# Frame Basics

This notebook shows some of the basics of working with sparktk frames.  The SparkTK frame acts as a proxy object to a large table of data in Spark, with properties and functions to operate on that frame.

- See [SparkTK Documentation](https://github.com/trustedanalytics/spark-tk/) for more information about the the API's

In [1]:
# First, let's verify that the SparkTK libraries are installed
import sparktk
print "SparkTK installation path = %s" % (sparktk.__path__)

SparkTK installation path = ['/opt/anaconda2/lib/python2.7/site-packages/sparktk']


In [2]:
from sparktk import TkContext
tc = TkContext()

In [3]:
# Create a new frame by providing data and schema
data = [ ['a', 1], 
         ['b', 2], 
         ['c', 3], 
         ['b', 4],     
         ['a', 5] ]

schema = [ ('letter', str),
           ('number', int) ]

frame = tc.frame.create(data, schema)

In [4]:
# View the first few rows of a frame
frame.inspect()

[#]  letter  number
[0]  a            1
[1]  b            2
[2]  c            3
[3]  b            4
[4]  a            5

In [5]:
# View a specfic number of rows of a frame
frame.inspect(2)

[#]  letter  number
[0]  a            1
[1]  b            2

In [6]:
# Add a column to the frame
frame.add_columns(lambda row: row.number * 2, ('number_doubled', int))
frame.inspect()

[#]  letter  number  number_doubled
[0]  a            1               2
[1]  b            2               4
[2]  c            3               6
[3]  b            4               8
[4]  a            5              10

In [7]:
# Get summary information for a column
frame.column_summary_statistics('number_doubled')

bad_row_count             = 0
geometric_mean            = 5.21034216939
good_row_count            = 5
maximum                   = 10.0
mean                      = 6.0
mean_confidence_lower     = 3.22814141775
mean_confidence_upper     = 8.77185858225
minimum                   = 2.0
non_positive_weight_count = 0
positive_weight_count     = 5
standard_deviation        = 3.16227766017
total_weight              = 5.0
variance                  = 10.0

In [8]:
# Add a column with the cumulative sum of the number column
frame.cumulative_sum('number')
frame.inspect()

[#]  letter  number  number_doubled  number_cumulative_sum
[0]  a            1               2                    1.0
[1]  b            2               4                    3.0
[2]  c            3               6                    6.0
[3]  b            4               8                   10.0
[4]  a            5              10                   15.0

In [9]:
# Rename a column
frame.rename_columns({ 'number_doubled': "x2" })
frame.inspect()

[#]  letter  number  x2  number_cumulative_sum
[0]  a            1   2                    1.0
[1]  b            2   4                    3.0
[2]  c            3   6                    6.0
[3]  b            4   8                   10.0
[4]  a            5  10                   15.0

In [10]:
# Sort the frame by column 'number' descending
frame.sort('number', False)
frame.inspect()

[#]  letter  number  x2  number_cumulative_sum
[0]  a            5  10                   15.0
[1]  b            4   8                   10.0
[2]  c            3   6                    6.0
[3]  b            2   4                    3.0
[4]  a            1   2                    1.0

In [11]:
# Remove a column from the frame
frame.drop_columns("x2")
frame.inspect()

[#]  letter  number  number_cumulative_sum
[0]  a            5                   15.0
[1]  b            4                   10.0
[2]  c            3                    6.0
[3]  b            2                    3.0
[4]  a            1                    1.0

In [12]:
# Download a frame from SparkTK to pandas
pandas_frame = frame.to_pandas(columns=['letter', 'number'])
pandas_frame

Unnamed: 0,letter,number
0,a,5
1,b,4
2,c,3
3,b,2
4,a,1


In [13]:
# Calculate aggregations on the frame
results = frame.group_by('letter', tc.agg.count, {'number': [tc.agg.avg, tc.agg.sum, tc.agg.min] })
results.inspect()

[#]  letter  count  number_AVG  number_SUM  number_MIN
[0]  b           2         3.0           6           2
[1]  a           2         3.0           6           1
[2]  c           1         3.0           3           3

In [14]:
# Count the number of rows satisfying a predicate
frame.count(lambda row: row.number > 2)

3

Many more frame operations are available.  See the [SparkTK Documentation](https://github.com/trustedanalytics/spark-tk).