# Five minute `DataFrame` demo

In [1]:
import findspark
findspark.init()

import pyspark

Initialize the `SparkContext` and the `SQLContext`

In [2]:
sc = pyspark.SparkContext('local[2]')

sqc = pyspark.sql.SQLContext(sc)

The `SQLContext` gives us access to the `DataFrame` functionality

### Read in some data and turn it into an RDD of tuples

In [3]:
people_rdd = (sc.textFile('file:///cluster/apps/spark/spark-current/examples/src/main/resources/people.txt')
                .map(lambda line: line.split(',')))

In [4]:
people_rdd.first()

[u'Michael', u' 29']

Now we can use this data to create `Row` objects and convert the `RDD` into a `DataFrame`: 

In [5]:
from pyspark.sql import Row

row_rdd = people_rdd.map(lambda (name,age): Row(name=name, age=int(age)))

row_rdd.first()

df = sqc.createDataFrame(row_rdd)

When the `DataFrame` is constructed, the data type for each column is inferred:

In [6]:
df.printSchema()

root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)



In [7]:
df.first()

Row(age=29, name=u'Michael')

There are some convenient methods for pretty-printing the columns:

In [8]:
df.show()

+---+-------+
|age|   name|
+---+-------+
| 29|Michael|
| 30|   Andy|
| 19| Justin|
+---+-------+



Let's compare `RDD` methods and `DataFrame` -- we want to get all the people older than 20: 

In [9]:
# using the usual RDD methods
people_rdd.filter(lambda (name, age): int(age)>20).collect()

[[u'Michael', u' 29'], [u'Andy', u' 30']]

In [10]:
# using the DataFrame
df.filter(df.age > 20).take(20)

[Row(age=29, name=u'Michael'), Row(age=30, name=u'Andy')]

No need to write `map`s if you can express the operation with the built-in functions. You refer to columns via the `DataFrame` object:

In [11]:
# this is a column that you can use in arithmetic expressions
df.age

Column<age>

In [12]:
df.select(df.age, (df.age*2).alias('times two')).show()

+---+---------+
|age|times two|
+---+---------+
| 29|       58|
| 30|       60|
| 19|       38|
+---+---------+



In [13]:
# equivalent RDD method
people_rdd.map(lambda (name, age): int(age)*2).collect()

[58, 60, 38]