# DataFrame

In [1]:
data = sc.parallelize([('M', 25), 
                       ('M', 20), 
                       ('M', 30), 
                       ('F', 25),
                       ('F', 20),
                       ('M', 30)])

Creating a DataFrame requires a dataset (e.g. RDD, List), and to provide a schema. In the most simple case, Spark will infer the correct datatypes from the dataset, so only the names for the columns are required.

Alternatively, data can also be read directly into DataFrames, for instance by using `pandas` or `parquet`.

In [2]:
df = spark.createDataFrame(data, ['gender', 'age'])
print("The structure of the dataframe is {}".format(df))
#show the result of the dataframe
df.show()
print("the number of rows in the dataframe is {}".format(df.count()))

The structure of the dataframe is DataFrame[gender: string, age: bigint]
+------+---+
|gender|age|
+------+---+
|     M| 25|
|     M| 20|
|     M| 30|
|     F| 25|
|     F| 20|
|     M| 30|
+------+---+

the number of rows in the dataframe is 6


Different from RDDs, the columns are labelled which allows for clear code.

In [3]:
df.filter(df.gender == 'M').show()

+------+---+
|gender|age|
+------+---+
|     M| 25|
|     M| 20|
|     M| 30|
|     M| 30|
+------+---+



In [10]:
df.groupBy(df.gender).count().show()

+------+-----+
|gender|count|
+------+-----+
|     F|    2|
|     M|    4|
+------+-----+



In [11]:
df.groupBy(df.gender).max("age").show()

+------+--------+
|gender|max(age)|
+------+--------+
|     F|      25|
|     M|      30|
+------+--------+



In [12]:
from pyspark.sql.functions import mean, sum, max, min, col
df1 = df.groupBy(df.gender)
df1.agg(sum("age").alias("total"), min("age"), max("age")).show()

+------+-----+--------+--------+
|gender|total|min(age)|max(age)|
+------+-----+--------+--------+
|     F|   45|      20|      25|
|     M|  105|      20|      30|
+------+-----+--------+--------+



For some file formats (e.g. .csv), there are also readers that can read the data directly as DataFrame. In this example, we will use a dataset with airline on-time statistics and delay causes. The spark.csv reader can use the column labels that are in the file. The option inferSchema means that datatypes are correctly inferred. This is actually an expensive operation for large data volumes, alternatively you can manually specify the schema, but in this tutorial, we will stick with inferSchema. 

In [None]:
filename = '../data/2008.csv.bz2'
if not os.path.exists(filename):
    import urllib.request
    urllib.request.urlretrieve ("http://stat-computing.org/dataexpo/2009/2008.csv.bz2", \
                                filename)

In [None]:
f = sqlContext.read.format("com.databricks.spark.csv").\
    options(header="true", inferSchema = "true").load(filename)
f.printSchema()

Now we can very easily compute the average delays for departure and arrival per flight number.

In [None]:
averagedelays = f.groupBy(f.FlightNum).agg(mean("DepDelay"), mean("ArrDelay"))
averagedelays.show()

## caching

When you need intermediate results more often, you can cache them to get faster access. You will experience this when you repeat the show() command several times, the first time is slow, after that it is fast.

In [None]:
averagedelays.cache()

In [None]:
averagedelays.show()