# Dataframes 

* Dataframes are a restricted sub-type of RDDs. 
* Restircing the type allows for more optimization.

* Dataframes store two dimensional data, similar to the type of data stored in a spreadsheet. 
   * Each column in a dataframe can have a different type.
   * Each row contains a `record`.

* Similar to, but not the same as, [pandas dataframes](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe) and [R dataframes](http://www.r-tutor.com/r-introduction/data-frame)

In [1]:
#import findspark
#findspark.init()
from pyspark import SparkContext
sc = SparkContext(master="local[4]")
sc.version

'2.3.0'

In [2]:
import os
import sys

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import Row, StructField, StructType, StringType, IntegerType
%pylab inline

# Just like using Spark requires having a SparkContext, using SQL requires an SQLContext
sqlContext = SQLContext(sc)
sqlContext

Populating the interactive namespace from numpy and matplotlib


<pyspark.sql.context.SQLContext at 0x7f6a70142470>

### Constructing a DataFrame from an RDD of Rows
Each Row defines it's own  fields, the schema is *inferred*.

In [3]:
# One way to create a DataFrame is to first define an RDD from a list of Rows 
_list = [Row(name=u"John", age=19),
                           Row(name=u"Smith", age=23),
                           Row(name=u"Sarah", age=18)]
some_rdd = sc.parallelize(_list)
some_rdd.collect()

[Row(age=19, name='John'),
 Row(age=23, name='Smith'),
 Row(age=18, name='Sarah')]

In [4]:
# The DataFrame is created from the same _list
# Infer schema from the first row, create a DataFrame and print the schema
some_df = sqlContext.createDataFrame(_list)
some_df.printSchema()
some_df.collect()

root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)



[Row(age=19, name='John'),
 Row(age=23, name='Smith'),
 Row(age=18, name='Sarah')]

In [5]:
# A dataframe is an RDD of rows plus information on the schema.
# performing **collect()* on either the RDD or the DataFrame gives the same result.
print(type(some_rdd),type(some_df))
print('some_df =',some_df.collect())
print('some_rdd=',some_rdd.collect())

<class 'pyspark.rdd.RDD'> <class 'pyspark.sql.dataframe.DataFrame'>
some_df = [Row(age=19, name='John'), Row(age=23, name='Smith'), Row(age=18, name='Sarah')]
some_rdd= [Row(age=19, name='John'), Row(age=23, name='Smith'), Row(age=18, name='Sarah')]


### Defining the Schema explicitly
The advantage of creating a DataFrame using a pre-defined schema allows the content of the RDD to be simple tuples, rather than rows.

In [6]:
# In this case we create the dataframe from an RDD of tuples (rather than Rows) and provide the schema explicitly
another_rdd = sc.parallelize([("John", 19), ("Smith", 23), ("Sarah", 18)])
# Schema with two fields - person_name and person_age
schema = StructType([StructField("person_name", StringType(), False),
                     StructField("person_age", IntegerType(), False)])
  
# Create a DataFrame by applying the schema to the RDD and print the schema
another_df = sqlContext.createDataFrame(another_rdd, schema)
another_df.printSchema()
# root
#  |-- age: binteger (nullable = true)
#  |-- name: string (nullable = true)

root
 |-- person_name: string (nullable = false)
 |-- person_age: integer (nullable = false)



## Loading DataFrames from disk
There are many maethods to load DataFrames from Disk. Here we will discuss three of these methods
1. Parquet
2. JSON (on your own)
3. CSV  (on your own)

In addition, there are API's for connecting Spark to an external database. We will not discuss this type of connection in this class.

### Loading dataframes from JSON files
[JSON](http://www.json.org/) is a very popular readable file format for storing structured data.
Among it's many uses are **twitter**, `javascript` communication packets, and many others. In fact this notebook file (with the extension `.ipynb` is in json format. JSON can also be used to store tabular data and can be easily loaded into a dataframe.

In [7]:
%pwd
dir='../../../Data'
!ls -lh $dir

total 628K
-rw-r--r-- 1 jovyan users 619K Apr 18 05:04 HW25000.csv
drwxr-xr-x 6 jovyan users  192 Apr 18 02:01 namesAndFavColors.parquet
-rw-r--r-- 1 jovyan users   73 Apr 17 23:45 people.json
-rw-r--r-- 1 jovyan users  615 Apr 17 23:45 users.parquet
drwxr-xr-x 8 jovyan users  256 Apr 18 07:34 Weather


In [8]:
# when loading json files you can specify either a single file or a directory containing many json files.
path = "../../../Data/people.json"
!cat $path 

# Create a DataFrame from the file(s) pointed to by path
people = sqlContext.read.json(path)
print('people is a',type(people))
# The inferred schema can be visualized using the printSchema() method.
 

people.printSchema()

{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}
people is a <class 'pyspark.sql.dataframe.DataFrame'>
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)



### Excercise: Loading csv files into dataframes

Spark 2.0 includes a facility for reading csv files. In this excercise you are to create similar functionality using your own code.

You are to write a class called `csv_reader` which has the following methods:

* `__init__(self,filepath):` recieves as input the path to a csv file. It throws an exeption `NoSuchFile` if the file does not exist.
* `Infer_Schema()` opens the file, reads the first 10 lines (or less if the file is shorter), and infers the schema. The first line of the csv file defines the column names. The following lines should have the same number of columns and all of the elements of the column should be of the same type. The only types allowd are `int`,`float`,`string`. The method infers the types of the columns, checks that they are consistent, and defines a dataframe schema of the form:
```python
schema = StructType([StructField("person_name", StringType(), False),
                     StructField("person_age", IntegerType(), False)])
```
If everything checks out, the method defines a `self.` variable that stores the schema and returns the schema as it's output. If an error is found an exception `BadCsvFormat` is raised.
* `read_DataFrame()`: reads the file, parses it and creates a dataframe using the inferred schema. If one of the lines beyond the first 10 (i.e. a line that was not read by `InferSchema`) is not parsed correctly, the line is not added to the Dataframe. Instead, it is added to an RDD called `bad_lines`.
The methods returns the dateFrame and the `bad_lines` RDD.

### Parquet files

* [Parquet](http://parquet.apache.org/) is a popular columnar format.

* Spark SQL allows [SQL](https://en.wikipedia.org/wiki/SQL) queries to retrieve a subset of the rows without reading the whole file.

* Compatible with HDFS : allows parallel retrieval on a cluster.

* Parquet compresses the data in each column.

### Spark and Hive
* Parquet is a **file format** not an independent database server.
* Spark can work with the [Hive](https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started) relational database system that supports the full array of database operations.
* Hive is compatible with HDFS.

In [9]:
dir='../../../Data'
parquet_file=dir+"/users.parquet"
!ls $dir

HW25000.csv  namesAndFavColors.parquet	people.json  users.parquet  Weather


In [10]:
#load a Parquet file
print(parquet_file)
df = sqlContext.read.load(parquet_file)
df.show()

../../../Data/users.parquet
+------+--------------+----------------+
|  name|favorite_color|favorite_numbers|
+------+--------------+----------------+
|Alyssa|          null|  [3, 9, 15, 20]|
|   Ben|           red|              []|
+------+--------------+----------------+



In [11]:
df2=df.select("name", "favorite_color")
df2.show() 

+------+--------------+
|  name|favorite_color|
+------+--------------+
|Alyssa|          null|
|   Ben|           red|
+------+--------------+



In [12]:
outfilename="namesAndFavColors.parquet"
!rm -rf $dir/$outfilename
df2.write.save(dir+"/"+outfilename)
!ls -ld $dir/$outfilename

drwxr-xr-x 6 jovyan users 192 Apr 21 23:20 ../../../Data/namesAndFavColors.parquet


A new interface object has been added in **Spark 2.0** called **SparkSession**. A spark session is initialized using a `builder`. For example
```python
spark = SparkSession.builder \
         .master("local") \
         .appName("Word Count") \
         .config("spark.some.config.option", "some-value") \
         .getOrCreate()
```

Using a SparkSession a Parquet file is read [as follows:](http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.parquet):
```python
df = spark.read.parquet('python/test_support/sql/parquet_partitioned')
```

## Loading Parquet Files
Here we are loading a dataframe from a parquet repository. The parquet directory is packaged into a single `.tgz` file using the `tar` command.

The `tgz` file is stored in an **open** bucket on S3.

### Levels of publicity on s3

* By default buckets on S3 are private to the creator of the bucket.
* It is possible to allow differet levels of access (read, write, manage) to specific aws users.
* It is also possible to provide **public** access, allowing anybody with an AWS account to access the data.
* Finally, it is possible to provide **open** access, which allows anybody, with or without an AWS account to download the through a URL
* Moving TB of data around can cost real money. The owner of the data can specify `Downloader Pays` so that they don't incur the cost of downloading their data.

In [13]:
%%time
state='CA'
path=dir+'/Weather'
target = path+'/%s.tgz'%state
source = 'https://mas-dse-open.s3-us-west-2.amazonaws.com/Weather/by_state/%s.tgz'%state
command='curl '+source+' > '+target
print(command)
!$command

curl https://mas-dse-open.s3-us-west-2.amazonaws.com/Weather/by_state/CA.tgz > ../../../Data/Weather/CA.tgz
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  108M  100  108M    0     0  7404k      0  0:00:14  0:00:14 --:--:-- 7950k
CPU times: user 408 ms, sys: 239 ms, total: 647 ms
Wall time: 15.7 s


## download AWS-CLI
* not faster when cownloading to my computer.
* It will be **much** faster when running on an ec2 instance

In [14]:
%%time
command='aws s3 cp s3://mas-dse-open/Weather/by_state/%s.tgz '%state+' '+target 
print(command)

!$command

aws s3 cp s3://mas-dse-open/Weather/by_state/CA.tgz  ../../../Data/Weather/CA.tgz
/bin/sh: 1: aws: not found
CPU times: user 7.01 ms, sys: 9.98 ms, total: 17 ms
Wall time: 725 ms


In [15]:
%%time
command= "tar xzf %s -C %s"%(target,path)
print(command)
!$command

tar xzf ../../../Data/Weather/CA.tgz -C ../../../Data/Weather
CPU times: user 210 ms, sys: 110 ms, total: 320 ms
Wall time: 8.09 s


In [16]:
!du -sh $path/*

133M	../../../Data/Weather/CA.parquet
113M	../../../Data/Weather/CA.tgz
76M	../../../Data/Weather/NY.parquet
64M	../../../Data/Weather/NY.tgz
49M	../../../Data/Weather/STAT_NY.pickle
13M	../../../Data/Weather/decon_NY_PRCP_s20.parquet
48K	../../../Data/Weather/index.html
8.0K	../../../Data/Weather/index.html~


In [17]:
%%time
df =sqlContext.read.load(path+'/%s.parquet'%state)

CPU times: user 3.23 ms, sys: 1.39 ms, total: 4.62 ms
Wall time: 239 ms


In [18]:
%%time
df.count()

CPU times: user 1.73 ms, sys: 950 Âµs, total: 2.68 ms
Wall time: 954 ms


365518

In [19]:
#%%time 
#pandas_df = df.collect()

## Using SQL on Parquet files

It is sometimes more efficient to use SQL, which is a **declarative** language, instead of using operators such as select and join.

In [22]:
%%time
parquet_name=path+'/CA'
query="""SELECT station,measurement,year 
FROM parquet.`%s.parquet` 
WHERE measurement=\"PRCP\" """%parquet_name
print(query)

df2 = sqlContext.sql(query)
print('number of rows=',df2.count())
df2.show(5)

SELECT station,measurement,year 
FROM parquet.`../../../Data/Weather/CA.parquet` 
WHERE measurement="PRCP" 
number of rows= 36692
+-----------+-----------+----+
|    station|measurement|year|
+-----------+-----------+----+
|USC00040986|       PRCP|1944|
|USC00040986|       PRCP|1945|
|USC00040986|       PRCP|1946|
|USC00040986|       PRCP|1947|
|USC00040986|       PRCP|1948|
+-----------+-----------+----+
only showing top 5 rows

CPU times: user 6.74 ms, sys: 2.89 ms, total: 9.64 ms
Wall time: 1.65 s


## Summary
* Dataframes are an efficient way to store data tables.
* All of the values in a column have the same type.
* A good way to store a dataframe in disk is to use a Parquet file.
* Next: Operations on dataframes.