# Dataframes 

* Dataframes are a restricted sub-type of RDDs. 
* Restricting the type allows for more optimization.

* Dataframes store two dimensional data, similar to the type of data stored in a spreadsheet. 
   * Each column in a dataframe can have a different type.
   * Each row contains a `record`.

* Similar to [pandas dataframes](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe) and [R dataframes](http://www.r-tutor.com/r-introduction/data-frame)

In [None]:
#import findspark
#findspark.init()
from pyspark import SparkContext
import os

os.environ["PYSPARK_PYTHON"]="python3"
os.environ["PYSPARK_DRIVER_PYTHON"] = "python3"

sc = SparkContext(master="local[4]")
sc.version

'3.5.0'

In [2]:
import os
import sys

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import Row, StructField, StructType, StringType, IntegerType
%pylab inline

%pylab is deprecated, use %matplotlib inline and import the required libraries.
Populating the interactive namespace from numpy and matplotlib


In [3]:
# Just like using Spark requires having a SparkContext, using SQL requires an SQLContext
sqlContext = SQLContext(sc)
sqlContext 



<pyspark.sql.context.SQLContext at 0x7fffd88057d0>

## Spark sessions

[A newer API for spark dataframes](https://spark.apache.org/docs/latest/sql-getting-started.html#starting-point-sparksession)

We will stick to the old API in this class.

A new interface object has been added in **Spark 2.0** called **SparkSession**. A spark session is initialized using a `builder`. For example
```python
spark = SparkSession.builder \
         .master("local") \
         .appName("Word Count") \
         .config("spark.some.config.option", "some-value") \
         .getOrCreate()
```

Using a SparkSession a Parquet file is read [as follows:](http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.parquet):
```python
df = spark.read.parquet('python/test_support/sql/parquet_partitioned')
```

### Constructing a DataFrame from an RDD of Rows
Each Row defines it's own  fields, the schema is *inferred*.

In [4]:
# One way to create a DataFrame is to first define an RDD from a list of Rows 
_list=[Row(name=u"John", age=19),
       Row(name=u"Smith", age=23),
       Row(name=u"Sarah", age=18)]
some_rdd = sc.parallelize(_list)
some_rdd.collect()

[Row(name='John', age=19),
 Row(name='Smith', age=23),
 Row(name='Sarah', age=18)]

In [5]:
# The DataFrame is created from the RDD or Rows
# Infer schema from the first row, create a DataFrame and print the schema
some_df = sqlContext.createDataFrame(_list)

In [6]:
some_df.printSchema()

root
 |-- name: string (nullable = true)
 |-- age: long (nullable = true)



In [7]:
# A dataframe is an RDD of rows plus information on the schema.
# performing **collect()* on either the RDD or the DataFrame gives the same result.
print(type(some_rdd),type(some_df))
print('some_df =',some_df.collect())
print('some_rdd=',some_rdd.collect())

<class 'pyspark.rdd.RDD'> <class 'pyspark.sql.dataframe.DataFrame'>
some_df = [Row(name='John', age=19), Row(name='Smith', age=23), Row(name='Sarah', age=18)]
some_rdd= [Row(name='John', age=19), Row(name='Smith', age=23), Row(name='Sarah', age=18)]


### Defining the Schema explicitly
The advantage of creating a DataFrame using a pre-defined schema allows the content of the RDD to be simple tuples, rather than rows.

In [8]:
# In this case we create the dataframe from an RDD of tuples (rather than Rows) and provide the schema explicitly
another_rdd = sc.parallelize([("John", 19), ("Smith", 23), ("Sarah", 18)])
# Schema with two fields - person_name and person_age
schema = StructType([StructField("person_name", StringType(), False),
                     StructField("person_age", IntegerType(), False)])
  
# Create a DataFrame by applying the schema to the RDD and print the schema
another_df = sqlContext.createDataFrame(another_rdd, schema)
another_df.printSchema()
# root
#  |-- age: binteger (nullable = true)
#  |-- name: string (nullable = true)

root
 |-- person_name: string (nullable = false)
 |-- person_age: integer (nullable = false)



## Loading DataFrames from disk
There are many maethods to load DataFrames from Disk. Here we will discuss three of these methods
1. Parquet
2. JSON (on your own)
3. CSV  (on your own)

In addition, there are API's for connecting Spark to an external database. We will not discuss this type of connection in this class.

### Loading dataframes from JSON files
[JSON](http://www.json.org/) is a very popular readable file format for storing structured data.
Among it's many uses are **twitter**, `javascript` communication packets, and many others. In fact this notebook file (with the extension `.ipynb` is in json format. JSON can also be used to store tabular data and can be easily loaded into a dataframe.

In [9]:
# when loading json files you can specify either a single file or a directory containing many json files.
print('--- json file')
path = "../Data/people.json"
!cat $path 

# Create a DataFrame from the file(s) pointed to by path
people = sqlContext.read.json(path)
print('\n--- dataframe\n people is a',type(people))
# The inferred schema can be visualized using the printSchema() method.
people.show()

print('--- Schema')
people.printSchema()

--- json file
{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}

--- dataframe
 people is a <class 'pyspark.sql.dataframe.DataFrame'>
+----+-------+
| age|   name|
+----+-------+
|NULL|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

--- Schema
root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)



### Excercise: Loading csv files into dataframes

Spark 2.0 includes a facility for reading csv files. In this excercise you are to create similar functionality using your own code.

You are to write a class called `csv_reader` which has the following methods:

* `__init__(self,filepath):` recieves as input the path to a csv file. It throws an exeption `NoSuchFile` if the file does not exist.
* `Infer_Schema()` opens the file, reads the first 10 lines (or less if the file is shorter), and infers the schema. The first line of the csv file defines the column names. The following lines should have the same number of columns and all of the elements of the column should be of the same type. The only types allowd are `int`,`float`,`string`. The method infers the types of the columns, checks that they are consistent, and defines a dataframe schema of the form:
```python
schema = StructType([StructField("person_name", StringType(), False),
                     StructField("person_age", IntegerType(), False)])
```
If everything checks out, the method defines a `self.` variable that stores the schema and returns the schema as it's output. If an error is found an exception `BadCsvFormat` is raised.
* `read_DataFrame()`: reads the file, parses it and creates a dataframe using the inferred schema. If one of the lines beyond the first 10 (i.e. a line that was not read by `InferSchema`) is not parsed correctly, the line is not added to the Dataframe. Instead, it is added to an RDD called `bad_lines`.
The methods returns the dateFrame and the `bad_lines` RDD.

### Parquet files

* [Parquet](http://parquet.apache.org/) is a popular columnar format.

* Spark SQL allows [SQL](https://en.wikipedia.org/wiki/SQL) queries to retrieve a subset of the rows without reading the whole file.

* Compatible with HDFS : allows parallel retrieval on a cluster.

* Parquet compresses the data in each column.

* `<reponame>.parquet` is usually a **directory** with many files or subdirectories.

### Spark and Hive
* Parquet is a **file format** not an independent database server.
* Spark can work with the [Hive](https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started) relational database system that supports the full array of database operations.
* Hive is compatible with HDFS.

In [10]:
dir='../Data'
parquet_file=dir+"/users.parquet"
!ls $dir

Moby-Dick.txt  namesAndFavColors.parquet  people.json  users.parquet  Weather


In [11]:
#load a Parquet file
print(parquet_file)
df = sqlContext.read.load(parquet_file)
df.show()

../Data/users.parquet
+------+--------------+----------------+
|  name|favorite_color|favorite_numbers|
+------+--------------+----------------+
|Alyssa|          NULL|  [3, 9, 15, 20]|
|   Ben|           red|              []|
+------+--------------+----------------+



In [12]:
df2=df.select("name", "favorite_color")
df2.show()

+------+--------------+
|  name|favorite_color|
+------+--------------+
|Alyssa|          NULL|
|   Ben|           red|
+------+--------------+



In [13]:
outfilename="namesAndFavColors.parquet"
!rm -rf $dir/$outfilename
df2.write.save(dir+"/"+outfilename)
!ls -ld $dir/$outfilename

drwxr-xr-x 6 jovyan users 192 Apr 14 17:52 ../Data/namesAndFavColors.parquet


## Lets have a look at a real-world dataframe

This dataframe is a small part from a large dataframe (15GB) which stores meteorological data from stations around the world.

In [14]:
from os.path import split,join,exists
from os import mkdir,getcwd,remove
from glob import glob

# create directory if needed
notebook_dir=getcwd()
data_dir=join(split(notebook_dir)[0],'Data')
weather_dir=join(data_dir,'Weather')

file_index='NY'
zip_file='%s.tgz'%(file_index)

In [15]:
weather_parquet = join(weather_dir, zip_file[:-3]+'parquet')
print(weather_parquet)
df = sqlContext.read.load(weather_parquet)
df.show(1)

/home/jovyan/Library/CloudStorage/GoogleDrive-ssingal@ucsd.edu/My Drive/UCSD/Notes/6th Quarter - Spring 24/DSC 232R - BDA with Spark/GitHub/lecture-notebooks/Data/Weather/NY.parquet
+-----------+-----------+----+--------------------+-----------------+--------------+------------------+-----------------+-----+-----------------+
|    Station|Measurement|Year|              Values|       dist_coast|      latitude|         longitude|        elevation|state|             name|
+-----------+-----------+----+--------------------+-----------------+--------------+------------------+-----------------+-----+-----------------+
|USW00094704|   PRCP_s20|1945|[00 00 00 00 00 0...|361.8320007324219|42.57080078125|-77.71330261230469|208.8000030517578|   NY|DANSVILLE MUNI AP|
+-----------+-----------+----+--------------------+-----------------+--------------+------------------+-----------------+-----+-----------------+
only showing top 1 row



In [16]:
#selecting a subset of the rows so it fits in slide.
df.select('station','year','measurement').show(5)

+-----------+----+-----------+
|    station|year|measurement|
+-----------+----+-----------+
|USW00094704|1945|   PRCP_s20|
|USW00094704|1946|   PRCP_s20|
|USW00094704|1947|   PRCP_s20|
|USW00094704|1948|   PRCP_s20|
|USW00094704|1949|   PRCP_s20|
+-----------+----+-----------+
only showing top 5 rows



## Summary
* Dataframes are an efficient way to store data tables.
* All of the values in a column have the same type.
* A good way to store a dataframe in disk is to use a Parquet file.
* Next: Operations on dataframes.