# Intro to Spark DataFrames

Recall the progression of "list-like" technology in Python:

1. `list`
2. `numpy` array  (much faster, but limited to single machine)
3. Spark RDD  (distributed across many machines)

Tabular data has seen a similar progression:

1. `dict` (*could* be used to store tabular data)
2. `pandas` DataFrame  (more brains, but limited to single machine)
3. Spark DataFrame  (distributed across many machines)

We want to dive into Spark DataFrames now.


## Warmup: `dict` and `pandas`

FWIW, here are two ways we can use dicts to store tabular data (but this has no brains):

In [1]:
# "columnar" storage (dict of lists)
columnar_table = {"col1": [3, 5, 4],
              "col2": [6.23, 4.2, 6.8]}

In [2]:
# "row" storage (list of dicts)
row_table = [{"col1": 3, "col2": 6.23},
             {"col1": 5, "col2": 4.2},
             {"col1": 4, "col2": 6.8}]

`pandas` can turn either one of these into the a DataFrame, but the result will ALWAYS be columnar (each column = a `numpy` array):

In [3]:
import pandas as pd

pandas_df = pd.DataFrame(columnar_table)
pandas_df.head()

Unnamed: 0,col1,col2
0,3,6.23
1,5,4.2
2,4,6.8


In [4]:
pandas_df = pd.DataFrame(row_table)
pandas_df.head()

Unnamed: 0,col1,col2
0,3,6.23
1,5,4.2
2,4,6.8


Here's how to get the raw `pandas` column (which will be a `numpy` array):

In [5]:
col1 = pandas_df['col1'].values
type(col1)

numpy.ndarray

In [6]:
col1

array([3, 5, 4])

## Spark DataFrames

Let's move onto Spark DataFrames.  These are somewhat similar to `pandas` DataFrames, except they are distributed.

Another difference:  in Spark, DataFrames are stored in row-based format.

In [7]:
# for DataFrame we use a SparkSession instead of a SparkContext

from pyspark.sql import SparkSession

# SparkSession uses the "builder" syntax
ss = SparkSession.builder.\
     config("fs.defaultFS", "hdfs://namenode:8020").\
     config("dfs.replication", "1").\
     master('spark://spark-master:7077').\
     appName('test').getOrCreate()

These examples are very similar in spirit to `sc.parallelize` for RDDs.  We are pushing small amounts of data up into Spark, usually for demonstration purposes:

### from list of dicts

In [8]:
from pyspark.sql import Row

## Spark defaults to row-based, so we have to feed it row-by-row
df = ss.createDataFrame([Row(col1=3, col2=6.23), Row(col1=5, col2=4.2), Row(col1=4, col2=6.8)])
df.collect()

[Row(col1=3, col2=6.23), Row(col1=5, col2=4.2), Row(col1=4, col2=6.8)]

### from `pandas`

We can also create from a `pandas` DataFrame

In [9]:
df = ss.createDataFrame(pandas_df)
df.collect()

[Row(col1=3, col2=6.23), Row(col1=5, col2=4.2), Row(col1=4, col2=6.8)]

### from list of tuples

Another easy way to create a DataFrame is from a list of tuples, passing the column names explicitly:

In [10]:
df = ss.createDataFrame([(3, 6.23), (5, 4.2), (4, 6.8)], ["col1", "col2"])
df.collect()

[Row(col1=3, col2=6.23), Row(col1=5, col2=4.2), Row(col1=4, col2=6.8)]

### from RDD

RDDs look VERY similar to a list of tuples, so it shouldn't surprise you that we can create a DataFrame from an RDD using almost exactly the same syntax:

In [11]:
# under the hood a SparkSession actually uses a SparkContext
# we can get the underlying SparkContext in order to play with RDDs
sc = ss.sparkContext

In [12]:
rdd = sc.parallelize([(3, 6.23), (5, 4.2), (4, 6.8)])

In [13]:
df = ss.createDataFrame(rdd, ["col1", "col2"])
df.collect()

[Row(col1=3, col2=6.23), Row(col1=5, col2=4.2), Row(col1=4, col2=6.8)]

### to RDD

We can also convert a DataFrame back to an RDD (but an RDD of `Row`s).

`Row` and `Column` are classes out of which the `DataFrame` class is built.

In [14]:
rdd2 = df.rdd
rdd2.collect()

[Row(col1=3, col2=6.23), Row(col1=5, col2=4.2), Row(col1=4, col2=6.8)]

### to `pandas`

We can convert back to a `pandas` DataFrame (this brings all the data back to the driver, basically like a `.collect()`, so be careful)

In [15]:
pandas_df2 = df.toPandas()
pandas_df2

Unnamed: 0,col1,col2
0,3,6.23
1,5,4.2
2,4,6.8


## take vs head vs show

In [16]:
df.take(3)

[Row(col1=3, col2=6.23), Row(col1=5, col2=4.2), Row(col1=4, col2=6.8)]

In [17]:
df.head(3)  # to mimic the `.head()` method in pandas

[Row(col1=3, col2=6.23), Row(col1=5, col2=4.2), Row(col1=4, col2=6.8)]

In [18]:
df.show(3)

+----+----+
|col1|col2|
+----+----+
|   3|6.23|
|   5| 4.2|
|   4| 6.8|
+----+----+



## Schemas

In all of the examples above the schema was *inferred*.  Spark just looked at the data and decided what type it should be.  Spark might make a mistake.

In [19]:
df.dtypes  # these are Pythonic numpy types

[('col1', 'bigint'), ('col2', 'double')]

In [20]:
df.schema  # these are Java-esque "equivalent" types

StructType(List(StructField(col1,LongType,true),StructField(col2,DoubleType,true)))

What if I intended for `col1` to not get very large?  It might be more space efficient for me to store it as a single `byte`.

(aside:  `double` is common typename in other languages for `float`.  Remember that Spark is implemented in Scala under the hood.  There are actually quite a few basic types)

We can specify a *schema*.  However, it is slightly verbose.  Remember that we are interfacing into the Java/Scala world, which is notoriously verbose:

In [21]:
from pyspark.sql.types import StructType, StructField, ByteType, DoubleType

schema = StructType([StructField("col1", ByteType(), True),
                     StructField("col2", DoubleType(), True)])

# the True arguments above specify whether or not the data is
# allowed to be missing (null)

In [22]:
df_from_schema = ss.createDataFrame([(3, 6.23), (5, 4.2), (4, 6.8)], schema)
df_from_schema.collect()

[Row(col1=3, col2=6.23), Row(col1=5, col2=4.2), Row(col1=4, col2=6.8)]

In [23]:
df_from_schema.dtypes

[('col1', 'tinyint'), ('col2', 'double')]

In [24]:
df_from_schema.schema

StructType(List(StructField(col1,ByteType,true),StructField(col2,DoubleType,true)))

## Writing out

We can write out data out to HDFS.  For example, here is how to write out to a csv:

In [25]:
# be careful that we are running as a user that has permission to
# write to a directory in HDFS
sc.sparkUser()

'vagrant'

In [29]:
from hdfs import InsecureClient
client = InsecureClient("http://namenode:50070", user='vagrant')
client.list('/Users/vagrant')

['structured-2018-01-14-neworleans',
 'structured-2018-03-11-atlanta',
 'structured-2018-04-01-birmingham',
 'structured-2018-04-08-proleague1',
 'structured-2018-04-19-relegation',
 'structured-2018-04-22-seattle',
 'structured-2018-06-17-anaheim',
 'structured-2018-07-29-proleague2',
 'structured-2018-08-19-champs']

In [28]:
df.write.csv('hdfs://namenode/Users/vagrant/test_df.csv')

In [30]:
# parquet is very popular, and much more efficient than csv
df.write.parquet('hdfs://namenode/Users/vagrant/test_df.parquet')

Incidentally, I never showed you how to write out an RDD.  It is just as easy:

In [31]:
rdd.saveAsPickleFile('hdfs://namenode/Users/vagrant/test_rdd.pickle')