# Walkthrough: DataFrames in Spark

This course has been tested with Spark 2.2 (April 2018)

# 1. Overview

What is Spark SQL?
- Spark SQL takes basic RDDs and **puts a schema on them**.

What is a DataFrame?
- DataFrames are the primary abstraction in Spark SQL.
- Think of a DataFrames as **RDDs with schema**.

What are **schemas**?
- Schemas are metadata about your data.
- Schema = Table Names + Column Names + Column Types

What are the pros of schemas?
- Schemas enable using **column names** instead of column positions
- Schemas enable **queries** using SQL and DataFrame syntax
- Schemas make your data more **structured**.

# 2. Operational DataFrames in Python

We'll proceed along the usual spark flow (see above).
1. create the enviromnent to run Spark SQL from python
2. create DataFrames from RDDs or from files
3. run some transformations
4. execute actions to obtain values (local objects in python)

## 2.1. Initializing a `SparkSession` (`SparkContext` and `SqlContext`) in Python

Using:

```python
import pyspark as ps

spark = ps.sql.SparkSession.builder \
            .master("local[4]") \
            .appName("df lecture") \
            .getOrCreate()
```

will create a *"local"* cluster made of the driver using all 4 cores.

In [None]:
import pyspark as ps

spark = ps.sql.SparkSession.builder \
            .master("local[4]") \
            .appName("df lecture") \
            .getOrCreate()
            
sc = spark.sparkContext

## 2.2. Creating a DataFrame

### 2.2.1. From an RDD (specifying schema)

You can create a DataFrame from an existing RDD (whatever source you used to create this one), if you add a schema.

To build a schema, you will use existing data types provided in the [`pyspqrk.sql.types`](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.types) module. Here's a list of the most useful ones (subjective criteria).

| Types | Python-like type |
| - | - |
| StringType | string |
| IntegerType | int |
| FloatType | float |
| ArrayType\* | array or list |
| MapType | dict |

\* see later UDF functions on how to use that

In [None]:
%ls data

In [None]:
# remember that csv file ?
def casting_function(row):
    trans_id, date, store, state, product, amount = row
    return((int(trans_id), date, int(store), state, int(product), float(amount)))

rdd_sales = sc.textFile('data/sales.txt')\
        .map(lambda rowstr : rowstr.split())\
        .filter(lambda row: not row[0].startswith('#'))\
        .map(casting_function)

rdd_sales.collect()

In [None]:
# import the many data types
from pyspark.sql.types import *

# create a schema of your own
schema = StructType( [
    StructField('id', IntegerType(),True),
    StructField('date', StringType(),True),
    StructField('store', IntegerType(),True),
    StructField('state', StringType(),True),
    StructField('product', IntegerType(),True),
    StructField('amount', FloatType(),True) ] )

# feed that into a DataFrame
df = spark.createDataFrame(rdd_sales, schema)

# show the result
df.show()

# print the schema
df.printSchema()

### 2.2.2. Reading from files (infering schema)

Use [`sqlContext.read.csv`](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.csv) to load a CSV into a DataFrame. You can specify every useful parameter in there. It can infer the schema.

In [None]:
# read CSV
df = spark.read.csv('data/aapl.csv',
                         header=True,       # use headers or not
                         quote='"',         # char for quotes
                         sep=",",           # char for separation
                         inferSchema=True)  # do we infer schema or not ?

# prints the schema
df.printSchema()

# some functions are still valid
print("line count: {}".format(df.count()))

# show the table in a oh-so-nice format
df.show()

Use [`sqlContext.read.json`](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.json) to load a JSON file into a DataFrame. You can specify every useful parameter in there. It can infer the schema.

In [None]:
%ls data

In [None]:
# read JSON
df = spark.read.json('data/sales.json')

# prints the schema
df.printSchema()

# some functions are still valid
print("line count: {}".format(df.count()))

# show the table in a oh-so-nice format
df.show()

In [None]:
# read JSON
df = spark.read.json('data/sales2.json.gz')

# show the table in a oh-so-nice format
df.show()

## 2.3. Actions : turning your DataFrame into a local object

Some actions just remain the same, you won't have to learn Spark all over again.

Some new actions give you the possibility to describe and show the content in a more fashionable manner.

When used/executed in IPython or in a notebook, they **launch the processing of the DAG**. This is where Spark stops being **lazy**. This is where your script will take time to execute.

| Method | DF vs RDD? | Description |
| - | - | - |
| [`.collect()`](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.collect) | identical | Return a list that contains all of the elements as Rows. |
| [`.count()`](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.count) | identical | Return the number of elements. |
| [`.take(n)`](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.take) | identical | Take the first `n` elements. |
| [`.top(n)`](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.top) | identical | Get the top `n` elements. |
| [`.first()`](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.first) | identical | Return the first element. |
| [`.show(n)`](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.show) | <span style="color:green">new</span> | Show the DataFrame in table format (`n=20` by default) |
| [`.toPandas()`](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.toPandas) | <span style="color:green">new</span> | Convert the DF into a Pandas DF. |
| [`.printSchema(*cols)`](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.printSchema)\* | <span style="color:green">new</span> | Display the schema. This is not an action, it doesn't launch the DAG, but it fits better in this category. |
| [`.describe(*cols)`](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.describe) | <span style="color:green">new</span> | Compute statistics for this column. |
| [`.sum(*cols)`](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.GroupedData.sum) | <span style="color:red">different</span> | Applies on GroupedData only (see transformations). |
| [`.mean(*cols)`](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.GroupedData.mean) | <span style="color:red">different</span> | Applies on GroupedData only (see transformations). |
| [`.min(*cols)`](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.GroupedData.min) | <span style="color:red">different</span> | Applies on GroupedData only (see transformations). |
| [`.max(*cols)`](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.GroupedData.max) | <span style="color:red">different</span> | Applies on GroupedData only (see transformations). |


In [None]:
# read CSV
df_sales = spark.read.csv('data/sales.csv',
                         header=True,       # use headers or not
                         quote='"',         # char for quotes
                         sep=",",           # char for separation
                         inferSchema=True)  # do we infer schema or not ?

In [None]:
df_sales.show()

In [None]:
df_sales.toPandas()

This is how `.collect()` returns things...

In [None]:
df_sales.collect()[0]["Date"]

In [None]:
# prints the schema
print("--- printSchema()")
df_sales.printSchema()

# prints the table itself
print("--- show()")
df_sales.show()

# show the statistics of all numerical columns
print("--- describe()")
df_sales.describe().show()

# show the statistics of one specific column
print("--- describe(Amount)")
df_sales.describe("Amount").show()

## 2.3. Transformations on DataFrames

- They are still **lazy**: Spark doesn't apply the transformation right away, it just builds on the **DAG**
- They transform a DataFrame into another because DataFrames are also **immutable**.
- They can be **wide** or **narrow** (whether they shuffle partitions or not).

You got that... DataFrames are just RDDs with a schema.

| Method | Type | Category | Description |
| - | - | - |
| [`.withColumn(label,func)`](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.withColumn) | transformation | mapping | Returns a new DataFrame by adding a column or replacing the existing column that has the same name. |
| [`.filter(condition)`](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.filter) | transformation | reduction |  Filters rows using the given condition. |
| [`.sample()`](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.sample) | transformation | reduction | Return a sampled subset of this DataFrame. |
| [`.sampleBy(col)`](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.sampleBy) | transformation | reduction | Returns a stratified sample without replacement based on the fraction given on each stratum. |
| [`.select(cols)`](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.select) | transformation | reduction | Projects a set of expressions and returns a new DataFrame. |
| [`.join(dfB)`](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.join) | transformation | operations | Joins with another DataFrame, using the given join expression. |
| [`.groupBy(col)`](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.groupBy) | transformation | operations | Groups the DataFrame using the specified columns, so we can run aggregation on them.  |
| [`.sort(cols)`](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.sort) | transformation | sorting |  Returns a new DataFrame sorted by the specified column(s). |



#### `.withColumn("label", func)` : 

In [None]:
# read CSV
df_aapl = spark.read.csv('data/aapl.csv',
                         header=True,       # use headers or not
                         quote='"',         # char for quotes
                         sep=",",           # char for separation
                         inferSchema=True)  # do we infer schema or not ?

df_aapl.show(5)

df_aapl.printSchema()

#### `.withColumn("label", func)` : constant value

In [None]:
from pyspark.sql.functions import lit

df_out = df_aapl.withColumn("blabla", lit("echo"))

df_out.show(5)

#### `.withColumn("label", func)` : column operations

In [None]:
from pyspark.sql.functions import col

#df_out = df_aapl.withColumn("diff", df_aapl.High - df_aapl.Low)
df_out = df_aapl.withColumn("diff", df_aapl["High"] - df_aapl["Low"])
df_out.show(5)

df_out = df_aapl.withColumn("diff", df_aapl.High - df_aapl.Low)
df_out.show(5)

# below is the PREFERED METHOD for referencing columns
df_out = df_aapl.withColumn("diff", col("High") - col("Low"))
df_out.show(5)

#### `.withColumn("label", func)` : user defined function

In [None]:
from pyspark.sql.functions import udf, col
from pyspark.sql.types import DoubleType

def my_specialfunc(h,l,o,c):
    return ((h-l)*(o-c))

my_specialfunc_udf = udf(lambda h,l,o,c : my_specialfunc(h,l,o,c), DoubleType())

df_out = df_aapl.withColumn("special",
                            my_specialfunc_udf(col('High'),
                                               col('Low'),
                                               col('Open'),
                                               col('Close')))

df_out.show()

#### .filter(condition) : filtering rows

In [None]:
from pyspark.sql.functions import col

df_out = df_aapl.filter("High > 120")

df_out.show()

In [None]:
from pyspark.sql.functions import col

df_out = df_aapl.filter(col('High') > 120)

df_out.show()

#### `.select(*cols)` : selecting specific columns

In [None]:
df_out = df_aapl.select(["Open", "Close"])

df_out.show(5)

#### `.groupBy()`: aggregating in DataFrames

In [None]:
from pyspark.sql import functions as F

df_out = df_sales.groupBy(col("State")).agg(F.sum(col("Amount")),F.mean("Amount"))

df_out.show()

#### `.orderBy()` : sorting by a column

In [None]:
df_out = df_sales.groupBy(col("State"))\
            .agg(F.sum(col("Amount")).alias("sumAmount"))\
            .orderBy(col("sumAmount"), ascending=False)

df_out.show()

## 2.4. Execute SQL statements

#### .createOrReplaceTempView(name) : registering your dataframe as a table

In [None]:
# read CSV
df_sales = spark.read.csv('data/sales.csv',
                         header=True,       # use headers or not
                         quote='"',         # char for quotes
                         sep=",",           # char for separation
                         inferSchema=True)  # do we infer schema or not ?

# Now create an SQL table and issue SQL queries against it without
# using the sqlContext but through the SparkSession object.
# Creates a temporary view of the DataFrame
df_sales.createOrReplaceTempView("sales")

In [None]:
result = spark.sql('''
    SELECT state, AVG(amount) as avg_amount
    FROM sales
    GROUP BY state
    ''')
result.show()

### spark.udf.register(name, func) : register a user defined function

In [None]:
def myfun(x):
    return x**2

spark.udf.register('my_sql_fun', myfun)

result = spark.sql('''
    SELECT state, my_sql_fun(amount) as square_amount
    FROM sales
    ''')

result.show()

# 3. Let's design chains of transformations together ! (reloaded)

## 3.1. Computing sales per state

### Input DataFrame

In [None]:
# read CSV
df_sales = spark.read.csv('data/sales.csv',
                         header=True,       # use headers or not
                         quote='"',         # char for quotes
                         sep=",",           # char for separation
                         inferSchema=True)  # do we infer schema or not ?

df_sales.show()

### Task

You want to obtain an ~~RDD~~ DataFrame of the states sorted by their decreasing cumulated sales.

What transformations do you need to apply ?

If you had to draw a workflow of the transformations to apply ?

### Code

In [None]:
df_out = df_sales # code your transformations here

df_out.show()

### Solution

<details>
  <summary>Click here to see the solution below</summary>
```
df_out = df_sales.groupBy(col('State'))\
                 .agg(F.sum(col('Amount')).alias('Money'))\
                 .orderBy("Money", ascending=False)

df_out.show()
```
</details>

## 3.2. Find the date on which AAPL's stock price was the highest

### Input DataFrame

In [None]:
# read CSV
df_aapl = spark.read.csv('data/aapl.csv',
                         header=True,       # use headers or not
                         quote='"',         # char for quotes
                         sep=",",           # char for separation
                         inferSchema=True)  # do we infer schema or not ?

df_aapl.show(5)

### Task

Now, design a pipeline that would :

1. ~~filter out headers and last line~~
2. ~~split each line based on comma~~
3. keep only fields for Date ~~(col 0)~~ and Close ~~(col 4)~~
4. order by Close in descending order

### Code

In [None]:
df_out = df_aapl  # put your transformation here...

df_out.show(5)

### Solution

<details>
  <summary>Click here to see the solution below</summary>
```
df_out.select("Close", "Date").orderBy(df_aapl.Close, ascending=False).show(5)```
</details>