# DataFrames in Spark


# 1. Overview

## 1.1. RDDs versus DataFrames

What is a DataFrame?
- DataFrames are the primary abstraction in Spark SQL.
- Think of a DataFrames as RDDs with schema.

What is a schema?
- Schemas are metadata about your data.
- Schemas define table names, column names, and column types over your data.
- Schemas enable using SQL and DataFrame syntax to query your RDDs, instead of using column positions.

What is Spark SQL?
- Spark SQL takes basic RDDs and puts a schema on them.

What is a schema again?
- Schema = Table Names + Column Names + Column Types

What are the pros of schemas?
- Schemas enable using column names instead of column positions
- Schemas enable queries using SQL and DataFrame syntax
- Schemas make your data more structured.

# 2. Operational DataFrames in Python

We'll proceed along the usual spark flow (see above).
1. create the environment to run Spark / Spark SQL from python
2. create DataFrames from RDDs or from files
3. run some transformations
4. execute actions to obtain values (local objects in python)

## 2.1. Initializing a `SparkContext` and `SqlContext` in Python

Using:

```python
import pyspark as ps
sc = ps.SparkContext('local[4]')
```

will create a *"local"* cluster made of the driver using all 4 cores.


In [1]:
import pyspark as ps    # for the pyspark suite

In [2]:
spark = (ps.sql.SparkSession
         .builder
         .master('local[4]')
         .appName('lecture')
         .getOrCreate()
        )
sc = spark.sparkContext

In [3]:
sc

In [4]:
spark

The `spark` session object serves as our SQL context manager

In [5]:
# old (Spark 1.x) way of making a SQL Context: sqlContext = ps.SQLContext(sc)

## 2.2. Creating a DataFrame

### 2.2.1. From an RDD (specifying schema)

You can create a DataFrame from an existing RDD (whatever source you used to create this one), if you add a schema.

To build a schema, you will use existing data types provided in the [`pyspark.sql.types`](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.types) module. Here's a list of the most useful ones (subjective criteria).

| Types | Python-like type |
| - | - |
| StringType | string |
| IntegerType | int |
| FloatType | float |
| ArrayType\* | array or list |
| MapType | dict |

\* see later UDF functions on how to use that

#### csv to rdd to df

In [6]:
!head data/sales.csv

#ID,Date,Store,State,Product,Amount
101,11/13/2014,100,WA,331,300.00
104,11/18/2014,700,OR,329,450.00
102,11/15/2014,203,CA,321,200.00
106,11/19/2014,202,CA,331,330.00
103,11/17/2014,101,WA,373,750.00
105,11/19/2014,202,CA,321,200.00


In [7]:
def casting_function(row):
    (id, date, store, state, product, amount) = row
    return (int(id), date, int(store), state, int(product), float(amount))

rdd_sales = (
    sc.textFile('data/sales.csv')
        .map(lambda rowstr : rowstr.split(","))
        .filter(lambda row: not row[0].startswith('#'))
        .map(casting_function)
            )

rdd_sales.collect()

[(101, '11/13/2014', 100, 'WA', 331, 300.0),
 (104, '11/18/2014', 700, 'OR', 329, 450.0),
 (102, '11/15/2014', 203, 'CA', 321, 200.0),
 (106, '11/19/2014', 202, 'CA', 331, 330.0),
 (103, '11/17/2014', 101, 'WA', 373, 750.0),
 (105, '11/19/2014', 202, 'CA', 321, 200.0)]

In [8]:
# import the many data types
from pyspark.sql.types import *

# create a schema of your own
schema = StructType( [
    StructField('id',IntegerType(),True),
    StructField('date',StringType(),True),
    StructField('store',IntegerType(),True),
    StructField('state',StringType(),True),
    StructField('product',IntegerType(),True),
    StructField('amount',FloatType(),True) ] )

# feed that into a DataFrame
df = spark.createDataFrame(rdd_sales,schema)

# show the result
df.show()

# print the schema
df.printSchema()

+---+----------+-----+-----+-------+------+
| id|      date|store|state|product|amount|
+---+----------+-----+-----+-------+------+
|101|11/13/2014|  100|   WA|    331| 300.0|
|104|11/18/2014|  700|   OR|    329| 450.0|
|102|11/15/2014|  203|   CA|    321| 200.0|
|106|11/19/2014|  202|   CA|    331| 330.0|
|103|11/17/2014|  101|   WA|    373| 750.0|
|105|11/19/2014|  202|   CA|    321| 200.0|
+---+----------+-----+-----+-------+------+

root
 |-- id: integer (nullable = true)
 |-- date: string (nullable = true)
 |-- store: integer (nullable = true)
 |-- state: string (nullable = true)
 |-- product: integer (nullable = true)
 |-- amount: float (nullable = true)



### 2.2.2. Reading from files (inferring schema)

Use [`sqlContext.read.csv`](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.csv) to load a CSV into a DataFrame. You can specify every useful parameter in there. It can infer the schema.

In [9]:
# read CSV
df = spark.read.csv('data/sales.csv',
                         header=True,       # use headers or not
                         quote='"',         # char for quotes
                         sep=",",           # char for separation
                         inferSchema=True)  # do we infer schema or not ?

# prints the schema
df.printSchema()

# some functions are still valid
print("line count: {}".format(df.count()))

# show the table in a oh-so-nice format
df.show()

root
 |-- #ID: integer (nullable = true)
 |-- Date: string (nullable = true)
 |-- Store: integer (nullable = true)
 |-- State: string (nullable = true)
 |-- Product: integer (nullable = true)
 |-- Amount: double (nullable = true)

line count: 6
+---+----------+-----+-----+-------+------+
|#ID|      Date|Store|State|Product|Amount|
+---+----------+-----+-----+-------+------+
|101|11/13/2014|  100|   WA|    331| 300.0|
|104|11/18/2014|  700|   OR|    329| 450.0|
|102|11/15/2014|  203|   CA|    321| 200.0|
|106|11/19/2014|  202|   CA|    331| 330.0|
|103|11/17/2014|  101|   WA|    373| 750.0|
|105|11/19/2014|  202|   CA|    321| 200.0|
+---+----------+-----+-----+-------+------+



Use [`spark.read.json`](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.json) to load a JSON file into a DataFrame. You can specify every useful parameter in there. It can infer the schema.

In [10]:
# read JSON
df = spark.read.json('data/sales.json')

# prints the schema
df.printSchema()

# some functions are still valid
print("line count: {}".format(df.count()))

# show the table in a oh-so-nice format
df.show()

root
 |-- amount: double (nullable = true)
 |-- date: string (nullable = true)
 |-- id: long (nullable = true)
 |-- product: long (nullable = true)
 |-- state: string (nullable = true)
 |-- store: long (nullable = true)

line count: 6
+------+----------+---+-------+-----+-----+
|amount|      date| id|product|state|store|
+------+----------+---+-------+-----+-----+
| 300.0|11/13/2014|101|    331|   WA|  100|
| 450.0|11/18/2014|104|    329|   OR|  700|
| 200.0|11/15/2014|102|    321|   CA|  203|
| 330.0|11/19/2014|106|    331|   CA|  202|
| 750.0|11/17/2014|103|    373|   WA|  101|
| 200.0|11/19/2014|105|    321|   CA|  202|
+------+----------+---+-------+-----+-----+



In [11]:
df.collect()

[Row(amount=300.0, date='11/13/2014', id=101, product=331, state='WA', store=100),
 Row(amount=450.0, date='11/18/2014', id=104, product=329, state='OR', store=700),
 Row(amount=200.0, date='11/15/2014', id=102, product=321, state='CA', store=203),
 Row(amount=330.0, date='11/19/2014', id=106, product=331, state='CA', store=202),
 Row(amount=750.0, date='11/17/2014', id=103, product=373, state='WA', store=101),
 Row(amount=200.0, date='11/19/2014', id=105, product=321, state='CA', store=202)]

## 2.3. Actions : turning your DataFrame into a local object

Some actions just remain the same, you won't have to learn Spark all over again.

Some new actions give you the possibility to describe and show the content in a more fashionable manner.

When used/executed in IPython or in a notebook, they **launch the processing of the DAG**. This is where Spark stops being **lazy**. This is where your script will take time to execute.

| Method | DF vs RDD? | Description |
| - | - | - |
| [`.collect()`](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.collect) | identical | Return a list that contains all of the elements as Rows. |
| [`.count()`](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.count) | identical | Return the number of elements. |
| [`.take(n)`](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.take) | identical | Take the first `n` elements. |
| [`.top(n)`](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.top) | identical | Get the top `n` elements. |
| [`.first()`](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.first) | identical | Return the first element. |
| [`.show(n)`](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.show) | <span style="color:green">new</span> | Show the DataFrame in table format (`n=20` by default) |
| [`.toPandas()`](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.toPandas) | <span style="color:green">new</span> | Convert the DF into a Pandas DF. |
| [`.printSchema(*cols)`](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.printSchema)\* | <span style="color:green">new</span> | Display the schema. This is not an action, it doesn't launch the DAG, but it fits better in this category. |
| [`.describe(*cols)`](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.describe) | <span style="color:green">new</span> | Compute statistics for this column. |
| [`.sum(*cols)`](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.GroupedData.sum) | <span style="color:red">different</span> | Applies on GroupedData only (see transformations). |
| [`.mean(*cols)`](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.GroupedData.mean) | <span style="color:red">different</span> | Applies on GroupedData only (see transformations). |
| [`.min(*cols)`](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.GroupedData.min) | <span style="color:red">different</span> | Applies on GroupedData only (see transformations). |
| [`.max(*cols)`](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.GroupedData.max) | <span style="color:red">different</span> | Applies on GroupedData only (see transformations). |


In [12]:
df[['date','amount']].show()

+----------+------+
|      date|amount|
+----------+------+
|11/13/2014| 300.0|
|11/18/2014| 450.0|
|11/15/2014| 200.0|
|11/19/2014| 330.0|
|11/17/2014| 750.0|
|11/19/2014| 200.0|
+----------+------+



In [13]:
# read CSV
df_sales = spark.read.csv('data/sales.csv',
                         header=True,       # use headers or not
                         quote='"',         # char for quotes
                         sep=",",           # char for separation
                         inferSchema=True)  # do we infer schema or not ?

In [14]:
df_sales.show()

+---+----------+-----+-----+-------+------+
|#ID|      Date|Store|State|Product|Amount|
+---+----------+-----+-----+-------+------+
|101|11/13/2014|  100|   WA|    331| 300.0|
|104|11/18/2014|  700|   OR|    329| 450.0|
|102|11/15/2014|  203|   CA|    321| 200.0|
|106|11/19/2014|  202|   CA|    331| 330.0|
|103|11/17/2014|  101|   WA|    373| 750.0|
|105|11/19/2014|  202|   CA|    321| 200.0|
+---+----------+-----+-----+-------+------+



In [15]:
df_sales.toPandas()

Unnamed: 0,#ID,Date,Store,State,Product,Amount
0,101,11/13/2014,100,WA,331,300.0
1,104,11/18/2014,700,OR,329,450.0
2,102,11/15/2014,203,CA,321,200.0
3,106,11/19/2014,202,CA,331,330.0
4,103,11/17/2014,101,WA,373,750.0
5,105,11/19/2014,202,CA,321,200.0


This is how `.collect()` returns things...

In [16]:
df_sales.collect()

[Row(#ID=101, Date='11/13/2014', Store=100, State='WA', Product=331, Amount=300.0),
 Row(#ID=104, Date='11/18/2014', Store=700, State='OR', Product=329, Amount=450.0),
 Row(#ID=102, Date='11/15/2014', Store=203, State='CA', Product=321, Amount=200.0),
 Row(#ID=106, Date='11/19/2014', Store=202, State='CA', Product=331, Amount=330.0),
 Row(#ID=103, Date='11/17/2014', Store=101, State='WA', Product=373, Amount=750.0),
 Row(#ID=105, Date='11/19/2014', Store=202, State='CA', Product=321, Amount=200.0)]

In [17]:
# prints the schema
print("--- printSchema()")
df_sales.printSchema()

# prints the table itself
print("--- show()")
df_sales.show()

# show the statistics of all numerical columns
print("--- describe()")
df_sales.describe().show()

# show the statistics of one specific column
print("--- describe(Amount)")
df_sales.describe("Amount").show()

--- printSchema()
root
 |-- #ID: integer (nullable = true)
 |-- Date: string (nullable = true)
 |-- Store: integer (nullable = true)
 |-- State: string (nullable = true)
 |-- Product: integer (nullable = true)
 |-- Amount: double (nullable = true)

--- show()
+---+----------+-----+-----+-------+------+
|#ID|      Date|Store|State|Product|Amount|
+---+----------+-----+-----+-------+------+
|101|11/13/2014|  100|   WA|    331| 300.0|
|104|11/18/2014|  700|   OR|    329| 450.0|
|102|11/15/2014|  203|   CA|    321| 200.0|
|106|11/19/2014|  202|   CA|    331| 330.0|
|103|11/17/2014|  101|   WA|    373| 750.0|
|105|11/19/2014|  202|   CA|    321| 200.0|
+---+----------+-----+-----+-------+------+

--- describe()
+-------+------------------+----------+------------------+-----+------------------+------------------+
|summary|               #ID|      Date|             Store|State|           Product|            Amount|
+-------+------------------+----------+------------------+-----+--------------

## 2.3. Transformations on DataFrames

- They are still **lazy**: Spark doesn't apply the transformation right away, it just builds on the **DAG**
- They transform a DataFrame into another because DataFrames are also **immutable**.
- They can be **wide** or **narrow** (whether they shuffle partitions or not).

You got that... DataFrames are just RDDs with a schema.

### 2.3.1. selecting and adding columns

#### `.select(*cols)` : selecting specific columns

In [18]:
# read CSV
df_aapl = spark.read.csv('data/aapl.csv',
                         header=True,       # use headers or not
                         quote='"',         # char for quotes
                         sep=",",           # char for separation
                         inferSchema=True)  # do we infer schema or not ?

df_aapl.show(5)

df_aapl.printSchema()

+-------------------+----------+----------+----------+----------+--------+----------+
|               Date|      Open|      High|       Low|     Close|  Volume| Adj Close|
+-------------------+----------+----------+----------+----------+--------+----------+
|2016-10-25 00:00:00|117.949997|118.360001|117.309998|    118.25|39190300|    118.25|
|2016-10-24 00:00:00|117.099998|117.739998|     117.0|117.650002|23538700|117.650002|
|2016-10-21 00:00:00|116.809998|116.910004|116.279999|116.599998|23192700|116.599998|
|2016-10-20 00:00:00|116.860001|117.379997|116.330002|117.059998|24125800|117.059998|
|2016-10-19 00:00:00|    117.25|117.760002|113.800003|117.120003|20034600|117.120003|
+-------------------+----------+----------+----------+----------+--------+----------+
only showing top 5 rows

root
 |-- Date: timestamp (nullable = true)
 |-- Open: double (nullable = true)
 |-- High: double (nullable = true)
 |-- Low: double (nullable = true)
 |-- Close: double (nullable = true)
 |-- Volume: 

In [19]:
df_out = df_aapl.select("Open", "Close")

df_out.show(5)

+----------+----------+
|      Open|     Close|
+----------+----------+
|117.949997|    118.25|
|117.099998|117.650002|
|116.809998|116.599998|
|116.860001|117.059998|
|    117.25|117.120003|
+----------+----------+
only showing top 5 rows



In [20]:
df_aapl[["Open","Close"]].show(5)

+----------+----------+
|      Open|     Close|
+----------+----------+
|117.949997|    118.25|
|117.099998|117.650002|
|116.809998|116.599998|
|116.860001|117.059998|
|    117.25|117.120003|
+----------+----------+
only showing top 5 rows



#### `.withColumn("label", func)` : constant value

In [23]:
from pyspark.sql.functions import lit

df_out = df_aapl.withColumn("blabla", lit(34))

df_out[['Open','High','blabla']].show(5)

+----------+----------+------+
|      Open|      High|blabla|
+----------+----------+------+
|117.949997|118.360001|    34|
|117.099998|117.739998|    34|
|116.809998|116.910004|    34|
|116.860001|117.379997|    34|
|    117.25|117.760002|    34|
+----------+----------+------+
only showing top 5 rows



#### `.withColumn("label", func)` : column operations

In [24]:
df_out = (df_aapl
            .withColumn("diff", 
                         df_aapl['High'] - df_aapl['Low'])
            .select('Date', 'High', 'Low', 'diff')
         )

df_out.show(5)

+-------------------+----------+----------+------------------+
|               Date|      High|       Low|              diff|
+-------------------+----------+----------+------------------+
|2016-10-25 00:00:00|118.360001|117.309998|1.0500030000000038|
|2016-10-24 00:00:00|117.739998|     117.0|0.7399979999999999|
|2016-10-21 00:00:00|116.910004|116.279999| 0.630004999999997|
|2016-10-20 00:00:00|117.379997|116.330002|1.0499950000000098|
|2016-10-19 00:00:00|117.760002|113.800003|3.9599989999999963|
+-------------------+----------+----------+------------------+
only showing top 5 rows



#### `.withColumn("label", func)` : user defined function
`udf()` turns a normal python function into something Spark can parallelized across its distributed data. `udf()` requires two arguments: a function, and the data type the function will return

In [33]:
import math
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType, FloatType

def my_specialfunc(h,l,o,c):
    return ((h-l)*(math.exp(o-c)))

my_specialfunc_udf = udf(my_specialfunc, FloatType())

df_out = df_aapl.withColumn("special", my_specialfunc_udf(df_aapl['High'], 
                                                          df_aapl['Low'], 
                                                          df_aapl['Open'], 
                                                          df_aapl['Close']))

df_out.select('High', 'Low', 'Open', 'Close', 'special').show()

+----------+----------+----------+----------+----------+
|      High|       Low|      Open|     Close|   special|
+----------+----------+----------+----------+----------+
|118.360001|117.309998|117.949997|    118.25|0.77785903|
|117.739998|     117.0|117.099998|117.650002|   0.42694|
|116.910004|116.279999|116.809998|116.599998|0.77722335|
|117.379997|116.330002|116.860001|117.059998|0.85966575|
|117.760002|113.800003|    117.25|117.120003| 4.5097456|
|118.209999|117.449997|    118.18|117.470001| 1.5458359|
|117.839996|116.779999|117.330002|117.550003|0.85066664|
|118.169998|117.129997|117.879997|117.629997| 1.3353877|
|117.440002|115.720001|116.790001|116.980003| 1.4223677|
|117.980003|    116.75|117.349998|117.339996| 1.2423673|
|118.690002|116.199997|117.699997|116.300003| 10.097407|
|    116.75|114.720001|115.019997|116.050003| 0.7247194|
|114.559998|113.510002|114.309998|114.059998| 1.3482215|
|114.339996|113.129997|113.699997|113.889999| 1.0006177|
|113.660004|112.690002|113.4000

In [28]:
my_specialfunc_udf?

### 2.3.2. aggregating and sorting columns

#### `.groupBy()`: aggregating in DataFrames

In [34]:
df_sales.show()

+---+----------+-----+-----+-------+------+
|#ID|      Date|Store|State|Product|Amount|
+---+----------+-----+-----+-------+------+
|101|11/13/2014|  100|   WA|    331| 300.0|
|104|11/18/2014|  700|   OR|    329| 450.0|
|102|11/15/2014|  203|   CA|    321| 200.0|
|106|11/19/2014|  202|   CA|    331| 330.0|
|103|11/17/2014|  101|   WA|    373| 750.0|
|105|11/19/2014|  202|   CA|    321| 200.0|
+---+----------+-----+-----+-------+------+



In [35]:
from pyspark.sql import functions as F
df_out = df_sales.groupBy("State").agg(F.sum("Amount"), F.avg('Product'))
#g = df_sales.groupBy("State").sum("Amount")
#g.show()

df_out.show()

+-----+-----------+-----------------+
|State|sum(Amount)|     avg(Product)|
+-----+-----------+-----------------+
|   OR|      450.0|            329.0|
|   CA|      730.0|324.3333333333333|
|   WA|     1050.0|            352.0|
+-----+-----------+-----------------+



#### `.orderBy()` : sorting by a column

In [36]:
df_out = (df_sales.groupBy("State")
                  .agg(F.sum("Amount"))
                  .orderBy("sum(Amount)", ascending=False)
         )

df_out.show()

+-----+-----------+
|State|sum(Amount)|
+-----+-----------+
|   WA|     1050.0|
|   CA|      730.0|
|   OR|      450.0|
+-----+-----------+



# 3. Let's design chains of transformations together !

## 3.1. Computing sales per state

### Input DataFrame

In [37]:
# read CSV
df_sales = sqlContext.read.csv('data/sales.csv',
                         header=True,       # use headers or not
                         quote='"',         # char for quotes
                         sep=",",           # char for separation
                         inferSchema=True)  # do we infer schema or not ?

df_sales.show()

+---+----------+-----+-----+-------+------+
|#ID|      Date|Store|State|Product|Amount|
+---+----------+-----+-----+-------+------+
|101|11/13/2014|  100|   WA|    331| 300.0|
|104|11/18/2014|  700|   OR|    329| 450.0|
|102|11/15/2014|  203|   CA|    321| 200.0|
|106|11/19/2014|  202|   CA|    331| 330.0|
|103|11/17/2014|  101|   WA|    373| 750.0|
|105|11/19/2014|  202|   CA|    321| 200.0|
+---+----------+-----+-----+-------+------+



### Task

You want to obtain a sorted DataFrame of the states in which you have most money from sales (amount).

What transformations do you need to apply?

### Code

In [38]:
df_out = df_sales

df_out.show()

+---+----------+-----+-----+-------+------+
|#ID|      Date|Store|State|Product|Amount|
+---+----------+-----+-----+-------+------+
|101|11/13/2014|  100|   WA|    331| 300.0|
|104|11/18/2014|  700|   OR|    329| 450.0|
|102|11/15/2014|  203|   CA|    321| 200.0|
|106|11/19/2014|  202|   CA|    331| 330.0|
|103|11/17/2014|  101|   WA|    373| 750.0|
|105|11/19/2014|  202|   CA|    321| 200.0|
+---+----------+-----+-----+-------+------+



In [39]:
(df_out.groupBy('State')
         .max('Amount')
         .show()
)

+-----+-----------+
|State|max(Amount)|
+-----+-----------+
|   OR|      450.0|
|   CA|      330.0|
|   WA|      750.0|
+-----+-----------+



In [41]:
(df_out.groupBy('State')
       .agg(F.max('Amount').alias('uh'))
       .orderBy('uh').show()
)

+-----+-----+
|State|   uh|
+-----+-----+
|   CA|330.0|
|   OR|450.0|
|   WA|750.0|
+-----+-----+



## Solution (use your mouse to uncover)

<span style="color:white;font-family:'Courier New'"><br/>
df_out = df_sales.groupBy(df_sales.State)\<br/>
                 .agg(F.sum(df_sales.Amount).alias('Money'))\<br/>
                 .orderBy("Money", ascending=False)<br/>
<br/>
df_out.show()<br/>
</span>

## 3.2. Find the date on which AAPL's stock price was the highest

### Input DataFrame

In [31]:
# read CSV
df_aapl = sqlContext.read.csv('data/aapl.csv',
                         header=True,       # use headers or not
                         quote='"',         # char for quotes
                         sep=",",           # char for separation
                         inferSchema=True)  # do we infer schema or not ?

df_aapl.show(5)

+-------------------+----------+----------+----------+----------+--------+----------+
|               Date|      Open|      High|       Low|     Close|  Volume| Adj Close|
+-------------------+----------+----------+----------+----------+--------+----------+
|2016-10-25 00:00:00|117.949997|118.360001|117.309998|    118.25|39190300|    118.25|
|2016-10-24 00:00:00|117.099998|117.739998|     117.0|117.650002|23538700|117.650002|
|2016-10-21 00:00:00|116.809998|116.910004|116.279999|116.599998|23192700|116.599998|
|2016-10-20 00:00:00|116.860001|117.379997|116.330002|117.059998|24125800|117.059998|
|2016-10-19 00:00:00|    117.25|117.760002|113.800003|117.120003|20034600|117.120003|
+-------------------+----------+----------+----------+----------+--------+----------+
only showing top 5 rows



### Task

Now, design a pipeline that would :

1. keep only fields for Date and Close 
4. order by Close in descending order

### Code

In [32]:
df_out = df_aapl.select('Date', 'Close').orderBy('Close', ascending=False)

df_out.show(5)

+-------------------+----------+
|               Date|     Close|
+-------------------+----------+
|2015-11-03 00:00:00|    122.57|
|2015-11-04 00:00:00|     122.0|
|2015-11-02 00:00:00|    121.18|
|2015-11-06 00:00:00|121.059998|
|2015-11-05 00:00:00|120.919998|
+-------------------+----------+
only showing top 5 rows



### Solution

<span style="color:white;font-family:'Courier New'">
df_out.select("Close", "Date").orderBy(df_aapl.Close, ascending=False).show(5)<br/>
</span>


# 4. The SQL Interface

I know you missed it. Let's run some SQL queries on these tables!

First we tell spark to create "SQL namespace" and assign a name to our dataframe:

In [42]:
df_aapl.createOrReplaceTempView('aapl')

Now we can write queries using `spark.sql`, and it will have access to any table we have registered like above. The output of the query is another spark dataframe.

In [43]:
df_sql = spark.sql("SELECT Open, Close, Close - Open as diff FROM aapl LIMIT 3")
df_sql.show()

+----------+----------+--------------------+
|      Open|     Close|                diff|
+----------+----------+--------------------+
|117.949997|    118.25|  0.3000030000000038|
|117.099998|117.650002|  0.5500040000000013|
|116.809998|116.599998|-0.20999999999999375|
+----------+----------+--------------------+



In [35]:
df_aapl.show()

+-------------------+----------+----------+----------+----------+--------+----------+
|               Date|      Open|      High|       Low|     Close|  Volume| Adj Close|
+-------------------+----------+----------+----------+----------+--------+----------+
|2016-10-25 00:00:00|117.949997|118.360001|117.309998|    118.25|39190300|    118.25|
|2016-10-24 00:00:00|117.099998|117.739998|     117.0|117.650002|23538700|117.650002|
|2016-10-21 00:00:00|116.809998|116.910004|116.279999|116.599998|23192700|116.599998|
|2016-10-20 00:00:00|116.860001|117.379997|116.330002|117.059998|24125800|117.059998|
|2016-10-19 00:00:00|    117.25|117.760002|113.800003|117.120003|20034600|117.120003|
|2016-10-18 00:00:00|    118.18|118.209999|117.449997|117.470001|24553500|117.470001|
|2016-10-17 00:00:00|117.330002|117.839996|116.779999|117.550003|23624900|117.550003|
|2016-10-14 00:00:00|117.879997|118.169998|117.129997|117.629997|35652200|117.629997|
|2016-10-13 00:00:00|116.790001|117.440002|115.720001|

In [44]:
df_sales.createOrReplaceTempView('sales')

In [45]:
df_sales.show()

+---+----------+-----+-----+-------+------+
|#ID|      Date|Store|State|Product|Amount|
+---+----------+-----+-----+-------+------+
|101|11/13/2014|  100|   WA|    331| 300.0|
|104|11/18/2014|  700|   OR|    329| 450.0|
|102|11/15/2014|  203|   CA|    321| 200.0|
|106|11/19/2014|  202|   CA|    331| 330.0|
|103|11/17/2014|  101|   WA|    373| 750.0|
|105|11/19/2014|  202|   CA|    321| 200.0|
+---+----------+-----+-----+-------+------+



In [46]:
query = '''SELECT state, SUM(Amount) as total 
            FROM sales 
            GROUP BY State 
            ORDER BY total DESC'''

spark.sql(query).show()

+-----+------+
|state| total|
+-----+------+
|   WA|1050.0|
|   CA| 730.0|
|   OR| 450.0|
+-----+------+

