# Spark-SQL, Spark-ML Objectives

At the end of this lecture you should be able to:

1. Chain spark dataframe methods together to do data munging.
2. Be able to describe the Spark-ML API, and recognize differences to sk-learn.
3. Chain spark-ml Transformers and Estimators together to compose ML pipelines.

# 3. Let's design chains of transformations together ! (reloaded)

## 3.1. Computing sales per state

### Input DataFrame

In [3]:
# read CSV
df_sales = sqlContext.read.csv('sales.csv',
                         header=True,       # use headers or not
                         quote='"',         # char for quotes
                         sep=",",           # char for separation
                         inferSchema=True)  # do we infer schema or not ?

df_sales.show()

+---+----------+-----+-----+-------+------+
|#ID|      Date|Store|State|Product|Amount|
+---+----------+-----+-----+-------+------+
|101|11/13/2014|  100|   WA|    331| 300.0|
|104|11/18/2014|  700|   OR|    329| 450.0|
|102|11/15/2014|  203|   CA|    321| 200.0|
|106|11/19/2014|  202|   CA|    331| 330.0|
|103|11/17/2014|  101|   WA|    373| 750.0|
|105|11/19/2014|  202|   CA|    321| 200.0|
+---+----------+-----+-----+-------+------+



### Task

You want to obtain a sorted ~~RDD~~ DataFrame of the states in which you have most sales done (amount).

What transformations do you need to apply ?
If you had to draw a workflow of the transformations to apply ?

### Code

In [4]:
df_out = df_sales

df_out.show()

+---+----------+-----+-----+-------+------+
|#ID|      Date|Store|State|Product|Amount|
+---+----------+-----+-----+-------+------+
|101|11/13/2014|  100|   WA|    331| 300.0|
|104|11/18/2014|  700|   OR|    329| 450.0|
|102|11/15/2014|  203|   CA|    321| 200.0|
|106|11/19/2014|  202|   CA|    331| 330.0|
|103|11/17/2014|  101|   WA|    373| 750.0|
|105|11/19/2014|  202|   CA|    321| 200.0|
+---+----------+-----+-----+-------+------+



<span style="color:white;font-family:'Courier New'"><br/>
df_out = df_sales.groupBy(df_sales.State)\<br/>
                 .agg(F.sum(df_sales.Amount).alias('Money'))\<br/>
                 .orderBy("Money", ascending=False)<br/>
<br/>
df_out.show()<br/>
</span>

## 3.2. Find the date on which AAPL's closing stock price was the highest

### Input DataFrame

In [13]:
!cp ../../spark/chris_overton/data/aapl.csv data/

In [15]:
# read CSV
df_aapl = sqlContext.read.csv('data/aapl.csv',
                         header=True,       # use headers or not
                         quote='"',         # char for quotes
                         sep=",",           # char for separation
                         inferSchema=True)  # do we infer schema or not ?

df_aapl.show(5)

+--------------------+----------+----------+----------+----------+--------+----------+
|                Date|      Open|      High|       Low|     Close|  Volume| Adj Close|
+--------------------+----------+----------+----------+----------+--------+----------+
|2016-10-25 00:00:...|117.949997|118.360001|117.309998|    118.25|39190300|    118.25|
|2016-10-24 00:00:...|117.099998|117.739998|     117.0|117.650002|23538700|117.650002|
|2016-10-21 00:00:...|116.809998|116.910004|116.279999|116.599998|23192700|116.599998|
|2016-10-20 00:00:...|116.860001|117.379997|116.330002|117.059998|24125800|117.059998|
|2016-10-19 00:00:...|    117.25|117.760002|113.800003|117.120003|20034600|117.120003|
+--------------------+----------+----------+----------+----------+--------+----------+
only showing top 5 rows



### Task

Now, design a pipeline that would :

1. ~~filter out headers and last line~~
2. ~~split each line based on comma~~
3. keep only fields for Date ~~(col 0)~~ and Close ~~(col 4)~~
4. order by Close in descending order

### Code

In [16]:
df_out = df_aapl # apply transformation here...

df_out.show(5)

+--------------------+----------+----------+----------+----------+--------+----------+
|                Date|      Open|      High|       Low|     Close|  Volume| Adj Close|
+--------------------+----------+----------+----------+----------+--------+----------+
|2016-10-25 00:00:...|117.949997|118.360001|117.309998|    118.25|39190300|    118.25|
|2016-10-24 00:00:...|117.099998|117.739998|     117.0|117.650002|23538700|117.650002|
|2016-10-21 00:00:...|116.809998|116.910004|116.279999|116.599998|23192700|116.599998|
|2016-10-20 00:00:...|116.860001|117.379997|116.330002|117.059998|24125800|117.059998|
|2016-10-19 00:00:...|    117.25|117.760002|113.800003|117.120003|20034600|117.120003|
+--------------------+----------+----------+----------+----------+--------+----------+
only showing top 5 rows



### Solution

<span style="color:white;font-family:'Courier New'">
df_out.select("Close", "Date").orderBy(df_aapl.Close, ascending=False).show(5)<br/>
</span>

# 4. Machine Learning on DataFrames

http://spark.apache.org/docs/latest/ml-features.html

### What is different about the code below?

In [46]:
from pyspark.ml.feature import MinMaxScaler, VectorAssembler

# assemble values in a vector
vectorAssembler = VectorAssembler(inputCols=["Close"],
                                  outputCol="features")

df_vector = vectorAssembler.transform(df_aapl)
df_vector.select(['Open', 'High', 'Close', 'features']).show(5)

+----------+----------+----------+------------+
|      Open|      High|     Close|    features|
+----------+----------+----------+------------+
|117.949997|118.360001|    118.25|    [118.25]|
|117.099998|117.739998|117.650002|[117.650002]|
|116.809998|116.910004|116.599998|[116.599998]|
|116.860001|117.379997|117.059998|[117.059998]|
|    117.25|117.760002|117.120003|[117.120003]|
+----------+----------+----------+------------+
only showing top 5 rows



In [47]:
scaler = MinMaxScaler(inputCol="features", outputCol="scaledfeatures")

# Compute summary statistics and generate MinMaxScalerModel
scalerModel = scaler.fit(df_vector)

# rescale each feature to range [min, max].
scaledData = scalerModel.transform(df_vector)
scaledData.select("features", "scaledfeatures").show(5)

+------------+--------------------+
|    features|      scaledfeatures|
+------------+--------------------+
|    [118.25]| [0.865963404782699]|
|[117.650002]|[0.8473472730564975]|
|[116.599998]|[0.8147688098332226]|
|[117.059998]|[0.8290412250646944]|
|[117.120003]|[0.8309029995776607]|
+------------+--------------------+
only showing top 5 rows



# Transformers

The `VectorAssembler` class above is an example of a generic type in Spark, called a Transformer. Important things to know about this type:

* They implement a `transform` method.
* They convert one `DataFrame` into another, usually by adding columns.

Examples of Transformers: `VectorAssembler`, `Tokenizer`, `StopWordsRemover`, `StandardScaler`, and many more



# Estimators

According to the docs: An Estimator abstracts the concept of a learning algorithm or any algorithm that fits or trains on data. Important things to know about this type:

* They implement a `fit` method whose argument is a `DataFrame`.
* The output of `fit` is another type called `Model`, which is a `Transformer`.

Examples of Estimators: `LogisticRegression`, `DecisionTreeRegressor`, and many more


# Pipelines

Many Data Science workflows can be described as sequential application of various `Transforms` and `Estimators`.

![http://spark.apache.org/docs/latest/img/ml-Pipeline.png](http://spark.apache.org/docs/latest/img/ml-Pipeline.png)

Let's see two ways to implement the above flow!

In [48]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer

# Prepare training documents from a list of (id, text, label) tuples.
training = spark.createDataFrame([
    (0, "a b c d e spark", 1.0),
    (1, "b d", 0.0),
    (2, "spark f g h", 1.0),
    (3, "hadoop mapreduce", 0.0)
], ["id", "text", "label"])

In [49]:
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol="words", outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.001)

tokens = tokenizer.transform(training)
hashes = hashingTF.transform(tokens)
logistic_model = lr.fit(hashes) #Uses columns named features/label by default

In [50]:
# Prepare test documents, which are unlabeled (id, text) tuples.
test = spark.createDataFrame([
    (4, "spark i j k"),
    (5, "l m n"),
    (6, "spark hadoop spark"),
    (7, "apache hadoop")
], ["id", "text"])

# What do we need to do to this to get a prediction?

## Alternatively

In [51]:
# Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.001)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

# Fit the pipeline to training documents.
model = pipeline.fit(training)

In [52]:
#How can we test this against our training data?
prediction = model.transform(test)
prediction.select(['features', 'prediction', 'probability']).show()

+--------------------+----------+--------------------+
|            features|prediction|         probability|
+--------------------+----------+--------------------+
|(262144,[20197,24...|       1.0|[0.15964077387874...|
|(262144,[18910,10...|       0.0|[0.83783256854766...|
|(262144,[155117,2...|       1.0|[0.06926633132976...|
|(262144,[66695,15...|       0.0|[0.98215753334442...|
+--------------------+----------+--------------------+



### Ok, so can you describe when you should chain DataFrame transformations together and when you should chain ML Transformers/Estimators together?