# Spark-ML

Machine learning at a scale!

# Objectives

At the end of this lecture you should be able to:

1. Chain spark dataframe methods together to do data munging.
2. Be able to describe the Spark-ML API, and recognize differences to sk-learn.
3. Chain Spark-ML Transformers and Estimators together to compose ML pipelines.

In [1]:
import pyspark.sql.functions as F
import sys
sys.tracebacklimit = 0

# Working with Spark DataFrames
<br>

## Examples to many similarites with Pandas

## Computing sales per state

In [2]:
# Read CSV
sales_df = sqlContext.read.csv('data/sales.csv',
                               header=True,       # parse the first line as a header?
                               quote='"',         # quote character
                               sep=",",           # separation character
                               inferSchema=True)  # infer schema?

sales_df.show(5)

+---+----------+-----+-----+-------+------+
|#ID|      Date|Store|State|Product|Amount|
+---+----------+-----+-----+-------+------+
|101|11/13/2014|  100|   WA|    331| 300.0|
|104|11/18/2014|  700|   OR|    329| 450.0|
|102|11/15/2014|  203|   CA|    321| 200.0|
|106|11/19/2014|  202|   CA|    331| 330.0|
|103|11/17/2014|  101|   WA|    373| 750.0|
+---+----------+-----+-----+-------+------+
only showing top 5 rows



### Pair Discussion - 2 mins

You want to obtain a sorted DataFrame of the states in which you have the most sales (by amount).

1. What transformations do you need to apply?
2. How would you perform those transformations in Pandas?

In [3]:
sales_df.show(1)

+---+----------+-----+-----+-------+------+
|#ID|      Date|Store|State|Product|Amount|
+---+----------+-----+-----+-------+------+
|101|11/13/2014|  100|   WA|    331| 300.0|
+---+----------+-----+-----+-------+------+
only showing top 1 row



In [None]:
out_df = (sales_df.<insert_code_here>)

out_df.show()

In [4]:
# Solution
out_df = (sales_df.groupBy(sales_df.State)
            # or           'State'
                  .agg(F.sum(sales_df.Amount).alias('Money'))
            # or       {'Amount': 'sum'} 
            # or  .sum('Amount')
                  .orderBy('Money', ascending=False))
 
out_df.show()

+-----+------+
|State| Money|
+-----+------+
|   WA|1050.0|
|   CA| 730.0|
|   OR| 450.0|
+-----+------+



## Find the date on which AAPL's closing stock price was the highest

In [5]:
# Read CSV
aapl_df = sqlContext.read.csv('data/aapl.csv',
                              header=True,       # parse the first line as a header?
                              quote='"',         # quote character
                              sep=",",           # separation character
                              inferSchema=True)  # infer schema?

aapl_df.select('Date', 'Open', 'High').show(5)

+--------------------+----------+----------+
|                Date|      Open|      High|
+--------------------+----------+----------+
|2016-10-25 00:00:...|117.949997|118.360001|
|2016-10-24 00:00:...|117.099998|117.739998|
|2016-10-21 00:00:...|116.809998|116.910004|
|2016-10-20 00:00:...|116.860001|117.379997|
|2016-10-19 00:00:...|    117.25|117.760002|
+--------------------+----------+----------+
only showing top 5 rows



### We can see the an intelligent schema has been inferred

In [6]:
print(aapl_df.columns)
aapl_df.schema

['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close']


StructType(List(StructField(Date,TimestampType,true),StructField(Open,DoubleType,true),StructField(High,DoubleType,true),StructField(Low,DoubleType,true),StructField(Close,DoubleType,true),StructField(Volume,IntegerType,true),StructField(Adj Close,DoubleType,true)))

### Pair Discussion - 2 Mins

Now, design a pipeline that will:

1. Keep only fields for `Date` and `Close`.
2. Order by `Close` in descending order. How would you do this in Pandas?

In [7]:
aapl_df.select('Date', 'Open', 'High', 'Low', 'Close', 'Volume').show(1)

+--------------------+----------+----------+----------+------+--------+
|                Date|      Open|      High|       Low| Close|  Volume|
+--------------------+----------+----------+----------+------+--------+
|2016-10-25 00:00:...|117.949997|118.360001|117.309998|118.25|39190300|
+--------------------+----------+----------+----------+------+--------+
only showing top 1 row



In [None]:
out_df = aapl_df.(<insert_code_here>)

out_df.show(5)

In [8]:
# Solution
out_df = (aapl_df.select('Date', 'Close')
                 .orderBy('Close', ascending=False))

out_df.show(5)

+--------------------+----------+
|                Date|     Close|
+--------------------+----------+
|2015-11-03 00:00:...|    122.57|
|2015-11-04 00:00:...|     122.0|
|2015-11-02 00:00:...|    121.18|
|2015-11-06 00:00:...|121.059998|
|2015-11-05 00:00:...|120.919998|
+--------------------+----------+
only showing top 5 rows



# Supervised Machine Learning on DataFrames

http://spark.apache.org/docs/latest/ml-features.html

In [9]:
from pyspark.ml.feature import MinMaxScaler, VectorAssembler

# Assemble values in a vector
vectorAssembler = VectorAssembler(inputCols=["Close"], 
                                  outputCol="Features")

vector_df = vectorAssembler.transform(aapl_df)
print(aapl_df.columns)
print(vector_df.columns)
vector_df.select('Features').show(5)

['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close']
['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close', 'Features']
+------------+
|    Features|
+------------+
|    [118.25]|
|[117.650002]|
|[116.599998]|
|[117.059998]|
|[117.120003]|
+------------+
only showing top 5 rows



## Why does this difference matter?

Let's try to run Spark-ML's built-in [min-max scaler](https://spark.apache.org/docs/latest/ml-features.html#minmaxscaler) the close column in our `appl_df`.

In [None]:
fail_scaler = MinMaxScaler(inputCol="Close", outputCol="Scaled Close")

fail_scaler_model = fail_scaler.fit(aapl_df)

scaled_close = fail_scaler_model.transform(aapl_df)
scaled_close.show(5)

### Gotta have the column be a vector...

In [10]:
scaler = MinMaxScaler(inputCol="Features", outputCol="Scaled Features")

# Compute summary statistics and generate MinMaxScalerModel
scaler_model = scaler.fit(vector_df)
# Notice how we didn't specify the column, we already did above

# Rescale each feature to range [min, max], returns a dataframe
scaled_data = scaler_model.transform(vector_df)
scaled_data.select("Features", "Scaled Features").show(5)

+------------+--------------------+
|    Features|     Scaled Features|
+------------+--------------------+
|    [118.25]| [0.865963404782699]|
|[117.650002]|[0.8473472730564975]|
|[116.599998]|[0.8147688098332226]|
|[117.059998]|[0.8290412250646944]|
|[117.120003]|[0.8309029995776607]|
+------------+--------------------+
only showing top 5 rows



# Transformers and Estimators

# Transformers

The `VectorAssembler` class above is an example of a generic type in Spark, called a [Transformer](http://spark.apache.org/docs/latest/ml-pipeline.html#transformers). Important things to know about this type:

* They implement a `transform` method.
* They convert one `DataFrame` into another, usually by adding columns.

Examples of Transformers: [`VectorAssembler`](http://spark.apache.org/docs/latest/ml-features.html#vectorassembler), [`Tokenizer`](http://spark.apache.org/docs/latest/ml-features.html#tokenizer), [`StopWordsRemover`](http://spark.apache.org/docs/latest/ml-features.html#stopwordsremover), and [many more](http://spark.apache.org/docs/latest/ml-features.html).



# Estimators

According to the docs: "An [Estimator](http://spark.apache.org/docs/latest/ml-pipeline.html#estimators) abstracts the concept of a learning algorithm or any algorithm that fits or trains on data". Important things to know about this type:

* They implement a `fit` method whose argument is a `DataFrame`.
* The output of `fit` is another type called `Model`, which is a `Transformer`.

Examples of Estimators: [`LogisticRegression`](http://spark.apache.org/docs/latest/ml-classification-regression.html#logistic-regression), [`DecisionTreeRegressor`](http://spark.apache.org/docs/latest/ml-classification-regression.html#decision-tree-regression), and [many more](http://spark.apache.org/docs/latest/ml-classification-regression.html).


In [11]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import RegexTokenizer, HashingTF

# Prepare training documents from a list of (id, text, label (about big data?)) tuples.
train_df = spark.createDataFrame([
             (0, "spark is like hadoop mapreduce", 1.0),
             (1, "sparks light fire!!!", 0.0),
             (2, "elephants like simba", 0.0),
             (3, "hadoop is an elephant", 1.0),
             (4, "hadoop mapreduce", 1.0)], 
             ["id", "text", "label"])

In [12]:
regexTokenizer = RegexTokenizer(inputCol="text", outputCol="tokens", pattern="\\W") # Transformer
hashingTF = HashingTF(inputCol="tokens", outputCol="features") # Transformer
lr = LogisticRegression(maxIter=10, regParam=0.001) # Estimator

tokens_df = regexTokenizer.transform(train_df)
hashes_df = hashingTF.transform(tokens_df)
logistic_model = lr.fit(hashes_df) # Uses columns named features/label by default
# logistic_model is now a transformer

### Now we can predict on new text

In [13]:
# Prepare test documents, which are unlabeled (id, text) tuples.
test_df = spark.createDataFrame([
            (5, "simba has a spark"),
            (6, "hadoop"),
            (7, "mapreduce in spark"),
            (8, "apache hadoop")], 
            ["id", "text"])

### What do we need to do to this to get a prediction?

In [14]:
preds_df = logistic_model.transform(
                hashingTF.transform(
           regexTokenizer.transform(test_df)))

preds_df.select('text', 'prediction', 'probability').show()

+------------------+----------+--------------------+
|              text|prediction|         probability|
+------------------+----------+--------------------+
| simba has a spark|       0.0|[0.78779795057740...|
|            hadoop|       1.0|[0.02996000405249...|
|mapreduce in spark|       1.0|[0.02396543994089...|
|     apache hadoop|       1.0|[0.02996000405249...|
+------------------+----------+--------------------+



# Pipelines

Many data science workflows can be described as sequential application of various `Transforms` and `Estimators`. In `sklearn` we know that this idea has been formalized into a class, [`Pipeline`](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html).

This way of thinking about things was so popular that Spark decided to implement it exactly.

![http://spark.apache.org/docs/latest/img/ml-Pipeline.png](http://spark.apache.org/docs/latest/img/ml-Pipeline.png)

Let's see two ways to implement the above flow!

## Configure an ML pipeline, which consists of three stages:
* tokenizer,
* hashingTF, and
* logistic regression

In [15]:
regexTokenizer = RegexTokenizer(inputCol="text", outputCol="tokens", pattern="\\W")
hashingTF = HashingTF(inputCol="tokens", outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.001)
pipeline = Pipeline(stages=[regexTokenizer, hashingTF, lr])

# Fit the pipeline to training documents
model = pipeline.fit(train_df)

### And now `model` is a transformer representing the entire pipeline. So it's simple to run our `test_df` through.

In [16]:
prediction = model.transform(test_df)
prediction.select(['text', 'prediction', 'probability']).show()

+------------------+----------+--------------------+
|              text|prediction|         probability|
+------------------+----------+--------------------+
| simba has a spark|       0.0|[0.78779795057740...|
|            hadoop|       1.0|[0.02996000405249...|
|mapreduce in spark|       1.0|[0.02396543994089...|
|     apache hadoop|       1.0|[0.02996000405249...|
+------------------+----------+--------------------+



# Unsupervised Machine Learning on DataFrames

We can similarly do unsupervised learning in Spark ML. And we can still use pipelines!

In [17]:
# Read CSV
iris_df = sqlContext.read.csv('data/iris.csv',
                              header=True,       # parse the first line as a header?
                              quote='"',         # quote character
                              sep=",",           # separation character
                              inferSchema=True)  # infer schema?

iris_df.show(5)

+-----------------+----------------+-----------------+----------------+
|sepal length (cm)|sepal width (cm)|petal length (cm)|petal width (cm)|
+-----------------+----------------+-----------------+----------------+
|              5.1|             3.5|              1.4|             0.2|
|              4.9|             3.0|              1.4|             0.2|
|              4.7|             3.2|              1.3|             0.2|
|              4.6|             3.1|              1.5|             0.2|
|              5.0|             3.6|              1.4|             0.2|
+-----------------+----------------+-----------------+----------------+
only showing top 5 rows



In [18]:
from pyspark.ml.feature import PCA

col_names = ["sepal length (cm)", "sepal width (cm)", 
             "petal length (cm)", "petal width (cm)"]
pcaVectorAssembler = VectorAssembler(inputCols=col_names, outputCol="features")
pca = PCA(k=3, inputCol="features", outputCol="pcaFeatures")
pipeline = Pipeline(stages=[pcaVectorAssembler, pca])

fitted_pipeline = pipeline.fit(iris_df)

fitted_pipeline.transform(iris_df).select("pcaFeatures").show(5)

+--------------------+
|         pcaFeatures|
+--------------------+
|[-2.8271359726790...|
|[-2.7959524821488...|
|[-2.6215235581650...|
|[-2.7649059004742...|
|[-2.7827501159516...|
+--------------------+
only showing top 5 rows



## Inspecting Pipelines

We can even access the individual, fitted, parts of a pipeline if there's an attribute on them that we're in

In [19]:
fitted_pipeline.stages[-1].explainedVariance

DenseVector([0.9246, 0.053, 0.0172])