# Spark ML

## Morning Objectives

At the end of this lecture you should be able to:

1. Be able to describe the Spark ML API, and recognize differences with sklearn
1. Chain Spark `Dataframe` methods together to do data munging
1. Chain Spark ML `Transformers` and `Estimators` together to compose ML `Pipeline`s

In [1]:
import pyspark.sql.functions as F

from pyspark.ml.feature import VectorAssembler, MinMaxScaler, RegexTokenizer, HashingTF, PCA
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline

## A. Let's design chains of transformations together!

### A.1. Computing sales per state

#### Input

In [2]:
sales_df = sqlContext.read.csv('data/sales.csv',
    header=True,      # use headers
    quote='"',        # use " for quoting
    sep=',',          # use , for separating fields
    inferSchema=True) # infer schema

In [3]:
sales_df.show()

+---+----------+-----+-----+-------+------+
|#ID|      Date|Store|State|Product|Amount|
+---+----------+-----+-----+-------+------+
|101|11/13/2014|  100|   WA|    331| 300.0|
|104|11/18/2014|  700|   OR|    329| 450.0|
|102|11/15/2014|  203|   CA|    321| 200.0|
|106|11/19/2014|  202|   CA|    331| 330.0|
|103|11/17/2014|  101|   WA|    373| 750.0|
|105|11/19/2014|  202|   CA|    321| 200.0|
+---+----------+-----+-----+-------+------+



In [4]:
sales_df.schema

StructType(List(StructField(#ID,IntegerType,true),StructField(Date,StringType,true),StructField(Store,IntegerType,true),StructField(State,StringType,true),StructField(Product,IntegerType,true),StructField(Amount,DoubleType,true)))

#### Task

You want to obtain a sorted `DataFrame` of the states in which you have most sales done (`Amount`).  (i.e., by decreasing order of sales)

1. What transformations do you need to apply?
2. What if you had to draw a workflow of the transformations to apply?

#### Code

In [5]:
(sales_df.groupBy(sales_df.State)
    .agg(F.sum(sales_df.Amount).alias('Money'))
    .orderBy('Money', ascending=False)
    .show())

+-----+------+
|State| Money|
+-----+------+
|   WA|1050.0|
|   CA| 730.0|
|   OR| 450.0|
+-----+------+



### A.2. Find the date on which AAPL's closing stock price was the highest

#### Input

In [6]:
aapl_df = sqlContext.read.csv('data/aapl.csv',
    header=True,
    quote='"',
    sep=',',
    inferSchema=True)

In [7]:
aapl_df.show(5)

+--------------------+----------+----------+----------+----------+--------+----------+
|                Date|      Open|      High|       Low|     Close|  Volume| Adj Close|
+--------------------+----------+----------+----------+----------+--------+----------+
|2016-10-25 00:00:...|117.949997|118.360001|117.309998|    118.25|39190300|    118.25|
|2016-10-24 00:00:...|117.099998|117.739998|     117.0|117.650002|23538700|117.650002|
|2016-10-21 00:00:...|116.809998|116.910004|116.279999|116.599998|23192700|116.599998|
|2016-10-20 00:00:...|116.860001|117.379997|116.330002|117.059998|24125800|117.059998|
|2016-10-19 00:00:...|    117.25|117.760002|113.800003|117.120003|20034600|117.120003|
+--------------------+----------+----------+----------+----------+--------+----------+
only showing top 5 rows



#### Task

Now, design a pipeline that will:

1. Keep only fields for Date and Close
2. Order by Close in descending order

#### Code

In [8]:
(aapl_df.select('Date', 'Close')
    .orderBy('Close', ascending=False)
    .show(5))

+--------------------+----------+
|                Date|     Close|
+--------------------+----------+
|2015-11-03 00:00:...|    122.57|
|2015-11-04 00:00:...|     122.0|
|2015-11-02 00:00:...|    121.18|
|2015-11-06 00:00:...|121.059998|
|2015-11-05 00:00:...|120.919998|
+--------------------+----------+
only showing top 5 rows



## B. Supervised Machine Learning on DataFrames

- (http://spark.apache.org/docs/latest/ml-features.html)

### Question: What is the difference between `aapl_df` and `vector_df` after running the code below?

In [9]:
assembler = VectorAssembler(inputCols=['Close'], outputCol='features')

vector_df = assembler.transform(aapl_df)

In [10]:
print type(aapl_df)
print
print aapl_df.schema
print
aapl_df.show(5)

<class 'pyspark.sql.dataframe.DataFrame'>

StructType(List(StructField(Date,TimestampType,true),StructField(Open,DoubleType,true),StructField(High,DoubleType,true),StructField(Low,DoubleType,true),StructField(Close,DoubleType,true),StructField(Volume,IntegerType,true),StructField(Adj Close,DoubleType,true)))

+--------------------+----------+----------+----------+----------+--------+----------+
|                Date|      Open|      High|       Low|     Close|  Volume| Adj Close|
+--------------------+----------+----------+----------+----------+--------+----------+
|2016-10-25 00:00:...|117.949997|118.360001|117.309998|    118.25|39190300|    118.25|
|2016-10-24 00:00:...|117.099998|117.739998|     117.0|117.650002|23538700|117.650002|
|2016-10-21 00:00:...|116.809998|116.910004|116.279999|116.599998|23192700|116.599998|
|2016-10-20 00:00:...|116.860001|117.379997|116.330002|117.059998|24125800|117.059998|
|2016-10-19 00:00:...|    117.25|117.760002|113.800003|117.120003|20034600|117.1

In [11]:
print type(vector_df)
print
print vector_df.schema
print
vector_df.show(5)

<class 'pyspark.sql.dataframe.DataFrame'>

StructType(List(StructField(Date,TimestampType,true),StructField(Open,DoubleType,true),StructField(High,DoubleType,true),StructField(Low,DoubleType,true),StructField(Close,DoubleType,true),StructField(Volume,IntegerType,true),StructField(Adj Close,DoubleType,true),StructField(features,VectorUDT,true)))

+--------------------+----------+----------+----------+----------+--------+----------+------------+
|                Date|      Open|      High|       Low|     Close|  Volume| Adj Close|    features|
+--------------------+----------+----------+----------+----------+--------+----------+------------+
|2016-10-25 00:00:...|117.949997|118.360001|117.309998|    118.25|39190300|    118.25|    [118.25]|
|2016-10-24 00:00:...|117.099998|117.739998|     117.0|117.650002|23538700|117.650002|[117.650002]|
|2016-10-21 00:00:...|116.809998|116.910004|116.279999|116.599998|23192700|116.599998|[116.599998]|
|2016-10-20 00:00:...|116.860001|117.379997|116.3300

### Follow-up: Why does this difference matter?

Let's try to run one of Spark ML's built-in transformers on some of our data.  Let's min-max scale the `Close` column.

In [12]:
scaler = MinMaxScaler(inputCol='Close', outputCol='Scaled Close').fit(aapl_df)

scaled_close_df = scaler.transform(aapl_df)

scaled_close_df.show(5)

IllegalArgumentException: u'requirement failed: Column Close must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually DoubleType.'

In [13]:
print aapl_df.schema['Close']
print vector_df.schema['features']

StructField(Close,DoubleType,true)
StructField(features,VectorUDT,true)


Takeaway: Gotta have the column as a vector.

In [14]:
scaler = MinMaxScaler(inputCol='features', outputCol='scaled_features').fit(vector_df)

scaled_features_df = scaler.transform(vector_df)

scaled_features_df.select('features', 'scaled_features').show(5)

+------------+--------------------+
|    features|     scaled_features|
+------------+--------------------+
|    [118.25]| [0.865963404782699]|
|[117.650002]|[0.8473472730564975]|
|[116.599998]|[0.8147688098332226]|
|[117.059998]|[0.8290412250646944]|
|[117.120003]|[0.8309029995776607]|
+------------+--------------------+
only showing top 5 rows



## Transformers

The `VectorAssembler` class above is an example of a generic type in Spark, called a [Transformer](http://spark.apache.org/docs/latest/ml-pipeline.html#transformers).  Important things to know about this type:

- They implement a `transform` method
- They convert one `DataFrame` into another, usually by adding columns

Examples of transformers: [`VectorAssembler`](http://spark.apache.org/docs/latest/ml-features.html#vectorassembler), [`Tokenizer`](http://spark.apache.org/docs/latest/ml-features.html#tokenizer), [`StopWordsRemover`](http://spark.apache.org/docs/latest/ml-features.html#stopwordsremover), and [many more](http://spark.apache.org/docs/latest/ml-features.html)

## Estimators

According to the documentation: "An Estimator abstracts the concept of a learning algorithm or any algorithm that fits or trains on data".  Important things to know about them:

- They implement a `fit` method whose argument is a `DataFrame`
- The output of `fit` is another type called `Model`, which is a `Transformer`

Examples of estimators: [`LogisticRegression`](http://spark.apache.org/docs/latest/ml-classification-regression.html#logistic-regression), [`DecisionTreeRegressor`](http://spark.apache.org/docs/latest/ml-classification-regression.html#decision-tree-regression), and [many more](http://spark.apache.org/docs/latest/ml-classification-regression.html)

## Pipelines

Many Data Science workflows can be described as sequential application of various `Transforms` and `Estimators`.

![http://spark.apache.org/docs/latest/img/ml-Pipeline.png](http://spark.apache.org/docs/latest/img/ml-Pipeline.png)

Let's see two ways to implement the above flow!

In [15]:
# prepare training set from a list of (id, text, label) tuples

training_df = spark.createDataFrame([(0, 'spark is like hadoop mapreduce', 1.0),
        (1, 'sparks light fire!!!', 0.0),
        (2, 'elephants like simba', 0.0),
        (3, 'hadoop is an elephant', 1.0),
        (4, 'hadoop mapreduce', 1.0)],
    ['id', 'text', 'label'])

In [16]:
tokenizer = RegexTokenizer(inputCol='text', outputCol='tokens', pattern='\\W')
tokens_df = tokenizer.transform(training_df)

In [17]:
tokens_df.show(5)

+---+--------------------+-----+--------------------+
| id|                text|label|              tokens|
+---+--------------------+-----+--------------------+
|  0|spark is like had...|  1.0|[spark, is, like,...|
|  1|sparks light fire!!!|  0.0|[sparks, light, f...|
|  2|elephants like simba|  0.0|[elephants, like,...|
|  3|hadoop is an elep...|  1.0|[hadoop, is, an, ...|
|  4|    hadoop mapreduce|  1.0| [hadoop, mapreduce]|
+---+--------------------+-----+--------------------+



In [18]:
tf = HashingTF(inputCol='tokens', outputCol='features')
tf_df = tf.transform(tokens_df)

In [19]:
tf_df.show(5)

+---+--------------------+-----+--------------------+--------------------+
| id|                text|label|              tokens|            features|
+---+--------------------+-----+--------------------+--------------------+
|  0|spark is like had...|  1.0|[spark, is, like,...|(262144,[15889,42...|
|  1|sparks light fire!!!|  0.0|[sparks, light, f...|(262144,[34036,91...|
|  2|elephants like simba|  0.0|[elephants, like,...|(262144,[23518,54...|
|  3|hadoop is an elep...|  1.0|[hadoop, is, an, ...|(262144,[15889,15...|
|  4|    hadoop mapreduce|  1.0| [hadoop, mapreduce]|(262144,[42633,15...|
+---+--------------------+-----+--------------------+--------------------+



In [20]:
model = LogisticRegression(maxIter=10, regParam=.001).fit(tf_df)
# (uses columns named features/label by default)

In [21]:
# prepare test set, which are unlabeled (id, text) tuples

test_df = spark.createDataFrame([(5, 'simba has a spark'),
        (6, 'hadoop'),
        (7, 'mapreduce in spark'),
        (8, 'apache hadoop')],
    ['id', 'text'])

# What do we need to do to this to get prediction on our test set?

predictions_df = model.transform(tf.transform(tokenizer.transform(test_df)))
predictions_df.select('text', 'prediction', 'probability').show()

+------------------+----------+--------------------+
|              text|prediction|         probability|
+------------------+----------+--------------------+
| simba has a spark|       0.0|[0.78779795057740...|
|            hadoop|       1.0|[0.02996000405249...|
|mapreduce in spark|       1.0|[0.02396543994089...|
|     apache hadoop|       1.0|[0.02996000405249...|
+------------------+----------+--------------------+



Alternatively, configure a ML pipeline, which consists of three stages: tokenizer, tf, and lr.

In [22]:
tokenizer = RegexTokenizer(inputCol='text', outputCol='tokens', pattern='\\W')
tf = HashingTF(inputCol='tokens', outputCol='features')
lr = LogisticRegression(maxIter=10, regParam=.001)

pipeline = Pipeline(stages=[tokenizer, tf, lr])

model = pipeline.fit(training_df)

In [23]:
# What do we need to do to this to get prediction on our test set?

predictions_df = model.transform(test_df)
predictions_df.select('text', 'prediction', 'probability').show()

+------------------+----------+--------------------+
|              text|prediction|         probability|
+------------------+----------+--------------------+
| simba has a spark|       0.0|[0.78779795057740...|
|            hadoop|       1.0|[0.02996000405249...|
|mapreduce in spark|       1.0|[0.02396543994089...|
|     apache hadoop|       1.0|[0.02996000405249...|
+------------------+----------+--------------------+



## C. Unsupervised Machine Learning on DataFrames

In [24]:
# read csv
iris_df = sqlContext.read.csv('data/iris.csv',
    header=True,
    quote='"',
    sep=',',
    inferSchema=True)

In [25]:
iris_df.show()

+-----------------+----------------+-----------------+----------------+
|sepal length (cm)|sepal width (cm)|petal length (cm)|petal width (cm)|
+-----------------+----------------+-----------------+----------------+
|              5.1|             3.5|              1.4|             0.2|
|              4.9|             3.0|              1.4|             0.2|
|              4.7|             3.2|              1.3|             0.2|
|              4.6|             3.1|              1.5|             0.2|
|              5.0|             3.6|              1.4|             0.2|
|              5.4|             3.9|              1.7|             0.4|
|              4.6|             3.4|              1.4|             0.3|
|              5.0|             3.4|              1.5|             0.2|
|              4.4|             2.9|              1.4|             0.2|
|              4.9|             3.1|              1.5|             0.1|
|              5.4|             3.7|              1.5|          

In [26]:
col_names = ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

pipeline = (Pipeline(stages=[
        VectorAssembler(inputCols=col_names, outputCol='features'),
        PCA(k=3, inputCol='features', outputCol='pca_features')
    ]).fit(iris_df))

pipeline.transform(iris_df).select('pca_features').show()

+--------------------+
|        pca_features|
+--------------------+
|[-2.8271359726790...|
|[-2.7959524821488...|
|[-2.6215235581650...|
|[-2.7649059004742...|
|[-2.7827501159516...|
|[-3.2314457367733...|
|[-2.6904524156023...|
|[-2.8848611044591...|
|[-2.6233845324473...|
|[-2.8374984110638...|
|[-3.0048163084440...|
|[-2.8982003795119...|
|[-2.7239091217858...|
|[-2.2861426515079...|
|[-2.8677998808418...|
|[-3.1274737739836...|
|[-2.8888168946571...|
|[-2.8630203653038...|
|[-3.3122651363522...|
|[-2.9239969088652...|
+--------------------+
only showing top 20 rows



In [27]:
pipeline.stages[-1].explainedVariance

DenseVector([0.9246, 0.053, 0.0172])