# Spark ML

## Morning Objectives

At the end of this lecture you should be able to:

1. Be able to describe the Spark ML API, and recognize differences with sklearn
1. Chain Spark `Dataframe` methods together to do data munging
1. Chain Spark ML `Transformers` and `Estimators` together to compose ML `Pipeline`s

In [None]:
import pyspark.sql.functions as F

from pyspark.ml.feature import VectorAssembler, MinMaxScaler, RegexTokenizer, HashingTF, PCA
from pyspark.ml.classification import LogisticRegression
from pyspark.ml import Pipeline

## A. Let's design chains of transformations together!

### A.1. Computing sales per state

#### Input

In [None]:
sales_df = sqlContext.read.csv('data/sales.csv',
    header=True,      # use headers
    quote='"',        # use " for quoting
    sep=',',          # use , for separating fields
    inferSchema=True) # infer schema

In [None]:
sales_df.show()

In [None]:
sales_df.schema

#### Task

You want to obtain a sorted `DataFrame` of the states in which you have most sales done (`Amount`).  (i.e., by decreasing order of sales)

1. What transformations do you need to apply?
2. What if you had to draw a workflow of the transformations to apply?

#### Code

In [None]:
# YOUR CODE HERE!

### A.2. Find the date on which AAPL's closing stock price was the highest

#### Input

In [None]:
aapl_df = sqlContext.read.csv('data/aapl.csv',
    header=True,
    quote='"',
    sep=',',
    inferSchema=True)

In [None]:
aapl_df.show(5)

#### Task

Now, design a pipeline that will:

1. Keep only fields for Date and Close
2. Order by Close in descending order

#### Code

In [None]:
# YOUR CODE HERE!

## B. Supervised Machine Learning on DataFrames

- (http://spark.apache.org/docs/latest/ml-features.html)

### Question: What is the difference between `aapl_df` and `vector_df` after running the code below?

In [None]:
assembler = VectorAssembler(inputCols=['Close'], outputCol='features')

vector_df = assembler.transform(aapl_df)

In [None]:
print type(aapl_df)
print
print aapl_df.schema
print
aapl_df.show(5)

In [None]:
print type(vector_df)
print
print vector_df.schema
print
vector_df.show(5)

### Follow-up: Why does this difference matter?

Let's try to run one of Spark ML's built-in transformers on some of our data.  Let's min-max scale the `Close` column.

In [None]:
scaler = MinMaxScaler(inputCol='Close', outputCol='Scaled Close').fit(aapl_df)

scaled_close_df = scaler.transform(aapl_df)

scaled_close_df.show(5)

In [None]:
print aapl_df.schema['Close']
print vector_df.schema['features']

Takeaway: Gotta have the column as a vector.

In [None]:
scaler = MinMaxScaler(inputCol='features', outputCol='scaled_features').fit(vector_df)

scaled_features_df = scaler.transform(vector_df)

scaled_features_df.select('features', 'scaled_features').show(5)

## Transformers

The `VectorAssembler` class above is an example of a generic type in Spark, called a [Transformer](http://spark.apache.org/docs/latest/ml-pipeline.html#transformers).  Important things to know about this type:

- They implement a `transform` method
- They convert one `DataFrame` into another, usually by adding columns

Examples of transformers: [`VectorAssembler`](http://spark.apache.org/docs/latest/ml-features.html#vectorassembler), [`Tokenizer`](http://spark.apache.org/docs/latest/ml-features.html#tokenizer), [`StopWordsRemover`](http://spark.apache.org/docs/latest/ml-features.html#stopwordsremover), and [many more](http://spark.apache.org/docs/latest/ml-features.html)

## Estimators

According to the documentation: "An Estimator abstracts the concept of a learning algorithm or any algorithm that fits or trains on data".  Important things to know about them:

- They implement a `fit` method whose argument is a `DataFrame`
- The output of `fit` is another type called `Model`, which is a `Transformer`

Examples of estimators: [`LogisticRegression`](http://spark.apache.org/docs/latest/ml-classification-regression.html#logistic-regression), [`DecisionTreeRegressor`](http://spark.apache.org/docs/latest/ml-classification-regression.html#decision-tree-regression), and [many more](http://spark.apache.org/docs/latest/ml-classification-regression.html)

## Pipelines

Many Data Science workflows can be described as sequential application of various `Transforms` and `Estimators`.

![http://spark.apache.org/docs/latest/img/ml-Pipeline.png](http://spark.apache.org/docs/latest/img/ml-Pipeline.png)

Let's see two ways to implement the above flow!

In [None]:
# prepare training set from a list of (id, text, label) tuples

training_df = spark.createDataFrame([(0, 'spark is like hadoop mapreduce', 1.0),
        (1, 'sparks light fire!!!', 0.0),
        (2, 'elephants like simba', 0.0),
        (3, 'hadoop is an elephant', 1.0),
        (4, 'hadoop mapreduce', 1.0)],
    ['id', 'text', 'label'])

In [None]:
tokenizer = RegexTokenizer(inputCol='text', outputCol='tokens', pattern='\\W')
tokens_df = tokenizer.transform(training_df)

In [None]:
tokens_df.show(5)

In [None]:
tf = HashingTF(inputCol='tokens', outputCol='features')
tf_df = tf.transform(tokens_df)

In [None]:
tf_df.show(5)

In [None]:
model = LogisticRegression(maxIter=10, regParam=.001).fit(tf_df)
# (uses columns named features/label by default)

In [None]:
# prepare test set, which are unlabeled (id, text) tuples

test_df = spark.createDataFrame([(5, 'simba has a spark'),
        (6, 'hadoop'),
        (7, 'mapreduce in spark'),
        (8, 'apache hadoop')],
    ['id', 'text'])

# What do we need to do to this to get prediction on our test set?

# YOUR CODE HERE

Alternatively, configure a ML pipeline, which consists of three stages: tokenizer, tf, and lr.

In [None]:
tokenizer = RegexTokenizer(inputCol='text', outputCol='tokens', pattern='\\W')
tf = HashingTF(inputCol='tokens', outputCol='features')
lr = LogisticRegression(maxIter=10, regParam=.001)

pipeline = Pipeline(stages=[tokenizer, tf, lr])

model = pipeline.fit(training_df)

In [None]:
# What do we need to do to this to get prediction on our test set?

# YOUR CODE HERE

## C. Unsupervised Machine Learning on DataFrames

In [None]:
# read csv
iris_df = sqlContext.read.csv('data/iris.csv',
    header=True,
    quote='"',
    sep=',',
    inferSchema=True)

In [None]:
iris_df.show()

In [None]:
col_names = ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']

pipeline = (Pipeline(stages=[
        VectorAssembler(inputCols=col_names, outputCol='features'),
        PCA(k=3, inputCol='features', outputCol='pca_features')
    ]).fit(iris_df))

pipeline.transform(iris_df).select('pca_features').show()

In [None]:
pipeline.stages[-1].explainedVariance