# Spark-ML

Machine learning at a scale!

# Objectives

At the end of this lecture you should be able to:

1. Chain spark dataframe methods together to do data munging.
2. Be able to describe the Spark-ML API, and recognize differences to sk-learn.
3. Chain Spark-ML Transformers and Estimators together to compose ML pipelines.

In [1]:
import pyspark.sql.functions as F
import sys
#sys.tracebacklimit = 0

In [2]:
import pyspark as ps

spark = ps.sql.SparkSession.builder \
            .master("local[4]") \
            .appName("df lecture") \
            .getOrCreate()


# Supervised Machine Learning on DataFrames

http://spark.apache.org/docs/latest/ml-features.html

## Why does this difference matter?

Let's try to run Spark-ML's built-in [min-max scaler](https://spark.apache.org/docs/latest/ml-features.html#minmaxscaler) the close column in our `appl_df`.

In [12]:
# read CSV
aapl_df = spark.read.csv('data/aapl.csv',
                         header=True,       # use headers or not
                         quote='"',         # char for quotes
                         sep=",",           # char for separation
                         inferSchema=True)  # do we infer schema or not ?

from pyspark.ml.feature import MinMaxScaler, VectorAssembler

# Assemble values in a vector
vectorAssembler = VectorAssembler(inputCols=["Close"], 
                                  outputCol="Features")

vector_df = vectorAssembler.transform(aapl_df)
print(aapl_df.columns)
print(vector_df.columns)
vector_df.select('Features').show(5) #features are vectors of length 1, if we added open, close, high low etc it could be a larger vector

['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close']
['Date', 'Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close', 'Features']
+------------+
|    Features|
+------------+
|    [118.25]|
|[117.650002]|
|[116.599998]|
|[117.059998]|
|[117.120003]|
+------------+
only showing top 5 rows



In [6]:
#this won't work!
fail_scaler = MinMaxScaler(inputCol="Close", outputCol="Scaled Close")

fail_scaler_model = fail_scaler.fit(aapl_df)

scaled_close = fail_scaler_model.transform(aapl_df)
scaled_close.show(5)

NameError: name 'MinMaxScaler' is not defined

### Gotta have the column be a vector...

In [10]:
scaler = MinMaxScaler(inputCol="Features", outputCol="Scaled Features")

# Compute summary statistics and generate MinMaxScalerModel
scaler_model = scaler.fit(vector_df)
# Notice how we didn't specify the column, we already did above

# Rescale each feature to range [min, max], returns a dataframe
scaled_data = scaler_model.transform(vector_df)
scaled_data.select("Features", "Scaled Features").show(5)

+------------+--------------------+
|    Features|     Scaled Features|
+------------+--------------------+
|    [118.25]| [0.865963404782699]|
|[117.650002]|[0.8473472730564975]|
|[116.599998]|[0.8147688098332226]|
|[117.059998]|[0.8290412250646944]|
|[117.120003]|[0.8309029995776607]|
+------------+--------------------+
only showing top 5 rows



# Transformers and Estimators

# Transformers

The `VectorAssembler` class above is an example of a generic type in Spark, called a [Transformer](http://spark.apache.org/docs/latest/ml-pipeline.html#transformers). Important things to know about this type:

* They implement a `transform` method.
* They convert one `DataFrame` into another, usually by adding columns.

Examples of Transformers: [`VectorAssembler`](http://spark.apache.org/docs/latest/ml-features.html#vectorassembler), [`Tokenizer`](http://spark.apache.org/docs/latest/ml-features.html#tokenizer), [`StopWordsRemover`](http://spark.apache.org/docs/latest/ml-features.html#stopwordsremover), and [many more](http://spark.apache.org/docs/latest/ml-features.html).



# Estimators

According to the docs: "An [Estimator](http://spark.apache.org/docs/latest/ml-pipeline.html#estimators) abstracts the concept of a learning algorithm or any algorithm that fits or trains on data". Important things to know about this type:

* They implement a `fit` method whose argument is a `DataFrame`.
* The output of `fit` is another type called `Model`, which is a `Transformer`.

Examples of Estimators: [`LogisticRegression`](http://spark.apache.org/docs/latest/ml-classification-regression.html#logistic-regression), [`DecisionTreeRegressor`](http://spark.apache.org/docs/latest/ml-classification-regression.html#decision-tree-regression), and [many more](http://spark.apache.org/docs/latest/ml-classification-regression.html).


In [3]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import RegexTokenizer, HashingTF

# Prepare training documents from a list of (id, text, label (about big data?)) tuples.
train_df = spark.createDataFrame([
             (0, "spark is like hadoop mapreduce", 1.0),
             (1, "sparks light fire!!!", 0.0),
             (2, "elephants like simba", 0.0),
             (3, "hadoop is an elephant", 1.0),
             (4, "hadoop mapreduce", 1.0)], 
             ["id", "text", "label"])

In [4]:
regexTokenizer = RegexTokenizer(inputCol="text", outputCol="tokens", pattern="\\W") # Transformer #finds words
hashingTF = HashingTF(inputCol="tokens", outputCol="features") # Transformer #gets term frequencies
lr = LogisticRegression(maxIter=10, regParam=0.001) # Estimator

tokens_df = regexTokenizer.transform(train_df)
print tokens_df.take(2)
hashes_df = hashingTF.transform(tokens_df)
print hashes_df.take(2)
logistic_model = lr.fit(hashes_df) # Uses columns named features/label by default
# logistic_model is now a transformer

[Row(id=0, text=u'spark is like hadoop mapreduce', label=1.0, tokens=[u'spark', u'is', u'like', u'hadoop', u'mapreduce']), Row(id=1, text=u'sparks light fire!!!', label=0.0, tokens=[u'sparks', u'light', u'fire'])]
[Row(id=0, text=u'spark is like hadoop mapreduce', label=1.0, tokens=[u'spark', u'is', u'like', u'hadoop', u'mapreduce'], features=SparseVector(262144, {15889: 1.0, 42633: 1.0, 155117: 1.0, 208258: 1.0, 234657: 1.0})), Row(id=1, text=u'sparks light fire!!!', label=0.0, tokens=[u'sparks', u'light', u'fire'], features=SparseVector(262144, {34036: 1.0, 91799: 1.0, 142979: 1.0}))]


### Now we can predict on new text

In [30]:
# Prepare test documents, which are unlabeled (id, text) tuples.
test_df = spark.createDataFrame([
            (5, "simba has a spark"),
            (6, "hadoop"),
            (7, "mapreduce in spark"),
            (8, "apache hadoop")], 
            ["id", "text"])

### What do we need to do to this to get a prediction?

In [35]:
preds_df = logistic_model.transform(
                hashingTF.transform(
           regexTokenizer.transform(test_df)))

preds_df.select('text', 'prediction', 'probability').show()

+------------------+----------+--------------------+
|              text|prediction|         probability|
+------------------+----------+--------------------+
| simba has a spark|       0.0|[0.78819302361551...|
|            hadoop|       1.0|[0.02995590605364...|
|mapreduce in spark|       1.0|[0.02401898451752...|
|     apache hadoop|       1.0|[0.02995590605364...|
+------------------+----------+--------------------+



# Pipelines

Many data science workflows can be described as sequential application of various `Transforms` and `Estimators`. In `sklearn` we know that this idea has been formalized into a class, [`Pipeline`](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html).

This way of thinking about things was so popular that Spark decided to implement it exactly.

![http://spark.apache.org/docs/latest/img/ml-Pipeline.png](http://spark.apache.org/docs/latest/img/ml-Pipeline.png)

Let's see two ways to implement the above flow!

## Configure an ML pipeline, which consists of three stages:
* tokenizer,
* hashingTF, and
* logistic regression

In [36]:
regexTokenizer = RegexTokenizer(inputCol="text", outputCol="tokens", pattern="\\W")
hashingTF = HashingTF(inputCol="tokens", outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.001)
pipeline = Pipeline(stages=[regexTokenizer, hashingTF, lr])

# Fit the pipeline to training documents
model = pipeline.fit(train_df)

### And now `model` is a transformer representing the entire pipeline. So it's simple to run our `test_df` through.

In [37]:
prediction = model.transform(test_df)
prediction.select(['text', 'prediction', 'probability']).show()

+------------------+----------+--------------------+
|              text|prediction|         probability|
+------------------+----------+--------------------+
| simba has a spark|       0.0|[0.78819302361551...|
|            hadoop|       1.0|[0.02995590605364...|
|mapreduce in spark|       1.0|[0.02401898451752...|
|     apache hadoop|       1.0|[0.02995590605364...|
+------------------+----------+--------------------+



# Unsupervised Machine Learning on DataFrames

We can similarly do unsupervised learning in Spark ML. And we can still use pipelines!

In [14]:
# Read CSV
iris_df = spark.read.csv('data/iris.csv',
                              header=True,       # parse the first line as a header?
                              quote='"',         # quote character
                              sep=",",           # separation character
                              inferSchema=True)  # infer schema?
iris_df.show(5)

+-----------------+----------------+-----------------+----------------+
|sepal length (cm)|sepal width (cm)|petal length (cm)|petal width (cm)|
+-----------------+----------------+-----------------+----------------+
|              5.1|             3.5|              1.4|             0.2|
|              4.9|             3.0|              1.4|             0.2|
|              4.7|             3.2|              1.3|             0.2|
|              4.6|             3.1|              1.5|             0.2|
|              5.0|             3.6|              1.4|             0.2|
+-----------------+----------------+-----------------+----------------+
only showing top 5 rows



In [21]:
from pyspark.ml.feature import PCA

col_names = ["sepal length (cm)", "sepal width (cm)", 
             "petal length (cm)", "petal width (cm)"]
pcaVectorAssembler = VectorAssembler(inputCols=col_names, outputCol="features")
pca = PCA(k=3, inputCol="features", outputCol="pcaFeatures")
pipeline = Pipeline(stages=[pcaVectorAssembler, pca])

fitted_pipeline = pipeline.fit(iris_df)

fitted_pipeline.transform(iris_df).select("pcaFeatures").take(5)

[Row(pcaFeatures=DenseVector([-2.8271, -5.6413, 0.6643])),
 Row(pcaFeatures=DenseVector([-2.796, -5.1452, 0.8463])),
 Row(pcaFeatures=DenseVector([-2.6215, -5.1774, 0.6181])),
 Row(pcaFeatures=DenseVector([-2.7649, -5.0036, 0.6051])),
 Row(pcaFeatures=DenseVector([-2.7828, -5.6486, 0.5465]))]

## Inspecting Pipelines

We can even access the individual, fitted, parts of a pipeline if there's an attribute on them that we're in

In [34]:
pca_model = fitted_pipeline.stages[-1]
print pca_model.explainedVariance
print pca_model.pc

[0.924616207174,0.0530155678505,0.017185139525]
DenseMatrix([[-0.36158968, -0.65653988,  0.58099728],
             [ 0.08226889, -0.72971237, -0.59641809],
             [-0.85657211,  0.1757674 , -0.07252408],
             [-0.35884393,  0.07470647, -0.54906091]])


DenseMatrix(4, 3, [-0.3616, 0.0823, -0.8566, -0.3588, -0.6565, -0.7297, 0.1758, 0.0747, 0.581, -0.5964, -0.0725, -0.5491], 0)