If this notebook was started with the script `jupyspark.sh`,  the variables `spark` and `sc` will already defined

If not, run the following cell:

In [1]:
import pyspark as ps

spark = (
        ps.sql.SparkSession.builder 
        .master("local[4]") 
        .appName("lecture") 
        .getOrCreate()
        )

sc = spark.sparkContext

In [2]:
spark

In [3]:
sc

# Spark-ML Objectives

At the end of this lecture you should be able to:

1. Be able to describe the Spark-ML API, and recognize differences to sk-learn.
2. Chain spark-ml Transformers and Estimators together to compose ML pipelines.

# Machine Learning on DataFrames

http://spark.apache.org/docs/latest/ml-features.html


In [4]:
# read CSV
df_aapl = spark.read.csv('data/aapl.csv',
                         header=True,       # use headers or not
                         quote='"',         # char for quotes
                         sep=",",           # char for separation
                         inferSchema=True)  # do we infer schema or not ?

df_aapl.show(5)

+-------------------+----------+----------+----------+----------+--------+----------+
|               Date|      Open|      High|       Low|     Close|  Volume| Adj Close|
+-------------------+----------+----------+----------+----------+--------+----------+
|2016-10-25 00:00:00|117.949997|118.360001|117.309998|    118.25|39190300|    118.25|
|2016-10-24 00:00:00|117.099998|117.739998|     117.0|117.650002|23538700|117.650002|
|2016-10-21 00:00:00|116.809998|116.910004|116.279999|116.599998|23192700|116.599998|
|2016-10-20 00:00:00|116.860001|117.379997|116.330002|117.059998|24125800|117.059998|
|2016-10-19 00:00:00|    117.25|117.760002|113.800003|117.120003|20034600|117.120003|
+-------------------+----------+----------+----------+----------+--------+----------+
only showing top 5 rows



In [5]:
from pyspark.ml.feature import MinMaxScaler, VectorAssembler

# assemble values in a vector
vectorAssembler = VectorAssembler(inputCols=["Open","High", "Low","Close"],
                                  outputCol="features")

df_vector = vectorAssembler.transform(df_aapl)
df_vector.select(['Open', 'High', 'Low', 'Close', 'features']).show(5)

print("***"*25)

df_vector.select('features').show(5)

print("***"*25)

df_vector.select('features').take(5)

+----------+----------+----------+----------+--------------------+
|      Open|      High|       Low|     Close|            features|
+----------+----------+----------+----------+--------------------+
|117.949997|118.360001|117.309998|    118.25|[117.949997,118.3...|
|117.099998|117.739998|     117.0|117.650002|[117.099998,117.7...|
|116.809998|116.910004|116.279999|116.599998|[116.809998,116.9...|
|116.860001|117.379997|116.330002|117.059998|[116.860001,117.3...|
|    117.25|117.760002|113.800003|117.120003|[117.25,117.76000...|
+----------+----------+----------+----------+--------------------+
only showing top 5 rows

***************************************************************************
+--------------------+
|            features|
+--------------------+
|[117.949997,118.3...|
|[117.099998,117.7...|
|[116.809998,116.9...|
|[116.860001,117.3...|
|[117.25,117.76000...|
+--------------------+
only showing top 5 rows

****************************************************************

[Row(features=DenseVector([117.95, 118.36, 117.31, 118.25])),
 Row(features=DenseVector([117.1, 117.74, 117.0, 117.65])),
 Row(features=DenseVector([116.81, 116.91, 116.28, 116.6])),
 Row(features=DenseVector([116.86, 117.38, 116.33, 117.06])),
 Row(features=DenseVector([117.25, 117.76, 113.8, 117.12]))]

In [6]:
scaler = MinMaxScaler(inputCol="features", outputCol="scaledfeatures")

# Compute summary statistics and generate MinMaxScalerModel
scalerModel = scaler.fit(df_vector)

# rescale each feature to range [min, max].
scaledData = scalerModel.transform(df_vector)
scaledData.select("features", "scaledfeatures").show(5)

print("***"*25)

scaledData.select("scaledfeatures").take(5)

+--------------------+--------------------+
|            features|      scaledfeatures|
+--------------------+--------------------+
|[117.949997,118.3...|[0.84364622791846...|
|[117.099998,117.7...|[0.81798975110079...|
|[116.809998,116.9...|[0.80923635459429...|
|[116.860001,117.3...|[0.81074565144089...|
|[117.25,117.76000...|[0.82251743035171...|
+--------------------+--------------------+
only showing top 5 rows

***************************************************************************


[Row(scaledfeatures=DenseVector([0.8436, 0.8302, 0.8659, 0.866])),
 Row(scaledfeatures=DenseVector([0.818, 0.8109, 0.8563, 0.8473])),
 Row(scaledfeatures=DenseVector([0.8092, 0.7851, 0.8339, 0.8148])),
 Row(scaledfeatures=DenseVector([0.8107, 0.7997, 0.8355, 0.829])),
 Row(scaledfeatures=DenseVector([0.8225, 0.8115, 0.7568, 0.8309]))]

In [7]:
scaledData.select("features", "scaledfeatures").first()

Row(features=DenseVector([117.95, 118.36, 117.31, 118.25]), scaledfeatures=DenseVector([0.8436, 0.8302, 0.8659, 0.866]))

In [8]:
scaledData.select("features", "scaledfeatures").first()['features']

DenseVector([117.95, 118.36, 117.31, 118.25])

In [9]:
scaledData.select("features", "scaledfeatures").first()['scaledfeatures']

DenseVector([0.8436, 0.8302, 0.8659, 0.866])

# Transformers

The `VectorAssembler` class above is an example of a generic type in Spark, called a Transformer. Important things to know about this type:

* They implement a `transform` method.
* They convert one `DataFrame` into another, usually by adding columns.

Examples of Transformers: `VectorAssembler`, `Tokenizer`, `StopWordsRemover`, `StandardScaler`, and many more



# Estimators

According to the docs: An Estimator abstracts the concept of a learning algorithm or any algorithm that fits or trains on data. Important things to know about this type:

* They implement a `fit` method whose argument is a `DataFrame`.
* The output of `fit` is another type called `Model`, which is a `Transformer`.

Examples of Estimators: `LogisticRegression`, `DecisionTreeRegressor`, and many more


# Pipelines

Many Data Science workflows can be described as sequential application of various `Transformers` and `Estimators`.

![http://spark.apache.org/docs/latest/img/ml-Pipeline.png](http://spark.apache.org/docs/latest/img/ml-Pipeline.png)

Let's see two ways to implement the above flow!

In [10]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer

# Prepare training documents from a list of (id, text, label) tuples.
training = spark.createDataFrame([
    (0, "a a a b b c a d spark", 1.0),
    (1, "b c c c d c c a", 0.0),
    (2, "spark spark a a c spam", 1.0),
    (3, "c d d b d spam", 0.0)
], ["id", "text", "label"])

In [11]:
training.show(5)

+---+--------------------+-----+
| id|                text|label|
+---+--------------------+-----+
|  0|a a a b b c a d s...|  1.0|
|  1|     b c c c d c c a|  0.0|
|  2|spark spark a a c...|  1.0|
|  3|      c d d b d spam|  0.0|
+---+--------------------+-----+



In [12]:
tokenizer = Tokenizer(inputCol="text", outputCol="words")
tokens = tokenizer.transform(training)
tokens.show(5)

+---+--------------------+-----+--------------------+
| id|                text|label|               words|
+---+--------------------+-----+--------------------+
|  0|a a a b b c a d s...|  1.0|[a, a, a, b, b, c...|
|  1|     b c c c d c c a|  0.0|[b, c, c, c, d, c...|
|  2|spark spark a a c...|  1.0|[spark, spark, a,...|
|  3|      c d d b d spam|  0.0|[c, d, d, b, d, s...|
+---+--------------------+-----+--------------------+



In [13]:
hashingTF = HashingTF(inputCol="words", outputCol="features")
hashes = hashingTF.transform(tokens)
hashes.show(5)

+---+--------------------+-----+--------------------+--------------------+
| id|                text|label|               words|            features|
+---+--------------------+-----+--------------------+--------------------+
|  0|a a a b b c a d s...|  1.0|[a, a, a, b, b, c...|(262144,[27526,28...|
|  1|     b c c c d c c a|  0.0|[b, c, c, c, d, c...|(262144,[27526,28...|
|  2|spark spark a a c...|  1.0|[spark, spark, a,...|(262144,[28698,19...|
|  3|      c d d b d spam|  0.0|[c, d, d, b, d, s...|(262144,[27526,28...|
+---+--------------------+-----+--------------------+--------------------+



In [14]:
hashes.select('features').show(5)

+--------------------+
|            features|
+--------------------+
|(262144,[27526,28...|
|(262144,[27526,28...|
|(262144,[28698,19...|
|(262144,[27526,28...|
+--------------------+



In [15]:
hashes.select('features').take(5)

[Row(features=SparseVector(262144, {27526: 1.0, 28698: 1.0, 30913: 2.0, 227410: 4.0, 234657: 1.0})),
 Row(features=SparseVector(262144, {27526: 1.0, 28698: 5.0, 30913: 1.0, 227410: 1.0})),
 Row(features=SparseVector(262144, {28698: 1.0, 197793: 1.0, 227410: 2.0, 234657: 2.0})),
 Row(features=SparseVector(262144, {27526: 3.0, 28698: 1.0, 30913: 1.0, 197793: 1.0}))]

In [16]:
print(hashes.select('features').first()['features'])

(262144,[27526,28698,30913,227410,234657],[1.0,1.0,2.0,4.0,1.0])


In [17]:
lr = LogisticRegression(maxIter=10, 
                        regParam=0.001, 
                        featuresCol='features',
                        labelCol='label',
                        predictionCol='prediction',
                        probabilityCol='probability')
# These last four keywords are the defaults!
# I've written them out here for clarity

logistic_model = lr.fit(hashes)

In [18]:
# Prepare test documents, which are unlabeled (id, text) tuples.
test = spark.createDataFrame([
    (4, "spark a a a a"),
    (5, "c c c p"),
    (6, "spark spam spark a"),
    (7, "a a a c c c")
], ["id", "text"])

# What do we need to do to this to get a prediction?

In [None]:
# Why doesn't this work?

#logistic_model.transform(test)

We need to transform all our test data with the same pipeline!

In [19]:
test_tokens = tokenizer.transform(test)
test_vectors = hashingTF.transform(test_tokens)
test_output = logistic_model.transform(test_vectors)

In [22]:
test_output.show()

+---+------------------+--------------------+--------------------+--------------------+--------------------+----------+
| id|              text|               words|            features|       rawPrediction|         probability|prediction|
+---+------------------+--------------------+--------------------+--------------------+--------------------+----------+
|  4|     spark a a a a| [spark, a, a, a, a]|(262144,[227410,2...|[-7.7612663456679...|[4.25735553078516...|       1.0|
|  5|           c c c p|        [c, c, c, p]|(262144,[28698,21...|[3.51073257971980...|[0.97099160578993...|       0.0|
|  6|spark spam spark a|[spark, spam, spa...|(262144,[197793,2...|[-5.6987077106472...|[0.00333910522987...|       1.0|
|  7|       a a a c c c|  [a, a, a, c, c, c]|(262144,[28698,22...|[-0.6949249969721...|[0.33293838014924...|       1.0|
+---+------------------+--------------------+--------------------+--------------------+--------------------+----------+



In [21]:
test_output.select('text','rawPrediction','probability','prediction').show(5)

+------------------+--------------------+--------------------+----------+
|              text|       rawPrediction|         probability|prediction|
+------------------+--------------------+--------------------+----------+
|     spark a a a a|[-7.7612663456679...|[4.25735553078516...|       1.0|
|           c c c p|[3.51073257971980...|[0.97099160578993...|       0.0|
|spark spam spark a|[-5.6987077106472...|[0.00333910522987...|       1.0|
|       a a a c c c|[-0.6949249969721...|[0.33293838014924...|       1.0|
+------------------+--------------------+--------------------+----------+



In [23]:
test_output.printSchema()

root
 |-- id: long (nullable = true)
 |-- text: string (nullable = true)
 |-- words: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- features: vector (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = false)



In [24]:
test_output.select('text', 'probability','prediction').show(5)

+------------------+--------------------+----------+
|              text|         probability|prediction|
+------------------+--------------------+----------+
|     spark a a a a|[4.25735553078516...|       1.0|
|           c c c p|[0.97099160578993...|       0.0|
|spark spam spark a|[0.00333910522987...|       1.0|
|       a a a c c c|[0.33293838014924...|       1.0|
+------------------+--------------------+----------+



In [25]:
test_output.select('probability').take(5)

[Row(probability=DenseVector([0.0004, 0.9996])),
 Row(probability=DenseVector([0.971, 0.029])),
 Row(probability=DenseVector([0.0033, 0.9967])),
 Row(probability=DenseVector([0.3329, 0.6671]))]

In [26]:
test_output.select('rawPrediction').take(5)

[Row(rawPrediction=DenseVector([-7.7613, 7.7613])),
 Row(rawPrediction=DenseVector([3.5107, -3.5107])),
 Row(rawPrediction=DenseVector([-5.6987, 5.6987])),
 Row(rawPrediction=DenseVector([-0.6949, 0.6949]))]

## Alternatively: put all these steps in a Pipeline

In [27]:
# Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.001)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

# Fit the pipeline to training documents.
model = pipeline.fit(training)

In [28]:
#How can we test this against our training data?
prediction = model.transform(test)
prediction.select(['features', 'prediction', 'probability']).show()

+--------------------+----------+--------------------+
|            features|prediction|         probability|
+--------------------+----------+--------------------+
|(262144,[227410,2...|       1.0|[4.25735553078518...|
|(262144,[28698,21...|       0.0|[0.97099160578993...|
|(262144,[197793,2...|       1.0|[0.00333910522987...|
|(262144,[28698,22...|       1.0|[0.33293838014925...|
+--------------------+----------+--------------------+

