This notebook was started with the script `jupyspark.sh`, so the variables `spark` and `sc` are already defined.

If it weren't, I would run the following cell uncommented:

In [2]:
# import pyspark as ps

# spark = ps.sql.SparkSession.builder \
#         .master("local[4]") \
#         .appName("df lecture") \
#         .getOrCreate()

# sc = spark.sparkContext

In [5]:
spark

In [6]:
sc

# Spark-ML Objectives

At the end of this lecture you should be able to:

1. Be able to describe the Spark-ML API, and recognize differences to sk-learn.
2. Chain spark-ml Transformers and Estimators together to compose ML pipelines.

# Machine Learning on DataFrames

http://spark.apache.org/docs/latest/ml-features.html


In [7]:
# read CSV
df_aapl = spark.read.csv('data/aapl.csv',
                         header=True,       # use headers or not
                         quote='"',         # char for quotes
                         sep=",",           # char for separation
                         inferSchema=True)  # do we infer schema or not ?

df_aapl.show(5)

+-------------------+----------+----------+----------+----------+--------+----------+
|               Date|      Open|      High|       Low|     Close|  Volume| Adj Close|
+-------------------+----------+----------+----------+----------+--------+----------+
|2016-10-25 00:00:00|117.949997|118.360001|117.309998|    118.25|39190300|    118.25|
|2016-10-24 00:00:00|117.099998|117.739998|     117.0|117.650002|23538700|117.650002|
|2016-10-21 00:00:00|116.809998|116.910004|116.279999|116.599998|23192700|116.599998|
|2016-10-20 00:00:00|116.860001|117.379997|116.330002|117.059998|24125800|117.059998|
|2016-10-19 00:00:00|    117.25|117.760002|113.800003|117.120003|20034600|117.120003|
+-------------------+----------+----------+----------+----------+--------+----------+
only showing top 5 rows



In [8]:
from pyspark.ml.feature import MinMaxScaler, VectorAssembler

# assemble values in a vector
vectorAssembler = VectorAssembler(inputCols=["Open","High", "Low","Close"],
                                  outputCol="features")

df_vector = vectorAssembler.transform(df_aapl)
df_vector.select(['Open', 'High', 'Low', 'Close', 'features']).show(5)

+----------+----------+----------+----------+--------------------+
|      Open|      High|       Low|     Close|            features|
+----------+----------+----------+----------+--------------------+
|117.949997|118.360001|117.309998|    118.25|[117.949997,118.3...|
|117.099998|117.739998|     117.0|117.650002|[117.099998,117.7...|
|116.809998|116.910004|116.279999|116.599998|[116.809998,116.9...|
|116.860001|117.379997|116.330002|117.059998|[116.860001,117.3...|
|    117.25|117.760002|113.800003|117.120003|[117.25,117.76000...|
+----------+----------+----------+----------+--------------------+
only showing top 5 rows



In [9]:
scaler = MinMaxScaler(inputCol="features", outputCol="scaledfeatures")

# Compute summary statistics and generate MinMaxScalerModel
scalerModel = scaler.fit(df_vector)

# rescale each feature to range [min, max].
scaledData = scalerModel.transform(df_vector)
scaledData.select("features", "scaledfeatures").show(5)

+--------------------+--------------------+
|            features|      scaledfeatures|
+--------------------+--------------------+
|[117.949997,118.3...|[0.84364622791846...|
|[117.099998,117.7...|[0.81798975110079...|
|[116.809998,116.9...|[0.80923635459429...|
|[116.860001,117.3...|[0.81074565144089...|
|[117.25,117.76000...|[0.82251743035171...|
+--------------------+--------------------+
only showing top 5 rows



In [10]:
scaledData.select("features", "scaledfeatures").first()

Row(features=DenseVector([117.95, 118.36, 117.31, 118.25]), scaledfeatures=DenseVector([0.8436, 0.8302, 0.8659, 0.866]))

In [11]:
scaledData.select("features", "scaledfeatures").first()['features']

DenseVector([117.95, 118.36, 117.31, 118.25])

In [12]:
scaledData.select("features", "scaledfeatures").first()['scaledfeatures']

DenseVector([0.8436, 0.8302, 0.8659, 0.866])

# Transformers

The `VectorAssembler` class above is an example of a generic type in Spark, called a Transformer. Important things to know about this type:

* They implement a `transform` method.
* They convert one `DataFrame` into another, usually by adding columns.

Examples of Transformers: `VectorAssembler`, `Tokenizer`, `StopWordsRemover`, `StandardScaler`, and many more



# Estimators

According to the docs: An Estimator abstracts the concept of a learning algorithm or any algorithm that fits or trains on data. Important things to know about this type:

* They implement a `fit` method whose argument is a `DataFrame`.
* The output of `fit` is another type called `Model`, which is a `Transformer`.

Examples of Estimators: `LogisticRegression`, `DecisionTreeRegressor`, and many more


# Pipelines

Many Data Science workflows can be described as sequential application of various `Transformers` and `Estimators`.

![http://spark.apache.org/docs/latest/img/ml-Pipeline.png](http://spark.apache.org/docs/latest/img/ml-Pipeline.png)

Let's see two ways to implement the above flow!

In [13]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer

# Prepare training documents from a list of (id, text, label) tuples.
training = spark.createDataFrame([
    (0, "a b c c d d e spark", 1.0),
    (1, "b d", 0.0),
    (2, "spark f g g h", 1.0),
    (3, "hadoop a mapreduce g", 0.0)
], ["id", "text", "label"])

In [14]:
training.show(5)

+---+--------------------+-----+
| id|                text|label|
+---+--------------------+-----+
|  0| a b c c d d e spark|  1.0|
|  1|                 b d|  0.0|
|  2|       spark f g g h|  1.0|
|  3|hadoop a mapreduce g|  0.0|
+---+--------------------+-----+



In [15]:
tokenizer = Tokenizer(inputCol="text", outputCol="words")
tokens = tokenizer.transform(training)
tokens.show(5)

+---+--------------------+-----+--------------------+
| id|                text|label|               words|
+---+--------------------+-----+--------------------+
|  0| a b c c d d e spark|  1.0|[a, b, c, c, d, d...|
|  1|                 b d|  0.0|              [b, d]|
|  2|       spark f g g h|  1.0| [spark, f, g, g, h]|
|  3|hadoop a mapreduce g|  0.0|[hadoop, a, mapre...|
+---+--------------------+-----+--------------------+



In [16]:
hashingTF = HashingTF(inputCol="words", outputCol="features")
hashes = hashingTF.transform(tokens)
hashes.show(5)

+---+--------------------+-----+--------------------+--------------------+
| id|                text|label|               words|            features|
+---+--------------------+-----+--------------------+--------------------+
|  0| a b c c d d e spark|  1.0|[a, b, c, c, d, d...|(262144,[17222,27...|
|  1|                 b d|  0.0|              [b, d]|(262144,[27526,30...|
|  2|       spark f g g h|  1.0| [spark, f, g, g, h]|(262144,[15554,24...|
|  3|hadoop a mapreduce g|  0.0|[hadoop, a, mapre...|(262144,[42633,51...|
+---+--------------------+-----+--------------------+--------------------+



In [17]:
hashes.select('features').show(5)

+--------------------+
|            features|
+--------------------+
|(262144,[17222,27...|
|(262144,[27526,30...|
|(262144,[15554,24...|
|(262144,[42633,51...|
+--------------------+



In [18]:
hashes.select('features').take(5)

[Row(features=SparseVector(262144, {17222: 1.0, 27526: 2.0, 28698: 2.0, 30913: 1.0, 227410: 1.0, 234657: 1.0})),
 Row(features=SparseVector(262144, {27526: 1.0, 30913: 1.0})),
 Row(features=SparseVector(262144, {15554: 1.0, 24152: 1.0, 51505: 2.0, 234657: 1.0})),
 Row(features=SparseVector(262144, {42633: 1.0, 51505: 1.0, 155117: 1.0, 227410: 1.0}))]

In [19]:
print(hashes.select('features').first()['features'])

(262144,[17222,27526,28698,30913,227410,234657],[1.0,2.0,2.0,1.0,1.0,1.0])


In [20]:
lr = LogisticRegression(maxIter=10, 
                        regParam=0.001, 
                        featuresCol='features',
                        labelCol='label',
                        predictionCol='prediction',
                        probabilityCol='probability')
# These last four keywords are the defaults!
# I've written them out here for clarity

logistic_model = lr.fit(hashes) 

In [21]:
# Prepare test documents, which are unlabeled (id, text) tuples.
test = spark.createDataFrame([
    (4, "spark i j k"),
    (5, "l m n"),
    (6, "spark hadoop spark"),
    (7, "apache hadoop")
], ["id", "text"])

# What do we need to do to this to get a prediction?

In [22]:
# Why doesn't this work?

#logistic_model.transform(test)

We need to transform all our test data with the same pipeline!

In [23]:
test_tokens = tokenizer.transform(test)
test_vectors = hashingTF.transform(test_tokens)
test_output = logistic_model.transform(test_vectors)

In [24]:
test_output.show(5)

+---+------------------+--------------------+--------------------+--------------------+--------------------+----------+
| id|              text|               words|            features|       rawPrediction|         probability|prediction|
+---+------------------+--------------------+--------------------+--------------------+--------------------+----------+
|  4|       spark i j k|    [spark, i, j, k]|(262144,[20197,24...|[-1.5983128691464...|[0.16821754677494...|       1.0|
|  5|             l m n|           [l, m, n]|(262144,[18910,10...|[2.46049217146948...|[0.92132534525405...|       0.0|
|  6|spark hadoop spark|[spark, hadoop, s...|(262144,[155117,2...|[-3.1687538922198...|[0.04035864933082...|       1.0|
|  7|     apache hadoop|    [apache, hadoop]|(262144,[66695,15...|[4.94885618901197...|[0.99295841983053...|       0.0|
+---+------------------+--------------------+--------------------+--------------------+--------------------+----------+



In [25]:
test_output.printSchema()

root
 |-- id: long (nullable = true)
 |-- text: string (nullable = true)
 |-- words: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- features: vector (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = true)



In [26]:
test_output.select('text', 'probability','prediction').show(5)

+------------------+--------------------+----------+
|              text|         probability|prediction|
+------------------+--------------------+----------+
|       spark i j k|[0.16821754677494...|       1.0|
|             l m n|[0.92132534525405...|       0.0|
|spark hadoop spark|[0.04035864933082...|       1.0|
|     apache hadoop|[0.99295841983053...|       0.0|
+------------------+--------------------+----------+



In [27]:
test_output.select('probability').take(5)

[Row(probability=DenseVector([0.1682, 0.8318])),
 Row(probability=DenseVector([0.9213, 0.0787])),
 Row(probability=DenseVector([0.0404, 0.9596])),
 Row(probability=DenseVector([0.993, 0.007]))]

In [28]:
test_output.select('rawPrediction').take(5)

[Row(rawPrediction=DenseVector([-1.5983, 1.5983])),
 Row(rawPrediction=DenseVector([2.4605, -2.4605])),
 Row(rawPrediction=DenseVector([-3.1688, 3.1688])),
 Row(rawPrediction=DenseVector([4.9489, -4.9489]))]

## Alternatively

In [29]:
# Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.001)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

# Fit the pipeline to training documents.
model = pipeline.fit(training)

In [30]:
#How can we test this against our training data?
prediction = model.transform(test)
prediction.select(['features', 'prediction', 'probability']).show()

+--------------------+----------+--------------------+
|            features|prediction|         probability|
+--------------------+----------+--------------------+
|(262144,[20197,24...|       1.0|[0.16821754677494...|
|(262144,[18910,10...|       0.0|[0.92132534525405...|
|(262144,[155117,2...|       1.0|[0.04035864933082...|
|(262144,[66695,15...|       0.0|[0.99295841983053...|
+--------------------+----------+--------------------+

