If this notebook was started with the script `jupyspark.sh`,  the variables `spark` and `sc` will already defined

If not, run the following cell:

In [1]:
import pyspark as ps

spark = (
        ps.sql.SparkSession.builder 
        .master("local[4]") 
        .appName("lecture") 
        .getOrCreate()
        )

sc = spark.sparkContext

In [2]:
spark

In [3]:
sc

# Spark-ML Objectives

At the end of this lecture you should be able to:

1. Be able to describe the Spark-ML API, and recognize differences to sk-learn.
2. Chain spark-ml Transformers and Estimators together to compose ML pipelines.

# Machine Learning on DataFrames

http://spark.apache.org/docs/latest/ml-pipeline.html

Spark's machine learning pipeline is organized three main components:

- Spark Dataframes: data with a schema
- ***transformer*** objects: anything that has a `.transform()` method. It takes a Spark Dataframe as input and returns a new Spark Dataframe
- ***estimator*** objects: anything that has a `.fit()` method. It takes a Spark Dataframe as input and returns a ***transformer***

In [4]:
# read CSV
df_aapl = spark.read.csv('data/aapl.csv',
                         header=True,       # use headers or not
                         quote='"',         # char for quotes
                         sep=",",           # char for separation
                         inferSchema=True)  # do we infer schema or not ?

df_aapl.show(5)

+-------------------+----------+----------+----------+----------+--------+----------+
|               Date|      Open|      High|       Low|     Close|  Volume| Adj Close|
+-------------------+----------+----------+----------+----------+--------+----------+
|2016-10-25 00:00:00|117.949997|118.360001|117.309998|    118.25|39190300|    118.25|
|2016-10-24 00:00:00|117.099998|117.739998|     117.0|117.650002|23538700|117.650002|
|2016-10-21 00:00:00|116.809998|116.910004|116.279999|116.599998|23192700|116.599998|
|2016-10-20 00:00:00|116.860001|117.379997|116.330002|117.059998|24125800|117.059998|
|2016-10-19 00:00:00|    117.25|117.760002|113.800003|117.120003|20034600|117.120003|
+-------------------+----------+----------+----------+----------+--------+----------+
only showing top 5 rows



Spark's machine learning pipeline doesn't quite operate on a simple dataframe where each column contains a feature. Instead, you must explicitly construct a `features` column that contains a **vector** of the relevant values. 

To do so, spark has a **transformer** called `VectorAssembler`. You make a new instance of the `VectorAssembler`, specifying the list of columns to concatenate and the name of the output column to store the vectors in (`features` is a good choice). Then pass your dataframe to the `transform()` method, which will return a new dataframe.

In [5]:
from pyspark.ml.feature import VectorAssembler

# assemble values in a vector
vec_assembler = VectorAssembler(inputCols=["Open","High", "Low","Close"],
                                  outputCol="features")

df_vector = vec_assembler.transform(df_aapl)

df_vector.select(['Open', 'High', 'Low', 'Close', 'features']).show(5)

print("*"*75)

df_vector.select('features').show(5)

print("*"*75)

df_vector.select('features').take(5)

+----------+----------+----------+----------+--------------------+
|      Open|      High|       Low|     Close|            features|
+----------+----------+----------+----------+--------------------+
|117.949997|118.360001|117.309998|    118.25|[117.949997,118.3...|
|117.099998|117.739998|     117.0|117.650002|[117.099998,117.7...|
|116.809998|116.910004|116.279999|116.599998|[116.809998,116.9...|
|116.860001|117.379997|116.330002|117.059998|[116.860001,117.3...|
|    117.25|117.760002|113.800003|117.120003|[117.25,117.76000...|
+----------+----------+----------+----------+--------------------+
only showing top 5 rows

***************************************************************************
+--------------------+
|            features|
+--------------------+
|[117.949997,118.3...|
|[117.099998,117.7...|
|[116.809998,116.9...|
|[116.860001,117.3...|
|[117.25,117.76000...|
+--------------------+
only showing top 5 rows

****************************************************************

[Row(features=DenseVector([117.95, 118.36, 117.31, 118.25])),
 Row(features=DenseVector([117.1, 117.74, 117.0, 117.65])),
 Row(features=DenseVector([116.81, 116.91, 116.28, 116.6])),
 Row(features=DenseVector([116.86, 117.38, 116.33, 117.06])),
 Row(features=DenseVector([117.25, 117.76, 113.8, 117.12]))]

Here's an example of an **estimator**: `MinMaxScaler`. `fit()` takes a dataframe, stores the min & max value for each feature, and returns a **transformer** object that has the `.transform()` method.

In [6]:
from pyspark.ml.feature import MinMaxScaler

scaler = MinMaxScaler(inputCol="features", outputCol="scaledfeatures")

# Compute summary statistics and generate MinMaxScalerModel
scaler_transformer = scaler.fit(df_vector)

# rescale each feature to range [min, max].
scaled_data = scaler_transformer.transform(df_vector)


scaled_data.select("features", "scaledfeatures").show(5)

print("*"*75)

scaled_data.select("scaledfeatures").take(5)

+--------------------+--------------------+
|            features|      scaledfeatures|
+--------------------+--------------------+
|[117.949997,118.3...|[0.84364622791846...|
|[117.099998,117.7...|[0.81798975110079...|
|[116.809998,116.9...|[0.80923635459429...|
|[116.860001,117.3...|[0.81074565144089...|
|[117.25,117.76000...|[0.82251743035171...|
+--------------------+--------------------+
only showing top 5 rows

***************************************************************************


[Row(scaledfeatures=DenseVector([0.8436, 0.8302, 0.8659, 0.866])),
 Row(scaledfeatures=DenseVector([0.818, 0.8109, 0.8563, 0.8473])),
 Row(scaledfeatures=DenseVector([0.8092, 0.7851, 0.8339, 0.8148])),
 Row(scaledfeatures=DenseVector([0.8107, 0.7997, 0.8355, 0.829])),
 Row(scaledfeatures=DenseVector([0.8225, 0.8115, 0.7568, 0.8309]))]

In [7]:
scaled_data.select("features", "scaledfeatures").first()

Row(features=DenseVector([117.95, 118.36, 117.31, 118.25]), scaledfeatures=DenseVector([0.8436, 0.8302, 0.8659, 0.866]))

In [8]:
scaled_data.select("features", "scaledfeatures").first()['features']

DenseVector([117.95, 118.36, 117.31, 118.25])

In [9]:
scaled_data.select("features", "scaledfeatures").first()['scaledfeatures']

DenseVector([0.8436, 0.8302, 0.8659, 0.866])

# Transformers & Estimators

Spark has so many.

http://spark.apache.org/docs/latest/ml-features.html

# Pipelines

Many Data Science workflows can be described as sequential application of various `Transformers` and `Estimators`.

![](images/ml-Pipeline.png)


source: http://spark.apache.org/docs/latest/img/

Let's see two ways to implement the above flow!

In [10]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import CountVectorizer, Tokenizer

# Prepare training documents from a list of (id, text, label) tuples.
training = spark.createDataFrame([
    (0, "a a a b b c a d spark", 1.0),
    (1, "b c c c d c c a", 0.0),
    (2, "spark spark a a c spam", 1.0),
    (3, "c d d b d spam", 0.0)
], ["id", "text", "label"])

In [11]:
training.show(5)

+---+--------------------+-----+
| id|                text|label|
+---+--------------------+-----+
|  0|a a a b b c a d s...|  1.0|
|  1|     b c c c d c c a|  0.0|
|  2|spark spark a a c...|  1.0|
|  3|      c d d b d spam|  0.0|
+---+--------------------+-----+



In [12]:
tokenizer = Tokenizer(inputCol="text", outputCol="words")
tokens = tokenizer.transform(training)
tokens.show(5)

+---+--------------------+-----+--------------------+
| id|                text|label|               words|
+---+--------------------+-----+--------------------+
|  0|a a a b b c a d s...|  1.0|[a, a, a, b, b, c...|
|  1|     b c c c d c c a|  0.0|[b, c, c, c, d, c...|
|  2|spark spark a a c...|  1.0|[spark, spark, a,...|
|  3|      c d d b d spam|  0.0|[c, d, d, b, d, s...|
+---+--------------------+-----+--------------------+



In [13]:
cv = CountVectorizer(inputCol="words", outputCol="features")
cv_model = cv.fit(tokens)
cv_df = cv_model.transform(tokens)
cv_df.show(5)

+---+--------------------+-----+--------------------+--------------------+
| id|                text|label|               words|            features|
+---+--------------------+-----+--------------------+--------------------+
|  0|a a a b b c a d s...|  1.0|[a, a, a, b, b, c...|(6,[0,1,2,3,4],[1...|
|  1|     b c c c d c c a|  0.0|[b, c, c, c, d, c...|(6,[0,1,2,3],[5.0...|
|  2|spark spark a a c...|  1.0|[spark, spark, a,...|(6,[0,1,4,5],[1.0...|
|  3|      c d d b d spam|  0.0|[c, d, d, b, d, s...|(6,[0,2,3,5],[1.0...|
+---+--------------------+-----+--------------------+--------------------+



In [14]:
cv_df.select('features').show(5)

+--------------------+
|            features|
+--------------------+
|(6,[0,1,2,3,4],[1...|
|(6,[0,1,2,3],[5.0...|
|(6,[0,1,4,5],[1.0...|
|(6,[0,2,3,5],[1.0...|
+--------------------+



In [15]:
cv_df.select('features').take(5)

[Row(features=SparseVector(6, {0: 1.0, 1: 4.0, 2: 1.0, 3: 2.0, 4: 1.0})),
 Row(features=SparseVector(6, {0: 5.0, 1: 1.0, 2: 1.0, 3: 1.0})),
 Row(features=SparseVector(6, {0: 1.0, 1: 2.0, 4: 2.0, 5: 1.0})),
 Row(features=SparseVector(6, {0: 1.0, 2: 3.0, 3: 1.0, 5: 1.0}))]

In [16]:
print(cv_df.select('features').take(5)[0]['features'])

(6,[0,1,2,3,4],[1.0,4.0,1.0,2.0,1.0])


In [17]:
lr = LogisticRegression(maxIter=10, 
                        regParam=0.001, 
                        featuresCol='features',
                        labelCol='label',
                        predictionCol='prediction',
                        probabilityCol='probability')
# These last four keywords are the defaults!
# I've written them out here for clarity

logistic_model = lr.fit(cv_df)

In [18]:
train_predictions = logistic_model.transform(cv_df)

train_predictions.printSchema()

root
 |-- id: long (nullable = true)
 |-- text: string (nullable = true)
 |-- label: double (nullable = true)
 |-- words: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- features: vector (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = false)



In [19]:
train_predictions.show()

+---+--------------------+-----+--------------------+--------------------+--------------------+--------------------+----------+
| id|                text|label|               words|            features|       rawPrediction|         probability|prediction|
+---+--------------------+-----+--------------------+--------------------+--------------------+--------------------+----------+
|  0|a a a b b c a d s...|  1.0|[a, a, a, b, b, c...|(6,[0,1,2,3,4],[1...|[-5.6771333894961...|[0.00341167842863...|       1.0|
|  1|     b c c c d c c a|  0.0|[b, c, c, c, d, c...|(6,[0,1,2,3],[5.0...|[5.48572083680497...|[0.99587156886007...|       0.0|
|  2|spark spark a a c...|  1.0|[spark, spark, a,...|(6,[0,1,4,5],[1.0...|[-6.0274989763248...|[0.00240571627821...|       1.0|
|  3|      c d d b d spam|  0.0|[c, d, d, b, d, s...|(6,[0,2,3,5],[1.0...|[5.79796408356077...|[0.99697545077755...|       0.0|
+---+--------------------+-----+--------------------+--------------------+--------------------+---------

In [20]:
train_predictions[['text', 'label', 'rawPrediction', 'probability', 'prediction']].show()

+--------------------+-----+--------------------+--------------------+----------+
|                text|label|       rawPrediction|         probability|prediction|
+--------------------+-----+--------------------+--------------------+----------+
|a a a b b c a d s...|  1.0|[-5.6771333894961...|[0.00341167842863...|       1.0|
|     b c c c d c c a|  0.0|[5.48572083680497...|[0.99587156886007...|       0.0|
|spark spark a a c...|  1.0|[-6.0274989763248...|[0.00240571627821...|       1.0|
|      c d d b d spam|  0.0|[5.79796408356077...|[0.99697545077755...|       0.0|
+--------------------+-----+--------------------+--------------------+----------+



In [21]:
train_predictions[['probability']].take(5)

[Row(probability=DenseVector([0.0034, 0.9966])),
 Row(probability=DenseVector([0.9959, 0.0041])),
 Row(probability=DenseVector([0.0024, 0.9976])),
 Row(probability=DenseVector([0.997, 0.003]))]

In [22]:
train_predictions[['rawPrediction']].take(5)

[Row(rawPrediction=DenseVector([-5.6771, 5.6771])),
 Row(rawPrediction=DenseVector([5.4857, -5.4857])),
 Row(rawPrediction=DenseVector([-6.0275, 6.0275])),
 Row(rawPrediction=DenseVector([5.798, -5.798]))]

In [23]:
# Prepare test documents, which are unlabeled (id, text) tuples.
test = spark.createDataFrame([
    (4, "spark a a a a"),
    (5, "c c c p"),
    (6, "spark spam spark a"),
    (7, "a a a c c c")
], ["id", "text"])

# What do we need to do to this to get a prediction?

In [24]:
# Why doesn't this work?

#logistic_model.transform(test)

We need to transform all our test data with the same pipeline!

In [25]:
test_tokens = tokenizer.transform(test)
test_vectors = cv_model.transform(test_tokens)
test_output = logistic_model.transform(test_vectors)

In [26]:
test_output.show()

+---+------------------+--------------------+--------------------+--------------------+--------------------+----------+
| id|              text|               words|            features|       rawPrediction|         probability|prediction|
+---+------------------+--------------------+--------------------+--------------------+--------------------+----------+
|  4|     spark a a a a| [spark, a, a, a, a]| (6,[1,4],[4.0,1.0])|[-7.7612663456679...|[4.25735553078517...|       1.0|
|  5|           c c c p|        [c, c, c, p]|       (6,[0],[3.0])|[3.51073257971980...|[0.97099160578993...|       0.0|
|  6|spark spam spark a|[spark, spam, spa...|(6,[1,4,5],[1.0,2...|[-5.6987077106472...|[0.00333910522987...|       1.0|
|  7|       a a a c c c|  [a, a, a, c, c, c]| (6,[0,1],[3.0,3.0])|[-0.6949249969721...|[0.33293838014924...|       1.0|
+---+------------------+--------------------+--------------------+--------------------+--------------------+----------+



In [27]:
test_output.select('text','rawPrediction','probability','prediction').show(5)

+------------------+--------------------+--------------------+----------+
|              text|       rawPrediction|         probability|prediction|
+------------------+--------------------+--------------------+----------+
|     spark a a a a|[-7.7612663456679...|[4.25735553078517...|       1.0|
|           c c c p|[3.51073257971980...|[0.97099160578993...|       0.0|
|spark spam spark a|[-5.6987077106472...|[0.00333910522987...|       1.0|
|       a a a c c c|[-0.6949249969721...|[0.33293838014924...|       1.0|
+------------------+--------------------+--------------------+----------+



In [28]:
test_output.printSchema()

root
 |-- id: long (nullable = true)
 |-- text: string (nullable = true)
 |-- words: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- features: vector (nullable = true)
 |-- rawPrediction: vector (nullable = true)
 |-- probability: vector (nullable = true)
 |-- prediction: double (nullable = false)



In [29]:
test_output.select('text', 'probability','prediction').show(5)

+------------------+--------------------+----------+
|              text|         probability|prediction|
+------------------+--------------------+----------+
|     spark a a a a|[4.25735553078517...|       1.0|
|           c c c p|[0.97099160578993...|       0.0|
|spark spam spark a|[0.00333910522987...|       1.0|
|       a a a c c c|[0.33293838014924...|       1.0|
+------------------+--------------------+----------+



In [30]:
test_output.select('probability').take(5)

[Row(probability=DenseVector([0.0004, 0.9996])),
 Row(probability=DenseVector([0.971, 0.029])),
 Row(probability=DenseVector([0.0033, 0.9967])),
 Row(probability=DenseVector([0.3329, 0.6671]))]

In [31]:
test_output.select('rawPrediction').take(5)

[Row(rawPrediction=DenseVector([-7.7613, 7.7613])),
 Row(rawPrediction=DenseVector([3.5107, -3.5107])),
 Row(rawPrediction=DenseVector([-5.6987, 5.6987])),
 Row(rawPrediction=DenseVector([-0.6949, 0.6949]))]

## Alternatively: put all these steps in a Pipeline

In [32]:
# Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
cv = CountVectorizer(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.001)
pipeline = Pipeline(stages=[tokenizer, cv, lr])

# Fit the pipeline to training documents.
model = pipeline.fit(training)

In [33]:
#How can we test this against our training data?
prediction = model.transform(test)
prediction.select(['features', 'prediction', 'probability']).show()

+--------------------+----------+--------------------+
|            features|prediction|         probability|
+--------------------+----------+--------------------+
| (6,[1,4],[4.0,1.0])|       1.0|[4.25735553078517...|
|       (6,[0],[3.0])|       0.0|[0.97099160578993...|
|(6,[1,4,5],[1.0,2...|       1.0|[0.00333910522987...|
| (6,[0,1],[3.0,3.0])|       1.0|[0.33293838014924...|
+--------------------+----------+--------------------+



# Afternoon: Spinning up a cluster

To do this the long way, see this walkthrough: https://github.com/gSchool/dsi-spark-aws/blob/master/pair_part1.md

To do this the short way, we've written a script that uses the AWS CLI to start up a cluster:
https://github.com/gSchool/dsi-spark-aws/blob/master/scripts/launch_cluster.sh

The script requires you to have:
- AWS CLI set up
- an S3 bucket
- a PEM key pair, with the PEM file stored in `~/.ssh/` (if you need to create one, go [here](https://console.aws.amazon.com/ec2/v2/home#KeyPairs))
- the accompanying file [`bootstrap-emr.sh`](https://github.com/gSchool/dsi-spark-aws/blob/master/scripts/bootstrap-emr.sh) in the same folder as `launch_cluster.sh`

When running the script, you specify the name of the bucket, the name of the PEM key, and the number of worker nodes to have in your cluster. e.g.,
```bash
bash launch_cluster.sh mybucket mypem 4
```

### AWS Command Line interface

``` pip install awscli ```

``` aws configure ```

 - leave `AWS Access Key ID` and `AWS Secret Access Key` as `None`, since you should have already put them in your  `~/.bash_profile` (`~/.bashrc` on Linux)
 - make sure `Default region name` matches the location of your cluster. [This page](https://www.npmjs.com/package/aws-regions) lists region codes.
 - leave `Default output format` as whatever it is

### S3 buckets with AWS CLI
- Create a bucket: 
  - `aws s3 mb s3://mynewbucketname`
- List files in a bucket: 
  - `aws s3 ls s3://bucketname`
- Copy local file to bucket: 
  - `aws s3 cp path/to/localfile s3://mybucketname`
- Copy from bucket to local current directory: 
  - `aws s3 cp s3://mybucket/path/to/file .`
- [AWS CLI S3 management reference](https://docs.aws.amazon.com/cli/latest/userguide/using-s3-commands.html)

### EC2: setting up ssh
- example `~/.ssh/config` entry:

```bash
Host host_alias
    HostName ec2-54-219-176-90.us-west-1.compute.amazonaws.com # see AWS console for public DNS
    User hadoop # depending on your machine, 'user' may be 'ubuntu' or 'ec2-user' instead
    IdentityFile ~/.ssh/key_file.pem # make sure this is the same key you chose when you set up the instance

```

- logging in to remote terminal:
 - `ssh host_alias`
   - (This is shorthand for `ssh -i ~/.ssh/key_file.pem <User>@<HostName>`)
- copying a file to a remote machine's home directory (note the colon!)
 - `scp path/to/local/file host_alias:`
   - (This is shorthand for `scp -i ~/.ssh/key_file.pem path/to/local/file <User>@<HostName>:`)
- copying a file to a remote machine
 - `scp path/to/local/file host_alias:path/to/target/directory`


### tmux: set it and forget it

Many processes are tied to your terminal. If you `ssh` into your remote machine and run a process in that terminal, that process will break if you suddenly lose your connection.

`tmux` is a tool for "terminal multiplexing". It's great for managing many processes in many terminals. Here's the process for starting a process in a terminal that is detached from your `ssh`ed terminal using `tmux`:

- `ssh` into your remote machine
- start a tmux session with `tmux new -s some_name`
- start your process (notebook or script or whatever)
- type `<ctrl>-b d` to detach (now you are back in your `ssh`ed terminal)
- exit or shut down or go to sleep or whatever
- `ssh` back in to your remote machine
- type `tmux a -t some_name` to check on that process
- [handy tmux reference](https://gist.github.com/MohamedAlaa/2961058)