# ⭐ Scaling Machine Learning in Three Week course 
# - Week 3:
##  Evaluate and automate with Pipelines

**Prerequisite**
Run notebook `week-3.0-data-prep-for-training` before.


In this excercise, you will use:
 * Bot data set
 * Evaluate machine learning models
 * Automate the process with pipelines




This excercise is part of the [Scaling Machine Learning with Spark book](https://learning.oreilly.com/library/view/scaling-machine-learning/9781098106812/)
available on the O'Reilly platform or on [Amazon](https://amzn.to/3WgHQvd).

In [1]:
from pyspark.sql import SparkSession 
from pyspark.ml import Pipeline
from pyspark.ml.regression import LinearRegression, LinearRegressionModel
from pyspark.ml.fpm import FPGrowth, FPGrowthModel


from pyspark.ml.evaluation import RegressionEvaluator

spark = SparkSession.builder \
    .master('local[*]') \
    .appName("eval_and_pipelines") \
    .getOrCreate()

#### Load machine learning model :


Load the models from previous Chapter:

In [2]:
lr_model = LinearRegressionModel.load('../models/linearRegression_model')

In [3]:
fpgrowth_model = FPGrowthModel.load('../models/fpGrowth_model')

Evaluate the models:

For evaluation, load classified test data

In [4]:
df_test = spark.read.parquet("../datasets/classified_test_data")

[From the docs:](https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html)

While there are many different types of classification algorithms, the evaluation of classification models all shares similar principles. 

In a supervised classification problem, there exists a _true output_ and a _model-generated predicted output_ for each data row. 



## Exercise 1: Evaluate your ML model 

### RegressionEvaluator functionality:


✅ **Task :** 


Start with predicting the outcome:
Use predict function

```python
    model.transform(vectorOfFeatures).select('prediction').show()
```

Notice that transform takes a vector of features as input.

Prediction represents if it's a bot or not.
1- bot
0- human

In [5]:
from pyspark.ml.feature import VectorAssembler

test = df_test.drop('description')
vecAssembler = VectorAssembler(inputCols=['screen_name','location','followers_count','friends_count','listed_count','favourites_count','verified','statuses_count','status','default_profile','name'], outputCol="features", handleInvalid = "skip")
test_df_with_vector = vecAssembler.transform(test)
test_df_with_vector.show(2)

+-----------+--------+---------------+-------------+------------+----------------+--------+--------------+------+---------------+----+---+--------------------+
|screen_name|location|followers_count|friends_count|listed_count|favourites_count|verified|statuses_count|status|default_profile|name|bot|            features|
+-----------+--------+---------------+-------------+------------+----------------+--------+--------------+------+---------------+----+---+--------------------+
|          1|       0|            736|         3482|           4|              22|       0|           681|     1|              0|   1|  0|[1.0,0.0,736.0,34...|
|          1|       0|           3437|            2|         106|               0|       0|          4356|     1|              0|   1|  0|[1.0,0.0,3437.0,2...|
+-----------+--------+---------------+-------------+------------+----------------+--------+--------------+------+---------------+----+---+--------------------+
only showing top 2 rows



In [6]:
model_test_prediction = lr_model.transform(test_df_with_vector)

In [7]:
model_test_prediction = lr_model.transform(test_df_with_vector)
model_test_prediction.select('bot','prediction').show()

+---+-------------------+
|bot|         prediction|
+---+-------------------+
|  0|0.11710130261697825|
|  0|0.11710130261697825|
|  0|0.11710130261697825|
|  0|0.11710130261697825|
|  0|0.11710130261697825|
|  0|0.11710130261697825|
|  0|0.11710130261697825|
|  0|0.11710130261697825|
|  0|0.11710130261697825|
|  0|0.11710130261697825|
|  0|0.11710130261697825|
|  0|0.11710130261697825|
|  0|0.11710130261697825|
|  0|0.11710130261697825|
|  0|0.11710130261697825|
|  0|0.11710130261697825|
|  0|0.11710130261697825|
|  0|0.11710130261697825|
|  0|0.11710130261697825|
|  0|0.11710130261697825|
+---+-------------------+
only showing top 20 rows



The model gave us a prediction of the chances for a specific row to
be a bot. We got numbers like 0.147 and 0.1021.

It is up to us to define the **threshold** for classifying a bot.
If it shows us 0.9? Will it satisfy us? How certain do we want to be in the classification?

**RegressionEvaluator** is the evaluator for regression-based models

Use regressionEvaluator to evaluate the model.

```pyhon
from pyspark.ml.evaluation import RegressionEvaluator
lr_evaluator = RegressionEvaluator(predictionCol="prediction", labelCol="bot",metricName="r2")
R2 = lr_evaluator.evaluate(model_test_prediction)
```

Check out R2 :
>R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression. 100% indicates that the model explains all the variability of the response data around its mean

From: [RegressionAnalysis](https://blog.minitab.com/blog/adventures-in-statistics-2/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit)

Notice `metricName` param:

RegressionEvaluator Supports: - `rmse` (default): root mean squared error - `mse`: mean squared error - `r2`: R Sqaure metric - `mae`: mean absolute error


**Notice!** here we work with the train data and select both `bot` and `prediction` to get a feel for the classifier

In [8]:
test = model_test_prediction.fillna({'bot':0})
test.show()

+-----------+--------+---------------+-------------+------------+----------------+--------+--------------+------+---------------+----+---+--------------------+-------------------+
|screen_name|location|followers_count|friends_count|listed_count|favourites_count|verified|statuses_count|status|default_profile|name|bot|            features|         prediction|
+-----------+--------+---------------+-------------+------------+----------------+--------+--------------+------+---------------+----+---+--------------------+-------------------+
|          1|       0|            736|         3482|           4|              22|       0|           681|     1|              0|   1|  0|[1.0,0.0,736.0,34...|0.11710130261697825|
|          1|       0|           3437|            2|         106|               0|       0|          4356|     1|              0|   1|  0|[1.0,0.0,3437.0,2...|0.11710130261697825|
|          1|       1|            150|            0|          18|               6|       0|         

In [9]:
from pyspark.ml.feature import VectorAssembler

df_train = spark.read.parquet("../datasets/classified_train_data")

train = df_train.drop('description')
vecAssemblerTrain = VectorAssembler(inputCols=['screen_name','location','followers_count','friends_count','listed_count','favourites_count','verified','statuses_count','status','default_profile','name'], outputCol="features", handleInvalid = "skip")
vecAssemblerTrain = vecAssemblerTrain.transform(train)

model_train_prediction = lr_model.transform(vecAssemblerTrain)
model_train_prediction.select('bot','prediction')

test = model_train_prediction.fillna({'bot':0})
model_train_prediction  = test

In [10]:
lr_evaluator = RegressionEvaluator(predictionCol="prediction", labelCol="bot",metricName="r2")
R2 = lr_evaluator.evaluate(model_train_prediction)

print("R Squared (R2) on test data = %g" % R2)

R Squared (R2) on test data = 0.00281465


When looking back at the `Predictions` output, we understand that they don't help us much. 

`Predictions` output is a number between [0,1].


However, we expect 1 or 0: bot or human.
What can we do?
Decide on a threshold.


For Example, every prediction above 0.8 is bot. Bellow 0.8 is human.

Or maybe every prediction above 0.14?

---

✅ Task :

Use model statistics params:

For example
Check RMSE - Root Mean Squared Error
For both train and test.


Code sample:

```python
def getLRSummary(df):
    df = df.drop('description')
    vecAssembler = VectorAssembler(inputCols=['screen_name','location','followers_count','friends_count','listed_count','favourites_count','verified','statuses_count','status','default_profile','name'], outputCol="features", handleInvalid = "skip")
    vecAssembler = vecAssembler.transform(df)
    output_test = vecAssembler.drop('screen_name','location','followers_count','friends_count','listed_count','favourites_count','verified','statuses_count','status','default_profile','name')
    output_test  = output_test.selectExpr("features", "bot as label")
    # evaluate function returns LinearRegressionSummary instance that holds the evaluate results
    return lr_model.evaluate(output_test)
```


Here are function [r2 docs](https://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.regression.LinearRegressionSummary.r2)

Check on both training and test set:

In [11]:

def getLRSummary(df):
    df = df.drop('description')
    vecAssembler = VectorAssembler(inputCols=['screen_name','location','followers_count','friends_count','listed_count','favourites_count','verified','statuses_count','status','default_profile','name'], outputCol="features", handleInvalid = "skip")
    vecAssembler = vecAssembler.transform(df)
    output = vecAssembler.drop('screen_name','location','followers_count','friends_count','listed_count','favourites_count','verified','statuses_count','status','default_profile','name')
    output  = output.selectExpr("features", "bot as label")
    # evaluate function returns LinearRegressionSummary instance that holds the evaluated results
    return lr_model.evaluate(output)
    

In [12]:
df_train = df_train.fillna({'bot':0})
df_test = df_test.fillna({'bot':0})


In [13]:
train_results = getLRSummary(df_train)
print("Root Mean Squared Error (RMSE) on train data = %g" % train_results.rootMeanSquaredError)

test_results = getLRSummary(df_test)
print("Root Mean Squared Error (RMSE) on test data = %g" % test_results.rootMeanSquaredError)

Root Mean Squared Error (RMSE) on train data = 0.451414
Root Mean Squared Error (RMSE) on test data = 0.775022


`LinearRegressionSummary` gives you a summary of the statistical algorithm evaluations.

In [14]:
print("r2 on test data = %g" % test_results.r2)
print("r2 on train data = %g" % train_results.r2)

r2 on test data = 0.000431035
r2 on train data = 0.00281465


#### What did you learn?
What more evaluating params can you get out of LinearRegressionSummary instance?



**Reminder**
What is r2? R Square: 
>R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression.

R Square measure how much of the variability in `bot` / `label` can be explained using the model.
We must be cautious that the performance on the training set to avoid overfitting of the model to the training set.
Overrfiting can create a model that is good only for the training set and not for the test set.


What is RMSE?
> Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the regression line data points are; RMSE is a measure of how spread out these residuals are. In other words, it tells you how concentrated the data is around the line of best fit.


## Exercise 2: Build Simple Spark ML Pipelines

ML Pipelines provide a uniform set of high-level APIs built on top of DataFrames that help us create and tune practical machine learning pipelines.

In the previous exercise, you learned Logistic regression.

Logistic regression is used when the dependent variable is binary.
In our case, bot is binary - yes or no.

Linear regression is used to predict the continuous dependent variable. 
This explains the result received.


### Spark ML Pipelines

Start with a simple ML Pipelines:

In [15]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer


Remember that in Chapter 1 we split `description` into a list?
Let's do it with the `Tokenizer` functionality instead!

`Tokenizer` is part of the `pyspark.ml.feature`. `pyspark.ml.feature` give us many out of the box functionality for feature extraction. Feature extraction is the _data-science_ way of transforming columns into a new one.

Load saved data:

In [16]:
data = spark.read.parquet('../datasets/train_data_only_description')
data = data.fillna({'label':0})


In [17]:
data.show()
(trainingData, testData) = data.randomSplit([0.7, 0.3])

+--------------------+-----+
|         description|label|
+--------------------+-----+
|Contributing Edit...|    1|
|     I live in Texas|    0|
|Fresh E3 rumours ...|    0|
|''The 'Hello Worl...|    0|
|Proud West Belcon...|    1|
|Hello, I am here ...|    0|
|Meow! I want to t...|    0|
|I have something ...|    0|
|I have more than ...|    0|
|If you have to st...|   13|
|I am a twitterbot...|    1|
|Designing and mak...|    0|
|Host of Vleeties ...|    0|
|Benefiting Refuge...|    0|
|Access Hollywood ...|    0|
|Producer/Songwrit...|    0|
|CEO @Shapeways. I...|    0|
|Two division UFC ...|    0|
|Moderator of @mee...|    0|
|Tweeting every le...|    0|
+--------------------+-----+
only showing top 20 rows



### Tokenizer, HashingTF, and Logistic Regression

✅ **Task :** 

Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr:


Use the next code sample and adjust it to your needs :
```python
tokenizer = Tokenizer(inputCol="description", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.001)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

```


After we understand that Linear Regression might not be good enough for our data science purposes, we are going to work with Logistic Regression. This is your **3rd** Machine Learning model with Spark ML 🎉

In [18]:
# Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
tokenizer = Tokenizer(inputCol="description", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.001)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])


Call fit on Pipeline to get the model:

If it fails here, validates that `description` doesn't have null values.
Our HashingTF doesn't know how to handle null values.
If those exist, create a new DataFrame without them and use the new DataFrame to build the model.

In case you need it:
``` python
    trainingData = trainingData.dropna('description')
```


**Note! you might get an execption here**
If you do get out of memory exception here, this is because you need more memory to run this excersice. What you can do - make sure the other notebooks are closed and shutdown. If you uncertain on how to do that, ask in the chat! 
You can also sample the trainingData with:
```python
trainingData = trainingData.sample(fraction=0.5, seed=3)
```
Feel free to play with the fraction according the the available memory.

In [19]:
trainingData.count()

1708

In [20]:
# remove this if you have enough memory on your machine / running in a distributed setting
trainingData = trainingData.sample(fraction=0.01, seed=3)
trainingData.count()

23

In [21]:
# Fit the pipeline to training documents.
model = pipeline.fit(trainingData)

Make predictions:

In [22]:
# Make predictions on test documents and print columns of interest.
prediction = model.transform(testData)
selected = prediction.select("description", "probability", "prediction")
for row in selected.collect():
    description, prob, prediction = row
    print("(%s) --> prob=%s, prediction=%f" % (description, str(prob), prediction))

( CA""") --> prob=[0.8773797667315911,0.12262023326840887], prediction=0.000000
( England""") --> prob=[0.8773797667315911,0.12262023326840887], prediction=0.000000
( NV""") --> prob=[0.8773797667315911,0.12262023326840887], prediction=0.000000
("""Affordable & Professional Printing Services for Businesses & Individuals. For Cheapest Printing Prices Mail us sales@print365.ie #printing #Ireland""") --> prob=[0.9358968200892436,0.06410317991075642], prediction=0.000000
("""Follow and tweet us) --> prob=[0.882717793416593,0.11728220658340704], prediction=0.000000
("""International fanbase for Running Man ___ Variety Show. For enquires email runningmantown@gmail.com #7012""") --> prob=[0.9358968200892436,0.06410317991075642], prediction=0.000000
("""Rare and strong PokŽmon in Las Vegas and Spring Valley. See more PokŽmon at https://t.co/GB4nYu29n3""") --> prob=[0.9249361164527934,0.07506388354720661], prediction=0.000000
(#Muslim #Arab #American #husband #Father #politicaljunkie,#Liberal #

In the text output search for
`prediction=1`

And write in the chat which description got classified as bot!


In the text output search for
`prediction=1`

And write in the chat which description got classified as bot!


In [23]:
# this is for clearning some memory :-)
prediction = None
selected = None

## Exercise 3: Put everything together

### CrossValidator, BinaryClassificationEvaluator and ParamGridBuilder functionality

✅ **Task :** 


CrossValidator provide us the ability to run multiple training set and testing set within one function call - 
`fit`.

It runs the evaluation phase and chooses the best parameters.

Read about `CrossValidator` in the [docs](https://spark.apache.org/docs/latest/ml-tuning.html) and integrate it into your pipeline.

In the docs, search for `CrossValidator` `python` example.

Copy the example to the notebook and adjust it to your needs.

From the docs:
>`CrossValidator` - K-fold cross validation performs model selection by splitting the dataset into a set of non-overlapping randomly partitioned folds which are used as separate training and test datasets e.g., with k=3 folds, K-fold cross validation will generate 3 (training, test) dataset pairs, each of which uses 2/3 of the data for training and 1/3 for testing. Each fold is used as the test set exactly once.

<details><summary>Can't find the example, click here! </summary>
<p>

```python
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

# Prepare training documents, which are labeled.
training = spark.createDataFrame([
    (0, "a b c d e spark", 1.0),
    (1, "b d", 0.0),
    (2, "spark f g h", 1.0),
    (3, "hadoop mapreduce", 0.0),
    (4, "b spark who", 1.0),
    (5, "g d a y", 0.0),
    (6, "spark fly", 1.0),
    (7, "was mapreduce", 0.0),
    (8, "e spark program", 1.0),
    (9, "a e c l", 0.0),
    (10, "spark compile", 1.0),
    (11, "hadoop software", 0.0)
], ["id", "text", "label"])

# Configure an ML pipeline, which consists of tree stages: tokenizer, hashingTF, and lr.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

# We now treat the Pipeline as an Estimator, wrapping it in a CrossValidator instance.
# This will allow us to jointly choose parameters for all Pipeline stages.
# A CrossValidator requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.
# We use a ParamGridBuilder to construct a grid of parameters to search over.
# With 3 values for hashingTF.numFeatures and 2 values for lr.regParam,
# this grid will have 3 x 2 = 6 parameter settings for CrossValidator to choose from.
paramGrid = ParamGridBuilder() \
    .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \
    .addGrid(lr.regParam, [0.1, 0.01]) \
    .build()

crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=BinaryClassificationEvaluator(),
                          numFolds=2)  # use 3+ folds in practice

# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.fit(training)

# Prepare test documents, which are unlabeled.
test = spark.createDataFrame([
    (4, "spark i j k"),
    (5, "l m n"),
    (6, "mapreduce spark"),
    (7, "apache hadoop")
], ["id", "text"])

# Make predictions on test documents. cvModel uses the best model found (lrModel).
prediction = cvModel.transform(test)
selected = prediction.select("id", "text", "probability", "prediction")
for row in selected.collect():
    print(row)
```
    
</p>
</details>


<details><summary>Answer</summary>
<p>

```python
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder


# We now treat the Pipeline as an Estimator, wrapping it in a CrossValidator instance.
# This will allow us to jointly choose parameters for all Pipeline stages.
# A CrossValidator requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.
# We use a ParamGridBuilder to construct a grid of parameters to search over.
# With 3 values for hashingTF.numFeatures and 2 values for lr.regParam,
# this grid will have 3 x 2 = 6 parameter settings for CrossValidator to choose from.
paramGrid = ParamGridBuilder() \
    .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \
    .addGrid(lr.regParam, [0.1, 0.01]) \
    .build()


crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=BinaryClassificationEvaluator(),
                          numFolds=3)  # use 3+ folds in practice


# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.fit(trainingData)

prediction = cvModel.transform(testData)
selected = prediction.select("description", "probability", "prediction")
for row in selected.collect():
    print(row)
   
```
    
</p>
</details>

In [24]:
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder


# We now treat the Pipeline as an Estimator, wrapping it in a CrossValidator instance.
# This will allow us to jointly choose parameters for all Pipeline stages.
# A CrossValidator requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.
# We use a ParamGridBuilder to construct a grid of parameters to search over.
# With 3 values for hashingTF.numFeatures and 2 values for lr.regParam,
# this grid will have 3 x 2 = 6 parameter settings for CrossValidator to choose from.
paramGrid = ParamGridBuilder() \
    .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \
    .addGrid(lr.regParam, [0.1, 0.01]) \
    .build()


crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=BinaryClassificationEvaluator(),
                          numFolds=3)  # use 3+ folds in practice


# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.fit(trainingData)

prediction = cvModel.transform(testData)
selected = prediction.select("description", "probability", "prediction")
for row in selected.collect():
    print(row)


Row(description=' CA"""', probability=DenseVector([0.9915, 0.0085]), prediction=0.0)
Row(description=' England"""', probability=DenseVector([0.9953, 0.0047]), prediction=0.0)
Row(description=' NV"""', probability=DenseVector([0.9545, 0.0455]), prediction=0.0)
Row(description='"""Affordable & Professional Printing Services for Businesses & Individuals. For Cheapest Printing Prices Mail us sales@print365.ie #printing #Ireland"""', probability=DenseVector([0.991, 0.009]), prediction=0.0)
Row(description='"""Follow and tweet us', probability=DenseVector([0.936, 0.064]), prediction=0.0)
Row(description='"""International fanbase for Running Man ___ Variety Show. For enquires email runningmantown@gmail.com #7012"""', probability=DenseVector([0.9998, 0.0002]), prediction=0.0)
Row(description='"""Rare and strong PokŽmon in Las Vegas and Spring Valley. See more PokŽmon at https://t.co/GB4nYu29n3"""', probability=DenseVector([0.8216, 0.1784]), prediction=0.0)
Row(description='#Muslim #Arab #Ameri

In the text output search for
`prediction=1`

Write in the chat which `description` got classified as a bot!

Notice that in some of the `description` exists the word - `bot`.

Meaning your algorithm found it without being told directly to search for the word bot 🤓

---

In the last task, you used `BinaryClassificationEvaluator` since it is more accurate to our needs. It works with Binary data - bot or human.
`ParamGridBuilder` is a utility that helps us construct a parameter grid for our algorithm. It helps us test out various models built with various params. 

`ParamGridBuilder` is part of `pyspark.ml.tuning` lib. 


## Well Done! 👏👏👏
## You just finished: evaluating and automating with PySpark pipelines

## I hope you enjoyed it!


[@adipolak](https://mastodon.online/@adipolak)