d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
</div>

# Model Selection

Building machine learning solutions involves testing a number of different models.  This lesson explores tuning hyperparameters and cross-validation in order to select the optimal model as well as saving models and predictions.

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this lesson you:<br>
* Define hyperparameters and motivate their role in machine learning
* Tune hyperparameters using grid search
* Validate model performance using cross-validation
* Save a trained model and its predictions

<iframe  
src="//fast.wistia.net/embed/iframe/vaq1zoh9k4?videoFoam=true"
style="border:1px solid #1cb1c2;"
allowtransparency="true" scrolling="no" class="wistia_embed"
name="wistia_embed" allowfullscreen mozallowfullscreen webkitallowfullscreen
oallowfullscreen msallowfullscreen width="640" height="360" ></iframe>
<div>
<a target="_blank" href="https://fast.wistia.net/embed/iframe/vaq1zoh9k4?seo=false">
  <img alt="Opens in new tab" src="https://files.training.databricks.com/static/images/external-link-icon-16x16.png"/>&nbsp;Watch full-screen.</a>
</div>

-sandbox
### Tuning, Validating and Saving

In earlier lessons, we addressed the methodological mistake of training _and_ evaluating a model on the same data.  This leads to **overfitting,** where the model performs well on data it has already seen but fails to predict anything useful on data it has not already seen.  To solve this, we proposed the train/test split where we divided our dataset between a training set used to train the model and a test set used to evaluate the model's performance on unseen data.  In this lesson, we will explore a more rigorous solution to problem of overfitting.

A **hyperparameter** is a parameter used in a machine learning algorithm that is set before the learning process begins.  In other words, a machine learning algorithm cannot learn hyperparameters from the data itself.  Hyperparameters need to be tested and validated by training multiple models.  Common hyperparameters include the number of iterations and the complexity of the model.  **Hyperparameter tuning** is the process of choosing the hyperparameter that performs the best on our loss function, or the way we penalize an algorithm for being wrong.

If we were to train a number of different models with different hyperparameters and then evaluate their performance on the test set, we would still risk overfitting because we might choose the hyperparameter that just so happens to perform the best on the data we have in our dataset.  To solve this, we can use _k_ subsets of our training set to train our model, a process called **_k_-fold cross-validation.** 

<div><img src="https://files.training.databricks.com/images/eLearning/ML-Part-1/cross-validation.png" style="height: 400px; margin: 20px"/></div>

In this lesson, we will divide our dataset into _k_ "folds" in order to choose the best hyperparameters for our machine learning model.  We will then save the trained model and its predictions.

In [5]:
%run "./Includes/Classroom-Setup"

-sandbox
### Hyperparameter Tuning

Hyperparameter tuning is the process of of choosing the optimal hyperparameters for a machine learning algorithm.  Each algorithm has different hyperparameters to tune.  You can explore these hyperparameters by using the `.explainParams()` method on a model.

**Grid search** is the process of exhaustively trying every combination of hyperparameters.  It takes all of the values we want to test and combines them in every possible way so that we test them using cross-validation.

Start by performing a train/test split on the Boston dataset and building a pipeline for linear regression.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> See <a href="https://en.wikipedia.org/wiki/Hyperparameter_optimization" target="_blank">the Wikipedia article on hyperparameter optimization</a> for more information.

In [7]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression

bostonDF = (spark.read
  .option("HEADER", True)
  .option("inferSchema", True)
  .csv("/mnt/training/bostonhousing/bostonhousing/bostonhousing.csv")
  .drop("_c0")
)

# use 80/20 rule to split training and testing data
trainDF, testDF = bostonDF.randomSplit([0.8,0.2], seed=42)

assembler = VectorAssembler(inputCols=bostonDF.columns[:-1], outputCol="features")

lr = (LinearRegression()
      .setLabelCol("medv")
      .setFeaturesCol("features")
)

pipeline = Pipeline(stages = [assembler, lr])

Take a look at the model parameters using the `.explainParams()` method.

In [9]:
print(lr.explainParams())

-sandbox
`ParamGridBuilder()` allows us to string together all of the different possible hyperparameters we would like to test.  In this case, we can test the maximum number of iterations, whether we want to use an intercept with the y axis, and whether we want to standardize our features.

<img alt="Caution" title="Caution" style="vertical-align: text-bottom; position: relative; height:1.3em; top:0.0em" src="https://files.training.databricks.com/static/images/icon-warning.svg"/> Since grid search works through exhaustively building a model for each combination of parameters, it quickly becomes a lot of different unique combinations of parameters.

In [11]:
from pyspark.ml.tuning import ParamGridBuilder

paramGrid = (ParamGridBuilder()
  .addGrid(lr.maxIter,[1,10,100])
             .addGrid(lr.fitIntercept,[True,False])
             .addGrid(lr.standardization,[True,False])
             .build()
)

Now `paramGrid` contains all of the combinations we will test in the next step.  Take a look at what it contains.

In [13]:
paramGrid

-sandbox
### Cross-Validation

There are a number of different ways of conducting cross-validation, allowing us to trade off between computational expense and model performance.  An exhaustive approach to cross-validation would include every possible split of the training set.  More commonly, _k_-fold cross-validation is used where the training dataset is divided into _k_ smaller sets, or folds.  A model is then trained on _k_-1 folds of the training data and the last fold is used to evaluate its performance.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> See <a href="https://en.wikipedia.org/wiki/Cross-validation_(statistics)" target="_blank">the Wikipedia article on Cross-Validation</a> for more information.

Create a `RegressionEvaluator()` to evaluate our grid search experiments and a `CrossValidator()` to build our models.

In [16]:
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.tuning import CrossValidator

evaluator = RegressionEvaluator(
  labelCol = "medv", 
  predictionCol = "prediction"
)

cv = CrossValidator(
estimator=pipeline,# Estimator (individual model or pipeline)
estimatorParamMaps=paramGrid,# Grid of parameters to try (grid search)
evaluator=evaluator,# Evaluator
numFolds=3,# Set k to 3
seed=42# Seed to sure our results are the same if ran again
)

-sandbox
Fit the `CrossValidator()`

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> This will train a large number of models.  If your cluster size is too small, it could take a while.

In [18]:
cvModel = cv.fit(trainDF)

Take a look at the scores from the different experiments.

In [20]:
for params, score in zip(cvModel.getEstimatorParamMaps(), cvModel.avgMetrics):
  print("".join([param.name+"\t"+str(params[param])+"\t" for param in params]))
  print("\tScore: {}".format(score))

You can then access the best model using the `.bestModel` attribute.

In [22]:
bestModel = cvModel.bestModel

In [23]:
evaluator.isLargerBetter()

In [24]:
param_dict = bestModel.stages[-1].extractParamMap()

for k, v in param_dict.items():
  print (k, ": ", v)

In [25]:
java_model = bestModel.stages[-1]._java_obj
{param.name: java_model.getOrDefault(java_model.getParam(param.name)) 
    for param in paramGrid[10]}

### Saving Models and Predictions

Spark can save both the trained model we created as well as the predictions.  For online predictions such as on a stream of new data, saving the trained model and using it with Spark Streaming is a common application.  It's also common to retrain an algorithm as a nightly batch process and save the results to a database or parquet table for later use.

Save the best model.

In [28]:
modelPath = userhome + "/cvPipelineModel"
dbutils.fs.rm(modelPath, recurse=True)

# save model to modelPath
cvModel.bestModel.save(modelPath)

Take a look at where it saved.

In [30]:
dbutils.fs.ls(modelPath)

In [31]:
from pyspark.ml import PipelineModel
readBackBestModel = PipelineModel.load(modelPath)
param_dict = readBackBestModel.stages[-1].extractParamMap()
for k, v in param_dict.items():
  print (k, ": ", v)

Save predictions made on `testDF`.

In [33]:
predictionsPath = userhome + "/modelPredictions.parquet"

# save predictions to predictionsPath
cvModel.bestModel.transform(testDF).write.mode("OVERWRITE").parquet(predictionsPath)

In [34]:
dbutils.fs.ls(predictionsPath)

In [35]:
# Read back the result
predictionDF = (spark
  .read
  .parquet(predictionsPath)
)

display(predictionDF)

crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv,features,prediction
0.00906,90.0,2.97,0,0.4,7.088,20.8,7.3073,1,285,15.3,394.72,7.85,32.2,"List(1, 13, List(), List(0.00906, 90.0, 2.97, 0.0, 0.4, 7.088, 20.8, 7.3073, 1.0, 285.0, 15.3, 394.72, 7.85))",31.408578543842005
0.01096,55.0,2.25,0,0.389,6.453,31.9,7.3073,1,300,15.3,394.72,8.23,22.0,"List(1, 13, List(), List(0.01096, 55.0, 2.25, 0.0, 0.389, 6.453, 31.9, 7.3073, 1.0, 300.0, 15.3, 394.72, 8.23))",27.2356724147633
0.01301,35.0,1.52,0,0.442,7.241,49.3,7.0379,1,284,15.5,394.74,5.49,32.7,"List(1, 13, List(), List(0.01301, 35.0, 1.52, 0.0, 0.442, 7.241, 49.3, 7.0379, 1.0, 284.0, 15.5, 394.74, 5.49))",30.304211206311624
0.01778,95.0,1.47,0,0.403,7.135,13.9,7.6534,3,402,17.0,384.3,4.45,32.9,"List(1, 13, List(), List(0.01778, 95.0, 1.47, 0.0, 0.403, 7.135, 13.9, 7.6534, 3.0, 402.0, 17.0, 384.3, 4.45))",30.458808711915974
0.01965,80.0,1.76,0,0.385,6.23,31.5,9.0892,1,241,18.2,341.6,12.93,20.1,"List(1, 13, List(), List(0.01965, 80.0, 1.76, 0.0, 0.385, 6.23, 31.5, 9.0892, 1.0, 241.0, 18.2, 341.6, 12.93))",20.260496085915605
0.03359,75.0,2.95,0,0.428,7.024,15.8,5.4011,3,252,18.3,395.62,1.98,34.9,"List(1, 13, List(), List(0.03359, 75.0, 2.95, 0.0, 0.428, 7.024, 15.8, 5.4011, 3.0, 252.0, 18.3, 395.62, 1.98))",34.069906290815894
0.03768,80.0,1.52,0,0.404,7.274,38.3,7.309,2,329,12.6,392.2,6.62,34.6,"List(1, 13, List(), List(0.03768, 80.0, 1.52, 0.0, 0.404, 7.274, 38.3, 7.309, 2.0, 329.0, 12.6, 392.2, 6.62))",34.458804323812224
0.03871,52.5,5.32,0,0.405,6.209,31.3,7.3172,6,293,16.6,396.9,7.14,23.2,"List(1, 13, List(), List(0.03871, 52.5, 5.32, 0.0, 0.405, 6.209, 31.3, 7.3172, 6.0, 293.0, 16.6, 396.9, 7.14))",26.984576022124987
0.03961,0.0,5.19,0,0.515,6.037,34.5,5.9853,5,224,20.2,396.9,8.01,21.1,"List(1, 13, List(), List(0.03961, 0.0, 5.19, 0.0, 0.515, 6.037, 34.5, 5.9853, 5.0, 224.0, 20.2, 396.9, 8.01))",20.84471091506774
0.04301,80.0,1.91,0,0.413,5.663,21.9,10.5857,4,334,22.0,382.8,8.05,18.2,"List(1, 13, List(), List(0.04301, 80.0, 1.91, 0.0, 0.413, 5.663, 21.9, 10.5857, 4.0, 334.0, 22.0, 382.8, 8.05))",14.63731891277724


## Exercise: Tuning a Model

Use grid search and cross-validation to tune the hyperparameters from a logistic regression model.

### Step 1: Import the Data

Import the data and perform a train/test split.

In [38]:
from pyspark.sql.functions import col

cols = ["index",
 "sample-code-number",
 "clump-thickness",
 "uniformity-of-cell-size",
 "uniformity-of-cell-shape",
 "marginal-adhesion",
 "single-epithelial-cell-size",
 "bare-nuclei",
 "bland-chromatin",
 "normal-nucleoli",
 "mitoses",
 "class"]

cancerDF = (spark.read  # read the data
  .option("HEADER", True)
  .option("inferSchema", True)
  .csv("/mnt/training/cancer/biopsy/biopsy.csv")
)

cancerDF = (cancerDF    # Add column names and drop nulls
  .toDF(*cols)
  .withColumn("bare-nuclei", col("bare-nuclei").isNotNull().cast("integer"))
)

display(cancerDF)

index,sample-code-number,clump-thickness,uniformity-of-cell-size,uniformity-of-cell-shape,marginal-adhesion,single-epithelial-cell-size,bare-nuclei,bland-chromatin,normal-nucleoli,mitoses,class
1,1000025,5,1,1,1,2,1,3,1,1,benign
2,1002945,5,4,4,5,7,1,3,2,1,benign
3,1015425,3,1,1,1,2,1,3,1,1,benign
4,1016277,6,8,8,1,3,1,3,7,1,benign
5,1017023,4,1,1,3,2,1,3,1,1,benign
6,1017122,8,10,10,8,7,1,9,7,1,malignant
7,1018099,1,1,1,1,2,1,3,1,1,benign
8,1018561,2,1,2,1,2,1,3,1,1,benign
9,1033078,2,1,1,1,2,1,1,1,5,benign
10,1033078,4,2,1,1,2,1,2,1,1,benign


Perform a train/test split to create `trainCancerDF` and `testCancerDF`.  Put 80% of the data in `trainCancerDF` and use the seed that is set for you.

In [40]:
# TODO
seed = 42
trainCancerDF, testCancerDF = cancerDF.randomSplit([0.8,0.2], seed=42)

In [41]:
# TEST - Run this cell to test your solution
trainCancerCount, testCancerCount = trainCancerDF.count(), testCancerDF.count()

dbTest("ML1-P-08-01-01", True, trainCancerCount > 550 and trainCancerCount < 575)
dbTest("ML1-P-08-01-02", True, testCancerCount > 125 and testCancerCount < 140)

print("Tests passed!")

### Step 2: Create a Pipeline

Create a pipeline `cancerPipeline` that consists of the following stages:<br>

1. `indexer`: a `StringIndexer` that takes `class` as an input and outputs the column `is-malignant`
2. `assembler`: a `VectorAssembler` that takes all of the other columns as an input and outputs  the column `features`
3. `logr`: a `LogisticRegression` that takes `features` as the input and `is-malignant` as the output variable

In [43]:
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import StringIndexer, VectorAssembler
from pyspark.ml import Pipeline

indexer = StringIndexer(inputCol="class", outputCol="is-malignant")
assembler = VectorAssembler(inputCols=["clump-thickness","uniformity-of-cell-size","uniformity-of-cell-shape","marginal-adhesion",\
"single-epithelial-cell-size","bare-nuclei","bland-chromatin","normal-nucleoli","mitoses"],outputCol="features")
logr = LogisticRegression(featuresCol="features", labelCol="is-malignant")
cancerPipeline = Pipeline(stages = [indexer,assembler,logr])

In [44]:
# TEST - Run this cell to test your solution
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import StringIndexer, VectorAssembler

dbTest("ML1-P-08-02-01", True, type(indexer) == type(StringIndexer()))
dbTest("ML1-P-08-02-02", True, indexer.getInputCol() == 'class')
dbTest("ML1-P-08-02-03", True, indexer.getOutputCol() == 'is-malignant')

dbTest("ML1-P-08-02-04", True, type(assembler) == type(VectorAssembler()))
dbTest("ML1-P-08-02-05", True, assembler.getInputCols() == cols[2:-1])
dbTest("ML1-P-08-02-06", True, assembler.getOutputCol() == 'features')

dbTest("ML1-P-08-02-07", True, type(logr) == type(LogisticRegression()))
dbTest("ML1-P-08-02-08", True, logr.getLabelCol() == "is-malignant")
dbTest("ML1-P-08-02-09", True, logr.getFeaturesCol() == 'features')

dbTest("ML1-P-08-02-10", True, type(pipeline) == type(Pipeline()))

print("Tests passed!")

### Step 3: Create Grid Search Parameters

Take a look at the parameters for our `LogisticRegression` object.  Use this to build the inputs to grid search.

In [46]:
print(logr.explainParams())

Create a `ParamGridBuilder` object with two grids:<br><br>

1. A regularization parameter `regParam` of `[0., .2, .8, 1.]`
2. Test both with and without an intercept using `fitIntercept`

In [48]:
# TODO
from pyspark.ml.tuning import ParamGridBuilder

cancerParamGrid = (ParamGridBuilder()
                   .addGrid(logr.regParam,[0.,.2,.8,1.])
                   .addGrid(logr.fitIntercept,[True,False])
                   .build()
)

In [49]:
# TEST - Run this cell to test your solution
dbTest("ML1-P-08-03-01", True, type(cancerParamGrid) == list)

print("Tests passed!")

### Step 4: Perform 3-Fold Cross-Validation

Create a `BinaryClassificationEvaluator` object and use it to perform 3-fold cross-validation.

In [51]:
# TODO
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator

binaryEvaluator = BinaryClassificationEvaluator(
  labelCol = "is-malignant", 
  metricName = "areaUnderROC"
)

cancerCV = CrossValidator(
  estimator = cancerPipeline,              # Estimator (individual model or pipeline)
  estimatorParamMaps = cancerParamGrid,     # Grid of parameters to try (grid search)
  evaluator= binaryEvaluator,               # Evaluator
  numFolds = 3,               # Set k to 3
  seed = 42                           # Seed to sure our results are the same if ran again
)

cancerCVModel = cancerCV.fit(trainCancerDF)

In [52]:
# TEST - Run this cell to test your solution
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator

dbTest("ML1-P-08-04-01", True, type(binaryEvaluator) == type(BinaryClassificationEvaluator()))
dbTest("ML1-P-08-04-02", True, type(cancerCV) == type(CrossValidator()))

print("Tests passed!")

### Step 5: Examine the results

Take a look at the results.  Which combination of hyperparameters learned the most from the data?

In [54]:
for params, score in zip(cancerCVModel.getEstimatorParamMaps(), cancerCVModel.avgMetrics):
  print("".join([param.name+"\t"+str(params[param])+"\t" for param in params]))
  print("\tScore: {}".format(score))

In [55]:
binaryEvaluator.isLargerBetter()

In [56]:
param_dict1 = cancerCVModel.bestModel.stages[-1].extractParamMap()

for k, v in param_dict1.items():
  print (k, ": ", v)

In [57]:
java_model1 = cancerCVModel.bestModel.stages[-1]._java_obj
{param.name: java_model1.getOrDefault(java_model1.getParam(param.name)) 
    for param in cancerParamGrid[7]}

## Review

**Question:** What are hyperparameters and how are they used?  
**Answer:** A hyperparameter is a parameter, or setting, for a machine learning model that must be set before training the model.  Since a hyperparameter cannot be learned from the data itself, many different hyperparameters should be tested in order to determine the set of values.

**Question:** How can I further improve my predictions?  
**Answer:** There are a number of different strategies including:
0. *Different models:* train different models such as a random forest or gradient boosted trees
0. *Expert knowledge:* combine the current pipeline with domain expertise about the data
0. *Better tuning* continue tuning with more hyperparameters to choose from
0. *Feature engineering* create new features for the model to train on
0. *Ensemble models* combining predictions from multiple models can produce better results than any one model<br>

**Question:** Why is cross-validation an optimal strategy for model selection?  
**Answer:** Cross-validation is an optimal strategy for model selection because it makes the most of the available data while refraining from over-training.  Cross-validation is also a flexible technique that allows for the manipulation of the number of folds (_k_) to balance the cost of training with the performance of the final model.  If compute time is not an issue, a large _k_ will lead to better training.  Cross-validation is also embarrassingly parallel as different models can be trained and validated in parallel.

**Question:** Now that I've trained my model, how can I use it?  
**Answer:** There are few different options.  Commonly, predictions are calculated in batch and saved to a database where they can be served when they are needed.  In the case of stream processing, models can predict on incoming streams of data.  Spark can also integrate well with <a href="http://mleap-docs.combust.ml/" target="_blank">other model serialization formats such as MLeap.</a>  Model serving will be covered in greater detail in later courses.

## Next Steps

[Start the capstone project]($./09-Capstone-Project ).

## Additional Topics & Resources

**Q:** Where can I find out more information on cross-validation?  
**A:** Check out the scikit-learn article <a href="https://scikit-learn.org/stable/modules/cross_validation.html" target="_blank">on cross-validation</a>

-sandbox
&copy; 2019 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>