### Tuning Hyperparameters using Grid Search


Most machine learning algorithms have a set of parameters that govern the algorithm's behavior.  These parameters are called hyperparameters to distinguish them from the model parameters such as the coefficients in linear and logistic regression.  In this module we show how to use grid search and cross validation in Spark MLlib to determine a reasonable regularization parameter for [L1 lasso linear regression](https://en.wikipedia.org/wiki/Lasso_%28statistics%29). This notebook is based on material supplied by Cloudera under their Cloudera Academic Partner program and *Spark: The Definitive Guide* book by Bill Chambers and Matei Zaharia.

Topics
- Creating train, validation, and test datasets
- Prepare for hyperparameter tuning by specifying
  - Estimator
  - Hyperparameter grid
  - Evaluator
- Tuning hyperparameters using a hold out cross-validation
- Tuning hyperparameters using k-fold cross-validation

You can find details of all of the classes, methods, and attributes in the [Spark MLlib API Reference](https://spark.apache.org/docs/latest/api/python/reference/pyspark.ml.html) and a more general guide to their use in the [Machine Learning Library (MLlib) Guide](https://spark.apache.org/docs/latest/ml-guide.html)

#### Generate the train and test datasets

In [0]:
# Load the regression modeling data  (saved version of "assembled" DataFrame from "Building and Evaluating Classification Models" notebook)
rides = spark.read.parquet("/mnt/my-data/duocar/regression_data")

# Just renaming columns for clearer printing
# not changing the names them for our analysis
rides\
  .withColumnRenamed("vehicle_year", "year")\
  .withColumnRenamed("star_rating", "star")\
  .withColumnRenamed("high_rating", "high")\
  .show(3, False)

# Create train and test DataFrames
(train, test) = rides.randomSplit([0.7, 0.3], 12345)

In [0]:
# display(dbutils.fs.ls("dbfs:/mnt/my-data/duocar"))

#### Requirements for hyperparameter tuning

We need to specify four components to perform hyperparameter tuning using grid search:
1. Estimator
2. Hyperparameter grid
3. Evaluator
4. Validation method

#### Specify the estimator

In this example we will use L1 (lasso) linear regression as our estimator.

In [0]:
# Setting `elasticNetParam=1.0` corresponds to L1 (lasso) linear regression
# Setting `elasticNetParam-0.0` would be L2 (ridge) linear regression
# We are interested in finding a good value of the Regualarization Parameter `regParam` 

from pyspark.ml.regression import LinearRegression
lr = LinearRegression(featuresCol="features", labelCol="star_rating", elasticNetParam=1.0)

# Use the `explainParams` method to get the full list of hyperparameters as wall as training parameters:
print(lr.explainParams())

#### Specify hyperparameter grid

Use the [ParamGridBuilder](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.tuning.ParamGridBuilder.html) class to specify a hyperparameter grid.

- `addGrid (param, values)` Sets the given parameters in the grid to fixed values
- `baseOn (*args)` Sets the given parameters in this grid to fixed values. Accepts either a parameter dictionary or a list of (parameter, value) pairs
- `build()`  builds and returns all combinations of parameters specified by the param grid.

In [0]:
from pyspark.ml.tuning import ParamGridBuilder
regParamList = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5]
grid = ParamGridBuilder().addGrid(lr.regParam, regParamList).build()

# The resulting object is simply a list of parameter maps
for item in grid:
    print(item)

In [0]:
type (grid)

In [0]:
# Rather than specify `elasticNetParam` in the `LinearRegression` constructor, we can specify it in our grid
grid = ParamGridBuilder().baseOn({lr.elasticNetParam: 1.0}).addGrid(lr.regParam, regParamList).build()

In [0]:
# The resulting object is simply a list of parameter maps
# This time it includes elasticNetParam with a value of 1.0
for item in grid:
    print(item)
    print ("\n")

#### Specify the evaluator

In this case we will use [RegressionEvaluator](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.evaluation.RegressionEvaluator.html) as our evaluator and specify root-mean-squared error (rmse) as the evaluatoin metric. Ohter machine learning algorighms have suitable evaluators.

In [0]:
# Use RegressionEvaluator as our evaluator and specify root-mean-squared error as the metric
from pyspark.ml.evaluation import RegressionEvaluator
evaluator = RegressionEvaluator(predictionCol="prediction", labelCol="star_rating", metricName="rmse")

#### Tuning the hyperparameters using holdout cross-validation

[TrainValidationSplit](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.tuning.TrainValidationSplit.html) supports validation for hyper-parameter tuning. We use it to specify a holdout dataset for cross-validation.  It randomly splits the input dataset into train and validation sets, and uses the `evaluation metric` on the validation set to select the best model. It is similar to `CrossValidator`, but only splits the set once. For large DataFrames, holdout cross-validation is efficient. 

Note that in our example 
- We already split the data (above) into a train and test datasets. `TrainValidationSplit` splits the train dataset from that step into new train and validation datasets
- The models for each combination of hyperparameters will be fit using the new smaller training dataset
- Each of those models will be evaluated using the `evaluation metric` on the validation dataset to select the best model

In [0]:
# Use the `TrainValidationSplit` class to specify holdout cross-validation
from pyspark.ml.tuning import TrainValidationSplit
validator = TrainValidationSplit(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator, trainRatio=0.75, seed=54321)

# Use the `fit` method to find the best set of hyperparameters
# The dataset will be split according to `trainRatio` 
cv_model = validator.fit(train)

The resulting model is an instance of the [TrainValidationSplitModel](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.tuning.TrainValidationSplitModel.html) class.

In [0]:
# Confirm type of model returned from TrainValidationSplit class
type(cv_model)

In [0]:
# The cross-validation results are stored in the `validationMetrics` attribute
cv_model.validationMetrics

# These are the rmse errors (smaller is better) 

#### Plot Validation Metric (rmse) against  Regularization Parameter

In [0]:
# Zip the two lists together and create a dataframe for plotting
to_plot = zip(regParamList,cv_model.validationMetrics)
to_plot_df = spark.createDataFrame(to_plot, \
             schema=["Regularization_Param", "Validation_Metric (rmse)"])
to_plot_df.printSchema()

In [0]:
display(to_plot_df)
# x-axis Regularization Parameter
# y-axis Validation Metric (rmse)

Regularization_Param,Validation_Metric (rmse)
0.0,1.0995311645849588
0.1,1.109470369357201
0.2,1.1327856354695602
0.3,1.1481376494959807
0.4,1.1481376494959807
0.5,1.1481376494959807


The best performance was with the regularization parameter at zero (i.e. Linear Regression)
However, it might be worth exploring the space between 0 and 0.1

In this case the `bestModel` attribute is an instance of the [LinearRegressionModel](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.regression.LinearRegressionModel.html) class. 

**Note:** 
- The model is rerun on the entire dataset using the best set of hyperparameters
- The usual attributes and methods of the `LinearRegressionModel` are available  (so we can look at performance measures, coefficients, etc.)

In [0]:
# Show type of 'bestModel'
type(cv_model.bestModel)

In [0]:
# A function for printing common results of a linear regression model
# Note that this is a prototype. In its current form it will not handle
# models without an intercept. Works okay for this example
from pyspark.sql.types import *
import pandas as pd
def printLinRegResults(model, feature_list):
    # Query model performance:
    print ("R-Squared: " + str(model.summary.r2))
    print ("RMSE:     " + str(model.summary.rootMeanSquaredError))
    
    # Build a list of model coefficients with native python float types
    combined_coeff = []
    coeff_floats = [round(float(np_float),5) for np_float in model.coefficients] # convert coefficients to floats
    combined_coeff.extend((coeff_floats)) # Add the coefficients to the list
    combined_coeff.append(round(float(model.intercept),5)) # Append the intecept to the list of coefficients
    
    StandardErrors = [round(num, 5) for num in model.summary.coefficientStandardErrors]
    tValues = [round(num, 5) for num in model.summary.tValues]
    pValues = [round(num, 5) for num in model.summary.pValues]
    
    model_summary = list(zip(feature_list, combined_coeff, StandardErrors, tValues, pValues))
    df = pd.DataFrame(model_summary, columns = ['Feature', 'Coefficientt', 'Standard Error', 't-value', 'p-value'])
    print("Note: Last row of this table represents the intercept")
    print(df)
    #display(df)    

In [0]:
# Show results of model by passing the model 
# and the feature list to the function
features_list = ['reviewed', 'vehicle_year', 'black', 'white', \
                 'gray', 'silver', 'blue', 'red', 'yellow', \
                 'green', 'brown', 'intercept']
printLinRegResults(cv_model.bestModel, features_list)

#### Tune hyperparameters using k-fold cross-validation

Use the [CrossValidator](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.tuning.CrossValidator.html) class to specify the k-fold cross-validation. K-fold cross validation performs model selection by splitting the dataset into a set of non-overlapping randomly partitioned folds which are used as separate training and test datasets e.g., with k=3 folds, K-fold cross validation will generate 3 (training, test) dataset pairs, each of which uses 2/3 of the data for training and 1/3 for testing. Each fold is used as the test set exactly once. For small datasets k-fold cross-validation will be more accurate.

The result is an instance of the [CrossValidatorModel](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.tuning.CrossValidatorModel.html) class.

In [0]:
# Use the CrossValidator class to specify the k-fold cross-validation
from pyspark.ml.tuning import CrossValidator
kfold_validator = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator, numFolds=3, seed=54321)
kfold_model = kfold_validator.fit(train)

In [0]:
# Check type of resulting class
type(kfold_model)

In [0]:
# The cross-validation results are stored in the `avgMetrics` attribute
kfold_model.avgMetrics

# These are the rmse errors (smaller is better) 

#### Plot Validation Metric (rmse) against  Regularization Parameter

In [0]:
to_plot2 = zip(regParamList,kfold_model.avgMetrics)
to_plot_df2 = spark.createDataFrame(to_plot2, schema=["Regularization_Param", "Validation_Metric"])

# The best performance was with the regularization parameter at zero (i.e. Linear Regression)
# However, it might be worth exploring the space between 0 and 0.1 

In [0]:
display(to_plot_df2)

Regularization_Param,Validation_Metric
0.0,1.0836731652226863
0.1,1.0958841775631103
0.2,1.121649513739378
0.3,1.136854244478132
0.4,1.136854244478132
0.5,1.136854244478132


In [0]:
# The `bestModel` attribute contains the model based on the best set of hyperparameters.
# In this case, it is an instance of the `LinearRegressionModel` class
type(kfold_model.bestModel)

In [0]:
printLinRegResults(kfold_model.bestModel, features_list)

In [0]:
# Compute the performance of the best model on the test dataset
summary_test = kfold_model.bestModel.evaluate(test)

print ("R-Squared: " + str(summary_test.r2))
print ("RMSE:     " + str(summary_test.rootMeanSquaredError))

print(type(summary_test))

In [0]:
# Show come of the predicions (used sampleBy to ensure I got some rides with reviews)
summary_test.predictions.sampleBy("reviewed", fractions={0:0.00025, 1:0.05}, seed = 42)\
  .drop("features")\
  .withColumnRenamed("reviewed","rev")\
  .withColumnRenamed("vehicle_year","year")\
  .withColumnRenamed("vehicle_color_cd","color")\
  .withColumnRenamed("star_rating","star")\
  .withColumnRenamed("high_rating","high")\
  .show(10, False)

###Hands On

![Hands-on](https://cis442f-open-data.s3.amazonaws.com/pictures/hands.png "Hands-on")


#### Exercises

(1) Maybe our regularization parameters are too large.  Rerun the hyperparameter tuning with regularization parameters [0.0, 0.02, 0.04, 0.06, 0.08, 0.01].

(2) Create a parameter grid that searches over regularization type (lasso or ridge) as well as the regularization parameter.

(3) Apply hyperparameter tuning to another learning algorithm (estimator).



#### References

[Model Selection and hyperparameter tuning](http://spark.apache.org/docs/latest/ml-tuning.html)

[pyspark MLlib tuning](https://spark.apache.org/docs/latest/api/python/reference/pyspark.ml.html#tuning)

In [0]:
# A function for printing common results of a linear regression model
# Note that this is a prototype. In its current form it will not handle
# models without an intercept. Works okay for this example

# This version works even if pandas is not available

from pyspark.sql.types import *
def printLinRegResults(model, feature_list):
    # Query model performance:
    print ("R-Squared: " + str(model.summary.r2))
    print ("RMSE:     " + str(model.summary.rootMeanSquaredError))
   
    # Build a list of model coefficients with native python float types
    combined_coeff = []
    coeff_floats = [round(float(np_float),5) for np_float in model.coefficients] # convert coefficients to floats
    combined_coeff.extend((coeff_floats)) # Add the coefficients to the list
    combined_coeff.append(round(float(model.intercept),5)) # Append the intecept to the list of coefficients
    
    StandardErrors = [round(num, 5) for num in model.summary.coefficientStandardErrors]
    tValues = [round(num, 5) for num in model.summary.tValues]
    pValues = [round(num, 5) for num in model.summary.pValues]
    
    model_summary2 = list(zip(feature_list, combined_coeff, StandardErrors, tValues, pValues))
    df = pd.DataFrame(model_summary2, columns = ['Feature', 'Coefficientt', 'Standard Error', 't-value', 'p-value'])
    display(df)

    # Create a DataFrame with summary of regression results
    # First define the schema of the DataFrame
    schema = StructType([StructField("Feature", StringType(), True),\
                        StructField("Coefficient", DoubleType(), True),\
                        StructField("Standard Error", DoubleType(), True),\
                        StructField("t-value", DoubleType(), True),\
                        StructField("p-value", DoubleType(), True)])
  
    # zip the elements of the various lists together and create a DataFrame
    model_summary = zip(feature_list, combined_coeff, StandardErrors, tValues, pValues)
    to_print_df = spark.createDataFrame(model_summary, schema=schema)
    print("Note: Last row of this table represents the intercept")
    to_print_df.show()
    
    

In [0]:
# This helped me confirm that I was including the right colors when displaying the  results using the function I defined
rides.groupBy("vehicle_color_cd").count().orderBy("vehicle_color_cd").show()