<a href="https://colab.research.google.com/github/villafue/Machine_Learning_Notes/blob/master/Supervised_Learning/Machine%20Learning%20with%20Tree-Based%20Models%20in%20Python/5%20Model%20Tuning/5_Model_Tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Model Tuning

The hyperparameters of a machine learning model are parameters that are not learned from data. They should be set prior to fitting the model to the training set. In this chapter, you'll learn how to tune the hyperparameters of a tree-based model using grid search cross validation.

# Tuning a CART's Hyperparameters

1. Tuning a CART's hyperparameters
To obtain a better performance, the hyperparameters of a machine learning should be tuned.

2. Hyperparameters
Machine learning models are characterized by parameters and hyperparameters. Parameters are learned from data through training; examples of parameters include the split-feature and the split-point of a node in a CART. Hyperparameters are not learned from data; they should be set prior to training. Examples of hyperparameters include the maximum-depth and the splitting-criterion of a CART.

3. What is hyperparameter tuning?
Hyperparameter tuning consists of searching for the set of optimal hyperparameters for the learning algorithm. The solution involves finding the set of optimal hyperparameters yielding an optimal model. The optimal model yields an optimal score. The score function measures the agreement between true labels and a model's predictions. In sklearn, it defaults to accuracy for classifiers and r-squared for regressors. A model's generalization performance is evaluated using cross-validation.

4. Why tune hyperparameters?
A legitimate question that you may ask is: why bother tuning hyperparameters? Well, in scikit-learn, a model's default hyperparameters are not optimal for all problems. Hyperparameters should be tuned to obtain the best model performance.

5. Approaches to hyperparameter tuning
Now there are many approaches for hyperparameter tuning including: grid-search, random-search, and so on. In this course, we'll only be exploring the method of grid-search.

6. Grid search cross validation
In grid-search cross-validation, first you manually set a grid of discrete hyperparameter values. Then, you pick a metric for scoring model performance and you search exhaustively through the grid. For each set of hyperparameters, you evaluate each model's score. The optimal hyperparameters are those for which the model achieves the best cross-validation score. Note that grid-search suffers from the curse of dimensionality, i-dot-e-dot, the bigger the grid, the longer it takes to find the solution.

7. Grid search cross validation: example
Let's walk through a concrete example to understand this procedure. Consider the case of a CART where you search through the two-dimensional hyperparameter grid shown here. The dimensions correspond to the CART's maximum-depth and the minimum-percentage of samples per leaf. For each combination of hyperparameters, the cross-validation score is evaluated using k-fold CV for example. Finally, the optimal hyperparameters correspond to the model achieving the best cross-validation score.

8. Inspecting the hyperparameters of a CART in sklearn
Let's now see how we can inspect the hyperparameters of a CART in scikit-learn. You can first instantiate a DecisionTreeClassifier dt as shown here.

9. Inspecting the hyperparameters of a CART in sklearn
Then, call dt's -dot-get_params() method. This prints out a dictionary where the keys are the hyperparameter names. In the following, we'll only be optimizing max_depth, max_features and min_samples_leaf. Note that max_features is the number of features to consider when looking for the best split. When it's a float, it is interpreted as a percentage. You can learn more about these hyperparameters by consulting scikit-learn's documentation.

10. Grid search CV in sklearn (Breast Cancer dataset)
Let's now tune dt on the wisconsin breast cancer dataset which is already loaded and split into 80%-train and 20%-test. First, import GridSearchCV from sklearn-dot-model_selection. Then, define a dictionary called params_dt containing the names of the hyperparameters to tune as keys and lists of hyperparameter-values as values. Once done, instantiate a GridSearchCV object grid_dt by passing dt as an estimator and params_dt as param_grid. Also set scoring to accuracy and cv to 10 in order to use 10-fold stratified cross-validation for model evaluation. Finally, fit grid_dt to the training set.

11. Extracting the best hyperparameters
After training grid_dt, the best set of hyperparameter-values can be extracted from the attribute -dot-best_params_ of grid_dt. Also, the best cross validation accuracy can be accessed through grid_dt's -dot-best_score_ attribute.

12. Extracting the best estimator
Similarly, the best-model can be extracted using the -dot-best_estimator attribute. Note that this model is fitted on the whole training set because the refit parameter of GridSearchCV is set to True by default. Finally, you can evaluate this model's test set accuracy using the score method. The result is about 94-dot-7% while the score of an untuned CART is of 93%.

13. Let's practice!
Now it's your turn to practice.

# Tree hyperparameters

In the following exercises you'll revisit the Indian Liver Patient dataset which was introduced in a previous chapter.

Your task is to tune the hyperparameters of a classification tree. Given that this dataset is imbalanced, you'll be using the ROC AUC score as a metric instead of accuracy.

We have instantiated a DecisionTreeClassifier and assigned to dt with sklearn's default hyperparameters. You can inspect the hyperparameters of dt in your console.

Which of the following is not a hyperparameter of dt?

- min_impurity_decrease
 - Incorrect! min_impurity_decrease is a hyperparameter and it defaults to 0.

- min_weight_fraction_leaf
 - Incorrect! min_weight_fraction_leaf is a hyperparameter and it defaults to 0.

- min_features
 - Well done! There is no hyperparameter named min_features.
- splitter
 - Incorrect! splitter is a hyperparameter and it defaults to 'best'.

# Set the tree's hyperparameter grid
In this exercise, you'll manually set the grid of hyperparameters that will be used to tune the classification tree dt and find the optimal classifier in the next exercise.

Instructions

1. Define a grid of hyperparameters corresponding to a Python dictionary called params_dt with:

 -  the key 'max_depth' set to a list of values 2, 3, and 4

 - the key 'min_samples_leaf' set to a list of values 0.12, 0.14, 0.16, 0.18

In [None]:
# Define params_dt
params_dt = {
    'max_depth': [2, 3, 4],
    'min_samples_leaf': [0.12, 0.14, 0.16, 0.18]
}

Conclusion

Great! Next comes performing the grid search.

# Search for the optimal tree

In this exercise, you'll perform grid search using 5-fold cross validation to find dt's optimal hyperparameters. Note that because grid search is an exhaustive process, it may take a lot time to train the model. Here you'll only be instantiating the GridSearchCV object without fitting it to the training set. As discussed in the video, you can train such an object similar to any scikit-learn estimator by using the .fit() method:

`grid_object.fit(X_train, y_train)`

An untuned classification tree dt as well as the dictionary params_dt that you defined in the previous exercise are available in your workspace.

Instructions

1. Import GridSearchCV from sklearn.model_selection.

2. Instantiate a GridSearchCV object using 5-fold CV by setting the parameters:

 - estimator to dt, param_grid to params_dt and

 - scoring to 'roc_auc'.

In [None]:
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

# Instantiate grid_dt
grid_dt = GridSearchCV(estimator=dt,
                       param_grid=params_dt,
                       scoring='roc_auc',
                       cv=5,
                       n_jobs=-1)

Conclusion

Awesome! As we said earlier, we will fit the model to the training data for you and in the next exercise you will compute the test set ROC AUC score.

# Evaluate the optimal tree

In this exercise, you'll evaluate the test set ROC AUC score of grid_dt's optimal model.

In order to do so, you will first determine the probability of obtaining the positive label for each test set observation. You can use the methodpredict_proba() of an sklearn classifier to compute a 2D array containing the probabilities of the negative and positive class-labels respectively along columns.

The dataset is already loaded and processed for you (numerical features are standardized); it is split into 80% train and 20% test. X_test, y_test are available in your workspace. In addition, we have also loaded the trained GridSearchCV object grid_dt that you instantiated in the previous exercise. Note that grid_dt was trained as follows:

`grid_dt.fit(X_train, y_train)`

Instructions

1. Import roc_auc_score from sklearn.metrics.

2. Extract the .best_estimator_ attribute from grid_dt and assign it to best_model.

3. Predict the test set probabilities of obtaining the positive class y_pred_proba.

4. Compute the test set ROC AUC score test_roc_auc of best_model.

In [None]:
# Import roc_auc_score from sklearn.metrics
from sklearn.metrics import roc_auc_score

# Extract the best estimator
best_model = grid_dt.best_estimator_

# Predict the test set probabilities of the positive class
y_pred_proba = best_model.predict_proba(X_test)[:,1]

# Compute test_roc_auc
test_roc_auc = roc_auc_score(y_test, y_pred_proba)

# Print test_roc_auc
print('Test set ROC AUC score: {:.3f}'.format(test_roc_auc))

'''
<script.py> output:
    Test set ROC AUC score: 0.610
'''

Conclusion

Great work! An untuned classification-tree would achieve a ROC AUC score of 0.54!

# Tuning a RF's Hyperparameters

1. Tuning an RF's Hyperparameters
Let's now turn to a case where we tune the hyperparameters of Random Forests which is an ensemble method.

2. Random Forests Hyperparameters
In addition to the hyperparameters of the CARTs forming random forests, the ensemble itself is characterized by other hyperparameters such as the number of estimators, whether it uses bootstraping or not and so on.

3. Tuning is expensive
As a note, hyperparameter tuning is computationally expensive and may sometimes lead only to very slight improvement of a model's performance. For this reason, it is desired to weigh the impact of tuning on the pipeline of your data analysis project as a whole in order to understand if it is worth pursuing.

4. Inspecting RF Hyperparameters in sklearn
To inspect the hyperparameters of a RandomForestRegressor, first, import RandomForestRegressor from sklearn.ensemble and then instantiate a RandomForestRegressor rf as shown here.

5. Inspecting RF Hyperparameters in sklearn
The hyperparameters of rf along with their default values can be accessed by calling rf's dot-get_params() method. In the following, we'll be optimizing n_estimators, max_depth, min_samples_leaf and max_features. You can learn more about these hyperparameters by consulting scikit-learn's documentation.

6. GridSearchCV in sklearn (auto dataset)
We'll perform grid-search cross-validation on the auto-dataset which is already loaded and split into 80%-train and 20%-test. First import mean_squared_error as MSE from sklearn.metrics and GridSearchCV from sklearn.model_selection. Then, define a dictionary called params_rf containing the grid of hyperparameters. Finally, instantiate a GridSearchCV object called grid_rf and pass the parameters rf as estimator, params_rf as param_grid. Also set cv to 3 to perform 3-fold cross-validation. In addition, set scoring to neg_mean_squared_error in order to use negative mean squared error as a metric. Note that the parameter verbose controls verbosity; the higher its value, the more messages are printed during fitting.

7. Searching for the best hyperparameters
You can now fit grid_rf to the training set as shown here. The output shows messages related to grid fitting as well as the obtained optimal model.

8. Extracting the best hyperparameters
You can extract rf's best hyperparameters by getting the attribute best_params_ from grid_rf. The results are shown here.

9. Evaluating the best model performance
You can also extract the best model from rf. This enables you to predict the test set labels and evaluate the test-set RMSE. The output shows a result of 3-dot-89. If you would have trained an untuned model, the RMSE would be 3-dot-98.

10. Let's practice!
Now let's try some examples.

# Random forests hyperparameters

In the following exercises, you'll be revisiting the Bike Sharing Demand dataset that was introduced in a previous chapter. Recall that your task is to predict the bike rental demand using historical weather data from the Capital Bikeshare program in Washington, D.C.. For this purpose, you'll be tuning the hyperparameters of a Random Forests regressor.

We have instantiated a RandomForestRegressor called rf using sklearn's default hyperparameters. You can inspect the hyperparameters of rf in your console.

Which of the following is not a hyperparameter of rf?

Possible Answers

- min_weight_fraction_leaf
 - Incorrect! min_weight_fraction_leaf is a hyperparameter and it defaults to 0.

- criterion
 - Incorrect! criterion is a hyperparameter and it defaults to 'mse'.

- learning_rate
 - Well done! There is no hyperparameter named learning_rate.

- warm_start
 - Incorrect! warm_start is a hyperparameter and it defaults to False.

# Set the hyperparameter grid of RF

In this exercise, you'll manually set the grid of hyperparameters that will be used to tune rf's hyperparameters and find the optimal regressor. For this purpose, you will be constructing a grid of hyperparameters and tune the number of estimators, the maximum number of features used when splitting each node and the minimum number of samples (or fraction) per leaf.

Instructions

1. Define a grid of hyperparameters corresponding to a Python dictionary called params_rf with:

 - the key 'n_estimators' set to a list of values 100, 350, 500

 - the key 'max_features' set to a list of values 'log2', 'auto', 'sqrt'

 - the key 'min_samples_leaf' set to a list of values 2, 10, 30

In [None]:
# Define the dictionary 'params_rf'
params_rf = {
    'n_estimators': [100, 350, 500],
    'max_features': ['log2', 'auto', 'sqrt'],
    'min_samples_leaf': [2, 10, 30]
}

# Search for the optimal forest

In this exercise, you'll perform grid search using 3-fold cross validation to find rf's optimal hyperparameters. To evaluate each model in the grid, you'll be using the negative mean squared error metric.

Note that because grid search is an exhaustive search process, it may take a lot time to train the model. Here you'll only be instantiating the GridSearchCV object without fitting it to the training set. As discussed in the video, you can train such an object similar to any scikit-learn estimator by using the .fit() method:

`grid_object.fit(X_train, y_train)`

The untuned random forests regressor model rf as well as the dictionary params_rf that you defined in the previous exercise are available in your workspace.

Instructions

1. Import GridSearchCV from sklearn.model_selection.

2. Instantiate a GridSearchCV object using 3-fold CV by using negative mean squared error as the scoring metric.

In [None]:
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

# Instantiate grid_rf
grid_rf = GridSearchCV(estimator=rf,
                       param_grid=params_rf,
                       scoring='neg_mean_squared_error',
                       cv=3,
                       verbose=1,
                       n_jobs=-1)

# Evaluate the optimal forest

In this last exercise of the course, you'll evaluate the test set RMSE of grid_rf's optimal model.

The dataset is already loaded and processed for you and is split into 80% train and 20% test. In your environment are available X_test, y_test and the function mean_squared_error from sklearn.metrics under the alias MSE. In addition, we have also loaded the trained GridSearchCV object grid_rf that you instantiated in the previous exercise. Note that grid_rf was trained as follows:

`grid_rf.fit(X_train, y_train)`

Instructions

1. Import mean_squared_error as MSE from sklearn.metrics.

2. Extract the best estimator from grid_rf and assign it to best_model.

3. Predict best_model's test set labels and assign the result to y_pred.

4. Compute best_model's test set RMSE.

In [None]:
# Import mean_squared_error from sklearn.metrics as MSE 
from sklearn.metrics import mean_squared_error as MSE

# Extract the best estimator
best_model = grid_rf.best_estimator_

# Predict test set labels
y_pred = best_model.predict(X_test)

# Compute rmse_test
rmse_test = MSE(y_test, y_pred)**(1/2)

# Print rmse_test
print('Test RMSE of best model: {:.3f}'.format(rmse_test))

'''
<script.py> output:
    Test RMSE of best model: 50.569
'''

# Congratulations!

1. Congratulations!
Congratulations on completing this course!

2. How far you have come
Take a moment to take a look at how far you have come! In chapter 1, you started off by understanding and applying the CART algorithm to train decision trees or CARTs for problems involving classification and regression. In chapter 2, you understood what the generalization error of a supervised learning model is. In addition, you also learned how underfitting and overfitting can be diagnosed with cross-validation. Furthermore, you learned how model ensembling can produce results that are more robust than individual decision trees. In chapter 3, you applied randomization through bootstrapping and constructed a diverse set of trees in an ensemble through bagging. You also explored how random forests introduces further randomization by sampling features at the level of each node in each tree forming the ensemble. Chapter 4 introduced you to boosting, an ensemble method in which predictors are trained sequentially and where each predictor tries to correct the errors made by its predecessor. Specifically, you saw how AdaBoost involved tweaking the weights of the training samples while gradient boosting involved fitting each tree using the residuals of its predecessor as labels. You also learned how subsampling instances and features can lead to a better performance through Stochastic Gradient Boosting. Finally, in chapter 5, you explored hyperparameter tuning through Grid Search cross-validation and you learned how important it is to get the most out of your models.

3. Thank you!
I hope you enjoyed taking this course as much as I enjoyed developing it. Finally, I encourage you to apply the skills you learned by practicing on real-world datasets.

