#### Tuning a CART's Hyperparameters

the hyperparameters of a machine learning model and the parameters that aren't learned from data, they should be set prior to fitting the model to the training set, we'll learn how to tune the hyperparameters of a tree-based model using grid search and cross validation

to have better performance, the hyperparameters of a machine learning model should be tuned
machine learning models are characterized by parameters and hyperparameters, parameters are learned from data, CART has split-point of a node, split-feature of a node, etc., hyperparameters aren't learned from data, you set them prior to training, CART has max_depth, min_samples_leaf, splitting criterion, etc.

hyperparamater tuning is when you search for the set of optimal hyperparameters for the learning algorithm, the solution is to find the set of optimal hyperparameters that results in an optimal model, the optimal model yields an optimal score, the score function measures the agreement between true labels and a model's prediction, in sklearn the default is accuracy (classification), and r squared (regression), the model's generalization performance is evaluated using cv

why do we tune hyperparameters? sklearn's model default parameters aren't optimal for all problems so you need to tune the hyperparamaters to obtain the best model performance

there are lots of approaches to hyperparameter tuning like grid search, random search, bayesian optimization, genetic algorithms, etc.

grid search cross validation:
- manually set a grid of discrete hyperparameter values
- pick a metric for scoring model performance
- search exhaustively through the grid
- for each set of hyperparameters you evaluate each model's cv score
- the optimal hyperparameters are those for which the model achieves the best cross-validation score

grid search suffers from the curse of dimensionality, the larger the grid the longer it takes to find the solution

In [None]:
# inspecting the hyperparameters of a CART in sklearn
import sklearn.tree import DecisionTreeClassifier

# set seed for reproducibility
SEED = 1

# instantiate a dtc
dt = DecisionTreeClassifier(random_state=SEED)

# print out a dictionary, the keys are the hyperparameter names, 
print(dt.get_params())

In [None]:
# tune dt on the cancer dataset, it's already loaded and split
from sklearn.model_selection import GridSearchCV

# define a dictionary with the names of the hyperparameters you want to tune as the keys and lists of hyperparameter values as values
params_dt = {
             'max_depth': [3, 4, 5, 6], 
             'min_samples_leaf': [0.04, 0.06, 0.08], 
             'max_features': [0.2, 0.4, 0.6, 0.8]
            }

# instantiate a grid search cv object
grid_dt = GridSearchCV(estimator=dt, 
                       param_grid=params_dt, 
                       scoring='accuracy',
                       cv=10, 
                       n-jobs=-1)

# fit to the training set
grid_dt.fit(X_train, y_train)

# extract the best hyperparameters 
best_hyperparams = grid_dt.best_params_
print('Best hyperparameters:\n', best_hyperparams)

# extract the best cross validation accuracy score
best_CV_score = grid_dt.best_score_
print('Best CV accuracy'.format(best_CV_score))

# extract the best model
best_model = grid_dt.best_estimator_

# this model is fitted on the whole training set because the refit parameter of the GridSearchCV is set to True by default

# evaluate test set accuracy
test_acc = best_model.score(X_test, y_test)
print("Test set accuracy of best model: {:.3f}".format(test_acc))

In [None]:
# exercise example, tune the hyperparameters of a classification tree
# the dataset is imbalanced so you'll use the ROC AUC score as a metric instead of accuracy
# set scoring to 'roc_auc'
#  evaluate the test set ROC AUC score of grid_dt's optimal model.
# In order to do so, you will first determine the probability of obtaining the positive label for each test set observation
# you can use the methodpredict_proba() of an sklearn classifier to compute a 2D array containing the probabilities of the 
# negative and positive class-labels respectively along columns
# do all the same stuff as above 

# Import roc_auc_score from sklearn.metrics 
from sklearn.metrics import roc_auc_score

# Extract the best estimator
best_model = grid_dt.best_estimator_

# Predict the test set probabilities of the positive class
y_pred_proba = best_model.predict_proba(X_test)[:,1]

# Compute test_roc_auc
test_roc_auc = roc_auc_score(y_test, y_pred_proba)

# Print test_roc_auc
print('Test set ROC AUC score: {:.3f}'.format(test_roc_auc))

#### Tuning an RF's Hyperparameters

we'll now tune the hyperparameters of Random Forests which is an ensemble method

in addition to the hyperparameters of the CARTs forming random forests, the ensemble itself is characterized by other hyperparameters like the number of estimators, whether it uses bootstrapping, etc.

hyperparameter tuning is computationally expensive and might only lead to a very slight improvement in some situations :(
because of this, you should weight the impact of tuning on the pipeline of your data analysis project as a whole so that you can understand if its worth pursuing

In [None]:
# inspect RF hyperparameters in sklear
from sklearn.ensemble import RandomForestRegressor

# set seed for reproducibility
SEED = 1

# instantiate a random forests regressor
rf = RandomForestRegressor(random_state=SEED)

# inspect rf's hyperparameters
rf.get_params()

In [None]:
# perform grid-search cross-validation on the dataset which is already loaded and split
from sklearn.metrics import mean_squared_error as MSE
from sklearn.model_selection import GridSearchCV

# define a dictionary containing the grid of hyperparameters
params_rf = {
             'n_estimators': [300, 400, 500], 
             'max_depth': [4, 6, 8],
             'min_samples_leaf': [0.1, 0.2], 
             'max_features': ['log2', 'sqrt']
            }

# instantiate a grid search cv object with 3 fold cv
# verbose controls verbosity, the higher the value the more messages are printed during fitting
grid_rf = GridSearchCV(estimator=rf, 
                       param_grid=params_rf, 
                       scoring='neg_mean_squared_error',
                       verbose=1, 
                       n-jobs=-1)

# fit to the training set
grid_rf.fit(X_train, y_train)

# extract the best hyperparameters 
best_hyperparams = grid_rf.best_params_
print('Best hyperparameters:\n', best_hyperparams)

# extract the best model
best_model = grid_rf.best_estimator_
# predict on test set labels
y_pred = best_model.predict(X_test)
# evaluate the test set RMSE
rmse_test = MSE(y_test, y_pred)**(1/2)
print('Test set RMSE of rf: {:.2f}'.format(rmse_test))