## Finding the best hyper-parameters

Most of the machine learning algorithms, like `DecisionTreeClassifier`, `RandomForestClassifier`, and so on..., have parameters that you can tune to change the behaviour of the model.

These are called *hyper-parameters*. The difference between *parameters* and *hyper-parameters* in machine learning is that the former are the properties of the training data that are learned during training, while the latter are set before training and do not change.

For example, for `DecisionTreeClassifier`, the hyper-parameters are 

- the splitting criterion
- the maximum tree depth
- ...

You'll find more info about all the hyper-parameters of `DecisionTreeClassifier`  in the [scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html).

Trying to find the best hyper-parameters is called *hyper-parameter tuning*.
However, you have to pay attention so as to not bias the model with data.
Indeed, you can't rely on the accuracy of the testing set to choose the hyper-parameters, as that would mean that you chose the parameters based on the testing set, thus using that data in the training procedure.
The testing set shall be the LAST thing you evaluate on, and it shall remain unseen until the very end.

What you do is you split your training again to get a validation set, which you will use to choose the best hyper-parameters.
There is a lot of conflicting information as to how you call the set that you evaluate last. Some call it the testing set, and others the validation set. In the world of academia, the term for the set of data that you use last is the testing set.
You'll thus have a training-validation-test split. A rule of thumb is for this split to be 50-30-20, but this can vary with the quantity of data that you have at your disposition.

To perform hyper-parameter tuning using `scikit-learn`, there is an useful method called `GridSearchCV`, which explores every combination of the parameters you specify and does cross-validation to choose the best ones.



In [23]:
from sklearn.model_selection import GridSearchCV, train_test_split
import numpy as np

from sklearn.tree import DecisionTreeClassifier
from sklearn import datasets

dataset = datasets.load_breast_cancer()
X, y = dataset.data, dataset.target

In [24]:
# Creating train-test split and classifier
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=41, test_size=0.2)
decision_tree = DecisionTreeClassifier(random_state=5)

In [25]:
# Setting all the parameters we want to test
params = {
    'max_features' : np.arange(0.1,1,0.1).tolist(), #Number of features to consider as a fraction of all features
    'max_depth': [1,2,4,8, None] # Depth of the tree
}

print("Parameters:")
for k,v in params.items():
    print("{} : {}".format(k,v))
print()

Parameters:
max_features : [0.1, 0.2, 0.30000000000000004, 0.4, 0.5, 0.6, 0.7000000000000001, 0.8, 0.9]
max_depth : [1, 2, 4, 8, None]



In [26]:
# Setting up the grid search that will test every combination of parameters
gridsearch = GridSearchCV(estimator = decision_tree,
                        param_grid = params,
                        scoring = 'accuracy', 
                        cv = 5, # Use 5 folds
                        verbose = 1,
                        n_jobs = -1 #Use all but one CPU core
                        )

# As we are doing cross-validation on the training set, the testing set X_test is untouched
result = gridsearch.fit(X_train, y_train)

Fitting 5 folds for each of 45 candidates, totalling 225 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    0.7s
[Parallel(n_jobs=-1)]: Done 210 out of 225 | elapsed:    0.9s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done 225 out of 225 | elapsed:    0.9s finished


In [27]:
print("The best parameters are :", result.best_params_)
print("The best accuracy is {:.2f}%:".format(result.best_score_ * 100))

# We can now use the testing set with the optimal hyper-parameters to get the final generalization accuracy
decision_tree = result.best_estimator_
score = decision_tree.score(X_test, y_test)
print("The generalization accuracy of the model is {:.2f}%".format(score * 100))

The best parameters are : {'max_depth': 4, 'max_features': 0.6}
The best accuracy is 94.51%:
The generalization accuracy of the model is 92.11%


Ideally, you would do all of the above multiple times by choosing different train-test splits, and averaging the score.
If we transform the above into a method that we can call, we can easily run this multiple times.


In [28]:
def single_grid_search(X_train, y_train):
    """
    Performs a grid search using the training set given.
    """
    # Setting all the parameters we want to test
    params = {
        'max_features' : np.arange(0.1,1,0.1).tolist(), #Number of features to consider as a fraction of all features
        'max_depth': [1,2,4,8, None] # Depth of the tree
    }

    gridsearch = GridSearchCV(estimator = decision_tree,
                            param_grid = params,
                            scoring = 'accuracy', 
                            cv = 5, # Use 5 folds
                            verbose = 0,
                            n_jobs = -1 #Use all but one CPU core
                            )

    # As we are doing cross-validation on the training set, the testing set X_test is untouched    
    return gridsearch.fit(X_train, y_train)

In [29]:
# Redoing the same computation as before, but this time
# using the method we created to show that we have the same results
result = single_grid_search(X_train, y_train)
decision_tree = result.best_estimator_
score = decision_tree.score(X_test, y_test)
print("The generalization accuracy of the model is {:.2f}%".format(score * 100))

The generalization accuracy of the model is 92.11%


In [30]:
# Now we can create k train-test splits using KFold
from sklearn.model_selection import KFold

# Using KFold instead of calling multiple times train_test_split to ensure that each
# sample goes into a single split only
kf = KFold(n_splits=5, random_state=45, shuffle=True)

split = 0
scores = []
for train_index, test_index in kf.split(X):
    
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    
    result = single_grid_search(X_train, y_train)
    
    decision_tree = result.best_estimator_
    score = decision_tree.score(X_test, y_test)
    scores.append(score)
    print("### Split {}: Accuracy is {:.2f}% ###".format(split := split + 1, score*100))
    
print("The mean generalization accuracy of the model is {:.2f}% (+/- {:.2f}%)".format(np.mean(scores) * 100, np.std(scores) * 100))

### Split 1: Accuracy is 92.11% ###
### Split 2: Accuracy is 92.98% ###
### Split 3: Accuracy is 90.35% ###
### Split 4: Accuracy is 87.72% ###
### Split 5: Accuracy is 96.46% ###
The mean generalization accuracy of the model is 91.92% (+/- 2.89%)


## References and additional reading material

[GridSearchCV - scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)

[KFold - scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html)
