# Hyperparameter Tuning

In the previous section, we did not discuss the parameters of random forest and gradient-boosting. However, there are a couple of things to keep in mind when  setting these.  

This notebook gives crucial information regarding how to set the hyperparameters of both random forest and gradient boosting decision tree models.

## Random Forest

The main parameter to tune for random forest is the `n_estimators` parameters. In general, the more trees in the forest, the better the generalization performance will be. However, it will slow down the fitting and prediction time. The goal is to balance computing time and generalization performance when setting the number of estimators when putting such learner in production.  

The `max_depth` parameter could also be tuned. Sometimes, there is no need to have fully grown trees. However, <code style="background:yellow; color:black">be aware that with random forest, trees are generally deep since we are seeking to overfit the learners on the bootstrap samples because this will be mitigated by combining them.</code>  
Assembling underfitted trees (i.e. shallow trees) might also lead to an underfitted forest.

In [2]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

In [3]:
data, target = fetch_california_housing(return_X_y=True, as_frame=True)
target *= 100
data_train, data_test, target_train, target_test = train_test_split(data, target,
                                                                   random_state=0)

In [4]:
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

In [5]:
param_grid = {
    "n_estimators": [10, 20, 30],
    "max_depth": [3, 5, None],
}
grid_search = GridSearchCV(RandomForestRegressor(n_jobs=2),
                          param_grid=param_grid,
                          scoring="neg_mean_absolute_error", n_jobs=2)
grid_search.fit(data_train, target_train)

columns = [f"param_{name}" for name in param_grid.keys()]
columns += ["mean_test_score", "rank_test_score"]
cv_results = pd.DataFrame(grid_search.cv_results_)
cv_results["mean_test_score"] = -cv_results["mean_test_score"]
cv_results[columns].sort_values(by="rank_test_score")

Unnamed: 0,param_n_estimators,param_max_depth,mean_test_score,rank_test_score
8,30,,34.602536,1
7,20,,34.882662,2
6,10,,36.103924,3
4,20,5.0,48.545337,4
5,30,5.0,48.681638,5
3,10,5.0,48.860571,6
2,30,3.0,57.061607,7
1,20,3.0,57.132398,8
0,10,3.0,57.140272,9


We can observe that in our grid-search, the largest `max_depth` together with the largest `n_estimators` led to the best generalization performance.

<div class="alert alert-block alert-warning">
<b>Caution:</b> For the sake of clarity, no cross validation is used to estimate the testing error. We are only showing the effect of the parameters on the validation set of what should be the inner cross-validation.
</div>

## Gradient-boosting decision trees

For gradient-boosting, parameters are coupled, so we cannot set the parameters one after the other anymore.  
The important parameters are `n_estimators, max_depth`, and `learnin_rate`.

Let's first discuss the `max_depth` parameter. We saw in the section on gradient boosting that the algorithm fits the error of the previous tree in the ensemble. Thus, fitting fully grown trees will be detrimental. Indeed, the frist tree of the ensemble would perfectly fit (overfit) the data and thus no subsequent tree would be required, since there would be no residuals. Therefore, the tree used in gradient-boosting should have a low depth, typically between 3 to 8 levels. Having very weak learners at each step will help reducing overfitting. 

With this consideration in mind, the deeper the trees, the faster the residuals will be corrrected and less learners are required. Therefore, `n_estimators` should be increased if `max_depth` is lower.



Finally, we have overlooked the impact of the `learning_rate` parameter until now. When fitting the residuals, we would like the tree to try to correct all possible errors or only a fraction of them. The learning-rate allows you to control this behaviour. A small learning-rate value would only correct the residuals of very few samples. If a large learning-rate is set (e.g., 1), we would fit the residuals of all samples. So, with a very low learning-rate, we will need more estimators to correct the overall error. However, a too large learning-rate tends to obtain an overfitted ensemble, similar to having a too large tree depth.

In [6]:
from sklearn.ensemble import GradientBoostingRegressor

param_grid = {
    "n_estimators": [10, 30, 50],
    "max_depth": [3, 5, None],
    "learning_rate": [0.1, 1],
}
grid_search = GridSearchCV(
    GradientBoostingRegressor(), param_grid=param_grid,
    scoring="neg_mean_absolute_error", n_jobs=2
)
grid_search.fit(data_train, target_train)

columns = [f"param_{name}" for name in param_grid.keys()]
columns += ["mean_test_score", "rank_test_score"]
cv_results = pd.DataFrame(grid_search.cv_results_)
cv_results["mean_test_score"] = -cv_results["mean_test_score"]
cv_results[columns].sort_values(by="rank_test_score")

Unnamed: 0,param_n_estimators,param_max_depth,param_learning_rate,mean_test_score,rank_test_score
5,50,5.0,0.1,35.659691,1
11,50,3.0,1.0,36.73689,2
10,30,3.0,1.0,37.437973,3
13,30,5.0,1.0,39.147928,4
4,30,5.0,0.1,39.274005,5
12,10,5.0,1.0,39.379441,6
14,50,5.0,1.0,39.890142,7
2,50,3.0,0.1,40.602042,8
9,10,3.0,1.0,41.589715,9
7,30,,0.1,45.586703,10


<div class="alert alert-block alert-warning">
<b>Caution:</b> Here, we tune the `n_estimators` but be aware that using early-stopping as in the previous exercise will be better.
</div>

## Main take-away

So in this module, we discussed ensemble learners which are a type of learner that combines simpler learners together. We saw two strategies:

* one based on bootstrap samples allowing learners to be fit in a parallel manner;
* the other called boosting which fit learners sequentially.

From these two families, we mainly focused on giving intuitions regarding the internal machinery of the random forest and gradient boosting models which are state-of-the-art methods.