# Hyperparameter Tuning

In the previous section, we did not discuss the parameters of random forest and gradient-boosting. However, there are a couple of things to keep in mind when  setting these.  

This notebook gives crucial information regarding how to set the hyperparameters of both random forest and gradient boosting decision tree models.

## Random Forest

The main parameter to tune for random forest is the `n_estimators` parameters. In general, the more trees in the forest, the better the generalization performance will be. However, it will slow down the fitting and prediction time. The goal is to balance computing time and generalization performance when setting the number of estimators when putting such learner in production.  

The `max_depth` parameter could also be tuned. Sometimes, there is no need to have fully grown trees. However, <code style="background:yellow; color:black">be aware that with random forest, trees are generally deep since we are seeking to overfit the learners on the bootstrap samples because this will be mitigated by combining them.</code>  
Assembling underfitted trees (i.e. shallow trees) might also lead to an underfitted forest.

In [1]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

In [2]:
data, target = fetch_california_housing(return_X_y=True, as_frame=True)
target *= 100
data_train, data_test, target_train, target_test = train_test_split(data, target,
                                                                   random_state=0)

In [4]:
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

In [6]:
param_grid = {
    "n_estimators": [10, 20, 30],
    "max_depth": [3, 5, None],
}
grid_search = GridSearchCV(RandomForestRegressor(n_jobs=2),
                          param_grid=param_grid,
                          scoring="neg_mean_absolute_error", n_jobs=2)
grid_search.fit(data_train, target_train)

columns = [f"param_{name}" for name in param_grid.keys()]
columns += ["mean_test_score", "rank_test_score"]
cv_results = pd.DataFrame(grid_search.cv_results_)
cv_results["mean_test_score"] = -cv_results["mean_test_score"]
cv_results[columns].sort_values(by="rank_test_score")

Unnamed: 0,param_n_estimators,param_max_depth,mean_test_score,rank_test_score
8,30,,34.403766,1
7,20,,34.936832,2
6,10,,35.780756,3
5,30,5.0,48.623852,4
4,20,5.0,48.643395,5
3,10,5.0,49.157675,6
2,30,3.0,57.077625,7
1,20,3.0,57.125061,8
0,10,3.0,57.279044,9


We can observe that in our grid-search, the largest `max_depth` together with the largest `n_estimators` led to the best generalization performance.

<div class="alert alert-block alert-warning">
<b>Caution:</b> For the sake of clarity, no cross validation is used to estimate the testing error. We are only showing the effect of the parameters on the validation set of what should be the inner cross-validation.
</div>