*ref: https://inria.github.io/scikit-learn-mooc/python_scripts/ensemble_hyperparameters.html*

# Random forest
>**The main parameter to tune for random forest is the `n_estimators` parameter.**\
In general, the more trees in the forest, the better the generalization performance will be.
\However, it will slow down the fitting and prediction time. \
The goal is to balance computing time and generalization performance when setting the number of estimators when putting such learner in production.

>Then, we could **also tune a parameter that controls the depth of each tree in the forest.**\
Two parameters are important for this: `max_depth and max_leaf_nodes`.\
They differ in the way they control the tree structure. \
Indeed, max_depth will enforce to have a more symmetric tree, while max_leaf_nodes does not impose such constraint.

>⚠️Be aware that with random forest, trees are generally deep since we are seeking to overfit each tree on each bootstrap sample because this will be mitigated by combining them altogether. \
Assembling underfitted trees (i.e. shallow trees) might also lead to an underfitted forest.

In [1]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

data, target = fetch_california_housing(return_X_y=True, as_frame=True)
target *= 100  # rescale the target in k$
data_train, data_test, target_train, target_test = train_test_split(
    data, target, random_state=0)

In [3]:
import pandas as pd
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor

param_distributions = {
    "n_estimators": [1, 2, 5, 10, 20, 50, 100, 200, 500],
    "max_leaf_nodes": [2, 5, 10, 20, 50, 100],
}
search_cv = RandomizedSearchCV(
    RandomForestRegressor(n_jobs=2), param_distributions=param_distributions,
    scoring="neg_mean_absolute_error", n_iter=10, random_state=0, n_jobs=2,
)
search_cv.fit(data_train, target_train)

columns = [f"param_{name}" for name in param_distributions.keys()]
columns += ["mean_test_error", "std_test_error"]
cv_results = pd.DataFrame(search_cv.cv_results_)
cv_results["mean_test_error"] = -cv_results["mean_test_score"]
cv_results["std_test_error"] = cv_results["std_test_score"]
cv_results[columns].sort_values(by="mean_test_error")

Unnamed: 0,param_n_estimators,param_max_leaf_nodes,mean_test_error,std_test_error
0,500,100,40.884375,0.695811
2,10,100,41.525385,0.66139
7,100,50,43.870925,0.766322
8,1,100,46.720423,1.310037
1,100,20,49.493528,0.960954
6,50,20,49.582823,0.767025
9,10,20,49.87435,1.003976
3,500,10,54.610072,0.85285
4,5,5,61.184241,0.920161
5,5,2,73.084221,0.91933


>Now we will estimate the generalization performance of the best model by refitting it with the full training set and using the test set for scoring on unseen data. \
This is done by default when calling the .fit method.


In [5]:
error = -search_cv.score(data_test, target_test)
print(f"On average, our random forest regressor makes an error of {error:.2f} k$")

On average, our random forest regressor makes an error of 41.97 k$


# Gradient-boosting decision trees
>For gradient-boosting, **parameters are coupled, so we cannot set the parameters one after the other anymore**.\The important parameters are n_estimators, learning_rate, and max_depth or max_leaf_nodes (as previously discussed random forest).

>**max_depth (or max_leaf_nodes)** the tree used in gradient-boosting should have a low depth, typically between 3 to 8 levels, or few leaves (2^3=8 to 2^8=256). Having very weak learners at each step will help reducing overfitting.

>With this consideration in mind, **the deeper the trees, the faster the residuals will be corrected and less learners are required**. Therefore, n_estimators should be increased if max_depth is lower.

>with a very low **learning-rate**, we will need more estimators to correct the overall error. \
However, a too large learning-rate tends to obtain an overfitted ensemble, similar to having a too large tree depth.

In [6]:
from scipy.stats import loguniform
from sklearn.ensemble import GradientBoostingRegressor

param_distributions = {
    "n_estimators": [1, 2, 5, 10, 20, 50, 100, 200, 500],
    "max_leaf_nodes": [2, 5, 10, 20, 50, 100],
    "learning_rate": loguniform(0.01, 1),
}
search_cv = RandomizedSearchCV(
    GradientBoostingRegressor(), param_distributions=param_distributions,
    scoring="neg_mean_absolute_error", n_iter=20, random_state=0, n_jobs=2
)
search_cv.fit(data_train, target_train)

columns = [f"param_{name}" for name in param_distributions.keys()]
columns += ["mean_test_error", "std_test_error"]
cv_results = pd.DataFrame(search_cv.cv_results_)
cv_results["mean_test_error"] = -cv_results["mean_test_score"]
cv_results["std_test_error"] = cv_results["std_test_score"]
cv_results[columns].sort_values(by="mean_test_error")

KeyboardInterrupt: 

In [None]:
# Now we estimate the generalization performance of the best model using the test set.

error = -search_cv.score(data_test, target_test)
print(f"On average, our GBDT regressor makes an error of {error:.2f} k$")


>The mean test score in the held-out test set is slightly better than the score of the best model. The reason is that the final model is refitted on the whole training set and therefore, on more data than the inner cross-validated models of the grid search procedure.