# Intuitions on  ensemble tree-based models

## Bagging vs random forests

Bagging is a general strategy
* Can work with any base model (linear, trees, ...)

Random forests are bagged *randomized* decision trees
* At each split: a random subset of features are selected
* The best split is taken among the restricted subset
* Extra randomization decorrelates the prediction errors
* Uncorrelated errors make bagging work better

<div class="alert alert-block alert-info">
<b>Note:</b>
<ul>
    <li>It's fine to use deep trees (max_depth=None) in random forests because of the reduced overfitting effect of prediction averaging. </li>
    <li>The more trees the better, typical to use 100 trees or more.</li>
    <li>More trees: longer to fit, slower to predict and bigger models to deploy.</li></ul>    
</div>

In practice, gradient boosting is more flexible thanks to the use of cost functions and tend to exhibits better predictive performance than traditional boosting.

<div class="alert alert-block alert-warning">
<b>Take away:</b> <br>
    <ul>
        <li> Bagging and random forests fit trees independently
        <ul>    
        <li> each deep tree overfits individually </li>
            <li> averaging the tree predictions reduces overfitting</li></ul></li>
        <li> (Gradient) boosting fits trees sequentially
            <ul>
            <li> each shallow tree underfits individually</li>
                <li> sequentially adding trees reduces underfitting</li></ul></li>
        <li> Gradient boosting tends to perform slightly better than bagging and random forest and furthermore shallow trees predict faster</li>
    </ul>
</div>

## Introductory example to ensemble models

This notebook aims at emphasizing the benefit of ensemble methods over simple models (e.g. decision tree, linear model, etc.). Combining simple models result in more powerful and robust models with less hassle.

We will be working with the California housing dataset. We recall that the goal in this dataset is to predict the median house value in some district in California based on demographic and geographic data.

In [1]:
from sklearn.datasets import fetch_california_housing

data, target = fetch_california_housing(as_frame=True, return_X_y=True)
target += 100 # rescale the target in k$

We will check the generalization performance of decision tree regressor with default parameters.

In [4]:
from sklearn.model_selection import cross_validate
from sklearn.tree import DecisionTreeRegressor

tree = DecisionTreeRegressor(random_state=0)
cv_results = cross_validate(tree, data, target, n_jobs=2)
scores = cv_results['test_score']

In [6]:
print(f"R2 score obtained by cross-validation: "
     f"{scores.mean():.3f} +/- {scores.std():.3f}")

R2 score obtained by cross-validation: 0.345 +/- 0.104


We obtain fair results. However, as we previously presented in the "tree in depth" notebook, this model needs to be tuned to overcome over- or under-fitting. Indeed, the default parameters will not necessarily lead to an optimal decision tree. Instead of using the default value, we sould search via cross-validation the optimal value of the important parameters such as `max_depth`, `min_samples_split`, or `min_samples_leaf`.

We recall that we need to tune these parameters, as decision trees tend to overfit the training data if we grow deep trees, but there are no rules on what each parameter should be set to. Thus, not making a search could lead us to have an underfitted or overfitted model.

Now, we make a grid-search to tune the hyperparameters that we nemtioned earlier.

In [7]:
%%time
from sklearn.model_selection import GridSearchCV

param_grid = {
    "max_depth": [5, 8, None],
    "min_samples_split": [2, 10, 30, 50],
    "min_samples_leaf": [0.01, 0.05, 0.1, 1]}
cv = 3

tree = GridSearchCV(DecisionTreeRegressor(random_state=0),
                    param_grid=param_grid, cv=cv, n_jobs=2)
cv_results = cross_validate(tree, data, target, n_jobs=2,
                            return_estimator=True)
scores = cv_results["test_score"]

print(f"R2 score obtained by cross-validation: "
      f"{scores.mean():.3f} +/- {scores.std():.3f}")

R2 score obtained by cross-validation: 0.523 +/- 0.108
CPU times: user 38.5 ms, sys: 42.3 ms, total: 80.8 ms
Wall time: 17.1 s


As expected, optimizing the hyperparameters has a positive effect on the generalization performance. However, it comes with a higher computational cost.

We can create a dataframe storing the important information collected during the tuning of the parameters and investigate the results.

Now we will use an ensemble method called bagging. In short, this method will use a base regressor (i.e. decision tree regressors) and will train several of them on a slightly modified version of the training set. Then the predictions of all these base regressors will be combined by averaging.

There, we will use 20 decision trees and check the fitting time as well as the generalization performance on the left-out testing data. It is important to note that we are not going to tune any parameter of the decision tree.

In [9]:
%%time
from sklearn.ensemble import BaggingRegressor

base_estimator = DecisionTreeRegressor(random_state=0)
bagging_regressor = BaggingRegressor(base_estimator=base_estimator,
                                    n_estimators=20, random_state=0)
cv_results = cross_validate(bagging_regressor, data, target, n_jobs=2)
scores = cv_results["test_score"]

print(f"R2 score obtained by cross-validation: "
     f"{scores.mean():.3f} +/- {scores.std():.3f}")

R2 score obtained by cross-validation: 0.639 +/- 0.083
CPU times: user 20.9 ms, sys: 8.64 ms, total: 29.6 ms
Wall time: 7.4 s


Without searching for optimal hyperparameters, the overall generalization performance of the bagging regressor is better than a single decision tree. In addition, the computational cost is reduced in comparison to optimal hyperparameters.

This shows the motivation behind the use of an ensemble learner: it gives a relatively good baseline with decent generalization performance without any parameter tuning.  