*ref: https://inria.github.io/scikit-learn-mooc/python_scripts/ensemble_random_forest.html*

Random forests have another particularity: **when training a tree, the search for the best split is done only on a subset of the original features taken at random.**

The random subsets are different for each split node. The goal is to inject additional randomization into the learning procedure to try to decorrelate the prediction errors of the individual trees.

Therefore, random forests are using randomization on both axes of the data matrix:
- by bootstrapping samples for each tree in the forest;
- randomly selecting a subset of features at each node of the tree.

# A look at random forests

In [5]:
# We will illustrate the usage of a random forest classifier on the adult census dataset.

import pandas as pd

adult_census = pd.read_csv("../../datasets/adult-census.csv")
target_name = "class"
data = adult_census.drop(columns=[target_name, "education-num"])
target = adult_census[target_name]

>The adult census contains some categorical data and we encode the categorical features using an OrdinalEncoder since tree-based models can work very efficiently with such a naive representation of categorical variables.

In [6]:
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import make_column_transformer, make_column_selector

categorical_encoder = OrdinalEncoder(
    handle_unknown="use_encoded_value", unknown_value=-1
)
preprocessor = make_column_transformer(
    (categorical_encoder, make_column_selector(dtype_include=object)),
    remainder="passthrough"
)

In [8]:
# We will first give a simple example where we will train a single decision tree classifier 
# and check its generalization performance via cross-validation.

from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier

tree = make_pipeline(preprocessor, DecisionTreeClassifier(random_state=0))

In [9]:
from sklearn.model_selection import cross_val_score

scores_tree = cross_val_score(tree, data, target)

print(f"Decision tree classifier: "
      f"{scores_tree.mean():.3f} ± {scores_tree.std():.3f}")

Decision tree classifier: 0.820 ± 0.006


In [10]:
# we construct a BaggingClassifier with a decision tree classifier as base model.

from sklearn.ensemble import BaggingClassifier

bagged_trees = make_pipeline(
    preprocessor,
    BaggingClassifier(
        base_estimator=DecisionTreeClassifier(random_state=0),
        n_estimators=50, n_jobs=2, random_state=0,
    )
)

In [12]:
scores_bagged_trees = cross_val_score(bagged_trees, data, target)

print(f"Bagged decision tree classifier: "
      f"{scores_bagged_trees.mean():.3f} ± {scores_bagged_trees.std():.3f}")

Bagged decision tree classifier: 0.846 ± 0.005


>Observe that we do not need to specify any base_estimator because the estimator is forced to be a decision tree. Thus, we just specify the desired number of trees in the forest.

In [13]:
from sklearn.ensemble import RandomForestClassifier

random_forest = make_pipeline(
    preprocessor,
    RandomForestClassifier(n_estimators=50, n_jobs=2, random_state=0)
)

In [14]:
from sklearn.ensemble import RandomForestClassifier

random_forest = make_pipeline(
    preprocessor,
    RandomForestClassifier(n_estimators=50, n_jobs=2, random_state=0)
)

In [15]:
scores_random_forest = cross_val_score(random_forest, data, target)

print(f"Random forest classifier: "
      f"{scores_random_forest.mean():.3f} ± "
      f"{scores_random_forest.std():.3f}")

Random forest classifier: 0.851 ± 0.004


# Details about default hyperparameters
For random forests, it is possible to control the amount of randomness for each split by setting the value of max_features hyperparameter:

- max_features=0.5 means that 50% of the features are considered at each split;
- max_features=1.0 means that all features are considered at each split which effectively disables feature subsampling.

By default, RandomForestRegressor disables feature subsampling while RandomForestClassifier uses max_features=np.sqrt(n_features). These default values reflect good practices given in the scientific literature.

>However, max_features is one of the hyperparameters to consider when tuning a random forest:
>- **too much randomness** in the trees can lead to underfitted base models and can be detrimental for the ensemble as a whole,
>- **too few randomness** in the trees leads to more correlation of the prediction errors and as a result reduce the benefits of the averaging step in terms of overfitting control.

**We summarize these details in the following table:**
<table class="colwidths-auto table">
<thead>
<tr class="row-odd"><th class="head"><p>Ensemble model class</p></th>
<th class="head"><p>Base model class</p></th>
<th class="head"><p>Default value for <code class="docutils literal notranslate"><span class="pre">max_features</span></code></p></th>
<th class="head"><p>Features subsampling strategy</p></th>
</tr>
</thead>
<tbody>
<tr class="row-even"><td><p><code class="docutils literal notranslate"><span class="pre">BaggingClassifier</span></code></p></td>
<td><p>User specified (flexible)</p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">n_features</span></code> (no&nbsp;subsampling)</p></td>
<td><p>Model level</p></td>
</tr>
<tr class="row-odd"><td><p><code class="docutils literal notranslate"><span class="pre">RandomForestClassifier</span></code></p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">DecisionTreeClassifier</span></code></p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">sqrt(n_features)</span></code></p></td>
<td><p>Tree node level</p></td>
</tr>
<tr class="row-even"><td><p><code class="docutils literal notranslate"><span class="pre">BaggingRegressor</span></code></p></td>
<td><p>User specified (flexible)</p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">n_features</span></code> (no&nbsp;subsampling)</p></td>
<td><p>Model level</p></td>
</tr>
<tr class="row-odd"><td><p><code class="docutils literal notranslate"><span class="pre">RandomForestRegressor</span></code></p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">DecisionTreeRegressor</span></code></p></td>
<td><p><code class="docutils literal notranslate"><span class="pre">n_features</span></code> (no&nbsp;subsampling)</p></td>
<td><p>Tree node level</p></td>
</tr>
</tbody>
</table>