*ref: https://inria.github.io/scikit-learn-mooc/python_scripts/ensemble_introduction.html*

In [1]:
from sklearn.datasets import fetch_california_housing

data, target = fetch_california_housing(as_frame=True, return_X_y=True)
target *= 100  # rescale the target in k$

In [2]:
# We will check the generalization performance of decision tree regressor with default parameters.

from sklearn.model_selection import cross_validate
from sklearn.tree import DecisionTreeRegressor

tree = DecisionTreeRegressor(random_state=0)
cv_results = cross_validate(tree, data, target, n_jobs=2)
scores = cv_results["test_score"]

print(f"R2 score obtained by cross-validation: "
      f"{scores.mean():.3f} ± {scores.std():.3f}")

R2 score obtained by cross-validation: 0.354 ± 0.087


In [4]:
%%time

# Now, we make a grid-search to tune the hyperparameters that we mentioned earlier.

from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeRegressor

param_grid = {
    "max_depth": [5, 8, None],
    "min_samples_split": [2, 10, 30, 50],
    "min_samples_leaf": [0.01, 0.05, 0.1, 1]}
cv = 3

tree = GridSearchCV(DecisionTreeRegressor(random_state=0),
                    param_grid=param_grid, cv=cv, n_jobs=2)
cv_results = cross_validate(tree, data, target, n_jobs=2,
                            return_estimator=True)
scores = cv_results["test_score"]

print(f"R2 score obtained by cross-validation: "
      f"{scores.mean():.3f} ± {scores.std():.3f}")

R2 score obtained by cross-validation: 0.523 ± 0.107
CPU times: total: 46.9 ms
Wall time: 14.6 s


**Now we will use an ensemble method called bagging.**

Here, we will use 20 decision trees and check the fitting time as well as the generalization performance on the left-out testing data. It is important to note that we are not going to tune any parameter of the decision tree.

In [5]:
%%time
from sklearn.ensemble import BaggingRegressor

base_estimator = DecisionTreeRegressor(random_state=0)
bagging_regressor = BaggingRegressor(
    base_estimator=base_estimator, n_estimators=20, random_state=0)

cv_results = cross_validate(bagging_regressor, data, target, n_jobs=2)
scores = cv_results["test_score"]

print(f"R2 score obtained by cross-validation: "
      f"{scores.mean():.3f} ± {scores.std():.3f}")

R2 score obtained by cross-validation: 0.642 ± 0.083
CPU times: total: 156 ms
Wall time: 7.8 s


>Without searching for optimal hyperparameters, the overall generalization performance of the bagging regressor is better than a single decision tree.\
In addition, the computational cost is reduced in comparison of seeking for the optimal hyperparameters.

>This shows the motivation behind the use of an ensemble learner: it gives a relatively good baseline with decent generalization performance without any parameter tuning.


