# 📝 Exercise M6.01

The aim of this notebook is to investigate if we can tune the hyperparameters
of a bagging regressor and evaluate the gain obtained.

We will load the California housing dataset and split it into a training and
a testing set.

In [2]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

data, target = fetch_california_housing(as_frame=True, return_X_y=True)
target *= 100  # rescale the target in k$
data_train, data_test, target_train, target_test = train_test_split(
    data, target, random_state=0, test_size=0.5)
data_train

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
12485,1.2450,42.0,3.624254,1.125249,761.0,1.512922,38.58,-121.48
7356,2.4432,24.0,3.399168,1.054054,2480.0,5.155925,33.96,-118.16
12392,2.4659,17.0,9.747727,2.029545,958.0,2.177273,33.74,-116.41
5210,1.5114,39.0,4.599278,1.018051,975.0,3.519856,33.92,-118.28
7421,3.0550,41.0,4.119891,1.089918,1690.0,4.604905,33.96,-118.20
...,...,...,...,...,...,...,...,...
13123,4.4125,20.0,6.000000,1.045662,712.0,3.251142,38.27,-121.26
19648,2.9135,27.0,5.349282,0.933014,647.0,3.095694,37.48,-120.89
9845,3.1977,31.0,3.641221,0.941476,704.0,1.791349,36.58,-121.90
10799,5.6315,34.0,4.540598,1.064103,1052.0,2.247863,33.62,-117.93


<div class="admonition note alert alert-info">
<p class="first admonition-title" style="font-weight: bold;">Note</p>
<p class="last">If you want a deeper overview regarding this dataset, you can refer to the
Appendix - Datasets description section at the end of this MOOC.</p>
</div>

Create a `BaggingRegressor` and provide a `DecisionTreeRegressor`
to its parameter `base_estimator`. Train the regressor and evaluate its
statistical performance on the testing set using the mean absolute error.

In [4]:
# Write your code here.

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.metrics import mean_absolute_error

bagging = BaggingRegressor(
    base_estimator=DecisionTreeRegressor(),
    n_jobs=2
)
bagging.fit(data_train, target_train)
target_predicted = bagging.predict(data_test)
print(f"Basic mean absolute error of the bagging regressor:\n"
      f"{mean_absolute_error(target_test, target_predicted):.2f} k$")


Basic mean absolute error of the bagging regressor:
36.53 k$


Now, create a `RandomizedSearchCV` instance using the previous model and
tune the important parameters of the bagging regressor. Find the best
parameters  and check if you are able to find a set of parameters that
improve the default regressor still using the mean absolute error as a
metric.

<div class="admonition tip alert alert-warning">
<p class="first admonition-title" style="font-weight: bold;">Tip</p>
<p class="last">You can list the bagging regressor's parameters using the <tt class="docutils literal">get_params</tt>
method.</p>
</div>

In [10]:
# Write your code here.

from scipy.stats import randint
from sklearn.model_selection import RandomizedSearchCV
import pandas as pd 

param_grid = {
    "n_estimators": randint(10, 30),
    "max_samples": [0.5, 0.8, 1.0],
    "max_features": [0.5, 0.8, 1.0],
    "base_estimator__max_depth": randint(3, 10),
}
search = RandomizedSearchCV(
    bagging, param_grid, n_iter=20, scoring="neg_mean_absolute_error"
)
_ = search.fit(data_train, target_train)

columns = [f"param_{name}" for name in param_grid.keys()]
columns += ["mean_test_score", "std_test_score", "rank_test_score"]
cv_results = pd.DataFrame(search.cv_results_)
cv_results = cv_results[columns].sort_values(by="rank_test_score")
cv_results["mean_test_score"] = -cv_results["mean_test_score"]
cv_results

Unnamed: 0,param_n_estimators,param_max_samples,param_max_features,param_base_estimator__max_depth,mean_test_score,std_test_score,rank_test_score
9,28,0.8,0.8,9,38.163799,1.323167,1
10,17,1.0,0.8,9,39.348485,1.43883,2
2,22,0.8,1.0,9,39.476948,0.896986,3
0,24,1.0,1.0,9,39.61191,0.966389,4
12,13,1.0,0.8,8,40.811257,1.269885,5
7,17,0.5,0.8,7,42.767927,1.064365,6
11,28,1.0,0.8,7,42.810864,1.17151,7
1,24,1.0,0.5,9,44.088563,1.648999,8
3,21,0.8,0.5,8,45.855,1.482286,9
4,25,0.8,0.5,8,46.088102,2.077267,10


In [11]:
target_predicted = search.predict(data_test)
print(f"Mean absolute error after tuning of the bagging regressor:\n"
      f"{mean_absolute_error(target_test, target_predicted):.2f} k$")

Mean absolute error after tuning of the bagging regressor:
38.03 k$


We see that the bagging regressor provides a predictor in which fine tuning
is not as important as in the case of fitting a single decision tree.