# 📝 Exercise M6.01

The aim of this notebook is to investigate if we can tune the hyperparameters
of a bagging regressor and evaluate the gain obtained.

We will load the California housing dataset and split it into a training and
a testing set.

In [1]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

data, target = fetch_california_housing(as_frame=True, return_X_y=True)
target *= 100  # rescale the target in k$
data_train, data_test, target_train, target_test = train_test_split(
    data, target, random_state=0, test_size=0.5)

<div class="admonition note alert alert-info">
<p class="first admonition-title" style="font-weight: bold;">Note</p>
<p class="last">If you want a deeper overview regarding this dataset, you can refer to the
Appendix - Datasets description section at the end of this MOOC.</p>
</div>

Create a `BaggingRegressor` and provide a `DecisionTreeRegressor`
to its parameter `base_estimator`. Train the regressor and evaluate its
statistical performance on the testing set using the mean absolute error.

In [39]:
BaggingRegressor?

In [40]:
# Write your code here.
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import BaggingRegressor
from sklearn.metrics import mean_absolute_error

rgr =  BaggingRegressor(base_estimator=DecisionTreeRegressor())

In [41]:
rgr.fit?

In [42]:
rgr.fit(data_train, target_train)

BaggingRegressor(base_estimator=DecisionTreeRegressor())

In [43]:
rgr.score?

In [44]:
rgr.score(data_test,target_test)

0.7658179145428943

In [79]:
preds = rgr.predict(data_test)

In [80]:
mean_absolute_error(target_test,preds)

36.88547079457364

Now, create a `RandomizedSearchCV` instance using the previous model and
tune the important parameters of the bagging regressor. Find the best
parameters  and check if you are able to find a set of parameters that
improve the default regressor still using the mean absolute error as a
metric.

<div class="admonition tip alert alert-warning">
<p class="first admonition-title" style="font-weight: bold;">Tip</p>
<p class="last">You can list the bagging regressor's parameters using the <tt class="docutils literal">get_params</tt>
method.</p>
</div>

In [45]:
# Write your code here.
from sklearn.model_selection import RandomizedSearchCV

In [46]:
 #rgr.get_params()

In [63]:
from scipy.stats import norm, randint

In [64]:
param_distributions = {
    "n_estimators": randint(10, 30),
    "max_samples": [0.5, 0.8, 1.0],
    "max_features": [0.5, 0.8, 1.0],
    "base_estimator__max_depth": randint(3, 10),
}

In [65]:
rgr_psearch = RandomizedSearchCV(rgr, param_distributions=param_distributions, \
                   scoring='neg_mean_absolute_error', cv=3,n_jobs=3)

In [66]:
_ = rgr_psearch.fit(data,target)

In [67]:
rgr_psearch.best_params_

{'base_estimator__max_depth': 6,
 'max_features': 1.0,
 'max_samples': 0.5,
 'n_estimators': 23}

In [68]:
rgr_psearch.best_score_

-51.798449137624004

In [70]:
import pandas as pd

In [73]:
cv_results = pd.DataFrame(rgr_psearch.cv_results_).sort_values(by="rank_test_score")

In [74]:
cv_results["mean_test_score"] = -cv_results["mean_test_score"]

In [75]:
cv_results["mean_test_score"]

8    51.798449
9    52.377097
1    53.331383
4    53.968620
5    55.845047
0    56.994576
3    58.604706
2    59.333788
6    61.739287
7    63.570279
Name: mean_test_score, dtype: float64

In [76]:
preds = rgr_psearch.predict(data_test)

In [78]:
mean_absolute_error(target_test,preds)

44.241150510387186

We see that the bagging regressor provides a predictor in which fine tuning
is not as important as in the case of fitting a single decision tree.