# 📝 Exercise M3.02

The goal is to find the best set of hyperparameters which maximize the
generalization performance on a training set.

In [1]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

data, target = fetch_california_housing(return_X_y=True, as_frame=True)
target *= 100  # rescale the target in k$

data_train, data_test, target_train, target_test = train_test_split(
    data, target, random_state=42
)

In this exercise, we progressively define the regression pipeline and
later tune its hyperparameters.

Start by defining a pipeline that:
* uses a `StandardScaler` to normalize the numerical data;
* uses a `sklearn.neighbors.KNeighborsRegressor` as a predictive model.

In [3]:
# Write your code here.
# If we check the data, we realize that it is all numerical.
# Let us start by importing the libraries
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(), KNeighborsRegressor())

Use `RandomizedSearchCV` with `n_iter=20` to find the best set of
hyperparameters by tuning the following parameters of the `model`:

- the parameter `n_neighbors` of the `KNeighborsRegressor` with values
  `np.logspace(0, 3, num=10).astype(np.int32)`;
- the parameter `with_mean` of the `StandardScaler` with possible values
  `True` or `False`;
- the parameter `with_std` of the `StandardScaler` with possible values `True`
  or `False`.

Notice that in the notebook "Hyperparameter tuning by randomized-search" we
pass distributions to be sampled by the `RandomizedSearchCV`. In this case we
define a fixed grid of hyperparameters to be explored. Using a `GridSearchCV`
instead would explore all the possible combinations on the grid, which can be
costly to compute for large grids, whereas the parameter `n_iter` of the
`RandomizedSearchCV` controls the number of different random combination that
are evaluated. Notice that setting `n_iter` larger than the number of possible
combinations in a grid (in this case 10 x 2 x 2 = 40) would lead to repeating
already-explored combinations.

Once the computation has completed, print the best combination of parameters
stored in the `best_params_` attribute.

The `param_distributions` parameter of `RandomizedSearchCV` takes a dictionary with parameters names (str) as keys and **distributions** or **lists** as values.
* Distributions must provide a `rvs` method for sampling (such as those from scipy.stats.distributions).
* Lists must provide a list of parameters to try. The parameter values in the list are sampled uniformly.

For more details, read:
* [RandomizedSearchCV documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)
* [numpy.logspace documentation](https://numpy.org/doc/stable/reference/generated/numpy.logspace.html)

In [7]:
# Write your code here.
from sklearn.model_selection import RandomizedSearchCV
import numpy as np

param_distributions = {
    "kneighborsregressor__n_neighbors": np.logspace(0, 3, num=10).astype(np.int32),
    "standardscaler__with_mean": [True, False],
    "standardscaler__with_std": [True, False],
}
model_random_search = RandomizedSearchCV(
    model,
    param_distributions=param_distributions,
    n_iter=20,
    cv=5,
    verbose=1,
)
model_random_search.fit(data_train, target_train)

Fitting 5 folds for each of 20 candidates, totalling 100 fits


In [12]:
import pandas as pd

cv_results = pd.DataFrame(model_random_search.cv_results_).sort_values(
    "rank_test_score", ascending=True, ignore_index=True)

cv_results.head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_standardscaler__with_std,param_standardscaler__with_mean,param_kneighborsregressor__n_neighbors,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.016612,0.007184,0.221854,0.022057,True,True,10,"{'standardscaler__with_std': True, 'standardsc...",0.670033,0.696368,0.699386,0.68632,0.687522,0.687926,0.010252,1
1,0.019718,0.006706,0.210318,0.013165,True,False,10,"{'standardscaler__with_std': True, 'standardsc...",0.670033,0.696368,0.699386,0.68632,0.687522,0.687926,0.010252,1
2,0.018205,0.002494,0.284222,0.016111,True,True,21,"{'standardscaler__with_std': True, 'standardsc...",0.664152,0.689769,0.694525,0.684824,0.684901,0.683634,0.010381,3
3,0.018343,0.003169,0.174719,0.010914,True,False,4,"{'standardscaler__with_std': True, 'standardsc...",0.666287,0.684908,0.679874,0.66779,0.675202,0.674812,0.007067,4
4,0.018938,0.00458,0.170778,0.013488,True,True,4,"{'standardscaler__with_std': True, 'standardsc...",0.666287,0.684908,0.679874,0.66779,0.675202,0.674812,0.007067,4


In [8]:
model_random_search.best_params_

{'standardscaler__with_std': True,
 'standardscaler__with_mean': True,
 'kneighborsregressor__n_neighbors': 10}