# Hyperparameter tuning by randomized-search

In the previous notebook, we showed how to use a grid-search approach to
search for the best hyperparameters maximizing the generalization performance
of a predictive model.

However, a grid-search approach has limitations. It does not scale well when
the number of parameters to tune increases. Also, the grid imposes a
regularity during the search which might miss better parameter
values between two consecutive values on the grid.

In this notebook, we present a different method to tune hyperparameters called
randomized search.

## Our predictive model

Let us reload the dataset as we did previously:

In [1]:
import pandas as pd

adult_census = pd.read_csv("../datasets/adult-census.csv")

We extract the column containing the target.

In [2]:
target_name = "class"
target = adult_census[target_name]
target

0         <=50K
1         <=50K
2          >50K
3          >50K
4         <=50K
          ...  
48837     <=50K
48838      >50K
48839     <=50K
48840     <=50K
48841      >50K
Name: class, Length: 48842, dtype: object

We drop from our data the target and the `"education-num"` column which
duplicates the information with `"education"` columns.

In [3]:
data = adult_census.drop(columns=[target_name, "education-num"])
data.head()

Unnamed: 0,age,workclass,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,25,Private,11th,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States
1,38,Private,HS-grad,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States
2,28,Local-gov,Assoc-acdm,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States
3,44,Private,Some-college,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States
4,18,?,Some-college,Never-married,?,Own-child,White,Female,0,0,30,United-States


Once the dataset is loaded, we split it into a training and testing sets.

In [4]:
from sklearn.model_selection import train_test_split

data_train, data_test, target_train, target_test = train_test_split(
    data, target, random_state=42
)

We create the same predictive pipeline as done for the grid-search section.

In [5]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import make_column_selector as selector

categorical_columns_selector = selector(dtype_include=object)
categorical_columns = categorical_columns_selector(data)

categorical_preprocessor = OrdinalEncoder(
    handle_unknown="use_encoded_value", unknown_value=-1
)
preprocessor = ColumnTransformer(
    [("cat_preprocessor", categorical_preprocessor, categorical_columns)],
    remainder="passthrough",
)

In [6]:
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.pipeline import Pipeline

model = Pipeline(
    [
        ("preprocessor", preprocessor),
        (
            "classifier",
            HistGradientBoostingClassifier(random_state=42, max_leaf_nodes=4),
        ),
    ]
)

model

## Tuning using a randomized-search

With the `GridSearchCV` estimator, the parameters need to be specified
explicitly. We already mentioned that exploring a large number of values for
different parameters quickly becomes untractable.

Instead, we can randomly generate the parameter candidates. Indeed, such
approach avoids the regularity of the grid. Hence, adding more evaluations can
increase the resolution in each direction. This is the case in the frequent
situation where the choice of some hyperparameters is not very important, as
for the hyperparameter 2 in the figure below.

![Randomized vs grid search](../figures/grid_vs_random_search.svg)

Indeed, the number of evaluation points needs to be divided across the two
different hyperparameters. With a grid, the danger is that the region of good
hyperparameters may fall between lines of the grid. In the figure such region
is aligned with the grid given that hyperparameter 2 has a weak influence.
Rather, stochastic search samples the hyperparameter 1 independently from the
hyperparameter 2 and find the optimal region.

The `RandomizedSearchCV` class allows for such stochastic search. It is used
similarly to the `GridSearchCV` but the sampling distributions need to be
specified instead of the parameter values. For instance, we can draw
candidates using a log-uniform distribution because the parameters we are
interested in take positive values with a natural log scaling (.1 is as close
to 1 as 10 is).

<div class="admonition note alert alert-info">
<p class="first admonition-title" style="font-weight: bold;">Note</p>
<p class="last">Random search (with <tt class="docutils literal">RandomizedSearchCV</tt>) is typically beneficial compared to
grid search (with <tt class="docutils literal">GridSearchCV</tt>) to optimize 3 or more hyperparameters.</p>
</div>

We now optimize 3 other parameters in addition to the ones we optimized in
the notebook presenting the `GridSearchCV`:

* `l2_regularization`: it corresponds to the strength of the regularization;
* `min_samples_leaf`: it corresponds to the minimum number of samples required
  in a leaf;
* `max_bins`: it corresponds to the maximum number of bins to construct the
  histograms.

We recall the meaning of the 2 remaining parameters:

* `learning_rate`: it corresponds to the speed at which the gradient-boosting
  corrects the residuals at each boosting iteration;
* `max_leaf_nodes`: it corresponds to the maximum number of leaves for each
  tree in the ensemble.

<div class="admonition note alert alert-info">
<p class="first admonition-title" style="font-weight: bold;">Note</p>
<p class="last"><tt class="docutils literal">scipy.stats.loguniform</tt> can be used to generate floating numbers. To generate
random values for integer-valued parameters (e.g. <tt class="docutils literal">min_samples_leaf</tt>) we can
adapt it as follows:</p>
</div>

In [7]:
from scipy.stats import loguniform


class loguniform_int:
    """Integer valued version of the log-uniform distribution"""

    def __init__(self, a, b):
        self._distribution = loguniform(a, b)

    def rvs(self, *args, **kwargs):
        """Random variable sample"""
        return self._distribution.rvs(*args, **kwargs).astype(int)

Now, we can define the randomized search using the different distributions.
Executing 10 iterations of 5-fold cross-validation for random parametrizations
of this model on this dataset can take from 10 seconds to several minutes,
depending on the speed of the host computer and the number of available
processors.

In [8]:
%%time
from sklearn.model_selection import RandomizedSearchCV

param_distributions = {
    "classifier__l2_regularization": loguniform(1e-6, 1e3),
    "classifier__learning_rate": loguniform(0.001, 10),
    "classifier__max_leaf_nodes": loguniform_int(2, 256),
    "classifier__min_samples_leaf": loguniform_int(1, 100),
    "classifier__max_bins": loguniform_int(2, 255),
}

model_random_search = RandomizedSearchCV(
    model,
    param_distributions=param_distributions,
    n_iter=10, #to improve the quality of the search, set 10 candidates per parameters set in param_distributions
    cv=5,
    verbose=1,
)

model_random_search.fit(data_train, target_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
CPU times: total: 2min 2s
Wall time: 16.1 s


Then, we can compute the accuracy score on the test set.

In [9]:
accuracy = model_random_search.score(data_test, target_test)

print(f"The test accuracy score of the best model is {accuracy:.2f}")

The test accuracy score of the best model is 0.88


In [12]:
from pprint import pprint

print("The best parameters are:")
pprint(model_random_search.best_params_)

The best parameters are:
{'classifier__l2_regularization': 1.1608042764447255e-05,
 'classifier__learning_rate': 0.41961777704389147,
 'classifier__max_bins': 120,
 'classifier__max_leaf_nodes': 18,
 'classifier__min_samples_leaf': 81}


We can inspect the results using the attributes `cv_results` as we did
previously.

In [14]:
# get the parameter names
column_results = [f"param_{name}" for name in param_distributions.keys()]
column_results += ["mean_test_score", "std_test_score", "rank_test_score"]

print(model_random_search.cv_results_)

cv_results = pd.DataFrame(model_random_search.cv_results_)
cv_results = cv_results[column_results].sort_values(
    "mean_test_score", ascending=False
)


def shorten_param(param_name):
    if "__" in param_name:
        return param_name.rsplit("__", 1)[1]
    return param_name


cv_results = cv_results.rename(shorten_param, axis=1)
cv_results

{'mean_fit_time': array([0.45101709, 0.27091813, 0.1520422 , 0.14804835, 0.36675005,
       0.07302952, 1.03306403, 0.09683986, 0.15102215, 0.17322745]), 'std_fit_time': array([0.05065122, 0.01011398, 0.00963119, 0.0176929 , 0.02357877,
       0.00723499, 0.04546066, 0.00785251, 0.01827054, 0.01423829]), 'mean_score_time': array([0.02831964, 0.02896013, 0.02223387, 0.0220222 , 0.0261632 ,
       0.01902413, 0.03104382, 0.0178462 , 0.02366319, 0.02087154]), 'std_score_time': array([0.00152873, 0.001391  , 0.00116579, 0.00189635, 0.00251911,
       0.0021724 , 0.00256005, 0.00265718, 0.00181708, 0.00323212]), 'param_classifier__l2_regularization': masked_array(data=[2.6770914730915104e-05, 10.101234581969754,
                   0.4505625424649028, 0.03316790451168031,
                   789.8101502945758, 0.00016791312809494392,
                   5.748736636039643, 1.926221139182056e-06,
                   1.1608042764447255e-05, 1.1018286651781029],
             mask=[False, False, Fal

Unnamed: 0,l2_regularization,learning_rate,max_leaf_nodes,min_samples_leaf,max_bins,mean_test_score,std_test_score,rank_test_score
8,1.2e-05,0.419618,18,81,120,0.867953,0.001829,1
1,10.101235,0.054625,22,15,47,0.855369,0.002759,2
6,5.748737,0.066335,152,1,42,0.854276,0.002391,3
0,2.7e-05,0.023282,45,1,11,0.841255,0.003269,4
2,0.450563,0.036966,4,3,97,0.830717,0.003891,5
3,0.033168,0.433015,4,3,5,0.830553,0.002378,6
9,1.101829,0.757313,78,1,2,0.800333,0.003615,7
4,789.81015,0.002743,36,27,2,0.758947,1.3e-05,8
7,2e-06,9.214924,58,4,52,0.680077,0.115199,9
5,0.000168,6.200753,2,24,91,0.283476,0.005123,10


Keep in mind that tuning is limited by the number of different combinations of
parameters that are scored by the randomized search. In fact, there might be
other sets of parameters leading to similar or better generalization
performances but that were not tested in the search. In practice, a randomized
hyperparameter search is usually run with a large number of iterations. In
order to avoid the computation cost and still make a decent analysis, we load
the results obtained from a similar search with 500 iterations.

In [15]:
model_random_search = RandomizedSearchCV(
    model, param_distributions=param_distributions, n_iter=500,
    n_jobs=2, cv=5)
model_random_search.fit(data_train, target_train)
cv_results =  pd.DataFrame(model_random_search.cv_results_)
cv_results.to_csv("../figures/randomized_search_results.csv")

#10min to train

In [16]:
cv_results = pd.read_csv(
    "../figures/randomized_search_results.csv", index_col=0
)

(
    cv_results[column_results]
    .rename(shorten_param, axis=1)
    .sort_values("mean_test_score", ascending=False)
)

Unnamed: 0,l2_regularization,learning_rate,max_leaf_nodes,min_samples_leaf,max_bins,mean_test_score,std_test_score,rank_test_score
450,0.027372,0.284554,11,8,127,0.870383,0.000812,1
271,11.276787,0.143733,46,16,228,0.869946,0.002570,2
19,0.002734,0.257275,8,6,212,0.869701,0.001818,3
197,0.073667,0.107628,54,30,152,0.869509,0.003356,4
242,0.000314,0.042688,67,11,139,0.868963,0.002402,5
...,...,...,...,...,...,...,...,...
383,0.001917,4.311567,2,6,5,0.283476,0.005123,496
205,2.319208,6.191326,2,1,58,0.283476,0.005123,496
153,0.344496,9.497829,2,50,20,0.283476,0.005123,496
20,131.189756,7.510622,2,18,32,0.283476,0.005123,496


In this case the top performing models have test scores with a high overlap
between each other, meaning that indeed, the set of parameters leading to the
best generalization performance is not unique.


In this notebook, we saw how a randomized search offers a valuable alternative
to grid-search when the number of hyperparameters to tune is more than two. It
also alleviates the regularity imposed by the grid that might be problematic
sometimes.

In the following, we will see how to use interactive plotting tools to explore
the results of large hyperparameter search sessions and gain some insights on
range of parameter values that lead to the highest performing models and how
different hyperparameter are coupled or not.