# 📝 Exercise M3.02

The goal is to find the best set of hyperparameters which maximize the
generalization performance on a training set.

Here again with limit the size of the training set to make computation
run faster. Feel free to increase the `train_size` value if your computer
is powerful enough.

In [1]:

import numpy as np
import pandas as pd

adult_census = pd.read_csv("../datasets/adult-census.csv")

target_name = "class"
target = adult_census[target_name]
data = adult_census.drop(columns=[target_name, "education-num"])
from sklearn.model_selection import train_test_split

data_train, data_test, target_train, target_test = train_test_split(
    data, target, train_size=0.2, random_state=42)

In this exercise, we will progressively define the classification pipeline
and later tune its hyperparameters.

Our pipeline should:
* preprocess the categorical columns using a `OneHotEncoder` and use a
  `StandardScaler` to normalize the numerical data.
* use a `LogisticRegression` as a predictive model.

Start by defining the columns and the preprocessing pipelines to be applied
on each group of columns.

In [2]:
# Write your code here.
from sklearn.compose import make_column_selector as selector

categorical_columns_selector = selector(dtype_include=object)
categorical_columns = categorical_columns_selector(data)
print(f"Categorical cols : {categorical_columns}")

numerical_columns_selector = selector(dtype_include=int)
numerical_columns = numerical_columns_selector(data)
print(f"Categorical cols : {numerical_columns}")

Categorical cols : ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country']
Categorical cols : ['age', 'capital-gain', 'capital-loss', 'hours-per-week']


In [29]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler

# Write your code here.
categorical_processor = OneHotEncoder(handle_unknown="ignore")
numerical_processor = StandardScaler()

Subsequently, create a `ColumnTransformer` to redirect the specific columns
a preprocessing pipeline.

In [30]:
# Write your code here.

from sklearn.compose import ColumnTransformer

# solution
preprocessor = ColumnTransformer(
    [('cat_preprocessor', categorical_processor, categorical_columns),
     ('num_preprocessor', numerical_processor, numerical_columns)]
)

Assemble the final pipeline by combining the above preprocessor
with a logistic regression classifier. Force the maximum number of
iterations to `10_000` to ensure that the model will converge.

In [5]:
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression

# Write your code here.

lr = LogisticRegression()
lr.get_params()

{'C': 1.0,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'l1_ratio': None,
 'max_iter': 100,
 'multi_class': 'auto',
 'n_jobs': None,
 'penalty': 'l2',
 'random_state': None,
 'solver': 'lbfgs',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}

In [33]:
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression

# Write your code here.

lr = LogisticRegression(max_iter=10000)
model = make_pipeline(preprocessor, LogisticRegression(max_iter=10_000))
model

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('cat_preprocessor',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  ['workclass', 'education',
                                                   'marital-status',
                                                   'occupation', 'relationship',
                                                   'race', 'sex',
                                                   'native-country']),
                                                 ('num_preprocessor',
                                                  StandardScaler(),
                                                  ['age', 'capital-gain',
                                                   'capital-loss',
                                                   'hours-per-week'])])),
                ('logisticregression', LogisticRegression(max_iter=10000))])

Use `RandomizedSearchCV` with `n_iter=20` to find the best set of
hyperparameters by tuning the following parameters of the `model`:

- the parameter `C` of the `LogisticRegression` with values ranging from
  0.001 to 10. You can use a log-uniform distribution
  (i.e. `scipy.stats.loguniform`);
- the parameter `with_mean` of the `StandardScaler` with possible values
  `True` or `False`;
- the parameter `with_std` of the `StandardScaler` with possible values
  `True` or `False`.

Once the computation has completed, print the best combination of parameters
stored in the `best_params_` attribute.

In [21]:
# [i for i in model.get_params().keys() if 'log' in i]

In [22]:
for i in range(0,3):
    print(i)

0
1
2


In [24]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import loguniform

# Write your code here.

from scipy.stats import loguniform


class loguniform_int:
    """Integer valued version of the log-uniform distribution"""
    def __init__(self, a, b):
        self._distribution = loguniform(a, b)

    def rvs(self, *args, **kwargs):
        """Random variable sample"""
        return self._distribution.rvs(*args, **kwargs).astype(int)


from sklearn.model_selection import RandomizedSearchCV

for i in range(1, 21):
    param_distributions = {
        'logisticregression__C': loguniform(1e-6, 1e3),
        'standardscaler__with_mean': [True, False],
        'standardscaler__with_std': [True, False]
    }

    model_random_search= RandomizedSearchCV(
        model, param_distributions=param_distributions, n_iter=10,
        cv=5, verbose=1,
    )
    print(model_random_search.best_params_)
# model_random_search.fit(data_train, target_train)

AttributeError: 'RandomizedSearchCV' object has no attribute 'best_params_'

In [35]:
model.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'columntransformer', 'logisticregression', 'columntransformer__n_jobs', 'columntransformer__remainder', 'columntransformer__sparse_threshold', 'columntransformer__transformer_weights', 'columntransformer__transformers', 'columntransformer__verbose', 'columntransformer__verbose_feature_names_out', 'columntransformer__cat_preprocessor', 'columntransformer__num_preprocessor', 'columntransformer__cat_preprocessor__categories', 'columntransformer__cat_preprocessor__drop', 'columntransformer__cat_preprocessor__dtype', 'columntransformer__cat_preprocessor__handle_unknown', 'columntransformer__cat_preprocessor__sparse', 'columntransformer__num_preprocessor__copy', 'columntransformer__num_preprocessor__with_mean', 'columntransformer__num_preprocessor__with_std', 'logisticregression__C', 'logisticregression__class_weight', 'logisticregression__dual', 'logisticregression__fit_intercept', 'logisticregression__intercept_scaling', 'logisticregression__l1_rati

In [36]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import loguniform

# solution
param_distributions = {
    "logisticregression__C": loguniform(0.001, 10),
    "columntransformer__num_preprocessor__with_mean": [True, False],
    "columntransformer__num_preprocessor__with_std": [True, False],
}

model_random_search = RandomizedSearchCV(
    model, param_distributions=param_distributions,
    n_iter=20, error_score=np.nan, n_jobs=2, verbose=1, random_state=1)
model_random_search.fit(data_train, target_train)
model_random_search.best_params_

Fitting 5 folds for each of 20 candidates, totalling 100 fits


{'columntransformer__num_preprocessor__with_mean': False,
 'columntransformer__num_preprocessor__with_std': False,
 'logisticregression__C': 0.17169565852473864}

So the best hyperparameters give a model where the features are scaled but not centered and the final model is regularized.

Getting the best parameter combinations is the main outcome of the hyper-parameter optimization procedure. However it is also interesting to assess the sensitivity of the best models to the choice of those parameters. The following code, not required to answer the quiz question shows how to conduct such an interactive analysis for this this pipeline using a parallel coordinate plot using the plotly library.

We could use cv_results = model_random_search.cv_results_ to make a parallel coordinate plot as we did in the previous notebook (you are more than welcome to try!). Instead we are going to load the results obtained from a similar search with many more iterations (1,000 instead of 20).

In [37]:
cv_results = pd.read_csv(
    "../figures/randomized_search_results_logistic_regression.csv")

In [40]:
cv_results.columns

Index(['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time',
       'param_columntransformer__num_preprocessor__with_mean',
       'param_columntransformer__num_preprocessor__with_std',
       'param_logisticregression__C', 'params', 'split0_test_score',
       'split1_test_score', 'split2_test_score', 'split3_test_score',
       'split4_test_score', 'mean_test_score', 'std_test_score',
       'rank_test_score'],
      dtype='object')

In [43]:
cv_results.sort_values('rank_test_score').head(3)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_columntransformer__num_preprocessor__with_mean,param_columntransformer__num_preprocessor__with_std,param_logisticregression__C,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
926,0.380337,0.042428,0.033888,0.009012,False,True,0.279554,{'columntransformer__num_preprocessor__with_me...,0.846469,0.855681,0.847492,0.852535,0.839222,0.84828,0.005636,1
282,0.383388,0.088132,0.029121,0.001475,True,True,0.25475,{'columntransformer__num_preprocessor__with_me...,0.845957,0.854657,0.848004,0.852023,0.839734,0.848075,0.005157,2
787,0.464065,0.085474,0.040924,0.01001,True,True,0.247167,{'columntransformer__num_preprocessor__with_me...,0.846469,0.854657,0.848004,0.85151,0.839734,0.848075,0.005046,3


 Selecting the best performing models (i.e. above an accuracy of ~0.845), we observe the following pattern:

scaling the data is important. All the best performing models are scaling the data;

centering the data does not have a strong impact. Both approaches, centering and not centering, can lead to good models;

using some regularization is fine but using too much is a problem. Recall that a smaller value of C means a stronger regularization. In particular no pipeline with C lower than 0.001 can be found among the best models.