# 📝 Exercise M3.02

The goal is to find the best set of hyperparameters which maximize the
generalization performance on a training set.

Here again with limit the size of the training set to make computation
run faster. Feel free to increase the `train_size` value if your computer
is powerful enough.

In [1]:

import numpy as np
import pandas as pd

adult_census = pd.read_csv("../datasets/adult-census.csv")

target_name = "class"
target = adult_census[target_name]
data = adult_census.drop(columns=[target_name, "education-num"])
from sklearn.model_selection import train_test_split

data_train, data_test, target_train, target_test = train_test_split(
    data, target, train_size=0.2, random_state=42)

In this exercise, we will progressively define the classification pipeline
and later tune its hyperparameters.

Our pipeline should:
* preprocess the categorical columns using a `OneHotEncoder` and use a
  `StandardScaler` to normalize the numerical data.
* use a `LogisticRegression` as a predictive model.

Start by defining the columns and the preprocessing pipelines to be applied
on each group of columns.

In [6]:
from sklearn.compose import make_column_selector as selector

cat_col_selector = selector(dtype_include=object)
cat_cols = cat_col_selector(data)
num_col_selector = selector(dtype_exclude=object)
num_cols = num_col_selector(data)

print(cat_cols,'\n', num_cols)


['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country'] 
 ['age', 'capital-gain', 'capital-loss', 'hours-per-week']


In [7]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler

cat_preprocessor = OneHotEncoder(
    handle_unknown = 'ignore', sparse_output = False
)

scaler = StandardScaler()


Subsequently, create a `ColumnTransformer` to redirect the specific columns
a preprocessing pipeline.

In [9]:
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer(
    [
        ('one-hot-encoder', cat_preprocessor, cat_cols), 
        ('numerical', scaler, num_cols)
    ]
        ,remainder = 'passthrough'
)


Assemble the final pipeline by combining the above preprocessor with a logistic regression classifier. Force the maximum number of iterations to `10_000` to ensure that the model will converge.

In [43]:
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression

model = make_pipeline(preprocessor, LogisticRegression(max_iter = 10_000))
model

Use `RandomizedSearchCV` with `n_iter=20` to find the best set of
hyperparameters by tuning the following parameters of the `model`:

- the parameter `C` of the `LogisticRegression` with values ranging from
  0.001 to 10. You can use a log-uniform distribution
  (i.e. `scipy.stats.loguniform`);
- the parameter `with_mean` of the `StandardScaler` with possible values
  `True` or `False`;
- the parameter `with_std` of the `StandardScaler` with possible values
  `True` or `False`.

Once the computation has completed, print the best combination of parameters
stored in the `best_params_` attribute.

In [42]:
loguniform.stats(0.001, 10)

(array(1.08562763), array(4.25009362))

In [46]:
model.get_params()
# model.get_params(classifier__C = 1e-3)

{'memory': None,
 'steps': [('columntransformer',
   ColumnTransformer(remainder='passthrough',
                     transformers=[('one-hot-encoder',
                                    OneHotEncoder(handle_unknown='ignore',
                                                  sparse_output=False),
                                    ['workclass', 'education', 'marital-status',
                                     'occupation', 'relationship', 'race', 'sex',
                                     'native-country']),
                                   ('numerical', StandardScaler(),
                                    ['age', 'capital-gain', 'capital-loss',
                                     'hours-per-week'])])),
  ('logisticregression', LogisticRegression(max_iter=10000))],
 'verbose': False,
 'columntransformer': ColumnTransformer(remainder='passthrough',
                   transformers=[('one-hot-encoder',
                                  OneHotEncoder(handle_unknown='ignore',
      

In [51]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import loguniform

# for C in [1e-3, 1e-2, 1e-1, 1, 10]:
#     model.set_params(logisticregression__C=C)
#     model.set_params()

param_distribution = {
    'logisticregression__C' : loguniform(0.001, 10),
    # 'logisticregression__C' : [1e-3, 1e-2, 1e-1, 1, 10],
    'columntransformer__numerical__with_mean': [True, False],
    'columntransformer__numerical__with_std': [True, False]
}

model_rscv = RandomizedSearchCV(model, param_distributions = param_distribution, n_iter = 20, random_state = 42)

model_rscv.fit(data_train, target_train)

In [50]:
# 'logisticregression__C' : [1e-3, 1e-2, 1e-1, 1, 10]
# 1:50 mins
print("Best Parameters:", model_rscv.best_params_)

Best Parameters: {'logisticregression__C': 1, 'columntransformer__numerical__with_std': False, 'columntransformer__numerical__with_mean': False}


In [52]:
# 'logisticregression__C' : loguniform(0.001, 10)
# 3 mins
print("Best Parameters:", model_rscv.best_params_)

Best Parameters: {'columntransformer__numerical__with_mean': True, 'columntransformer__numerical__with_std': False, 'logisticregression__C': 6.245139574743064}
