# 📝 Exercise M3.02

The goal is to find the optimal set of hyperparameters which maximize the
statistical performance on a test set.

Here again limit the size of the training set to make computation
run faster. Feel free to increase the `train_size` value if your computer
is powerful enough.

In [17]:
import numpy as np
import pandas as pd

adult_census = pd.read_csv("../datasets/adult-census.csv")

target_name = "class"
target = adult_census[target_name]
data = adult_census.drop(columns=[target_name, "education-num"])

from sklearn.model_selection import train_test_split

data_train, data_test, target_train, target_test = train_test_split(
    data, target, random_state=42, train_size=0.2)

Create your machine learning pipeline

You should:
* preprocess the categorical columns using a `OneHotEncoder` and use a
  `StandardScaler` to normalize the numerical data.
* use a `LogisticRegression` as a predictive model.

Start by defining the columns and the preprocessing pipelines to be applied
on each columns.

In [18]:
data.head()

Unnamed: 0,age,workclass,education,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
0,25,Private,11th,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States
1,38,Private,HS-grad,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States
2,28,Local-gov,Assoc-acdm,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States
3,44,Private,Some-college,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States
4,18,?,Some-college,Never-married,?,Own-child,White,Female,0,0,30,United-States


In [25]:
target.head()

0     <=50K
1     <=50K
2      >50K
3      >50K
4     <=50K
Name: class, dtype: object

In [27]:
from sklearn import set_config
set_config(display="diagram")

In [20]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler

from sklearn.compose import make_column_selector as selector

categorical_column_names = selector(dtype_include=object)(data)
numerical_column_names = selector(dtype_exclude=object)(data)

Subsequently, create a `ColumnTransformer` to redirect the specific columns
a preprocessing pipeline.

In [21]:
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer(
                    [
                        ('categorical-preproc', OneHotEncoder(handle_unknown='ignore'), categorical_column_names),
                        ('numerical-preproc', StandardScaler(), numerical_column_names)
                    ])

Finally, concatenate the preprocessing pipeline with a logistic regression.

In [22]:
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression

model = make_pipeline(preprocessor, LogisticRegression())

Use a `RandomizedSearchCV` to find the best set of hyperparameters by tuning
the following parameters of the `model`:

- the parameter `C` of the `LogisticRegression` with values ranging from
  0.001 to 10. You can use a log-uniform distribution
  (i.e. `scipy.stats.loguniform`);
- the parameter `with_mean` of the `StandardScaler` with possible values
  `True` or `False`;
- the parameter `with_std` of the `StandardScaler` with possible values
  `True` or `False`.

Once the computation has completed, print the best combination of parameters
stored in the `best_params_` attribute.

In [23]:
model.get_params()

{'memory': None,
 'steps': [('columntransformer',
   ColumnTransformer(transformers=[('categorical-preproc',
                                    OneHotEncoder(handle_unknown='ignore'),
                                    ['workclass', 'education', 'marital-status',
                                     'occupation', 'relationship', 'race', 'sex',
                                     'native-country']),
                                   ('numerical-preproc', StandardScaler(),
                                    ['age', 'capital-gain', 'capital-loss',
                                     'hours-per-week'])])),
  ('logisticregression', LogisticRegression())],
 'verbose': False,
 'columntransformer': ColumnTransformer(transformers=[('categorical-preproc',
                                  OneHotEncoder(handle_unknown='ignore'),
                                  ['workclass', 'education', 'marital-status',
                                   'occupation', 'relationship', 'race', 'sex',
     

In [28]:
from scipy.stats import loguniform
from sklearn.model_selection import RandomizedSearchCV

param_distributions = {
    'logisticregression__C': loguniform(0.001, 10),
    'columntransformer__numerical-preproc__with_mean': [False, True],
    'columntransformer__numerical-preproc__with_std': [False, True]
}

model_random_search = RandomizedSearchCV(model, param_distributions=param_distributions, 
                                n_iter=10, n_jobs=4, cv=5)
model_random_search.fit(data_train, target_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [31]:
accuracy = model_random_search.score(data_test, target_test)

print("Test accuracy score of best model is: ", f"{accuracy:.2f}")

Test accuracy score of best model is:  0.85


The accuracy is somewhat worse with Logistic Regression than with Gradient Boosting Classifier (acc=0.87)

The best model has the following parameters found by the random-search algorithm:

In [30]:
model_random_search.best_params_

{'columntransformer__numerical-preproc__with_mean': True,
 'columntransformer__numerical-preproc__with_std': True,
 'logisticregression__C': 5.829070819931309}