# Manual Hyperparameter Tuning

The process of learning a predictive model is driven by a set of internal parameters and a set of training data. These internal parameters are called hyperparameters and are specific for each family models. A specific set of hyperparameters are optimal for a specific dataset and thus they need to be optimized. In this notebook we will use the words hyperparameters and parameters interchangeably.

## Set and get hyperparameters in scikit-learn

This notebook shows how we can get and set the value of a hyperparameter in a scikit-learn estimator. We recall that hyperparameters refer to the parameter that will control the learning process.

They should not be confused with the fitted parameters, resulting from the training. These fitted parameters are recognizable in scikit-learn because they are spelled with a final underscore `_`, for instance `model.coef_`.

So let's first start by importing the required modules and load the adult census dataset

In [8]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression


In [4]:
df = pd.read_csv('data/adult-census.csv')
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


We will only consider the numerical features.

In [5]:
numerical_columns = ["age", "capital-gain", "capital-loss", "hours-per-week"]

target = df['class']
data = df[numerical_columns]
data.head(5)


Unnamed: 0,age,capital-gain,capital-loss,hours-per-week
0,25,0,0,40
1,38,0,0,50
2,28,0,0,40
3,44,7688,0,40
4,18,0,0,30


Let's create a simple predictive model made of a scaler followed by a logistic regression classifier.

As mentioned in previous notebooks, many models, including linear ones, work better if all features have a similar scaling. For this purpose we use the `StandardScaler` .

In [7]:
from sklearn.preprocessing import StandardScaler

model = Pipeline(steps=[
    ('preprocessor', StandardScaler()),
    ('classifier', LogisticRegression())
])

We can evaluate the generalization performance of the model via cross-validation.

In [9]:
from sklearn.model_selection import cross_validate

cv_results = cross_validate(model, data, target)
scores = cv_results['test_score']
print(f"Accuracy score via cross-validation:\n"
      f"{scores.mean():.3f} +/- {scores.std():.3f}")

Accuracy score via cross-validation:
0.800 +/- 0.003


We created a model with the default `C` valued that is equal to 1. If we wanted to use a different `C` parameter we could have done so when we created the `LogisticRegression` object with something like `LogisticRegression(C=1e-3)`.

We can also change the parameter of a model after it has been created with the `set_params` method, which is available for all scikit-learn estimators. For exampe, we can set `C=1e-3`, fit and evaluate the model: 

In [10]:
model.get_params().keys()

dict_keys(['memory', 'steps', 'verbose', 'preprocessor', 'classifier', 'preprocessor__copy', 'preprocessor__with_mean', 'preprocessor__with_std', 'classifier__C', 'classifier__class_weight', 'classifier__dual', 'classifier__fit_intercept', 'classifier__intercept_scaling', 'classifier__l1_ratio', 'classifier__max_iter', 'classifier__multi_class', 'classifier__n_jobs', 'classifier__penalty', 'classifier__random_state', 'classifier__solver', 'classifier__tol', 'classifier__verbose', 'classifier__warm_start'])

In [11]:
model.set_params(classifier__C=1e-3)
results = cross_validate(model, data, target)
scores = results['test_score']
print(f"Accuracy score via cross-validation:\n"
      f"{scores.mean():.3f} +/- {scores.std():.3f}")

Accuracy score via cross-validation:
0.787 +/- 0.002


When the model of interest is a `Pipeline`, the parameter names are of the form `<model_name>__<parameter_name>`. In our case, `classifier` comes from the `Pipeline` definition and `C` is the parameter name of `LogisticRegression`.

In general, you can use the `get_params` method to list all the parameters with their values.

In [12]:
model.get_params()['classifier__C']

0.001

We can systematically vary the value of `C` to see if there is an optimal value.

In [13]:
for C in [1e-3, 1e-2, 1e-1, 1, 10]:
    model.set_params(classifier__C=C)
    results = cross_validate(model, data, target)
    scores = results['test_score']
    print(f"Accuracy score via cross-validation with C={C}:\n"
          f"{scores.mean():.3f} +/- {scores.std():.3f}")

Accuracy score via cross-validation with C=0.001:
0.787 +/- 0.002
Accuracy score via cross-validation with C=0.01:
0.799 +/- 0.003
Accuracy score via cross-validation with C=0.1:
0.800 +/- 0.003
Accuracy score via cross-validation with C=1:
0.800 +/- 0.003
Accuracy score via cross-validation with C=10:
0.800 +/- 0.003


We can see that as long as C is high enough, the model seems to perform well.  
What we did here is very manual: it involves scanning the values for `C` and picking the best one manually. 

<div class="alert alert-block alert-warning">
<b>Warning:</b> <br>
When we evaluate a family of models on test data and pick the best performer, we can not trust the corresponding prediction accuracy, and we need to apply the selected model to new data. Indeed, the test data has been used to select the model, and it is thus no longer independent from this model.
</div>