*ref: https://inria.github.io/scikit-learn-mooc/python_scripts/parameter_tuning_manual.html*

The process of learning a predictive model is driven by a set of internal parameters and a set of training data. 

These internal parameters are called **hyperparameters** and are specific for each family of models. 

In addition, a specific set of hyperparameters are optimal for a specific dataset and thus they need to be optimized.

In [1]:
# We will start by loading the adult census dataset and only use the numerical features.

import pandas as pd

adult_census = pd.read_csv("../../datasets/adult-census.csv")

target_name = "class"
numerical_columns = [
    "age", "capital-gain", "capital-loss", "hours-per-week"]

target = adult_census[target_name]
data = adult_census[numerical_columns]

In [2]:
# Our data is only numerical.

data.head()

Unnamed: 0,age,capital-gain,capital-loss,hours-per-week
0,25,0,0,40
1,38,0,0,50
2,28,0,0,40
3,44,7688,0,40
4,18,0,0,30


In [3]:
# Let’s create a simple predictive model made of a scaler followed by a logistic regression classifier.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

model = Pipeline(steps=[
    ("preprocessor", StandardScaler()),
    ("classifier", LogisticRegression())
])

In [4]:
# We can evaluate the generalization performance of the model via cross-validation.

from sklearn.model_selection import cross_validate

cv_results = cross_validate(model, data, target)
scores = cv_results["test_score"]
print(f"Accuracy score via cross-validation:\n"
      f"{scores.mean():.3f} ± {scores.std():.3f}")

Accuracy score via cross-validation:
0.800 ± 0.003


We created a model with the **default C value that is equal to 1**. 

If we wanted to use a different C parameter we could have done so when we created the LogisticRegression object with something like LogisticRegression(C=1e-3).

In [5]:
# We can also change the parameter of a model after it has been created with the set_params method, which is available for all scikit-learn estimators. For example, we can set C=1e-3, fit and evaluate the model:

model.set_params(classifier__C=1e-3)
cv_results = cross_validate(model, data, target)
scores = cv_results["test_score"]
print(f"Accuracy score via cross-validation:\n"
      f"{scores.mean():.3f} ± {scores.std():.3f}")

Accuracy score via cross-validation:
0.787 ± 0.002


When the model of interest is a Pipeline, the parameter names are of the form `<model_name>__<parameter_name>` (note the double underscore in the middle). 

In our case, classifier comes from the Pipeline definition and C is the parameter name of LogisticRegression.

In [7]:
# In general, you can use the get_params method on scikit-learn models to list 
# all the parameters with their values. 
# For example, if you want to get all the parameter names, you can use:

for parameter in model.get_params():
    print(parameter)

memory
steps
verbose
preprocessor
classifier
preprocessor__copy
preprocessor__with_mean
preprocessor__with_std
classifier__C
classifier__class_weight
classifier__dual
classifier__fit_intercept
classifier__intercept_scaling
classifier__l1_ratio
classifier__max_iter
classifier__multi_class
classifier__n_jobs
classifier__penalty
classifier__random_state
classifier__solver
classifier__tol
classifier__verbose
classifier__warm_start


In [8]:
# If you want to get the value of a single parameter, for example classifier__C, you can use:

model.get_params()['classifier__C']

0.001

In [9]:
# We can systematically vary the value of C to see if there is an optimal value.

for C in [1e-3, 1e-2, 1e-1, 1, 10]:
    model.set_params(classifier__C=C)
    cv_results = cross_validate(model, data, target)
    scores = cv_results["test_score"]
    print(f"Accuracy score via cross-validation with C={C}:\n"
          f"{scores.mean():.3f} ± {scores.std():.3f}")

Accuracy score via cross-validation with C=0.001:
0.787 ± 0.002
Accuracy score via cross-validation with C=0.01:
0.799 ± 0.003
Accuracy score via cross-validation with C=0.1:
0.800 ± 0.003
Accuracy score via cross-validation with C=1:
0.800 ± 0.003
Accuracy score via cross-validation with C=10:
0.800 ± 0.003


We can see that as long as C is high enough, the model seems to perform well.