# 📝 Exercise M3.01

The goal is to write an exhaustive search to find the best parameters
combination maximizing the model statistical performance.

Here we use a small subset of the Adult Census dataset to make the code
fast to execute. Once your code works on the small subset, try to
change `train_size` to a larger value (e.g. 0.8 for 80% instead of
20%).

In [1]:
import pandas as pd

from sklearn.model_selection import train_test_split

adult_census = pd.read_csv("../datasets/adult-census.csv")

target_name = "class"
target = adult_census[target_name]
data = adult_census.drop(columns=[target_name, "education-num"])

data_train, data_test, target_train, target_test = train_test_split(
    data, target, train_size=0.2, random_state=42, )

In [9]:
len(data), len(data_train), len(data_test)

(48842, 9768, 39074)

In [10]:
target.value_counts()

 <=50K    37155
 >50K     11687
Name: class, dtype: int64

In [3]:
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector as selector
from sklearn.preprocessing import OrdinalEncoder

categorical_preprocessor = OrdinalEncoder(handle_unknown="use_encoded_value",
                                          unknown_value=-1)
preprocessor = ColumnTransformer(
    [('cat-preprocessor', categorical_preprocessor,
      selector(dtype_include=object))],
    remainder='passthrough', sparse_threshold=0)

# This line is currently required to import HistGradientBoostingClassifier
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.pipeline import Pipeline

model = Pipeline([
    ("preprocessor", preprocessor),
    ("classifier", HistGradientBoostingClassifier(random_state=42))
])


Use the previously defined model (called `model`) and using two nested `for`
loops, make a search of the best combinations of the `learning_rate` and
`max_leaf_nodes` parameters. In this regard, you will need to train and test
the model by setting the parameters. The evaluation of the model should be
performed using `cross_val_score`. We will use the following parameters
search:
- `learning_rate` for the values 0.01, 0.1, 1 and 10. This parameter controls
  the ability of a new tree to correct the error of the previous sequence of
  trees
- `max_leaf_nodes` for the values 3, 10, 30. This parameter controls the
  depth of each tree.

In [17]:
from sklearn.model_selection import cross_val_score

learning_rates = [0.01, 0.1, 1, 10]
max_leaf_nodes = [3, 10, 30]

for lr in learning_rates:
    for mln in max_leaf_nodes:
        model.set_params(classifier__learning_rate=lr, classifier__max_leaf_nodes=mln)
        score = cross_val_score(model, data, target, )
        print(f"lr: {lr:5.2f}, mln: {mln:2.0f}, scoremean: {score.mean():.3f}, scorestd: {score.std():.3f}")

lr:  0.01, mln:  3, scoremean: 0.799, scorestd: 0.001
lr:  0.01, mln: 10, scoremean: 0.820, scorestd: 0.002
lr:  0.01, mln: 30, scoremean: 0.848, scorestd: 0.002
lr:  0.10, mln:  3, scoremean: 0.856, scorestd: 0.003
lr:  0.10, mln: 10, scoremean: 0.870, scorestd: 0.001
lr:  0.10, mln: 30, scoremean: 0.874, scorestd: 0.002
lr:  1.00, mln:  3, scoremean: 0.870, scorestd: 0.003
lr:  1.00, mln: 10, scoremean: 0.867, scorestd: 0.002
lr:  1.00, mln: 30, scoremean: 0.860, scorestd: 0.005
lr: 10.00, mln:  3, scoremean: 0.281, scorestd: 0.004
lr: 10.00, mln: 10, scoremean: 0.761, scorestd: 0.045
lr: 10.00, mln: 30, scoremean: 0.616, scorestd: 0.179


In [18]:
for lr in learning_rates:
    for mln in max_leaf_nodes:
        model.set_params(classifier__learning_rate=lr, classifier__max_leaf_nodes=mln)
        score = cross_val_score(model, data, target, scoring="balanced_accuracy")
        print(f"lr: {lr:5.2f}, mln: {mln:2.0f}, scoremean: {score.mean():.3f}, scorestd: {score.std():.3f}")

lr:  0.01, mln:  3, scoremean: 0.580, scorestd: 0.003
lr:  0.01, mln: 10, scoremean: 0.628, scorestd: 0.004
lr:  0.01, mln: 30, scoremean: 0.699, scorestd: 0.004
lr:  0.10, mln:  3, scoremean: 0.744, scorestd: 0.006
lr:  0.10, mln: 10, scoremean: 0.786, scorestd: 0.004
lr:  0.10, mln: 30, scoremean: 0.797, scorestd: 0.005
lr:  1.00, mln:  3, scoremean: 0.790, scorestd: 0.006
lr:  1.00, mln: 10, scoremean: 0.790, scorestd: 0.007
lr:  1.00, mln: 30, scoremean: 0.788, scorestd: 0.006
lr: 10.00, mln:  3, scoremean: 0.268, scorestd: 0.004
lr: 10.00, mln: 10, scoremean: 0.577, scorestd: 0.042
lr: 10.00, mln: 30, scoremean: 0.546, scorestd: 0.173


Best parameter combination for

learning_rate, max_leaf_nodes

lr:  0.10, mln: 30, scoremean: 0.797, scorestd: 0.005