# 📝 Exercise M3.01

The goal is to write an exhaustive search to find the best parameters
combination maximizing the model statistical performance.

Here we use a small subset of the Adult Census dataset to make to code
fast to execute. Once your code works on the small subset, try to
change `train_size` to a larger value (e.g. 0.8 for 80% instead of
20%).

In [1]:
import pandas as pd

from sklearn.model_selection import train_test_split

adult_census = pd.read_csv("../datasets/adult-census.csv")

target_name = "class"
target = adult_census[target_name]
data = adult_census.drop(columns=[target_name, "education-num"])

data_train, data_test, target_train, target_test = train_test_split(
    data, target, train_size=0.2, random_state=42)

In [2]:
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector as selector
from sklearn.preprocessing import OrdinalEncoder

categorical_preprocessor = OrdinalEncoder(handle_unknown="use_encoded_value",
                                          unknown_value=-1)
preprocessor = ColumnTransformer(
    [('cat-preprocessor', categorical_preprocessor,
      selector(dtype_include=object))],
    remainder='passthrough', sparse_threshold=0)

# This line is currently required to import HistGradientBoostingClassifier
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.pipeline import Pipeline

model = Pipeline([
    ("preprocessor", preprocessor),
    ("classifier", HistGradientBoostingClassifier(random_state=42))
])

In [3]:
from sklearn import set_config
set_config(display='diagram')

In [4]:
model


Use the previously defined model (called `model`) and using two nested `for`
loops, make a search of the best combinations of the `learning_rate` and
`max_leaf_nodes` parameters. In this regard, you will need to train and test
the model by setting the parameters. The evaluation of the model should be
performed using `cross_val_score`. We will use the following parameters
search:
- `learning_rate` for the values 0.01, 0.1, 1 and 10. This parameter controls
  the ability of a new tree to correct the error of the previous sequence of
  trees
- `max_leaf_nodes` for the values 3, 10, 30. This parameter controls the
  depth of each tree.

In [24]:
# Write your code here.
model.get_params()['classifier__learning_rate']

0.01

In [25]:
model.get_params()['classifier__max_leaf_nodes']

30

In [26]:
learning_rate = [0.01, 0.1, 1 ,10]
max_leaf_nodes = [3,10,30]

In [27]:
results = { 'lr': [],
  'nodes': [],
  'test_mean': [],
  'test_std': []}

In [28]:
results

{'lr': [], 'nodes': [], 'test_mean': [], 'test_std': []}

In [29]:
from sklearn.model_selection import cross_validate

In [30]:
from sklearn.metrics import SCORERS

In [31]:
# sorted(SCORERS.keys())

In [32]:
for lr in learning_rate:
    for node in max_leaf_nodes:
        model.set_params(classifier__learning_rate=lr)
        model.set_params(classifier__max_leaf_nodes=node)
        cv_results = cross_validate(model, data, target)
        scores = cv_results["test_score"]
        results['lr'].append(lr)
        results['nodes'].append(node)
        results['test_mean'].append(scores.mean())
        results['test_std'].append(scores.std())

In [33]:
results

{'lr': [0.01, 0.01, 0.01, 0.1, 0.1, 0.1, 1, 1, 1, 10, 10, 10],
 'nodes': [3, 10, 30, 3, 10, 30, 3, 10, 30, 3, 10, 30],
 'test_mean': [0.7987183295300809,
  0.8202572043190326,
  0.8482454338964729,
  0.8564146296232487,
  0.8703574876954956,
  0.8738995698254579,
  0.8698047187324492,
  0.8665083892461312,
  0.8598541773610868,
  0.28072167729758074,
  0.7614564750402605,
  0.6162475795293892],
 'test_std': [0.0013302343640772096,
  0.0018418157171543748,
  0.001751013395103816,
  0.0027885492881534767,
  0.001265588638684552,
  0.0022484274499671004,
  0.0030594344189543527,
  0.002458147489131834,
  0.0045806743262617215,
  0.003960326670561044,
  0.04505152811468116,
  0.17850571211553154]}

In [38]:
import pandas as pd
df = pd.DataFrame(results)

In [39]:
df

Unnamed: 0,lr,nodes,test_mean,test_std
0,0.01,3,0.798718,0.00133
1,0.01,10,0.820257,0.001842
2,0.01,30,0.848245,0.001751
3,0.1,3,0.856415,0.002789
4,0.1,10,0.870357,0.001266
5,0.1,30,0.8739,0.002248
6,1.0,3,0.869805,0.003059
7,1.0,10,0.866508,0.002458
8,1.0,30,0.859854,0.004581
9,10.0,3,0.280722,0.00396
