# 📝 Exercise M3.01

The goal is to write an exhaustive search to find the best parameters
combination maximizing the model generalization performance.

Here we use a small subset of the Adult Census dataset to make the code faster
to execute. Once your code works on the small subset, try to change
`train_size` to a larger value (e.g. 0.8 for 80% instead of 20%).

In [1]:
import pandas as pd

from sklearn.model_selection import train_test_split

adult_census = pd.read_csv("../datasets/adult-census.csv")

target_name = "class"
target = adult_census[target_name]
data = adult_census.drop(columns=[target_name, "education-num"])

data_train, data_test, target_train, target_test = train_test_split(
    data, target, train_size=0.2, random_state=42
)

In [2]:
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector as selector
from sklearn.preprocessing import OrdinalEncoder

categorical_preprocessor = OrdinalEncoder(
    handle_unknown="use_encoded_value", unknown_value=-1
)
preprocessor = ColumnTransformer(
    [
        (
            "cat_preprocessor",
            categorical_preprocessor,
            selector(dtype_include=object),
        )
    ],
    remainder="passthrough",
)

from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.pipeline import Pipeline

model = Pipeline(
    [
        ("preprocessor", preprocessor),
        ("classifier", HistGradientBoostingClassifier(random_state=42)),
    ]
)

Use the previously defined model (called `model`) and using two nested `for`
loops, make a search of the best combinations of the `learning_rate` and
`max_leaf_nodes` parameters. In this regard, you have to train and test the
model by setting the parameters. The evaluation of the model should be
performed using `cross_val_score` on the training set. Use the following
parameters search:
- `learning_rate` for the values 0.01, 0.1, 1 and 10. This parameter controls
  the ability of a new tree to correct the error of the previous sequence of
  trees
- `max_leaf_nodes` for the values 3, 10, 30. This parameter controls the depth
  of each tree.

In [16]:
# Write your code here.
from sklearn.model_selection import cross_val_score

learning_rate = [.01, .1, 1, 10]
max_leaf_nodes = [3, 10, 30]
best_score = 0

for lr in learning_rate:
    for mln in max_leaf_nodes:
        model.set_params(classifier__learning_rate=lr, classifier__max_leaf_nodes=mln)
        # print(model.get_params()['classifier__learning_rate'], model.get_params()['classifier__max_leaf_nodes'])
        scores = cross_val_score(model, data_train, target_train)
        mean_score = scores.mean()
        if mean_score > best_score:
            best_score = mean_score # update best score
            best_lr = lr            # save the best lr
            best_mln = mln          # save the best mln
        print(f"Avg. Accuracy for lr = {lr:>5} and max leaf nodes = {mln:>5} is {scores.mean():.3f}")

Avg. Accuracy for lr =  0.01 and max leaf nodes =     3 is 0.790
Avg. Accuracy for lr =  0.01 and max leaf nodes =    10 is 0.814
Avg. Accuracy for lr =  0.01 and max leaf nodes =    30 is 0.842
Avg. Accuracy for lr =   0.1 and max leaf nodes =     3 is 0.849
Avg. Accuracy for lr =   0.1 and max leaf nodes =    10 is 0.863
Avg. Accuracy for lr =   0.1 and max leaf nodes =    30 is 0.861
Avg. Accuracy for lr =     1 and max leaf nodes =     3 is 0.852
Avg. Accuracy for lr =     1 and max leaf nodes =    10 is 0.853
Avg. Accuracy for lr =     1 and max leaf nodes =    30 is 0.839
Avg. Accuracy for lr =    10 and max leaf nodes =     3 is 0.288
Avg. Accuracy for lr =    10 and max leaf nodes =    10 is 0.282
Avg. Accuracy for lr =    10 and max leaf nodes =    30 is 0.540


Now use the test set to score the model using the best parameters that we
found using cross-validation. You will have to refit the model over the full
training set.

In [17]:
# Write your code here.
from sklearn.model_selection import cross_validate

model.set_params(classifier__learning_rate=best_lr, classifier__max_leaf_nodes=best_mln)

cv_results = cross_validate(model, data, target, return_train_score=True)

train_scores = cv_results["train_score"]
test_scores = cv_results["test_score"]

print(
    "The mean cross-validation train accuracy is: "
    f"{train_scores.mean():.3f} ± {train_scores.std():.3f} "
)
print(
    "The mean cross-validation test accuracy is: "
    f"{test_scores.mean():.3f} ± {test_scores.std():.3f} "
)

The mean cross-validation train accuracy is: 0.873 ± 0.001 
The mean cross-validation test accuracy is: 0.870 ± 0.001 
