# 📝 Exercise M3.01

The goal is to write an exhaustive search to find the best parameters
combination maximizing the model generalization performance.

Here we use a small subset of the Adult Census dataset to make the code
faster to execute. Once your code works on the small subset, try to
change `train_size` to a larger value (e.g. 0.8 for 80% instead of
20%).

In [1]:
import pandas as pd

from sklearn.model_selection import train_test_split

adult_census = pd.read_csv("../datasets/adult-census.csv")

target_name = "class"
target = adult_census[target_name]
data = adult_census.drop(columns=[target_name, "education-num"])

data_train, data_test, target_train, target_test = train_test_split(
    data, target, train_size=0.2, random_state=42)

In [2]:
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector as selector
from sklearn.preprocessing import OrdinalEncoder

categorical_preprocessor = OrdinalEncoder(handle_unknown="use_encoded_value",
                                          unknown_value=-1)
preprocessor = ColumnTransformer(
    [('cat_preprocessor', categorical_preprocessor,
      selector(dtype_include=object))],
    remainder='passthrough', sparse_threshold=0)

from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.pipeline import Pipeline

model = Pipeline([
    ("preprocessor", preprocessor),
    ("classifier", HistGradientBoostingClassifier(random_state=42))
])


Use the previously defined model (called `model`) and using two nested `for`
loops, make a search of the best combinations of the `learning_rate` and
`max_leaf_nodes` parameters. In this regard, you will need to train and test
the model by setting the parameters. The evaluation of the model should be
performed using `cross_val_score` on the training set. We will use the
following parameters search:
- `learning_rate` for the values 0.01, 0.1, 1 and 10. This parameter controls
  the ability of a new tree to correct the error of the previous sequence of
  trees
- `max_leaf_nodes` for the values 3, 10, 30. This parameter controls the
  depth of each tree.

In [19]:
cross_val_score(model, data, target)

array([0.79813696, 0.79864879, 0.79842342, 0.79719492, 0.80118755])

In [22]:
# Write your code here.
from sklearn.model_selection import cross_val_score

print("Initial LR an leaf nodes:", model.get_params()['classifier__learning_rate'], model.get_params()['classifier__max_leaf_nodes'])

learning_rates, max_leaf_n = [0.01, 0.1, 1, 10], [3, 10, 30]

for lr in learning_rates:
    for n in max_leaf_n:
        model.set_params(classifier__learning_rate = lr, classifier__max_leaf_nodes = n)
        print(" --> Current LR: ", model.get_params()['classifier__learning_rate'], "and leaf nodes: ", model.get_params()['classifier__max_leaf_nodes'])
        cv_results = cross_val_score(model, data, target)
        scores = cv_results
        print(f" Accuracy: {scores.mean():.3f} +/- {scores.std():.3f}")
        print(" ")

Initial LR an leaf nodes: 10 30
 --> Current LR:  0.01 and leaf nodes:  3
 Accuracy: 0.799 +/- 0.001
 
 --> Current LR:  0.01 and leaf nodes:  10
 Accuracy: 0.820 +/- 0.002
 
 --> Current LR:  0.01 and leaf nodes:  30
 Accuracy: 0.848 +/- 0.002
 
 --> Current LR:  0.1 and leaf nodes:  3
 Accuracy: 0.856 +/- 0.003
 
 --> Current LR:  0.1 and leaf nodes:  10
 Accuracy: 0.870 +/- 0.001
 
 --> Current LR:  0.1 and leaf nodes:  30
 Accuracy: 0.874 +/- 0.002
 
 --> Current LR:  1 and leaf nodes:  3
 Accuracy: 0.870 +/- 0.003
 
 --> Current LR:  1 and leaf nodes:  10
 Accuracy: 0.867 +/- 0.002
 
 --> Current LR:  1 and leaf nodes:  30
 Accuracy: 0.860 +/- 0.005
 
 --> Current LR:  10 and leaf nodes:  3
 Accuracy: 0.281 +/- 0.004
 
 --> Current LR:  10 and leaf nodes:  10
 Accuracy: 0.761 +/- 0.045
 
 --> Current LR:  10 and leaf nodes:  30
 Accuracy: 0.616 +/- 0.179
 



Now use the test set to score the model using the best parameters
that we found using cross-validation in the training set.

Best params: 
 --> Current LR:  0.1 and leaf nodes:  30
 Accuracy: 0.874 +/- 0.002

In [26]:
# Write your code here.

model.set_params(classifier__learning_rate = 0.1, classifier__max_leaf_nodes = 30)
print(" --> Current LR: ", model.get_params()['classifier__learning_rate'], "and leaf nodes: ", model.get_params()['classifier__max_leaf_nodes'])

from sklearn.model_selection import cross_validate
from sklearn.model_selection import ShuffleSplit

ss = ShuffleSplit(random_state = 0, test_size=0.25)
cross_v = cross_validate(model, data, target, cv = ss)
cross_v['test_score']

 --> Current LR:  0.1 and leaf nodes:  30


array([0.87535828, 0.87159119, 0.87183687, 0.8727377 , 0.87339284,
       0.87347474, 0.87584964, 0.87953485, 0.87969863, 0.87208255])

In [27]:
print(" Su solución también mola: ")

# solution
from sklearn.model_selection import cross_val_score

learning_rate = [0.01, 0.1, 1, 10]
max_leaf_nodes = [3, 10, 30]

best_score = 0
best_params = {}
for lr in learning_rate:
    for mln in max_leaf_nodes:
        print(f"Evaluating model with learning rate {lr:.3f}"
              f" and max leaf nodes {mln}... ", end="")
        model.set_params(
            classifier__learning_rate=lr,
            classifier__max_leaf_nodes=mln
        )
        scores = cross_val_score(model, data_train, target_train, cv=2)
        mean_score = scores.mean()
        print(f"score: {mean_score:.3f}")
        if mean_score > best_score:
            best_score = mean_score
            best_params = {'learning-rate': lr, 'max leaf nodes': mln}
            print(f"Found new best model with score {best_score:.3f}!")

print(f"The best accuracy obtained is {best_score:.3f}")
print(f"The best parameters found are:\n {best_params}")

# solution
best_lr = best_params['learning-rate']
best_mln = best_params['max leaf nodes']

model.set_params(classifier__learning_rate=best_lr,
                 classifier__max_leaf_nodes=best_mln)
model.fit(data_train, target_train)
test_score = model.score(data_test, target_test)

print(f"Test score after the parameter tuning: {test_score:.3f}")

 Su solución también mola: 
Evaluating model with learning rate 0.010 and max leaf nodes 3... score: 0.789
Found new best model with score 0.789!
Evaluating model with learning rate 0.010 and max leaf nodes 10... score: 0.813
Found new best model with score 0.813!
Evaluating model with learning rate 0.010 and max leaf nodes 30... score: 0.842
Found new best model with score 0.842!
Evaluating model with learning rate 0.100 and max leaf nodes 3... score: 0.847
Found new best model with score 0.847!
Evaluating model with learning rate 0.100 and max leaf nodes 10... score: 0.859
Found new best model with score 0.859!
Evaluating model with learning rate 0.100 and max leaf nodes 30... score: 0.857
Evaluating model with learning rate 1.000 and max leaf nodes 3... score: 0.852
Evaluating model with learning rate 1.000 and max leaf nodes 10... score: 0.833
Evaluating model with learning rate 1.000 and max leaf nodes 30... score: 0.828
Evaluating model with learning rate 10.000 and max leaf node