In [108]:
import pandas as pd

penguins = pd.read_csv("../datasets/penguins.csv")

columns = ["Body Mass (g)", "Flipper Length (mm)", "Culmen Length (mm)"]
target_name = "Species"

# Remove lines with missing values for the columns of interest
penguins_non_missing = penguins[columns + [target_name]].dropna()

data = penguins_non_missing[columns]
target = penguins_non_missing[target_name]

In [109]:
target.value_counts()

Adelie Penguin (Pygoscelis adeliae)          151
Gentoo penguin (Pygoscelis papua)            123
Chinstrap penguin (Pygoscelis antarctica)     68
Name: Species, dtype: int64

In [110]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

model = Pipeline(steps=[
    ("preprocessor", StandardScaler()),
    ("classifier", KNeighborsClassifier(n_neighbors=5)),
])

Evaluate the pipeline using stratified 10-fold cross-validation with the balanced-accuracy scoring metric to choose the correct statement in the list below.

You can use:

sklearn.model_selection.cross_validate to perform the cross-validation routine;
provide an integer 10 to the parameter cv of cross_validate to use the cross-validation with 10 folds;
provide the string "balanced_accuracy" to the parameter scoring of cross_validate.

a) The average cross-validated test balanced accuracy of the above pipeline is between 0.9 and 1.0

b) The average cross-validated test balanced accuracy of the above pipeline is between 0.8 and 0.9

c) The average cross-validated test balanced accuracy of the above pipeline is between 0.5 and 0.8

In [111]:
from sklearn.model_selection import cross_validate

cv_results = cross_validate(model, data, target, cv=10, n_jobs=2, scoring="balanced_accuracy")

cv_results

{'fit_time': array([0.00688362, 0.00686026, 0.0050416 , 0.00816345, 0.00452375,
        0.00464201, 0.00452185, 0.00470209, 0.00454974, 0.00466847]),
 'score_time': array([0.00474024, 0.00475216, 0.00435305, 0.00440693, 0.00406003,
        0.00427771, 0.00416422, 0.00430012, 0.00401139, 0.00418925]),
 'test_score': array([1.        , 1.        , 1.        , 0.91880342, 0.88253968,
        0.95238095, 0.97777778, 0.93015873, 0.90793651, 0.95238095])}

Repeat the evaluation by setting the parameters in order to select the correct statements in the list below. We recall that you can use model.get_params() to list the parameters of the pipeline and use model.set_params(param_name=param_value) to update them. Remember that one way to compare two models is comparing the cross-validation test scores of both models fold-to-fold, i.e. counting the number of folds where one model has a better test score than the other.

 a) Looking at the individual cross-validation scores, using a model with n_neighbors=5 is substantially better (at least 7 of the cross-validations scores are better) than a model with n_neighbors=51
 
 b) Looking at the individual cross-validation scores, using a model with n_neighbors=5 is substantially better (at least 7 of the cross-validations scores are better) than a model with n_neighbors=101
 
 c) Looking at the individual cross-validation scores, a 5 nearest neighbors using a StandardScaler is substantially better (at least 7 of the cross-validations scores are better) than a 5 nearest neighbors using the raw features (without scaling).

In [112]:
%%time
from sklearn.model_selection import GridSearchCV

param_grid = {
    'classifier__n_neighbors': (5, 51, 101)
}
model_grid_search = GridSearchCV(model, param_grid=param_grid,
                                 n_jobs=2, cv=10)
model_grid_search.fit(data, target)

cv_results = pd.DataFrame(model_grid_search.cv_results_).sort_values(
    "mean_test_score", ascending=False)
cv_results.head()

CPU times: user 59.7 ms, sys: 7.66 ms, total: 67.4 ms
Wall time: 178 ms


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_classifier__n_neighbors,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
0,0.005189,0.000875,0.003881,0.000287,5,{'classifier__n_neighbors': 5},1.0,1.0,1.0,0.941176,0.911765,0.970588,0.970588,0.941176,0.911765,0.970588,0.961765,0.032353,1
1,0.00476,0.000493,0.004513,0.000607,51,{'classifier__n_neighbors': 51},0.971429,0.971429,1.0,0.911765,0.911765,0.970588,0.941176,0.970588,0.941176,0.970588,0.95605,0.027209,2
2,0.004541,0.000156,0.004381,0.00015,101,{'classifier__n_neighbors': 101},0.914286,0.971429,0.970588,0.911765,0.882353,0.911765,0.882353,0.911765,0.882353,0.941176,0.917983,0.03178,3


Vemos que 5 y 51 funcionan parecido pero siempre mejor que sin escalar. Tiene sentido, al ser un modelo basado en distancias.

In [113]:
unscaled_model = KNeighborsClassifier(n_neighbors=5)


cv_results = cross_validate(unscaled_model, data, target, cv=10, n_jobs=2)
cv_results = pd.DataFrame(cv_results)
cv_results

Unnamed: 0,fit_time,score_time,test_score
0,0.005153,0.005711,0.742857
1,0.002942,0.004028,0.8
2,0.002485,0.0035,0.794118
3,0.00405,0.003533,0.794118
4,0.002531,0.003486,0.647059
5,0.002597,0.00347,0.764706
6,0.002942,0.003556,0.882353
7,0.002493,0.003502,0.794118
8,0.002493,0.003471,0.911765
9,0.002861,0.003479,0.852941


The Box-Cox method is common preprocessing strategy for positive values. The other preprocessors work both for any kind of numerical features. If you are curious to read the details about those method, please feel free to read them up in the preprocessing chapter of the scikit-learn user guide but this is not required to answer the quiz questions.

Use sklearn.model_selection.GridSearchCV to study the impact of the choice of the preprocessor and the number of neighbors on the stratified 10-fold cross-validated balanced_accuracy metric. We want to study the n_neighbors in the range [5, 51, 101] and preprocessor in the range all_preprocessors.



CLARO, YO PUEDO HACER VARIOS PIPELINES, UNO CON CADA PROCESOR, O SIN PROCESOR.

Does this makes sense?

In [114]:
import numpy as np
from sklearn.preprocessing import QuantileTransformer
rng = np.random.RandomState(0)
X = np.array(range(0, 11)).reshape(-1, 1)
qt1, qt2 = QuantileTransformer(n_quantiles=10, random_state=0), QuantileTransformer(n_quantiles=3, random_state=42)
qt.fit_transform(X), qt2.fit_transform(X)

(array([[0. ],
        [0.1],
        [0.2],
        [0.3],
        [0.4],
        [0.5],
        [0.6],
        [0.7],
        [0.8],
        [0.9],
        [1. ]]),
 array([[0. ],
        [0.1],
        [0.2],
        [0.3],
        [0.4],
        [0.5],
        [0.6],
        [0.7],
        [0.8],
        [0.9],
        [1. ]]))

<li>Transform features using quantiles information.

This method transforms the features to follow a uniform or a normal distribution. Therefore, for a given feature, this transformation tends to spread out the most frequent values. It also reduces the impact of (marginal) outliers: this is therefore a robust preprocessing scheme.
    

In [115]:
import numpy as np
from sklearn.preprocessing import QuantileTransformer
rng = np.random.RandomState(0)
X = np.sort(rng.normal(loc=0.5, scale=0.25, size=(25, 1)), axis=0)
qt = QuantileTransformer(n_quantiles=10, random_state=0)

X, qt.fit_transform(X)


(array([[-0.13824745],
        [ 0.25568053],
        [ 0.28647607],
        [ 0.31445874],
        [ 0.44871043],
        [ 0.4621607 ],
        [ 0.47419529],
        [ 0.53041875],
        [ 0.53601089],
        [ 0.57826693],
        [ 0.58341858],
        [ 0.6000393 ],
        [ 0.60264963],
        [ 0.61096581],
        [ 0.66340465],
        [ 0.69025943],
        [ 0.71610905],
        [ 0.7375221 ],
        [ 0.7446845 ],
        [ 0.86356838],
        [ 0.87351977],
        [ 0.94101309],
        [ 0.9668895 ],
        [ 1.0602233 ],
        [ 1.06743866]]),
 array([[0.        ],
        [0.09871873],
        [0.10643612],
        [0.11754671],
        [0.21017437],
        [0.21945445],
        [0.23498666],
        [0.32443642],
        [0.33333333],
        [0.41360794],
        [0.42339464],
        [0.46257841],
        [0.47112236],
        [0.49834237],
        [0.59986536],
        [0.63390302],
        [0.66666667],
        [0.68873101],
        [0.69611125],
     

<li>Box-Cox

Apply a power transform featurewise to make data more Gaussian-like.

Power transforms are a family of parametric, monotonic transformations that are applied to make data more Gaussian-like. This is useful for modeling issues related to heteroscedasticity (non-constant variance), or other situations where normality is desired.

In [116]:
import numpy as np
from sklearn.preprocessing import PowerTransformer
pt, pt2 = PowerTransformer(), PowerTransformer("box-cox")
data1 = [[1, 2], [3, 2], [4, 5]]
pt.fit(data1)
pt2.fit(data1)

print(pt.transform(data1), pt2.transform(data1))

[[-1.31616039 -0.70710678]
 [ 0.20998268 -0.70710678]
 [ 1.1061777   1.41421356]] [[-1.33269291 -0.70710678]
 [ 0.25653283 -0.70710678]
 [ 1.07616008  1.41421356]]


In [117]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import QuantileTransformer
from sklearn.preprocessing import PowerTransformer

all_preprocessors = [
    None,
    StandardScaler(),
    MinMaxScaler(),
    QuantileTransformer(n_quantiles=100),
    PowerTransformer(method="box-cox"),
]

param_grid = {
    'classifier__n_neighbors': (5, 51, 101)
}

for prep in all_preprocessors:
    print(f" Using preprocessor: {prep}")
    model = Pipeline(steps=[
    ("preprocessor", prep),
    ("classifier", KNeighborsClassifier(n_neighbors=5)),])
    model_grid_search = GridSearchCV(model, param_grid=param_grid,
                                 n_jobs=2, cv=10)
    model_grid_search.fit(data, target)
    cv_results = pd.DataFrame(model_grid_search.cv_results_).sort_values(
    "mean_test_score", ascending=False)
    print(cv_results.sort_values("param_classifier__n_neighbors").head()[["params", "mean_test_score", "split0_test_score", "split1_test_score", "split2_test_score", "split3_test_score"]])
    print(" ")

 Using preprocessor: None
                             params  mean_test_score  split0_test_score  \
0    {'classifier__n_neighbors': 5}         0.798403           0.742857   
1   {'classifier__n_neighbors': 51}         0.728151           0.742857   
2  {'classifier__n_neighbors': 101}         0.736891           0.742857   

   split1_test_score  split2_test_score  split3_test_score  
0           0.800000           0.794118           0.794118  
1           0.685714           0.735294           0.705882  
2           0.714286           0.705882           0.705882  
 
 Using preprocessor: StandardScaler()
                             params  mean_test_score  split0_test_score  \
0    {'classifier__n_neighbors': 5}         0.961765           1.000000   
1   {'classifier__n_neighbors': 51}         0.956050           0.971429   
2  {'classifier__n_neighbors': 101}         0.917983           0.914286   

   split1_test_score  split2_test_score  split3_test_score  
0           1.000000       


Which of the following statements hold:

 a) Looking at the individual cross-validation scores, the best ranked model using a StandardScaler is substantially better (at least 7 of the cross-validations scores are better) than using any other processor 
 
 b) Using any of the preprocessors has always a better ranking than using no processor, irrespective of the value of n_neighbors 
 
 c) Looking at the individual cross-validation scores, the model with n_neighbors=5 and StandardScaler is substantially better (at least 7 of the cross-validations scores are better) than the model with n_neighbors=51 and StandardScaler 
 
 d) Looking at the individual cross-validation scores, the model with n_neighbors=51 and StandardScaler is substantially better (at least 7 of the cross-validations scores are better) than the model with n_neighbors=101 and StandardScaler

- unanswered
Select all answers that apply
Hint: pass {"preprocessor": all_preprocessors, "classifier__n_neighbors": [5, 51, 101]} for the param_grid argument to the GridSearchCV class.


In [118]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    "preprocessor": all_preprocessors,
    "classifier__n_neighbors": [5, 51, 101],
}

grid_search = GridSearchCV(
    model,
    param_grid=param_grid,
    scoring="balanced_accuracy",
    cv=10,
).fit(data, target)
results = pd.DataFrame(grid_search.cv_results_).sort_values(
    by="rank_test_score", ascending=True
)
# convert the name of the preprocessor for later display
results["param_preprocessor"] = results["param_preprocessor"].apply(
    lambda x: x.__class__.__name__ if x is not None else "None"
)
results

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_classifier__n_neighbors,param_preprocessor,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
1,0.004404,0.000249,0.004077,8.4e-05,5,StandardScaler,"{'classifier__n_neighbors': 5, 'preprocessor': StandardScaler()}",1.0,1.0,1.0,0.918803,0.88254,0.952381,0.977778,0.930159,0.907937,0.952381,0.952198,0.039902,1
2,0.00417,0.000137,0.004032,6.8e-05,5,MinMaxScaler,"{'classifier__n_neighbors': 5, 'preprocessor': MinMaxScaler()}",1.0,0.952381,1.0,0.944444,0.88254,0.930159,0.955556,0.952381,0.907937,0.952381,0.947778,0.034268,2
3,0.007212,0.001843,0.006711,0.002439,5,QuantileTransformer,"{'classifier__n_neighbors': 5, 'preprocessor': QuantileTransformer(n_quantiles=100)}",0.952381,0.92674,1.0,0.918803,0.904762,1.0,0.977778,0.930159,0.907937,0.952381,0.947094,0.033797,3
4,0.009083,0.002241,0.00462,0.000834,5,PowerTransformer,"{'classifier__n_neighbors': 5, 'preprocessor': PowerTransformer(method='box-cox')}",1.0,0.977778,1.0,0.863248,0.88254,0.952381,0.955556,0.930159,0.907937,1.0,0.94696,0.047387,4
6,0.004891,0.001747,0.004522,0.000343,51,StandardScaler,"{'classifier__n_neighbors': 51, 'preprocessor': StandardScaler()}",0.952381,0.977778,1.0,0.863248,0.88254,0.952381,0.955556,0.952381,0.930159,0.952381,0.94188,0.038905,5
8,0.005252,0.000111,0.004747,0.000605,51,QuantileTransformer,"{'classifier__n_neighbors': 51, 'preprocessor': QuantileTransformer(n_quantiles=100)}",0.857143,0.952381,1.0,0.863248,0.904762,0.904762,0.977778,0.930159,0.930159,0.952381,0.927277,0.043759,6
9,0.008145,0.000682,0.004722,0.000665,51,PowerTransformer,"{'classifier__n_neighbors': 51, 'preprocessor': PowerTransformer(method='box-cox')}",0.904762,0.977778,1.0,0.863248,0.834921,0.952381,0.907937,0.952381,0.930159,0.904762,0.922833,0.047883,7
7,0.004225,0.000346,0.004331,0.00014,51,MinMaxScaler,"{'classifier__n_neighbors': 51, 'preprocessor': MinMaxScaler()}",0.904762,0.952381,1.0,0.863248,0.834921,0.952381,0.907937,0.952381,0.930159,0.904762,0.920293,0.045516,8
11,0.004206,5.9e-05,0.00475,0.000668,101,StandardScaler,"{'classifier__n_neighbors': 101, 'preprocessor': StandardScaler()}",0.857143,0.952381,0.944444,0.863248,0.834921,0.857143,0.834921,0.88254,0.834921,0.904762,0.876642,0.041618,9
12,0.004105,0.000219,0.004586,0.00021,101,MinMaxScaler,"{'classifier__n_neighbors': 101, 'preprocessor': MinMaxScaler()}",0.857143,0.857143,0.944444,0.863248,0.834921,0.857143,0.765079,0.904762,0.834921,0.904762,0.862357,0.046244,10


In [119]:
cv_score_columns = results.columns[results.columns.str.startswith("split")]
reference_model = results.iloc[0][cv_score_columns]
other_model = results.iloc[8][cv_score_columns]
print(
    f"51-NN with StandardScaler is strictly better than 101-NN with StandardScaler for "
    f"{np.sum(reference_model.to_numpy() > other_model.to_numpy())} "
    "CV iterations out of 10."
)

51-NN with StandardScaler is strictly better than 101-NN with StandardScaler for 10 CV iterations out of 10.


## Question 6
Evaluate the generalization performance of the best models found in each fold using nested cross-validation. Set return_estimator=True and cv=10 for the outer loop. The scoring metric must be the balanced-accuracy.

The mean generalization performance is :

 a) better than 0.97 b) between 0.92 and 0.97 c) below 0.92

In [120]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    "preprocessor": all_preprocessors,
    "classifier__n_neighbors": [5, 51, 101],
}

grid_search = GridSearchCV(
    model,
    param_grid=param_grid,
    scoring="balanced_accuracy",
    cv=10,
).fit(data, target)
results = pd.DataFrame(grid_search.cv_results_).sort_values(
    by="rank_test_score", ascending=True
)
# convert the name of the preprocessor for later display
results["param_preprocessor"] = results["param_preprocessor"].apply(
    lambda x: x.__class__.__name__ if x is not None else "None"
)
# results

In [121]:
# results.sort_values("mean_test_score", ascending = False).groupby("param_classifier__n_neighbors").first()

In [122]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

param_grid = {
    "preprocessor": all_preprocessors,
    "classifier__n_neighbors": [5, 51, 101],
}

model_grid_search = GridSearchCV(
    model, param_grid=param_grid, n_jobs=2, cv=2
)
model_grid_search.fit(data, target)



data_train, data_test, target_train, target_test = train_test_split(
    data, target, test_size=0.2, random_state=42
)

model_grid_search.fit(data_train, target_train)
accuracy = model_grid_search.score(data_test, target_test)
print(f"Accuracy on test set: {accuracy:.3f}")

cv_results = cross_validate(
    model_grid_search, data, target, cv=5, n_jobs=2, return_estimator=True
)

cv_results = pd.DataFrame(cv_results)
cv_test_scores = cv_results['test_score']
print(
    "Generalization score with hyperparameters tuning:\n"
    f"{cv_test_scores.mean():.3f} +/- {cv_test_scores.std():.3f}"
)


Accuracy on test set: 0.971
Generalization score with hyperparameters tuning:
0.953 +/- 0.012


In [123]:
cv_results = cross_validate(
    grid_search,
    data,
    target,
    cv=10,
    n_jobs=2,
    scoring="balanced_accuracy",
    return_estimator=True,
)
cv_results = pd.DataFrame(cv_results)
cv_test_scores = cv_results['test_score']

print(
    "Generalization score with hyperparameters tuning:\n"
    f"{cv_test_scores.mean():.3f} +/- {cv_test_scores.std():.3f}"
)

Generalization score with hyperparameters tuning:
0.943 +/- 0.038


## Question 7 (1 point possible)
Explore the set of best parameters that the different grid search models found in each fold of the outer cross-validation. Remember that you can access them with the best_params_ attribute of the estimator. Select all the statements that are true.

 a) The tuned number of nearest neighbors is stable across all folds 
 
 b) The tuned number of nearest neighbors changes often across all folds 
 
 c) The optimal scaler is stable across all folds 
 
 d) The optimal scaler changes often across all folds
- unanswered
Select all answers that apply
Hint: it is important to pass return_estimator=True to the cross_validate function to be able to introspect trained model saved in the "estimator" field of the CV results. If you forgot to do for the previous question, please re-run the cross-validation with that option enabled.

In [124]:
cv_results1 = cross_validate(
    grid_search,
    data,
    target,
    cv=10,
    n_jobs=2,
    scoring="balanced_accuracy",
    return_estimator=True,
)
# print(cv_results)
cv_results = pd.DataFrame(cv_results1)
cv_test_scores = cv_results['test_score']

print(
    "Generalization score with hyperparameters tuning:\n"
    f"{cv_test_scores.mean():.3f} +/- {cv_test_scores.std():.3f}"
)

Generalization score with hyperparameters tuning:
0.943 +/- 0.038


In [125]:
pd.set_option('max_colwidth', None)

In [126]:
# cv_results1["estimator"]

In [127]:
for estimator in cv_results1["estimator"]:
    print(estimator.best_params_)

{'classifier__n_neighbors': 5, 'preprocessor': QuantileTransformer(n_quantiles=100)}
{'classifier__n_neighbors': 5, 'preprocessor': QuantileTransformer(n_quantiles=100)}
{'classifier__n_neighbors': 5, 'preprocessor': StandardScaler()}
{'classifier__n_neighbors': 5, 'preprocessor': StandardScaler()}
{'classifier__n_neighbors': 5, 'preprocessor': MinMaxScaler()}
{'classifier__n_neighbors': 5, 'preprocessor': QuantileTransformer(n_quantiles=100)}
{'classifier__n_neighbors': 5, 'preprocessor': MinMaxScaler()}
{'classifier__n_neighbors': 5, 'preprocessor': StandardScaler()}
{'classifier__n_neighbors': 5, 'preprocessor': StandardScaler()}
{'classifier__n_neighbors': 5, 'preprocessor': QuantileTransformer(n_quantiles=100)}


a) The tuned number of nearest neighbors is stable across all folds

b) The tuned number of nearest neighbors changes often across all folds

c) The optimal scaler is stable across all folds

d) The optimal scaler changes often across all folds