In [1]:
import pandas as pd

penguins = pd.read_csv("../datasets/penguins.csv")

columns = ["Body Mass (g)", "Flipper Length (mm)", "Culmen Length (mm)"]
target_name = "Species"

# Remove lines with missing values for the columns of interest
penguins_non_missing = penguins[columns + [target_name]].dropna()

data = penguins_non_missing[columns]
target = penguins_non_missing[target_name]

In [4]:
target.value_counts()

Adelie Penguin (Pygoscelis adeliae)          151
Gentoo penguin (Pygoscelis papua)            123
Chinstrap penguin (Pygoscelis antarctica)     68
Name: Species, dtype: int64

In [6]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
model = Pipeline(steps=[
    ("preprocessor", StandardScaler()),
    ("classifier", KNeighborsClassifier(n_neighbors=5)),
])

Evaluate the pipeline using stratified 10-fold cross-validation with the balanced-accuracy scoring metric to choose the correct statement in the list below.

You can use:

sklearn.model_selection.cross_validate to perform the cross-validation routine;
provide an integer 10 to the parameter cv of cross_validate to use the cross-validation with 10 folds;
provide the string "balanced_accuracy" to the parameter scoring of cross_validate.

a) The average cross-validated test balanced accuracy of the above pipeline is between 0.9 and 1.0

b) The average cross-validated test balanced accuracy of the above pipeline is between 0.8 and 0.9

c) The average cross-validated test balanced accuracy of the above pipeline is between 0.5 and 0.8

In [7]:
from sklearn.model_selection import cross_validate

cv_results = cross_validate(model, data, target, cv=10, n_jobs=2, scoring="balanced_accuracy")

cv_results

{'fit_time': array([0.00473571, 0.00671792, 0.00407672, 0.00412583, 0.00370669,
        0.00384164, 0.00391388, 0.00363302, 0.00362539, 0.0036366 ]),
 'score_time': array([0.00366426, 0.00392079, 0.003793  , 0.00353646, 0.00354743,
        0.00359511, 0.0035336 , 0.00342464, 0.00345469, 0.00377893]),
 'test_score': array([1.        , 1.        , 1.        , 0.91880342, 0.88253968,
        0.95238095, 0.97777778, 0.93015873, 0.90793651, 0.95238095])}

Repeat the evaluation by setting the parameters in order to select the correct statements in the list below. We recall that you can use model.get_params() to list the parameters of the pipeline and use model.set_params(param_name=param_value) to update them. Remember that one way to compare two models is comparing the cross-validation test scores of both models fold-to-fold, i.e. counting the number of folds where one model has a better test score than the other.

 a) Looking at the individual cross-validation scores, using a model with n_neighbors=5 is substantially better (at least 7 of the cross-validations scores are better) than a model with n_neighbors=51
 
 b) Looking at the individual cross-validation scores, using a model with n_neighbors=5 is substantially better (at least 7 of the cross-validations scores are better) than a model with n_neighbors=101
 
 c) Looking at the individual cross-validation scores, a 5 nearest neighbors using a StandardScaler is substantially better (at least 7 of the cross-validations scores are better) than a 5 nearest neighbors using the raw features (without scaling).

In [21]:
%%time
from sklearn.model_selection import GridSearchCV

param_grid = {
    'classifier__n_neighbors': (5, 51, 101)
}
model_grid_search = GridSearchCV(model, param_grid=param_grid,
                                 n_jobs=2, cv=10)
model_grid_search.fit(data, target)

cv_results = pd.DataFrame(model_grid_search.cv_results_).sort_values(
    "mean_test_score", ascending=False)
cv_results.head()

CPU times: user 38.2 ms, sys: 0 ns, total: 38.2 ms
Wall time: 137 ms


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_classifier__n_neighbors,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,split5_test_score,split6_test_score,split7_test_score,split8_test_score,split9_test_score,mean_test_score,std_test_score,rank_test_score
0,0.003848,0.000449,0.003113,0.00029,5,{'classifier__n_neighbors': 5},1.0,1.0,1.0,0.941176,0.911765,0.970588,0.970588,0.941176,0.911765,0.970588,0.961765,0.032353,1
1,0.004083,0.000972,0.003288,0.000137,51,{'classifier__n_neighbors': 51},0.971429,0.971429,1.0,0.911765,0.911765,0.970588,0.941176,0.970588,0.941176,0.970588,0.95605,0.027209,2
2,0.00385,0.000792,0.003496,0.000131,101,{'classifier__n_neighbors': 101},0.914286,0.971429,0.970588,0.911765,0.882353,0.911765,0.882353,0.911765,0.882353,0.941176,0.917983,0.03178,3


Vemos que 5 y 51 funcionan parecido pero siempre mejor que sin escalar. Tiene sentido, al ser un modelo basado en distancias.

In [20]:
unscaled_model = KNeighborsClassifier(n_neighbors=5)


cv_results = cross_validate(unscaled_model, data, target, cv=10, n_jobs=2)
cv_results = pd.DataFrame(cv_results)
cv_results

Unnamed: 0,fit_time,score_time,test_score
0,0.003432,0.003328,0.742857
1,0.003489,0.003238,0.8
2,0.002327,0.002955,0.794118
3,0.002425,0.003005,0.794118
4,0.002295,0.002899,0.647059
5,0.002113,0.002912,0.764706
6,0.002247,0.003214,0.882353
7,0.002121,0.002936,0.794118
8,0.002106,0.002814,0.911765
9,0.002066,0.00291,0.852941
