# Feature Selection
Enrique Juliá Arévalo, Sara Verde Camacho, Leo Pérez Peña

Se comienza cargando los paquetes necesarios:

In [60]:
from sklearn.model_selection import StratifiedKFold, GridSearchCV, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn import preprocessing
from sklearn.metrics import accuracy_score
import numpy as np
import pandas as pd
from sklearn.feature_selection import SelectKBest, f_classif
import sys

**In this practical, you will become familiarized with some basic feature selection methods implemented in scikit-learn. Consider the prostate dataset that is attached to this practical. You are asked to:**

Se carga el dataset:

In [61]:
data = pd.read_csv('prostate.csv')
data.head()

Unnamed: 0,100_g_at,1000_at,1001_at,1002_f_at,1003_s_at,1004_at,1005_at,1006_at,1007_s_at,1008_f_at,...,AFFX-ThrX-5_at,AFFX-ThrX-M_at,AFFX-TrpnX-3_at,AFFX-TrpnX-5_at,AFFX-TrpnX-M_at,AFFX-YEL002c/WBP1_at,AFFX-YEL018w/_at,AFFX-YEL021w/URA3_at,AFFX-YEL024w/RIP1_at,Y
0,6.92746,7.391657,3.812922,3.453385,6.070151,5.527153,5.812353,3.167275,7.354981,9.419909,...,3.770583,2.884436,2.730025,3.126168,2.870161,3.08221,2.747289,3.226588,3.480196,0
1,7.222432,7.32905,3.958028,3.407226,5.921265,5.376464,7.303408,3.108708,7.391872,10.539579,...,3.190759,2.460119,2.696578,2.675271,2.940032,3.126269,3.013745,3.517859,3.428752,1
2,6.776402,7.664007,3.783702,3.152019,5.452293,5.111794,7.207638,3.07736,7.488371,6.833428,...,3.325183,2.603014,2.469759,2.615746,2.510172,2.730814,2.613696,2.823436,3.049716,0
3,6.919134,7.469634,4.004581,3.34117,6.070925,5.296108,8.744059,3.117104,7.203028,10.400557,...,3.625057,2.765521,2.681757,3.310741,3.197177,3.414182,3.193867,3.353537,3.567482,0
4,7.113561,7.322408,4.242724,3.489324,6.141657,5.62839,6.82537,3.794904,7.403024,10.240322,...,3.698067,3.026876,2.69167,3.23603,3.003906,3.081497,2.963307,3.47205,3.598103,1


1. Estimate the performance of the nearest neighbor classifier on this dataset using 10-times 10-fold cross validation when all the features are used for prediction. The number of neighbors should be chosen using an inner cross-validation procedure. You can use 5-fold cross validation for this.

In [62]:
X = np.array(data.iloc[:, :-1]).astype(float)
y = np.array(data.iloc[:, -1]).astype(int)

In [63]:
param_grid = {'knn__n_neighbors': np.arange(1, 21)}

pipe = Pipeline([
    ('scaler', preprocessing.StandardScaler()),
    ('knn', KNeighborsClassifier())
])


# Validación cruzada anidada
# - Outer loop: 10 repeticiones de 10-fold CV
# - Inner loop: 5-fold CV para ajustar k

out_cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
in_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)

clf = GridSearchCV(pipe, param_grid=param_grid, cv=in_cv)


nested_scores = cross_val_score(clf, X, y, cv=out_cv)

print(f"Accuracy: {nested_scores.mean()} ± {nested_scores.std()}")


Accuracy: 0.7936363636363637 ± 0.14013275877768422


Versión más larga:

In [None]:
k_values = np.arange(3, 10) # menor rango porque tarda mucho
out_scores = []
score_per_k = {}

for out_i, (train_val_index, test_index) in enumerate(out_cv.split(X, y)):
    X_train_val, X_test = X[train_val_index], X[test_index]
    y_train_val, y_test = y[train_val_index], y[test_index]

    scaler = preprocessing.StandardScaler().fit(X_train_val)
    X_train_scaled = scaler.transform(X_train_val)
    X_test_scaled = scaler.transform(X_test)

    in_cv = StratifiedKFold(n_splits=5) 

    mean_in_score = []

    for k in k_values:
        in_score = []

        for train_index, val_index in in_cv.split(X_train_val, y_train_val):
            X_train, X_val = X_train_val[train_index], X_train_val[val_index]
            y_train, y_val = y_train_val[train_index], y_train_val[val_index]

            scaler = preprocessing.StandardScaler().fit(X_train)
            in_X_train_scaled = scaler.transform(X_train)
            in_X_test_scaled = scaler.transform(X_val)
            
            knn = KNeighborsClassifier(n_neighbors=k)
            knn.fit(in_X_train_scaled, y_train)
            acc = accuracy_score(y_val, knn.predict(in_X_test_scaled))
            in_score.append(acc)
        
        mean_in_score.append(np.mean(in_score))
    
    final_k = k_values[np.argmax(mean_in_score)]
    final_knn =KNeighborsClassifier(n_neighbors= final_k)
    final_knn.fit(X_train_scaled, y_train_val)
    acc = accuracy_score(y_test, final_knn.predict(X_test_scaled))
    out_scores.append(acc)
    print(f"Iteration: {out_i + 1}, K: {final_k}, accuracy = {acc}")
    
    if final_k not in score_per_k:
        score_per_k[final_k] = [acc]
    else:
        score_per_k[final_k].append(acc)



Iteration: 1, K: 7, accuracy = 0.7272727272727273
Iteration: 2, K: 6, accuracy = 0.8181818181818182
Iteration: 3, K: 7, accuracy = 0.7
Iteration: 4, K: 3, accuracy = 0.9
Iteration: 5, K: 7, accuracy = 0.8
Iteration: 6, K: 7, accuracy = 0.7
Iteration: 7, K: 8, accuracy = 1.0
Iteration: 8, K: 7, accuracy = 0.7
Iteration: 9, K: 3, accuracy = 0.9
Iteration: 10, K: 8, accuracy = 0.7


In [65]:
out_scores = np.array(out_scores)
print(f"Mean accuracy: {out_scores.mean()}")

Mean accuracy: 0.7945454545454547


In [66]:
for k in score_per_k:
    print(f"Average accuracy for {k} neighbors: {np.array(score_per_k[k]).mean()}")

Average accuracy for 7 neighbors: 0.7254545454545456
Average accuracy for 6 neighbors: 0.8181818181818182
Average accuracy for 3 neighbors: 0.9
Average accuracy for 8 neighbors: 0.85


2. Estimate the performance of the nearest neighbor classifier on the same dataset when using a feature selection technique based on the F-score (ANOVA) that picks up the 10 most relevant features. Use the same cross-validation methods as in the previous step.

In [67]:
pipe = Pipeline([
    ('scaler', preprocessing.StandardScaler()),
    ('feature_selection', SelectKBest(score_func=f_classif, k=10)),
    ('knn', KNeighborsClassifier())
])

param_grid = {'knn__n_neighbors': np.arange(1, 21)}

outer_cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
inner_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)

clf = GridSearchCV(pipe, param_grid=param_grid, cv=inner_cv)

nested_scores = cross_val_score(clf, X, y, cv=outer_cv)


print(f"Accuracy: {nested_scores.mean()} ± {nested_scores.std()}")


Accuracy: 0.9209090909090909 ± 0.09783186791718376


3. Repeat the previous experiment but when a random forest is used to pick up the 10 most relevant features. Use an initial filtering method based on the F-score to keep only the 20% most promising features.

4. What feature selection method performs best? Can you explain why?

**Now we will address the problem of analyzing the trade-off between interpretability and prediction accuracy. For this, you are asked to:**

1. Estimate the performance of the nearest neighbor classifier with K=3 as a function of the features used for prediction. Use a 10-times 10-fold cross-validation method and plot the results obtained. That is prediction error vs. the number of features used for prediction. Use the F-score for feature selection. Report results from 1 feature to 200 features. Not all features need to be explored. Use a higher resolution when you are closer to 1 feature.

2. Repeat that process when the feature selection is done externally to the cross-validation loop using all the available data. Include these results in the previous plot.


3. Are the two estimates obtained similar? What are their differences? If they are different try to explain why this is the case.

4. By taking a look at these results, what is the optimal number of features to use in this dataset in terms of interpretability vs. prediction error?


5. Given the results obtained in this part of the practical, you are asked to indicate which particular features should be used for prediction on this dataset. Include a list with them. Take a look at the documentation of SelectKBest from scikit-learn to understand how to do this. Use all available data to provide such a list of features. 