In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib notebook

# Everything is correlated
Or, how to do cross validation wrong.


Let's try a classification problem with a small data set with many possible features.
The labels are going to be random, so we know there should be no predictive power to any model we fit.

In [2]:
npoints = 100
nfeatures = int(5e4)
rng = np.random.default_rng()

X = rng.random(size=(npoints, nfeatures));
Y = rng.random(size=(npoints,)) > 0.5


Now let's pretend that we feel there are too many features, and many are probably useless. We only want the top 100 for making our model.

How can we pick these? Let's take the top 100 that have the highest correlation with our labels as the "useful" features.

In [3]:
def select_best_features(X, Y, n=100):
    corrs = np.zeros(X.shape[1])
    for ii in range(X.shape[1]):
        corrs[ii] = np.corrcoef(X[:, ii], Y)[0, 1]
    top_idxs = np.argsort(np.abs(corrs))[-n:]
    return top_idxs
    

In [4]:
%%time
top_idxs = select_best_features(X, Y, 100)
X100 = X[:, top_idxs]

CPU times: user 4.05 s, sys: 23.4 ms, total: 4.08 s
Wall time: 4.14 s


Now let's try fitting a model with the trimmed data set, and see how it performs in cross validation:

In [5]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
from sklearn.model_selection import cross_val_score
import sklearn.linear_model
from sklearn import svm

X_train, X_test, y_train, y_test = train_test_split(X100, Y, test_size=0.4, random_state=0)
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
print(clf.score(X_test, y_test))

1.0


In [6]:
folds = 5
scores = cross_val_score(clf, X100, Y, cv=folds)
print(f"{folds}-fold cross validation scores:")
print(scores)

5-fold cross validation scores:
[1. 1. 1. 1. 1.]


### Perfect cross validation?

The model appears to be able to fit the data perfectly... how?

What happens if we get the hold out set of X and Y *before* picking the best features?

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=0)
print(f"Input X shape: {X_train.shape}")
top_idxs = select_best_features(X_train, y_train, 100)
X100 = X_train[:, top_idxs]
print(f"output shape: {X100.shape}")

# This will be what we check with, after training with the other rows
X_holdout = X_test[:, top_idxs]

Input X shape: (60, 50000)
output shape: (60, 100)


In [8]:
clf = svm.SVC(kernel='linear', C=1).fit(X100, y_train)

print(f"Score on holdout set: {clf.score(X_holdout, y_test)}")
scores = cross_val_score(clf, X_holdout, y_test, cv=5)
print(f"{folds}-fold cross validation scores:")
print(scores)

Score on holdout set: 0.525
5-fold cross validation scores:
[0.125 0.625 0.625 0.5   0.5  ]


Now we see how the feature selection is *part of the model training*. If we don't set aside the validation set, we won't get a representative score from cross validation.