# Cross Validation Experiment
### By Tanay Trivedi and Jonathan Bair

The fundamental difference in the implementation of the cross val in 7.10.2 is that feature selection needs be done inside the folds, rather than outside the folds.

We are going to generate the dataset inside the example first.

## Data Generation

In [2]:
import numpy as np
import sklearn
import pandas as pd

In [28]:
predictors=np.linspace(1,5000,5000).astype(int)

In [10]:
sample=np.linspace(0,49,50)

In [17]:
data=pd.DataFrame(columns=predictors.astype(int),index=sample.astype(int))

In [20]:
data.loc[0:24,'target']=1
data.loc[25:50,'target']=0

In [32]:
predictors_data=np.random.randn(50,5000)

In [33]:
data.loc[0:49,predictors]=predictors_data

In [36]:
data=data.sample(frac=1)

In [40]:
data=data.reset_index()

## Bad Cross Val
To perform the bad method, we conduct the correlation test before cross validation, and then train the classifier inside the folds.

In [41]:
corrs=pd.Series(index=predictors)
for i in predictors:
    corrs.loc[i]=data[i].corr(data['target'])

In [48]:
best_predictors=corrs.sort_values(ascending=False).iloc[0:100].index

In [51]:
X=data[best_predictors]
y=data['target']

The above lines conduct the correlation of each predictor against the target, sorts them and takes the best 100.

In [54]:
from sklearn.model_selection import KFold

In [87]:
kf = KFold(n_splits=10)
index_gen=kf.split(X,y)

This generates 10 folds for cross validation

In [88]:
from sklearn.neighbors import KNeighborsClassifier

In [89]:
accs=pd.Series(index=range(10))

In [90]:
for i in range(10):
    train_index,test_index=next(index_gen)
    
    neigh = KNeighborsClassifier(n_neighbors=1)
    neigh.fit(X.loc[train_index], y.loc[train_index]) 
    y_hat=neigh.predict(X.loc[test_index])
    y_real=y.loc[test_index].values
    accs.loc[i]=float(len(y_hat[y_hat==y_real]))/len(y_hat)
    

In [91]:
accs

0    1.0
1    1.0
2    1.0
3    1.0
4    0.8
5    1.0
6    1.0
7    1.0
8    1.0
9    1.0
dtype: float64

In [94]:
(1-accs.mean())*100

1.9999999999999907

The above is the average error rate among folds. Nonsensically high for random data.

## Good Cross Val

In this, we perform the correlation feature selection within the train side of each fold before fitting the classifier.

In [99]:
X=data[predictors]
y=data['target']

kf = KFold(n_splits=10)
index_gen=kf.split(X,y)
accs=pd.Series(index=range(10))
for i in range(10):
    train_index,test_index=next(index_gen)
    this_X=X.loc[train_index]
    this_y=y.loc[train_index]
    
    corrs=pd.Series(index=predictors)
    for j in predictors:
        corrs.loc[j]=this_X[j].corr(this_y)
    best_predictors=corrs.sort_values(ascending=False).iloc[0:100].index
    
    this_X=this_X[best_predictors]
    
    neigh = KNeighborsClassifier(n_neighbors=1)
    neigh.fit(this_X, this_y) 
    y_hat=neigh.predict(X.loc[test_index,best_predictors])
    y_real=y.loc[test_index].values
    accs.loc[i]=float(len(y_hat[y_hat==y_real]))/len(y_hat)
    

In [100]:
accs

0    0.4
1    0.2
2    0.8
3    0.8
4    0.6
5    0.4
6    0.8
7    0.6
8    0.4
9    0.8
dtype: float64

In [102]:
(1-accs.mean())*100

42.00000000000001

This is much more reasonable, it should be around 50%. Taking 10 folds with so little data creates some issues, but the idea is correct.