<a href="https://colab.research.google.com/github/wolframalexa/FrequentistML/blob/master/model_assessment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Assignment

# Re-implement the example in section 7.10.2 using any simple, out of the box classifier (like K nearest neighbors from sci-kit). 
# Reproduce the results for the incorrect and correct way of doing cross-validation.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_validate # uses Kfold to cross validate
from sklearn.model_selection import KFold
from sklearn.feature_selection import SelectKBest, f_classif

In [None]:
#----------GENERATE DATASET
np.random.seed(100)

data = np.random.normal(size=(50,500))
dataframe = pd.DataFrame(data)

# split data into inputs and outputs
X = dataframe
y = np.ravel(np.random.randint(0, 2, (50,1)))


In [None]:
#----------THE WRONG WAY

# prescreen features by computing their correlation with output class
num_features = 10
X_new = SelectKBest(f_classif, k=num_features).fit_transform(X,y) #f_classif works with negative data

In [None]:
# construct classifier using k nearest neighbours using the two selected features
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X_new, y)

# perform cross-validation using kfold
cv_results = cross_validate(neigh, X_new,y,cv=2)
print("Average scores from incorrect cross validation:",np.mean(np.abs(cv_results['test_score'])))

Average scores from incorrect cross validation: 0.78


After pre-screening for predictors and choosing the ones having the biggest bearing on the outcome classification, then constructing a model around those traits and performing cross-validation, we obtain a score of .78.

In [None]:
#-------THE RIGHT WAY

# prepare data to work with Kfolds function
kf = KFold(n_splits=2, shuffle=True, random_state=0) # draw samples randomly for kfolds

X_trains = []
X_tests = []
y_trains = []
y_tests = []
X = X.to_numpy()
y = np.array(y)



In [None]:
# split into folds 
for train_index, test_index in kf.split(X): # there might have been a better way to do this with indexing
  X_trains.append(X[train_index])
  X_tests.append(X[test_index])
  y_trains.append(y[train_index])
  y_tests.append(y[test_index])


# for each fold, find "good" predictors using all samples except fold k
scores = 0

for fold in range(0,len(X_trains)):
  # create arrays dropping the current fold
  temp_X_trains = np.delete(X_trains, fold, axis=0)
  temp_X_trains = temp_X_trains[0]
  temp_X_tests = np.delete(X_tests, fold, axis=0)
  temp_X_tests = temp_X_tests[0]
  temp_y_trains = np.delete(y_trains, fold, axis=0)
  temp_y_trains = np.ravel(np.transpose(temp_y_trains))
  temp_y_tests = np.delete(y_tests, fold, axis=0)

  # select good features
  kbest = SelectKBest(f_classif, k=num_features)
  X_new = kbest.fit_transform(temp_X_trains,temp_y_trains)
  selected_features = kbest.get_support()

  # build a multivariate classifier using all samples except fold k
  X_selected = np.concatenate((np.zeros((temp_X_trains.shape[0],len(selected_features)-num_features)) ,temp_X_trains[:, selected_features]), axis=1) # fit to selected features

  # use regressor to predict for samples in fold k
  neigh = KNeighborsClassifier(n_neighbors=3)
  neigh.fit(X_selected, temp_y_trains)


  score = neigh.score(X_tests[fold],y_tests[fold])
  scores += abs(score)

# compute average score
avg_score = scores/len(X_trains)
print("Average score using correct method:",scores)

Average score using correct method: 1.2000000000000002


Done correctly, cross-validation should yield a higher average error rate. This is because it has not seen the test samples, so it cannot use them in its inference. Here, we can see that the "correct" error value is much higher than the "incorrect" error rate: 0.78 vs 1.2.