<a href="https://colab.research.google.com/github/wolframalexa/FrequentistML/blob/master/model_assessment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Assignment

# Re-implement the example in section 7.10.2 using any simple, out of the box classifier (like K nearest neighbors from sci-kit). 
# Reproduce the results for the incorrect and correct way of doing cross-validation.

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_validate # uses Kfold to cross validate
from sklearn.model_selection import KFold
from sklearn.feature_selection import SelectKBest, chi2

In [3]:
#----------DATA READING 
data = 'https://raw.githubusercontent.com/wolframalexa/FrequentistML/master/iris.csv' # this is a very popular dataset and does not need cleaning
dataframe = pd.read_csv(data, sep=',', header='infer')
print("Species in dataset:",dataframe['species'].unique())

# split data into inputs and outputs
X = dataframe.drop('species', axis=1)
y = dataframe['species']
y = [0 if x == 'Iris-setosa' else 1 if x == 'Iris-versicolor' else 2 for x in y] # encode species data as integers

dataframe.describe()

Species in dataset: ['Iris-setosa' 'Iris-versicolor' 'Iris-virginica']


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667
std,0.828066,0.433594,1.76442,0.763161
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [4]:
#----------THE WRONG WAY

# prescreen features by computing their correlation with output class
X_new = SelectKBest(chi2, k=2).fit_transform(X,y)
print("Original data:\n",X.loc[[0]])
print('Selected data:',X_new[0])

Original data:
    sepal_length  sepal_width  petal_length  petal_width
0           5.1          3.5           1.4          0.2
Selected data: [1.4 0.2]


From inspection, we can see that petal length and petal width are the most important features in the classification. We will build the K nearest neighbours classifier with just *petal_length* and *petal_width*.

In [5]:
# construct classifier using k nearest neighbours using the two selected features
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X_new, y)

# perform cross-validation using kfold
cv_results = cross_validate(neigh, X_new,y,cv=2)

# compute average error rate
avg_error = 1-np.average(cv_results['test_score'])
print("Average error using wrong method:",avg_error*100,"%")

Average error using wrong method: 4.666666666666663 %


After pre-screening for predictors and choosing the ones having the biggest bearing on the outcome classification, then constructing a model around those traits and performing cross-validation, we obtain on average a 4.67% error rate. 

In [6]:
#-------THE RIGHT WAY

# prepare data to work with Kfolds function
kf = KFold(n_splits=2, shuffle=True, random_state=0) # draw samples randomly for kfolds

X_trains = []
X_tests = []
y_trains = []
y_tests = []
X = X.to_numpy()
y = np.array(y)



In [7]:
# split into folds 
for train_index, test_index in kf.split(X): # there might have been a better way to do this with indexing
  X_trains.append(X[train_index])
  X_tests.append(X[test_index])
  y_trains.append(y[train_index])
  y_tests.append(y[test_index])


# for each fold, find "good" predictors using all samples except fold k
errs = 0

for fold in range(0,len(X_trains)):
  # create arrays dropping the current fold
  temp_X_trains = np.delete(X_trains, fold, axis=0)
  temp_X_trains = temp_X_trains[0]
  temp_X_tests = np.delete(X_tests, fold, axis=0)
  temp_y_trains = np.delete(y_trains, fold, axis=0)
  temp_y_trains = np.transpose(temp_y_trains)
  temp_y_tests = np.delete(y_tests, fold, axis=0)

  features = np.array([False, False, False, False])
  temp_X_trains_all = np.array([])
  temp_y_trains_all = np.array([])

  # select good features
  kbest = SelectKBest(chi2, k=2)

  X_new = kbest.fit_transform(temp_X_trains,temp_y_trains)
  features = features | kbest.get_support()             # "or" together to find good features

  # build a multivariate classifier using all samples except fold k
  selected_features = [i for i, feature in enumerate(features) if feature == True]
  X_selected = np.concatenate((np.zeros((temp_X_trains.shape[0],len(selected_features))) ,temp_X_trains[:, selected_features]), axis=1) # fit to selected features

  neigh = KNeighborsClassifier(n_neighbors=3)
  neigh.fit(X_selected, np.ravel(temp_y_trains))

# use classifier to predict class labels for samples in fold k
  y_predicted = neigh.predict(X_tests[fold])
  diff = y_predicted - y_tests[fold]
  err = np.count_nonzero(diff)/len(diff)
  errs += err

# compute average error rate
avg_err = errs/len(X_trains)
print("Average error using correct method:",avg_err*100,"%")

Average error using correct method: 2.666666666666667 %


Done correctly, cross-validation should yield a higher average error rate. This is because it has not seen the test samples, so it cannot use them in its inference.

However, here the "correct" error rate is actually lower than the "incorrect" error rate. This is likely because of the size of the dataset and its simplicity - one or two samples can skew the % incorrect by a lot. This is certainly not expected, but it is welcome in terms of getting a model with a lower error.