# Simple Classifiers

Today, we will be examining classifiers using an EEG dataset from Michelle's lab using a boundary-based classification method (support vector machine) as well as cross validation.

By the end of today's lesson, you should know how to implement various types of K-fold cross validation, as well as understand the accuracy and time tradeoffs in each.

Let's start by reading in the data - please be patient because it's a big file!

In [None]:
from scipy.io import loadmat

# Load in the data file
dataMat = loadmat('/home/data/eegData/sim001_bootstrap.mat')
dataMat = dataMat['data']

# Load in the category labels
trialData = loadmat('/home/data/eegData/sub001.mat')
trialData = trialData['catVec']
trialData = trialData[6:,0] # Here, the first 6 trials were practice and can be discarded

print(trialData.shape)
print(dataMat.shape)

As you can see, there are 63 electrodes, 600 time points, and 1200 trials. In this experiment, observers saw 100 repeats each of 12 different scene categories. Time 0 refers to 100 ms before stimulus onset, so we have 500 ms of stimulus-driven response.

A powerful way to use classifiers in EEG is to perform the analysis at each time point. This can show us when there is neural information in the scene categories over time. However, it takes a long time to run classifiers on each time point! In the interest of time, we will isolate time point 200 (101 ms after image onset). This should yield a 63-electrode by 1200-trial array.

In [None]:
time200 = # fill me in: should be 63x1200 array when printed
print(time200.shape)

Although it's a good intellectual exercise to work through creating different train-test splits of your data, I'm going to let you off easy today. We'll be using a built-in machine learning function to do k-fold cross validation. Run this cell to get the gist of how train and test splits work for this function:

In [None]:
from sklearn.model_selection import KFold
import numpy as np

# create a list of indices for all 1200 trials
allIndices = np.arange(1200)

# create a k-fold object
kf = KFold(n_splits=5)

# for each train-test split, print the length of train and test indices
for trainIdx, testIdx in kf.split(allIndices):
    print(len(trainIdx), len(testIdx))

This means that in each of the five folds, there are 960 trials that will be used to train the classifier, and 240 trials that will be used to test the classifier. Now, we are ready to apply this information to performing a 5-fold cross validation of this data! Fill in the code below to finish.

In [None]:
# import svm from machine learning library
from sklearn import svm

# initialize accuracy vector (hint: should be the number of folds)
overallAccuracy = np.zeros()

# initialize a counter for the folds
foldCount = -1

# loop through all folds
for trainIdx, testIdx in kf.split(allIndices):
    # increment counter
    foldCount += 1
    
    # initialize vector that will hold a 1 (accurate) or a 0 (inaccurate) value for each of the test items
    foldAccuracy = np.zeros()
    
    # define training data: fill in square brackets to indicate the trial indices for the given fold for training
    trainData = time200[].T
    # define training answers
    trainAnswers = trialData[]
    
    # define testing data: fill in square brackets to indicate the trial indices for the given fold for testing
    testData = time200[].T
    # define testing answers
    testAnswers = trialData[]
    
    # define the classifier
    classifier = svm.SVC(kernel = 'linear')
    
    # train the classifier
    classifier.fit(trainData, trainAnswers)
    
    # test the classifier
    predClass = classifier.predict(testData)
    
    # loop through each of the predicted classes and test whether they are correct
    for i in range(len(predClass)):
        if predClass[i]==testAnswers[i]: 
            foldAccuracy[i] = 1
            
    overallAccuracy[foldCount] = np.mean(foldAccuracy)

# print the average accuracy over all folds
print(np.mean(overallAccuracy))

A randomly-guessing classifier would get, on average, 1/12 items correct (0.08). Considering this, it's not bad performance! You may be wondering what's so great about 5-fold cross validation and whether better performance might be observed with other values of K. 

In the cell(s) below, play around with other values of K and observe the accuracy as well as the time necessary to obtain a result. Discuss with your group what the optimal value might be (if there is any).