# Interpretation of Electroencephalography Measurements for Human Movements

w207 Summer 2015, Final Project

By Michael Marks, Nihar Patel, Carson Forter, Ji Yan, Jeff Yau

![live brain](http://s1.ibtimes.com/sites/www.ibtimes.com/files/styles/v2_article_large/public/2015/04/25/brain.jpg?itok=jIf_rHLW)

## Introduction
Electroencephalography (EEG) is a non-invasive method of measuring brain activity. An array of sensors are arranged on a person's scalp. These sensors detect the electrical activity of the neurons firing immediately under the skull. Though this technology has limited spatial resolution and cannot detect electrical activity within deep brain structures, it has the advantage of having high temporal resolution; it measures electrical activity on the time scale that biological processes in the brain actually happen. This technology has many medical uses, one of which is allowing for a brain-to-computer interface. In particular, because the human brain controls movement in part with surface-oriented neurons, there is potential for using EEG to interpret muscle movements as signals from the brain. Advances in sensor hardware will allow for greater spatial resolution in the future, but with current technology interpreting raw brain signals into the intended muscle movement is a challenge.

On June 29th, 2015, Kaggle, a data science competition platform, introduced [a contest](https://www.kaggle.com/c/grasp-and-lift-eeg-detection) for EEG classification. The dataset was a series of EEG recordings at 500 Hz from volunteers asked to repeatedly perform a task with their right hand. Each hand movement was recorded on video and carefully documented into a dataset time-matched with the EEG sensors. The result is a set of raw sensor data, 500 data points per second, labeled with one or more of 6 corresponding arm movements. Competitors were challenged to use the sample dataset to produce the algorithm that most accurately predicts the correct arm movement from the raw sensor data.

In this workbook, we describe how our group approached the problem. We outline our preliminary research, our initial tests, how we handled the large dataset, and finally our results.

## Libraries Used
We primarily use the [scikit-learn](http://scikit-learn.org/stable/index.html) python library for this project.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import glob
import time
import random
import gc
import re
import csv
import os
import psutil
import sys

# SK-learn libraries.
from sklearn.decomposition import PCA

from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.linear_model import SGDClassifier

# scipy
from scipy.signal import butter, lfilter



## Dataset
The data can be downloaded from Kaggle [here](https://www.kaggle.com/c/grasp-and-lift-eeg-detection/data). This workbook uses the folder structure data\test\ and data\train. 

The EEG data has a temporal aspect to it and all data is recorded at 500 Hz, meaning that there are 500 rows of data for every second recorded. For each participant, there is a separate file for each session containing the recordings from 32 EEG sensors. Parallel to each of these files is a file describing which of 6 different arm motions are being performed at any given frame (1/500th of a second). Each testing session is called a series, and consists of the participant completing the prescribed hand motion repeatedly for approximately 4 minutes. There are 8 series/label file pairs for each of the 12 subjects, 96 pairs of files in total. Additionally there are two more series of data for each subject without labels, which served as the kaggle test dataset.

Significantly, the hosts of the competition changed the dataset towards the end of our project. See the announcement on Kaggle [here](https://www.kaggle.com/c/grasp-and-lift-eeg-detection/forums/t/15413/announcement-error-in-dataset). An error in the dataset caused there to be a time discrepancy between the labels and the associated data. This meant that we needed to retrain our model on the new dataset in order to learn weights that were actually representative of the data. Anecdotally, this was disasterous for some groups' highly optimized algorithms. Luckily, the feature engineering and dimensionality reduction that we developed on the original dataset remained applicable and we were able to successfully carry over our work to the fixed data.

In [3]:
# Location of data can be changed here.
data_path = r'data\train'

### Loading data
Here we define functions for loading subsets of data into a Pandas dataframe. This is necessary because the data is too large to hold in memory on an average desktop computer, and so we will need to call this function many times on different chunks of the data. At approximately 2% positive values, the training labels are sparse. Pandas supports sparse datasets, but we chose not to implement this because we used it in maneagable chunks, we faced compatability issues with sklearn, and this did not address the real big data problem- the sensor data.

In [12]:
def open_data(subjects_to_use='[1,2]',series_for_training='[1-3]'):
    # Make a list of all the filenames with training data.
    
    train_data_filenames = glob.glob(data_path + r"\subj" + subjects_to_use +
                                     "_series"+series_for_training + "_data.csv")
    # Initialize an empty dataframe.
    train_data= pd.DataFrame()

    # Load the dataframe with the contents of each file.
    for file_ in train_data_filenames:
        train_data = train_data.append(pd.read_csv(file_,index_col=0, header=0))
    
    train_data.reset_index()
    # Make a list of all the filenames with training labels.
    train_labels_filenames = glob.glob(data_path + r"\subj" + subjects_to_use +
                                     "_series"+series_for_training + "_events.csv")

    # Initialize an empty dataframe.
    train_labels = pd.DataFrame()
    
    #Load the dataframe with the contents of each file.
    for file_ in train_labels_filenames:
        train_labels = train_labels.append(pd.read_csv(file_,index_col=0, header=0))
        
    train_labels.reset_index()
    # Return the resulting two dataframes.
    return train_data, train_labels



#Split up by subject for submission test data
def open_test_data(subject_to_use,test_data_path):
    # Make a list of all the filenames with training data.   
    train_data_filenames = glob.glob(test_data_path + r"\subj" + subject_to_use + "_series*_data.csv")
    
    # Initialize an empty dataframe.
    train_data= pd.DataFrame()

    # Load the dataframe with the contents of each file.
    for file_ in train_data_filenames:
        train_data = train_data.append(pd.read_csv(file_,index_col=0, header=0))
    
        
    Subject_Strings = np.array(train_data.index)
    train_data.reset_index()
    # Return the resulting two dataframes.
    return train_data,Subject_Strings




## Pre-Processing/Feature Engineering
Our group spent the vast majority of our time researching the best way to pre-process the data. We attempted  feature engineering by transforming the raw data using both a priori knowledge and general data reduction techniques. We began by consulting with medical EEG researchers and reading literature on EEG processing.

**Removing Channels**

Our medical EEG consultant did not have experience with interpreting muscle movement data, but was able to share some insights into neuroanatomy. The following brain map shows the different functional regions of the brain:

![brain functional regions](http://lh3.ggpht.com/_RIjx_Mg4ZVM/TNbSn_XhNcI/AAAAAAAACeY/Abc73jPpSbU/image_thumb6.png)

We anticipated that the frontal lobe of the brain would contain the relevant information for our problem. As the source of muscle movements, we believed that the region called the primary motor cortex is particularly important for predicting arm movements. The next image is a representation of what the primary motor cortex controls:

![homunculus](http://brainconnection.brainhq.com/wp-content/uploads/2013/03/1b.gif)

This is a coronal section of the brain (as if you were looking face-to-face with x-ray vision). The surface and about an inch under it, shaded in salmon color in the picture, is called the cortex. EEG sensors pick up the electrical firings of cells on the cortical surface shaded in blue. Note that the surface has grooves called sulci where the EEG sensors would have an especially hard time telling the difference between adjacently active sections of the cortex. The motor cortex, responsible for innervating our muscles, is always represented in the same spot and in the same order for humans, and we learned that individual differences likely wouldn't matter with the low spatial resolution available using EEG. For example, if Sensor 1 is positioned above the motor strip at the position where the fingers are represented, we would expect that sensor to be active when the participants are closing and releasing their fingers from the object, and to be inactive when participants are only moving their arms with their biceps.

Another important detail we learned was that the left side of the brain controls the right half of the body, so the cells we expect to be firing are on the left-side motor cortex.

Reviewing this information and considering potential sources of noise, we produced a function to remove channels that we anticipated would contain more noise than information.

In [5]:
# Remove the channels we don't want 
def Remove_Channels(df):
    df.drop(df.columns[[15,16,20,21,22,25,26,27,28,29,30,31,32]], axis=1, inplace=True)
    return df

**Frequency Filtering**

We read that the oscillatory nature of neural network electrical activity can be interpreted as a frequency power distribution. EEG sensor data in the 8-12 Hz and 16-24 Hz range has been shown to be related to motor neuron activity.

We used the following bandpass filter functions to eliminate frequencies in the raw sensor data outside of the desired ranges.

In [10]:
def butter_bandpass(lowcut, highcut, fs, order=5):
    nyq = 0.5 * fs
    low = lowcut / nyq
    high = highcut / nyq
    b, a = butter(order, [low, high], btype='band')
    return b, a

def butter_bandpass_filter(df, lowcut, highcut, fs, order=5):
    b, a = butter_bandpass(lowcut, highcut, fs, order=order)
    df = pd.DataFrame(lfilter(b, a, df), columns=df.columns)
    return df

def butterworth_filter(X,k,l):
    '''
      Butterworth Filter:
      scipy.signal.butter(N, Wn, btype='low', analog=False, output='ba')[source]
        N: the order of the filter, and 5 seems to be good enough
        Wn: critical frequency, at which point the gain drops to 1/sqrt(2) that of 
            the passband (the "-3dB point")
    '''
    b,a = butter(5,k/250.0,btype='lowpass')
    X = lfilter(b,a,X)
    return X


def preprocess_data(X):
    scaler= StandardScaler()
    # Just standardized the X
    X_normalized = scaler.fit_transform(X)
    
    # Define 2 x 20 features (based on a series of lowpass filters)
    nFeaturesAdded=20
    X_lowpass = np.zeros((np.shape(X_normalized)[0],nFeaturesAdded))
    l=30
    for i in range(nFeaturesAdded):
        X_lowpass[:,i] = butterworth_filter(X[:,0],2-(i*0.1),l)
        X_lowpass[:,i] = scaler.fit_transform(X_lowpass[:,i])
    X_lowpass_squared = X_lowpass ** 2
    X_preprocess = np.concatenate((X_lowpass, X_lowpass_squared),axis=1)
    return X_preprocess

**Signal Smoothing**

The Kaggle dataset description explains that the labels can be off by +/- 75 ms. Also, we observed that the raw data has frequent dips and rises during both positive and negative conditions. Consequently, we decided that some method of compressing the signal temporarily and smoothing could be beneficial. This requires care because the competition rules state that future data cannot be used to predict earlier labels.

We produced a number of functions designed to compress and smooth the signal. In all cases, data is smoothed using a right-edged window; only historical data is used to make adjustments at each data point.

In [6]:
# Bin the time. This was an attempt to bin the entire dataset into features based on the amount of time since test start.
def Bin_Time(num_rows,num_bins):
    Bin_Size = num_rows/num_bins
    Bins = np.zeros(shape=(num_rows,num_bins))
    Bin_Min = 0
    Bin_Max = Bin_Size
    for i in range(0,num_bins):
        Bins[Bin_Min:Bin_Max,i] = 1
        Bin_Min = Bin_Min + Bin_Size
        Bin_Max = Bin_Max + Bin_Size
    return Bins


# Return a rolling metric of each column of sensor data in a pandas dataframe with a given window size. Returns 
# df of same size. Metric can be mean, var, min, max, skew, or kurt. Multiplier can be absolute, square, or cube.
def df_rolling_metric(df,window, metric,multiplier='none'):
    # eval('pd.rolling'+metric+'(df.iloc[0:,i],'+ window +',min_periods = 0).fillna(0)')
    list_ = []
    Num_Cols = len(df.columns)
    if multiplier == 'abs':
        df = df.abs()
    if multiplier == 'square':
        df = df**2
    if multiplier == 'cube':
        df = df**3   
    for i in range(0,Num_Cols):
        Roll_Array = eval('pd.rolling_' + str(metric)+"(df.iloc[0:,i],"+str(window)+",min_periods = 0).fillna(0)")
        list_.append(Roll_Array)
    new_df = pd.concat(list_,1)
    for i in range(0,Num_Cols):
        new_df=new_df.rename(columns = {i:metric + str(i)})
    return np.array(new_df.astype('float32'))

# Return rolling quantile of each column in a pandas dataframe with a given window and quantile. Returns df of same size. 
def df_rolling_quantile(df,window,quantile):
    list_ = []
    Num_Cols = len(df.columns)
    for i in range(0,Num_Cols):
        Roll_Array = pd.rolling_quantile(df.iloc[0:,i],window,quantile,min_periods = 0).fillna(0)
        list_.append(Roll_Array)
    new_df = pd.concat(list_,1)
    return np.array(new_df.astype('float32'))


# BE CAREFUL NOT TO SUPPLY TOO MANY COLUMNS TO THIS FUNCTION. Returns 2^N columns, where N = intitial columns. 
# Return rolling pairwise correlation of each column in a pandas dataframe with a given window. Returns df of same size. 
def df_rolling_corr(df,window):
    list_ = []
    Num_Cols = len(df.columns)
    for i in range(0,Num_Cols):
        list_.append(pd.rolling_corr(df.iloc[0:,i],window,min_periods = 0))
    return pd.concat(list_,1)



# BE CAREFUL NOT TO SUPPLY TOO MANY COLUMNS TO THIS FUNCTION. Returns 2^N columns, where N = intitial columns. 
# Return rolling pairwise covariance of each column in a pandas dataframe with a given window. Returns df of same size. 
def df_rolling_cov(df,window):
    list_ = []
    Num_Cols = len(df.columns)
    for i in range(0,Num_Cols):
        list_.append(rolling_cov(df.iloc[0:,i],window,min_periods = 0))
    return pd.concat(list_,1)



**Feature Reduction**

We were concerned about the size of the dataset. To put it in context, series 1-8 from Subject 1 alone contains $1,422,329$ observations of $32$ columns, a total of $45514528$ data points. With the hundreds of features we have created, we (1) did not have enough time to conduct all training exercises on the full $12$ subjects dataset and (2) we could not fit all rows of data in memory. Therefore, we decided to try some methods of feature reduction to alleviate the memory and training-time issues.

We used principal component analysis as a means of reducing dimensionality. The function we wrote for PCA allowed us to input a desired percent variance explained. Using a set of principal components that explained about 90% of the variation in the data seemed to strike a good balance between a reduced number of dimensions and information loss.

In [7]:
#run PCA and return the number of PCs that explain the given amount of variance. 
def extract_PCs(Train_Features,Test_Features, PercentVarExplained):
    Start_Time = time.time()   
    Scale_Center = StandardScaler() #we must first scale and center the data.
    Train_Features = np.float16(Scale_Center.fit_transform(np.array(Train_Features)))
    gc.collect()  #Garbage collection (i.e. get rid of any outstanding unused memory)
    Test_Features = np.float16(Scale_Center.fit_transform(np.array(Test_Features)))
    gc.collect()

    pca = PCA()
    pca.fit(Train_Features)
    gc.collect()
    Explained_Variance_Ratios = pca.explained_variance_ratio_
    for i in range(1,len(Explained_Variance_Ratios)):
        if sum(Explained_Variance_Ratios[0:i]) >= PercentVarExplained:
                   NumPCs = i + 1 #add 1 since numpy array ranges are not inclusive
                   break
    print('PCA Complete:',round((time.time()-Start_Time),2), " seconds")
    print(NumPCs, 'Resultant Principal Components')

    return np.float32(pca.transform(Train_Features)[:,0:NumPCs]),np.float32(pca.transform(Test_Features)[:,0:NumPCs])

## Testing Features


**Metrics**

After creating features and data reduction functions, we created a pipeline for testing these functions with a simple linear regression model. Through this process, we discovered that we would constantly observe around 98% accuracy; roughly 98% of labels are negative, and roughly only 2% of all labels in the dataset are positive. This is a reflection of quick motions that participants were asked to perform, and their relatively short duration compared to time spent at every other movement or at rest.

When evaluating the predictive power of models for binary outcomes, **accuracy**, which measures the proportion of examples classified correctly, fails to provide a good evaluation of the model performance when the distribution of the two classes are very skewed, which is case in this EEG classification problem. (See the following link, written by one of our colleagues, of the shortcoming of accuracy: http://svds.com/post/basics-classifier-evaluation-part-1)

We decided to judge the performance of our pre-processing steps using Area Under the Curve (AUC) scoring, which is required by the Kaggle Competition: *"Submissions are evaluated on the mean column-wise AUC. That is, the mean of the individual areas under the ROC curve for each predicted column."* https://www.kaggle.com/c/grasp-and-lift-eeg-detection/details/evaluation

The curve is called the ** Receiver Operating Characteristic (ROC) curve **. ROC provides another approach to evaluate the predictive power of a binary classifier. It represents the locus (or collection) of the ratios of $sensitivity$ to $1-specificity$, where sensitivity is the proportion of events correctly predicted and specificity is the proportion of non-events correctly predicted. Ideally, both of these proportions are high. However, there generally exists an inverse relationship between the two; cutoff points closer to zero specificity give a higher probability that an event will be predicted as an event, but non-events will also be predicted as events. High specificty cutoff points give the opposite pattern. While the ROC curve can be a useful visual to compare models, it is more convenient to have a one number summary. This gives rise to  the **area** under the ROC curve, which is a measure of discrimination between events and non-events.

  **Sample model**

Due to the time series nature of the data, the main features we tested were rolling window statistics. These served as a means of smoothing the data. We used a simple logistic regression on only one of the 6 labels for testing purposes. We tested the features in the model by training on subject 1 series 1 and 2 (~350,000 rows of data), and testing on subject 1 series 3 (~200,000 samples). 

In [13]:
#Create a function that trains and runs a logistic regression model. Then prints and returns the AUC score
def regression_test(metric_string, train_data, train_labels, test_data, test_labels):
    
    Logistic_Reg = LogisticRegression(tol = .001)    
    Logistic_Reg.fit(train_data, train_labels)
    prob = Logistic_Reg.predict_proba(test_data)
    print(metric_string, "AUC =", round(roc_auc_score(test_labels, prob[:,1]),4))

def Test_Features():
    #just test on subject 1, series 1,2 and 3
    train_data, train_labels = open_data('[1]','[1,2]')
    dev_data, dev_labels = open_data('[1]','[3]')


    train_labels=train_labels['HandStart']
    dev_labels=dev_labels['HandStart']

    print "\n"
    regression_test("Baseline",train_data,train_labels,dev_data,dev_labels)
    
    print "\n"
    regression_test("lowpass filter",preprocess_data(np.asarray(train_data)),train_labels,preprocess_data(np.asarray(dev_data)),dev_labels)
    regression_test("Bandpass 2-30 Hz",
                    butter_bandpass_filter(train_data,2,30,500),train_labels,
                    butter_bandpass_filter(dev_data,2,30,500),dev_labels)

    print "\n"
    regression_test("Rolling Mean w/ Window of 100",df_rolling_metric(train_data,100,"mean"),train_labels,df_rolling_metric(dev_data,100,"mean"),dev_labels )
    regression_test("Rolling Mean w/ Window of 400",df_rolling_metric(train_data,400,"mean"),train_labels,df_rolling_metric(dev_data,400,"mean"),dev_labels)
    regression_test("Rolling Mean w/ Window of 700",df_rolling_metric(train_data,700,"mean"),train_labels,df_rolling_metric(dev_data,700,"mean"),dev_labels)
    regression_test("Rolling Mean w/ Window of 1000",df_rolling_metric(train_data,1000,"mean"),train_labels,df_rolling_metric(dev_data,1000,"mean"),dev_labels)
    regression_test("Rolling Mean w/ Window of 1500",df_rolling_metric(train_data,1500,"mean"),train_labels,df_rolling_metric(dev_data,1500,"mean"),dev_labels)
    regression_test("Rolling Mean w/ Window of 2000",df_rolling_metric(train_data,2000,"mean"),train_labels,df_rolling_metric(dev_data,2000,"mean"),dev_labels)

    print "\n"
    regression_test("Rolling Mean_Abs w/ Window of 100",df_rolling_metric(train_data,100,"mean","abs"),train_labels,df_rolling_metric(dev_data,100,"mean","abs"),dev_labels)
    regression_test("Rolling Mean_Abs w/ Window of 400",df_rolling_metric(train_data,400,"mean","abs"),train_labels,df_rolling_metric(dev_data,400,"mean","abs"),dev_labels)
    regression_test("Rolling Mean_Abs w/ Window of 700",df_rolling_metric(train_data,700,"mean","abs"),train_labels,df_rolling_metric(dev_data,700,"mean","abs"),dev_labels)
    regression_test("Rolling Mean_Abs w/ Window of 1000",df_rolling_metric(train_data,1000,"mean","abs"),train_labels,df_rolling_metric(dev_data,1000,"mean","abs"),dev_labels)
    regression_test("Rolling Mean_Abs w/ Window of 1500",df_rolling_metric(train_data,1500,"mean","abs"),train_labels,df_rolling_metric(dev_data,1500,"mean","abs"),dev_labels)
    regression_test("Rolling Mean_Abs w/ Window of 2000",df_rolling_metric(train_data,2000,"mean","abs"),train_labels,df_rolling_metric(dev_data,2000,"mean","abs"),dev_labels)

    print "\nTrying mean squared error to accentuate the large absolute values"
    regression_test("Rolling Mean_Square w/ Window of 100",df_rolling_metric(train_data,100,"mean","square"),train_labels,df_rolling_metric(dev_data,100,"mean","square"),dev_labels)
    regression_test("Rolling Mean_Square w/ Window of 400",df_rolling_metric(train_data,400,"mean","square"),train_labels,df_rolling_metric(dev_data,400,"mean","square"),dev_labels)
    regression_test("Rolling Mean_Square w/ Window of 700",df_rolling_metric(train_data,700,"mean","square"),train_labels,df_rolling_metric(dev_data,700,"mean","square"),dev_labels)
    regression_test("Rolling Mean_Square w/ Window of 1000",df_rolling_metric(train_data,1000,"mean","square"),train_labels,df_rolling_metric(dev_data,1000,"mean","square"),dev_labels)
    regression_test("Rolling Mean_Square w/ Window of 1500",df_rolling_metric(train_data,1500,"mean","square"),train_labels,df_rolling_metric(dev_data,1500,"mean","square"),dev_labels)
    regression_test("Rolling Mean_Square w/ Window of 2000",df_rolling_metric(train_data,2000,"mean","square"),train_labels,df_rolling_metric(dev_data,2000,"mean","square"),dev_labels)

    print "\n"
    regression_test("Rolling Skew w/ Window of 100",df_rolling_metric(train_data,100,"skew"),train_labels,df_rolling_metric(dev_data,100,"skew"),dev_labels)
    regression_test("Rolling Skew w/ Window of 400",df_rolling_metric(train_data,400,"skew"),train_labels,df_rolling_metric(dev_data,400,"skew"),dev_labels)
    regression_test("Rolling Skew w/ Window of 700",df_rolling_metric(train_data,700,"skew"),train_labels,df_rolling_metric(dev_data,700,"skew"),dev_labels)
    regression_test("Rolling Skew w/ Window of 1000",df_rolling_metric(train_data,1000,"skew"),train_labels,df_rolling_metric(dev_data,1000,"skew"),dev_labels)
    regression_test("Rolling Skew w/ Window of 1500",df_rolling_metric(train_data,1500,"skew"),train_labels,df_rolling_metric(dev_data,1500,"skew"),dev_labels)
    regression_test("Rolling Skew w/ Window of 2000",df_rolling_metric(train_data,2000,"skew"),train_labels,df_rolling_metric(dev_data,2000,"skew"),dev_labels)

    print "\n"
    regression_test("Rolling Min w/ Window of 100",df_rolling_metric(train_data,100,"min"),train_labels,df_rolling_metric(dev_data,100,"min"),dev_labels)
    regression_test("Rolling Min w/ Window of 400",df_rolling_metric(train_data,400,"min"),train_labels,df_rolling_metric(dev_data,400,"min"),dev_labels)
    regression_test("Rolling Min w/ Window of 700",df_rolling_metric(train_data,700,"min"),train_labels,df_rolling_metric(dev_data,700,"min"),dev_labels)
    regression_test("Rolling Min w/ Window of 1000",df_rolling_metric(train_data,1000,"min"),train_labels,df_rolling_metric(dev_data,1000,"min"),dev_labels)
    regression_test("Rolling Min w/ Window of 1500",df_rolling_metric(train_data,1500,"min"),train_labels,df_rolling_metric(dev_data,1500,"min"),dev_labels)
    regression_test("Rolling Min w/ Window of 2000",df_rolling_metric(train_data,2000,"min"),train_labels,df_rolling_metric(dev_data,2000,"min"),dev_labels)

    print "\n"
    regression_test("Rolling Max w/ Window of 100",df_rolling_metric(train_data,100,"max"),train_labels,df_rolling_metric(dev_data,100,"max"),dev_labels)
    regression_test("Rolling Max w/ Window of 200",df_rolling_metric(train_data,200,"max"),train_labels,df_rolling_metric(dev_data,200,"max"),dev_labels)
    regression_test("Rolling Max w/ Window of 400",df_rolling_metric(train_data,400,"max"),train_labels,df_rolling_metric(dev_data,400,"max"),dev_labels)
    regression_test("Rolling Max w/ Window of 700",df_rolling_metric(train_data,700,"max"),train_labels,df_rolling_metric(dev_data,700,"max"),dev_labels)
    regression_test("Rolling Max w/ Window of 1000",df_rolling_metric(train_data,1000,"max"),train_labels,df_rolling_metric(dev_data,1000,"max"),dev_labels)
    regression_test("Rolling Max w/ Window of 1500",df_rolling_metric(train_data,1500,"max"),train_labels,df_rolling_metric(dev_data,1500,"max"),dev_labels)
    regression_test("Rolling Max w/ Window of 2000",df_rolling_metric(train_data,2000,"max"),train_labels,df_rolling_metric(dev_data,2000,"max"),dev_labels)

    print "\n"
    # Trying different percentiles.
    Main_Pct_List = [.01,.05,.10,.25,.5,.75,.9,.95,.99]

    for Pct in Main_Pct_List:
        regression_test("Rolling Pct "+str(Pct*100)+ " w/ Window of 1000",df_rolling_quantile(train_data,400,Pct),train_labels,df_rolling_quantile(dev_data,400,Pct),dev_labels)

        
        
Test_Features()



('Baseline', 'AUC =', 0.786)


('lowpass filter', 'AUC =', 0.7273)
('Bandpass 2-30 Hz', 'AUC =', 0.7087)


('Rolling Mean w/ Window of 100', 'AUC =', 0.7105)
('Rolling Mean w/ Window of 400', 'AUC =', 0.6597)
('Rolling Mean w/ Window of 700', 'AUC =', 0.6638)
('Rolling Mean w/ Window of 1000', 'AUC =', 0.6666)
('Rolling Mean w/ Window of 1500', 'AUC =', 0.6555)
('Rolling Mean w/ Window of 2000', 'AUC =', 0.6519)


('\nRolling Mean_Abs w/ Window of 100', 'AUC =', 0.5426)
('Rolling Mean_Abs w/ Window of 400', 'AUC =', 0.5412)
('Rolling Mean_Abs w/ Window of 700', 'AUC =', 0.5558)
('Rolling Mean_Abs w/ Window of 1000', 'AUC =', 0.5699)
('Rolling Mean_Abs w/ Window of 1500', 'AUC =', 0.5821)
('Rolling Mean_Abs w/ Window of 2000', 'AUC =', 0.6134)

Trying mean squared error to accentuate the large absolute values
('Rolling Mean_Square w/ Window of 100', 'AUC =', 0.5018)
('Rolling Mean_Square w/ Window of 400', 'AUC =', 0.5262)
('Rolling Mean_Square w/ Window of 700', 'AUC =', 0.5317)
('Ro

##Feature Selection
Recursive feature selection was attempted but eventually abandoned due to computing limitations. Based on the performance of our tested features, certain features were manually selected and combined. This is by no means the best way to do feature selection, but options were limited due to memory constraints and processing time.

In [14]:
def Create_Features(train_data,test_data):
    Start_Time = time.time()   
    Train_Features = np.asarray(train_data)
    Test_Features = np.asarray(test_data)    
    train_data = train_data.iloc[0:,0:32]
    test_data = test_data.iloc[0:,0:32]
    
    Train_Features = np.concatenate((Train_Features,preprocess_data(np.asarray(train_data))),axis=1) #apply lowpass filter
    Train_Features = np.concatenate((Train_Features,df_rolling_metric(train_data,100,"mean")),axis=1)
    Train_Features = np.concatenate((Train_Features,df_rolling_metric(train_data,400,"mean")),axis=1)
    Train_Features = np.concatenate((Train_Features,df_rolling_metric(train_data,1000,"mean")),axis=1)
    Train_Features = np.concatenate((Train_Features,df_rolling_metric(train_data,700,"skew")),axis=1)
    Train_Features = np.concatenate((Train_Features,df_rolling_metric(train_data,1350,"skew")),axis=1)
    Train_Features = np.concatenate((Train_Features,df_rolling_metric(train_data,2000,"skew")),axis=1)
    Train_Features = np.concatenate((Train_Features,df_rolling_metric(train_data,100,"mean",'square')),axis=1)
    Train_Features = np.concatenate((Train_Features,df_rolling_metric(train_data,400,"mean",'square')),axis=1)
    Train_Features = np.concatenate((Train_Features,df_rolling_metric(train_data,700,"mean",'square')),axis=1)
    Train_Features = np.concatenate((Train_Features,df_rolling_metric(train_data,1000,"mean",'square')),axis=1)
    Train_Features = np.concatenate((Train_Features,df_rolling_metric(train_data,100,"min")),axis=1)
    Train_Features = np.concatenate((Train_Features,df_rolling_metric(train_data,700,"min")),axis=1)
    Train_Features = np.concatenate((Train_Features,df_rolling_metric(train_data,1200,"min")),axis=1)
    Train_Features = np.concatenate((Train_Features,df_rolling_metric(train_data,100,"max")),axis=1)
    Train_Features = np.concatenate((Train_Features,df_rolling_metric(train_data,700,"max")),axis=1)
    Train_Features = np.concatenate((Train_Features,df_rolling_metric(train_data,1200,"max")),axis=1)
    print("Train Features Complete:",round((time.time()-Start_Time),2)/60, " Minutes Elapsed")
    
    Test_Features = np.concatenate((Test_Features,preprocess_data(np.asarray(test_data))),axis=1) #apply lowpass filter
    Test_Features = np.concatenate((Test_Features,df_rolling_metric(test_data,100,"mean")),axis=1)
    Test_Features = np.concatenate((Test_Features,df_rolling_metric(test_data,400,"mean")),axis=1)
    Test_Features = np.concatenate((Test_Features,df_rolling_metric(test_data,1000,"mean")),axis=1)
    Test_Features = np.concatenate((Test_Features,df_rolling_metric(test_data,700,"skew")),axis=1)
    Test_Features = np.concatenate((Test_Features,df_rolling_metric(test_data,1350,"skew")),axis=1)
    Test_Features = np.concatenate((Test_Features,df_rolling_metric(test_data,2000,"skew")),axis=1)
    Test_Features = np.concatenate((Test_Features,df_rolling_metric(test_data,100,"mean",'square')),axis=1)
    Test_Features = np.concatenate((Test_Features,df_rolling_metric(test_data,400,"mean",'square')),axis=1)
    Test_Features = np.concatenate((Test_Features,df_rolling_metric(test_data,700,"mean",'square')),axis=1)
    Test_Features = np.concatenate((Test_Features,df_rolling_metric(test_data,1000,"mean",'square')),axis=1)
    Test_Features = np.concatenate((Test_Features,df_rolling_metric(test_data,100,"min")),axis=1)
    Test_Features = np.concatenate((Test_Features,df_rolling_metric(test_data,700,"min")),axis=1)
    Test_Features = np.concatenate((Test_Features,df_rolling_metric(test_data,1200,"min")),axis=1)
    Test_Features = np.concatenate((Test_Features,df_rolling_metric(test_data,100,"max")),axis=1)
    Test_Features = np.concatenate((Test_Features,df_rolling_metric(test_data,700,"max")),axis=1)
    Test_Features = np.concatenate((Test_Features,df_rolling_metric(test_data,1200,"max")),axis=1)

    print("All Features Complete:",round((time.time()-Start_Time),2)/60, " Minutes Total")
    print(Test_Features.shape[1],"total features")
    return Train_Features, Test_Features

#this is a separate function that we used on a single data set, as opposed to both training and testing data
def Create_Features_single_dataset(data):
    Start_Time = time.time()   
    Features = np.asarray(data)

    print ('Creating Features........')
    gc.collect
    Features = np.concatenate((Features,preprocess_data(np.asarray(data))),axis=1) #apply lowpass filter
    Features = np.concatenate((Features,df_rolling_metric(data,100,"mean")),axis=1)
    Features = np.concatenate((Features,df_rolling_metric(data,400,"mean")),axis=1)
    Features = np.concatenate((Features,df_rolling_metric(data,1000,"mean")),axis=1)
    gc.collect
    Features = np.concatenate((Features,df_rolling_metric(data,500,"skew")),axis=1)
    Features = np.concatenate((Features,df_rolling_metric(data,1200,"skew")),axis=1)
    Features = np.concatenate((Features,df_rolling_metric(data,2000,"skew")),axis=1)
    gc.collect
    Features = np.concatenate((Features,df_rolling_metric(data,200,"mean",'square')),axis=1)
    Features = np.concatenate((Features,df_rolling_metric(data,500,"mean",'square')),axis=1)
    Features = np.concatenate((Features,df_rolling_metric(data,1000,"mean",'square')),axis=1)
    gc.collect
    Features = np.concatenate((Features,df_rolling_metric(data,100,"min")),axis=1)
    Features = np.concatenate((Features,df_rolling_metric(data,700,"min")),axis=1)
    Features = np.concatenate((Features,df_rolling_metric(data,1200,"min")),axis=1)
    gc.collect
    Features = np.concatenate((Features,df_rolling_metric(data,100,"max")),axis=1)
    Features = np.concatenate((Features,df_rolling_metric(data,700,"max")),axis=1)
    Features = np.concatenate((Features,df_rolling_metric(data,1200,"max")),axis=1)

    return Features
                         

##Selecting a  Model

In order to create and test a model, we decided to define a binary classifier function for each of the six label types:

 - Hand Start
 - First Digit Touch
 - Both Start Load Phase
 - Lift Off
 - Replace
 - Both Released
 
 
 We considered different models, including:
 

 - Logistic Regression
 - Decision Trees
 - Random Forest
 - Boosted Decision Trees
 - KNN
 - Linear Discriminant Analysis
 - SVM
 - Random Forest Regressor


Not shown here, we trained and tested multiple models to see which worked best, again using subject 1, series 1, 2 and 3. Our testing validated these ideas: logistic regression resulted in the highest AUC during testing (~.93). Though we were not expecting this, it does match what we've learned, that regression has high performance with large datasets.

In [15]:
# set parameters for Logistic Regression
C_Value = 1
penalty = 'l2'
Convergence_tol = .001

#Create a function that trains and runs a logistic regression model. Then prints and returns the AUC score
def Test_AUC(train_data, train_label, test_data, test_label,Category):
    C_Value = 1
    penalty = 'l2'
    Convergence_tol = .001
    class_weight="auto" 
    
    Logistic_Reg = LogisticRegression(C = C_Value, penalty = penalty,tol=Convergence_tol,class_weight = class_weight)    
    Logistic_Reg.fit(train_data, train_label)

    prob = Logistic_Reg.predict_proba(test_data)
    AUC = roc_auc_score(test_label, prob[:,1])
    print(Category, "AUC =",round(AUC,4))
    
    return AUC

def Test_Model(train_data,test_data,train_labels,test_labels):
    Train_Labels_HandStart =  train_labels['HandStart'].to_sparse(fill_value=0)
    Train_Labels_FirstDigitTouch =  train_labels['FirstDigitTouch'].to_sparse(fill_value=0)
    Train_Labels_BothStartLoadPhase =  train_labels['BothStartLoadPhase'].to_sparse(fill_value=0)
    Train_Labels_LiftOff =  train_labels['LiftOff'].to_sparse(fill_value=0)
    Train_Labels_Replace =  train_labels['Replace'].to_sparse(fill_value=0)
    Train_Labels_BothReleased =  train_labels['BothReleased'].to_sparse(fill_value=0)

    Test_Labels_HandStart =  test_labels['HandStart'].to_sparse(fill_value=0)
    Test_Labels_FirstDigitTouch =  test_labels['FirstDigitTouch'].to_sparse(fill_value=0)
    Test_Labels_BothStartLoadPhase =  test_labels['BothStartLoadPhase'].to_sparse(fill_value=0)
    Test_Labels_LiftOff =  test_labels['LiftOff'].to_sparse(fill_value=0)
    Test_Labels_Replace =  test_labels['Replace'].to_sparse(fill_value=0)
    Test_Labels_BothReleased =  test_labels['BothReleased'].to_sparse(fill_value=0)
    
    Start_Time = time.time()   
    AUC_HandStart = Test_AUC(train_data,Train_Labels_HandStart.to_dense(),test_data,Test_Labels_HandStart.to_dense(), 'HandStart')
    print(round((time.time()-Start_Time),2)/60, " Minutes Elapsed")
          
    AUC_FirstDigitTouch = Test_AUC(train_data,Train_Labels_FirstDigitTouch.to_dense(),test_data,Test_Labels_FirstDigitTouch.to_dense(), 'FirstDigitTouch')
    print(round((time.time()-Start_Time),2)/60, " Minutes Elapsed")
          
    AUC_BothStartLoadPhase = Test_AUC(train_data,Train_Labels_BothStartLoadPhase.to_dense(),test_data,Test_Labels_BothStartLoadPhase.to_dense(), 'BothStartLoadPhase')
    print(round((time.time()-Start_Time),2)/60, " Minutes Elapsed")
          
    AUC_LiftOff = Test_AUC(train_data,Train_Labels_LiftOff.to_dense(),test_data,Test_Labels_LiftOff.to_dense(), 'LiftOff')
    print(round((time.time()-Start_Time),2)/60, " Minutes Elapsed")
          
    AUC_Replace = Test_AUC(train_data,Train_Labels_Replace.to_dense(),test_data,Test_Labels_Replace.to_dense(), 'Replace')
    print(round((time.time()-Start_Time),2)/60, " Minutes Elapsed")
          
    AUC_BothReleased = Test_AUC(train_data,Train_Labels_BothReleased.to_dense(),test_data,Test_Labels_BothReleased.to_dense(), 'BothReleased')
    print(int(time.time()-Start_Time), " Seconds to complete")
          
    print("Overall Logistic Regression Score = ", np.mean((AUC_HandStart, AUC_FirstDigitTouch,AUC_BothStartLoadPhase,AUC_LiftOff,AUC_Replace,AUC_BothReleased)))

In [34]:
# Despite efficiencies built in, this code block has high memory requirements and
# has a long processing time.

Start_Time = time.time()   

#Open the data
train_data, train_labels = open_data('[1]','[1,2]')
test_data, test_labels = open_data('[1]','[3]')


#transform into features    
train_data, test_data = Create_Features(train_data,test_data)    


#perform dimensionality reduction. We'll use the PCs that explain 90% of the variance.
train_data, test_data = extract_PCs(train_data, test_data ,.93)
Test_Model(train_data,test_data,train_labels,test_labels)

print(round((time.time()-Start_Time),2)/60, "Total Minutes to Complete")

# Finalizing Model

Having selected features and chosen regression, we next wanted to train our model on the entire data set. Our initial attempts to scale the simple logistic regression model ran into memory issues, so we attempted two different methods of big data training: using a machine with 50 Gb of memory and batch training.

The 50 Gb machine provided good results, but would take half a day to process code. After many attempts at memory and processing time optimization, and after the kaggle dataset mistake, we decided to shift focus onto batch training.

### Batch Training All The Data With .partial_fit()

Though we tried different methods to conserve memory, no solution saved space on the order of magnitude necessary to fit on a standard desktop computer (16 Gb RAM). We decided to use a model that permits training on batches of data sequentially. The options available in Scikit-Learn include:

- Multinomial Naive Bayes
- Bernoulli Naive Bayes
- Perceptron
- SGD Classifier
- Passive Aggressive Classifier
- SGD Regressor

These functions all contain a .partial_fit() method for "out of core" (out of memory) learning.

After similarly testing these different models with their default settings, we chose an SGD classifier due to a higher AUC and quicker training time.

###Stochastic Gradient Descent Classifier

The SGD classifier was difficult to optimize parameters for and is an area where further improvement is needed in our model. Still, the SGD classifier worked well in our tests. Scikit-learn warns of these sensitive hyperparameters and of sensitivity to feature scaling, but the processed sensor data we used for features should not be problematic.

We wrote a function to take a select subject and series as input. It then uses then selected features to trains a classifier for each of the six labels in batches. It returns these trained models and the PCA transformations (i.e. the eigenvectors and eigenvalues). It is important to note that neither PCA nor any other vector-based transformation can work in batch training because the entire dataset is never contained in memory. PCA was very valuable to the processing speed and AUC, and so our solution was to create the component coefficients using limited data and applying that transformation to the remaining data as it was read for training.

In [40]:
def Batch_Train_SGD(subject,series,PercentVarExplained):
    Overall_Start_Time = time.time()
    counter = 1
    SGD_HandStart = SGDClassifier(loss = 'log',n_jobs = 7,class_weight="auto",penalty  = 'l2',alpha = .001)    
    SGD_FirstDigitTouch = SGDClassifier(loss = 'log',n_jobs = 7,class_weight="auto",penalty  = 'l2',alpha = .001)    
    SGD_BothStartLoadPhase = SGDClassifier(loss = 'log',n_jobs = 7,class_weight="auto",penalty  = 'l2',alpha = .001)    
    SGD_LiftOff = SGDClassifier(loss = 'log',n_jobs = 7,class_weight="auto",penalty  = 'l2',alpha = .001)    
    SGD_Replace = SGDClassifier(loss = 'log',n_jobs = 7,class_weight="auto",penalty  = 'l2',alpha = .001)    
    SGD_BothReleased = SGDClassifier(loss = 'log',n_jobs = 7,class_weight="auto",penalty  = 'l2',alpha = .001)    

    for series in series_batches:
        train_data, train_labels = open_data(subject,series)
        train_labels = train_labels.to_sparse(fill_value=0)
        #transform into features    
        train_data = Create_Features_single_dataset(train_data) 
        print "features completed"
        
        #We fit the PCA to the first batch and then apply it to all subsequent batches and test data. 
        if counter ==1:
            PCA_Start = time.time()
            Scale_Center = StandardScaler() #we must first scale and center the data.
            train_data = np.float16(Scale_Center.fit_transform(np.array(train_data)))
            print train_data.shape
            gc.collect()  #Garbage collection (i.e. get rid of any outstanding unused memory)
            pca = PCA()
            pca.fit(train_data)
            gc.collect() 
            Explained_Variance_Ratios = pca.explained_variance_ratio_
            for i in range(1,len(Explained_Variance_Ratios)):
                if sum(Explained_Variance_Ratios[0:i]) >= PercentVarExplained:
                    NumPCs = i + 1 #add 1 since numpy array ranges are not inclusive
                    break
            del Explained_Variance_Ratios
            print("PCA Complete:", NumPCs, "Resultant Principal Components:" ,
                  round((time.time()-Overall_Start_Time),2)/60, "Minutes")
        counter = counter + 1

        train_data = np.float32(pca.transform(train_data)[:,0:NumPCs])
        gc.collect()
        train_data = np.float16(Scale_Center.fit_transform(np.array(train_data)))
        gc.collect() 
        SGD_HandStart.partial_fit(train_data, train_labels['HandStart'].to_dense(),classes = [0,1])
        SGD_FirstDigitTouch.partial_fit(train_data, train_labels['FirstDigitTouch'].to_dense(),classes = [0,1])     
        SGD_BothStartLoadPhase.partial_fit(train_data, train_labels['BothStartLoadPhase'].to_dense(),classes = [0,1])
        SGD_LiftOff.partial_fit(train_data, train_labels['LiftOff'].to_dense(),classes = [0,1])
        SGD_Replace.partial_fit(train_data, train_labels['Replace'].to_dense(),classes = [0,1])
        SGD_BothReleased.partial_fit(train_data, train_labels['BothReleased'].to_dense(),classes = [0,1])
        gc.collect()     
    print("Subject", subject,"Model Fit Complete:", round((time.time()-Overall_Start_Time),2)/60, "Total Minutes")
    
    return pca, NumPCs,SGD_HandStart,SGD_FirstDigitTouch,SGD_BothStartLoadPhase,SGD_LiftOff,SGD_Replace,SGD_BothReleased



###Functions for Creating a Submission File

 - **Create_Submission_Data** - This function takes a subject ID and the output from Batch_Train_SGD as input (The trained models). It then loads the test data for that subject and saves the predictions for each of the six labels as a csv files so it does not need to be stored in memory.  
 - **Run_Full_Submission_Model** - combines the functionality of the Batch_Train_SGD and Create_Submission_Data functions into one single function. 
 - **Combine_Output_Files** - combines all of the csv files output by Create_Submission_Data into one file.
 

In [None]:
def Create_Submission_Data(Subj_ID,PercentVarExplained,test_data_path,pca,NumPCs,SGD_HandStart,
                           SGD_FirstDigitTouch,SGD_BothStartLoadPhase,SGD_LiftOff,SGD_Replace,SGD_BothReleased):
    print Subj_ID
    train_data, Subject_ID_Strings = open_test_data(str(Subj_ID),test_data_path)
    print train_data.shape
    
    #transform into features  
    gc.collect() 
    train_data = Create_Features_single_dataset(train_data) 
    gc.collect()       
    print train_data.shape
    Scale_Center = StandardScaler() #we must first scale and center the data.
    train_data = np.float16(Scale_Center.fit_transform(np.array(train_data)))
    gc.collect()  #Garbage collection (i.e. get rid of any outstanding unused memory)
    print train_data.shape
    train_data = np.float32(pca.transform(train_data)[:,0:NumPCs])
    
    train_data = np.float16(Scale_Center.fit_transform(np.array(train_data)))
    
    HandStart_Proba = np.float16(SGD_HandStart.predict_proba(train_data)[:,1])
    FirstDigitTouch_Proba = np.float16(SGD_FirstDigitTouch.predict_proba(train_data)[:,1])
    BothStartLoadPhase_Proba = np.float16(SGD_BothStartLoadPhase.predict_proba(train_data)[:,1])
    LiftOff_Proba = np.float16(SGD_LiftOff.predict_proba(train_data)[:,1])  
    Replace_Proba = np.float16(SGD_Replace.predict_proba(train_data)[:,1])
    BothReleased_Proba = np.float16(SGD_BothReleased.predict_proba(train_data)[:,1])
    
    del train_data  
    
    Subj_Probabilities = np.transpose(np.vstack([Subject_ID_Strings,HandStart_Proba,FirstDigitTouch_Proba,
                                                 BothStartLoadPhase_Proba,LiftOff_Proba,Replace_Proba,BothReleased_Proba]))
    
    File_Str = Output_Path + "\Subject" + str(Subj_ID) + "Probs.csv"  
    np.savetxt(File_Str, Subj_Probabilities, delimiter=",",fmt="%s",comments='')

    
def Combine_Output_Files(Output_Path):
    output_filenames = glob.glob(Output_Path + "\*Probs.csv"  )
    print output_filenames
    # Initialize an empty dataframe.
    All_Probabilities= pd.DataFrame()

    # Load the dataframe with the contents of each file.
    for file_ in output_filenames:
        All_Probabilities = All_Probabilities.append(pd.read_csv(file_))
    
    File_Str = "SubmissionFile.csv"  
    np.savetxt(File_Str,
               All_Probabilities, 
               delimiter=",", 
               header = "id,HandStart,FirstDigitTouch,BothStartLoadPhase,LiftOff,Replace,BothReleased",
               fmt="%s",comments='')

    
def Run_Full_Submission_Model(subjects,series_batches,PercentVarExplained,test_data_path,Output_Path):
    Overall_Start_Time = time.time()
    counter = 1
    for i in range(0,len(subjects)):
        Subj_ID = subjects[i]
        print "Subject" + str(Subj_ID) +" ...."
        Start_Time = time.time()
        pca, NumPCs, SGD_HandStart,SGD_FirstDigitTouch,SGD_BothStartLoadPhase,SGD_LiftOff,SGD_Replace,
        SGD_BothReleased = Batch_Train_SGD(str(Subj_ID),series_batches,PercentVarExplained)
        
        Current_Loop_Submission_Data = Create_Submission_Data(str(Subj_ID),PercentVarExplained,
                                                              test_data_path,pca,NumPCs,SGD_HandStart,SGD_FirstDigitTouch,
                                                              SGD_BothStartLoadPhase,SGD_LiftOff,SGD_Replace,SGD_BothReleased)
        gc.collect()

        print "Subject "+str(Subj_ID)+" Prediction Complete: " + str(round((time.time()-Start_Time),2)/60)+ " Total Minutes" 
    
    Combine_Output_Files(Output_Path)
    print round((time.time()-Overall_Start_Time),2)/60, "Total Minutes"

###Create Submission File

This section of code calls all the previously defined functions to create a .csv which can be submitted for the Kaggle competition. 
 - **test_data_path** - the path containing the 24 files of test data. There should be two files for each subject (series 9 and 10)
 - **Output_Path** - the path to which the individual subject prediction probabilities csv files will be written. 
 - **subjects** - the subjects upon which to train the model. The way this SGD model is set up, a model will be trained for each subject in this list. Each subjects test data will be predicted with their own model. This is much like was done with the batch logistic regression; however, the SGD allows the models to be trained in batches. Therefore, each subject's model is trained in four separate batches. This batch training allows a submission file to be generated with only 16 GB of RAM. 
 - **series_batches** - this is a list of the batches of series upon which the models will be trained. Each item in the list is a regex string for a range of series numbers. 
 - **PercentVarExplained** - This model uses PCA as a means of dimensionality reduction. This variable sets the desired amount of variance explained by returned principal components. For example, if this variable is set to 0.9, the principal components that explain 90% of the variation in the data will be used as features.


Given enough time to run, this script produces a csv which submitted to Kaggle earns an AUC score of $.744$. Though this is far short of our goal, this is large improvement from the naive models we started with. 

In [None]:
test_data_path = r'data\test'
Output_Path = r'data\'
subjects = [10,11,12] #subjects for training. each one will be done on it's own, batch training a model based on the batches below. 
series_batches = ['[1-2]','[3-4]','[5-6]','[7-8]'] #series for training. needs to be in batches
PercentVarExplained = .9

Run_Full_Submission_Model(subjects,series_batches,PercentVarExplained,test_data_path,Output_Path)

Subject10 ....
(479260, 32)
Creating Features........
features completed
(479260L, 552L)
('PCA Complete:', 67, 'Resultant Principal Components:', 2.440833333333333, 'Minutes')
(354201, 32)
Creating Features........
features completed
(393902, 32)
Creating Features........
features completed
(259552, 32)
Creating Features........
features completed
('Subject', '10', 'Model Fit Complete:', 6.629833333333334, 'Total Minutes')
10
(257237, 32)
Creating Features........
(257237L, 552L)
(257237L, 552L)
Subject 10 Prediction Complete: 7.75216666667 Total Minutes
Subject11 ....
(371097, 32)
Creating Features........
features completed
(371097L, 552L)
('PCA Complete:', 76, 'Resultant Principal Components:', 1.9508333333333332, 'Minutes')
(407697, 32)
Creating Features........
features completed
(466771, 32)
Creating Features........
features completed
(275472, 32)
Creating Features........
features completed
('Subject', '11', 'Model Fit Complete:', 6.833166666666667, 'Total Minutes')
11
(277497,