# Applied Machine Learning (DataSci W207 - Summer 2015)
## Final Project: Identify Hand Motions from ElectroEncephaloGraphy (EEG) Signals
### Group Members: Carson Forter, Michael Marks, Nihar Patal, Ji Yan, Jeffrey Yau

### Introduction:

This kaggle competition challenges participants to identify when a hand is grasping, lifting, and replacing an object using  ElectroEncephaloGram (EEG) data that was taken from healthy subjects as they performed these activities. The benefit of a successful prediction of hand movements using EEG signals is critical to developing a Brain-Computer Interface (BCI) device that would give patients with neurological disabilities the ability to move through the world with greater autonomy. 

This report summarizes the activities we have performed as of today as well as the next step. These activities include 
1. Downloaded the dataset, which contains tens of millions of observations across all 12 subjects and the basic structure of which is listed below, built a baseline model, and submitted it to the Kaggle site for scoring.
2. Read the extensive literature on the analysis of EEG signals to learn some domain knowledge and EEG signal processing techniques
3. Used time series graphs to visualize the time series of the EEG signals of selected channels.
4. Held weekly meeting to discuss results and next steps
5. Consulted professional researchers (from UCSF) who are experts in EEG analyses; based on their suggestions, we have eliminated 14 of the 32 channels, for signals from the elminated channels do not reflect movements from the right hand
6. Tried different classification approaches:
-- Due to the large size of the entire dataset, we have decided to test different approaches before training on the entire dataset using subject 1's series 1 and 2 for training and subject 1's series 3 for testing.
-- Approach 1: Applied PCA to further eliminate channels, applied a wide variety of backward rolling window methods to smooth the series, and then classified each of six (hand movement) categories using logistic regression. Rolling-window smoothing method, the most commonly-used of which is moving averages (or some call "rolling average") that is frequently applied to smooth macroeconomic time series, such as initial jobless claims, is designed to smooth out the series and make its trend more prominent using information within the window. We have used mean, various quantiles, and the first four moments to identify different characteristics of the series. 
-- Approach 2: Applied low-pass filtering, which is a common practice in EEG signal processing, and classified each of six (hand movement) categories using logistic regression. EEG signals are notoriously noisy. When analyzing EEG signals for hand movements, it is important to filter out frequency bands that do not *correspond to hand movements. That is why low-pass filtering was first applied before feature engineering was used. We chose the *Butterworth Filter Design*, which is among the top 5 most common linear filter approximation used to shape the frequency spectrum of signals (the other four are Elliptical, different variants of Chebyshev, Bessel, and Cauer, all of which are available in the *scipy.signal package. When using butterworth filter, an order needs to be chosen. We used the $5_{th}$ order, which is typical in the EEG literature, because this particular order seems to balance the trade-off between the pass band flatness and width of the transition band as the filter changes from the pass band to the stop band. 

**Next Steps:**
Continue to attempt various filter, feature engineering, and classification methods untill we meet again this weekend. One of the methods we will attempt include:
-- Eliminated the 18 channels (per bullet point 5 above); applied low-pass band filtering to filter out irrelevant (to hand movement) frequency bands; applied common spatial pattern (CSP) method, which has found to be useful in EEG signals modeling, to further reduce the feature dimension; then train the models using (1) Linear Discriminat Analysis (LDA), which has been found to be useful in EEG classification and (2) Hidden Markov Model, in order to capture dependence of the EEG series.

In this project, we would like to account for both the spatial and temporal information given by the EEG signals from different channels; different channels are spatially related while signal series at time $t-\tau$, where $0 < \tau <  t$, may provide information to predict the signals at $t$. As such, for a successful classification, we conjecture that a holistic approach that accounts for the spatio-temporal relationship embedded in the data is important. In the EEG analysis literature, the spatio-temporal relationship is often done at the filtering stage, but the classification methods typicall used completely ignores the temporal relationship (except one paper that we came across). For this reason, not only do we want to capture the spatio-temporal relationship at the filtering stage, we would like to use a classification method that can account for the temporal information at the classification stage.

### Basic Format of the Datasets
There are 12 subjects in total, 10 series of trials for each subject, and approximately 30 trials within each series. The numberwithin each series. The of trials varies for each series. The training set contains the first 8 series for each subject. The test set contains the 9th and 10th series.

For each GAL, you are tasked to detect 6 events:

1. HandStart
2. FirstDigitTouch
3. BothStartLoadPhase
4. LiftOff
5. Replace
6. BothReleased

<img src="C:/Users/K/Google Drive/_2.MIDS/W207 - Applied Machine Learning/Project/Final/EEG_Electrode_Numbering.jpeg">

###A Brief Description of the Datasets from "Multi-channel EEG recordings during 3,936 grasp and lift trials with varying weight and friction" published in Scientific Data
Reference: http://www.nature.com/articles/sdata201447#ref-link-section-1
WAY-EEG-GAL is a dataset designed to allow critical tests of techniques to decode sensation, intention, and action from scalp EEG recordings in humans who perform a grasp-and-lift task. Twelve participants performed lifting series in which the object’s weight (165, 330, or 660 g), surface friction (sandpaper, suede, or silk surface), or both, were changed unpredictably between trials, thus enforcing changes in fingertip force coordination. In each of a total of 3,936 trials, the participant was cued to reach for the object, grasp it with the thumb and index finger, lift it and hold it for a couple of seconds, put it back on the support surface, release it, and, lastly, to return the hand to a designated rest position. We recorded EEG (32 channels), EMG (five arm and hand muscles), the 3D position of both the hand and object, and force/torque at both contact plates. For each trial we provide 16 event times (e.g., ‘object lift-off’) and 18 measures that characterize the behaviour (e.g., ‘peak grip force’).

### Other Notes:
$\textbf{EEG signals}$ can be formalized as
$$
  \{E_n\}_{n=1}^N \in \Re^{ch \times time}
$$
where
- $N$ is the number of trials
- $ch$ is the number of channels
- $time$ is the range of time domain

The data set includes **12** subjects, each of whom are given about **10** series of about **30** trials. It includes **32** channels.

We will explore transformation methods to transform each EEG signal to feature vectors. That is,
$$
  \mathbf{E}_n \in \Re^{ch \times time} \mapsto \mathbf{x}_n \in \Re^d
$$
where $d$ is small than $ch \times time$

** Note 1 : Temporal Data **
Unlike other datasets we have worked with in the class, EEG signal recordings are both temporal (time-dependent) and spatial data. So, our first task is to determine the sample frequency and the observed time period.

Each timeframe is given a unique id column according to the subject, series, and frame to which it belongs. The six label columns are either zero or one, depending on whether the corresponding event has occurred within ±150ms (±75frames).

- the *_data.csv files contain the raw 32 channels EEG data (sampling rate 500Hz)
- the *_events.csv files contains the ground truth frame-wise labels for all events

According to kaggle forum and the description in the competition page, the data was sampled at rate of 500Hz, so each line on the data and event file is a measurement from the sensors at 1/500th of a second. For instance, in the subj1_series1_events file, there are lines where there is a '1' in more than one column. Does this mean that both those are events are happening at that time ? I guess I started with the assumption that each time would have a single event associated with it. 

Each event is tagged in the data with a window a +/- 75ms around the actual occurence of the event. For example, the class 'handstart' correspond to +/- 75ms starting from the instant where the user start moving his hand. From the challenge point of view, It would have make no sense to ask contestants to predict an event with a precision of 1 time sample. instead, you are asked to predict what the events files give you, i.e. a time window around the actual event.

** Note 2: No Future Data Rule ** is imposed as a restrictions by the competition.

** Note 3: Overlapping Events **
The events do overlap and you can make a submission with ones in multiple columns for the same row. For evaluation purposes each type of event is treated as a separate binary classification task. Your score is just the average of your score for each task.