<br/><font size=10>Data organization</font><br/>

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span><ul class="toc-item"><li><span><a href="#Dataset-description" data-toc-modified-id="Dataset-description-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Dataset description</a></span></li><li><span><a href="#Load-dataset" data-toc-modified-id="Load-dataset-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Load dataset</a></span></li></ul></li><li><span><a href="#Resampling" data-toc-modified-id="Resampling-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Resampling</a></span><ul class="toc-item"><li><span><a href="#Resampling-regarding-sampling-rate" data-toc-modified-id="Resampling-regarding-sampling-rate-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Resampling regarding sampling rate</a></span></li><li><span><a href="#Resampling-regarding-data-size" data-toc-modified-id="Resampling-regarding-data-size-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Resampling regarding data size</a></span></li></ul></li><li><span><a href="#Filtering" data-toc-modified-id="Filtering-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Filtering</a></span></li><li><span><a href="#Reference" data-toc-modified-id="Reference-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Reference</a></span></li></ul></div>

# Introduction

The dataset we choose to use in this tutorial is the EEG Motor Movement/Imagery Dataset (EEG-MMIDB)[<sup>1</sup>](#refer-anchor-1).

In this tutorial, we offer two cleaned dataset that extract from the EEG-MMIDB dataset.   
> Dataset_1: the data of the first subject in the EEG-MMIDB.  
> Dataset_2: data of all 109 subjects of the EEG-MMIDB.

* Note: in the following models of this BCI tutorial, we will use Dataset_1 to run examples for the computational advantage.

## Dataset description

The original dataset description can be find here: https://archive.physionet.org/pn4/eegmmidb/

After our clearning and sorting, each npy file is one subject, the data shape of each npy file is [N, 65], the first 64 columns are 64 channel features, the last column is the class label. The N varis for different subjects, but N should be 259520 or 255680. This is the inherent difference in the original dataset. 

The sampling rate is 160Hz, some trials last for 4.1 seconds while others last for 4.2 seconds. I suggest you to segment the signals in each second. 


4.1 seconds (656=4.1 x 160 instances); 4.2 seconds (672 = 4.2 x 160 instancs) 

> Labels:  
0: open eyes,  
1: close eyes.  
2: left hand,   
3: right hand.  
4: image left hand,   
5: image right hand.  
6: open fists,   
7:open feet.  
8: image fist,   
9: image feet.  
10: rest.

## Load dataset

In [2]:
import numpy as np
dataset_1=np.load('1.npy')
print('The shape of Dataset_1:', dataset_1.shape)
dataset_1

The shape of Dataset_1: (259520, 65)


array([[-16, -29,   2, ..., -11,  15,   0],
       [-56, -54, -27, ...,   1,  21,   0],
       [-55, -55, -29, ...,  18,  35,   0],
       ...,
       [  0,   0,   0, ...,   0,   0,   9],
       [  0,   0,   0, ...,   0,   0,   9],
       [  0,   0,   0, ...,   0,   0,   9]], dtype=int64)

*As we can see from above, the Dataset_1 consist of 259520 timesteps and 64 channels, the last column is the class label.*

We first introduce several terms about the data orgnization:
* Instance (time step or time point). Each instance indicates a list of values which are collected at a single time point or sampling point. For example, there will be 160 instances in 1 second for the equipment with 160 Hz sampling rate.

* Segment (sample). Each segment contains multiple continue instances which can represent a specific event/state of brain signals. The length of a segment is called time window. 


# Resampling

Resampling is a method that consists of drawing repeated samples from the original data samples[<sup>2</sup>](#refer-anchor-2). There are two types of resampling widely used:
* Downsampling
* Upsampling

Next, we introduce the implementation of resampling in different conditions.

## Resampling regarding sampling rate

Regarding sampling rate/frequency of the input signal, we generally need to resample the input data in two scenarios. 

* The first one is to deal with multimodal signals. For example, we want to take advantage of two different input signals: brain signals (EEG) with 1000 Hz sampling rate and heart beat (ECG) with 20 sampling rate. In order to unify the input data and integrate them together, we need to adjust them into a constant samling rate, let's say 200 Hz. The practical operation is straightforwad that sample one instance from each five continue instances in EEG (downsampling), while five is calculated by 1000/200; at the meantime, repeat each instance in ECG for 10 times (upsampling), while 10 is calculated by 200/20. 


* Resampling can also apply to the scenario that the input signals have very high sampling frequency which we don't need. Take the EEG data with 1000 sampling rate as an example: one background knowledge is that the most useful information in EEG are under 70 Hz, based on Nyquist Sampling Theorem, we only need signals with sampling rate around 140 Hz. To reduce computational cost, it's not compulsory but acceptable to downsample the original high-frequency signal (1000 Hz) to lower-frequency signals (e.g., 200 Hz). 

## Resampling regarding data size

This kind of resampling aims to keep balance between datasets (both training and testing set). For example, we have 1000 positive samples but only 200 negative samples, which is generally called imbalance in machine learning. The classifier can achieve an accuracy of 80% by just blindly predicted the most frequent class (make all predection as positive prediction).     Addressing this issue, we have three typical solutions: 
1. Downsampling the most frequent class until a more balanced distribution is reached (e.g., randomly selecting examples from the majority class and deleting them from the training dataset). 
2. Upsampling the least frequent class (e.g., randomly duplicating examples from the minority class). 
3. Use other evaluation metrics, such as ROC Curves and ROC AUC (ROC Area Under Curve) which could avoid the imbalanced data issue. 

*In this tutorial,we do not need resampling since our data has appropriate sampling frequency and is already balance.*

# Filtering

Sometimes we need to filter the EEG signals into frequency bands of interested, in order to find the most informative and distinguishable patterns. 

> The EEG signals collected from any typical EEG hardware have several nonoverlapping frequency bands (Delta, Theta, Alpha, Beta, and Gamma) based on the strong intraband correlation with a distinct behavioral state. Each EEG pattern contains signals associated with particular brain information. The degree of awareness denotes the perception of individuals when presented with external stimuli. It is mainly defined in physiology instead of psychology.[<sup>3</sup>](#refer-anchor-3)  

Each frequency band represents a brain state and a qualitative assessment of awareness:
* Delta pattern (0.5--4 Hz) corresponds to deep sleep when the subject has lower awareness.  
- Theta pattern (4--8 Hz) corresponds to light sleep in the realm of low awareness.   
* Alpha pattern (8--12 Hz) mainly occurs during eyes closed and deeply relaxed state and corresponds to the medium awareness. 
* Beta pattern (12--30 Hz) is the dominant rhythm while the eyes of the subject are open and is associated with high awareness. Beta patterns capture most of our daily activities (such as eating, walking, and talking). 
* Gamma pattern (30--100 Hz) represents the co-interaction of several brain areas to carry out a specific motor and cognitive function. 

_Next, we provide the filtering codes taking the Delta band as an example (we don't need filtering in this tutorial, just provide the code in case the readers need it). The filter used here is 3-order band-pass butterworth filter. Please adjust sampling frequency (fs), lowcut and highcut to customize your own filter._ 

In [3]:
import myimporter  # active import from inside notebook
from BCI_functions import *  # BCI_functions.ipynb contains some functions we might use multiple times in this tutorial 

n_fea = 64  # 64 channels
label = dataset_1[:, n_fea: n_fea+1]  # seperate label from feature
feature = dataset_1[:, 0:n_fea] 
feature_f=[]  # feature after filtering

# EEG Delta pattern decomposition
for i in range(feature.shape[1]):
    x = feature[:, i]
    fs = 160.0
    lowcut = 0.5
    highcut = 4.0
    y = butter_bandpass_filter(x, lowcut, highcut, fs, order=3)
    feature_f.append(y) 
    
feature_f=np.array(feature_f).T
print('The shape of filtered feature:',feature_f.shape)

data_f=np.hstack((feature_f,label))  # stack label to filtered feature 
print("The shape of dataset_1 after filtering:",data_f.shape)

importing Jupyter notebook from BCI_functions.ipynb
The shape of filtered feature: (259520, 64)
The shape of dataset_1 after filtering: (259520, 65)


# Reference

<div id="refer-anchor-1"></div>

- [1]  [Goldberger AL, Amaral LAN, Glass L, Hausdorff JM, Ivanov PCh, Mark RG, Mietus JE, Moody GB, Peng C-K, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals. Circulation 101(23):e215-e220](http://circ.ahajournals.org/cgi/content/full/101/23/e215)

<div id="refer-anchor-2"></div>

- [2]  [Resampling Imbalanced Data and Applying Machine Learning Techniques](https://medium.com/better-programming/resampling-imbalanced-data-and-applying-ml-techniques-91ebce40ff4d)

<div id="refer-anchor-2"></div>

- [3]  [Zhang X, Yao L, Wang X, et al. A survey on deep learning-based non-invasive brain signals: recent advances and new frontiers[J]. Journal of Neural Engineering, 2020.](https://arxiv.org/abs/1905.04149)

