## Introduction

There is one very important thing that we need to be aware of, our data is always not enough.

Whether it is deep learning or machine learning, the effect of the model is based on our data. The quantity and quality of the data are the decisive factors and necessary prerequisites for whether we can get the desired/ideal effect.

Note: 

* Due to the limitation of internship time and laboratory computing resources, I do not have time to test all data enhancement methods, and test the performance of the model after data enhancement;
* I will focus on the alphabet vision/imagination data;
* Some data augmentation methods can be combined.


### A review paper

See more information in this paper:

Data Augmentation for Deep Neural Networks Model in EEG Classification Task: A Review
https://www.frontiersin.org/articles/10.3389/fnhum.2021.765525/full


PS: I have to say, almost every paper, none of them publish and share codes.
(but in github, there are still some open codes...)

### Import packages

In [7]:
import os
import time
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter

# our own pipeline
# from pipelines.tools import power_band, one_signal_band_power, generate_feature_dict
from pipelines.data_prapare import read_power_band_txt,read_features_table, read_signal_data
from pipelines.ml_functions import prepare_signals,set_seed, clean_all_feature_table
from pipelines.ml_functions import  print_performance, evaluate_model, model_evaluation_dict

In [8]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

import xgboost as xgb
import lightgbm as lgb
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier


from sklearn.model_selection import cross_val_score
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import plot_confusion_matrix, accuracy_score, f1_score, recall_score, precision_score

In [9]:
import warnings
warnings.filterwarnings("ignore")

set_seed(42)

## Our ways


### Add or reduce noise

We can even add or reduce noise to the signal data of only one of the channels to get more data.

In theory, adding noise will increase the generalization ability and robustness of the model. However, considering the particularity of the EEG signal, adding noise will bring many uncontrollable factors, and requires a lot of comparative experiments, so we will not consider this kind of data enhancement for the time being.

### Split signals

Considering the difference between static object recognition and motion imagery, in fact, for our 3-second signal, no matter whether the signal is in the first second, the second second or the last second, in our brains, they are all the same static object. Based on this idea, I think it is also a possible data enhancement method to directly segment the original data.

In our previous chapters, we know the length of our data, the shortest is 346 and the highest is 420+, so we need to think of a suitable length to segment the signal pieces, so that each piece of data can maintain a sufficient length, while ensuring quality, so, probably we can divide the data into 2, 3 or 4 segments.

Note: Splitting the data does give us more data, but this may bring up another problem, which is the quality of the data. 

Why? In the first second of the data, the experimenter - that is me, experienced a very short conversion process from the previous letter to the next letter, this process may have some impact on the quality of our data.(But the reduction in data quality is also a robustness enhancement)

In addition, if we process data like that, our data won't have a lot of time information compare to complete signal interval.

In [10]:
aat_vis_signal, aat_img_signal, asl_vis_signal, asl_img_signal = read_signal_data()
labels = np.array(aat_vis_signal['label_index'])

Divide the signal directly into four parts: 

In [11]:
ch_names=['ch1', 'ch2', 'ch3', 'ch4', 'ch5', 'ch6', 'ch7', 'ch8','ch9', 'ch10',
          'ch11', 'ch12', 'ch13', 'ch14', 'ch15', 'ch16']
seg_labels = list()
seg_aat_vis = list()
seg_aat_img = list()
seg_piece = 4

for i in range(aat_vis_signal.shape[0]):
    current_vis = list()
    current_img = list()
    for name in ch_names:
        current_vis.append(np.array(eval(aat_vis_signal[name][i])))
        current_img.append(np.array(eval(aat_img_signal[name][i])))
    current_vis = np.array(current_vis)
    current_img = np.array(current_img)
    seg = int(current_vis.shape[-1]/seg_piece)
    seg_2 = int(current_img.shape[-1]/seg_piece)
    
    seg_aat_vis.extend([current_vis[:, 0:seg], current_vis[:,seg:seg*2], current_vis[:,seg*2:seg*3], current_vis[:, seg*3:]])
    seg_aat_img.extend([current_img[:, 0:seg_2], current_img[:,seg_2:seg_2*2], current_img[:,seg_2*2:seg_2*3], current_img[:,seg_2*3:]])
    seg_labels.extend([labels[i], labels[i], labels[i], labels[i]])

Store this seg data in our folder:
* seg_eeg_signals
* seg_eeg_features
* seg_band_power

In [12]:
len(seg_aat_img)

3328

In [13]:
feature_path = "./data/EEG_features_Lintao/"  
signal_path = feature_path + "seg_eeg_signals/"
bp_feature_path = feature_path + "seg_band_powers/"
all_feature_path = feature_path + "seg_eeg_features/"

In [18]:
seg_aat_img[1][1]

array([-6452.41218238, -6456.32373766, -6463.92333077, -6460.68232783,
       -6454.35678415, -6448.88060676, -6438.75526652, -6450.66874631,
       -6437.81649325, -6444.79023752, -6455.51907486, -6444.85729275,
       -6446.62308057, -6447.27128116, -6445.63960381, -6455.36261265,
       -6436.16246416, -6436.47538858, -6438.53174907, -6431.60270829,
       -6444.52201659, -6445.90782474, -6441.01279271, -6448.0982957 ,
       -6456.41314464, -6459.54238886, -6446.91365324, -6450.60169108,
       -6456.63666208, -6454.55794985, -6442.28684214, -6434.86606298,
       -6444.56672008, -6448.94766199, -6440.34224037, -6452.56864459,
       -6448.00888872, -6444.92434799, -6442.24213865, -6439.69403979,
       -6447.13717069, -6443.60559506, -6439.35876362, -6442.01862121,
       -6448.94766199, -6451.36165039, -6447.71831604, -6448.65708931,
       -6451.36165039, -6454.22267368, -6453.97680449, -6448.07594396,
       -6446.01958347, -6455.92140626, -6447.67361256, -6451.1157812 ,
      

In [19]:
aat_img = dict()
aat_vision = dict()

count = 0 
for i in range(len(seg_aat_img)):

    if count == 0:
        aat_img['label_index'] = [seg_labels[i]]
        aat_vision['label_index'] = [seg_labels[i]]
        
        for j in range(16):
            aat_img[ch_names[j]] = [seg_aat_img[i][j,:]]
            aat_vision[ch_names[j]] = [seg_aat_vis[i][j,:]]

    else:
        aat_img['label_index'].append(seg_labels[i])
        aat_vision['label_index'].append(seg_labels[i])
        for j in range(16):
            aat_img[ch_names[j]].append(seg_aat_img[i][j,:])
            aat_vision[ch_names[j]].append(seg_aat_vis[i][j,:])
    
    count += 1

# positions_need = positions.iloc[:16,:]
aat_img = pd.DataFrame(aat_img)
aat_img.to_csv(signal_path+"aat_img.csv")


# positions_need = positions.iloc[:16,:]
aat_vision = pd.DataFrame(aat_vision)
aat_vision.to_csv(signal_path+"aat_vision.csv")

In [20]:
aat_img.head()

Unnamed: 0,label_index,ch1,ch2,ch3,ch4,ch5,ch6,ch7,ch8,ch9,ch10,ch11,ch12,ch13,ch14,ch15,ch16
0,9,"[-3344.111543191856, -3345.095019947889, -3325...","[-6450.624042823797, -6441.549234574943, -6431...","[2805.2556878633127, 2805.5239087967766, 2806....","[-140.2571964570518, -140.65952785724735, -132...","[-2027.6608500076356, -2026.4091523181385, -20...","[-4437.335364500924, -4434.317878999457, -4429...","[-4727.192786597346, -4728.198615097835, -4736...","[-4534.788970326063, -4532.665554602809, -4538...","[7048.041468625244, 7049.561387248205, 7053.80...","[-1243.1593231152683, -1244.7015934826843, -12...","[3767.185362241908, 3765.084298263109, 3767.85...","[10897.905933607331, 10887.802945113532, 10908...","[-3521.3161732335298, -3517.985763309689, -351...","[-2543.5391120361223, -2534.486655531723, -252...","[-2891.376959249611, -2888.6053429371527, -288...","[1188.978694555604, 1193.0690637909252, 1193.4..."
1,9,"[-3349.610072327861, -3360.80829629997, -3361....","[-6452.412182380222, -6456.3237376599, -6463.9...","[2811.625935033076, 2812.184728644458, 2811.60...","[-143.92288254772217, -144.54873139247078, -14...","[-2018.5860417587808, -2014.0709893788087, -20...","[-4442.588024447921, -4443.035059337028, -4445...","[-4732.736019222262, -4730.567900010097, -4733...","[-4544.579034397488, -4541.583900640476, -4546...","[7068.962701435412, 7076.428184083484, 7072.71...","[-1329.8170363684935, -1296.6023441079074, -13...","[3775.768432112746, 3776.2825222352176, 3774.7...","[10882.125602021886, 10862.746639579134, 10886...","[-3517.672838887315, -3512.867213829424, -3515...","[-2549.417620827868, -2534.777228209642, -2546...","[-2893.0756918282145, -2890.035854582293, -289...","[1187.4364241881876, 1189.068101533425, 1187.8..."
2,9,"[-3365.18923821321, -3385.1940495007098, -3385...","[-6496.154546279257, -6507.57628769592, -6519....","[2818.756141514318, 2818.3091066252123, 2816.8...","[-151.03073728450985, -154.67407163072488, -15...","[-2000.4364252610717, -1994.1779368135856, -20...","[-4451.461666996678, -4448.779457662041, -4449...","[-4737.519292535698, -4731.864301188505, -4735...","[-4553.005642057138, -4547.484761176677, -4551...","[7069.007404924322, 7072.851904970635, 7068.71...","[-935.9345955770726, -915.862729056207, -930.5...","[3784.1279845390304, 3783.233914760818, 3781.0...","[10844.798188781524, 10850.967270251189, 10834...","[-3515.2364987416863, -3513.31424871853, -3514...","[-2537.817065455564, -2528.876367673441, -2528...","[-2894.1485755620693, -2892.6510086835638, -28...","[1183.3460549528666, 1186.072967776414, 1186.7..."
3,9,"[-3362.305863178476, -3379.472002920151, -3376...","[-6458.223635938602, -6473.9592640351375, -648...","[2839.476208624388, 2839.543263857754, 2839.49...","[-155.83636234240083, -166.20757176966333, -16...","[-1996.0778350922867, -1987.5394687103596, -19...","[-4460.402364778801, -4460.514123501078, -4461...","[-4744.671850761396, -4736.066429146103, -4741...","[-4566.260226519135, -4560.3817177273895, -456...","[7057.92093967449, 7066.97339617889, 7063.2630...","[-1195.3712934698217, -1181.2449909740676, -12...","[3804.736292926824, 3807.574964472648, 3803.64...","[10839.47847360116, 10854.208273197208, 10826....","[-3490.1354897183764, -3485.732196060681, -348...","[-2550.0211179281614, -2538.8675974449634, -25...","[-2894.997941851371, -2892.8521743836613, -289...","[1178.495726406065, 1179.6580171177409, 1180.2..."
4,6,"[-3458.060736425011, -3449.29885259853, -3464....","[-6607.600344133419, -6605.611038876897, -6616...","[2855.7482785878515, 2855.547112887754, 2855.5...","[-177.74107190860175, -175.57295269643697, -18...","[-1926.8321307697452, -1930.587223838237, -192...","[-4464.313920058479, -4463.129277602348, -4466...","[-4765.257807404734, -4766.978891727793, -4762...","[-4585.19215407278, -4586.712072695741, -4582....","[7106.424225142507, 7101.3280274066965, 7105.3...","[-731.1032093886387, -734.9030059460409, -735....","[3812.291182552717, 3813.028790119743, 3812.71...","[10726.535108868493, 10736.101655495366, 10725...","[-3481.351254147441, -3486.2015826942425, -348...","[-2576.0161967296835, -2584.554563111611, -258...","[-2899.244773297879, -2902.262258799345, -2900...","[1177.7357670945844, 1176.0146827715257, 1177...."


In [21]:
aat_img['ch1'][0]

array([-3344.11154319, -3345.09501995, -3325.87251972, -3326.43131333,
       -3318.38468532, -3322.02801967, -3341.60814781, -3338.47890359,
       -3354.12512471, -3361.12122072, -3346.54788334, -3371.87240981,
       -3378.91320931, -3391.76546237, -3400.66145666, -3398.13570954,
       -3416.66530569, -3416.71000918, -3429.38344829, -3425.33778254,
       -3427.81882618, -3444.80615196, -3433.33970706, -3442.97330892,
       -3448.8965212 , -3443.2638816 , -3453.90331196, -3440.73813447,
       -3454.68562301, -3457.16666665, -3453.99271893, -3460.87705623,
       -3456.42905908, -3464.72155627, -3461.32409112, -3451.1093439 ,
       -3457.07725967, -3452.85277997, -3461.18998065, -3459.49124807,
       -3448.78476248, -3463.08987893, -3446.14725663, -3452.29398636,
       -3443.666213  , -3436.69246873, -3431.59627099, -3421.82855866,
       -3429.31639306, -3424.19784358, -3418.52050048, -3417.62643071,
       -3403.90245961, -3402.76252064, -3407.05405558, -3392.25720075,
      

Create three functions to read them in pipelines folder.

* read_seg_power_band_txt
* read__seg_features_table
* read_seg_signal_data

### Test our new seg data

redefine our function

In [4]:
def init_classifiers():
    """
    Initialize our machine learning classifier ---
    where catboost and NN (neural network classification) are not initialized,
    and most hyperparameters will take default values

    """

    model_names = ['SVM', 'LR', 'KNN', 'GBDT', 'DT', 'AdaB', 'RF', 'XGB', 'LGB', 'Catboost', 'NN']

    # the training parameters of each model
    param_grid_svc = [{}]
    param_grid_logistic = [{'C': [0.1], 'penalty': ['l1', 'l2']}]
    param_grid_knn = [{}, {'n_neighbors': list(range(3, 8))}]
    param_grid_gbdt = [{}]
    param_grid_tree = [{}]
    param_grid_boost = [{}]
    param_grid_rf = [{}]

    return ([(SVC(), model_names[0], param_grid_svc),
             (LogisticRegression(), model_names[1], param_grid_logistic),
             (KNeighborsClassifier(), model_names[2], param_grid_knn),
             (DecisionTreeClassifier(), model_names[4], param_grid_tree),
             (AdaBoostClassifier(), model_names[5], param_grid_boost),
             (RandomForestClassifier(), model_names[6], param_grid_rf)])

In [5]:
from pipelines.data_prapare import read_seg_power_band_txt,read_seg_features_table, read_seg_signal_data

In [8]:
aat_vis, aat_img = read_seg_features_table()
bp_data_dict = read_seg_power_band_txt()

# 26 * 32 * 4 
labels_1 = np.array(aat_vis['label_index'])
# 26 * 32 * 4 * 2
labels_2 = np.concatenate((labels_1, labels_1), axis=0)


# for the feature analyse
col_name = list(aat_vis.columns)[1:]
# col_name

bp_data_dict.keys()

dict_keys(['welch_alphabet_imagination', 'welch_alphabet_vision'])

In [9]:
# for the feature analyse
ch_names=['ch1', 'ch2', 'ch3', 'ch4', 'ch5', 'ch6', 'ch7', 'ch8','ch9', 'ch10',
              'ch11', 'ch12', 'ch13', 'ch14', 'ch15', 'ch16']
band_name = ["δ-delta" , "θ-theta" , "α-alpha" , "β-beta" , "γ-gamma"]   
bp_col_names = [i+'_'+j for i in ch_names for j in band_name]

One problem there, it should start form 1 not 2:

In [10]:
# read and clean data
aat_vis = clean_all_feature_table(aat_vis.iloc[:, 2:])
aat_img = clean_all_feature_table(aat_img.iloc[:, 2:])

# bp feature data
bp_aat_img = np.array(bp_data_dict['welch_alphabet_imagination']).reshape(-1,80)
bp_aat_vis = np.array(bp_data_dict['welch_alphabet_vision']).reshape(-1,80)

In [12]:
aat_vis.shape

(3328, 1128)

In [13]:
labels_1.shape

(3328,)

In [14]:
bp_aat_vis.shape

(3328, 80)

#### BP features alphabet vision 26 Classification 

In [15]:
res_list = []
best_score = 0
classifiers = init_classifiers()
for i in classifiers:
    results= model_evaluation_dict(bp_aat_vis, labels_1, i[0], i[1], i[2])
    res_list.append(results)
    

df_model_comparison = pd.DataFrame(res_list).sort_values(by=['F1 Score(Macro)','F1 Score(Micro)']).reset_index(drop=True)
df_model_comparison

Unnamed: 0,Classifier,param,Traing score,Test Score,Whole score,CV Score,Precision(Macro),Precision(Micro),Recall(Macro),Recall(Micro),F1 Score(Macro),F1 Score(Micro)
0,SVM,default,0.047231,0.027027,0.041166,0.043366,0.001048,0.027027,0.038462,0.027027,0.00204,0.027027
1,AdaB,default,0.075569,0.065065,0.072416,0.066981,0.021759,0.065065,0.067553,0.065065,0.023422,0.065065
2,LR,"{'C': 0.1, 'penalty': 'l2'}",0.076857,0.063063,0.072716,0.06699,0.033897,0.063063,0.065733,0.063063,0.031914,0.063063
3,KNN,{'n_neighbors': 7},0.287248,0.071071,0.222356,0.074277,0.068267,0.071071,0.069957,0.071071,0.062033,0.071071
4,DT,default,1.0,0.079079,0.723558,0.072122,0.080486,0.079079,0.080107,0.079079,0.07755,0.079079
5,RF,default,1.0,0.12012,0.735877,0.104767,0.09678,0.12012,0.118058,0.12012,0.098789,0.12012


#### BP features alphabet vis+img 26 Classification 

In [16]:
data_bp = np.concatenate((bp_aat_img, bp_aat_vis), axis=0)

res_list = []
best_score = 0
classifiers = init_classifiers()
for i in classifiers:
    results= model_evaluation_dict(data_bp, labels_2, i[0], i[1], i[2])
    res_list.append(results)
    

df_model_comparison = pd.DataFrame(res_list).sort_values(by=['F1 Score(Macro)','F1 Score(Micro)']).reset_index(drop=True)
df_model_comparison

Unnamed: 0,Classifier,param,Traing score,Test Score,Whole score,CV Score,Precision(Macro),Precision(Micro),Recall(Macro),Recall(Micro),F1 Score(Macro),F1 Score(Micro)
0,SVM,default,0.045718,0.031047,0.041316,0.041641,0.006003,0.031047,0.038259,0.031047,0.003086,0.031047
1,LR,"{'C': 0.1, 'penalty': 'l2'}",0.078343,0.069604,0.075721,0.066107,0.049513,0.069604,0.068083,0.069604,0.034051,0.069604
2,AdaB,default,0.080489,0.068102,0.076773,0.074266,0.043494,0.068102,0.06952,0.068102,0.03604,0.068102
3,KNN,default,0.311011,0.077116,0.240835,0.065468,0.088277,0.077116,0.075631,0.077116,0.07314,0.077116
4,DT,default,1.0,0.082624,0.72476,0.089503,0.083831,0.082624,0.085381,0.082624,0.083608,0.082624
5,RF,default,1.0,0.123686,0.737079,0.131575,0.108094,0.123686,0.123407,0.123686,0.108008,0.123686


Because the best welch work not so good, so I won't test FFT and multitaper..

#### All features alphabet vision 26 Classification 

In [17]:
res_list = []
best_score = 0
classifiers = init_classifiers()
for i in classifiers:
    results= model_evaluation_dict(aat_vis, labels_1, i[0], i[1], i[2])
    res_list.append(results)
    

df_model_comparison = pd.DataFrame(res_list).sort_values(by=['F1 Score(Macro)','F1 Score(Micro)']).reset_index(drop=True)
df_model_comparison

Unnamed: 0,Classifier,param,Traing score,Test Score,Whole score,CV Score,Precision(Macro),Precision(Micro),Recall(Macro),Recall(Micro),F1 Score(Macro),F1 Score(Micro)
0,AdaB,default,0.053242,0.042042,0.04988,0.06311,0.009253,0.042042,0.048853,0.042042,0.011936,0.042042
1,SVM,default,0.169601,0.086086,0.144531,0.106042,0.044464,0.086086,0.088036,0.086086,0.050579,0.086086
2,LR,"{'C': 0.1, 'penalty': 'l2'}",0.390726,0.136136,0.314303,0.129231,0.106433,0.136136,0.133081,0.136136,0.110994,0.136136
3,KNN,{'n_neighbors': 3},0.589953,0.24024,0.484976,0.237419,0.296238,0.24024,0.243775,0.24024,0.238215,0.24024
4,DT,default,1.0,0.38038,0.814002,0.308271,0.382664,0.38038,0.379463,0.38038,0.374872,0.38038
5,RF,default,1.0,0.57958,0.873798,0.534995,0.590739,0.57958,0.578801,0.57958,0.574677,0.57958


#### All features alphabet vis+img 26 Classification 

In [18]:
data_all = np.concatenate((aat_img, aat_vis), axis=0)

res_list = []
best_score = 0
classifiers = init_classifiers()
for i in classifiers:
    results= model_evaluation_dict(data_all, labels_2, i[0], i[1], i[2])
    res_list.append(results)
    

df_model_comparison = pd.DataFrame(res_list).sort_values(by=['F1 Score(Macro)','F1 Score(Micro)']).reset_index(drop=True)
df_model_comparison

Unnamed: 0,Classifier,param,Traing score,Test Score,Whole score,CV Score,Precision(Macro),Precision(Micro),Recall(Macro),Recall(Micro),F1 Score(Macro),F1 Score(Micro)
0,AdaB,default,0.103456,0.086129,0.098257,0.0719,0.022683,0.086129,0.093731,0.086129,0.033477,0.086129
1,SVM,default,0.179652,0.10015,0.155799,0.091008,0.074459,0.10015,0.098961,0.10015,0.069555,0.10015
2,LR,"{'C': 0.1, 'penalty': 'l2'}",0.308865,0.121683,0.252704,0.132221,0.105045,0.121683,0.119293,0.121683,0.100875,0.121683
3,KNN,{'n_neighbors': 3},0.58446,0.21983,0.47506,0.213568,0.250737,0.21983,0.220398,0.21983,0.215108,0.21983
4,DT,default,1.0,0.522283,0.856671,0.482082,0.523828,0.522283,0.522434,0.522283,0.521087,0.522283
5,RF,default,1.0,0.796194,0.938852,0.738792,0.80623,0.796194,0.799565,0.796194,0.797076,0.796194


check other three models:

In [28]:
def init_classifiers_left():
    """
    Initialize our machine learning classifier ---
    where catboost and NN (neural network classification) are not initialized,
    and most hyperparameters will take default values

    """

    model_names = ['SVM', 'LR', 'KNN', 'GBDT', 'DT', 'AdaB', 'RF', 'XGB', 'LGB', 'Catboost', 'NN']

    # the training parameters of each model
    param_grid_svc = [{}]
    param_grid_logistic = [{'C': [0.1], 'penalty': ['l1', 'l2']}]
    param_grid_knn = [{}, {'n_neighbors': list(range(3, 8))}]
    param_grid_gbdt = [{}]
    param_grid_tree = [{}]
    param_grid_boost = [{}]
    param_grid_rf = [{}]
    param_grid_xgb = [{}]
    param_grid_lgb = [{}]

    return ([ (GradientBoostingClassifier(), model_names[3], param_grid_gbdt),
             (xgb.XGBClassifier(), model_names[7], param_grid_xgb)
             ])

This part of the code really takes a lot of time...

Even if I don't use LGB, just XGB takes at least 10-12 hours on CPU, so unless the models here are really good, we won't be using these models when testing sign language.

Of course it's a lot faster if you have a GPU.


In [29]:
data_all = np.concatenate((aat_img, aat_vis), axis=0)

res_list = []
best_score = 0
classifiers = init_classifiers_left()
for i in classifiers: 
    results= model_evaluation_dict(data_all, labels_2, i[0], i[1], i[2])
    res_list.append(results)
    

df_model_comparison = pd.DataFrame(res_list).sort_values(by=['F1 Score(Macro)','F1 Score(Micro)']).reset_index(drop=True)
df_model_comparison

Unnamed: 0,Classifier,param,Traing score,Test Score,Whole score,CV Score,Precision(Macro),Precision(Micro),Recall(Macro),Recall(Micro),F1 Score(Macro),F1 Score(Micro)
0,GBDT,default,0.952136,0.311467,0.759916,0.296205,0.314927,0.311467,0.310919,0.311467,0.308228,0.311467
1,XGB,default,1.0,0.481723,0.844501,0.402448,0.487494,0.481723,0.483239,0.481723,0.476254,0.481723


And even with a more complicated model, it doesn't seem to improve test ACC a lot...

#### All features alphabet vis+img 4(A/B/C/D) Classification

In [22]:
indexs_4 = list()
data_4 = list()
for i in range(aat_vis.shape[0]):
    if labels_1[i] in [1,2,3,4]:
        indexs_4.append(labels_1[i])
        indexs_4.append(labels_1[i])
        data_4.append(aat_vis[i,:])
        data_4.append(aat_img[i,:])

res_list = []
best_score = 0
classifiers = init_classifiers()
for i in classifiers:
    results= model_evaluation_dict(data_4, indexs_4, i[0], i[1], i[2])
    res_list.append(results)
    

df_model_comparison = pd.DataFrame(res_list).sort_values(by=['F1 Score(Macro)','F1 Score(Micro)']).reset_index(drop=True)
df_model_comparison

Unnamed: 0,Classifier,param,Traing score,Test Score,Whole score,CV Score,Precision(Macro),Precision(Micro),Recall(Macro),Recall(Micro),F1 Score(Macro),F1 Score(Micro)
0,SVM,default,0.481844,0.288961,0.423828,0.271127,0.296762,0.288961,0.30009,0.288961,0.273315,0.288961
1,LR,"{'C': 0.1, 'penalty': 'l2'}",0.703911,0.373377,0.604492,0.417469,0.373612,0.373377,0.373527,0.373377,0.372045,0.373377
2,AdaB,default,0.555866,0.49026,0.536133,0.445501,0.538641,0.49026,0.491167,0.49026,0.496771,0.49026
3,KNN,{'n_neighbors': 3},0.772346,0.555195,0.707031,0.434253,0.576993,0.555195,0.554166,0.555195,0.555723,0.555195
4,DT,default,1.0,0.62987,0.888672,0.734605,0.621488,0.62987,0.621841,0.62987,0.620058,0.62987
5,RF,default,1.0,0.905844,0.97168,0.875841,0.910834,0.905844,0.905418,0.905844,0.90707,0.905844


#### All features alphabet vision 4(A/B/C/D) Classification

In [23]:
indexs_4 = list()
data_4 = list()
for i in range(aat_vis.shape[0]):
    if labels_1[i] in [1,2,3,4]:
        indexs_4.append(labels_1[i])
        data_4.append(aat_vis[i,:])

res_list = []
best_score = 0
classifiers = init_classifiers()
for i in classifiers:
    results= model_evaluation_dict(data_4, indexs_4, i[0], i[1], i[2])
    res_list.append(results)
    

df_model_comparison = pd.DataFrame(res_list).sort_values(by=['F1 Score(Macro)','F1 Score(Micro)']).reset_index(drop=True)
df_model_comparison

Unnamed: 0,Classifier,param,Traing score,Test Score,Whole score,CV Score,Precision(Macro),Precision(Micro),Recall(Macro),Recall(Micro),F1 Score(Macro),F1 Score(Micro)
0,SVM,default,0.463687,0.227273,0.392578,0.268016,0.304951,0.227273,0.2529,0.227273,0.209844,0.227273
1,AdaB,default,0.606145,0.337662,0.525391,0.39119,0.384509,0.337662,0.338184,0.337662,0.340245,0.337662
2,DT,default,1.0,0.37013,0.810547,0.396905,0.375619,0.37013,0.368617,0.37013,0.37079,0.37013
3,LR,"{'C': 0.1, 'penalty': 'l2'}",0.815642,0.376623,0.683594,0.338016,0.387544,0.376623,0.377824,0.376623,0.377793,0.376623
4,KNN,{'n_neighbors': 3},0.801676,0.493506,0.708984,0.519206,0.519226,0.493506,0.500248,0.493506,0.495221,0.493506
5,RF,default,1.0,0.649351,0.894531,0.656587,0.656201,0.649351,0.651748,0.649351,0.648779,0.649351


### Conclusion

We can find that after using window segmentation and adding training data, the performance of our model has been significantly improved, and even the random forest model has reached an accuracy of 80%, which has initially met our minimum expected threshold ( this result could be improved if trained with the best param).

This conclusion also verifies our assumption. The main reason for the poor performance of the model at the beginning must be due to insufficient training data. Since many codes are not open source, and we do not have the computing power to test GAN and AE, we will only use this segmented pseudo-data enhancement for this part. I will mark the other parts for the time being, and leave them for later.


## Official ways

### Classic 

* AutoEncoder / VAE

* GANs

Although these methods are very classic and useful methods, it has to be mentioned that many EEG data enhancement articles have not published code, and data, we can only understand their ideas by directly reading the papers.

(But it takes time and GPU resources to reproduce their methods and frameworks)


### Others


* Regularization Dropout(Copy data, or train more epoch)

* Enhancement of EEG images(Process the EEG signal into an image and use the image method to augment the data)

* Data Augmentation in RCN Papers(https://arxiv.org/abs/1511.06448), not improve a lot.

* Addressing Class Imbalance in Classification Problems of Noisy Signals by using Fourier Transform Surrogates https://arxiv.org/abs/1806.08675

* Data augmentation for eeg-based emotion recognition with deep convolutional neural networks

* Cross-session classification of mental workload levels using EEG and an adaptive deep learning model

* Improving brain computer interface performance by data augmentation with conditional Deep Convolutional Generative Adversarial Networks

