## Introduction

This notebook is an attempt to classify our final 26 tasks, through feature extraction/feature selection + machine learning, and simple deep learning, to see if this task can be successfully completed.

If the results are not ideal, in the next, the focus will be on testing, feature extraction + deep learning, more complex deep learning, trying to reproduce the framework/models of the paper, etc.

For our task, the main hard points are 1. not enough data and computing source 2. our task is more complicated compare the binary/3/4/5 class classification task.


## Import packages and read data

In [1]:
import os
import time
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter

# our own pipeline
# from pipelines.tools import plot_intervals
# from pipelines.tools import power_band, one_signal_band_power, power_band_timeslice
from pipelines.data_prapare import read_power_band_txt,read_features_table, read_signal_data
from pipelines.ml_functions import prepare_signals,set_seed, clean_all_feature_table
from pipelines.ml_functions import  print_performance, evaluate_model, model_evaluation_dict, init_classifiers

In [2]:
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

import torch
import xgboost as xgb
import lightgbm as lgb
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier

import tensorflow as tf
import tensorflow.keras.layers as layers
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.layers import Conv1D, Dense, Flatten, GRU, Dense, LSTM, RNN, RepeatVector, TimeDistributed, SimpleRNN
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

from sklearn.model_selection import cross_val_score
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import plot_confusion_matrix, accuracy_score, f1_score, recall_score, precision_score

# pip install shap
# pip install lime
# import shap
# import lime
# from sklearn.tree import export_graphviz, plot_tree

In [3]:
import warnings
warnings.filterwarnings("ignore")

In [4]:
# from google.colab import drive
# drive.mount('/content/gdrive')

In [5]:
# import sys
# sys.path.append('/content/gdrive/My Drive/UMONS')

In [6]:
set_seed(42)

Read data

In [7]:
aat_vis, aat_img, asl_vis, asl_img = read_features_table()
bp_data_dict = read_power_band_txt()

# 26 * 32 = 832 data
labels_1 = np.array(aat_vis['label_index'])
# 26 * 32 * 2= 1664 data
labels_2 = np.concatenate((labels_1, labels_1), axis=0)
# 26 * 32 * 4= 3328 data
labels_4 = np.concatenate((labels_2, labels_2), axis=0)


# for the feature analyse
col_name = list(asl_img.columns)[2:]
# col_name

bp_data_dict.keys()

dict_keys(['fft_alphabet_imagination', 'fft_alphabet_vision', 'fft_asl_imagination', 'fft_asl_vision', 'multitaper_alphabet_imagination', 'multitaper_alphabet_vision', 'multitaper_asl_imagination', 'multitaper_asl_vision', 'welch_alphabet_imagination', 'welch_alphabet_vision', 'welch_asl_imagination', 'welch_asl_vision'])

In [8]:
# for the feature analyse
ch_names=['ch1', 'ch2', 'ch3', 'ch4', 'ch5', 'ch6', 'ch7', 'ch8','ch9', 'ch10',
              'ch11', 'ch12', 'ch13', 'ch14', 'ch15', 'ch16']
band_name = ["δ-delta" , "θ-theta" , "α-alpha" , "β-beta" , "γ-gamma"]   
bp_col_names = [i+'_'+j for i in ch_names for j in band_name]

In [9]:
# read and clean data
aat_vis = clean_all_feature_table(aat_vis.iloc[:, 2:])
aat_img = clean_all_feature_table(aat_img.iloc[:, 2:])
asl_vis = clean_all_feature_table(asl_vis.iloc[:, 2:])
asl_img = clean_all_feature_table(asl_img.iloc[:, 2:])

# bp feature data
bp_aat_img = np.array(bp_data_dict['welch_alphabet_imagination']).reshape(-1,80)
bp_aat_vis = np.array(bp_data_dict['welch_alphabet_vision']).reshape(-1,80)
bp_asl_img = np.array(bp_data_dict['welch_asl_imagination']).reshape(-1,80)
bp_asl_vis = np.array(bp_data_dict['welch_asl_vision']).reshape(-1,80)

## ML Classification (26 letters) and performances

### Alphabet letters

data for training: 26 class *  32 * 2 each class data = 1664 data, does not distinguish between vision and imagination.

#### BP features

In [10]:
data_bp = np.concatenate((bp_aat_img, bp_aat_vis), axis=0)
data_allfeature = np.concatenate((aat_img, aat_vis), axis=0) 


res_list = []
best_score = 0
classifiers = init_classifiers()
for i in classifiers:
    results= model_evaluation_dict(data_bp, labels_2, i[0], i[1], i[2])
    res_list.append(results)
    

df_model_comparison = pd.DataFrame(res_list).sort_values(by=['F1 Score(Macro)','F1 Score(Micro)']).reset_index(drop=True)
df_model_comparison

Unnamed: 0,Classifier,param,Traing score,Test Score,Whole score,CV Score,Precision(Macro),Precision(Micro),Recall(Macro),Recall(Micro),F1 Score(Macro),F1 Score(Micro)
0,SVM,default,0.04811,0.026,0.041466,0.040385,0.001004,0.026,0.038462,0.026,0.001957,0.026
1,LR,"{'C': 0.1, 'penalty': 'l2'}",0.050687,0.026,0.043269,0.04468,0.00378,0.026,0.032418,0.026,0.006454,0.026
2,AdaB,default,0.050687,0.03,0.044471,0.054966,0.014364,0.03,0.044508,0.03,0.011762,0.03
3,DT,default,1.0,0.046,0.713341,0.056705,0.045503,0.046,0.043913,0.046,0.044035,0.046
4,GBDT,default,1.0,0.048,0.713942,0.064441,0.046638,0.048,0.047021,0.048,0.044396,0.048
5,KNN,{'n_neighbors': 7},0.243986,0.058,0.188101,0.048991,0.042888,0.058,0.055605,0.058,0.045531,0.058
6,LGB,default,1.0,0.078,0.722957,0.076496,0.059475,0.078,0.076301,0.078,0.061458,0.078
7,XGB,default,1.0,0.072,0.721154,0.06786,0.065267,0.072,0.065483,0.072,0.064068,0.072
8,RF,default,1.0,0.08,0.723558,0.071331,0.074598,0.08,0.077454,0.08,0.067588,0.08


#### All features

In [11]:
res_list = []
best_score = 0
classifiers = init_classifiers()
for i in classifiers:
    results= model_evaluation_dict(data_allfeature, labels_2, i[0], i[1], i[2])
    res_list.append(results)
    

df_model_comparison = pd.DataFrame(res_list).sort_values(by=['F1 Score(Macro)','F1 Score(Micro)']).reset_index(drop=True)
df_model_comparison

Unnamed: 0,Classifier,param,Traing score,Test Score,Whole score,CV Score,Precision(Macro),Precision(Micro),Recall(Macro),Recall(Micro),F1 Score(Macro),F1 Score(Micro)
0,SVM,default,0.127148,0.044,0.102163,0.059284,0.015161,0.044,0.050743,0.044,0.023001,0.044
1,KNN,{'n_neighbors': 7},0.243127,0.042,0.182692,0.054988,0.03162,0.042,0.042789,0.042,0.03177,0.042
2,AdaB,default,0.090206,0.08,0.087139,0.067013,0.033946,0.08,0.077595,0.08,0.036225,0.08
3,LGB,default,1.0,0.062,0.718149,0.085927,0.043085,0.062,0.068631,0.062,0.048533,0.062
4,LR,"{'C': 0.1, 'penalty': 'l2'}",0.40378,0.074,0.304688,0.070476,0.048802,0.074,0.071611,0.074,0.051432,0.074
5,GBDT,default,1.0,0.062,0.718149,0.070439,0.068968,0.062,0.056064,0.062,0.058203,0.062
6,XGB,default,1.0,0.08,0.723558,0.071316,0.06061,0.08,0.07188,0.08,0.062899,0.08
7,DT,default,1.0,0.064,0.71875,0.058422,0.066696,0.064,0.062087,0.064,0.062932,0.064
8,RF,default,1.0,0.078,0.722957,0.081639,0.058432,0.078,0.091251,0.078,0.065249,0.078


### ASL letters

data for training: 26 class *  32 * 2 each class data = 1664 data, does not distinguish between vision and imagination.

#### BP features

In [12]:
data_bp = np.concatenate((bp_asl_img, bp_asl_vis), axis=0)
data_allfeature = np.concatenate((asl_img, asl_vis), axis=0) 


res_list = []
best_score = 0
classifiers = init_classifiers()
for i in classifiers:
    results= model_evaluation_dict(data_bp, labels_2, i[0], i[1], i[2])
    res_list.append(results)
    

df_model_comparison = pd.DataFrame(res_list).sort_values(by=['F1 Score(Macro)','F1 Score(Micro)']).reset_index(drop=True)
df_model_comparison

Unnamed: 0,Classifier,param,Traing score,Test Score,Whole score,CV Score,Precision(Macro),Precision(Micro),Recall(Macro),Recall(Micro),F1 Score(Macro),F1 Score(Micro)
0,SVM,default,0.050687,0.022,0.042067,0.042963,0.000851,0.022,0.038462,0.022,0.001666,0.022
1,LR,"{'C': 0.1, 'penalty': 'l2'}",0.056701,0.042,0.052284,0.046397,0.048618,0.042,0.045655,0.042,0.01096,0.042
2,AdaB,default,0.065292,0.04,0.057692,0.062739,0.051362,0.04,0.055428,0.04,0.020687,0.04
3,DT,default,1.0,0.044,0.71274,0.057552,0.043842,0.044,0.04302,0.044,0.042533,0.044
4,KNN,{'n_neighbors': 7},0.244845,0.06,0.189303,0.057552,0.042249,0.06,0.060566,0.06,0.044703,0.06
5,LGB,default,1.0,0.058,0.716947,0.082457,0.048162,0.058,0.058288,0.058,0.050086,0.058
6,GBDT,default,1.0,0.064,0.71875,0.060116,0.058091,0.064,0.068914,0.064,0.06036,0.064
7,XGB,default,1.0,0.084,0.72476,0.081587,0.057506,0.084,0.080228,0.084,0.062304,0.084
8,RF,default,1.0,0.082,0.724159,0.068722,0.074889,0.082,0.085016,0.082,0.070625,0.082


#### All features

In [13]:
res_list = []
best_score = 0
classifiers = init_classifiers()
for i in classifiers:
    results= model_evaluation_dict(data_allfeature, labels_2, i[0], i[1], i[2])
    res_list.append(results)
    

df_model_comparison = pd.DataFrame(res_list).sort_values(by=['F1 Score(Macro)','F1 Score(Micro)']).reset_index(drop=True)
df_model_comparison

Unnamed: 0,Classifier,param,Traing score,Test Score,Whole score,CV Score,Precision(Macro),Precision(Micro),Recall(Macro),Recall(Micro),F1 Score(Macro),F1 Score(Micro)
0,AdaB,default,0.084192,0.044,0.072115,0.048968,0.021725,0.044,0.047754,0.044,0.022072,0.044
1,SVM,default,0.12457,0.048,0.101562,0.07388,0.017929,0.048,0.060374,0.048,0.02571,0.048
2,KNN,{'n_neighbors': 6},0.256014,0.046,0.192909,0.060102,0.032237,0.046,0.052689,0.046,0.035586,0.046
3,GBDT,default,1.0,0.05,0.714543,0.060978,0.052653,0.05,0.052417,0.05,0.051324,0.05
4,XGB,default,1.0,0.068,0.719952,0.066202,0.0483,0.068,0.060986,0.068,0.051406,0.068
5,LGB,default,1.0,0.068,0.719952,0.085021,0.054005,0.068,0.071978,0.068,0.057117,0.068
6,LR,"{'C': 0.1, 'penalty': 'l2'}",0.402062,0.094,0.309495,0.069621,0.074208,0.094,0.089874,0.094,0.061953,0.094
7,RF,default,1.0,0.086,0.725361,0.081587,0.063479,0.086,0.079463,0.086,0.06637,0.086
8,DT,default,1.0,0.062,0.718149,0.066129,0.070843,0.062,0.068275,0.062,0.067226,0.062


In [17]:
# labels_1

We could find the joint data and ML models are not ok, we cannot find the features which present the information that this is always letter A in our brain.

I deduce that there are many reasons for this situation:

1. We did not extract the robustness feature, discriminative features;
2. We have too many features for the model to choose from;
3. There are too many class to classifier (26/27), and classification is difficult to achieve;
4. There is too little data, and the machine cannot really learn well;
5. There is no distinction between vision and imagination in the joint data, and we can know from the previous notebook that vision and imagination signal have a great degree of distinction;
6. Missing key spatial information (like channel location);
7. There is a lot of noise in the data and we can't get the correct results.

etc.

The above are not the only possibilities, and the possibility of poor performances' results may be a combination of factors.

In the next sections, I will test some of the above assumptions to try to improve our results.

## An attempted optimization

### Point 1

For our assumptions, point 1 won't be test, if wish, please see this paper to get more features(not in my feature list):

Comparison of different feature extraction methods for EEG-based emotion recognition
https://www.sciencedirect.com/science/article/pii/S0208521620300553

A review of feature extraction and performance evaluation in epileptic seizure detection using EEG
https://www.sciencedirect.com/science/article/pii/S1746809419302836


### Point 6

For this point, I will try to use EEG features or movies in other notebook

### Point 7

This one will be test(clean raw data noisy), if need.

### Point 5

Just use one  832 data instead of joint data (vision+imagination)

#### BP features

In [15]:
res_list = []
best_score = 0
classifiers = init_classifiers()
for i in classifiers:
    results= model_evaluation_dict(bp_aat_img, labels_1, i[0], i[1], i[2])
    res_list.append(results)
    

df_model_comparison = pd.DataFrame(res_list).sort_values(by=['F1 Score(Macro)','F1 Score(Micro)']).reset_index(drop=True)
df_model_comparison

Unnamed: 0,Classifier,param,Traing score,Test Score,Whole score,CV Score,Precision(Macro),Precision(Micro),Recall(Macro),Recall(Micro),F1 Score(Macro),F1 Score(Micro)
0,LR,"{'C': 0.1, 'penalty': 'l2'}",0.061856,0.024,0.050481,0.053244,0.00471,0.024,0.030983,0.024,0.007591,0.024
1,SVM,default,0.073883,0.024,0.058894,0.048042,0.014774,0.024,0.036859,0.024,0.010611,0.024
2,AdaB,default,0.106529,0.02,0.080529,0.037814,0.027535,0.02,0.016115,0.02,0.014636,0.02
3,LGB,default,1.0,0.032,0.709135,0.053273,0.024617,0.032,0.028407,0.032,0.025621,0.032
4,GBDT,default,1.0,0.044,0.71274,0.041204,0.033882,0.044,0.037536,0.044,0.034396,0.044
5,KNN,{'n_neighbors': 6},0.264605,0.032,0.194712,0.041204,0.043852,0.032,0.04949,0.032,0.035452,0.032
6,RF,default,1.0,0.06,0.717548,0.061894,0.06283,0.06,0.067569,0.06,0.053857,0.06
7,XGB,default,1.0,0.072,0.721154,0.065225,0.065208,0.072,0.078965,0.072,0.064314,0.072
8,DT,default,1.0,0.096,0.728365,0.051578,0.083881,0.096,0.097056,0.096,0.083649,0.096


#### All features

In [16]:
res_list = []
best_score = 0
classifiers = init_classifiers()
for i in classifiers:
    results= model_evaluation_dict(aat_vis, labels_1, i[0], i[1], i[2])
    res_list.append(results)
    

df_model_comparison = pd.DataFrame(res_list).sort_values(by=['F1 Score(Macro)','F1 Score(Micro)']).reset_index(drop=True)
df_model_comparison

Unnamed: 0,Classifier,param,Traing score,Test Score,Whole score,CV Score,Precision(Macro),Precision(Micro),Recall(Macro),Recall(Micro),F1 Score(Macro),F1 Score(Micro)
0,LR,"{'C': 0.1, 'penalty': 'l2'}",0.530928,0.032,0.38101,0.065283,0.012363,0.032,0.043956,0.032,0.019048,0.032
1,SVM,default,0.137457,0.036,0.106971,0.058475,0.012745,0.036,0.048077,0.036,0.019484,0.036
2,KNN,{'n_neighbors': 7},0.223368,0.036,0.167067,0.053302,0.039683,0.036,0.030148,0.036,0.027495,0.036
3,DT,default,1.0,0.024,0.706731,0.053156,0.030523,0.024,0.033974,0.024,0.027947,0.024
4,GBDT,default,1.0,0.032,0.709135,0.053244,0.033912,0.032,0.035321,0.032,0.029767,0.032
5,AdaB,default,0.106529,0.052,0.090144,0.049795,0.031957,0.052,0.071795,0.052,0.030611,0.052
6,XGB,default,1.0,0.052,0.715144,0.068819,0.036068,0.052,0.048397,0.052,0.038242,0.052
7,RF,default,1.0,0.056,0.716346,0.065254,0.052713,0.056,0.065004,0.056,0.045003,0.056
8,LGB,default,1.0,0.06,0.717548,0.054997,0.043198,0.06,0.061674,0.06,0.046896,0.06


In [18]:
res_list = []
best_score = 0
classifiers = init_classifiers()
for i in classifiers:
    results= model_evaluation_dict(asl_vis, labels_1, i[0], i[1], i[2])
    res_list.append(results)
    

df_model_comparison = pd.DataFrame(res_list).sort_values(by=['F1 Score(Macro)','F1 Score(Micro)']).reset_index(drop=True)
df_model_comparison

Unnamed: 0,Classifier,param,Traing score,Test Score,Whole score,CV Score,Precision(Macro),Precision(Micro),Recall(Macro),Recall(Micro),F1 Score(Macro),F1 Score(Micro)
0,AdaB,default,0.073883,0.028,0.060096,0.044623,0.004976,0.028,0.035027,0.028,0.00717,0.028
1,SVM,default,0.152921,0.036,0.117788,0.07896,0.014673,0.036,0.063187,0.036,0.023403,0.036
2,GBDT,default,1.0,0.032,0.709135,0.070485,0.029311,0.032,0.038889,0.032,0.031357,0.032
3,XGB,default,1.0,0.044,0.71274,0.065254,0.027197,0.044,0.043881,0.044,0.032106,0.044
4,RF,default,1.0,0.064,0.71875,0.048042,0.041026,0.064,0.04768,0.064,0.04269,0.064
5,KNN,{'n_neighbors': 7},0.221649,0.056,0.171875,0.060111,0.054617,0.056,0.047934,0.056,0.043838,0.056
6,DT,default,1.0,0.048,0.713942,0.042987,0.048437,0.048,0.050278,0.048,0.048134,0.048
7,LR,"{'C': 0.1, 'penalty': 'l2'}",0.465636,0.08,0.34976,0.049883,0.037348,0.08,0.08047,0.08,0.048643,0.08
8,LGB,default,1.0,0.072,0.721154,0.073904,0.065184,0.072,0.069029,0.072,0.060841,0.072


#### Test randomsearchCV, best param

In [23]:
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 20, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 10)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 3, 4]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]

# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
# print(random_grid)

In [14]:
X_train,X_test,y_train,y_test = train_test_split(aat_vis,labels_1,test_size=0.25)

# Use the random grid to search for best hyperparameters; Gridsearch need more time

rf = RandomForestClassifier()

# 5 fold cross validation, 
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 10, 
                               cv = 5, verbose=2, random_state=13, n_jobs = -1)


In [15]:
rf_random.fit(X_train, y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


RandomizedSearchCV(cv=5, estimator=RandomForestClassifier(), n_jobs=-1,
                   param_distributions={'bootstrap': [True, False],
                                        'max_depth': [10, 21, 32, 43, 54, 65,
                                                      76, 87, 98, 110, None],
                                        'max_features': ['auto', 'sqrt'],
                                        'min_samples_leaf': [1, 2, 4],
                                        'min_samples_split': [2, 5, 10],
                                        'n_estimators': [20, 240, 460, 680, 900,
                                                         1120, 1340, 1560, 1780,
                                                         2000]},
                   random_state=13, verbose=2)

In [17]:
print(rf_random.best_params_)
best_random_rf = rf_random.best_estimator_

# print(best_random_rf)
# y_pred = best_random_rf.predict(X_test)
best_random_rf.score(X_test, y_test)

{'n_estimators': 900, 'min_samples_split': 5, 'min_samples_leaf': 4, 'max_features': 'auto', 'max_depth': 10, 'bootstrap': True}


0.07211538461538461

#### Check basic DNN

In [24]:
aat_vis_signal, aat_img_signal, asl_vis_signal, asl_img_signal = read_signal_data()
aat_vis_signal = prepare_signals(aat_vis_signal)
aat_img_signal = prepare_signals(aat_img_signal)
asl_vis_signal = prepare_signals(asl_vis_signal)
asl_img_signal = prepare_signals(asl_img_signal)

In [71]:
to_categorical(labels_1)[0].shape

(27,)

In [69]:
x_train, x_test, y_train, y_test = train_test_split(asl_vis_signal, to_categorical(labels_1), test_size=0.3)

In [74]:
model = Sequential()
model.add(LSTM(512, input_shape=(346, 16)))
model.add(Dense(27,activation='softmax'))

model.compile(optimizer='Adam', loss='categorical_crossentropy', metrics=['accuracy'])

history = model.fit(x_train,
                    y_train,
                    batch_size=16,
                    epochs=20,
                    verbose=1)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


Always not good

### Point 3

Reduce the classification labels of the dataset

#### Test all models

In [10]:
indexs_5 = list()
data_5 = list()
for i in range(832):
    if labels_1[i] in [1,2,3,4,5,6,7]:
        indexs_5.append(labels_1[i])
        data_5.append(aat_vis[i,:])

In [11]:
res_list = []
best_score = 0
classifiers = init_classifiers()
for i in classifiers:
    results= model_evaluation_dict(np.array(data_5), np.array(indexs_5), i[0], i[1], i[2])
    res_list.append(results)
    

df_model_comparison = pd.DataFrame(res_list).sort_values(by=['F1 Score(Macro)','F1 Score(Micro)']).reset_index(drop=True)
df_model_comparison

Unnamed: 0,Classifier,param,Traing score,Test Score,Whole score,CV Score,Precision(Macro),Precision(Micro),Recall(Macro),Recall(Micro),F1 Score(Macro),F1 Score(Micro)
0,SVM,default,0.262821,0.073529,0.205357,0.07125,0.02037,0.073529,0.091837,0.073529,0.033333,0.073529
1,RF,default,1.0,0.044118,0.709821,0.12875,0.042659,0.044118,0.041148,0.044118,0.040072,0.044118
2,XGB,default,1.0,0.073529,0.71875,0.090417,0.048291,0.073529,0.082143,0.073529,0.058072,0.073529
3,LR,"{'C': 0.1, 'penalty': 'l2'}",0.782051,0.088235,0.571429,0.044583,0.094322,0.088235,0.09246,0.088235,0.089633,0.088235
4,LGB,default,1.0,0.102941,0.727679,0.115417,0.215476,0.102941,0.113946,0.102941,0.097341,0.102941
5,KNN,{'n_neighbors': 7},0.288462,0.102941,0.232143,0.097917,0.09263,0.102941,0.109694,0.102941,0.099445,0.102941
6,DT,default,1.0,0.117647,0.732143,0.11375,0.127551,0.117647,0.108071,0.117647,0.112916,0.117647
7,GBDT,default,1.0,0.161765,0.745536,0.089583,0.221916,0.161765,0.160668,0.161765,0.166676,0.161765
8,AdaB,default,0.288462,0.220588,0.267857,0.1025,0.275603,0.220588,0.25772,0.220588,0.209127,0.220588


Reduce labels:

In [12]:
indexs_5 = list()
data_5 = list()
for i in range(832):
    if labels_1[i] in [1,2,3,4]:
        indexs_5.append(labels_1[i])
        data_5.append(aat_vis[i,:])
        
res_list = []
best_score = 0
classifiers = init_classifiers()
for i in classifiers:
    results= model_evaluation_dict(np.array(data_5), np.array(indexs_5), i[0], i[1], i[2])
    res_list.append(results)
    

df_model_comparison = pd.DataFrame(res_list).sort_values(by=['F1 Score(Macro)','F1 Score(Micro)']).reset_index(drop=True)
df_model_comparison

Unnamed: 0,Classifier,param,Traing score,Test Score,Whole score,CV Score,Precision(Macro),Precision(Micro),Recall(Macro),Recall(Micro),F1 Score(Macro),F1 Score(Micro)
0,SVM,default,0.348315,0.128205,0.28125,0.225,0.036765,0.128205,0.208333,0.128205,0.0625,0.128205
1,RF,default,1.0,0.076923,0.71875,0.2375,0.133333,0.076923,0.073413,0.076923,0.082043,0.076923
2,XGB,default,1.0,0.102564,0.726562,0.236111,0.090909,0.102564,0.118182,0.102564,0.099432,0.102564
3,LR,"{'C': 0.1, 'penalty': 'l2'}",0.876404,0.102564,0.640625,0.147222,0.096875,0.102564,0.123397,0.102564,0.102092,0.102564
4,AdaB,default,0.775281,0.128205,0.578125,0.256944,0.104037,0.128205,0.14881,0.128205,0.102632,0.128205
5,GBDT,default,1.0,0.153846,0.742188,0.158333,0.156857,0.153846,0.147384,0.153846,0.146324,0.153846
6,KNN,{'n_neighbors': 6},0.404494,0.179487,0.335938,0.144444,0.190476,0.179487,0.18961,0.179487,0.178041,0.179487
7,LGB,default,1.0,0.205128,0.757812,0.315278,0.218861,0.205128,0.311134,0.205128,0.199242,0.205128
8,DT,default,1.0,0.282051,0.78125,0.258333,0.273748,0.282051,0.269048,0.282051,0.267909,0.282051


Continue to reduce labels:

In [13]:
indexs_5 = list()
data_5 = list()
for i in range(832):
    if labels_1[i] in [1,2]:
        indexs_5.append(labels_1[i])
        data_5.append(aat_vis[i,:])
        
res_list = []
best_score = 0
classifiers = init_classifiers()
for i in classifiers:
    results= model_evaluation_dict(np.array(data_5), np.array(indexs_5), i[0], i[1], i[2])
    res_list.append(results)
    

df_model_comparison = pd.DataFrame(res_list).sort_values(by=['F1 Score(Macro)','F1 Score(Micro)']).reset_index(drop=True)
df_model_comparison

Unnamed: 0,Classifier,param,Traing score,Test Score,Whole score,CV Score,Precision(Macro),Precision(Micro),Recall(Macro),Recall(Micro),F1 Score(Macro),F1 Score(Micro)
0,LGB,default,1.0,0.3,0.78125,0.385,0.291667,0.3,0.3,0.3,0.292929,0.3
1,RF,default,1.0,0.3,0.78125,0.265,0.30303,0.3,0.30303,0.3,0.3,0.3
2,SVM,default,0.522727,0.45,0.5,0.335,0.225,0.45,0.5,0.45,0.310345,0.45
3,XGB,default,1.0,0.35,0.796875,0.49,0.338384,0.35,0.333333,0.35,0.335038,0.35
4,LR,"{'C': 0.1, 'penalty': 'l2'}",0.886364,0.45,0.75,0.48,0.449495,0.45,0.45,0.45,0.448622,0.45
5,GBDT,default,1.0,0.45,0.828125,0.68,0.478022,0.45,0.479167,0.45,0.448622,0.45
6,AdaB,default,1.0,0.45,0.828125,0.565,0.511905,0.45,0.510989,0.45,0.448622,0.45
7,KNN,{'n_neighbors': 6},0.545455,0.5,0.53125,0.32,0.483516,0.5,0.484848,0.5,0.479167,0.5
8,DT,default,1.0,0.5,0.84375,0.46,0.5,0.5,0.5,0.5,0.5,0.5


0.5 means nothing..

But we can clearly find the improvement of the ACC.

Try to add data:

In [14]:
indexs_5 = list()
data_5 = list()
for i in range(832):
    if labels_1[i] in [1,2]:
        indexs_5.append(labels_1[i])
        indexs_5.append(labels_1[i])
        indexs_5.append(labels_1[i])
        indexs_5.append(labels_1[i])
        data_5.append(aat_vis[i,:])
        data_5.append(aat_img[i,:])
        data_5.append(asl_vis[i,:])
        data_5.append(asl_img[i,:])
        
res_list = []
best_score = 0
classifiers = init_classifiers()
for i in classifiers:
    results= model_evaluation_dict(np.array(data_5), np.array(indexs_5), i[0], i[1], i[2])
    res_list.append(results)
    

df_model_comparison = pd.DataFrame(res_list).sort_values(by=['F1 Score(Macro)','F1 Score(Micro)']).reset_index(drop=True)
df_model_comparison

Unnamed: 0,Classifier,param,Traing score,Test Score,Whole score,CV Score,Precision(Macro),Precision(Micro),Recall(Macro),Recall(Micro),F1 Score(Macro),F1 Score(Micro)
0,SVM,default,0.664804,0.441558,0.597656,0.407516,0.430882,0.441558,0.452365,0.441558,0.401158,0.441558
1,GBDT,default,1.0,0.415584,0.824219,0.518627,0.433575,0.415584,0.443182,0.415584,0.405558,0.415584
2,DT,default,1.0,0.415584,0.824219,0.464706,0.412766,0.415584,0.417004,0.415584,0.409207,0.415584
3,LR,"{'C': 0.1, 'penalty': 'l2'}",0.826816,0.454545,0.714844,0.40817,0.455128,0.454545,0.454762,0.454545,0.453716,0.454545
4,KNN,{'n_neighbors': 7},0.614525,0.467532,0.570312,0.497386,0.477371,0.467532,0.478571,0.467532,0.46428,0.467532
5,AdaB,default,1.0,0.493506,0.847656,0.480065,0.485417,0.493506,0.485714,0.493506,0.484817,0.493506
6,RF,default,1.0,0.493506,0.847656,0.418627,0.492915,0.493506,0.492857,0.493506,0.492136,0.493506
7,XGB,default,1.0,0.506494,0.851562,0.446405,0.505735,0.506494,0.505814,0.506494,0.504404,0.506494
8,LGB,default,1.0,0.519481,0.855469,0.458824,0.52381,0.519481,0.52381,0.519481,0.519481,0.519481


Try to add labels:

In [15]:
indexs_5 = list()
data_5 = list()
for i in range(832):
    if labels_1[i] in [1,2,3,4]:
        indexs_5.append(labels_1[i])
        indexs_5.append(labels_1[i])
        indexs_5.append(labels_1[i])
        indexs_5.append(labels_1[i])
        data_5.append(aat_vis[i,:])
        data_5.append(aat_img[i,:])
        data_5.append(asl_vis[i,:])
        data_5.append(asl_img[i,:])
        
res_list = []
best_score = 0
classifiers = init_classifiers()
for i in classifiers:
    results= model_evaluation_dict(np.array(data_5), np.array(indexs_5), i[0], i[1], i[2])
    res_list.append(results)
    

df_model_comparison = pd.DataFrame(res_list).sort_values(by=['F1 Score(Macro)','F1 Score(Micro)']).reset_index(drop=True)
df_model_comparison

Unnamed: 0,Classifier,param,Traing score,Test Score,Whole score,CV Score,Precision(Macro),Precision(Micro),Recall(Macro),Recall(Micro),F1 Score(Macro),F1 Score(Micro)
0,SVM,default,0.410615,0.142857,0.330078,0.181429,0.107221,0.142857,0.162743,0.142857,0.110321,0.142857
1,KNN,{'n_neighbors': 3},0.502793,0.201299,0.412109,0.176111,0.205016,0.201299,0.192992,0.201299,0.18239,0.201299
2,LGB,default,1.0,0.214286,0.763672,0.22881,0.215824,0.214286,0.218399,0.214286,0.213025,0.214286
3,XGB,default,1.0,0.227273,0.767578,0.268413,0.223144,0.227273,0.225525,0.227273,0.220915,0.227273
4,AdaB,default,0.645251,0.24026,0.523438,0.242857,0.227393,0.24026,0.233691,0.24026,0.22499,0.24026
5,LR,"{'C': 0.1, 'penalty': 'l2'}",0.734637,0.227273,0.582031,0.21246,0.226307,0.227273,0.226552,0.227273,0.226413,0.227273
6,DT,default,1.0,0.246753,0.773438,0.198095,0.248307,0.246753,0.246094,0.246753,0.245933,0.246753
7,RF,default,1.0,0.266234,0.779297,0.181349,0.262179,0.266234,0.265249,0.266234,0.26148,0.266234
8,GBDT,default,1.0,0.311688,0.792969,0.293016,0.308362,0.311688,0.311495,0.311688,0.308557,0.311688


Check BP:

In [16]:
bp_aat_vis[1,:].shape

(80,)

In [17]:
indexs_5 = list()
data_5 = list()
for i in range(832):
    if labels_1[i] in [1,2,3,4]:
        indexs_5.append(labels_1[i])
        indexs_5.append(labels_1[i])
        indexs_5.append(labels_1[i])
        indexs_5.append(labels_1[i])
        data_5.append(bp_aat_vis[i, :])
        data_5.append(bp_aat_img[i,:])
        data_5.append(bp_asl_vis[i, :])
        data_5.append(bp_asl_img[i, :])
        
res_list = []
best_score = 0
classifiers = init_classifiers()
for i in classifiers:
    results= model_evaluation_dict(np.array(data_5), np.array(indexs_5), i[0], i[1], i[2])
    res_list.append(results)
    

df_model_comparison = pd.DataFrame(res_list).sort_values(by=['F1 Score(Macro)','F1 Score(Micro)']).reset_index(drop=True)
df_model_comparison

Unnamed: 0,Classifier,param,Traing score,Test Score,Whole score,CV Score,Precision(Macro),Precision(Micro),Recall(Macro),Recall(Micro),F1 Score(Macro),F1 Score(Micro)
0,SVM,default,0.265363,0.227273,0.253906,0.259762,0.056818,0.227273,0.25,0.227273,0.092593,0.227273
1,KNN,{'n_neighbors': 4},0.519553,0.194805,0.421875,0.27381,0.193121,0.194805,0.200081,0.194805,0.187866,0.194805
2,LGB,default,1.0,0.188312,0.755859,0.293095,0.194483,0.188312,0.187843,0.188312,0.188842,0.188312
3,LR,"{'C': 0.1, 'penalty': 'l2'}",0.301676,0.220779,0.277344,0.229603,0.215175,0.220779,0.222835,0.220779,0.201775,0.220779
4,RF,default,1.0,0.233766,0.769531,0.22373,0.232055,0.233766,0.238875,0.233766,0.229996,0.233766
5,DT,default,1.0,0.233766,0.769531,0.273651,0.232482,0.233766,0.235328,0.233766,0.230305,0.233766
6,XGB,default,1.0,0.253247,0.775391,0.229206,0.248293,0.253247,0.253203,0.253247,0.248069,0.253247
7,GBDT,default,1.0,0.25974,0.777344,0.220556,0.259511,0.25974,0.261238,0.25974,0.256922,0.25974
8,AdaB,default,0.583799,0.298701,0.498047,0.237778,0.305044,0.298701,0.299848,0.298701,0.300594,0.298701


BP are not better than All features

OK, we could find the improvement of the performances.

So, we basically find the problems

1. no enough data 

2. labels are too much.

3. (optional) add position information

#### Search best params

Although the results have improved, it is possible and useful to check whether the results/performances of the search parameters will be better. After all, we can make a preliminary analysis, determine the key features of classification, and reclassify the entire data based on these characteristics.

In [18]:
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 20, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 10)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [1,2,3,4,5]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4,5]
# Method of selecting samples for training each tree
bootstrap = [True, False]

# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
# print(random_grid)

In [19]:
indexs_5 = list()
data_5 = list()
for i in range(832):
    if labels_1[i] in [1,2]:
        indexs_5.append(labels_1[i])
        indexs_5.append(labels_1[i])
        indexs_5.append(labels_1[i])
        indexs_5.append(labels_1[i])
        data_5.append(aat_vis[i,:])
        data_5.append(aat_img[i,:])
        data_5.append(asl_vis[i,:])
        data_5.append(asl_img[i,:])
        

In [20]:
len(aat_img[1,:])

1120

In [21]:
X_train,X_test,y_train,y_test = train_test_split(data_5,indexs_5,test_size=0.25, random_state=13)

# Use the random grid to search for best hyperparameters; Gridsearch need more time

rf = RandomForestClassifier()

# 5 fold cross validation, 
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 10, 
                               cv = 5, verbose=2, random_state=13, n_jobs = -1)
rf_random.fit(X_train, y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


RandomizedSearchCV(cv=5, estimator=RandomForestClassifier(), n_jobs=-1,
                   param_distributions={'bootstrap': [True, False],
                                        'max_depth': [10, 21, 32, 43, 54, 65,
                                                      76, 87, 98, 110, None],
                                        'max_features': ['auto', 'sqrt'],
                                        'min_samples_leaf': [1, 2, 4, 5],
                                        'min_samples_split': [1, 2, 3, 4, 5],
                                        'n_estimators': [20, 240, 460, 680, 900,
                                                         1120, 1340, 1560, 1780,
                                                         2000]},
                   random_state=13, verbose=2)

In [22]:
print(rf_random.best_params_)
best_random_rf = rf_random.best_estimator_

# print(best_random_rf)
# y_pred = best_random_rf.predict(X_test)
best_random_rf.score(X_test, y_test)

{'n_estimators': 1340, 'min_samples_split': 5, 'min_samples_leaf': 2, 'max_features': 'auto', 'max_depth': 98, 'bootstrap': False}


0.453125

Not good enough

#### Check basic DNN

In [37]:
indexs_5 = list()
data_5 = list()
for i in range(832):
    if labels_1[i] in [1,2,3,4,5]:
        indexs_5.append(labels_1[i])
        indexs_5.append(labels_1[i])
        data_5.append(aat_vis_signal[i])
        data_5.append(aat_img_signal[i])

In [38]:
len(to_categorical(indexs_5)[1])

6

In [39]:
type(labels_1[0])

numpy.int64

In [40]:
x_train, x_test, y_train, y_test = train_test_split(np.array(data_5), to_categorical(indexs_5), test_size=0.3)

In [35]:
len(indexs_5)

160

In [41]:
model = Sequential()
model.add(SimpleRNN(512, input_shape=(346, 16)))
model.add(Dense(6,activation='softmax'))

model.compile(optimizer='Adam', loss='categorical_crossentropy', metrics=['accuracy'])

history = model.fit(x_train,
                    y_train,
                    batch_size=16,
                    epochs=40,
                    verbose=1)

Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40


The highest is 41%...

still very bad!

### Point 2 Feature selections

* difference features

* LDA 

* PCA 

* pipeline selections

#### pipeline selections

In [46]:
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectFromModel,SelectKBest,chi2

In [44]:
indexs_5 = list()
data_5 = list()
for i in range(832):
    if labels_1[i] in [1,2,3,4]:
        indexs_5.append(labels_1[i])
        indexs_5.append(labels_1[i])
        indexs_5.append(labels_1[i])
        indexs_5.append(labels_1[i])
        data_5.append(aat_vis[i,:])
        data_5.append(aat_img[i,:])
        data_5.append(asl_vis[i,:])
        data_5.append(asl_img[i,:])
x_train, x_test, y_train, y_test = train_test_split(np.array(data_5), np.array(indexs_5), test_size=0.3)

In [53]:
clf = Pipeline([
    ('feature_selection',  SelectFromModel(LinearSVC(penalty='l1', loss='squared_hinge', dual=False))),
    ('classification', RandomForestClassifier())
])

In [54]:
clf.fit(x_train, y_train)

Pipeline(steps=[('feature_selection',
                 SelectFromModel(estimator=LinearSVC(dual=False,
                                                     penalty='l1'))),
                ('classification', RandomForestClassifier())])

In [55]:
clf.score(x_train,y_train)

1.0

In [56]:
y_pred = clf.predict(x_test)
accuracy, f1_w, recall_w, precision_w,  f1, recall, precision = print_performance(y_test, y_pred)   

accuracy: 0.182
f1 score macro av: 0.178
recall score macro av: 0.177
precision score macro av: 0.188
f1 score for every class:  [0.10126582 0.17910448 0.25263158 0.17910448]
recall:  [0.125      0.16666667 0.27272727 0.14285714]
precision:  [0.08510638 0.19354839 0.23529412 0.24      ]
 


In [58]:
clf = Pipeline([
    ('feature_selection', SelectKBest(chi2, k = 200)),
    ('classification', RandomForestClassifier())
])
clf.fit(x_train, y_train)

Pipeline(steps=[('feature_selection',
                 SelectKBest(k=200,
                             score_func=<function chi2 at 0x000002D57DB99798>)),
                ('classification', RandomForestClassifier())])

In [59]:
y_pred = clf.predict(x_test)
accuracy, f1_w, recall_w, precision_w,  f1, recall, precision = print_performance(y_test, y_pred)   

accuracy: 0.195
f1 score macro av: 0.195
recall score macro av: 0.195
precision score macro av: 0.198
f1 score for every class:  [0.13513514 0.24657534 0.18823529 0.21052632]
recall:  [0.15625    0.25       0.18181818 0.19047619]
precision:  [0.11904762 0.24324324 0.19512195 0.23529412]
 


#### PCA

In [14]:
indexs_5 = list()
data_5 = list()
for i in range(832):
    if labels_1[i] in [1,2,3,4]:
        indexs_5.append(labels_1[i])
        indexs_5.append(labels_1[i])
        data_5.append(aat_vis[i,:])
        data_5.append(aat_img[i,:])
pca = PCA(n_components=100)
pca.fit(np.array(data_5))
X = pca.transform(np.array(data_5))
x_train, x_test, y_train, y_test = train_test_split(X, np.array(indexs_5), test_size=0.3)

In [15]:
res_list = []
best_score = 0
classifiers = init_classifiers()
for i in classifiers:
    results= model_evaluation_dict(X, np.array(indexs_5), i[0], i[1], i[2])
    res_list.append(results)
    

df_model_comparison = pd.DataFrame(res_list).sort_values(by=['F1 Score(Macro)','F1 Score(Micro)']).reset_index(drop=True)
df_model_comparison

Unnamed: 0,Classifier,param,Traing score,Test Score,Whole score,CV Score,Precision(Macro),Precision(Micro),Recall(Macro),Recall(Micro),F1 Score(Macro),F1 Score(Micro)
0,SVM,default,0.575419,0.116883,0.4375,0.184641,0.1234,0.116883,0.121499,0.116883,0.11321,0.116883
1,DT,default,1.0,0.12987,0.738281,0.245752,0.153018,0.12987,0.123658,0.12987,0.127543,0.12987
2,LR,"{'C': 0.1, 'penalty': 'l2'}",0.687151,0.12987,0.519531,0.245098,0.129469,0.12987,0.132442,0.12987,0.12998,0.12987
3,RF,default,1.0,0.181818,0.753906,0.267974,0.158533,0.181818,0.201316,0.181818,0.174119,0.181818
4,GBDT,default,1.0,0.181818,0.753906,0.229412,0.190448,0.181818,0.186091,0.181818,0.181849,0.181818
5,LGB,default,1.0,0.207792,0.761719,0.234314,0.21802,0.207792,0.213955,0.207792,0.207827,0.207792
6,KNN,default,0.391061,0.207792,0.335938,0.161438,0.215171,0.207792,0.21409,0.207792,0.208946,0.207792
7,AdaB,default,0.681564,0.246753,0.550781,0.189216,0.247396,0.246753,0.266642,0.246753,0.233025,0.246753
8,XGB,default,1.0,0.285714,0.785156,0.240523,0.282164,0.285714,0.270753,0.285714,0.274681,0.285714


#### LDA

#### This is one notice added after 06.3 Notebook, so in a word, after validation, the results of the following part are not really correct!

In [20]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis(n_components=3)
lda.fit(X,np.array(indexs_5))
X_new = lda.transform(X)

In [21]:
res_list = []
best_score = 0
classifiers = init_classifiers()
for i in classifiers:
    results= model_evaluation_dict(X_new, np.array(indexs_5), i[0], i[1], i[2])
    res_list.append(results)
    

df_model_comparison = pd.DataFrame(res_list).sort_values(by=['F1 Score(Macro)','F1 Score(Micro)']).reset_index(drop=True)
df_model_comparison

Unnamed: 0,Classifier,param,Traing score,Test Score,Whole score,CV Score,Precision(Macro),Precision(Micro),Recall(Macro),Recall(Micro),F1 Score(Macro),F1 Score(Micro)
0,DT,default,1.0,0.532468,0.859375,0.585948,0.531458,0.532468,0.519908,0.532468,0.518909,0.532468
1,RF,default,1.0,0.61039,0.882812,0.636928,0.634585,0.61039,0.62776,0.61039,0.614912,0.61039
2,XGB,default,1.0,0.623377,0.886719,0.643464,0.663603,0.623377,0.609004,0.623377,0.616526,0.623377
3,GBDT,default,1.0,0.623377,0.886719,0.62549,0.616986,0.623377,0.622368,0.623377,0.617208,0.623377
4,AdaB,default,0.664804,0.623377,0.652344,0.547059,0.656383,0.623377,0.625,0.623377,0.629268,0.623377
5,KNN,default,0.75419,0.636364,0.71875,0.681046,0.6446,0.636364,0.646104,0.636364,0.631992,0.636364
6,LGB,default,1.0,0.636364,0.890625,0.653268,0.637605,0.636364,0.641696,0.636364,0.634437,0.636364
7,LR,"{'C': 0.1, 'penalty': 'l2'}",0.670391,0.727273,0.6875,0.665359,0.732143,0.727273,0.732828,0.727273,0.731008,0.727273
8,SVM,default,0.715084,0.792208,0.738281,0.610131,0.793006,0.792208,0.809602,0.792208,0.798946,0.792208


Check all labels LDA:

In [23]:
lda = LinearDiscriminantAnalysis(n_components=24)

data_allfeature = np.concatenate((aat_img, aat_vis), axis=0) 
lda.fit(data_allfeature, labels_2)
X_new = lda.transform(data_allfeature)

In [24]:
res_list = []
best_score = 0
classifiers = init_classifiers()
for i in classifiers:
    results= model_evaluation_dict(X_new, labels_2, i[0], i[1], i[2])
    res_list.append(results)
    

df_model_comparison = pd.DataFrame(res_list).sort_values(by=['F1 Score(Macro)','F1 Score(Micro)']).reset_index(drop=True)
df_model_comparison

Unnamed: 0,Classifier,param,Traing score,Test Score,Whole score,CV Score,Precision(Macro),Precision(Micro),Recall(Macro),Recall(Micro),F1 Score(Macro),F1 Score(Micro)
0,AdaB,default,0.238832,0.188,0.223558,0.18046,0.18371,0.188,0.216445,0.188,0.168039,0.188
1,DT,default,1.0,0.74,0.921875,0.753397,0.74349,0.74,0.746252,0.74,0.736796,0.74
2,GBDT,default,1.0,0.92,0.975962,0.917543,0.922469,0.92,0.919681,0.92,0.918839,0.92
3,XGB,default,1.0,0.948,0.984375,0.952815,0.949266,0.948,0.952154,0.948,0.949502,0.948
4,LGB,default,1.0,0.958,0.98738,0.962172,0.960804,0.958,0.962353,0.958,0.960788,0.958
5,RF,default,1.0,0.98,0.99399,0.981108,0.98148,0.98,0.980625,0.98,0.980578,0.98
6,KNN,{'n_neighbors': 7},0.993127,0.986,0.990986,0.992271,0.98685,0.986,0.987753,0.986,0.987028,0.986
7,LR,"{'C': 0.1, 'penalty': 'l2'}",1.0,0.994,0.998197,0.995697,0.994091,0.994,0.995072,0.994,0.994503,0.994
8,SVM,default,1.0,1.0,1.0,0.991409,1.0,1.0,1.0,1.0,1.0,1.0


In [26]:
X_new.shape

(1664, 24)

We will check the LDA dimensionality reduction of BP features of 2*832, and the full feature of 4*832 for the four joint data Dimensionality reduction results.

Check BP features 2*832:

In [32]:
lda = LinearDiscriminantAnalysis(n_components=24)

data_bp = np.concatenate((bp_asl_img, bp_asl_vis), axis=0)
lda.fit(data_bp, labels_2)
X_new = lda.transform(data_bp)

res_list = []
best_score = 0
classifiers = init_classifiers()
for i in classifiers:
    results= model_evaluation_dict(X_new, labels_2, i[0], i[1], i[2])
    res_list.append(results)
    

df_model_comparison = pd.DataFrame(res_list).sort_values(by=['F1 Score(Macro)','F1 Score(Micro)']).reset_index(drop=True)
df_model_comparison

Unnamed: 0,Classifier,param,Traing score,Test Score,Whole score,CV Score,Precision(Macro),Precision(Micro),Recall(Macro),Recall(Micro),F1 Score(Macro),F1 Score(Micro)
0,KNN,{'n_neighbors': 7},0.274914,0.088,0.21875,0.084203,0.085164,0.088,0.081779,0.088,0.073956,0.088
1,AdaB,default,0.137457,0.08,0.120192,0.063602,0.086998,0.08,0.082539,0.08,0.075404,0.08
2,SVM,default,0.363402,0.104,0.285457,0.114272,0.136159,0.104,0.098358,0.104,0.077126,0.104
3,XGB,default,1.0,0.092,0.727163,0.105644,0.08509,0.092,0.095168,0.092,0.085174,0.092
4,DT,default,1.0,0.088,0.725962,0.05756,0.084757,0.088,0.093223,0.088,0.085691,0.088
5,GBDT,default,1.0,0.098,0.728966,0.103949,0.097307,0.098,0.112182,0.098,0.095985,0.098
6,RF,default,1.0,0.114,0.733774,0.118582,0.116562,0.114,0.120197,0.114,0.104489,0.114
7,LGB,default,1.0,0.116,0.734375,0.099661,0.101375,0.116,0.113811,0.116,0.104718,0.116
8,LR,"{'C': 0.1, 'penalty': 'l2'}",0.256873,0.134,0.219952,0.161546,0.131958,0.134,0.141832,0.134,0.122992,0.134


Check 2*832 all feature for ASL:

In [33]:
lda = LinearDiscriminantAnalysis(n_components=24)

data_allfeature = np.concatenate((asl_img, asl_vis), axis=0) 
lda.fit(data_allfeature, labels_2)
X_new = lda.transform(data_allfeature)

res_list = []
best_score = 0
classifiers = init_classifiers()
for i in classifiers:
    results= model_evaluation_dict(X_new, labels_2, i[0], i[1], i[2])
    res_list.append(results)
    

df_model_comparison = pd.DataFrame(res_list).sort_values(by=['F1 Score(Macro)','F1 Score(Micro)']).reset_index(drop=True)
df_model_comparison

Unnamed: 0,Classifier,param,Traing score,Test Score,Whole score,CV Score,Precision(Macro),Precision(Micro),Recall(Macro),Recall(Micro),F1 Score(Macro),F1 Score(Micro)
0,AdaB,default,0.149485,0.142,0.147236,0.182147,0.132132,0.142,0.137044,0.142,0.10216,0.142
1,DT,default,1.0,0.738,0.921274,0.745697,0.744805,0.738,0.749893,0.738,0.736815,0.738
2,GBDT,default,1.0,0.916,0.97476,0.913248,0.919378,0.916,0.920714,0.916,0.918042,0.916
3,XGB,default,1.0,0.932,0.979567,0.937312,0.932349,0.932,0.935479,0.932,0.9321,0.932
4,LGB,default,1.0,0.958,0.98738,0.967352,0.960879,0.958,0.957512,0.958,0.957847,0.958
5,RF,default,1.0,0.982,0.994591,0.988823,0.980581,0.982,0.982217,0.982,0.980764,0.982
6,LR,"{'C': 0.1, 'penalty': 'l2'}",1.0,0.99,0.996995,0.997414,0.991099,0.99,0.991026,0.99,0.990865,0.99
7,SVM,default,0.999141,0.996,0.998197,0.995704,0.996115,0.996,0.995334,0.996,0.9956,0.996
8,KNN,{'n_neighbors': 6},0.997423,0.996,0.996995,0.997414,0.996154,0.996,0.996154,0.996,0.996105,0.996


Check 4*832 all feature:

In [34]:
lda = LinearDiscriminantAnalysis(n_components=24)

data_allfeature = np.concatenate((aat_img, aat_vis,asl_vis, asl_img), axis=0) 
lda.fit(data_allfeature, labels_4)
X_new = lda.transform(data_allfeature)

res_list = []
best_score = 0
classifiers = init_classifiers()
for i in classifiers:
    results= model_evaluation_dict(X_new, labels_4, i[0], i[1], i[2])
    res_list.append(results)
    

df_model_comparison = pd.DataFrame(res_list).sort_values(by=['F1 Score(Macro)','F1 Score(Micro)']).reset_index(drop=True)
df_model_comparison

Unnamed: 0,Classifier,param,Traing score,Test Score,Whole score,CV Score,Precision(Macro),Precision(Micro),Recall(Macro),Recall(Micro),F1 Score(Macro),F1 Score(Micro)
0,AdaB,default,0.392872,0.36036,0.383113,0.373577,0.404874,0.36036,0.360711,0.36036,0.362895,0.36036
1,DT,default,1.0,0.386386,0.815805,0.398439,0.389493,0.386386,0.389492,0.386386,0.385751,0.386386
2,KNN,{'n_neighbors': 7},0.81623,0.66967,0.772236,0.715327,0.686968,0.66967,0.666713,0.66967,0.667345,0.66967
3,GBDT,default,1.0,0.706707,0.911959,0.695569,0.713877,0.706707,0.710233,0.706707,0.704474,0.706707
4,RF,default,1.0,0.735736,0.920673,0.752688,0.736146,0.735736,0.738769,0.735736,0.73391,0.735736
5,XGB,default,1.0,0.745746,0.923678,0.722199,0.744553,0.745746,0.746265,0.745746,0.742827,0.745746
6,LGB,default,1.0,0.745746,0.923678,0.735935,0.749149,0.745746,0.749053,0.745746,0.745483,0.745746
7,SVM,default,0.940747,0.810811,0.901743,0.81152,0.816055,0.810811,0.812111,0.810811,0.810749,0.810811
8,LR,"{'C': 0.1, 'penalty': 'l2'}",0.879777,0.832833,0.865685,0.823106,0.83515,0.832833,0.830877,0.832833,0.831157,0.832833


Check 1*832 alphabet vision all features:

In [35]:
lda = LinearDiscriminantAnalysis(n_components=24)

# data_allfeature = np.concatenate((aat_img, aat_vis,asl_vis, asl_img), axis=0) 
lda.fit(aat_vis, labels_1)
X_new = lda.transform(aat_vis)

res_list = []
best_score = 0
classifiers = init_classifiers()
for i in classifiers:
    results= model_evaluation_dict(X_new, labels_1, i[0], i[1], i[2])
    res_list.append(results)
    

df_model_comparison = pd.DataFrame(res_list).sort_values(by=['F1 Score(Macro)','F1 Score(Micro)']).reset_index(drop=True)
df_model_comparison

Unnamed: 0,Classifier,param,Traing score,Test Score,Whole score,CV Score,Precision(Macro),Precision(Micro),Recall(Macro),Recall(Micro),F1 Score(Macro),F1 Score(Micro)
0,AdaB,default,0.280069,0.244,0.269231,0.311017,0.231566,0.244,0.269231,0.244,0.232331,0.244
1,GBDT,default,1.0,0.952,0.985577,0.953594,0.954937,0.952,0.950576,0.952,0.94756,0.952
2,DT,default,1.0,0.96,0.987981,0.945091,0.966474,0.96,0.964889,0.96,0.962141,0.96
3,XGB,default,1.0,0.984,0.995192,0.972472,0.982466,0.984,0.984998,0.984,0.983033,0.984
4,LGB,default,1.0,0.992,0.997596,0.987931,0.992521,0.992,0.994231,0.992,0.993067,0.992
5,SVM,default,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
6,LR,"{'C': 0.1, 'penalty': 'l2'}",1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
7,KNN,default,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
8,RF,default,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Although LDA has achieved great results, I still want to try other attempts to see if the same results can be achieved.

After all, the new features after feature dimensionality reduction LDA are not so interpretable, compared to other methods.

#### Difference features

Check notebook 05.5;

As seen above, the effect of the model after LDA has become very good, but LDA still originated from our more than 1,100 features, but the effect of more than 1,100 features directly is not good, so we try to find out what are the decisive features to confirm whether the performance of the model will improve compared to the model using all features.

In [39]:
# modified this functions
def percent_subtraction(alist1, alist2, feature_name_list):

    top_diff='all'
    differences = list()
    top_diffs = dict()
    for i in range(len(alist1)):
        
        diff = np.abs(alist1[i] - alist2[i])
        
        max1 = diff*100/np.abs(alist1[i])
        max2 = diff*100/np.abs(alist2[i])
        diff = round(max(max1, max2))   #  *100
        
        if diff >= 95:
            differences.append("100%")
            top_diffs[feature_name_list[i]] = diff 
        elif 95> diff >=85:
            differences.append("90%")
            top_diffs[feature_name_list[i]] = diff 
        elif 85> diff >=75:
            differences.append("80%")
            top_diffs[feature_name_list[i]] = diff 
        elif 75> diff >=65:
            differences.append("70%") 
            top_diffs[feature_name_list[i]] = diff 
        elif 65> diff >=55:
            differences.append("60%")
            top_diffs[feature_name_list[i]] = diff 
        elif 55> diff >=45:
            differences.append("50%")
            top_diffs[feature_name_list[i]] = diff 
        elif 45> diff >=35:
            differences.append("40%")
            top_diffs[feature_name_list[i]] = diff 
        elif 35> diff >=25:
            differences.append("30%")
            top_diffs[feature_name_list[i]] = diff 
        elif 25> diff >=15:
            differences.append("20%")
            top_diffs[feature_name_list[i]] = diff 
        elif 15> diff >=5:
            differences.append("10%")
            top_diffs[feature_name_list[i]] = diff 
        else:
            differences.append("0%")
    
    # print(top_diffs.items())
    # tt = sorted(top_diffs.items(), key=lambda d: d[1], reverse=True)
    if top_diff == 'all': 
        return differences, top_diffs
#     elif isinstance(top_diff,int):
#         return differences, tt[0:top_diff]
    else:
        raise ValueError

In [38]:
aat_vis_data, aat_img_data, asl_vis_data, asl_img_data = read_features_table()

W_mean_aat_vis = np.mean(np.array(aat_vis[aat_vis_data['label'] == 'W']), axis=0, keepdims=False)
V_mean_aat_vis = np.mean(np.array(aat_vis[aat_vis_data['label'] == 'V']), axis=0, keepdims=False)
A_mean_aat_vis = np.mean(np.array(aat_vis[aat_vis_data['label'] == 'A']), axis=0, keepdims=False)

W_mean_aat_img = np.mean(np.array(aat_img[aat_img_data['label'] == 'W']), axis=0, keepdims=False)
V_mean_aat_img = np.mean(np.array(aat_img[aat_img_data['label'] == 'V']), axis=0, keepdims=False)
A_mean_aat_img = np.mean(np.array(aat_img[aat_img_data['label'] == 'A']), axis=0, keepdims=False)

In [43]:
differences, tt1 = percent_subtraction(W_mean_aat_vis, V_mean_aat_vis,np.array(aat_img_data.columns))
differences, tt2 = percent_subtraction(W_mean_aat_vis, A_mean_aat_vis,np.array(aat_img_data.columns))
differences, tt3 = percent_subtraction(A_mean_aat_vis, V_mean_aat_vis,np.array(aat_img_data.columns))
differences, tt4 = percent_subtraction(W_mean_aat_img, V_mean_aat_img,np.array(aat_img_data.columns))
differences, tt5 = percent_subtraction(W_mean_aat_img, A_mean_aat_img,np.array(aat_img_data.columns))
differences, tt6 = percent_subtraction(A_mean_aat_img, V_mean_aat_img,np.array(aat_img_data.columns))

intersection_set = list(set(tt1.keys()).intersection(tt2.keys(), tt3.keys() ,tt4.keys(), tt5.keys(), tt6.keys())) 
union_set = list(set(tt1.keys()).union(tt2.keys(), tt3.keys() ,tt4.keys(), tt5.keys(), tt6.keys())) 

In [46]:
len(union_set)
len(intersection_set)

822

Union set features:

In [3]:
# union_set
# aat_img.loc[:,union_set]

In [53]:
X_new = np.concatenate((aat_img, aat_vis), axis=0) 
X_new = pd.DataFrame(X_new, columns=col_name)
X_new[union_set].head()

Unnamed: 0,ch3_2_min_diff,ch2_signal_energy,ch1_hjorth_activity,ch5_PAPR,ch13_totalVariation,ch8_LZC,ch13_PB_SE_2,ch3_MAP,ch11_1_min_diff,ch4_hurst,...,ch16_mean_abs,ch1_PFD,ch4_2_mean_diff,ch10_median_frequency,ch9_LRSSV,ch10_pb_theta,ch10_pb_alpha,ch12_1_min_diff,ch14_mean_abs,ch1_alpha/delta
0,0.954677,0.489335,0.096217,0.003209,0.161263,0.289427,1.0,0.150765,0.96137,0.562523,...,0.129242,0.774133,0.180544,0.86161,0.383445,0.004782,0.003971,0.451122,0.273035,0.00468
1,0.991682,0.412496,0.169959,0.002276,0.257689,0.122819,0.862552,0.155995,0.972344,0.619079,...,0.128029,0.862124,0.11269,0.550921,0.133072,0.000954,0.000952,0.798344,0.273149,0.001966
2,0.997793,0.415079,0.073071,0.003309,0.3419,0.451762,0.686836,0.160928,0.973661,0.736924,...,0.126123,0.881273,0.048038,0.792301,0.061359,0.000535,0.000836,0.814103,0.263899,0.001449
3,0.994398,0.508645,0.098513,0.001742,0.190987,0.602973,0.844325,0.164132,0.941396,0.752094,...,0.127203,0.858147,0.049239,0.466906,0.124525,0.000398,0.000516,0.825321,0.244271,0.001183
4,1.0,0.411177,0.106731,0.003266,0.352711,0.343126,0.758032,0.163584,0.977831,0.566901,...,0.129638,0.851805,0.053042,0.935242,0.102707,0.000432,0.001188,0.786058,0.242634,0.004205


In [54]:
res_list = []
best_score = 0
classifiers = init_classifiers()
for i in classifiers:
    results= model_evaluation_dict(X_new[union_set], labels_2, i[0], i[1], i[2])
    res_list.append(results)
    

df_model_comparison = pd.DataFrame(res_list).sort_values(by=['F1 Score(Macro)','F1 Score(Micro)']).reset_index(drop=True)
df_model_comparison

Unnamed: 0,Classifier,param,Traing score,Test Score,Whole score,CV Score,Precision(Macro),Precision(Micro),Recall(Macro),Recall(Micro),F1 Score(Macro),F1 Score(Micro)
0,SVM,default,0.158076,0.044,0.123798,0.062695,0.017916,0.044,0.050023,0.044,0.024688,0.044
1,AdaB,default,0.094502,0.072,0.08774,0.063579,0.025482,0.072,0.078062,0.072,0.0347,0.072
2,KNN,{'n_neighbors': 7},0.246564,0.056,0.189303,0.053235,0.045722,0.056,0.061375,0.056,0.04483,0.056
3,XGB,default,1.0,0.066,0.719351,0.079856,0.049632,0.066,0.073191,0.066,0.056652,0.066
4,LR,"{'C': 0.1, 'penalty': 'l2'}",0.37457,0.084,0.28726,0.075612,0.068533,0.084,0.083222,0.084,0.061828,0.084
5,LGB,default,1.0,0.08,0.723558,0.078198,0.066592,0.08,0.078289,0.08,0.06565,0.08
6,RF,default,1.0,0.078,0.722957,0.079885,0.083762,0.078,0.095414,0.078,0.074974,0.078
7,DT,default,1.0,0.076,0.722356,0.062717,0.074486,0.076,0.081537,0.076,0.075701,0.076
8,GBDT,default,1.0,0.084,0.72476,0.06618,0.081704,0.084,0.084533,0.084,0.079477,0.084


Intersection set:

In [55]:
res_list = []
best_score = 0
classifiers = init_classifiers()
for i in classifiers:
    results= model_evaluation_dict(X_new[intersection_set], labels_2, i[0], i[1], i[2])
    res_list.append(results)
    

df_model_comparison = pd.DataFrame(res_list).sort_values(by=['F1 Score(Macro)','F1 Score(Micro)']).reset_index(drop=True)
df_model_comparison

Unnamed: 0,Classifier,param,Traing score,Test Score,Whole score,CV Score,Precision(Macro),Precision(Micro),Recall(Macro),Recall(Micro),F1 Score(Macro),F1 Score(Micro)
0,SVM,default,0.158935,0.066,0.13101,0.048151,0.020271,0.066,0.060613,0.066,0.029455,0.066
1,LR,"{'C': 0.1, 'penalty': 'l2'}",0.170962,0.06,0.13762,0.057538,0.03946,0.06,0.065329,0.06,0.040199,0.06
2,AdaB,default,0.138316,0.072,0.118389,0.06786,0.067398,0.072,0.073509,0.072,0.045337,0.072
3,LGB,default,1.0,0.076,0.722356,0.080762,0.053326,0.076,0.070881,0.076,0.05839,0.076
4,XGB,default,1.0,0.07,0.720553,0.082501,0.063895,0.07,0.068067,0.07,0.06104,0.07
5,KNN,{'n_neighbors': 7},0.271478,0.074,0.212139,0.063535,0.068455,0.074,0.07984,0.074,0.063565,0.074
6,RF,default,1.0,0.1,0.729567,0.09706,0.082039,0.1,0.093937,0.1,0.079542,0.1
7,DT,default,1.0,0.09,0.726562,0.063609,0.094263,0.09,0.095408,0.09,0.090276,0.09
8,GBDT,default,1.0,0.102,0.730168,0.086796,0.0917,0.102,0.105484,0.102,0.093681,0.102


change nothing!

### Point 4

Machine learning data augmentation based on python package imblearn.

SMOTE is Synthetic Minority Over-sampling Technique, which is to deal with the problem of sample imbalance by artificially synthesizing new samples, thereby improving the performance of the classifier. (SMOTE generates new samples by interpolating between samples in the small sample class)

#### All features

In [75]:
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import LabelEncoder

In [77]:
labels_1 = np.array(aat_vis_data['label_index'])

In [78]:
lab = LabelEncoder()
y_transformed = lab.fit_transform(labels_1)

In [83]:
new_labels = list()
new_data = list()

for i in range(832):
    if labels_1[i] in list(range(1,14)):
        new_labels.append(y_transformed [i])
        new_labels.append(y_transformed [i])
        new_labels.append(y_transformed [i])
        new_data.append(aat_vis[i,:])
        new_data.append(aat_vis[i,:])
        new_data.append(aat_vis[i,:])
    else:
        new_labels.append(y_transformed [i])
        new_data.append(aat_vis[i,:])


In [86]:
# Counter(new_labels)
new_labels[1]

8

In [87]:
new_data[1]

array([0.65356577, 0.65831931, 0.25139945, ..., 1.        , 1.        ,
       1.        ])

In [2]:
# X_resampled, y_resampled = SMOTE().fit_resample(np.array(new_labels), np.array(new_data))
# Counter(y_resampled)

In [60]:
len(y_resampled)

832

Cannot working this way, pass it!

## Conclusion

After the above attempts, we can find that LDA has achieved a good result.

But I will try to check this result is real or not.

### Notice!!!

(added after 06.3 notebook, this is not good results... Do you know why?)