# Classifier Model with PSD Features from EEG Data

**Description**:\
Develop a prediction model with the training dataset obtained with the PSD Method from MNE. The objective is to evaluate different classifier models and measure the results to compare them between the training dataset with all channels and only frontopolar channels.

For this section, all the dataset required are stored in the Training Dataset directory, which are mainly the PSD Feature Extraction results. For futher understanding, you can take a look at the notebook `Feature Extraction PSD.ipynb` in Feature Extraction the `Feature Extraction` directory.

**Author**: Elmo Chavez\
**Date**: October 18, 2023

## Libraries

In [1]:
import pandas as pd
import numpy as np
import os
import sys

path_eeg_mne = os.path.abspath(os.path.join(os.path.dirname('eeg_mne.py'), '..'))
sys.path.append(path_eeg_mne)
import eeg_mne

## Read the Dataset

In [2]:
path_training = '../Training Dataset/'

# Participants Dataset preselected in Feature Extraction step 1
file_participants_selected = 'Participants_Selected.csv'
df_participants_selected = pd.read_csv(path_training+file_participants_selected)

# PSD features with only FP1 channel
file_psd_features_all = 'PSD_Features-All_Channels.csv'
df_features_all = pd.read_csv(path_training+file_psd_features_all)

# PSD features with only FP1 channel
file_psd_features_fp1 = 'PSD_Features-FP1_Channel.csv'
df_features_fp1 = pd.read_csv(path_training+file_psd_features_fp1)

## Exploratory Data Analysis

**Brief Summary about the Participants Selected**

In [3]:
df_participants_selected

Unnamed: 0,participant_id,Gender,Age,Group,MMSE,time_max,points,sfreq,flag
0,sub-001,0,57,0,16,599.798,299900,500.0,True
1,sub-002,0,78,0,22,793.098,396550,500.0,True
2,sub-003,1,70,0,14,306.098,153050,500.0,False
3,sub-004,0,67,0,20,706.098,353050,500.0,True
4,sub-005,1,70,0,22,804.098,402050,500.0,True
...,...,...,...,...,...,...,...,...,...
83,sub-084,0,71,1,24,652.098,326050,500.0,True
84,sub-085,1,64,1,26,560.058,280030,500.0,True
85,sub-086,1,49,1,26,578.798,289400,500.0,True
86,sub-087,1,73,1,24,602.758,301380,500.0,True


Remove Participants not flagged:\
    - Participants with Healthy Control\
    - Participants with maximum recorded time less than 540 seconds\
    - Balancing classes to 22 samples for each group (Alzheimer Disease and Frototemporal Dementia)

In [4]:
df_participants_selected = df_participants_selected[df_participants_selected['flag']==True].reset_index(drop=True)
df_participants_selected.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44 entries, 0 to 43
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   participant_id  44 non-null     object 
 1   Gender          44 non-null     int64  
 2   Age             44 non-null     int64  
 3   Group           44 non-null     int64  
 4   MMSE            44 non-null     int64  
 5   time_max        44 non-null     float64
 6   points          44 non-null     int64  
 7   sfreq           44 non-null     float64
 8   flag            44 non-null     bool   
dtypes: bool(1), float64(2), int64(5), object(1)
memory usage: 2.9+ KB


In [5]:
df_participants_selected.groupby('Group')['participant_id'].count()

Group
0    22
1    22
Name: participant_id, dtype: int64

**Features Extracted using PSD Method from MNE for All the Channels**

In [6]:
eeg_mne.Dataset_Features_Summary(df_features_all)

Total Features: 6274
Windows: 11 -> ['w0', 'w1', 'w10', 'w2', 'w3', 'w4', 'w5', 'w6', 'w7', 'w8', 'w9']
Channels: 19 -> ['F3', 'P4', 'Fz', 'C3', 'T6', 'F8', 'T4', 'O2', 'F4', 'O1', 'F7', 'Fp2', 'P3', 'Fp1', 'Cz', 'C4', 'T5', 'T3', 'Pz']
Frequency Bands: 5 -> ['delta', 'alpha', 'theta', 'beta', 'gamma']
Features: 6 -> ['total power', 'peak to peak', 'spectral entropy', 'average power', 'relative power', 'std dev']


**Features Extracted using PSD Method from MNE for the _FP1_ Channel**

In [7]:
eeg_mne.Dataset_Features_Summary(df_features_fp1)

Total Features: 334
Windows: 11 -> ['w0', 'w1', 'w10', 'w2', 'w3', 'w4', 'w5', 'w6', 'w7', 'w8', 'w9']
Channels: 1 -> ['Fp1']
Frequency Bands: 5 -> ['delta', 'alpha', 'theta', 'beta', 'gamma']
Features: 6 -> ['total power', 'peak to peak', 'spectral entropy', 'average power', 'relative power', 'std dev']


## Predictions with Cross-Validation

### All Channels

In [13]:
df_results_cv_allch = eeg_mne.eeg_classifier_cv(df=df_features_all, feature_id='participant_id', target='Group', feature_extraction='PSD', channels='All')
df_results_cv_allch.head(15)

Running: Support Vector
Running: Random Forest
Running: XGBoost
Running: LigthGBM
[LightGBM] [Info] Number of positive: 16, number of negative: 19
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 35, number of used features: 0
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.457143 -> initscore=-0.171850
[LightGBM] [Info] Start training from score -0.171850
[LightGBM] [Info] Number of positive: 19, number of negative: 16
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 35, number of used features: 0
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.542857 -> initscore=0.171850
[LightGBM] [Info] Start training from score 0.171850
[LightGBM] [Info] Number of positive: 18, number of negative: 17
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 35, number of used features: 0
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.514286 -> initscore=0.057158
[LightGBM] [Info] 

Unnamed: 0,feature_extraction,channels,classifier,cross-validation,feature-selection,accuracy,f1_score,AUC
0,PSD,All,Support Vector,KFold,anova,0.477778,0.425,0.471667
1,PSD,All,Support Vector,KFold,mutual_info_classif,0.544444,0.451299,0.59
2,PSD,All,Support Vector,KFold,chi2,0.461111,0.369744,0.55
3,PSD,All,Support Vector,StratifiedKFold,anova,0.544444,0.468075,0.565
4,PSD,All,Support Vector,StratifiedKFold,mutual_info_classif,0.566667,0.456667,0.575
5,PSD,All,Support Vector,StratifiedKFold,chi2,0.480556,0.358881,0.525
6,PSD,All,Support Vector,StratifiedShuffleSplit,anova,0.577778,0.56632,0.57
7,PSD,All,Support Vector,StratifiedShuffleSplit,mutual_info_classif,0.466667,0.337512,0.47
8,PSD,All,Support Vector,StratifiedShuffleSplit,chi2,0.444444,0.307692,0.5
9,PSD,All,Random Forest,KFold,anova,0.411111,0.289744,0.45


Show the Top 20 results

In [14]:
df_results_cv_allch.sort_values('AUC',ascending=False).head(20)

Unnamed: 0,feature_extraction,channels,classifier,cross-validation,feature-selection,accuracy,f1_score,AUC
26,PSD,All,XGBoost,StratifiedShuffleSplit,chi2,0.733333,0.72645,0.75
17,PSD,All,Random Forest,StratifiedShuffleSplit,chi2,0.733333,0.725556,0.74
14,PSD,All,Random Forest,StratifiedKFold,chi2,0.705556,0.69987,0.72
37,PSD,All,AdaBoost,KFold,mutual_info_classif,0.705556,0.701111,0.71
20,PSD,All,XGBoost,KFold,chi2,0.663889,0.651053,0.691667
23,PSD,All,XGBoost,StratifiedKFold,chi2,0.661111,0.648333,0.68
24,PSD,All,XGBoost,StratifiedShuffleSplit,anova,0.688889,0.656793,0.675
41,PSD,All,AdaBoost,StratifiedKFold,chi2,0.658333,0.652013,0.675
11,PSD,All,Random Forest,KFold,chi2,0.641667,0.63614,0.661667
10,PSD,All,Random Forest,KFold,mutual_info_classif,0.638889,0.634444,0.66


### FP1 Channel

In [15]:
df_results_cv_fp1 = eeg_mne.eeg_classifier_cv(df=df_features_all, feature_id='participant_id', target='Group', feature_extraction='PSD', channels='Fp1')
df_results_cv_fp1.head(15)

Running: Support Vector
Running: Random Forest
Running: XGBoost
Running: LigthGBM
[LightGBM] [Info] Number of positive: 16, number of negative: 19
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 35, number of used features: 0
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.457143 -> initscore=-0.171850
[LightGBM] [Info] Start training from score -0.171850
[LightGBM] [Info] Number of positive: 19, number of negative: 16
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 35, number of used features: 0
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.542857 -> initscore=0.171850
[LightGBM] [Info] Start training from score 0.171850
[LightGBM] [Info] Number of positive: 18, number of negative: 17
[LightGBM] [Info] Total Bins 0
[LightGBM] [Info] Number of data points in the train set: 35, number of used features: 0
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.514286 -> initscore=0.057158
[LightGBM] [Info] 

Unnamed: 0,feature_extraction,channels,classifier,cross-validation,feature-selection,accuracy,f1_score,AUC
0,PSD,Fp1,Support Vector,KFold,anova,0.477778,0.425,0.471667
1,PSD,Fp1,Support Vector,KFold,mutual_info_classif,0.544444,0.451299,0.59
2,PSD,Fp1,Support Vector,KFold,chi2,0.461111,0.369744,0.55
3,PSD,Fp1,Support Vector,StratifiedKFold,anova,0.544444,0.468075,0.565
4,PSD,Fp1,Support Vector,StratifiedKFold,mutual_info_classif,0.566667,0.456667,0.575
5,PSD,Fp1,Support Vector,StratifiedKFold,chi2,0.480556,0.358881,0.525
6,PSD,Fp1,Support Vector,StratifiedShuffleSplit,anova,0.577778,0.56632,0.57
7,PSD,Fp1,Support Vector,StratifiedShuffleSplit,mutual_info_classif,0.466667,0.337512,0.47
8,PSD,Fp1,Support Vector,StratifiedShuffleSplit,chi2,0.444444,0.307692,0.5
9,PSD,Fp1,Random Forest,KFold,anova,0.388889,0.276107,0.433333


In [16]:
df_results_cv_fp1.sort_values('AUC',ascending=False).head(20)

Unnamed: 0,feature_extraction,channels,classifier,cross-validation,feature-selection,accuracy,f1_score,AUC
26,PSD,Fp1,XGBoost,StratifiedShuffleSplit,chi2,0.733333,0.72645,0.75
37,PSD,Fp1,AdaBoost,KFold,mutual_info_classif,0.705556,0.701111,0.71
41,PSD,Fp1,AdaBoost,StratifiedKFold,chi2,0.680556,0.677143,0.695
20,PSD,Fp1,XGBoost,KFold,chi2,0.663889,0.651053,0.691667
16,PSD,Fp1,Random Forest,StratifiedShuffleSplit,mutual_info_classif,0.666667,0.636538,0.685
23,PSD,Fp1,XGBoost,StratifiedKFold,chi2,0.661111,0.648333,0.68
24,PSD,Fp1,XGBoost,StratifiedShuffleSplit,anova,0.688889,0.656793,0.675
11,PSD,Fp1,Random Forest,KFold,chi2,0.636111,0.630346,0.661667
10,PSD,Fp1,Random Forest,KFold,mutual_info_classif,0.638889,0.634444,0.66
17,PSD,Fp1,Random Forest,StratifiedShuffleSplit,chi2,0.644444,0.63119,0.66


Best performance from each Classifier for the two approaches (All Channels and FP1 Channel)

In [22]:
df_results_cv = pd.concat([df_results_cv_allch, df_results_cv_fp1], ignore_index=True)
df_results_cv_sorted = df_results_cv.sort_values('AUC',ascending=False)
df_results_cv_sorted.groupby(['channels']).head(5)

Unnamed: 0,feature_extraction,channels,classifier,cross-validation,feature-selection,accuracy,f1_score,AUC
71,PSD,Fp1,XGBoost,StratifiedShuffleSplit,chi2,0.733333,0.72645,0.75
26,PSD,All,XGBoost,StratifiedShuffleSplit,chi2,0.733333,0.72645,0.75
17,PSD,All,Random Forest,StratifiedShuffleSplit,chi2,0.733333,0.725556,0.74
14,PSD,All,Random Forest,StratifiedKFold,chi2,0.705556,0.69987,0.72
37,PSD,All,AdaBoost,KFold,mutual_info_classif,0.705556,0.701111,0.71
82,PSD,Fp1,AdaBoost,KFold,mutual_info_classif,0.705556,0.701111,0.71
86,PSD,Fp1,AdaBoost,StratifiedKFold,chi2,0.680556,0.677143,0.695
20,PSD,All,XGBoost,KFold,chi2,0.663889,0.651053,0.691667
65,PSD,Fp1,XGBoost,KFold,chi2,0.663889,0.651053,0.691667
61,PSD,Fp1,Random Forest,StratifiedShuffleSplit,mutual_info_classif,0.666667,0.636538,0.685


## Save Results

In [23]:
file_results_cv = 'Results PSD - Cross-Validation.csv'
df_results_cv.to_csv(path_training+file_results_cv)

## Conclusions

After conducting two distinct cross-validation techniques and utilizing five different classifier models in conjunction with three different feature-selection methods, the findings have been recorded in a CSV file to be compared with future approaches. The results are substantial, and the following conclusions have been drawn:
* The Cross-Validation technique provides a better understanding of the performance of the models after training. It allows us to calculate the mean results for the chosen evaluation metrics, which helps prevent overfitting during the training phase due to the values defined by the train-test split for the training dataset.
* The Classifier models exhibit a consistent performance during the training, except for the LightGBM model which faces some challenges due to the small dataset size. This model requires at least 20 samples for the testing phase to ensure improved performance.
*Both approaches, namely All Channels and only FP1, have yielded almost identical results. The next step is to perform hyperparameter optimization to find the best models. At this stage, it is uncertain whether there will be any noticeable differences or the similar results will persist for both approaches.
* The feature selection methods applied helped improve results by reducing the number of features.