<h2 style="text-align:center; color:#546545;text-shadow: 2px 2px 4px #000000;"> A comparison of different classifiers’ accuracy & performance for high-dimensional data</h2>

<h2 style="text-align:center; color:#546545;text-shadow: 2px 2px 4px #000000;">Problem formulation</h2>

The **EEG Brainwave Dataset** contains electronic brainwave signals from an EEG headset and is in temporal format.

The challenge is: **Can we predict emotional sentiment from brainwave readings?**

<h2 style="text-align:center; color:#546545;text-shadow: 2px 2px 4px #000000;">Import Packages</h2>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier

import xgboost as xgb

import warnings
warnings.filterwarnings('ignore')

import os
os.listdir('../input')


In [None]:
brainwave_df = pd.read_csv('../input/emotions.csv')

<h2 style="text-align:center; color:#546545;text-shadow: 2px 2px 4px #000000;">Peek of Data</h2>

In [None]:
brainwave_df.head()

<h2 style="text-align:center; color:#546545;text-shadow: 2px 2px 4px #000000;">Tail of Data</h2>

In [None]:
brainwave_df.tail()

<h2 style="text-align:center; color:#546545;text-shadow: 2px 2px 4px #000000;">Dimensions of Data

In [None]:
brainwave_df.shape

<h2 style="text-align:center; color:#546545;text-shadow: 2px 2px 4px #000000;">Data Type For Each Attribute</h2>

In [None]:
brainwave_df.dtypes

<h2 style="text-align:center; color:#546545;text-shadow: 2px 2px 4px #000000;">Descriptive Statistics</h2>

In [None]:
brainwave_df.describe()

<h2 style="text-align:center; color:#546545;text-shadow: 2px 2px 4px #000000;">Class Distribution</h2>

In [None]:
plt.figure(figsize=(12,5))
sns.countplot(x=brainwave_df.label, color='mediumseagreen')
plt.title('Emotional sentiment class distribution', fontsize=16)
plt.ylabel('Class Counts', fontsize=16)
plt.xlabel('Class Label', fontsize=16)
plt.xticks(rotation='vertical');

<h2 style="text-align:center; color:#546545;text-shadow: 2px 2px 4px #000000;">Correlation Between Attributes</h2>
Correlation refers to the relationship between two variables and how they may or may not change together.

The most common method for calculating correlation is [Pearson’s Correlation Coefficient](https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient), that assumes a normal distribution of the attributes involved. A correlation of -1 or 1 shows a full negative or positive correlation respectively. Whereas a value of 0 shows no correlation at all.

In [None]:
label_df = brainwave_df['label']
brainwave_df.drop('label', axis = 1, inplace=True)

In [None]:
correlations = brainwave_df.corr(method='pearson')
correlations

<h2 style="text-align:center; color:#546545;text-shadow: 2px 2px 4px #000000;">Skew of Univariate Distributions</h2>

In [None]:
skew = brainwave_df.skew()
skew

<h2 style="text-align:center; color:#546545;text-shadow: 2px 2px 4px #000000;">RandomForest Classifier</h2>

`RandomForest` is a tree & bagging approach-based ensemble classifier. It will automatically reduce the number of features by its probabilistic entropy calculation approach.

In [None]:
%%time

pl_random_forest = Pipeline(steps=[('random_forest', RandomForestClassifier())])
scores = cross_val_score(pl_random_forest, brainwave_df, label_df, cv=10,scoring='accuracy')
print('Accuracy for RandomForest : ', scores.mean())

<h2 style="text-align:center; color:#546545;text-shadow: 2px 2px 4px #000000;">Logistic Regression Classifier</h2>

`Logistic Regression` is a linear classifier and works in same way as linear regression.

In [None]:
%%time

pl_log_reg = Pipeline(steps=[('scaler',StandardScaler()),
                             ('log_reg', LogisticRegression(multi_class='multinomial', solver='saga', max_iter=200))])
scores = cross_val_score(pl_log_reg, brainwave_df, label_df, cv=10,scoring='accuracy')
print('Accuracy for Logistic Regression: ', scores.mean())

<h2 style="text-align:center; color:#546545;text-shadow: 2px 2px 4px #000000;">Principal Component Analysis (PCA)</h2>

PCA can transform original low level variables to a higher dimensional space and thus reduce the number of required variables. All co-linear variables get clubbed together. 

In [None]:
scaler = StandardScaler()
scaled_df = scaler.fit_transform(brainwave_df)
pca = PCA(n_components = 20)
pca_vectors = pca.fit_transform(scaled_df)
for index, var in enumerate(pca.explained_variance_ratio_):
    print("Explained Variance ratio by Principal Component ", (index+1), " : ", var)


In [None]:
plt.figure(figsize=(25,8))
sns.scatterplot(x=pca_vectors[:, 0], y=pca_vectors[:, 1], hue=label_df)
plt.title('Principal Components vs Class distribution', fontsize=16)
plt.ylabel('Principal Component 2', fontsize=16)
plt.xlabel('Principal Component 1', fontsize=16)
plt.xticks(rotation='vertical');

<h2 style="text-align:center; color:#546545;text-shadow: 2px 2px 4px #000000;">Logistic Regression classifier with these two PCs</h2>

In [None]:
%%time
pl_log_reg_pca = Pipeline(steps=[('scaler',StandardScaler()),
                             ('pca', PCA(n_components = 2)),
                             ('log_reg', LogisticRegression(multi_class='multinomial', solver='saga', max_iter=200))])
scores = cross_val_score(pl_log_reg_pca, brainwave_df, label_df, cv=10,scoring='accuracy')
print('Accuracy for Logistic Regression with 2 Principal Components: ', scores.mean())

<h2 style="text-align:center; color:#546545;text-shadow: 2px 2px 4px #000000;">Take all 10 PCs</h2>

In [None]:
%%time

pl_log_reg_pca_10 = Pipeline(steps=[('scaler',StandardScaler()),
                             ('pca', PCA(n_components = 10)),
                             ('log_reg', LogisticRegression(multi_class='multinomial', solver='saga', max_iter=200))])
scores = cross_val_score(pl_log_reg_pca_10, brainwave_df, label_df, cv=10,scoring='accuracy')
print('Accuracy for Logistic Regression with 10 Principal Components: ', scores.mean())

<h2 style="text-align:center; color:#546545;text-shadow: 2px 2px 4px #000000;">Artificial Neural Network Classifier (ANN)</h2>

An ANN classifier is non-linear with automatic feature engineering and dimensional reduction techniques. `MLPClassifier` in scikit-learn works as an ANN. But here also, basic scaling is required for the data.[](http://)

In [None]:
%%time

pl_mlp = Pipeline(steps=[('scaler',StandardScaler()),
                             ('mlp_ann', MLPClassifier(hidden_layer_sizes=(1275, 637)))])
scores = cross_val_score(pl_mlp, brainwave_df, label_df, cv=10,scoring='accuracy')
print('Accuracy for ANN : ', scores.mean())

<h2 style="text-align:center; color:#546545;text-shadow: 2px 2px 4px #000000;">Linear Support Vector Machines Classifier (SVM)</h2>

In [None]:
%%time

pl_svm = Pipeline(steps=[('scaler',StandardScaler()),
                             ('pl_svm', LinearSVC())])
scores = cross_val_score(pl_svm, brainwave_df, label_df, cv=10,scoring='accuracy')
print('Accuracy for Linear SVM : ', scores.mean())

<h2 style="text-align:center; color:#546545;text-shadow: 2px 2px 4px #000000;">Extreme Gradient Boosting Classifier (XGBoost)</h2>

XGBoost is a boosted tree based ensemble classifier. Like ‘RandomForest’, it will also automatically reduce the feature set. 

In [None]:
%%time
pl_xgb = Pipeline(steps=
                  [('xgboost', xgb.XGBClassifier(objective='multi:softmax'))])
scores = cross_val_score(pl_xgb, brainwave_df, label_df, cv=10)
print('Accuracy for XGBoost Classifier : ', scores.mean())