# Emotion Recognition using EEG Brainwave Data

## Project Overview

In this project, I aim to predict human emotions using EEG brainwave data. EEG (Electroencephalography) is a technique used to record electrical activity in the brain. The dataset I'm using contains EEG readings from individuals experiencing different emotions, and my goal is to build a machine learning model that can accurately predict the emotion based on these readings.

## Dataset

The dataset I'm using is available on Kaggle and contains EEG brainwave data along with the associated emotion for each data point. The emotions are categorized into three classes: POSITIVE, NEUTRAL, and NEGATIVE. The EEG data is represented as various features derived from the EEG signals, such as mean and FFT (Fast Fourier Transform) values.

## Approach

I start by preprocessing the data, which involves cleaning, normalization, and splitting the data into a training set and a test set. I then train a Random Forest model on the training data and evaluate its performance on the test data. I also explore other models like Logistic Regression and Gradient Boosting, and perform hyperparameter tuning and cross-validation to improve the performance of my models.

Throughout the project, I also visualize the data and the results to gain a better understanding of the relationships between the features and the target variable. This includes creating a correlation heatmap, a feature importance plot, and a confusion matrix.

In [None]:
!pip install kaggle

In [None]:
import os
os.environ['KAGGLE_USERNAME'] = 'vasanthdesai2020'
os.environ['KAGGLE_KEY'] = 'abe1a7dbf0a92f3136ea7a7913106542'

In [None]:
!kaggle datasets download -d birdy654/eeg-brainwave-dataset-feeling-emotions

In [None]:
!unzip eeg-brainwave-dataset-feeling-emotions.zip

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import GridSearchCV, cross_val_score
import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('emotions.csv')
df.head()

## Data Exploration

Before I proceed with any kind of model training, it's important for me to understand the nature of our data. Let's start by checking the shape of our dataset, the number of unique labels (emotions), and the distribution of these labels.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Print dataset shape
print(f'\033[1;34mDataset shape: {df.shape}\033[0m')

# Print unique labels
print(f'\033[1;36mUnique labels: {df.label.unique()}\033[0m')

# Print label distribution
print(f'\033[1;32mLabel distribution:\033[0m\n{df.label.value_counts()}')

# Plot the distribution of labels/emotions
plt.figure(figsize=(10, 5))
sns.countplot(x='label', data=df, palette="cool")
plt.title('Emotion Distribution', color='purple')
plt.grid(color = 'gray', linestyle = '--', linewidth = 0.5)
plt.show()


## Data Preprocessing

Now that we have a basic understanding of our data, let's preprocess it for our machine learning models. This involves the following steps:

1. **Label Encoding:** Convert the categorical target labels (emotions) into numerical values.
2. **Feature Scaling:** Standardize the features to have a mean of 0 and a standard deviation of 1. This is important for many machine learning models and helps improve their performance.
3. **Train-Test Split:** Split the data into a training set and a test set. The training set is used to train the model, and the test set is used to evaluate its performance.

In [None]:
from sklearn.preprocessing import LabelEncoder

# Label encoding
le = LabelEncoder()
df['label'] = le.fit_transform(df['label'])

# Feature scaling
scaler = StandardScaler()
features = df.drop('label', axis=1)
scaled_features = scaler.fit_transform(features)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(scaled_features, df['label'], test_size=0.2, random_state=42)

X_train.shape, X_test.shape

## Model Training and Evaluation

We're now ready to train our machine learning model. We'll start with a Random Forest classifier, which is a powerful and versatile machine learning model that works well on a wide range of datasets.

After training the model, we'll evaluate its performance on the test set. We'll look at the accuracy of the model, as well as the confusion matrix, which shows the number of correct and incorrect predictions for each class.

In [None]:
# Model training
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)

# Model evaluation
y_pred = rf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)

print(f'Accuracy: {accuracy}')
print(f'Confusion Matrix:\n{cm}')

sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.show()

## Feature Importance

One of the advantages of the Random Forest model is that it provides a straightforward way to examine the importance of each feature in making predictions. This can be useful for understanding which features are most influential in our model, and can also help guide further feature selection or engineering efforts.

In [None]:
feature_importances = rf.feature_importances_
importance_df = pd.DataFrame({'feature': features.columns, 'importance': feature_importances})
top_features = importance_df.sort_values('importance', ascending=False).head(20)

plt.figure(figsize=(10, 8))
sns.barplot(x='importance', y='feature', data=top_features, orient='h', color='#6096B4')
plt.title('Top 20 Important Features')
plt.show()

## Hyperparameter Tuning and Cross-Validation

To further improve the performance of our model, we can tune its hyperparameters. Hyperparameters are the parameters of the model that are not learned from the data, but are set beforehand. For the Random Forest model, these include the number of trees in the forest (n_estimators) and the maximum depth of the trees (max_depth), among others.

We'll use GridSearchCV from scikit-learn to perform a grid search over a range of possible hyperparameter values. GridSearchCV also performs cross-validation, which means it splits the training data into k 'folds' and trains and evaluates the model k times, each time with a different fold held out as a validation set. This provides a more robust estimate of the model's performance.

In [None]:
# Hyperparameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10]
}

# Grid search
grid_search = GridSearchCV(rf, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Best parameters and score
print(f'Best parameters: {grid_search.best_params_}')
print(f'Best score: {grid_search.best_score_}')

We used EEG data to predict emotions, preprocesseding and splitting it into training and test sets. We trained a Random Forest model, evaluated its performance, explored feature importance, and tuned hyperparameters for optimization. 

The optimal parameters resulted in approximately 98.4% accuracy on the cross-validation set, showing potential for accurate emotion prediction via EEG data, although real-world scenarios would be more complex. 

This technology could advance mental health treatments, enhance tech interfaces, and contribute to neuroscience research, highlighting machine learning's potential in emotion recognition.
