EEG Turkish Sentence Decoding using Deep Learning project

Dataset link: https://www.kaggle.com/datasets/mehmetbayin/turkish-sentence-eeg-dataset

    - Reading Demonstration Set: 15-second 14-channel EEG signals recorded from 20 volunteers
    
    - Reading Listening Set: 15-second 14-channel EEG signals recorded from 20 volunteers
    
    - EMOTIV EPOC+ mobile system 
        - collected 14-channel EEG signals from 16 scalp zones: AF3 (1), F7 (2), F3 (3), FC5 (4), T7 (5), P7 (6), O1 (7), O2 (8), P8 - (9), T8 (10), FC6 (11), F4 (12), F8 (13), AF4 (14), P3 (reference zone), and P4 (reference zone)
        - sampling rate 128 Hz, bandwidth 0.16-43 Hz
        - dataset contains 1600 observations and 1600 labels
        - .mat file

Methodology:
    - may require transfer learning in a multi classifcation setting (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10400490/)
    - RNN architecture (LSTM) are very good for EEG data as they are designed to process sequences of data, making them suitable for time-series EEG data. They also capture long-term dependencies in the EEG signals

In [None]:
from scipy.io import loadmat
import numpy as np
import pandas as pd
#loading eeg data as a .mat file
eeg_data = loadmat('data/TurkishSentenceEEGData.mat')


In [None]:
eeg_data

In [None]:
len(eeg_data)

In [None]:
eeg_data.keys()

In [None]:
eeg_data.values()

In [None]:
#how many values are in each label
labels_array = eeg_data['Labels']
unique_values, counts = np.unique(labels_array, return_counts=True)
value_counts_dict = dict(zip(unique_values, counts))
value_counts_dict

In [None]:
# Inspect the keys and structure of the loaded data
observations = eeg_data['Observations']
labels = eeg_data['Labels'].ravel()

In [None]:
demonstration_data = observations[:800]
listening_data = observations[800:]
demonstration_labels = labels[:800]
listening_labels = labels[800:]


In [None]:
# Convert the 2D arrays in demonstration_data and listening_data to lists
demonstration_data_list = [obs.tolist() for obs in demonstration_data]
listening_data_list = [obs.tolist() for obs in listening_data]

In [None]:
# Convert to Pandas DataFrame
demo_df = pd.DataFrame({'EEG_Data': demonstration_data_list, 'Label': demonstration_labels})
listen_df = pd.DataFrame({'EEG_Data': listening_data_list, 'Label': listening_labels})

In [None]:
demo_df


In [None]:
listen_df

In [None]:
#check for missing values
print("Missing values in demo_df:", demo_df.isnull().sum())
print("Missing values in listen_df:", listen_df.isnull().sum())

In [None]:
#normalize EEG data to zero mean and unit variance using StandardScaler
from sklearn.preprocessing import StandardScaler

def normalize_eeg_data_2d(eeg_2d_list):
    """
    Normalize each 2D EEG data in the list.
    """
    scaler = StandardScaler()
    return [scaler.fit_transform(eeg_matrix) for eeg_matrix in eeg_2d_list]

In [None]:
#apply normalization 
demo_df['EEG_Data'] = demo_df['EEG_Data'].apply(normalize_eeg_data_2d)
listen_df['EEG_Data'] = listen_df['EEG_Data'].apply(normalize_eeg_data_2d)

In [None]:
# Convert to a numpy array
demo_data_array = np.array(demo_df['EEG_Data'].tolist())
listen_data_array = np.array(listen_df['EEG_Data'].tolist())

print("Shape of demo_data_array:", demo_data_array.shape)
print("Shape of listen_data_array:", listen_data_array.shape)

In [None]:
# Removing the unnecessary dimension to fit (number_of_samples, channels, time_points).
demo_data_reshaped = np.squeeze(demo_data_array)
listen_data_reshaped = np.squeeze(listen_data_array)

print("Shape of demo_data_reshaped:", demo_data_reshaped.shape)
print("Shape of listen_data_reshaped:", listen_data_reshaped.shape)

In [None]:
# now split the data into training and test splits for model training and validation 
from sklearn.model_selection import train_test_split

In [None]:
    """
    Args:
    - X: The features (EEG data).
    - y: The labels.
    - test_size: Proportion of the dataset to include in the test split.
    - val_size: Proportion of the dataset to include in the validation split (from the training set).
    
    Returns:
    - X_train, y_train: Training data and labels.
    - X_val, y_val: Validation data and labels.
    - X_test, y_test: Test data and labels.
    """

def split_data(X, y, test_size=0.2, val_size=0.25):
        X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=test_size, random_state=42)
        X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=val_size, random_state=42)
        return X_train, y_train, X_val, y_val, X_test, y_test

demo_labels = demo_df['Label'].values
listen_labels = listen_df['Label'].values

demo_X_train, demo_y_train, demo_X_val, demo_y_val, demo_X_test, demo_y_test = split_data(demo_data_reshaped, demo_labels)
listen_X_train, listen_y_train, listen_X_val, listen_y_val, listen_X_test, listen_y_test = split_data(listen_data_reshaped, listen_labels)

print("Demo Training Data Shape:", demo_X_train.shape)
print("Demo Validation Data Shape:", demo_X_val.shape)
print("Demo Test Data Shape:", demo_X_test.shape)
print("\nListening Training Data Shape:", listen_X_train.shape)
print("Listening Validation Data Shape:", listen_X_val.shape)
print("Listening Test Data Shape:", listen_X_test.shape)
