## Data Preprocessing and Aggregation

This notebook performs preprocessing and aggregation of raw EEG data for error-related potential (ErrP). It produces a clean, analysis-ready dataset for subsequent feature extraction and classification.

Dataset:

The EEG dataset is publicly accessible via the BNCI Horizon 2020 database.
Title: 22. Monitoring error-related potentials (013-2015)
Link: https://bnci-horizon-2020.eu/database/data-sets

Steps:

1. Loads raw EEG .mat files from all subjects, sessions, and trials.
2. Applies bandpass filtering ([0.5–10] Hz) and baseline correction using the MNE-Python toolbox.
3. Extracts event markers and segments the continuous EEG into single-trial epochs aligned to task events, assigning each epoch an error/correct label.
4. Performs baseline correction for each epoch (removing pre-stimulus mean).
5. Aggregates all epochs, labels, subject/session/trial metadata into unified arrays.
6. Crops each epoch to the analysis window of interest (e.g., 200–600 ms post-stimulus) for standardization and GAN fitting.
7. Saves the final cleaned and segmented arrays to disk as .npy files, ready for feature engineering and classification.

This notebook is the foundation for all following steps.

In [1]:
import scipy.io as sio
import numpy as np
import os
import mne

import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
data_dir = '/Users/Rosie/Documents/Applications/HRC_BCI_VU/Casus_BCI_classifier/data'

n_participants = 6
n_sessions = 2
n_trials = 10
fs = 512        # original sampling rate
#fs_new = 128   # in case of downsampling

ch_names = ['Fp1', 'AF7', 'AF3', 'F1', 'F3', 'F5', 'F7', 'FT7', 'FC5', 'FC3', 'FC1', 'C1', 'C3',
            'C5', 'T7', 'TP7', 'CP5', 'CP3', 'CP1', 'P1', 'P3', 'P5', 'P7', 'P9', 'PO7', 'PO3',
            'O1', 'Iz', 'Oz', 'POz', 'Pz', 'CPz', 'Fpz', 'Fp2', 'AF8', 'AF4', 'AFz', 'Fz', 'F2',
            'F4', 'F6', 'F8', 'FT8', 'FC6', 'FC4', 'FC2', 'FCz', 'Cz', 'C2', 'C4', 'C6', 'T8',
            'TP8', 'CP6', 'CP4', 'CP2', 'P2', 'P4', 'P6', 'P8', 'P10', 'PO8', 'PO4', 'O2']

epoch_start_ms = -200 # for baseline correction
epoch_end_ms = 600

all_epochs = []
all_labels = []
all_subjects = []
all_sessions = []
all_trials = []

In [3]:
for subj_idx in range(1, n_participants + 1):
    for sess_idx in range(1, n_sessions + 1):
        mat_path = os.path.join(data_dir, f'Subject{subj_idx:02d}_s{sess_idx}.mat')
        
        # Load session file
        subject = sio.loadmat(mat_path, struct_as_record=False, squeeze_me=True)
        runs = subject['run']  # 10 trials (runs) in this session
        
        # If only a single run is present, make it a list
        if isinstance(runs, np.ndarray):
            run_list = list(runs)
        else:
            run_list = [runs]
            
        for trial_idx, run in enumerate(run_list):
            # Extract EEG and header
            eeg = run.eeg # shape: (n_timepoints, n_channels)
            header = run.header
            
            # MNE bandpass filter and downsample
            info = mne.create_info(
                ch_names=ch_names,
                sfreq=fs,
                ch_types='eeg'
            )
            raw = mne.io.RawArray(eeg.T, info)  # MNE expects (n_channels, n_times)
            raw_filtered = raw.copy().filter(l_freq=0.5, h_freq=10, verbose=False)
            #raw_filtered_resampled = raw_filtered.copy().resample(fs_new, verbose=False)
            eeg_proc = raw_filtered.get_data().T  # (n_timepoints, n_channels)
            
            # 3. Extract event info and downsample POS
            event_types = np.array(header.EVENT.TYP)
            event_positions = np.array(header.EVENT.POS)
            #event_positions_ds = (event_positions / (fs / fs_new)).astype(int)
            
            # 4. Assign labels for error/correct
            labels = []
            positions = []
            
            #for typ, pos in zip(event_types, event_positions_ds):
            for typ, pos in zip(event_types, event_positions):
                if typ in [5, 10]:   # correct
                    labels.append(0)
                    positions.append(pos)
                elif typ in [6, 9]:  # error
                    labels.append(1)
                    positions.append(pos)
                    
            labels = np.array(labels)
            positions = np.array(positions)
            
            # Extract epochs and baseline correct
#             epoch_start = int(epoch_start_ms * fs_new / 1000)
#             epoch_end = int(epoch_end_ms * fs_new / 1000)
            epoch_start = int(epoch_start_ms * fs / 1000)
            epoch_end = int(epoch_end_ms * fs / 1000)
            
            for label, pos in zip(labels, positions):
                start_idx = pos + epoch_start
                end_idx = pos + epoch_end
                
                if start_idx >= 0 and end_idx < eeg_proc.shape[0]:
                    epoch = eeg_ds[start_idx:end_idx, :]
                    
                    # baseline correction: subtract mean of pre-event period
                    baseline = epoch[:abs(epoch_start), :].mean(axis=0, keepdims=True)
                    epoch = epoch - baseline
                    
                    all_epochs.append(epoch)
                    all_labels.append(label)
                    all_subjects.append(subj_idx)
                    all_sessions.append(sess_idx)
                    all_trials.append(trial_idx)

Creating RawArray with float64 data, n_channels=64, n_times=91648
    Range : 0 ... 91647 =      0.000 ...   178.998 secs
Ready.
Creating RawArray with float64 data, n_channels=64, n_times=102400
    Range : 0 ... 102399 =      0.000 ...   199.998 secs
Ready.
Creating RawArray with float64 data, n_channels=64, n_times=94720
    Range : 0 ... 94719 =      0.000 ...   184.998 secs
Ready.
Creating RawArray with float64 data, n_channels=64, n_times=95232
    Range : 0 ... 95231 =      0.000 ...   185.998 secs
Ready.
Creating RawArray with float64 data, n_channels=64, n_times=91136
    Range : 0 ... 91135 =      0.000 ...   177.998 secs
Ready.
Creating RawArray with float64 data, n_channels=64, n_times=90112
    Range : 0 ... 90111 =      0.000 ...   175.998 secs
Ready.
Creating RawArray with float64 data, n_channels=64, n_times=93184
    Range : 0 ... 93183 =      0.000 ...   181.998 secs
Ready.
Creating RawArray with float64 data, n_channels=64, n_times=101888
    Range : 0 ... 101887 =  

    Range : 0 ... 91647 =      0.000 ...   178.998 secs
Ready.
Creating RawArray with float64 data, n_channels=64, n_times=93184
    Range : 0 ... 93183 =      0.000 ...   181.998 secs
Ready.
Creating RawArray with float64 data, n_channels=64, n_times=90624
    Range : 0 ... 90623 =      0.000 ...   176.998 secs
Ready.
Creating RawArray with float64 data, n_channels=64, n_times=89088
    Range : 0 ... 89087 =      0.000 ...   173.998 secs
Ready.
Creating RawArray with float64 data, n_channels=64, n_times=91136
    Range : 0 ... 91135 =      0.000 ...   177.998 secs
Ready.
Creating RawArray with float64 data, n_channels=64, n_times=90112
    Range : 0 ... 90111 =      0.000 ...   175.998 secs
Ready.
Creating RawArray with float64 data, n_channels=64, n_times=90624
    Range : 0 ... 90623 =      0.000 ...   176.998 secs
Ready.
Creating RawArray with float64 data, n_channels=64, n_times=91136
    Range : 0 ... 91135 =      0.000 ...   177.998 secs
Ready.
Creating RawArray with float64 dat

In [4]:
# Final aggregation 
all_epochs = np.array(all_epochs)      # (n_epochs, epoch_len, n_channels)
all_labels = np.array(all_labels)      # (n_epochs,)
all_subjects = np.array(all_subjects)  # (n_epochs,)
all_sessions = np.array(all_sessions)  # (n_epochs,)
all_trials = np.array(all_trials)      # (n_epochs,)

print(f"Aggregated epochs shape: {all_epochs.shape}")
print(f"Aggregated labels shape: {all_labels.shape}")
print(f"Aggregated subjects shape: {all_subjects.shape}")
print(f"Aggregated sessions shape: {all_sessions.shape}")
print(f"Aggregated trials shape: {all_trials.shape}")

Aggregated epochs shape: (6437, 409, 64)
Aggregated labels shape: (6437,)
Aggregated subjects shape: (6437,)
Aggregated sessions shape: (6437,)
Aggregated trials shape: (6437,)


In [12]:
# Crop after baseline correction
# -209 instead of -200 for GAN fitting
crop_start = int((200 - (-209)) * fs / 1000)  # 209
crop_end = int((600 - (-200)) * fs / 1000)    # 410

# Keep samples from 200 ms to 600 ms (note end index is exclusive)
all_epochs_cropped = all_epochs[:, crop_start:crop_end, :]

print('New cropped shape:', all_epochs_cropped.shape)

New cropped shape: (6437, 200, 64)


In [13]:
#np.save("/Users/Rosie/Documents/Applications/HRC_BCI_VU/Casus_BCI_classifier/data_preprocessed/all_epochs.npy", all_epochs) 
np.save("/Users/Rosie/Documents/Applications/HRC_BCI_VU/Casus_BCI_classifier/data_preprocessed/all_epochs.npy", all_epochs_cropped) 
np.save("/Users/Rosie/Documents/Applications/HRC_BCI_VU/Casus_BCI_classifier/data_preprocessed/all_labels.npy", all_labels) 
np.save("/Users/Rosie/Documents/Applications/HRC_BCI_VU/Casus_BCI_classifier/data_preprocessed/all_subjects.npy", all_subjects) 
np.save("/Users/Rosie/Documents/Applications/HRC_BCI_VU/Casus_BCI_classifier/data_preprocessed/all_sessions.npy", all_sessions) 
np.save("/Users/Rosie/Documents/Applications/HRC_BCI_VU/Casus_BCI_classifier/data_preprocessed/all_trials.npy", all_trials) 