This notebook aims to understand and explore data and metadata associated with this competition. For a reminder, the goal of this competition is to detect and classify seizures and other types of harmful brain activity. From electroencephalography (EEG) signals recorded from critically ill hospital patients, we must build a classifier that is able to assign for a given EEG time series the correct class based on the detection and recognition of one of 6 patterns:
* seizure (SZ)
* generalized periodic discharges (GPD)
* lateralized periodic discharges (LPD)
* lateralized rhythmic delta activity (LRDA)
* generalized rhythmic delta activity (GRDA)
* “other” for all signals that do not fit inside above categories

All patterns have specific characteristics based on their frequency, the existence of time gap between pulse, their temporal extent and their location. For more information about this, you can watch this youtube playlist from Fábio A. Nascimento: [EEG Talk - ACNS Critical Care EEG Terminology 2021](https://www.youtube.com/playlist?list=PL1qAb9U_Ln6EO07t6SuqjZdgkwwRNOECr) (careful for loud sound at beginning of videos ^^). Across those videos, Dr. Lawrence Hirsch will explain each of the pattern of interest and what characterize them. For a more thorough explanation and figures, you can also check this article from ACNS: [American Clinical Neurophysiology Society’s Standardized
Critical Care EEG Terminology: 2021 Version](https://www.acns.org/UserFiles/file/ACNSStandardizedCriticalCareEEGTerminology_rev2021.pdf) and use your favorite LLM to summarize it.

<div style="font-size: large;">
    We will:
    <ol>
        <li>Look at training metadata and understand what information each column gives</li>
        <li>Look at an example EEG data, all the signals contained in it, what are the characteristic and plot them on time axis
        <li>Try to filter the EKG signal of this example record, with the help of scipy signal module and fft module
    </ol>
</div>

<center>
    <img src="https://i.imgur.com/lkrxljj.png" alt="an AI generated illustration" width=600>
</center>

# 📦 Setup and imports

In [None]:
import os

import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.fft as fft
import scipy.signal as signal
import seaborn as sns

matplotlib.rcParams['font.family'] = 'sans-serif'
matplotlib.rcParams['figure.figsize'] = (10, 6)

# 📂 Load and explore metadata

## Load metadata

In [None]:
dataset_path = "/kaggle/input/hms-harmful-brain-activity-classification"
training_metadata_path = os.path.join(dataset_path, "train.csv")

In [None]:
training_meta = pd.read_csv(training_metadata_path)
print(f"Training metadata shape: {training_meta.shape}")
training_meta.head()

In [None]:
print("Columns: ")
training_meta.columns.to_list()

## Comments

In [None]:
print("For each eeg and spectrogram, there is a unique patient")
print(training_meta.groupby("eeg_id").patient_id.nunique().value_counts())
print(training_meta.groupby("spectrogram_id").patient_id.nunique().value_counts())

print()
print("-" * 80)
print("But one patient can be recorded several times")
print(training_meta.groupby("patient_id").eeg_id.nunique().sort_values(ascending=False))
print(training_meta.groupby("patient_id").spectrogram_id.nunique().sort_values(ascending=False))

In [None]:
# Distribution of patterns is rather balanced
fig = plt.figure()
ax = sns.countplot(
    x="expert_consensus",
    data=training_meta,
    hue="expert_consensus",
    dodge=False
)
ax.get_legend().get_frame().set_alpha(0.6)
plt.title("Expert consensus distribution")
plt.grid()

In [None]:
# only label_id is unique for each row
training_meta.nunique()

In [None]:
# A single spectrogram can regroup several eegs
print(training_meta.groupby("spectrogram_id").eeg_id.nunique().sort_values(ascending=False))
print()
print("-" * 80)
print()
training_meta[training_meta.spectrogram_id == 764146759]

## Metadata column description

We will explain meaning of the different columns. Explanations are mainly based on competition data description
available at [HMS - Harmful Brain Activity Classification](https://www.kaggle.com/competitions/hms-harmful-brain-activity-classification/data):
* `eeg_id`: A unique identifier for the entire EEG recording. Those identifies also the parquet files of the eeg in the **train_eegs** folder. Note that as each row denominates a particular subsample of an EEG, there can be several rows with same eeg_id.
* `eeg_sub_id`: An ID for the specific 50 second long subsample this row's labels apply to. `eeg_sub_id` are assigned in chronological order for each `eeg_id` beginning with 0. The subsample is shifted from the beginning of the `eeg_id` record by `eeg_label_offset_seconds`. Note that even if subsample is 50 seconds long, the annotation is done by looking at the central 10 seconds.
* `eeg_label_offset_seconds`: The time between the beginning of the consolidated EEG and this subsample. Time shift.
* `spectrogram_id`: A unique identifier for an entire spectrogram. A spectrogram can regroup several EEGs records. Note that `spectrogram_id` are name of spectrogram files in folder **train_spectrograms**. For more information on what is a [spectrogram](https://en.wikipedia.org/wiki/Spectrogram).
* `spectrogram_sub_id`: An ID for the specific 10 minute subsample this row's labels apply to. Same concept as for `eeg_sub_id`.
* `spectogram_label_offset_seconds`: The time between the beginning of the consolidated spectrogram and this subsample. Simple time shift.
* `label_id`: An ID for this set of labels. Unique for each row.
* `patient_id`: An ID for the patient who donated the data. An `eeg_id` or `spectrogram_id` is associated with a single `patient_id` but a single patient can be associated with several eegs and spectrograms.
* `expert_consensus`: The consensus annotator label. Provided for convenience only. Give the final decision for classifying of the 10 seconds central window of a given subsample in one of the 6 available category.
* `[seizure/lpd/gpd/lrda/grda/other]_vote`: The count of annotator votes for a given brain activity class. The full names of the activity classes are as follows: lpd: lateralized periodic discharges, gpd: generalized periodic discharges, lrd: lateralized rhythmic delta activity, and grda: generalized rhythmic delta activity. Note that total number of annotators can vary for each row, even within the same eeg. Size of cohort of experts range from 1 to 28.

# 🧠 Let's look at an example eeg

## Recording parameters

In [None]:
fs = 200  # data sampled at 200 Hz
window_length = 10  # 10 seconds central window
subsample_length = 50  # 50 seconds subsampling for each row

## Choose particular eeg_id and load eeg

In [None]:
eeg_path_template = dataset_path + "/train_eegs/{eeg_id}.parquet"

In [None]:
# we chose this id as it is associated with a heavily recognized pattern 
eeg_id = 2900632927
training_meta[training_meta.eeg_id == eeg_id]

In [None]:
df_eeg = pd.read_parquet(eeg_path_template.format(eeg_id=eeg_id))
print("EEG record shape: ", df_eeg.shape)
df_eeg.head()

In [None]:
print(df_eeg.columns.to_list())

We have 20 signals, 19 corresponding to the EEG measures, which are voltage fluctuations recorded by electrodes placed on the scalp
and a last one corresponding to `EKG`, which is an electrocardiogram of the patient.

The EEG measures give a view of the electrical activity of the brain. The different signals names (`Fp1`, `F3`, ...) corresponds to standardized locations where the electrodes will be attached. The locations are generally associated with a particular area of the brain. The letters design a particular lobe, such as pre-frontal (Fp), frontal (F), temporal (T), parietal (P), occipital (O), and central (C). Even numbers in the electrode names indicate it is placed on the right side of the head. Odd number indicate it is placed on left side. z (for zero) indicate the electrode is placed in the central plane and serves as reference point. For more information, you can see the [10–20 system (EEG)](https://en.wikipedia.org/wiki/10%E2%80%9320_system_(EEG)).

The unit of measurement is $\mu V$.

## Add time

In [None]:
df_eeg["time"] = df_eeg.index / fs
df_eeg.set_index("time", inplace=True)
df_eeg.index

## Plot one electrode

In [None]:
plt.plot(df_eeg.index, df_eeg["Fp1"])
plt.xlabel("Time (s)")
plt.ylabel("Amplitude (uV)")
plt.grid()
plt.title("Fp1")

We can focus on a particular window of time.

In [None]:
x_start = 25.0
x_end = 35.0
plt.plot(df_eeg.loc[x_start:x_end].index, df_eeg.loc[x_start:x_end]["Fp1"])
plt.grid()
plt.title("Fp1")
plt.xlabel("Time (s)")
plt.ylabel("Amplitude (uV)")

# ⚡ Whole set of electrodes

In [None]:
fig, axes = plt.subplots(nrows=len(df_eeg.columns), ncols=1, sharex=True, figsize=(30, 50))
for i, col_name in enumerate(df_eeg.columns):
    ax = sns.lineplot(data=df_eeg, x=df_eeg.index, y=col_name, ax=axes[i])
    ax.set_title(col_name)
    ax.grid()
plt.subplots_adjust()

Not very insightful, signals are very noisy. We can take as extreme example the electrocardiogram `EKG` (or ECG), which is completly unreadable. Yet in the example figures (see one below representing a LPD identified in a portion of an EEG), we have at the bottom right an interpretable EKG in a 10 second window. A possible cause of the noise in our signal is the existence of an interfering signal that add to the true signal.

One idea would be to smooth the signal with a simple moving average filter. Or filter certain frequencies with a signal processing filter.
Luckily, the python module [scipy.signal](https://docs.scipy.org/doc/scipy/reference/signal.html) offers us whole range of tools for signal processing.

<center>
    <img src="https://i.imgur.com/LY0Zh2b.png" alt="an AI generated illustration" width=800>
</center>

# 🔎 Analysing the EKG signal

We will focus on a 20 seconds length centered window of the `EKG` signal.

We will inspect the frequency components of the EKG signal and use established litterature to assist us in processing this signal into an interpretable and meaningful one.

## Extract 20 seconds window in the middle of the EKG record

In [None]:
# extract EKG signal in a 20 second centered window
ekg_subsample = df_eeg.loc[30.0:50.0]["EKG"].values.copy()[:-1]
td = np.arange(30., 50., 1 / fs)  # associated time domain

In [None]:
fig = plt.figure(figsize=(6, 4))
plt.plot(td, ekg_subsample)
plt.xlabel("Time (s)")
plt.ylabel("Amplitude (uV)")
plt.grid()
plt.title("EKG")

## Plot Fourier transform

NOTE: my memories of my signal processing courses are a little bit old. But I will always try to put resources to give the justification for each step taken 🙃.

The Fourier transform (FT) is well known in signal processing. It is a transformation that associate to a function defined in the temporal domain a function defined in the frequency domaine. This new function describes the frequency spectrum of the input temporal function. So to make it brief, we can consider our input EKG signal made up of different frequency components (for exemple a 1Hz component for heartbeats rate at 60/min) and the Fourier transform will allow us to decompose the EKG signal into those different frequency components and for each of those components we will have amplitude and phase. For more information about the Fourier transform, you can check out the [wikipedia page](https://en.wikipedia.org/wiki/Fourier_transform).

As we have discrete signals, we will have to compute a [discrete Fourier transform (DFT)](https://en.wikipedia.org/wiki/Discrete_Fourier_transform), and for that we can make use of [Fast Fourier Transform (FFT)](https://en.wikipedia.org/wiki/Fast_Fourier_transform), which is an efficient algorithm for computing DFT.

Scipy offers us a module [fft](https://docs.scipy.org/doc/scipy/reference/fft.html#module-scipy.fft) for Fourier analysis. We will rely upon it.

In [None]:
y_ekg = fft.fft(ekg_subsample)

# array of frequencies corresponding to the FT points
# determined by number of points and sampling interval (inverse of sampling frequency)
fd = fft.fftfreq(n=len(y_ekg), d=1 / fs)

In [None]:
# to plot the amplitude of each frequency, we take the absolute value of the complex number
# also as the FT is symmetric between (-fs/2, fs/2), we only plot the first half
fig = plt.figure(figsize=(10, 6))
plt.plot(fd[:len(fd) // 2], np.abs(y_ekg)[:len(fd) // 2])
plt.grid()
plt.xlabel("Frequency (Hz)")
plt.ylabel("Amplitude")
plt.title("EKG spectrum")

Ok so we have some good information here. So we can see that we have the amplitude of the frequency components from 0 Hz to 100 Hz. This value of 100 Hz is quite important, it is related to the [Nyquist–Shannon sampling theorem](https://en.wikipedia.org/wiki/Nyquist%E2%80%93Shannon_sampling_theorem) which states that a continuous signal $x(t)$ which does not contain frequency component higher than $B\; Hz$ can be fully reconstructed by a discrete sampling $\{x(n \times \frac{1}{fs})\}_n$ if the sampling frequency $f_s$ is at least two times the higher $x(t)$ frequency: $f_s > 2B$.

Our original `EKG` signal is sampled at 200 Hz. So the highest frequency we can have reliable information on is: $200/2 = 100 \;Hz$

We observe two major amplitude peaks:
* one at 0 Hz: it is the "non-oscillating" component, the mean component. As we are measuring potential in volt (V), the measure is equivalent up to a constant. So we can subtract mean from the signal for convenience.
* one huge at approximately 60 Hz: it is likely to be due to powerline interference (50 Hz or 60 Hz). We will surely want to filter this component.

We can also zoom on low frequencies (< 2 Hz)

In [None]:
delta_f = 1 / len(y_ekg) * fs  # frequency resolution (step between each frequency point)
print(f"Frequency resolution: {delta_f} Hz")

In [None]:
fig = plt.figure(figsize=(10, 6))
plt.plot(fd[:40], np.abs(y_ekg)[:40])
plt.grid()
plt.xlabel("Frequency (Hz)")
plt.ylabel("Amplitude")
plt.title("EKG spectrum")

Apart from the mean component (0 Hz), there are very low frequencies (0.05-0.5 Hz) components
However heart rate typically ranges between 60 and 100 bpm (1-1.6 Hz). So those very low frequencies are 
likely not to be related to the heart rate but maybe to the respiration rate or slow movements of the patient.

We can filter them with a highpass filter.

# 🔉 Filtering !

For this part, I used this resource from GE Healthcare: [A Guide to ECG Signal Filtering](https://www.gehealthcare.com/insights/article/a-guide-to-ecg-signal-filtering), which gives insight on which frequencies to filter.

We first remove the mean component.

In [None]:
ekg_subsample_centered = ekg_subsample - ekg_subsample.mean()

Then a highpass filter to remove frequency components below 0.5 Hz.

In [None]:
f_cutoff = 0.5  # cut-off frequency in Hz

# generate butterworth filter (by generating the coefficients)
b, a = signal.butter(N=4, Wn=f_cutoff, btype='highpass', fs=fs, analog=False, output='ba')

In [None]:
ekg_subsample_high = signal.filtfilt(b, a, ekg_subsample_centered) 

In [None]:
# notch filter to remove power line interference
f0 = 60.0  # Frequency to be removed from signal (Hz)
Q = 30.0  # Quality factor

b_notch, a_notch = signal.iirnotch(w0=f0, Q=Q, fs=fs)

In [None]:
ekg_subsample_filtered = signal.filtfilt(b_notch, a_notch, ekg_subsample_high)

In [None]:
# lets plot the Fourier transform of the filtered signal
y_filtered = fft.fft(ekg_subsample_filtered)

fig, axes = plt.subplots(nrows=2, ncols=1, figsize=(10, 8))

axes[0].plot(fd[:len(fd) // 2], np.abs(y_filtered)[:len(fd) // 2])
axes[0].set_title("EKG spectrum after filtering")
axes[0].grid()
axes[0].set_xlabel("Frequency (Hz)")
axes[0].set_ylabel("Amplitude")

axes[1].plot(fd[:40], np.abs(y_filtered)[:40])
axes[1].set_title("EKG spectrum after filtering (zoomed on low frequencies)")
axes[1].grid()
axes[1].set_xlabel("Frequency (Hz)")
axes[1].set_ylabel("Amplitude")

plt.tight_layout()

We have successfully filtered our original signal. We can now plot the temporal filtered EKG signal and compare it to the unfiltered one.

In [None]:
fig, axes = plt.subplots(nrows=2, ncols=1, figsize=(10, 8), sharex=True)
axes[0].plot(td, ekg_subsample)
axes[0].set_title("EKG signal")
axes[0].grid()
axes[0].set_ylabel("Amplitude")

axes[1].plot(td, ekg_subsample_filtered)
axes[1].set_title("EKG signal after filtering")
axes[1].grid()
axes[1].set_xlabel("Time (s)")
axes[1].set_ylabel("Amplitude")

There are some boundary effect. For the moment we will ignore those extreme points.

In [None]:
fig, axes = plt.subplots(nrows=2, ncols=1, figsize=(10, 8), sharex=True)
axes[0].plot(td[100:-100], ekg_subsample[100:-100])
axes[0].set_title("EKG signal")
axes[0].grid()
axes[0].set_ylabel("Amplitude")

axes[1].plot(td[100:-100], ekg_subsample_filtered[100:-100])
axes[1].set_title("EKG signal after filtering")
axes[1].grid()
axes[1].set_xlabel("Time (s)")
axes[1].set_ylabel("Amplitude")

<span style="font-size: x-large;">Great 🙌 !</span>

# 🌛 TO BE CONTINUED (it's 1 AM in France)