# Introduction and EDA

## 1. Introduction 

### 1.1 Quick Summary 

**Goal:** Classification of seizures + other harmful brain activity <br>
**Data:** electroencephalography (EEG) signals <br>
**Submission Date:** 19th of April 

### 1.2 What is electroencephalography (EEG)? 

An electroencephalogram (EEG) is a test that measures electrical activity in the brain using small, metal discs (electrodes) attached to the scalp. Brain cells communicate via electrical impulses and are active all the time, even during asleep. This activity shows up as wavy lines on an EEG recording. <br> <br>

<img src="https://assets-global.website-files.com/621e95f9ac30687a56e4297e/64a8d6171e48d618d6eb1f61_V2_1677770554425_847b3a5e-24b7-4152-ab9b-01ef841697d5.png" alt="level of mesuarements" width="720"/>

### 1.3 Which patterns are there? 

**patterns:** 
- seizure (SZ)
- generalized periodic discharges (GPD)
- lateralized periodic discharges (LPD)
- lateralized rhythmic delta activity (LRDA)
- generalized rhythmic delta activity (GRDA)

**weighting:**

- idealized &rarr; clear expert vote
- proto &rarr; half of experts for one pattern
- egde &rarr; 50/50 vote between two patterns 

### 1.4 Evaluation metrics

**Kullbeck Liebler Divergence** <br>

$D_{KL}(P \parallel Q) = \sum_{i} P(i) \log\left(\frac{P(i)}{Q(i)}\right)$

&rarr; measures the difference between two distributions 
- log is used get the same divergence for $\frac{1}{x}$ and $x$
- $P(i)$ is used as weighting to set a focus an cases we will find more often

### 1.5 Target Output 
&rarr; as csv file in the following format 

|eeg_id|seizure_vote|lpd_vote|gpd_vote|lrda_vote|grda_vote|other_vote|
|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
| 0 | 0.166 | 0.166 | 0.167 | 0.167 | 0.167 | 0.167 |
| 1 | 0.166 | 0.166 | 0.167 | 0.167 | 0.167 | 0.167 |
| ... | ... | ... | ... | ... | ... | ... |

## 2. EDA

### 2.1 train.csv  

#### 2.1.1 Data Import

In [None]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns

df_train = pd.read_csv('/kaggle/input/hms-harmful-brain-activity-classification/train.csv')
df_train

#### 2.1.2 Data Description

In [None]:
df_train.describe().T

| Column Name                  | Description                                                         |
|------------------------------|---------------------------------------------------------------------|
| eeg_id                       | A unique identifier for the entire EEG recording.                   |
| eeg_sub_id                   | An ID for the specific 50-second long subsample this row's labels apply to. |
| eeg_label_offset_seconds     | The time between the beginning of the consolidated EEG and this subsample. |
| spectrogram_id               | A unique identifier for the entire EEG recording.                   |
| spectrogram_sub_id           | An ID for the specific 10-minute subsample this row's labels apply to. |
| spectogram_label_offset_seconds | The time between the beginning of the consolidated spectrogram and this subsample. |
| label_id                     | An ID for this set of labels.                                       |
| patient_id                   | An ID for the patient who donated the data.                         |
| expert_consensus             | The consensus annotator label. Provided for convenience only.       |


the rest are the categories to predict 

In [None]:
df_train.info()

#### 2.1.3 train.csv EDA 

In [None]:
print('There are ', df_train['eeg_id'].nunique(), ' eeg data entries')
print('There are ', df_train['spectrogram_id'].nunique(), ' spectrogram data entries')
print('There are ', df_train['patient_id'].nunique(), ' different patients')

In [None]:
columns_to_plot = ['seizure_vote', 'lpd_vote', 'gpd_vote', 'lrda_vote', 'grda_vote', 'other_vote']

df_normalized = df_train[columns_to_plot].div(df_train[columns_to_plot].sum(axis=1), axis=0)

column_means = df_normalized.mean()

data_for_barplot = pd.DataFrame({'Columns': column_means.index, 'Mean Percentage': column_means.values})

sns.set(style="whitegrid")

plt.figure(figsize=(10, 6))
sns.barplot(x='Columns', y='Mean Percentage', data=data_for_barplot, palette="viridis")

plt.title('Comparison of Mean Percentages for Each Normalized Expert Vote')
plt.xlabel('Columns')
plt.ylabel('Mean Percentage')
plt.show()

In [None]:
plt.figure(figsize=(14, 16))
for i, column in enumerate(columns_to_plot, 1):
    plt.subplot(3, 2, i)
    sns.violinplot(x=df_train[column])
    plt.title(f'Violin Plot for {column}')

plt.suptitle('Violin Plots for Absolute Values of Each Column', y=1.02)
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(4, 3))
column = 'eeg_label_offset_seconds'
sns.violinplot(x=df_train[column])

plt.title('The time between the beginning of the consolidated EEG and this subsample')
plt.suptitle('EEG Offset Seconds Distribution', y=1.02)
plt.tight_layout()
plt.show()

df_train[[column]].describe().T


In [None]:
plt.figure(figsize=(4, 3))
column = 'spectrogram_label_offset_seconds'
sns.violinplot(x=df_train[column])

plt.title('The time between the beginning of the consolidated spectrogram and this subsample')
plt.suptitle('Spectrogram Offset Seconds Distribution', y=1.02)
plt.tight_layout()
plt.show()

df_train[[column]].describe().T

In [None]:
plt.figure(figsize=(6, 5))
sns.heatmap(df_train[columns_to_plot].corr(), annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)

plt.title('correlation matrix for possible patterns')
plt.show()

### 2.2 train_eegs

#### 2.2.1 Data Import

In [None]:
df_eeg = pd.read_parquet('/kaggle/input/hms-harmful-brain-activity-classification/train_eegs/1628180742.parquet')
df_eeg

#### 2.2.2 Data Description

**corresponding meta date from train.csv**

In [None]:
df_train[df_train.eeg_id == 1628180742]

**how to map the eeg_sub_id** <br>

Usually the eeg data for an eeg_id contains 10k rows because we have 50 seconds x 200 records per second. If we have multiple entries for our eeg_id in the metadate, it will have eeg_sub_ids. Next to the sub_id you can find the time (in seconds) when this subrecord starts. If we want to find the starting row, we have to use the mentioned formula minus one because the index starts with 0.

So let's imagine our last sub record starts after 20 seconds. To find the starting row, we multiply 20 by 200 and subtract one. So we know our first line for this sub id is 3999. Our last row for this sub_id is always 10k lines later which means row 13999.

For our case we will reduce the eeg data to the first subsample (only first 50 seconds)

In [None]:
df_eeg = df_eeg.head(10000)

df_eeg.describe().T

**Based on 10-20 System**

The 10–20 system, also known as the International 10–20 system, is a standardized method for placing scalp electrodes during EEG exams, polysomnograph sleep studies, or lab research. It ensures consistency in testing, allowing for reproducibility and effective analysis. Electrodes are positioned based on their relationship to underlying areas of the brain, particularly the cerebral cortex. The system derives its name from the fact that electrode distances are either 10% or 20% of the total front–back or right–left distance of the skull. This method facilitates the detection of distinct electrical patterns in the brain during sleep and wake cycles. Various extrinsic factors can influence these patterns, such as age, medication, health conditions, neurological history, and substance use. The measurements are taken from specific anatomical locations, like the tragus, auricle, and mastoid, and are crucial for consistent and comparable results in scientific studies.

<img src="https://info.tmsi.com/hs-fs/hubfs/Blogs/0.1%20The%2010-20%20System/the-10-20-system-1-new.jpg?width=1307&height=689&name=the-10-20-system-1-new.jpg" alt="level of mesuarements" width="720"/>


The EKG column is for an electrocardiogram lead that records data from the heart.

<img src="https://my.clevelandclinic.org/-/scassets/images/org/health/articles/16953-electrocardiogram" alt="level of mesuarements" width="720"/>

In [None]:
columns = df_eeg.columns

plt.figure(figsize=(14, 30))
for i, column in enumerate(columns, 1):
    plt.subplot(20, 1, i)
    sns.violinplot(x=df_eeg[column])
    plt.title(f'Violin Plot for {column}')
    plt.xlim(-1600, 250)

plt.suptitle('Violin Plots for the electrode values per position', y=1.02)
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(14, 12))
sns.heatmap(df_eeg.corr(), annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)

plt.title('Correlation Matrix')
plt.show()

### 2.3. train_spectorgrams 

#### 2.3.1 Data Import

- it will be the corresponding spectrogram to the eeg data before 

In [None]:
spectrogram = pd.read_parquet('/kaggle/input/hms-harmful-brain-activity-classification/train_spectrograms/353733.parquet')
spectrogram

#### 2.3.2 Data Explanation

### Column Names Description:
- `LL_x`: Frequency (Hz) data from left lateral region
- `RL_x`: Frequency (Hz) data from right lateral region
- `LP_x`: Frequency (Hz) data from left parasagittal region
- `RP_x`: Frequency (Hz) data from right parasagittal region

**the connection to the metadate i based on a similar conecpt**

In [None]:
spectrogram.describe().T

#### 2.3.3 Data Analysis

In [None]:
def plot_spectrogram(spectrogram_path):
    '''
    source --> https://www.kaggle.com/code/clehmann10/plot-spectrograms
    '''
    
    sample_spect = pd.read_parquet(spectrogram_path)
    
    split_spect = {
        "LL": sample_spect.filter(regex='^LL', axis=1),
        "RL": sample_spect.filter(regex='^RL', axis=1),
        "RP": sample_spect.filter(regex='^RP', axis=1),
        "LP": sample_spect.filter(regex='^LP', axis=1),
    }
    
    fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(15, 12))
    axes = axes.flatten()
    label_interval = 5
    for i, split_name in enumerate(split_spect.keys()):
        ax = axes[i]
        img = ax.imshow(np.log(split_spect[split_name]).T, cmap='viridis', aspect='auto', origin='lower')  # You can choose any colormap (cmap) that suits your preferences
        cbar = fig.colorbar(img, ax=ax)
        cbar.set_label('Log(Value)')
        ax.set_title(split_name)
        ax.set_ylabel("Frequency (Hz)")
        ax.set_xlabel("Time")

        ax.set_yticks(np.arange(len(split_spect[split_name].columns)))
        ax.set_yticklabels([column_name[3:] for column_name in split_spect[split_name].columns])
        frequencies = [column_name[3:] for column_name in split_spect[split_name].columns]
        ax.set_yticks(np.arange(0, len(split_spect[split_name].columns), label_interval))
        ax.set_yticklabels(frequencies[::label_interval])
    plt.tight_layout()
    plt.show()

In [None]:
plot_spectrogram('/kaggle/input/hms-harmful-brain-activity-classification/train_spectrograms/353733.parquet')