> The goal of this competition is to detect and classify seizures and other types of harmful brain activity in electroencephalography (EEG) data. Even experts find this to be a challenging task and often disagree about the correct labels.

# Setup

In [None]:
import numpy as np
import polars as pl
pl.Config.set_tbl_cols(-1);
from pathlib import Path

# Load dataset

In [None]:
df_train = pl.read_csv('/kaggle/input/hms-harmful-brain-activity-classification/train.csv')
df_test = pl.read_csv('/kaggle/input/hms-harmful-brain-activity-classification/test.csv')
df_sample_submission = pl.read_csv('/kaggle/input/hms-harmful-brain-activity-classification/sample_submission.csv')

# Data description

> **train.csv** Metadata for the train set. The expert annotators reviewed 50 second long EEG samples plus matched spectrograms covering 10 a minute window centered at the same time and labeled the central 10 seconds. Many of these samples overlapped and have been consolidated. train.csv provides the metadata that allows you to extract the original subsets that the raters annotated.
> 
> - `eeg_id` - A unique identifier for the entire EEG recording.
> - `eeg_sub_id` - An ID for the specific 50 second long subsample this row's labels apply to.
> - `eeg_label_offset_seconds` - The time between the beginning of the consolidated EEG and this subsample.
> - `spectrogram_id` - A unique identifier for the entire EEG recording.
> - `spectrogram_sub_id` - An ID for the specific 10 minute subsample this row's labels apply to.
> - `spectogram_label_offset_seconds` - The time between the beginning of the consolidated spectrogram and this subsample.
> - `label_id` - An ID for this set of labels.
> - `patient_id` - An ID for the patient who donated the data.
> - `expert_consensus` - The consensus annotator label. Provided for convenience only.
> - `[seizure/lpd/gpd/lrda/grda/other]_vote` - The count of annotator votes for a given brain activity class. The full names of the activity classes are as follows: lpd: lateralized periodic discharges, gpd: generalized periodic discharges, lrd: lateralized rhythmic delta activity, and grda: generalized rhythmic delta activity . A detailed explanations of these patterns is [available here](https://www.acns.org/UserFiles/file/ACNSStandardizedCriticalCareEEGTerminology_rev2021.pdf).

In [None]:
display(df_train)

> **test.csv** Metadata for the test set. As there are no overlapping samples in the test set, many columns in the train metadata don't apply.
> 
> - `eeg_id`
> - `spectrogram_id`
> - `patient_id`

In [None]:
display(df_test)

> **sample_submission.csv**
> 
> - `eeg_id`
> - `[seizure/lpd/gpd/lrda/grda/other]_vote` - The target columns. Your predictions must be probabilities. Note that the test samples had between 3 and 20 annotators.

In [None]:
display(df_sample_submission)

> **train_eegs/** EEG data from one or more overlapping samples. Use the metadata in train.csv to select specific annotated subsets. The column names are [the names of the individual electrode locations for EEG leads](https://en.wikipedia.org/wiki/10%E2%80%9320_system_%28EEG%29), with one exception. The EKG column is for an electrocardiogram lead that records data from the heart. All of the EEG data (for both train and test) was collected at a frequency of 200 samples per second.

In [None]:
sample_eeg_id = 1628180742

In [None]:
df_sample_eeg = pl.read_parquet(
    Path('/kaggle/input/hms-harmful-brain-activity-classification/train_eegs')
    / f'{sample_eeg_id}.parquet'
)

In [None]:
display(df_sample_eeg)

> **test_eegs/** Exactly 50 seconds of EEG data.

In [None]:
sample_test_eeg_id = 3911565283
df_sample_test_eeg = pl.read_parquet(
    Path('/kaggle/input/hms-harmful-brain-activity-classification/test_eegs')
    / f'{sample_test_eeg_id}.parquet'
)

In [None]:
display(df_sample_test_eeg)

> **train_spectrograms/** Spectrograms assembled EEG data. Use the metadata in train.csv to select specific annotated subsets. The column names indicate the frequency in hertz and the recording regions of the EEG electrodes. The latter are abbreviated as LL = left lateral; RL = right lateral; LP = left parasagittal; RP = right parasagittal.

In [None]:
sample_spectrogram_id = 353733

In [None]:
df_sample_spectrogram = pl.read_parquet(
    Path('/kaggle/input/hms-harmful-brain-activity-classification/train_spectrograms')
    / f'{sample_spectrogram_id}.parquet'
)

In [None]:
display(df_sample_spectrogram)

> **test_spectrograms/** Spectrograms assembled using exactly 10 minutes of EEG data.

In [None]:
sample_test_spectrogram_id = 853520
df_sample_test_spectrogram = pl.read_parquet(
    Path('/kaggle/input/hms-harmful-brain-activity-classification/test_spectrograms')
    / f'{sample_test_spectrogram_id}.parquet'
)

In [None]:
display(df_sample_test_spectrogram)

> **example_figures/** Larger copies of the example case images used on the overview tab.

In [None]:
# the following code should display PDF, but seems not on Kaggle
from IPython.display import display_pdf

with open('/kaggle/input/hms-harmful-brain-activity-classification/example_figures/Sample01.pdf', 'rb') as f:
    display_pdf(f.read(),raw=True)

# Visualization

In [None]:
display(df_train.describe())

In [None]:
display(df_sample_eeg.describe())

In [None]:
display(df_sample_spectrogram.describe())

In [None]:
display(df_train['expert_consensus'].value_counts())

In [None]:
display(df_train['label_id'].n_unique())

In [None]:
eeg_samples_per_second = 200

In [None]:
def get_eeg_subsamples(df_eeg, offset):
    return (
        df_eeg
        .with_row_count('index')
        .filter(pl.col('index').is_between(
            offset * eeg_samples_per_second,
            (offset + 50) * eeg_samples_per_second,
            closed = 'left',
        ))
        .drop('index')
    )

In [None]:
df_subsample_eeg = get_eeg_subsamples(df_sample_eeg, 0)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
plt.figure(figsize = (16, 8))
sns.lineplot(df_subsample_eeg.to_pandas());

In [None]:
plt.figure(figsize = (16, 8))
sns.lineplot(df_subsample_eeg.head(eeg_samples_per_second * 5).to_pandas());

In [None]:
plt.figure(figsize = (16, 8))
sns.lineplot(df_subsample_eeg.head(eeg_samples_per_second * 1).to_pandas());

In [None]:
plt.figure(figsize = (16, 8))
sns.lineplot(df_subsample_eeg.drop('EKG').to_pandas());

In [None]:
plt.figure(figsize = (16, 8))
sns.lineplot(df_subsample_eeg.drop('EKG').head(eeg_samples_per_second * 5).to_pandas());

In [None]:
plt.figure(figsize = (16, 8))
sns.lineplot(df_subsample_eeg.drop('EKG').head(eeg_samples_per_second * 1).to_pandas());

In [None]:
def get_spectrogram_subsamples(df_spectrogram, offset, region):
    return (
        df_spectrogram
        .filter(pl.col('time').is_between(
            offset, offset + 600,
            closed = 'left',
        ))
        .select(pl.col(f'^{region}_.*$'))
    )

In [None]:
df_subsample_spectrogram = get_spectrogram_subsamples(df_sample_spectrogram, 0, 'LL')

In [None]:
display(df_subsample_spectrogram)

In [None]:
plt.figure(figsize = (16, 8))
sns.heatmap(df_subsample_spectrogram.to_pandas().T);

In [None]:
plt.figure(figsize = (16, 8))
sns.heatmap(get_spectrogram_subsamples(df_sample_spectrogram, 0, 'RL').to_pandas().T);

In [None]:
plt.figure(figsize = (16, 8))
sns.heatmap(get_spectrogram_subsamples(df_sample_spectrogram, 0, 'LP').to_pandas().T);

In [None]:
plt.figure(figsize = (16, 8))
sns.heatmap(get_spectrogram_subsamples(df_sample_spectrogram, 0, 'RP').to_pandas().T);

# Stats

In [None]:
train_eeg_dir = Path('/kaggle/input/hms-harmful-brain-activity-classification/train_eegs')

In [None]:
def read_eeg_subsamples(eeg_id, offset):
    path = train_eeg_dir / f'{eeg_id}.parquet'
    df = pl.read_parquet(path)
    return get_eeg_subsamples(df, offset)

In [None]:
def make_eeg_stats(df_train = df_train):
    acc = pl.DataFrame()
    df = df_train.select('eeg_id', 'eeg_label_offset_seconds')
    for r in df.iter_rows():
        eeg_id = r[0]
        offset = r[1]
        eeg = read_eeg_subsamples(eeg_id, offset)
        eeg_min = eeg.min().select(pl.col('*').name.suffix('_min'))
        eeg_max = eeg.max().select(pl.col('*').name.suffix('_max'))
        eeg_null = eeg.null_count().select(pl.col('*').name.suffix('_null_count'))
        acc = pl.concat([acc, pl.concat([eeg_min, eeg_max, eeg_null], how = 'horizontal')],
                        how = 'vertical')
    return pl.concat([df, acc], how = 'horizontal')

In [None]:
df_eeg_stats = make_eeg_stats()

In [None]:
display(df_eeg_stats)

In [None]:
df_eeg_stats.write_parquet('train_eeg_stats.parquet')

In [None]:
display(df_eeg_stats.describe())

In [None]:
train_spec_dir = Path('/kaggle/input/hms-harmful-brain-activity-classification/train_spectrograms')

In [None]:
def read_spec_subsamples(spec_id, offset):
    path = train_spec_dir / f'{spec_id}.parquet'
    df = pl.read_parquet(path)
    return pl.concat(
        [
            get_spectrogram_subsamples(df, offset, region)
            for region in ['LL', 'RL', 'LP', 'RP']
        ],
        how = 'horizontal',
    )

In [None]:
def make_spec_stats(df_train = df_train):
    acc = pl.DataFrame()
    df = df_train.select('spectrogram_id', 'spectrogram_label_offset_seconds')
    for r in df.iter_rows():
        spec_id = r[0]
        offset = r[1]
        spec = read_spec_subsamples(spec_id, offset)
        spec_min = spec.min().select(pl.col('*').name.suffix('_min'))
        spec_max = spec.max().select(pl.col('*').name.suffix('_max'))
        spec_null = spec.null_count().select(pl.col('*').name.suffix('_null_count'))
        acc = pl.concat([acc, pl.concat([spec_min, spec_max, spec_null], how = 'horizontal')],
                        how = 'vertical')
    return pl.concat([df, acc], how = 'horizontal')

In [None]:
df_spec_stats = make_spec_stats()

In [None]:
display(df_spec_stats)

In [None]:
df_spec_stats.write_parquet('train_spectrogram_stats.parquet')

In [None]:
display(df_spec_stats.describe())