# Introduction

**eeglib** is a library for converting EEG waveforms into feature sets:

[github: eeglib](https://github.com/Xiul109/eeglib)
[eeglib: A Python module for EEG feature extraction](https://www.sciencedirect.com/science/article/pii/S2352711021000753)

This notebook uses **eeglib** to extract various features from the training set and
visualise them using violinplots.

I've developed two plotting functions, one for left-right regions of the brain
and one covering all of the spatial areas in covered by the eeg sensors in a pair-wise
left-right-front-back manner. These plot eeglib features against my interpretation
of the probabilities.

## Things to note

1. **eeglib** is not in the standad Kaggle notebook libraries image fo you have to
   offline pip install it using an offline pip install dataset. I've already created this
   dataset here: [kaggle dataset: hms-hbac-offline-libs](https://www.kaggle.com/datasets/andrewscholan/hms-hbac-offline-libs).

## tldr

The take on this is that eeglib features might be a useful decomposition of the eeg
file data but the data probably needs better preprocessing that eeglib preprocessing
provides.

Anyway, there are a few pretty plots to look at!!!!

## And finally

If you've found this notebook useful, please upvote it on kaggle.

In [None]:
!pip install \
   --requirement /kaggle/input/hms-hbac-offline-libs/requirements.txt \
   --no-index \
   --find-links file:///kaggle/input/hms-hbac-offline-libs/wheels

In [None]:
# All imports in this code block

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import webcolors as wc
import math
import glob

from koilerplate import INPUT_ROOT, WORKING_ROOT, TEMP_ROOT
from pathlib import Path
from enum import Enum
from typing import List, Tuple, Dict
from dataclasses import dataclass
from tqdm.notebook import tqdm_notebook, tqdm
from copy import deepcopy
from pyarrow.parquet import ParquetDataset

from eeglib.helpers import Helper
from eeglib.eeg import EEG


In [None]:
# Set up some basic file paths
INPUT_PATH = Path(INPUT_ROOT)
COMPETITION_DATA_PATH = INPUT_PATH / "hms-harmful-brain-activity-classification"
COMPETITION_DATA_PATH

In [None]:
# Load the training set CSV
train_info = pd.read_csv(COMPETITION_DATA_PATH/"train.csv")
train_info

## Basic data visualisation

There's not a lot to look at in the basic training csv file. Just look at the expert categories to see if they are roughly the same order of magnitude each.

### Expert consensus

Just look at the raw data from the training file.

In [None]:
# Visualise the expert consensus category
sns.histplot(x="expert_consensus", data=train_info)

### Are all time offsets even ???

Just check to see if there are any odd time offsets...

In [None]:
eeg_time_offset_odd = train_info["eeg_label_offset_seconds"].mod(2) != 0
spectrogram_time_offset_odd = train_info["spectrogram_label_offset_seconds"].mod(2) != 0
print(f"All EEG start time offsets are multiples of 2 sec: {eeg_time_offset_odd.sum()==0}")
print(f"All spectrogram start time offsets are multiples of 2 sec: {eeg_time_offset_odd.sum()==0}")

## Labels

According to the description in the dataset:

- experts were given a 50 second extract of EEG waveform
  data and each label corresponds to the centre 10 seconds of the sample.

- the 50 second sample comes from a longer EEG waveform which may have a number
  of labelled sections in it, each extracted as a snippet of 50 seconds with the
  label corresponding to the centre 10 seconds.
  
- snippet sections may overlap.

- labels could overlap.

In [None]:
labels = train_info['label_id'].unique()
if len(labels)==train_info.shape[0]:
    print("All label IDs are unique")
else:
    print("WARNING: Some label IDs are duplicated")

In [None]:
eeg_ids = train_info['eeg_id'].unique()
print(f"There are {len(eeg_ids)} EEG ids")
print(f"There are approximately {len(labels)/len(eeg_ids):.2f} labels per EEG id")

In [None]:
# Add label start time and end times to the train info
EEG_SNAPSHOT_DURATION = 50.0    # Experts analysed 50 second snapshots for labelling
EEG_LABEL_DURATION = 10.0       # Labels relate to the centre 10 seconds of each 50s
train_info["eeg_label_start_time"] = (
    train_info["eeg_label_offset_seconds"] + 
    (EEG_SNAPSHOT_DURATION - EEG_LABEL_DURATION) / 2
)
train_info["eeg_label_end_time"] = train_info["eeg_label_start_time"] + EEG_LABEL_DURATION
train_info

In [None]:
VOTES_AND_CONSENSUS = [
    "seizure_vote", 
    "lpd_vote", "gpd_vote", 
    "lrda_vote", "grda_vote", 
    "other_vote", 
    "expert_consensus"
]
# First column of flags where there is an overlap between adjacent labels
train_info["label_overlaps_next"] = (
    (train_info["eeg_id"] == train_info["eeg_id"].shift(-1)) &
    (train_info["eeg_label_end_time"] > train_info["eeg_label_start_time"].shift(-1))
)
# Work out the overlap duration in seconds
train_info["label_overlap_next_sec"] = (
    train_info["eeg_label_end_time"] - train_info["eeg_label_start_time"].shift(-1)
).where(train_info["label_overlaps_next"], 0)
train_info

In [None]:
this_votes = train_info[["label_overlaps_next"] + VOTES_AND_CONSENSUS]
next_votes = train_info[VOTES_AND_CONSENSUS].shift(-1)
voting = this_votes.join(next_votes, lsuffix="_current", rsuffix="_next")
voting["join_vote_conflict"] = (
    voting["label_overlaps_next"] & (
        (voting["seizure_vote_current"] != voting["seizure_vote_next"])
        | (voting["lpd_vote_current"] != voting["lpd_vote_next"])
        | (voting["gpd_vote_current"] != voting["gpd_vote_next"])
        | (voting["lrda_vote_current"] != voting["lrda_vote_next"])
        | (voting["grda_vote_current"] != voting["grda_vote_next"])
        | (voting["other_vote_current"] != voting["other_vote_next"])
    )
)
voting["join_consensus_conflict"] = (
    voting["label_overlaps_next"]
    & (voting["expert_consensus_current"] != voting["expert_consensus_next"])
)
num_vote_conflicts = voting['join_vote_conflict'].sum()
num_consensus_conflicts = voting['join_consensus_conflict'].sum()
num_overlaps = voting['label_overlaps_next'].sum()

print(
    f"{num_vote_conflicts} conflicts in expert voting when labels overlap "
    f"({num_vote_conflicts/num_overlaps :.3%})."
)
print(
    f"{num_consensus_conflicts} conflicts in expert consensus when labels overlap "
    f"({num_consensus_conflicts/num_overlaps :.3%})."
)
print(f"{num_overlaps} total number of overlapping labels.")

### Discussion

There are various conflicts between labels when they are merged, buy not as many as
we might fear.

Interestingly, if you assume that an expert focusses on a smaller central region other
than the full 10 seconds of the label the conflicts reduce significantly.

| Label length | Voting Conflicts | Consensus Conflicts | Overlapping labels |
| ------------ | ---------------- | ------------------- | ------------------ |
|        4 sec |     306 (0.729%) |        109 (0.260%) |              41960 |
|        6 sec |     767 (1.279%) |        255 (0.425%) |              59966 |
|        8 sec |    1247 (1.795%) |        427 (0.615%) |              69479 |
|       10 sec |    1739 (2.314%) |        594 (0.790%) |              71546 |

> **Note:** Above table not coded in this notebook. Shows results of some coding experiments.

## Dropping conflicted data

The easiest way to deal with the conflicted labels in the training set is to
drop those labels where there are conflicts in the consensus (which is less than
1% of the data) as we treat them as outliers in the dataset.

In [None]:
train_info_clean = train_info[voting["join_consensus_conflict"] != True].copy()
train_info_clean

### Votes cast

We only know that there is a panel of experts, some presumably will not have voted and some may
vote more based on either the spectrogram or the raw eeg.

In [None]:
# Do we know how many experts looked at each eeg trace?
# Start by simply counting the votes for each entry 
train_info_clean["vote_count"] = (
    train_info_clean["seizure_vote"] +
    train_info_clean["lpd_vote"] +
    train_info_clean["gpd_vote"] +
    train_info_clean["lrda_vote"] +
    train_info_clean["grda_vote"] +
    train_info_clean["other_vote"]
)
train_info_clean

In [None]:
# Now get the maximum vote count based on EEG ID and Spectrogram ID
train_info_clean["max_vote_count_eeg"] = train_info_clean.groupby("eeg_id")["vote_count"].transform("max")
train_info_clean["max_vote_count_spectrogram"] = train_info_clean.groupby("spectrogram_id")["vote_count"].transform("max")
train_info_clean

In [None]:
# Now test how many entries there are where the raw count is different from the max
total_rows = train_info_clean.shape[0]
fewer_eeg_votes = train_info_clean[train_info_clean["vote_count"] < train_info_clean["max_vote_count_eeg"]]
fewer_eeg_votes_rows = fewer_eeg_votes.shape[0]
fewer_spectrogram_votes = train_info_clean[train_info_clean["vote_count"] < train_info_clean["max_vote_count_spectrogram"]]
fewer_spectrogram_votes_rows = fewer_spectrogram_votes.shape[0]
inconsistent_max_votes = train_info_clean[train_info_clean["max_vote_count_eeg"] != train_info_clean["max_vote_count_spectrogram"]]
inconsistent_max_votes_rows = inconsistent_max_votes.shape[0]
print(f"{fewer_eeg_votes_rows=}/{total_rows}; {fewer_eeg_votes_rows/total_rows : .2%}")
print(f"{fewer_spectrogram_votes_rows=}/{total_rows}; {fewer_spectrogram_votes_rows/total_rows : .2%}")
print(f"{inconsistent_max_votes_rows=}/{total_rows}; {inconsistent_max_votes_rows/total_rows : .2%}")

In [None]:
# Explore further the inconsistent max votes
inconsistent_fewer_eeg_votes = inconsistent_max_votes[
    inconsistent_max_votes["max_vote_count_eeg"] < inconsistent_max_votes["max_vote_count_spectrogram"]
]
inconsistent_fewer_eeg_votes_rows = inconsistent_fewer_eeg_votes.shape[0]
inconsistent_fewer_spectrogram_votes = inconsistent_max_votes[
    inconsistent_max_votes["max_vote_count_spectrogram"] < inconsistent_max_votes["max_vote_count_eeg"]
]
inconsistent_fewer_spectrogram_votes_rows = inconsistent_fewer_spectrogram_votes.shape[0]
print(
    f"{inconsistent_fewer_eeg_votes_rows=}/{inconsistent_max_votes_rows}; "
    f"{inconsistent_fewer_eeg_votes_rows/inconsistent_max_votes_rows : .2%}"
)
print(
    f"{inconsistent_fewer_spectrogram_votes_rows=}/{inconsistent_max_votes_rows}; "
    f"{inconsistent_fewer_spectrogram_votes_rows/inconsistent_max_votes_rows : .2%}"
)

#### Discussion

So, what do we know:
1. Sometimes experts will not agree on a label and will not cast a vote.
2. The spectrogram IDs have more votes associated with them than the eeg ids.
   this is probably consistent as the spectrograms cover a longer period of time
   than the EEG traces.
3. This probably means that the size of the panel could be derived from the max
   number of votes associated with the spectrogram ID rather than the eeg ID.
   
Therefore, to convert our votes to probabilities we use the max count of spectrogram ID votes.

### Probabilities

We'll divide the vote numbers by the maximum number of votes grouped by spectrogram IDs

In [None]:
VOTE_COLUMNS = ["seizure_vote", "lpd_vote", "gpd_vote", "lrda_vote", "grda_vote", "other_vote"]
P_COLUMNS = ["P_sz", "P_lpd", "P_gpd", "P_lrda", "P_grda", "P_other"]
for p_col, v_col in zip(P_COLUMNS, VOTE_COLUMNS):
    train_info_clean[p_col] = train_info_clean[v_col] / train_info_clean["max_vote_count_spectrogram"]
train_info_clean = train_info_clean.copy()
train_info_clean

In [None]:
# Work out probabilities across the whole training set
P_all = np.asarray(
    [train_info_clean[p_col].sum() for p_col in P_COLUMNS]
) / total_rows
P_all

In [None]:
# Visualise as a table
P_df = pd.DataFrame(data=P_all.reshape((1,6)), columns=["P_sz", "P_lpd", "P_gpd", "P_lrda", "P_grda", "P_other"])
P_df

In [None]:
# What's the probability sum across everything, don't expect this to be 1.0, but something close-ish
P_all.sum()

### Level of agreement

From the competition overview:

- 'idealized': High level of expert agreement
- 'proto patterns':  Cases where ~1/2 of experts give a label as “other” and ~1/2
   give one of the remaining five labels.
- 'edge cases': Where experts are approximately split between 2 of the 5 named patterns

> Not that easy to program!

In [None]:
# My stab at categorizing the agreement based on the vague description
def compute_agreement(P_sz:float, P_lpd:float, P_gpd:float, P_lrda:float, P_grda:float, P_other:float) -> str:
    agreement = "none"
    # Because of the way we have computed the probabilities for a row,
    # the parameters passed may not add to 1.0. Lets fix this first.
    P_tot = P_sz + P_lpd + P_gpd + P_lrda + P_grda + P_other
    P_sz = P_sz / P_tot
    P_lpd = P_lpd / P_tot
    P_gpd = P_gpd / P_tot
    P_lrda = P_lrda / P_tot
    P_grda = P_grda / P_tot
    P_other = P_other / P_tot
    # Now rank them by probability value
    p_dict = { "sz": P_sz, "lpd": P_lpd, "gpd": P_gpd, "lrda": P_lrda, "grda": P_grda, "other": P_other}
    p_list = list(p_dict.items())
    p_list.sort(key=lambda kv: 1-kv[1])
    # Ignore all but the three highest probabilities, get the top ranking categories
    first = p_list[0][0]
    second = p_list[1][0]
    third = p_list[2][0]
    # Re-scale by the ignored probabilities of the lower 3 categories
    p_first = p_dict[first]
    p_second = p_dict[second]
    p_third = p_dict[third]
    p_top_3 = p_first + p_second + p_third
    p_first /= p_top_3
    p_second /= p_top_3
    p_third /= p_top_3
    if p_first > 0.75:
        # If not other then there is strong agreement, if is other then no agreement
        if first != "other":
            agreement = "idealized"
    elif (p_first + p_second) > 0.75:
        # First two categories combined are significant
        if first == "other" or second == "other":
            # We'll assume that this is a proto-pattern
            agreement = "proto-pattern"
        else:
            # We have two of equal-ish ranking
            agreement = "edge-case"
    else:
        # No agreement (already set)
        ...
    return agreement

In [None]:
train_info_clean["agreement"] = train_info_clean.apply(
    lambda row : compute_agreement(
        row["P_sz"], row["P_lpd"], row["P_gpd"], row["P_lrda"], row["P_grda"], row["P_other"]
    ),
    axis=1
)
train_info_clean

In [None]:
# Visualise the expert consensus category
sns.histplot(x="agreement", data=train_info_clean)

## How long are the EEG files?

The competition says that some of the EEG files have been merged so it would be useful to
know the range of file sizes (which gives up the number of samples).

In [None]:
# The EEG sampling rate is given in the competition data set info
EEG_SAMPLING_RATE = 200

In [None]:
print("Reading in training EEGs as a ParquetDataset, this takes some time...")
eeg_files_dataset = ParquetDataset(COMPETITION_DATA_PATH / "train_eegs")
print("Now reading the length of each parquet file, also takes time...")
file_durations:Dict[float, int] = {}
for fragment in tqdm_notebook(eeg_files_dataset.fragments):
    file_duration = fragment.count_rows() / EEG_SAMPLING_RATE
    if file_duration in file_durations:
        file_durations[file_duration] += 1
    else:
        file_durations[file_duration] = 1 
file_durations_df = pd.DataFrame(sorted([(k,v) for k,v in file_durations.items()]), columns=["duration", "file_count"])
file_durations_df 

In [None]:
# We're going to categorise the file length into various buckets...
def categorise_duration(duration: float, sort_order:bool = False) -> str:
    if duration == 50.0:
        bucket = (0, "50s")
    elif duration <= 60.0:
        bucket = (1, "52-60s")
    elif duration <= 90.0:
        bucket = (2, "60-90s")
    elif duration <= 120.0:
        bucket = (3, "90s-2m")
    elif duration <= 180.0:
        bucket = (4, "2-3m")
    elif duration <= 300.0:
        bucket = (5, "3-5m")
    elif duration <= 600.0:
        bucket = (6, "5-10m")
    elif duration <= 1200.0:
        bucket = (7, "10-20m")
    else:
        bucket = (8, "> 20m")
    return bucket[0 if sort_order else 1]

In [None]:
file_durations_df["duration_bucket"] = file_durations_df.apply(
    lambda row: categorise_duration(row["duration"]), axis=1
)
file_durations_df["bucket_order"] = file_durations_df.apply(
    lambda row: categorise_duration(row["duration"], True), axis=1
)
file_durations_df

In [None]:
# Look at the spread of file durations
axis = sns.barplot(
    data=file_durations_df, 
    x="duration_bucket", 
    y="file_count", 
    estimator="sum",
    errorbar=None
)
axis.bar_label(axis.containers[0])
print(f"Total number of EEG files: {file_durations_df['file_count'].sum()}")

## EEG Data exploration

Load in an arbitrary sample and have a look at the data

In [None]:
# Just choose a arbitrary sample to look at
SAMPLE = 7_824
sample_info = train_info_clean.iloc[SAMPLE]
sample_info

In [None]:
TIME_OFFSET_COLUMN = "time_offset"
EEG_SAMPLING_PERIOD = 1.0 / EEG_SAMPLING_RATE

# Function to add a time-channel to a dataframe
def add_time_channel(df: pd.DataFrame, period: float=EEG_SAMPLING_PERIOD) -> None:
    df[TIME_OFFSET_COLUMN] = df.index * period

### Per second probabilities

We have labels that overlap in any given EEG file and we know that there are
inconsistencies still between labels, however, we have dropped the overlapping
labels where the expert consensus "agreement" category differs.

We still need to translate out tabular label probabilities into something we can
use for training a model. Let's give each second in an EEG file a label probability
based on whatever labels happen to overlap.

In [None]:
def combine_probabilities(info: pd.DataFrame, row: pd.Series) -> pd.Series:
    time_offset = float(row.name)   # This is the index value for the row
    applicable_probabilities = info[P_COLUMNS].where(
        (info["eeg_label_start_time"]<=time_offset) & 
        (time_offset<info["eeg_label_end_time"])
    )
    # Note apply fillna **after** mean as we don't want NaN rows to count towards the mean
    return applicable_probabilities[P_COLUMNS].mean().fillna(0.0)

In [None]:
def eeg_P_per_sec(train_info: pd.DataFrame, eeg_id: int):
    eeg_info = train_info[train_info["eeg_id"]==eeg_id].copy()
    last_label_start_offset = eeg_info["eeg_label_offset_seconds"].iloc[-1]
    # Note as we know all label boundaries are every 2 secs we step index by 2
    p_per_sec = pd.DataFrame(0.0, index=np.arange(0.0, last_label_start_offset+EEG_SNAPSHOT_DURATION, 2.0), columns=P_COLUMNS)
    p_per_sec = p_per_sec.apply(lambda row: combine_probabilities(eeg_info, row), axis=1)
    p_per_sec["eeg_id"] = eeg_id
    p_per_sec["is_scored"] = p_per_sec[P_COLUMNS].sum(axis=1) > 0
    p_per_sec.replace(0.0, math.nan, inplace=True)
    return p_per_sec.copy()

In [None]:
sample_eeg_p_per_sec = eeg_P_per_sec(train_info_clean, sample_info["eeg_id"])
sample_eeg_p_per_sec

In [None]:
sns.lineplot(data=sample_eeg_p_per_sec[P_COLUMNS])

## Load a single 50s snapshot for the EEG we've selected

This is the 50 second slice that the experts have seen.

In [None]:
# Function to load the eeg data
def load_eeg(
    eeg_id:int, 
    start_time:float=0, 
    duration:float|None=None
) -> pd.DataFrame:
    df = pd.read_parquet(COMPETITION_DATA_PATH/"train_eegs"/f"{eeg_id}.parquet")
    # Interpolated over the dataframe as some of the eeg files have the 
    # odd row of NaNs
    df = df.interpolate()
    # And front and back data may contain NaNs that won't interpolate
    df = df.fillna(0.0)
    # Add the time channel
    add_time_channel(df)
    # Get the duration for the whole file
    file_duration = df[TIME_OFFSET_COLUMN].iloc[-1] + EEG_SAMPLING_PERIOD
    # Now we want to extract the data for period in question
    end_time = (
        start_time + 
        (duration if duration is not None else file_duration - start_time)
    )
    df = df[
        (df[TIME_OFFSET_COLUMN] >= start_time) & 
        (df[TIME_OFFSET_COLUMN] < end_time)
    ]
    return df.copy().reset_index()

In [None]:
# Load the first 50 seconds of the sample EEG data using function
snapshot_eeg = load_eeg(
    sample_info['eeg_id'], 
    sample_info['eeg_label_offset_seconds'],
    EEG_SNAPSHOT_DURATION
)
snapshot_eeg

## Single ended vs differential

Looks like the data is single-ended (which is good) but we probably need to make it differential to produce the same type of plots in the sample data.

In [None]:
# What are the column names?
for col in snapshot_eeg.columns:
    print(col)

In [None]:
# Left lateral (LL)
LL_EEG_CHANNELS = ["Fp1-F7", "F7-T3", "T3-T5", "T5-O1"]
# Left parasagittal (LP)
LP_EEG_CHANNELS = ["Fp1-F3", "F3-C3", "C3-P3", "P3-O1"]
# Central
CC_EEG_CHANNELS = ["Fz-Cz", "Cz-Pz"]
# Right parasagittal (RP)
RP_EEG_CHANNELS = ["Fp2-F4", "F4-C4", "C4-P4", "P4-O2"]
# Right lateral (RL)
RL_EEG_CHANNELS = ["Fp2-F8", "F8-T4", "T4-T6", "T6-O2"]
# Auxiliary information columns
AUX_EEG_COLUMNS = [ "EKG", TIME_OFFSET_COLUMN ]

# Define how we want our EEG channels to be constructed
DIFF_EEG_COLUMNS = \
    LL_EEG_CHANNELS + \
    LP_EEG_CHANNELS + \
    CC_EEG_CHANNELS + \
    RP_EEG_CHANNELS + \
    RL_EEG_CHANNELS + \
    AUX_EEG_COLUMNS
DIFF_EEG_COLUMNS

In [None]:
# Make a new dataframe with differential channels, rather than single ended
def make_differential(df:pd.DataFrame, columns:List[str] = DIFF_EEG_COLUMNS):
    # Take copy as df is potentially a slice
    df = df.copy();
    to_drop: List[str] = []
    for column in columns:
        single_ended_channels = column.split("-")
        if len(single_ended_channels) == 2:
            # We need to make the differential channel
            df[column] = df[single_ended_channels[0]] - df[single_ended_channels[1]]
            if single_ended_channels[0] not in to_drop:
                to_drop.append(single_ended_channels[0])
            if single_ended_channels[1] not in to_drop:
                to_drop.append(single_ended_channels[1])
            # Standardize column value
            df[column] = (df[column] - df[column].mean()) / df[column].std()
    # Drop non-differential columns
    df.drop(to_drop, axis=1, inplace=True)
    # Now return the dataframe in the correct column order
    return df[columns]    

In [None]:
# Apply the differential function
extracted_diff_eeg = make_differential(snapshot_eeg)
extracted_diff_eeg

## Plotting

We're going to define a plotting process that can plot the dataframe columns as a grouped, stacked line plot using a combination of seaborn and matplot libraries.

In [None]:
# Define s dataclass that defines a group of channels to be plotted
@dataclass
class PlotGroup:
    channels: List[str]
    fg_color: str
    bg_color: str

In [None]:
# This function will plot a group of channels from the dataframe
def eeg_group_plot(
    df: pd.DataFrame, 
    x: str, 
    plot_group: PlotGroup, 
    snapshot_offset:float,
    axes:List[plt.axis], 
    plot_span_secs:float,
    snapshot_duration
):
    xlim_lower = snapshot_offset + (snapshot_duration / 2) - (plot_span_secs / 2)
    xlim_upper = xlim_lower + plot_span_secs
    for channel, ax in zip(plot_group.channels, axes):
        sns.lineplot(data=df, x=x, y=channel, ax=ax, color=plot_group.fg_color)
        ax.set_facecolor(plot_group.bg_color)
        ax.set_xlim((xlim_lower, xlim_upper))
        ax.set_ylabel(channel, rotation=0, fontsize=12, horizontalalignment='right', verticalalignment='center')

In [None]:
# This function will plot and stack multiple groups together
def eeg_plot(
    df: pd.DataFrame, 
    x: str, 
    plot_groups: List[plt.axis], 
    snapshot_offset: float, 
    plot_span_secs:float = EEG_LABEL_DURATION,
    snapshot_duration = EEG_SNAPSHOT_DURATION,
    fig_height=12, 
    fig_width=12
):
    num_channels = 0;
    for plot_group in plot_groups:
        num_channels += len(plot_group.channels)
    figure, axes = plt.subplots(num_channels, 1)
    figure.subplots_adjust(hspace=0)
    figure.set_figheight(fig_height)
    figure.set_figwidth(fig_width)
    axis_num = 0
    for plot_group in plot_groups:
        num_group_chans = len(plot_group.channels)
        group_axes = axes[axis_num:axis_num+num_group_chans]
        eeg_group_plot(
            df, x, plot_group, snapshot_offset, group_axes,
            plot_span_secs=plot_span_secs,
            snapshot_duration=snapshot_duration
        )
        axis_num += num_group_chans
    figure.show()

In [None]:
# Now define what the plot groupings are and what the colours are to use.
# We'll plot from the left side of the head to the right side, front to back.
# We'll use red for left size and green for right (as per international navigation lights!)
# EKG will be last channel
DIFF_PLOT_GROUPS = [
    PlotGroup(["Fp1-F7", "F7-T3", "T3-T5", "T5-O1"], 'red', wc.CSS3_NAMES_TO_HEX["lightpink"]),
    PlotGroup(["Fp1-F3", "F3-C3", "C3-P3", "P3-O1"], 'red', wc.CSS3_NAMES_TO_HEX["lightsalmon"]),
    PlotGroup(["Fz-Cz", "Cz-Pz"], 'black', wc.CSS3_NAMES_TO_HEX["gainsboro"]),
    PlotGroup(["Fp2-F4", "F4-C4", "C4-P4", "P4-O2"], 'green', wc.CSS3_NAMES_TO_HEX["darkseagreen"]),
    PlotGroup(["Fp2-F8", "F8-T4", "T4-T6", "T6-O2"], 'green', wc.CSS3_NAMES_TO_HEX["lightseagreen"]),
    PlotGroup(["EKG"], 'blue', wc.CSS3_NAMES_TO_HEX["powderblue"]),
]

In [None]:
# Now plot the unprocessed data
eeg_plot(extracted_diff_eeg, TIME_OFFSET_COLUMN, DIFF_PLOT_GROUPS, sample_info['eeg_label_offset_seconds'])

## EEGLIB preprocessing

eeglib has sparse documentation and may/may not be useful. Let's just explore what can be done with the preprocessing functions it provides.

### eeglib Helper class

Looks like data has to start in a `Helper` object and we need to create this from a 
numpy array (as there's no way to directly import a parquet file).

**However:** If you simply create a `Helper` object from a numpy array the `Helper` 
only saves the reference to the data and then, potentially, modifies it during 
pre-processing. This is not a problem in a forward-running pipeline but when messing
around in the Notebook this can have unintended side-effects.

Therefore we create a helper function that safely creates the helper from the
DataFrame without danger of modifying the input dataframe.

In [None]:
@dataclass
class BandPass:
    low_cutoff: float|None
    high_cutoff: float|None
        
DEFAULT_BANDPASS = BandPass(1.0, 50.0)

In [None]:
# Makes an eeglib helper without potential side effects on the input dataframe
# Returns the helper and the data dropped when we made the helper
def df_to_eeglib_helper(
    df: pd.DataFrame,
    columns: List[str]=DIFF_EEG_COLUMNS,
    drop_columns: List[str]=AUX_EEG_COLUMNS,
    sample_rate:int=EEG_SAMPLING_RATE, 
    window_size:int|None=int(EEG_LABEL_DURATION * EEG_SAMPLING_RATE),
    band_pass:BandPass|None=None,
    normalize:bool=False,
    ica:bool=False
) -> Helper:
    required_cols = [col for col in columns if col not in drop_columns]
    # Here is all important copy, gives us a new data array to be mutated by eeglib
    copy_df = df[required_cols].copy()
    dropped_df = df[drop_columns].copy()
    data = copy_df.to_numpy().transpose()
    helper = Helper(
        data, 
        sampleRate=sample_rate, 
        names=required_cols, 
        windowSize=window_size,
        highpass=band_pass.low_cutoff if (band_pass and band_pass.low_cutoff) else None,
        lowpass=band_pass.high_cutoff if (band_pass and band_pass.high_cutoff) else None,
        normalize=normalize,
        ICA=ica,
    )
    return helper, dropped_df;

In [None]:
# Convert a eeglib EEG object back into a standard pandas dataframe
def eeg_to_df(
    eeg: EEG, 
    eeg_channels:List[str], 
    restore_df:pd.DataFrame|None, 
    columns:List[str]=DIFF_EEG_COLUMNS
):
    df = pd.DataFrame(eeg.window.window.transpose(), columns=eeg_channels, copy=True)
    if (restore_df is not None):
        df = df.join(restore_df.reset_index())
    return df[columns]

In [None]:
helper, aux_df = df_to_eeglib_helper(
    extracted_diff_eeg,
    band_pass=DEFAULT_BANDPASS,
    normalize=True,
)
# Now we need to get an EEG object
# Note, looks like a bug in eeglib, can only iterate once so we'll collect the eegs in a list.
# Must make deep copies or will simply use underlying data for last iterator.
eegs = [deepcopy(eeg) for eeg in helper]
eeg = eegs[2] # Take centre 10 second window
start_time = (
    sample_info['eeg_label_offset_seconds'] + 
    (EEG_SNAPSHOT_DURATION - EEG_LABEL_DURATION) / 2.0
)
end_time = start_time + EEG_LABEL_DURATION
eeg_aux_df = aux_df[(aux_df[TIME_OFFSET_COLUMN] >= start_time) & (aux_df[TIME_OFFSET_COLUMN] < end_time)]
eeglib_modified_df = eeg_to_df(
    eeg, 
    helper.names, 
    eeg_aux_df
)
eeg_plot(eeglib_modified_df, TIME_OFFSET_COLUMN, DIFF_PLOT_GROUPS, sample_info['eeg_label_offset_seconds'])

### eeglib data input summary

So, we've ingressed some of the spectrogram data and it's been pre-processed (basically filtered and normalized).

> Note that I can't get the ICA option to work, ignoring this!

It looks reasonably sensible although there are definite edge effects (looking at the training edge caused by the
band-pass filter.

## eeglib features

So, why might we use eeglib?? The main reason appears to be that it can decompose
the time sequence eeg into a set of features. 
There is a list in the [eeglib features documentation](https://eeglib.readthedocs.io/en/latest/features.html)

There _may_ be some visual correlations discernable between the features it produces
and the expert votes in the training data.

So how to do this:

1. Read in an entire eeg file into an EEGlib helper.
2. Window this for every two seconds in the file data
3. Generate a feature set from the window, per channel.
4. Combine the results to produce a timed-based set of features.


In [None]:
# Read in the entire eeg file
sample_eeg_df = make_differential(load_eeg(sample_info['eeg_id']))
sample_eeg_df

## Feature generation

Don't really know much about any of these feature types but we'll generate the single channel
features available in eeglib for the entire sample sliced into one second chunks

In [None]:
def extract_eeg_features(
    helper: Helper,
    scored_entries: pd.Series,
    window_duration:float=EEG_LABEL_DURATION
) -> pd.DataFrame:
    # First we'll just build the features int python lists
    bp_alpha = []
    bp_beta = []
    bp_delta = []
    bp_theta = []
    pfd = []
    hfd = []
    hjorth_activity = []
    hjorth_mobility = []
    hjorth_complexity = []
    samp_en = []
    lzc = []
    dfa = []
    num_channels = len(helper.names)
    for eeg, is_scored in zip(helper, scored_entries):
        if is_scored:
            bp = eeg.bandPower()
            # Split band power and convert to db
            bp_alpha.append([10.0 * math.log10(ch["alpha"]) for ch in bp])
            bp_beta.append([10.0 * math.log10(ch["beta"]) for ch in bp])
            bp_delta.append([10.0 * math.log10(ch["delta"]) for ch in bp])
            bp_theta.append([10.0 * math.log10(ch["theta"]) for ch in bp])
            pfd.append(eeg.PFD())
            hfd.append(eeg.HFD())
            hjorth_activity.append(eeg.hjorthActivity())
            hjorth_mobility.append(eeg.hjorthMobility())
            hjorth_complexity.append(eeg.hjorthComplexity())
            samp_en.append(eeg.sampEn())
            lzc.append(eeg.LZC())
            dfa.append(eeg.DFA())
        else:
            bp_alpha.append([math.nan] * num_channels)
            bp_beta.append([math.nan] * num_channels)
            bp_delta.append([math.nan] * num_channels)
            bp_theta.append([math.nan] * num_channels)
            pfd.append([math.nan] * num_channels)
            hfd.append([math.nan] * num_channels)
            hjorth_activity.append([math.nan] * num_channels)
            hjorth_mobility.append([math.nan] * num_channels)
            hjorth_complexity.append([math.nan] * num_channels)
            samp_en.append([math.nan] * num_channels)
            lzc.append([math.nan] * num_channels)
            dfa.append([math.nan] * num_channels)
    # Now make this into a dataframe
    df_as_dict = {}
    #print(bp_alpha)
    for chan_idx, col_name in enumerate(helper.names):
        df_as_dict[f"{col_name}.bp_alpha"] = [feature_values[chan_idx] for feature_values in bp_alpha]
        df_as_dict[f"{col_name}.bp_beta"] = [feature_values[chan_idx] for feature_values in bp_beta]
        df_as_dict[f"{col_name}.bp_delta"] = [feature_values[chan_idx] for feature_values in bp_delta]
        df_as_dict[f"{col_name}.bp_theta"] = [feature_values[chan_idx] for feature_values in bp_theta]
        df_as_dict[f"{col_name}.pfd"] = [feature_values[chan_idx] for feature_values in pfd]
        df_as_dict[f"{col_name}.hfd"] = [feature_values[chan_idx] for feature_values in hfd]
        df_as_dict[f"{col_name}.hjorth_activity"] = [feature_values[chan_idx] for feature_values in hjorth_activity]
        df_as_dict[f"{col_name}.hjorth_mobility"] = [feature_values[chan_idx] for feature_values in hjorth_mobility]
        df_as_dict[f"{col_name}.hjorth_complexity"] = [feature_values[chan_idx] for feature_values in hjorth_complexity]
        df_as_dict[f"{col_name}.samp_en"] = [feature_values[chan_idx] for feature_values in samp_en]
        df_as_dict[f"{col_name}.lzc"] = [feature_values[chan_idx] for feature_values in lzc]
        df_as_dict[f"{col_name}.dfa"] = [feature_values[chan_idx] for feature_values in dfa]
    feature_df = pd.DataFrame.from_dict(df_as_dict)    
    # Set index as time channel
    feature_df.index *= window_duration
    return feature_df.copy() # This is because pandas warns the dataframe is fragmented!


In [None]:
# Build the sample features dataframe using the eeglib helper object
FEATURE_WINDOW_DURATION = 2.0
feature_helper, _ = df_to_eeglib_helper(
    sample_eeg_df,
    band_pass=DEFAULT_BANDPASS,
    normalize=True,
    window_size=int(FEATURE_WINDOW_DURATION * EEG_SAMPLING_RATE)
)
feature_df = extract_eeg_features(
    feature_helper, 
    sample_eeg_p_per_sec["is_scored"], 
    FEATURE_WINDOW_DURATION
)
feature_df

In [None]:
# There's a lot of numbers here! Try boiling some of it down into mean values per region...
EEG_REGIONS = [
    ("LL", LL_EEG_CHANNELS),
    ("LP", LP_EEG_CHANNELS),
    ("CC", CC_EEG_CHANNELS),
    ("RP", RP_EEG_CHANNELS),
    ("RL", RL_EEG_CHANNELS),
]
FEATURE_NAMES = [
    "bp_alpha", 
    "bp_beta", 
    "bp_delta", 
    "bp_theta", 
    "pfd", 
    "hfd", 
    "hjorth_activity",
    "hjorth_mobility", 
    "hjorth_complexity", 
    "samp_en", 
    "lzc", 
    "dfa"
]

# Function to add statistics channels per region
def add_feature_stats_by_region(df: pd.DataFrame) -> pd.DataFrame:
    with_stats_df = df.copy()
    for region_name, region_chans in EEG_REGIONS:
        with_stats_df = with_stats_df.copy() # prevent DF fragmented warning
        for feature_name in FEATURE_NAMES:
            region_feature_chans = [f"{chan}.{feature_name}" for chan in region_chans]
            with_stats_df[f"{region_name}.{feature_name}.mean"] = with_stats_df[region_feature_chans].mean(axis=1)
    return with_stats_df

In [None]:
features_and_region_stats_df = add_feature_stats_by_region(feature_df)
features_and_region_stats_df

## Joining the features data on probabilities for the eeg

In order to do some visualisation of the features we need to join the
features data with the probabilities for the same file data along the
time axis.

In [None]:
# Create a named function that does the joining
def join_features_and_probabilities(
    features_df:pd.DataFrame, probabilities_df:pd.DataFrame
) -> pd.DataFrame:
    joined_df = features_df.join(probabilities_df)
    return joined_df[joined_df["is_scored"]].copy()

labelled_features = join_features_and_probabilities(
    features_and_region_stats_df, sample_eeg_p_per_sec
)
labelled_features

## Feature visualisation by brain region

We're first going to simply look at a nplot of each of the features from a single
EEG file by the region of the brain that it emanated from.

In [None]:
REGION_MNEMONICS = [mne for mne, _ in EEG_REGIONS]
REGION_MNEMONICS

In [None]:
# Combine plot of scatterplot and density plot
def scatter_density_plot(
    data:pd.DataFrame, x:str, y:str, color: str, ax:plt.Axes, xlim:Tuple[float,float]=(-0.1, 1.1)
) -> None:
    scatter_plot = sns.scatterplot(
        data=data,
        x=x, 
        y=y, 
        ax=ax,
        color=color
    ).set(xlim=xlim)
#     CANNOT GET THIS TO WORK WITHOUT OCCASSIONAL DATA RELATED FAILURES
#     levels = 5
#     if data[y].count() >= levels:
#         sns.kdeplot(
#             data=data, 
#             x=x, 
#             y=y, 
#             levels=levels, 
#             fill=True, 
#             alpha=0.25, 
#             cut=2,
#             ax=ax, 
#             warn_singular=False,
#             color=color
#         ).set(xlim=xlim)

In [None]:
# Plot a scatterplot with density for each brain region 
# for a given probability channel
def regional_feature_probability_scatterplot(
    data: pd.DataFrame, 
    feature_name:str,
    probability_channels:List[str]=P_COLUMNS,
    regions:List[str]=REGION_MNEMONICS,
    fig_height:float=2.5,
    fig_width:float=12,
    force_color:str|None=None    
) -> None:
    figure, axes = plt.subplots(1, len(regions), sharey=True, sharex=True)
    figure.subplots_adjust(wspace=0)
    figure.set_figheight(fig_height)
    figure.set_figwidth(fig_width)
    visible_yaxis = True
    my_color_cycler = plt.cycler(
        'color',
        plt.rcParams['axes.prop_cycle'].by_key()['color']
    )
    for region, ax in tqdm_notebook(
        zip(regions, axes), 
        desc="Regions to plot",
        total=len(regions)
    ):
        feature_chan = f"{region}.{feature_name}"
        props_cycle = my_color_cycler()
        for probability_chan, props  in zip(probability_channels, props_cycle):
            scatter_density_plot(
                data=data, 
                x=probability_chan,
                y=feature_chan,
                ax=ax,
                color=props['color'] if force_color is None else force_color
            )
        ax.title.set_text(region)
        ax.yaxis.set_visible(visible_yaxis)
        ax.set_xlabel("probability")
        ax.set_ylabel(None)
        visible_yaxis = False
    figure.supylabel(feature_name)
    props_cycle = my_color_cycler()
    legend_handles = [
        mpatches.Patch(
            color=props['color'] if force_color is None else force_color,
            label=probability_chan
        )
        for probability_chan, props  in zip(probability_channels, props_cycle)
    ]
    figure.legend(handles=legend_handles)
    figure.show()

In [None]:
regional_feature_probability_scatterplot(labelled_features, "bp_alpha.mean")

In [None]:
for feature_name in tqdm_notebook(FEATURE_NAMES):
    regional_feature_probability_scatterplot(labelled_features, f"{feature_name}.mean")

## Spatial Visualisation

Let's try and spatially visualise the feature data by producing a grid of 14 scatterplots ordered as the probes are placed on the patient's head.

In [None]:
CC_EEG_CHANNELS_PADDED = [None] + CC_EEG_CHANNELS + [None]
SPATIAL_CHANNELS = [
    [ll, lp, cc, rp, rl] 
    for ll, lp, cc, rp, rl in zip(
        LL_EEG_CHANNELS, LP_EEG_CHANNELS, CC_EEG_CHANNELS_PADDED, RP_EEG_CHANNELS, RL_EEG_CHANNELS
    )
]
SPATIAL_CHANNELS

In [None]:
def spatial_feature_probability_scatterplot(
    data: pd.DataFrame,
    feature_name:str,
    probability_channels:List[str]=P_COLUMNS,
    spatial_channels:List[List[str]]=SPATIAL_CHANNELS,
    regions:List[str]=REGION_MNEMONICS,
    fig_height:float=8,
    fig_width:float=12,
    force_color:str|None=None
):
    common_xlim = (-0.1, 1.1)
    subplot_rows = len(spatial_channels)
    subplot_cols = max([len(row) for row in spatial_channels])
    figure, axes_2d = plt.subplots(subplot_rows, subplot_cols, sharey=True, sharex=True)
    figure.subplots_adjust(wspace=0, hspace=0)
    figure.set_figheight(fig_height)
    figure.set_figwidth(fig_width)
    my_color_cycler = plt.cycler(
        'color',
        plt.rcParams['axes.prop_cycle'].by_key()['color']
    )
    for row_idx, (channels, axes) in tqdm_notebook(
        enumerate(zip(spatial_channels, axes_2d)),
        total = subplot_rows,
        desc="Front to back"
    ):
        num_chans = len(channels)
        visible_yaxis = True
        first_row = 0 == row_idx
        last_row = subplot_rows == row_idx+1
        for col_idx, (region, ax) in tqdm_notebook(
            enumerate(zip(regions, axes)),
            total=subplot_cols,
            desc="Left to right"
        ):
            channel = channels[col_idx] if col_idx < num_chans else None
            if channel is not None:
                props_cycle = my_color_cycler()
                for probability_channel, props in zip(probability_channels, props_cycle):
                    scatter_density_plot(
                        data=data, 
                        x=probability_channel, 
                        y=f"{channel}.{feature_name}", 
                        ax=ax,
                        xlim=common_xlim,
                        color=props["color"] if force_color is None else force_color
                    )
            else:
                ax.xaxis.set_visible(last_row)
                ax.set_facecolor(wc.CSS3_NAMES_TO_HEX["gainsboro"])
                ax.set(xlim=common_xlim)
            if first_row:
                ax.title.set_text(region)
            ax.set_xlabel(None if not last_row else "probability")
            ax.set_ylabel(None)
            ax.yaxis.set_visible(visible_yaxis)
            visible_yaxis = False
    figure.supylabel(feature_name)
    props_cycle = my_color_cycler()
    legend_handles = [
        mpatches.Patch(
            color=props['color'] if force_color is None else force_color,
            label=probability_chan
        )
        for probability_chan, props  in zip(probability_channels, props_cycle)
    ]
    figure.legend(handles=legend_handles)
    figure.show()

In [None]:
spatial_feature_probability_scatterplot(labelled_features, "bp_alpha")

In [None]:
for feature_name in tqdm_notebook(FEATURE_NAMES, desc="Feature loop"):
    spatial_feature_probability_scatterplot(labelled_features, feature_name)

#### Discussion

So can we see any correlation?

I'm not sure however what we are looking at is statistically insignificant - 
we're only looking at a single EEG sequence from a single patient.

The next step would be to feature-ise a larger proportion of the training dataset
to see if the eeglib features begin to show correlations in the visualisations.


# Building a statistically larger the features dataset

So lets go ahead and build a larger dataset using the eeglib feature extraction
based on the above code.


In [None]:
# Function that combines all of the previous sub-processes to buid a
# labelled features dataframe for a single eeg file
def featurize_eeg(train_info: pd.DataFrame, eeg_id: int) -> pd.DataFrame:
    # Get temporal probabilities for this eeg
    probabilities_df = eeg_P_per_sec(train_info, eeg_id)
    # Load the EEG waveform data
    eeg_df = load_eeg(eeg_id)
    # Add a time channel to it
    add_time_channel(eeg_df)
    # Build the differential waveform
    eeg_df = make_differential(eeg_df)
    # Make an eeglib helper object
    feature_helper, _ = df_to_eeglib_helper(
        eeg_df,
        band_pass=DEFAULT_BANDPASS,
        normalize=True,
        window_size=int(FEATURE_WINDOW_DURATION * EEG_SAMPLING_RATE)
    )
    # Extract the raw channel features from the helper
    features_df = extract_eeg_features(feature_helper, probabilities_df["is_scored"], FEATURE_WINDOW_DURATION)
    # Add region statistics as well
    features_df = add_feature_stats_by_region(features_df)
    # And join the features and probabilities
    features_df = join_features_and_probabilities(features_df, probabilities_df)
    return features_df

In [None]:
# Just test the function works
test_eeg_df = featurize_eeg(train_info_clean, sample_info.eeg_id)
spatial_feature_probability_scatterplot(test_eeg_df, "dfa")

In [None]:
all_eeg_ids = train_info_clean["eeg_id"].unique()
print(f"Total number of EEG files: {len(all_eeg_ids)}")

In [None]:
# Set the random seed
RANDOM_SEED = 0x55aa6699

# Sample percentage of the dataset for visualisation
SAMPLE_FRAC_PC = 5.0 

In [None]:
NUM_SAMPLES = int(len(all_eeg_ids) * SAMPLE_FRAC_PC / 100.0)
random_generator = np.random.default_rng(seed=RANDOM_SEED)
random_eeg_ids = random_generator.choice(all_eeg_ids, size=NUM_SAMPLES, replace=False).tolist()
print(f"Total number of sampled EEG files: {len(random_eeg_ids)}")

In [None]:
# Some flags to control dataset generation
GENERATE_DATASET = True 
ITS_A_TEST_RUN = False
TEST_RUN_SLICE_SIZE = 20

sampled_eegs = random_eeg_ids
if not GENERATE_DATASET or (GENERATE_DATASET and ITS_A_TEST_RUN):
    # Just get a small slice of them
    sampled_eegs = random_eeg_ids[:TEST_RUN_SLICE_SIZE]
if GENERATE_DATASET:
    print(
        f"{'TEST MODE: ' if ITS_A_TEST_RUN else ''}"
        f"Will build dataset dataframe using {len(sampled_eegs)} EEG files."
    )
else:
    print("Dataset generation is disabled.")

In [None]:
# Now generate the dataset dataframes
output_dataset_df = None
for dataset_eeg_id in tqdm_notebook(sampled_eegs, desc="Files processed"):
    tqdm.write(f"EEG id: {dataset_eeg_id}")
    dataset_eeg_df = featurize_eeg(train_info_clean, dataset_eeg_id)
    if output_dataset_df is None:
        output_dataset_df = dataset_eeg_df
    else:
        output_dataset_df = pd.concat(
            [output_dataset_df, dataset_eeg_df], ignore_index=True, axis=0
        )
output_dataset_df

In [None]:
props_cycler = plt.cycler(
    'color',
    plt.rcParams['axes.prop_cycle'].by_key()['color']
)
for feature_name in tqdm_notebook(FEATURE_NAMES, desc="Feature loop"):
    props_cycle = props_cycler();
    for probability_channel, props in zip(P_COLUMNS, props_cycle):
        spatial_feature_probability_scatterplot(
            output_dataset_df, 
            feature_name, 
            [probability_channel],
            force_color=props['color']
        )

# Conclusion

So, I'm struggling to see much of a visual correlation between the eeglib
generated features and the various probability levels.

We're looking for, essentially diagonal relationships in these graphs and there
are not a lot of these and they're a bit unclear.

There's also obvious issues in the features, presumably as a result of
inadequate pre-processing or the standardization process applied when we made the
differential channels. These are significantly noticible in the band power
features.

But... this might form a basis for input to train a model...

Many thanks,
Andrew.