# HMS - Data Inspection

Let's inspect and understand the data first. Navigate to each section for summary of findings.

**Comments welcome!**


## Table of Contents
- [train.csv](#train.csv)
- [train_eegs](#train_eegs)
- [train_spectrograms](#train_spectrograms)
- [test data](#test-data)
- [sample_submission.csv](#sample_submission.csv)

In [None]:
import os
import pathlib
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
base_dir = pathlib.Path("/kaggle/input/hms-harmful-brain-activity-classification")
base_dir

In [None]:
os.listdir(base_dir)

# train.csv
- Each row corresponds to a sample of an EEG recording and respective spectrogram.
- Eeg and spec id are referring to longer recordings, for judgement the experts only used parts of these, that can be inferred by the given offsets and which are enumerated by the sub_ids.
- For the judgement, apparently, the following is used 
    - 50 second subsample of an EEG recording
    - 10 minute center matched spectrograms
        - **How is this center matched, in particular for those that have offset=0?**
    - Expert labelled the central 10 seconds.
        - **What is labelled, EEG or spec, or both? Must look into discussions.**
- Combination of eeg_id + eeg_sub_id is a unique identifier.
- The 50 second intervals overlap but are not completely identical.
- A eeg_id has at most 1 spectrogram, but a spectrogram can cover multiple eeg_id.
    - **How can this work? Is EEG stopped in between?**
- Labels are well balanced.
- Number of expert votes differs.
- Expert agreement as described on the Overview page is not given (idealized, proto, edge), but can be (partially) calculated. Uncertainty might be worth to consider!


In [None]:
path_train = base_dir / "train.csv"
train = pd.read_csv(path_train, dtype={"eeg_id": "str", "spectrogram_id": "str"})
train

In [None]:
train["eeg_id"].value_counts()

In [None]:
print("Duplicates in id + sub_id:", train[["eeg_id", "eeg_sub_id"]].duplicated().any())
print("Duplicates in id + offset_seconds:", train[["eeg_id", "eeg_label_offset_seconds"]].duplicated().any())
print("Max number of spec by eeg_id:", train.groupby("eeg_id")["spectrogram_id"].nunique().max())
print("Max number of eeg_id by spec:", train.groupby("spectrogram_id")["eeg_id"].nunique().max())
print("Max number of patient_id by spec:", train.groupby("spectrogram_id")["patient_id"].nunique().max())
print("Max number of patient_id by eeg_id:", train.groupby("eeg_id")["patient_id"].nunique().max())

In [None]:
spec_id = train.groupby("spectrogram_id")["eeg_id"].nunique().idxmax()
train.query("spectrogram_id == @spec_id")

In [None]:
train["expert_consensus"].value_counts().plot(kind='bar')
plt.show()

In [None]:
vote_cols = [x for x in train.columns if "vote" in x]
train[vote_cols].sum(axis=1).value_counts().sort_index()

# train_eegs
- There are more eeg signals in the train folder than in the train CSV.
    - See [discussion](https://www.kaggle.com/competitions/hms-harmful-brain-activity-classification/discussion/467058).
- Sampling frequency: 200 Hz
- Columns are those from 10-20-system and EKG signal.
- Not sure what is the best way to look at the data. In the example pdf they look to be differences.

In [None]:
sample_rate = 200

In [None]:
eeg_dir = base_dir / "train_eegs"
eeg_dir

In [None]:
eeg_ids_folder = set(x.stem for x in eeg_dir.glob("*.parquet"))
eeg_ids_train = set(train["eeg_id"].unique())
print("eeg_id_train == eeg_id_folder:", eeg_ids_train == eeg_ids_folder)
print("all train eeg_id present:", eeg_ids_train.issubset(eeg_ids_folder))
too_much_eegs = sorted(eeg_ids_folder.difference(eeg_ids_train))
print("too many in folder:", len(too_much_eegs))
print("too many in folder:", too_much_eegs)

In [None]:
rec = train.iloc[1]
rec

In [None]:
path_eeg = eeg_dir / f"{rec.eeg_id}.parquet"
eeg = pd.read_parquet(path_eeg)
eeg["time"] = eeg.index / sample_rate
eeg

In [None]:
eeg.columns

In [None]:
i_start = int(rec.eeg_label_offset_seconds) * sample_rate
i_stop = i_start + 50 * sample_rate
fig, axs = plt.subplots(nrows=20, figsize=(16, 10), tight_layout=True, sharex=True, gridspec_kw={"hspace": 0})
eeg.loc[i_start:i_stop].set_index("time").plot(subplots=True, ax=axs)
for ax in axs.flat:
    ax.legend(loc="upper right")
plt.show()

# train_spectrograms
- All specs present, and none too much!
- Column names indicate frequency and position/region.
- Unsure about scale of spectrogram. Might require a separate analysis.

In [None]:
spec_dir = base_dir / "train_spectrograms"
spec_dir

In [None]:
spec_ids_folder = set(x.stem for x in spec_dir.glob("*.parquet"))
spec_ids_train = set(train["spectrogram_id"].unique())
print("spec_ids_train == spec_ids_folder:", spec_ids_train == spec_ids_folder)

In [None]:
rec

In [None]:
path_spec = spec_dir / f"{rec.spectrogram_id}.parquet"
spec = pd.read_parquet(path_spec)
spec = spec.set_index("time")
spec

In [None]:
# columns: region (str) x freq (float)
columns = spec.columns.str.split("_", expand=True)
columns = pd.MultiIndex.from_tuples(
    [(x[0], float(x[1])) for x in columns], names=["region", "freq"]
)
spec.columns = columns
spec = spec.T
spec

In [None]:
regions = list(spec.index.get_level_values(0).unique())
regions

In [None]:
pd.Series(spec.loc["LL"].values.ravel()).describe()

In [None]:
regions

In [None]:
fig, axs = plt.subplots(nrows=2, ncols=2, sharex="all", sharey="all", tight_layout=True, figsize=(16, 10))
for region, ax in zip(regions, axs.flat):
    df = spec.loc[region]
    times = df.columns
    freqs = df.index
    ax.pcolormesh(times, freqs, df.values, cmap="viridis")
    ax.set_title(region)
axs[0,0].set_ylabel("freq")
axs[1,0].set_ylabel("freq")
axs[1,0].set_xlabel("time")
axs[1,1].set_xlabel("time")
plt.show()

# test data

## test.csv
- Only ids of eeg, spec and patient given. No further information.

## test_eegs, test_spectrograms
- Single eeg is exactly 50 seconds long.
- Single spectrogram is 10 minutes long.

In [None]:
path_test = base_dir / "test.csv"
test = pd.read_csv(path_test, dtype={"eeg_id": "str", "spectrogram_id": "str"})
test

In [None]:
test_eeg_dir = base_dir / "test_eegs"
os.listdir(test_eeg_dir)

In [None]:
path_eeg = test_eeg_dir / f"{test.iloc[0].eeg_id}.parquet"
eeg = pd.read_parquet(path_eeg)
eeg

In [None]:
print("EEG is 50 seconds:", len(eeg) == 50 * sample_rate)

In [None]:
test_spec_dir = base_dir / "test_spectrograms"
os.listdir(test_spec_dir)

In [None]:
path_spec = test_spec_dir / f"{test.iloc[0].spectrogram_id}.parquet"
spec = pd.read_parquet(path_spec)
spec = spec.set_index("time")
spec

In [None]:
print("Spectrogram is 10 minutes:", len(spec) == 10 * 30)

# sample_submission.csv
- Labelled by eeg_id.
- Probability of individual classes.
    - **Can we exploit that we know that 3-20 annotators have labelled?**

In [None]:
path_submission = base_dir / "sample_submission.csv"
submission = pd.read_csv(path_submission)
submission