# EDA: Missing Data in HMS-HBA Competition Spectrograms

While processing EEG from a thrid-party source and creating spectrograms, I compared with competition spectrograms to make sure the spctrograms were similar. During that process, I happened upon a competition spectrogram with significant missing data (`/kaggle/input/hms-harmful-brain-activity-classification/train_spectrograms/9661509.parquet`). While I expected some missing data, this particular spectrogram was missing more than a third of the data for a complete 10 minute spectrogram (see below). I found this surprising, and wanted to investigate further. 

The result were surprising and might affect training with competition-provided spectrograms:

* 7.2% of the label_id have missing data in the offset spectrogram.
* 8.7% of spectrograms have missing data in at least one of the offset spectrograms.

See discussion [here](https://www.kaggle.com/competitions/hms-harmful-brain-activity-classification/discussion/478233).

In [None]:
import numpy as np 
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt

In [None]:
spec = pd.read_parquet('/kaggle/input/hms-harmful-brain-activity-classification/train_spectrograms/9661509.parquet')


spec = spec.fillna(0)
sig = spec.iloc[:300, 1:101].T.values

# Log spectrogram 
sig = tf.clip_by_value(sig, tf.math.exp(-4.0), tf.math.exp(8.0)) # avoid 0 in log
sig = tf.math.log(sig)

# Normalize spectrogram
sig -= tf.math.reduce_mean(sig)
sig /= tf.math.reduce_std(sig) + 1e-6

# Plot the spectrogram
times = spec.iloc[:300]['time']
frequencies = [float(c.split('_')[-1]) for c in spec.columns if c[:2] == 'LL']
img = sig.numpy() 
img -= img.min()
img /= img.max() + 1e-4
plt.figure(figsize=(10, 4))
plt.pcolormesh(times, frequencies, img, shading='gouraud')
plt.ylabel('Frequency [Hz]')
plt.xlabel('Time [sec]')
plt.title('Linear-frequency power spectrogram')
# plt.colorbar(format='%+2.0f dB')
plt.tight_layout()
plt.savefig('missing_data.png')
plt.show()

In [None]:
df = pd.read_csv('/kaggle/input/hms-harmful-brain-activity-classification/train.csv')
df['missing_data'] = 0.0

Load all spectrograms and offset, then calculate the percentage of missing data.

In [None]:
for spec_id, dff in df.groupby('spectrogram_id'):
    spec = pd.read_parquet(f'/kaggle/input/hms-harmful-brain-activity-classification/train_spectrograms/{spec_id}.parquet')
    for idx, row in dff.iterrows():
        offset = int(row['spectrogram_label_offset_seconds'] // 2)
        df.loc[idx, 'missing_data'] = spec.iloc[offset: 300 + offset, 1:].isna().mean().mean()

In [None]:
print(f"{(df['missing_data'] > 0).mean() * 100: 0.1f}% of label_ids have missing data.")
print(f"{(df.groupby('spectrogram_id')['missing_data'].max()!=0).mean() * 100: 0.1f}% of spectrogram_ids have missing data.")

In [None]:
# remove spectrograms with no missing data
df_missing = df.loc[df['missing_data'] > 0, 'missing_data']
df_missing.plot.hist(bins=20, title='Missing Data Distribution by `label_id`', xlabel='Portion of Missing Data');
plt.savefig('missing_data_label_id.png')

In [None]:
# Look at max missing data per spectrogram
df.loc[df['missing_data'] > 0].groupby('spectrogram_id')['missing_data'].max().plot.hist(bins=20, title='Max Missing Data Distribution by `spectrogram_id`', xlabel='Portion of Missing Data');
plt.savefig('missing_data_spec_id')

In [None]:
df.to_csv('train.csv', index=False)