In [None]:
import numpy as np
import pandas as pd
import time
import matplotlib.pyplot as plt
import seaborn as sns

# Introduction

This notebook gives a basic introduction into the data of the Harmful Brain Activity competion, describing how you get from the training data to making an actual submission.

In this competition we need to analyse EEG and Spectograms to predict the probablities of how experts have classified each sample of brain activity.

# Training Data

The easiest way to understand this competition is by looking at the training data:

In [None]:
train_df = pd.read_csv("/kaggle/input/hms-harmful-brain-activity-classification/train.csv")
train_df.head()

Each row of the training CSV file contains a reference to an EEG file and to a Spectrogram, along with time offsets into each of these images, to the point in the image that corresponds to the data for the current row.<br>

Additionally the last 6 columns (the "vote" columns) contain the target values that we're trying to predict. These represent the number of experts that have chosen each classification.<br>

So, for example, for the first row entry, 3 experts have analysed this data and all have decided that the data shows that the patient has suffered a seizure.<br>

## Training EEG

For the first row in the training data we can look at the matching EEG data - which in this case has <b>ID = 1628180742</b>

In [None]:
first_eeg_df = train_df[train_df['eeg_id']==1628180742]
first_eeg_df

From the training CSV if we look at all the rows that match this EEG file name we can see that there are 9 predictions using this EEG, with each prediction being based on a 50 second long period.<br>

So the first prediction will be from the start of the EEG (0.0 seconds) until the 50 second mark, the second will be from 6 seconds until 56 seconds and the last prediction is from 40 seconds until 90 seconds.


Here's the matching EEG data...

In [None]:
train_eeg = "/kaggle/input/hms-harmful-brain-activity-classification/train_eegs/1628180742.parquet"
train_eeg_df = pd.read_parquet(train_eeg)
print(train_eeg_df.shape)
train_eeg_df.head(3)

* The rows represent each EEG sample and there are 200 EEG samples taken per second, so the total number of seconds in this EEG data = (18000/200) = 90 seconds. This matches with where we expect the 50 second window of the last prediction to end.<br>

* Each column represents a different location on the skull, from where the EEG readings have been taken, except the last column 'EKG' which represents a heart reading. The various positions on the skull are as shown below...<br>
<br>
<br>
<center>
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/7/70/21_electrodes_of_International_10-20_system_for_EEG.svg/1024px-21_electrodes_of_International_10-20_system_for_EEG.svg.png" alt="drawing" width="600"/>

<i>(Image by: By トマトン124 (talk) - Own work, Public Domain, https://commons.wikimedia.org/w/index.php?curid=10489987)</i>
</center>

We can look at the EEG reading for the first location 'Fp1'. In the plot below, the green vertical lines represent the start of each prediction period and the red lines represent the period ends

In [None]:
f = 'Fp1'
plt.figure(figsize=(12,3))
plt.plot(train_eeg_df[f])
plt.title(f)
plt.grid()
for i,row in first_eeg_df.iterrows():
    plt.axvline(x=(row['eeg_label_offset_seconds']*200),color='green')    
    plt.axvline(x=((row['eeg_label_offset_seconds']+50)*200),color='red')    
plt.show()

So, for the very first prediction in the training data, it represents the time period from 0 to 50 seconds, or from 0 to (50*200)= 10000 samples and is shown in the 'Fp1' plot below:

In [None]:
plt.figure(figsize=(12,3))
plt.plot(train_eeg_df[f][:10000])
plt.title(f)
plt.grid()
row = first_eeg_df.iloc[0]
plt.axvline(x=(row['eeg_label_offset_seconds']*200),color='green')    
plt.axvline(x=((row['eeg_label_offset_seconds']+50)*200),color='red')    
plt.show()

## Training Spectrogram

We can examine the Spectograms in exactly the same way as the EEGs.

In [None]:
first_spectrogram_df = train_df[train_df['spectrogram_id']==353733]
first_spectrogram_df

As with the EEGs, each Spectrogram can contain data relating to multiple predictions. So, as with the first EEG data, the first Spectrogram also contains data from 9 predictions.<br>

Opening the Spectrogram referenced on the first line of the training CSV, gives the following data:

In [None]:
train_spect = "/kaggle/input/hms-harmful-brain-activity-classification/train_spectrograms/353733.parquet"
train_spect_df = pd.read_parquet(train_spect).dropna()
print(train_spect_df.shape)
train_spect_df.head()

The columns of the Spectogram data represent the location at which the reading was taken and frequency in Hertz. The four possible recording locations are:

* LL = left lateral
* RL = right lateral
* LP = left parasagittal
* RP = right parasagittal

As shown in the image below, where the figure shows three major planes annotated as follows:<br>
<br>
* a: Axial or horizontal plane (blue), that contains the lateral axis 
  and also the medial axis 
* c: Coronal plane (green), containing the axial axis and the lateral axis
* s: Sagittal plane (red), containing  the axial axis and the medial axis

<br>       
There also is:<br>

* e: The eye position showing the anterior end of the brain<br>
* p: an example of a parasagittal plane (yellow); 
  parasagittal planes comprise the class of planes 
  parallel to (and therefore lateral to) 
  the sagittal plane.

<br>
<br>
<center>
<img src="https://upload.wikimedia.org/wikipedia/commons/f/f4/Human_brain_anatomical_planes_letter_annotations.jpg" width="600"/>       
    

<i>(Image by: https://commons.wikimedia.org/wiki/File:Human_brain_anatomical_planes_letter_annotations.jpg)</i>
</center>


For each of these 4 regions we can plot the Spectrogram to see how the frequencies change over time.

In [None]:
def plot_spectrogram(spectrogram_path):
    """ Code from: https://www.kaggle.com/code/clehmann10/plot-spectrograms
    """
    sample_spect = pd.read_parquet(spectrogram_path)
    
    split_spect = {
        "LL": sample_spect.filter(regex='^LL', axis=1),
        "RL": sample_spect.filter(regex='^RL', axis=1),
        "RP": sample_spect.filter(regex='^RP', axis=1),
        "LP": sample_spect.filter(regex='^LP', axis=1),
    }
    
    fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(15, 12))
    axes = axes.flatten()
    label_interval = 5
    for i, split_name in enumerate(split_spect.keys()):
        ax = axes[i]
        img = ax.imshow(np.log(split_spect[split_name]).T, cmap='viridis', aspect='auto', origin='lower')  
        cbar = fig.colorbar(img, ax=ax)
        cbar.set_label('Log(Value)')
        ax.set_title(split_name)
        ax.set_ylabel("Frequency (Hz)")
        ax.set_xlabel("Time")
        
        ax.set_yticks(np.arange(len(split_spect[split_name].columns)))
        ax.set_yticklabels([column_name[3:] for column_name in split_spect[split_name].columns])
        frequencies = [column_name[3:] for column_name in split_spect[split_name].columns]
        ax.set_yticks(np.arange(0, len(split_spect[split_name].columns), label_interval))
        ax.set_yticklabels(frequencies[::label_interval])                
        
    plt.tight_layout()
    plt.show()

In [None]:
# Note: the time axis should be multiplied by 2, to match with the time column in the Spectrogram data
plot_spectrogram(train_spect)

In the case of the Spectrograms, they contain a 10 minute window that is centered at the same time as the associated EEG image. The 'time' column represents the time in seconds. So the first row of the train CSV file is for the Spectogram period from 0 to (10*60)= 600 seconds. Whereas the last is from 40 to 640 seconds. If we look at the last row of this Spectrogram data we can see that its time matches with this:

(Note - in the spectrograms shown above, their time axis should be multiplied by 2 to match with the time column of the spectrogram data)

In [None]:
train_spect_df.tail()

# Submission

As with the training data shown above, we want to predict the how the experts categorize each sample in the test data. However, in the training data, for each row the number of experts that made each prediction is given, whereas in the submission we want to give the probability for each type of prediction.

This can be seen by looking at the sample submission file:

In [None]:
sub_df = pd.read_csv("/kaggle/input/hms-harmful-brain-activity-classification/sample_submission.csv")
sub_df.head()

This sample row gives the ID of the associated EEG file to look at when making the prediction and the predicted probabilities for each category of diagnosis (the description for each of these categories can be found in the competition description: https://www.kaggle.com/competitions/hms-harmful-brain-activity-classification/overview)

So, in this case, each vote is set to 0.1667 = 1/6, so we're saying that each category is equally likely.

To make actual predictions we take the 'test.csv' file and, for each row, examine the specified EEG and Spectrogram files, before predicting the probability for each of the six categories.

In the given test file there's only a single row - but this expands to have multiple rows when you submit your notebook.

In [None]:
test_df = pd.read_csv("/kaggle/input/hms-harmful-brain-activity-classification/test.csv")
test_df.head()

If we imaging that there are a total of 60 experts making predictions and they have an even split over each of the 6 possible voting categories, then we'll have the following votes and probabilities:

In [None]:
seizure_vote = 10
lpd_vote = 10
gpd_vote = 10
lrda_vote = 10
grda_vote = 10
other_vote = 10

total_vote = seizure_vote + lpd_vote + gpd_vote + lrda_vote + grda_vote + other_vote
preds = [seizure_vote/total_vote, lpd_vote/total_vote, gpd_vote/total_vote, lrda_vote/total_vote, grda_vote/total_vote, other_vote/total_vote]

print(f"Total Number of Votes = {total_vote}")
print(f"Predictions = {preds}")

We can then use these predictions to generate the same entry for each row in the test data:

In [None]:
# generate the same predictions for every entry in the test data
df = pd.DataFrame(columns=sub_df.columns)
for i, row in test_df.iterrows():    
    df.loc[i] = [str(row['eeg_id'])] + preds
    
df 

To get a slightly better score we can take the mean vote ratio from the training data and use these values for the predctions (see: https://www.kaggle.com/code/seshurajup/eda-train-csv):

seizure_vote    0.152810<br>
lpd_vote        0.142456<br>
gpd_vote        0.104062<br>
lrda_vote       0.065407<br>
grda_vote       0.114851<br>
other_vote      0.420414<br>

In [None]:
preds = [0.152810,0.142456,0.104062,0.065407,0.114851,0.420414]
df = pd.DataFrame(columns=sub_df.columns)
for i, row in test_df.iterrows():    
    df.loc[i] = [str(row['eeg_id'])] + preds
    
df 

In [None]:
# write the final submission file
df.to_csv('submission.csv', index=False)

# Conclusion

We've looked at the basics of how to get from the training data to making an actual submission - although the submission is only a baseline. To get a real submission you'll need to take the EEG and Spectrogram data and do a proper prediction, but hopefully this will get you on your way.

If you've found this notebook useful please consider giving it a vote. Thanks!