# Examining the Training File

This notebook aims to given an introductory to the metadata in the **train.csv** file for this competition. I will focus the  the analysis on the patients, brain activity classes, the EEG samples, and the spectogram samples. Later on, I will focus on the features within the EEG and spectogram files. 

* [Custom Functions](#custom_functions)
* [Read Data](#read_file)
* [EDA](#eda)
    * [Patient Analysis](#patient_analysis)
    * [Brain Activity Classes](#classes)
    
### Summary 

**Patient Info**:
* There are 1950 patients in the training dataset.
* Most patients have *less than 5 EEG samples* in the training dataset with 2 EEG samples being the most common
* A few patients have *more than 100* different EEG samples in the training set
* Similarly, most patients have *less than 5 spectrogram samples* in the training dataset with *2 spectrogram sample being the most common*
* A few patients have *more than 50* different spectrogram samples in the training set
* Based on expert consensus, most patients only have *1 or 2 unique brain activity classes* in the training sample


**Brain Activity Classes**:
* Most of the consenus is evenly split among the classes
* Almost half of the records only have 1 class with a vote
    
More to come...

In [None]:
# Analysis packages
import pandas as pd 

# Visualization packages
import seaborn as sns
import matplotlib.pyplot as plt

# Custom Functions
<a id="custom_functions" ></a>


In [None]:
primary_color = '#FF595E'
secondary_color = '#1982C4'
third_color = '#6A4C93'
sns.set_style(
    "whitegrid",
    {
        "axes.facecolor": "#FFFDD0",
        "figure.facecolor": "#FFFDD0",
        "patch.facecolor":primary_color,
        "patch.edgecolor":'black',
        "axes.edgecolor":'black',
        "grid.color":'black'
    })

In [None]:
def custom_barplot(data, x, y, xlabel, ylabel, title, ax, vertical = True):
    ax = sns.barplot(
        data = data,
        x = x,
        y = y,
        color=primary_color,
        ax=ax
    )
    
    # Add percentage labels above the bars
    bars = ax.containers[0]
    if vertical:
        total = sum(data[y])
        ax.bar_label(
            bars, 
            labels = [f'{(x.get_height()/total):.1%}' for x in bars]
        )
    else:
        total = sum(data[x])
        ax.bar_label(
            bars, 
            labels = [f'{(x.get_width()/total):.1%}' for x in bars]
        )
        

    ax.set_xlabel(xlabel)
    ax.set_ylabel(ylabel)
    ax.set_title(title)
        
    return ax

def custom_boxenplot(data, x, xlabel, title, ax):
    ax = sns.boxenplot(
        data = data,
        x = x,
        ax=ax,
        color=primary_color
    )
        

    ax.set_xlabel(xlabel)
    ax.set_title(title)
        
    return ax

# Read Data
<a id="read_file" ></a>

In [None]:
train_df = pd.read_csv("/kaggle/input/hms-harmful-brain-activity-classification/train.csv")
train_df

# EDA
<a id="eda" ></a>

## Patient Analysis
<a id="patient_analysis" ></a>

Let's look at the patients in the dataset and some characteristics.

In [None]:
print(f'There are {train_df.patient_id.nunique()} patients in the training dataset.')

In [None]:
fig, axs = plt.subplots(ncols=2, figsize=(16,6))

# Left plot
ax = custom_boxenplot(
    data= (train_df
         .groupby('patient_id')
         .eeg_id
         .nunique()
         .reset_index()
         .set_axis(['patient_id', 'num_eegs'], axis=1)
        ),
    x = 'num_eegs',
    ax=axs[0],
    xlabel = 'Number of EEGs per patient',
    title = 'There are a few patients with more than 150 EEG samples'
)
ax.grid(axis='x', color='black', linestyle='--', alpha=0.8)
ax.set_axisbelow(True)

# Right plot
ax = custom_barplot(
    data = (train_df
             .groupby('patient_id')
             .eeg_id
             .nunique()
             .reset_index()
             .set_axis(['patient_id', 'num_eegs'], axis=1)
             .assign(
                 label = lambda x: x.num_eegs.apply(lambda x: '> 10' if x > 10 else x)
             )
            .groupby('label')
            .num_eegs
            .count()
            .reset_index()
        ),
    x = 'label',
    y = 'num_eegs',
    xlabel = 'Number of EEGs per patient',
    ylabel = 'Number of patients',
    title = 'Most patients have less than 5 EEG samples in the training set',
    ax = axs[1]
)
ax.grid(axis='y', color='black', linestyle='--', alpha=0.8)
ax.set_axisbelow(True)

plt.show()

In [None]:
fig, axs = plt.subplots(ncols=2, figsize=(16,6))

# Left plot
ax = custom_boxenplot(
    data= (train_df
     .groupby('patient_id')
     .spectrogram_id
     .nunique()
     .reset_index()
     .set_axis(['patient_id', 'num_spectrograms'], axis=1)
    ),
    x='num_spectrograms',
    xlabel = 'Number of spectrograms per patient',
    title = 'There are a few patients with more than 50 spectrogram samples',
    ax=axs[0]
)
ax.grid(axis='x', color='black', linestyle='--', alpha=0.8)
ax.set_axisbelow(True)

# Right plot
ax = custom_barplot(
    data = (train_df
             .groupby('patient_id')
             .spectrogram_id
             .nunique()
             .reset_index()
             .set_axis(['patient_id', 'num_spectrograms'], axis=1)
             .assign(
                 label = lambda x: x.num_spectrograms.apply(lambda x: '> 10' if x > 10 else x)
             )
            .groupby('label')
            .num_spectrograms
            .count()
            .reset_index()
        ),
    x = 'label',
    y = 'num_spectrograms',
    xlabel = 'Number of spectrograms per patient',
    ylabel = 'Number of patients',
    title = 'Most patients have less than 5 spectrogram samples in the training set',
    ax = axs[1]
)
ax.grid(axis='y', color='black', linestyle='--', alpha=0.8)
ax.set_axisbelow(True)

### Brain Activity for Patients


In [None]:
fig, ax = plt.subplots()
ax = custom_barplot(
    data = (train_df
         .groupby('patient_id')
         .expert_consensus
         .nunique()
         .reset_index()
         .set_axis(['patient_id', 'unique_classes'], axis=1)
         .groupby('unique_classes', as_index=False)
         .patient_id
         .count()
        ),
    x = 'unique_classes',
    y = 'patient_id',
    xlabel = 'Number of unique brain activity classes per patient',
    ylabel = 'Number of patients',
    title = 'Most patients only have 1 or 2 unique brain activity classes',
    ax=ax
)
ax.grid(axis='y', color='black', linestyle='--', alpha=0.8)
ax.set_axisbelow(True)
plt.show()

## Brain Activity Classes
<a id="classes" ></a>

In [None]:
fig, ax = plt.subplots()
custom_barplot(
    data=train_df.expert_consensus.value_counts().reset_index(),
    x='count',
    y='expert_consensus',
    xlabel = 'Number of subsamples',
    ylabel = 'Expert consensus',
    title = 'Expert consensus is roughly evenly split among subsamples',
    vertical=False,
    ax=ax
)
ax.set_xlim(0,23000)
ax.grid(axis='x', color='black', linestyle='--', alpha=0.8)
ax.set_axisbelow(True)
plt.show()

In [None]:
fig, ax = plt.subplots()

custom_barplot(
    data = (pd.concat(
                [train_df['label_id'],
                 train_df.filter(like='_vote')
                ], axis=1)
             .melt(id_vars='label_id')
             .query("value > 0")
             .groupby('label_id')
             .variable
             .nunique()
             .reset_index()
             .groupby('variable', as_index=False)
             .count()
             .set_axis(['unique_votes', 'count'], axis=1)
            ),
    x='unique_votes',
    y='count',
    xlabel='Unique brain activity classes voted for',
    ylabel='Subsamples',
    title='Around 47.8% of the labels are fully-agreed upon by experts',
    ax = ax
)
ax.grid(axis='y', color='black', linestyle='--', alpha=0.8)
ax.set_axisbelow(True)
plt.show()