<p style="font-size: 24px; font-weight: bold;">Hello there!</p>
​
<p style="font-size: 16px;">This notebook introduces a super simple way to create a submission file for the competition of <b>"HMS - Harmful Brain Activity Classification"</b>.</p>
​
<p style="font-size: 16px;">In this notebook, for classification purposes, we treat only the data from the Cz electrode of EEG signals as features.</p>
​
<p style="font-size: 16px;">Essentially, we are using the electrode data itself as features, which implies the need for feature engineering considering frequency characteristics.</p>
​
<p style="font-size: 16px;">The purpose of sharing this notebook is to provide a step-by-step guide to creating a submission using as simple code as possible, even if it's a rough implementation.</p>
​
<p style="font-size: 16px;">I hope that the release of this notebook will contribute even a little to the excitement of the competition.</p>
​
<p style="font-size: 16px;">Let's enjoy Kaggle together!</p>
​
<p style="font-size: 16px;">This notebook executes in approximately 30 seconds.</p>
​
<h1>Import Modules</h1>

In [None]:
import os
import tqdm
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier

# Global

In [None]:
# parent directory
PDIR = '/kaggle/input/hms-harmful-brain-activity-classification'

# Prepare train data

## Load CSV meta data

In [None]:
# Reading the CSV file 'train.csv' located in the directory specified by PDIR
df = pd.read_csv(os.path.join(PDIR, 'train.csv'))

# Displaying the first few rows of the DataFrame
display(df.head())

## Load EEG data

In [None]:
# Setting the sampling frequency and duration for EEG data collection
sampling_frequency = 200  # Sampling frequency in Hz
data_collection_duration = 50  # Duration of EEG data collection in seconds
total_samples = sampling_frequency * data_collection_duration  # Total number of samples in the duration

# Setting the number of training data points
num_train_data_points = 500  

# Creating an empty DataFrame to store training data
training_data_df = pd.DataFrame()

# Iterating over each training data point
for i in tqdm.tqdm(range(num_train_data_points)):
    # Loading EEG data for a specified eeg_id
    eeg_id = df.loc[i, 'eeg_id']
    eeg_data = pd.read_parquet(os.path.join(PDIR, 'train_eegs', f'{eeg_id}.parquet'))
    
    # Extracting EEG data from the Cz electrode for 50 seconds
    label_offset_time = df.loc[i, 'eeg_label_offset_seconds']  # Offset time for the EEG label
    label_offset_index = int(sampling_frequency * label_offset_time)  # Calculating offset index
    cz_electrode_data = eeg_data['Cz'][label_offset_index:label_offset_index + total_samples]  # Extracting data for Cz electrode
    
    # Adding the extracted data as a row to the training DataFrame
    training_data_df = pd.concat([training_data_df, cz_electrode_data.reset_index(drop=True).to_frame().transpose()], axis=0)

# Prepare features (X_train) and target variable (y_train)

In [None]:
# Adding diagnosis results
training_data_df['expert_consensus'] = df[:num_train_data_points]['expert_consensus'].values

# Removing rows with missing values
training_data_df = training_data_df.dropna()
training_data_df = training_data_df.reset_index(drop=True)

# Separating data into features and target
y_train = training_data_df['expert_consensus']
X_train = training_data_df.drop('expert_consensus', axis=1)

# Displaying the first few rows of the feature  dataset
display(X_train.head())

# Train RandomForestClassifier()

In [None]:
# Initializing a RandomForestClassifier with a random state of 0
forest = RandomForestClassifier(random_state=0)

# Fitting the classifier to the training data
forest.fit(X_train, y_train)

# Prepare test Data

## Load CSV meta data

In [None]:
# Reading the CSV file 'test.csv' located in the directory specified by PDIR
df_test = pd.read_csv(os.path.join(PDIR, 'test.csv'))

# Displaying the first few rows of the DataFrame
display(df_test.head())

## Prepare features (X_test) 

In [None]:
# Creating an empty DataFrame to store testing data
X_test = pd.DataFrame()

# Iterating over each test data point
for i in tqdm.tqdm(range(len(df_test))):
    # Loading EEG data for a specified eeg_id
    eeg_id_ = df_test.loc[i, 'eeg_id']
    tmp = pd.read_parquet(os.path.join(PDIR, 'test_eegs', f'{eeg_id_}.parquet'))
    
    # Extracting EEG data from the Cz electrode
    cz_electrode_data = tmp['Cz']
    
    # Adding the extracted data as a row to the testing DataFrame
    X_test = pd.concat([X_test, cz_electrode_data.reset_index(drop=True).to_frame().transpose()], axis=0)

# Predict and submit

In [None]:
# Calculate predictions using the trained RandomForestClassifier model
predictions = forest.predict_proba(X_test)

# Read the sample submission file
submission = pd.read_csv(f'{PDIR}/sample_submission.csv')

# Iterate over each test data point
for i in tqdm.tqdm(range(len(df_test))):
    # Set the 'eeg_id' in the submission DataFrame
    submission.loc[i, 'eeg_id'] = df_test.loc[i, 'eeg_id']
    
    # Set the probability for each class in the submission DataFrame
    for j, cls_name in enumerate(forest.classes_):
        submission.loc[i, f'{cls_name.lower()}_vote'] = predictions[i, j]

In [None]:
# Display the submission DataFrame
display(submission)

In [None]:
# Saving the submission DataFrame to a CSV file without including the index
submission.to_csv('submission.csv', index=False)

# Congratulations!

You're now ready to submit your work on Kaggle!

Enjoy your experience on Kaggle!