# Explained: HMS Baseline & Resources 📚

This notebook is part of a series exploring the Harmful Brain Activity Classification task. The baseline model is built using ResNet34d, and resources from the [HMS Kaggle competition](https://www.kaggle.com/c/hms-harmful-brain-activity-classification) have been used for training and inference.

### Baseline Model
- The baseline model architecture is ResNet34d, a variant of ResNet.
- PyTorch is the primary deep learning library utilized for model training and evaluation.
- The model is trained using a configuration that includes seed, image transformation, and the number of folds for cross-validation.

### Key Resources
- Training data is handled using Pandas for CSV manipulation and Torchvision for image transformations.
- The loading of pre-trained models for inference involves reading the saved models for each fold.
- Seed initialization ensures reproducibility in the training process.
- The submission process involves loading the test data, preprocessing, model inference, and generating class probabilities.
- The final submission CSV includes class probabilities for each class.



## Import necessary libraries

In [None]:
# Importing necessary libraries
import pandas as pd  # 📊 Library for handling CSV files
import numpy as np  # 🧮 Library for matrix operations
import torch  # 🚀 Deep learning library - PyTorch
import torch.nn as nn  # 💡 Neural network module in PyTorch
import torch.nn.functional as F  # 🧠 Functional module for neural network operations
import torchvision.transforms as transforms  # 🖼️ Image processing library in PyTorch for data augmentation

# Setting the random seed for reproducibility
import random  # 🎲 Library for generating random numbers
import warnings  # ⚠️ Library for handling warnings
warnings.filterwarnings('ignore')  # 🚫 Ignore specific warnings during execution


## 🛠️ Configuration Settings for Model Training 🤖

**Explanation:**

1. `seed = 2024`: This parameter sets the random seed for reproducibility. Setting a seed ensures that the same sequence of random numbers is generated, making experiments reproducible. 🌱 [Random Seed - Wikipedia](https://en.wikipedia.org/wiki/Random_seed)

2. `image_transform = transforms.Resize((512, 512))`: This line defines an image transformation using the `transforms` module from PyTorch. The `Resize` transformation resizes images to a specified size of (512, 512). Image transformations are commonly used for data preprocessing in deep learning pipelines. 🖼️ [Torchvision Transforms Documentation](https://pytorch.org/vision/stable/transforms.html)

3. `num_folds = 5`: This parameter represents the number of folds used in cross-validation. Cross-validation is a technique used to assess the performance of a model and reduce the risk of overfitting. It involves splitting the dataset into multiple folds, training the model on different combinations of folds, and evaluating its performance. 🔢 [Cross-Validation - Wikipedia](https://en.wikipedia.org/wiki/Cross-validation)

These configuration settings are essential for defining the experimental setup, ensuring reproducibility, and preparing data for model training. The seed is set for reproducibility, image transformation is specified for preprocessing, and the number of folds is set for cross-validation.

In [None]:
from torchvision import transforms

class Config:
    seed = 2024
    image_transform = transforms.Resize((512, 512))
    num_folds = 5
    
    # Additional parameters
    num_classes = 10  # Number of output classes in the model
    dropout_rate = 0.2  # Dropout rate for regularization in the model
    learning_rate_scheduler = "cosine"  # Learning rate scheduler type (e.g., cosine, step, etc.)
    warmup_epochs = 2  # Number of warm-up epochs for learning rate scheduler
    augmentation_prob = 0.5  # Probability of applying data augmentation during training
    logging_interval = 100  # Interval for logging training information
    batch_size = 24  # Training batch size
    num_epochs = 10  # Number of training epochs
    weight_decay = 1e-4  # Weight decay for regularization
    model_type = "resnet"  # Type of model architecture (e.g., resnet, vgg, etc.)
    optimizer = "adam"  # Optimizer type (e.g., adam, sgd, etc.)
    lr = 0.0008  # Initial learning rate
    momentum = 0.9  # Momentum for optimizer (if applicable)


## 🤖 Model Loading for Inference 🚀

**Explanation:**

1. `models = []`: This line initializes an empty list named `models` to store the loaded models. 🤖

2. `for i in range(Config.num_folds):`: This is a loop that iterates over the specified number of folds (given by `Config.num_folds`). It is used to load a model for each fold. 🔁

3. `model = torch.load(f'/kaggle/input/hms-baseline-resnet34d-512-512-training-5-folds/HMS_resnet_fold{i}.pth')`: This line loads a pre-trained PyTorch model for the ith fold from a specified file path. `torch.load` is used to load the model from a saved state. 📂 [PyTorch Load and Save Model Documentation](https://pytorch.org/tutorials/beginner/saving_loading_models.html)

4. `models.append(model)`: After loading a model for a fold, the model is appended to the `models` list. This creates a list of models, one for each fold. 📋 [Python List Documentation](https://docs.python.org/3/tutorial/introduction.html#lists)

To load pre-trained models for each fold from a specified directory path and store them in a list (`models`). These loaded models can then be used for making predictions or further analysis during inference.

In [None]:
# List to store loaded models
models = []

# Loop to load trained models for each fold
for i in range(Config.num_folds):
    # Loading a pre-trained model for the ith fold
    model = torch.load(f'/kaggle/input/hms-baseline-resnet34d-512-512-training-5-folds/HMS_resnet_fold{i}.pth')
    
    # Appending the loaded model to the list
    models.append(model)


## 🌱 Seed Initialization for Reproducibility 🌐

**Explanation:**

1. `torch.backends.cudnn.deterministic = True`: This line sets the CuDNN (CUDA Deep Neural Network library) to deterministic mode. It ensures that the GPU operations produce the same results on each run. 🧠 [CuDNN Documentation](https://docs.nvidia.com/deeplearning/cudnn/index.html#deterministic-behavior)

2. `torch.backends.cudnn.benchmark = True`: Disabling CuDNN benchmarking helps achieve consistent results by avoiding dynamic adjustment of convolution algorithms. This is particularly useful for reproducibility. ⚙️ [CuDNN Documentation](https://docs.nvidia.com/deeplearning/cudnn/index.html#library-behavior)

3. `torch.manual_seed(seed)`: This line sets the manual seed for PyTorch operations, ensuring that random operations within PyTorch are also reproducible. 🌐 [PyTorch Random Seed Documentation](https://pytorch.org/docs/stable/notes/randomness.html)

4. `np.random.seed(seed)`: Setting the NumPy random seed ensures that operations involving NumPy arrays produce the same results across different runs. 🌐 [NumPy Random Seed Documentation](https://numpy.org/doc/stable/reference/random/generated/numpy.random.seed.html)

5. `random.seed(seed)`: Setting the built-in Python random seed ensures reproducibility for other Python-based random operations. 🌐 [Python Random Seed Documentation](https://docs.python.org/3/library/random.html#random.seed)

The purpose of this code snippet is to initialize seeds for various libraries to achieve reproducibility in the training process. Reproducible results are essential for debugging, understanding model behavior, and comparing different experimental runs.

In [None]:
# Function to set seeds for reproducibility
def seed_everything(seed):
    torch.backends.cudnn.deterministic = True  # 🧠 Set CuDNN to deterministic mode for GPU
    torch.backends.cudnn.benchmark = True  # ⚙️ Disable CuDNN benchmarking for consistent results
    torch.manual_seed(seed)  # 🌐 Set PyTorch manual seed for reproducibility
    np.random.seed(seed)  # 🌐 Set NumPy random seed for reproducibility
    random.seed(seed)  # 🌐 Set Python built-in random seed for reproducibility

# Calling the seed initialization function with the specified seed value
seed_everything(Config.seed)


## 📊 Loading Test Data and Preparing Submission DataFrame 🧾

**Explanation:**

1. `test_df = pd.read_csv("/kaggle/input/hms-harmful-brain-activity-classification/test.csv")`: This line reads the test data CSV file into a Pandas DataFrame (`test_df`). 📊 [Pandas read_csv Documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)

2. `submission = pd.read_csv("/kaggle/input/hms-harmful-brain-activity-classification/sample_submission.csv")`: This line reads the sample submission CSV file into a Pandas DataFrame (`submission`). 📊 [Pandas read_csv Documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)

3. `submission = submission.merge(test_df, on='eeg_id', how='left')`: Merging the `submission` DataFrame with the `test_df` DataFrame based on the 'eeg_id' column. The resulting DataFrame contains information from both DataFrames. 🔄 [Pandas Merge, Join, and Concatenate Documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html)

4. `submission['path'] = submission['spectrogram_id'].apply(lambda x: "/kaggle/input/hms-harmful-brain-activity-classification/test_spectrograms/" + str(x) + ".parquet")`: This line creates a new column 'path' in the `submission` DataFrame by applying a lambda function to the 'spectrogram_id' column. The lambda function constructs the file path for each spectrogram in the test set. 🧾 [Pandas apply Documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html)

5. `submission.head()`: This line displays the first few rows of the `submission` DataFrame to verify the data loading and preprocessing steps. 🧐 [Pandas DataFrame head Documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html)

The code is responsible for loading the test dataset, preparing the submission DataFrame, and creating a new column 'path' that contains the file paths for the test spectrograms. This process is crucial for inputting test data into the trained models during inference.

In [None]:
# Loading the test dataset from CSV file
test_df = pd.read_csv("/kaggle/input/hms-harmful-brain-activity-classification/test.csv")

# Loading the sample submission dataframe
submission = pd.read_csv("/kaggle/input/hms-harmful-brain-activity-classification/sample_submission.csv")

# Merging test dataframe with the submission dataframe based on the 'eeg_id' column
submission = submission.merge(test_df, on='eeg_id', how='left')

# Creating a new column 'path' by applying a lambda function to the 'spectrogram_id' column
submission['path'] = submission['spectrogram_id'].apply(lambda x: "/kaggle/input/hms-harmful-brain-activity-classification/test_spectrograms/" + str(x) + ".parquet")

# Displaying the first few rows of the submission dataframe
submission.head()


## 🧠 Inference on Test Data and Prediction Aggregation 🤖


**Explanation:**

1. `paths = submission['path'].values`: Extracts the file paths from the 'path' column in the submission dataframe. 📊 [Pandas DataFrame Documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)

2. `eps = 1e-6`: Defines a small epsilon value to avoid division by zero during normalization.

3. `data = pd.read_parquet(path)`: Reads the parquet file specified by the file path using Pandas. Parquet is a columnar storage file format. 📊 [Pandas read_parquet Documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_parquet.html)

4. Data Preprocessing:
   - Fills NaN values with -1.
   - Transposes the data to have time along rows and different features along columns.
   - Selects the first 300 time points for training.
   - Clips the data to be within a certain range and applies a logarithmic transformation.
   - Normalizes the data by subtracting the mean and dividing by the standard deviation.

5. `data_tensor = torch.unsqueeze(torch.Tensor(data), dim=0)`: Converts the preprocessed data to a PyTorch tensor and adds an extra dimension using `unsqueeze`. 🧠 [PyTorch Tensor Documentation](https://pytorch.org/docs/stable/tensors.html)

6. `data = Config.image_transform(data_tensor)`: Applies the image transformation specified in the configuration to the data. 🖼️ [Torchvision Transforms Documentation](https://pytorch.org/vision/stable/transforms.html)

7. `test_pred = []`: Initializes an empty list to store predictions for each model.

8. Model Inference:
   - Loops over each loaded model.
   - Sets the model to evaluation mode using `model.eval()`.
   - Performs a forward pass to obtain predictions using the softmax function.
   - Detaches the predictions, moves them to CPU, and converts them to a NumPy array.

9. Aggregation:
   - Aggregates predictions across models by taking the mean along the model dimension.

10. `test_preds = np.array(test_preds)`: Converts the list of predictions to a NumPy array for further analysis.

This code conducts inference on the test data using the loaded models, aggregates predictions, and produces the final predictions for submission. The models are expected to have been loaded in a previous code cell.

In [None]:
# Extracting file paths from the 'path' column in the submission dataframe
paths = submission['path'].values

# List to store predictions for each test spectrogram
test_preds = []

# Loop over each file path in the test dataset
for path in paths:
    eps = 1e-6  # Small epsilon value to avoid division by zero
    
    # Reading the parquet file and preprocessing the data
    data = pd.read_parquet(path)
    data = data.fillna(-1).values[:, 1:].T
    data = data[:, 0:300]
    data = np.clip(data, np.exp(-6), np.exp(10))
    data = np.log(data)
    
    # Normalizing the data
    data_mean = data.mean(axis=(0, 1))
    data_std = data.std(axis=(0, 1))
    data = (data - data_mean) / (data_std + eps)
    
    # Converting the data to PyTorch tensor and applying image transformation
    data_tensor = torch.unsqueeze(torch.Tensor(data), dim=0)
    data = Config.image_transform(data_tensor)
    
    # List to store predictions for each model
    test_pred = []
    
    # Loop over each model in the list of loaded models
    for model in models:
        model.eval()  # Set the model to evaluation mode
        with torch.no_grad():
            # Forward pass to obtain model predictions
            pred = F.softmax(model(data.unsqueeze(0)))[0]
            pred = pred.detach().cpu().numpy()
        test_pred.append(pred)
    
    # Aggregating predictions by taking the mean across models
    test_pred = np.array(test_pred).mean(axis=0)
    test_preds.append(test_pred)

# Converting the list of predictions to a NumPy array
test_preds = np.array(test_preds)
test_preds


## 📊 Generating Submission CSV with Class Probabilities 📄

```python
# Cell Title: 📊 Generating Submission CSV with Class Probabilities 📄

# Reading the sample submission dataframe
submission = pd.read_csv("/kaggle/input/hms-harmful-brain-activity-classification/sample_submission.csv")

# List of class labels
labels = ['seizure', 'lpd', 'gpd', 'lrda', 'grda', 'other']

# Adding columns for class probabilities based on the aggregated test predictions
for i in range(len(labels)):
    submission[f'{labels[i]}_vote'] = test_preds[:, i]

# Saving the submission dataframe to a CSV file
submission.to_csv("submission.csv", index=None)

# Displaying the first few rows of the submission dataframe with added columns
submission.head()
```

**Explanation:**

1. `submission = pd.read_csv("/kaggle/input/hms-harmful-brain-activity-classification/sample_submission.csv")`: Reads the sample submission CSV file into a Pandas DataFrame (`submission`). 📊 [Pandas read_csv Documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)

2. `labels = ['seizure', 'lpd', 'gpd', 'lrda', 'grda', 'other']`: Defines a list of class labels corresponding to the target classes.

3. Adding Class Probability Columns:
   - Loops over each class label.
   - Adds a new column to the submission dataframe for each class, containing the corresponding class probabilities from the aggregated test predictions.

4. `submission.to_csv("submission.csv", index=None)`: Saves the modified submission dataframe to a CSV file named "submission.csv" without including the index column. 📄 [Pandas to_csv Documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html)

5. `submission.head()`: Displays the first few rows of the modified submission dataframe with added class probability columns. 🧐 [Pandas DataFrame head Documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html)

Augments the original submission dataframe with additional columns containing the class probabilities for each class based on the aggregated test predictions. The resulting dataframe is then saved to a CSV file for submission.

In [None]:
# Reading the sample submission dataframe
submission = pd.read_csv("/kaggle/input/hms-harmful-brain-activity-classification/sample_submission.csv")

# List of class labels
labels = ['seizure', 'lpd', 'gpd', 'lrda', 'grda', 'other']

# Adding columns for class probabilities based on the aggregated test predictions
for i in range(len(labels)):
    submission[f'{labels[i]}_vote'] = test_preds[:, i]

# Saving the submission dataframe to a CSV file
submission.to_csv("submission.csv", index=None)

# Displaying the first few rows of the submission dataframe with added columns
submission.head()


## Explore More! 👀

I appreciate you taking the time to explore this notebook! If you found it insightful or helpful in any way, feel free to delve into more of my projects on my profile.

👉 [Check out My Profile](https://www.kaggle.com/zulqarnainali) 👈

## Share Your Thoughts! 🗣️
Your feedback is invaluable! If you have any comments, questions, or ideas to share, I'm eager to hear from you. Your insights contribute significantly to my ongoing improvement.

📬 Drop me an email at: [zulqar445ali@gmail.com](mailto:zulqar445ali@gmail.com)

I want to express my gratitude for your time and engagement. Your support motivates me to create more meaningful content.

Happy coding, and I wish you the best in all your data science endeavors! 🚀