# Requirements of the Project
## Tools and Libraries
Describe the tools and libraries needed for the project, including specific versions where necessary to ensure compatibility. Common entries might include:

    Python: The primary programming language used.
    NumPy and Pandas: For data manipulation.
    Scikit-Learn: For applying machine learning algorithms.
    TensorFlow or PyTorch: If deep learning methods are utilized.
    Matplotlib and Seaborn: For plotting and visualization.

## Spliting small Datasets
Instead of using full dataset, we sample small dataset from original data.
Loading Data: The dataset is loaded from a CSV file train.csv into a DataFrame df using Pandas.
Analyzing Class Distribution: The distribution of classes within the expert_consensus column is analyzed and printed. This helps identify any imbalances among the classes.
### Initial Data Distribution

| expert_consensus | Count |
|------------------|-------|
| Seizure          | 20933 |
| GRDA             | 18861 |
| Other            | 18808 |
| GPD              | 16702 |
| LRDA             | 16640 |
| LPD              | 14856 |

Balancing the Data:
The minimum number of samples across all classes (min_samples) is identified, but the code sets a fixed number of desired_samples per class to 50, which might be for ensuring a substantial enough number per class for training purposes. For each class, 50 samples are randomly selected. This selection is done without replacement, ensuring that samples are not repeated within the same class in the balanced dataset.
These samples are then added to the balanced_df DataFrame, resulting in a new balanced dataset.

Output Balanced Data: 
The newly balanced data's class distribution is printed to verify that each class now has an equal number of samples. Finally, this balanced dataset is saved to balanced_train.csv, making it available for further processing or model training.

### Balanced Data Distribution

| expert_consensus | Count |
|------------------|-------|
| Seizure          | 50    |
| GPD              | 50    |
| LRDA             | 50    |
| Other            | 50    |
| GRDA             | 50    |
| LPD              | 50    |

# Implementation and Design

## Logging set

In [None]:
log_filename = paths.ROOT/'new_version_training_record.log'

logging.basicConfig(filename=log_filename, level=logging.INFO,
                    format='%(asctime)s %(levelname)s %(message)s', datefmt='%Y-%m-%d %H:%M:%S')
def log_time(func):
    """warpper for logging running time"""

    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        start_time = time.time()
        result = func(*args, **kwargs)
        end_time = time.time()
        logging.info(f"{func.__name__} took {end_time - start_time:.4f} seconds.")
        print(f"{func.__name__} took {end_time - start_time:.4f} seconds.")
        return result

    return wrapper

In our code, we implemented a logging system to systematically capture and document the execution times and operational statuses during the training process. The setup includes:

Logging Configuration: Logs are generated to include timestamps, log levels, and messages, and are saved in new_version_training_record.log for thorough analysis.
Execution Time Monitoring: A decorator, log_time, measures and logs the execution time of critical functions, aiding in the identification and optimization of performance bottlenecks.
This methodology ensures precise tracking of model training dynamics, which is crucial for both debugging and optimizing computational efficiency in our experiments.

##  Pre-processing the Dataset 

### Original dataset

The example of original spectrogram and eegs are shown below.

spectrogram

![spectrogram](output/spectrogram.jpg)

eeg_example

![spectrogram](output/eeg_example.jpg)

More information about this competition can be found in this link:
https://www.kaggle.com/competitions/hms-harmful-brain-activity-classification/discussion/468010


### Logarithmic Transformation: 
Prior to standardization, spectrograms undergo a logarithmic transformation to stabilize the variance across the image. This transformation helps in reducing the skewness of the original data distribution, which is particularly beneficial for spectral data that often spans several orders of magnitude.
Data Augmentation: To enhance the robustness of the model against variations in input data, optional augmentation techniques such as horizontal flipping are employed. This helps in simulating a broader set of potential real-world scenarios the model might encounter.

### Spectrogram Standardization: 
Following the logarithmic transformation, each spectrogram is standardized by subtracting the mean and dividing by the standard deviation of its pixel values. This step ensures that the input features have zero mean and unit variance, optimizing conditions for model training.

import numpy as np 
def log_and_Standarize(self,img):
        # Log transform spectogram
            img = np.clip(img, np.exp(-4), np.exp(8))
            img = np.log(img)

            # Standarize per image
            ep = 1e-6
            mu = np.nanmean(img.flatten())
            std = np.nanstd(img.flatten())
            img = (img - mu) / (std + ep)
            img = np.nan_to_num(img, nan=0.0)
            return img



### Wavelet Transform Feature Extraction

In our research, we have employed multiple techniques to preprocess and enhance the training dataset to improve the model's ability to process complex signals. In addition to traditional EEG and spectrograms, we have also incorporated wavelet transform techniques to enrich the feature set.

Wavelet Transform Feature Extraction:
We use the pywt library to perform wavelet decomposition, selecting db1 as the mother wavelet and setting the decomposition level to 5. This step is intended to capture details and trends in the original signal across different scales.
Given that wavelet coefficients vary in size and scale, directly using these coefficients might not be suitable for uniform machine learning model inputs. Therefore, we calculate statistical features (mean and standard deviation) from each level's wavelet coefficients, which help maintain consistency in the scale of data inputs.

Feature Integration and Standardization:
The features derived from wavelet transforms, along with the original spectrogram features, are standardized to format them appropriately for model inputs. This includes applying a logarithmic transformation to smooth the distribution of the data, followed by normalization to ensure the data is on the same scale.
The processed wavelet features are merged with other spectrogram features, uniformly resized to the desired resolution (128x256), and integrated into the final input tensor.

### Reshaping and Merging Modalities: 
Spectrograms from different sources—such as EEG and wavelet transforms—are resized and merged into a single tensor. This comprehensive dataset provides a multi-dimensional view of the features, critical for models trained to recognize complex patterns.

## Training

In the training phase of our neural network model, several steps are undertaken to ensure effective learning:

### Model Setup: 
The CustomModel class initializes with a predefined model configuration, leveraging timm.create_model for setting up the underlying architecture, which is a variant of EfficientNet (tf_efficientnet_b0).

### Data Preparation: 
Training data is loaded using the CustomDataset class, which handles data augmentation, multi-modality data integration (spectrograms, EEG, and wavelet spectrograms), and appropriate transformations to prepare batches for the model.

## Optimizer and Scheduler:
An AdamW optimizer is employed with a learning rate scheduler (OneCycleLR) to adjust learning rates dynamically during training, optimizing convergence rates. The picture of AdamW is shown below.

## Training Loop: Using cross-validation and trying different features combinations
The training involves processing batches of data, calculating losses using KLDivLoss, and updating model weights. Gradient clipping is applied to avoid exploding gradients. Detailed logs of training progress, including loss metrics and learning rates, are maintained using a custom logging setup.
During our investigation, we conducted a series of experiments with the aim of identifying the best combination of features and model parameters for our deep learning models. The experimental setup was configured with 3 epochs and a 3-fold cross-validation strategy to ensure robustness and reliability of the results. In the metric section, we present a summary of the cross-validation results for different combinations of features using two versions of the EfficientNet architecture.

## Amp Scaler: 
Mixed-precision training is facilitated by torch.cuda.amp to accelerate training while conserving memory.

### Two stages strategy

![Learning Rate Schedule](output/lr.jpg)

## Evaluation
The model's performance is evaluated using the following steps:

### Validation Data Loading: 
A separate dataset for validation is prepared using the CustomDataset class without data augmentation to evaluate model performance on untouched data.

### Evaluation Loop: 
During evaluation, predictions are made without updating model weights. Losses are computed using the same criterion as in training (KLDivLoss). The softmax function is applied to predictions to convert logits into probabilities.
### Metrics Calculation: 
The evaluation phase logs detailed loss metrics for each batch and overall validation loss, aiding in the assessment of model generalizability.
### Model Saving: 
If the validation loss of the current epoch is lower than previously recorded losses, the model state is saved. This approach ensures that only the best-performing model is retained.
These sections highlight the comprehensive strategies employed to train and evaluate the machine learning model, ensuring robustness and accuracy in predictions. Each component of the process is designed to contribute to a deep understanding of the model's performance and areas for improvement.

### Features combinations experiment result

| Model             | Use Kaggle Spectrograms | Use EEG Spectrograms | Use Wavelet Spectrograms | CV Result   |
|-------------------|-------------------------|----------------------|--------------------------|-------------|
| **tf_efficientnet_b0** | **True**                    | **False**                | **False**                    | **1.2792**      |
| tf_efficientnet_b0 | True                    | True                 | True                     | 1.2816      |
| tf_efficientnet_b0 | True                    | True                 | False                    | 1.2915      |
| tf_efficientnet_b0 | True                    | False                | True                     | 1.2948      |
| tf_efficientnet_b0 | False                   | True                 | True                     | 1.3456      |
| tf_efficientnet_b0 | False                   | False                | True                     | 1.3557      |
| tf_efficientnet_b0 | False                   | True                 | False                    | 1.3655      |
| **efficientnet_b4**    | **True**                    | **False**                | **False**                    | **1.3070**      |
| efficientnet_b4    | True                    | False                | True                     | 1.3266      |
| efficientnet_b4    | True                    | True                 | True                     | 1.3388      |
| efficientnet_b4    | True                    | True                 | False                    | 1.3415      |
| efficientnet_b4    | False                   | True                 | True                     | 1.3431      |
| efficientnet_b4    | False                   | True                 | False                    | 1.3676      |
| efficientnet_b4    | False                   | False                | True                     | 1.3566      |


From the results, it is evident that training models using only Kaggle spectrograms (without EEG or wavelet spectrograms) yielded the best cross-validation results, particularly for the tf_efficientnet_b0 and efficientnet_b4 models. Specifically, the tf_efficientnet_b0 model achieved a CV result of 1.2792 when trained solely with Kaggle spectrograms, demonstrating higher performance compared to combinations involving additional features.
The outcome measure was computed using a Kullback-Leibler divergence loss function (KLDivLoss), with the predictions subjected to a log-softmax transformation:

In [None]:
def get_result(oof_df):
    kl_loss = nn.KLDivLoss(reduction="batchmean")
    labels = torch.tensor(oof_df[label_cols].values)
    preds = torch.tensor(oof_df[target_preds].values)
    preds = F.log_softmax(preds, dim=1)
    result = kl_loss(preds, labels)
    return result

Based on the experimental trials, we conclude that simpler feature sets may sometimes provide superior predictive performance, potentially due to reduced overfitting and noise. These insights will guide the future direction of our research, focusing on optimizing feature selection and model configuration to enhance model efficacy.

## Reference

https://www.kaggle.com/code/alejopaullier/hms-efficientnetb0-pytorch-train
https://www.kaggle.com/competitions/hms-harmful-brain-activity-classification/discussion/468010
