# **BirdCLEF 2024: Bird Sound Classification Project**

## Overview

This project focuses on classifying bird species from audio recordings using deep learning techniques. Leveraging advanced data augmentation, focal loss, and ensemble models, the goal is to achieve high performance in the BirdCLEF 2024 competition. The project encompasses data preprocessing, model training, evaluation, and inference, with support for exporting models in the ONNX format for optimized deployment.

## Table of Contents:

1. Environment Setup
2. Configuration
3. Data Preprocessing
4. Dataset Handling
5. Model Architecture
6. Training Utilities
7. Training Loop
8. Evaluation
9. Inference
10. Visualization
11. ONNX Export

## Environment Setup

The project begins by setting up the necessary environment, importing essential libraries and packages required for data manipulation, model building, training, and evaluation.

In [None]:
import os
import gc
import sys
import glob
import time
import shutil
import random
import warnings
warnings.simplefilter("ignore")

import wandb

import numpy as np
import pandas as pd
from sklearn.model_selection import KFold, GroupKFold, StratifiedGroupKFold

from tqdm.notebook import tqdm

import seaborn as sns
import matplotlib.pyplot as plt

from torch.cuda import amp
import torch
print(f"pytorch version is {torch.__version__}")
import torch.nn as nn
from torch.cuda import amp

import torchvision
from torchvision.transforms import v2 as transforms

import librosa
import torchaudio
import torchaudio.transforms as audioT

if KAGGLE == False:
    import nnAudio
    from nnAudio import features
    import albumentations
    from audiomentations import Compose, SpecCompose, OneOf, AddGaussianNoise, AddColorNoise
    from audiomentations import TimeStretch, PitchShift, Shift, SpecFrequencyMask, TimeMask
    from audiomentations import Gain, GainTransition
    from torcheval.metrics.functional import multiclass_auroc, multiclass_f1_score, multiclass_precision, multiclass_recall, multilabel_accuracy
if KAGGLE == False:
    from adan_pytorch import Adan
import timm


**Key Components:**

* **Data Manipulation:** numpy, pandas
* **Machine Learning Utilities:** scikit-learn, torch, torchvision
* **Data Augmentation:** albumentations, audiomentations
* **Visualization:** matplotlib, seaborn
* **Model Management:** timm, onnx, onnxruntime
* **Experiment Tracking:** wandb (Weights & Biases)

## Configuration

A configuration class centralizes all the settings and hyperparameters used throughout the project. This ensures easy management and modification of parameters.

In [None]:
class config:
    if KAGGLE:
        dir = "/kaggle/input/birdclef-2024/"
    else:
        dir = "/mnt/d/kaggle/birdclef-2024/"

    wave_path = "original_waves/second_30/"
    model_name = 'tf_efficientnet_b0'
    pool_type = 'avg'

    train_duration = 30  # Seconds of audio used for training
    slice_duration = 5   # Duration of each slice fed into the model

    test_duration = 5
    train_drop_duration = 1

    # Spectrogram parameters
    sr = 32000
    fmin = 20
    fmax = 15000
    n_mels = 128
    n_fft = n_mels * 8
    size_x = 512
    hop_length = int(sr * slice_duration / size_x)
    test_hop_length = int(sr * test_duration / size_x)
    bins_per_octave = 12

    # Cross-validation
    nfolds = 5
    inference_folds = [4]

    # Training settings
    enable_amp = True
    train_batchsize = 32
    valid_batchsize = 1
    loss_type = "BCEFocalLoss"
    lr = 1.0e-03
    optimizer = 'adan'
    weight_decay = 1.0e-02
    es_patience = 5
    deterministic = True
    max_epoch = 9
    aug_epoch = 6

    # Data augmentation settings
    useSecondary = True
    secondary_label_value = 0.5
    oversample = False
    oversample_threthold = 60
    seed = 42
    wandb = True

    # Augmentation flags
    aug_noise = 0.0
    aug_gain = 0.0
    aug_wave_pitchshift = 0.0
    aug_wave_shift = 0.0
    aug_spec_xymasking = 0.0
    aug_spec_coarsedrop = 0.0
    aug_spec_hflip = 0.0
    aug_spec_mixup = 0.0
    aug_spec_mixup_prob = 0.5
    alpha = 0.95

    # Label smoothing
    smoothing_value = 0.0

cfg = config()


**Highlights:**
* **Data Paths:** Differentiates between Kaggle and local environments.
* **Model Parameters:** Defines the model architecture and pooling strategy.
* **Training Hyperparameters:** Learning rate, batch sizes, epochs, optimizer choice, etc.
* **Data Augmentation:** Configurable flags for various augmentation techniques.
* **Cross-Validation:** Specifies the number of folds and which folds to use during inference.


## Data Preprocessing

### **Loading and Cleaning the Data**

The dataset is loaded from CSV files, and duplicate entries are removed to ensure data quality.

In [None]:
sample_submission = pd.read_csv(cfg.dir + "sample_submission.csv")
LABELS = list(sample_submission.set_index("row_id").columns)

if KAGGLE == False:
    train_csv = pd.read_csv(cfg.dir + "train_eda.csv")
    train_csv["fileID"] = train_csv["filename"].map(lambda x: x.split("/")[1][:-4])
else:
    train_csv = pd.read_csv(cfg.dir + "train_metadata.csv")

import ast
train_csv['new_target'] = train_csv['primary_label'] + ' ' + train_csv['secondary_labels'].map(lambda x: ' '.join(ast.literal_eval(x)))
train_csv['len_new_target'] = train_csv['new_target'].map(lambda x: len(x.split()))

train_csv["filename_tmp"] = train_csv["filename"].map(lambda x: x.split("/")[1][:-4])
duplicated_filenames = train_csv["filename_tmp"].value_counts()[train_csv["filename_tmp"].value_counts() > 1].index
train_csv = train_csv[~train_csv["filename_tmp"].isin(duplicated_filenames)]
train_csv = train_csv.reset_index(drop=True)


**Steps:**
1. **Loading Sample Submission:** Retrieves the list of labels used in the competition.
1. **Loading Training Data:**
    For local environments, it reads train_eda.csv.
    For Kaggle, it reads train_metadata.csv.
1. **Creating New Targets:** Combines primary and secondary labels into a single target string.
1. **Removing Duplicates:** Identifies and removes duplicate file entries to prevent overfitting.

## Dataset Handling
A custom BirdCLEF_Dataset class inherits from torch.utils.data.Dataset to handle data loading, preprocessing, and augmentation.

In [None]:
class BirdCLEF_Dataset(torch.utils.data.Dataset):
    def __init__(self, df, augmentation=False, mode='train'):
        self.df = df.reset_index(drop=True)
        self.mode = mode
        self.augmentation = augmentation

    def __len__(self):
        return len(self.df)

    def normalize(self, x):
        valid_values = x[x != float('-inf')]
        mean_value = np.mean(valid_values)
        x[x == float('-inf')] = mean_value
        x = x - x.min()
        x = x / x.max()
        return x

    def wave_tile_and_cutoff(self, data):
        drop_duration = cfg.sr * cfg.train_drop_duration
        use_duration = cfg.sr * cfg.train_duration

        if len(data[0]) > drop_duration:
            data = data[:, drop_duration:]

        if len(data[0]) < use_duration:
            iter = 1 + (use_duration) // len(data[0])
            data = np.tile(data, (1, iter))

        data = data[:, :use_duration]
        return data

    def label_smoothing(self, idx, target):
        secondary_target = target * cfg.secondary_label_value
        out_of_target_noise_intensity = cfg.smoothing_value / (len(LABELS) - 1)
        out_of_target_noise_array = torch.ones(target.shape) * out_of_target_noise_intensity
        secondary_target_with_noise = secondary_target + out_of_target_noise_array
        secondary_target_with_noise = torch.clip(secondary_target_with_noise, min=0, max=cfg.secondary_label_value)

        primary_target = np.isin(LABELS, self.df.loc[idx, "primary_label"]).astype(int)
        primary_target = torch.tensor(primary_target, dtype=torch.float32)

        primary_and_secondary_target_with_noise = primary_target + secondary_target_with_noise
        new_target = torch.clip(primary_and_secondary_target_with_noise, min=0, max=1)
        new_target = new_target - primary_target * cfg.smoothing_value
        return new_target

    def __getitem__(self, idx):
        if self.mode == 'train':
            # Load and preprocess training data
            pass
        elif self.mode == 'valid':
            # Load and preprocess validation data
            pass
        elif self.mode == 'test':
            # Load and preprocess test data
            pass
        elif self.mode == 'clean':
            # Load and preprocess clean data for visualization
            pass


**Functionalities:**
* **Normalization:** Adjusts spectrogram values to a standardized range.
* **Wave Tiling and Cutoff:** Ensures all audio clips are of uniform length by trimming or tiling.
* **Label Smoothing:** Applies smoothing to the target labels to handle class imbalance and improve generalization.
* **Data Augmentation:** Integrates wave and spectrogram augmentations to enhance model robustness.

## Data Augmentation
Data augmentation techniques are employed to artificially expand the dataset, making the model more resilient to variations in the audio data.

In [None]:
if isTrain == True:
    normal_augment = Compose([
        OneOf([
            Gain(min_gain_in_db=-15, max_gain_in_db=15, p=1.0),
            GainTransition(min_gain_in_db=-24.0, max_gain_in_db=6.0, min_duration=0.2, max_duration=6.0, p=1.0)
        ], p=cfg.aug_gain),

        OneOf([
            AddGaussianNoise(p=1),
            AddColorNoise(p=1, min_snr_db=5, max_snr_db=20, min_f_decay=-3.01, max_f_decay=-3.01)
        ], p=cfg.aug_noise),

        PitchShift(min_semitones=-1, max_semitones=1, p=cfg.aug_wave_pitchshift),
        Shift(p=cfg.aug_wave_shift)
    ])
    alb_transform = [
        albumentations.XYMasking(num_masks_x=2, num_masks_y=1, 
                                 mask_x_length=cfg.size_x // 30, mask_y_length=cfg.n_mels // 30,
                                 fill_value=0, mask_fill_value=0, p=cfg.aug_spec_xymasking),
        albumentations.CoarseDropout(fill_value=0, min_holes=20, max_holes=50, p=cfg.aug_spec_coarsedrop),
        albumentations.HorizontalFlip(p=cfg.aug_spec_hflip)    
    ]
    albumentations_augment = albumentations.Compose(alb_transform)


### **Mixup Techniques**
Mixup is a data augmentation technique that combines two samples to create a new sample, which can help in regularizing the model.

In [None]:
def mixup(data, targets, alpha, mode="same_wave"):
    if mode == "same_wave":
        data = torch.tensor(data)
        indices = torch.randperm(data.size(0))
        shuffled_data = data[indices]
        lam = np.random.beta(alpha, alpha)
        new_data = data * lam + shuffled_data * (1 - lam)
        return new_data.numpy()
    elif mode == "other_wave":
        indices = torch.randperm(data.size(0))
        shuffled_data = data[indices]
        shuffled_targets = targets[indices]
        lam = np.random.beta(alpha, alpha)
        new_data = data * lam + shuffled_data * (1 - lam)
        new_targets = targets * lam + shuffled_targets * (1 - lam)
        return new_data, new_targets


### **Spectrogram Transformation**
Transforms raw audio waveforms into spectrograms, which are then used as input to the neural network.

In [None]:
spec_layer = torchaudio.transforms.MelSpectrogram(
    sample_rate=cfg.sr, hop_length=cfg.hop_length, n_fft=cfg.n_fft,
    n_mels=cfg.n_mels, f_min=cfg.fmin, f_max=cfg.fmax, mel_scale='slaney', center=True, pad_mode='reflect'
).to(device)

valid_spec_layer = torchaudio.transforms.MelSpectrogram(
    sample_rate=cfg.sr, hop_length=cfg.test_hop_length, n_fft=cfg.n_fft,
    n_mels=cfg.n_mels, f_min=cfg.fmin, f_max=cfg.fmax, mel_scale='slaney', center=True, pad_mode='reflect'
).to(device)

test_spec_layer = torchaudio.transforms.MelSpectrogram(
    sample_rate=cfg.sr, hop_length=cfg.test_hop_length, n_fft=cfg.n_fft,
    n_mels=cfg.n_mels, f_min=cfg.fmin, f_max=cfg.fmax, mel_scale='slaney', center=True, pad_mode='reflect'
).cpu()


## Model Architecture

### **BirdModel**
A versatile neural network architecture built using the timm library, which supports various state-of-the-art models. The BirdModel class allows for different pooling strategies and includes a custom classification head.

In [None]:
class BirdModel(torch.nn.Module):
    def __init__(self, model_name, pretrained, in_channels, num_classes, pool="default"):
        super().__init__()
        self.pool = pool
        self.normalize = transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
        
        if pool == "default":
            self.backbone = timm.create_model(
                model_name=model_name, pretrained=pretrained,
                num_classes=0, in_chans=3)
        else:
            self.backbone = timm.create_model(
                model_name=model_name, pretrained=pretrained,
                num_classes=0, in_chans=3, global_pool="")
        
        in_features = self.backbone.num_features

        self.max_pooling = torch.nn.Sequential(
            torch.nn.AdaptiveMaxPool2d(1),
            torch.nn.Flatten(start_dim=1, end_dim=-1)
        )
        self.avg_pooling = torch.nn.Sequential(
            torch.nn.AdaptiveAvgPool2d(1),
            torch.nn.Flatten(start_dim=1, end_dim=-1)
        )
        self.both_pooling_neck = torch.nn.Sequential(
            torch.nn.BatchNorm1d(2 * in_features),
            torch.nn.Linear(in_features=2 * in_features, out_features=in_features)
        )
        
        self.head = torch.nn.Sequential(
            torch.nn.BatchNorm1d(in_features),
            torch.nn.Linear(in_features=in_features, out_features=256),
            torch.nn.Hardswish(inplace=True),
            torch.nn.Dropout(0.1),
            torch.nn.Linear(in_features=256, out_features=len(LABELS))
        )
        
        self.active = torch.nn.Sigmoid()
    
    def forward(self, x):
        x = x.expand(-1, 3, -1, -1)
        x = self.normalize(x)
        x = self.backbone(x)

        if self.pool == "max":
            x = self.max_pooling(x)
        elif self.pool == "avg":
            x = self.avg_pooling(x)
        elif self.pool == "both":
            x_max = self.max_pooling(x)
            x_avg = self.avg_pooling(x)
            x = x_max + x_avg
        
        x = self.head(x)
        return x


**Features:**
1. **Backbone:** Utilizes models from the timm library, such as eca_nfnet_l0, convnext_small_fb_in22k_ft_in1k_384, and convnextv2_tiny_fcmae_ft_in22k_in1k_384.
1. **Pooling Strategies:**
1. * **Max Pooling:** Captures the most prominent features.
1. * **Average Pooling:** Captures the average feature presence.
1. * **Both:** Combines max and average pooling for richer feature representation.
1. **Classification Head:** A sequence of batch normalization, linear layers, activation functions, and dropout for robust classification.
1. **Activation Function:** Uses sigmoid activation for multi-label classification.

## Training Utilities

### **Setting Random Seeds**
Ensures reproducibility by setting seeds for various random number generators.

In [None]:
def set_random_seed(seed: int = 42, deterministic: bool = False):
    """Set seeds for reproducibility."""
    random.seed(seed)
    np.random.seed(seed)
    os.environ["PYTHONHASHSEED"] = str(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)  # type: ignore
    torch.backends.cudnn.deterministic = deterministic  # type: ignore


### **Loss Function: BCEFocalLoss**
Implements the Binary Cross Entropy Focal Loss to handle class imbalance by focusing more on hard-to-classify examples.

In [None]:
class BCEFocalLoss(nn.Module):
    def __init__(self, alpha=0.25, gamma=2.0):
        super().__init__()
        self.alpha = alpha
        self.gamma = gamma

    def forward(self, preds, targets):
        bce_loss = nn.BCEWithLogitsLoss(reduction='none')(preds, targets)
        probas = torch.sigmoid(preds)
        
        tmp = targets * self.alpha * (1. - probas) ** self.gamma * bce_loss
        smp = (1. - targets) * probas ** self.gamma * bce_loss
        
        loss = tmp + smp
        loss = loss.mean()
        return loss


### **Metrics**
Utilizes various metrics to evaluate model performance, including AUC, F1 Score, Precision, and Recall.

## Training Loop

### **Initialization**
Sets up the model, optimizer, scheduler, scaler, and loss function based on the configuration.

In [None]:
def initialization():
    model = BirdModel(model_name=cfg.model_name, pretrained=True, in_channels=3, num_classes=len(LABELS), pool=cfg.pool_type)
    
    if cfg.optimizer == 'adan':
        optimizer = Adan(model.parameters(), lr=cfg.lr, betas=(0.02, 0.08, 0.01), weight_decay=cfg.weight_decay)
    else:
        optimizer = torch.optim.AdamW(params=model.parameters(), lr=cfg.lr, weight_decay=cfg.weight_decay)
    
    scheduler = torch.optim.lr_scheduler.OneCycleLR(
        optimizer=optimizer, epochs=cfg.max_epoch,
        pct_start=0.0, steps_per_epoch=len(train_dataloader),
        max_lr=cfg.lr, div_factor=25, final_div_factor=4.0e-01
    )
    
    scaler = amp.GradScaler(enabled=cfg.enable_amp)
    if cfg.loss_type == "BCEFocalLoss":
        loss_func = BCEFocalLoss(alpha=1)
    elif cfg.loss_type == "BCEWithLogitsLoss":
        loss_func = torch.nn.BCEWithLogitsLoss()
    
    return model.to(device), optimizer, scheduler, scaler, loss_func.to(device)


### **Training Functions**
train_one_loop
Handles a single epoch of training without mixup augmentation.



In [None]:
def train_one_loop(model, optimizer, scaler, scheduler, dataloader, loss_fn):
    trainloss = 0
    model.train()

    count = 0
    for idx, (data, label) in enumerate(tqdm(dataloader, leave=False, desc="[train]")):
        data, label = data.to(device), label.to(device)
        
        optimizer.zero_grad()
        with amp.autocast(cfg.enable_amp, dtype=torch.bfloat16):
            pred = model.forward(data)
            loss = loss_fn(pred, label)

        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()

        scheduler.step()
        
        trainloss += loss.item()
        del data, label, loss
        count += 1

    trainloss /= len(dataloader)
    if cfg.wandb:
        wandb.log({"train_loss": trainloss, "lr": scheduler.get_lr()[0]})
    return model, optimizer, scaler, scheduler, trainloss


mixup_one_loop
Handles a single epoch of training with mixup augmentation.

In [None]:
def mixup_one_loop(model, optimizer, scaler, scheduler, dataloader, loss_fn):
    trainloss = 0
    model.train()

    count = 0
    for idx, (data, label) in enumerate(tqdm(dataloader, leave=False, desc="[train]")):
        if np.random.random() > cfg.aug_spec_mixup_prob:
            data, label = mixup(data=data, targets=label, alpha=cfg.alpha, mode="other_wave")
        else:
            data, label = spec_mixup(data=data, targets=label)
        data, label = data.to(device), label.to(device)
        
        optimizer.zero_grad()
        with amp.autocast(cfg.enable_amp, dtype=torch.bfloat16):
            pred = model.forward(data)
            loss = loss_fn(pred, label)

        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()

        scheduler.step()
        
        trainloss += loss.item()
        del data, label, loss
        count += 1

    trainloss /= len(dataloader)
    if cfg.wandb:
        wandb.log({"train_loss": trainloss, "lr": scheduler.get_lr()[0]})
    return model, optimizer, scaler, scheduler, trainloss


### **Evaluation Function**
Evaluates the model on the validation set, computing various metrics.

In [None]:
def evaluate_validation(model, dataloader, loss_fn):
    validloss = 0
    model.eval()

    preds, trues, targets = [], [], []
    
    for idx, (data, label) in enumerate(tqdm(dataloader, leave=False, desc="[valid]")):
        d = data[0].unsqueeze(1)
        label = label[0]
        
        d = d.to(device)
        pred = model.forward(d)

        preds.extend(pred.detach().cpu())
        trues.extend(label)
        targets.extend(label.argmax(axis=1))
        
    t = torch.stack(preds)
    t = torch.sigmoid(t)
    targets = torch.tensor(targets)
    y_trues = torch.stack(trues)

    validloss = loss_fn(torch.stack(preds), torch.stack(trues))
    
    sk_f1_30 = metrics.f1_score(np.array(y_trues), np.array(t) > 0.30, average="micro")
    sk_f1_50 = metrics.f1_score(np.array(y_trues), np.array(t) > 0.50, average="micro")
    
    auc = multiclass_auroc(input=t, target=targets, num_classes=len(LABELS), average="macro").item()

    prec = multiclass_precision(input=t, target=targets, num_classes=len(LABELS), average="macro").item()
    
    f1 = multiclass_f1_score(input=t, target=torch.tensor(targets), num_classes=len(LABELS), average="micro").item()
    f1_macro = multiclass_f1_score(input=t, target=torch.tensor(targets), num_classes=len(LABELS), average="macro").item()

    t_03 = (t > 0.3).int()
    t_03 = torch.tensor(t_03, dtype=torch.int64)
    f1_03 = multiclass_f1_score(input=t_03, target=torch.tensor(targets), num_classes=len(LABELS), average="micro").item()

    t_05 = (t > 0.5).int()
    t_05 = torch.tensor(t_05, dtype=torch.int64)
    f1_05 = multiclass_f1_score(input=t_05, target=torch.tensor(targets), num_classes=len(LABELS), average="micro").item()

    if cfg.wandb:
        wandb.log({
            "valid_loss": validloss,
            "AUC": auc,
            "precision": prec, 
            "F1": f1,
            "F1_macro": f1_macro,
            "F1 30%": f1_03,
            "F1 50%": f1_05
        })
    return validloss, auc, f1, f1_03, f1_05, sk_f1_30, sk_f1_50


## Training Process


### **Experiment Naming and Setup**
The experiment names are managed based on whether the environment is Kaggle or local, and whether the task is training or inference.



In [None]:
if isTrain == False and isInference == True:
    newDir = False
else:
    newDir = True
print(newDir)
False

if KAGGLE == False:
    if isTrain == True:
        name = sorted(glob.glob("exp1*.ipynb"))[-1][:-6]
        print(f"filename is {name}")
    else:
        name = "exp1057"
        print(f"filename is {name}")
else:
    name = "exp1057"
    name = f'bird2024{name}'
    print(f"filename is {name}")
trial = "trial1"
p_name = f"BirdCLEF_cv_ver2"
filename is bird2024exp1057


### **Directory and Logging Setup**
Creates necessary directories for checkpoints and initializes Weights & Biases (WandB) for experiment tracking.

In [None]:
if KAGGLE == False:
    if cfg.wandb == True:
        wandb.login(key="your_wandb_api_key")
    
    if newDir == True:
        new_dir_path_recursive = f"{name}/checkpoint"
        os.makedirs(new_dir_path_recursive, exist_ok=True)
        shutil.rmtree(new_dir_path_recursive)
        os.makedirs(new_dir_path_recursive, exist_ok=True)


### **Cross-Validation Setup**
Implements Stratified K-Fold cross-validation to ensure each fold has a representative distribution of classes.

In [None]:
from sklearn.model_selection import KFold, StratifiedKFold, GroupKFold
skf = StratifiedKFold(n_splits=cfg.nfolds, shuffle=True, random_state=cfg.seed)
for fold, (train_index, valid_index) in enumerate(skf.split(train_csv, train_csv['primary_label'])):
    train_csv.loc[valid_index, 'fold'] = int(fold)


## **Training Loop Execution**
The training process iterates over each fold, training the model and evaluating its performance on the validation set.

In [None]:
if isTrain == True:
    set_random_seed(seed=42)
    
    if cfg.wandb == True:
        wandb.init(project=p_name, name=f"{name}", config=tmp_params)
        
    for fold in cfg.inference_folds:
        train_ = train_csv.loc[train_csv["fold"] != fold]

        if cfg.oversample == True:
            train = get_oversampled_df(df=train_)
        else:
            train = train_
        
        augme_dataset = BirdCLEF_Dataset(df=train, augmentation=True, mode='train')
        augme_dataloader = torch.utils.data.DataLoader(dataset=augme_dataset, batch_size=cfg.train_batchsize, shuffle=True)

        train_dataset = BirdCLEF_Dataset(df=train, augmentation=False, mode='train')
        train_dataloader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=cfg.train_batchsize, shuffle=True)
        
        valid = train_csv.loc[train_csv["fold"] == fold]
        valid_dataset = BirdCLEF_Dataset(df=valid, augmentation=False, mode='valid')
        valid_dataloader = torch.utils.data.DataLoader(dataset=valid_dataset, batch_size=cfg.valid_batchsize, shuffle=False)
    
        model, optimizer, scheduler, scaler, loss_func = initialization()
    
        best_f1 = 0
        best_auc = 0
        best_loss = 1.00000
        for e in range(cfg.max_epoch):
            start_time = time.time()
            if e < cfg.aug_epoch:
                if cfg.aug_spec_mixup > np.random.random():
                    model, optimizer, scaler, scheduler, train_loss = mixup_one_loop(
                        model=model, optimizer=optimizer, scaler=scaler, 
                        scheduler=scheduler, dataloader=augme_dataloader, loss_fn=loss_func
                    )
                else:
                    model, optimizer, scaler, scheduler, train_loss = train_one_loop(
                        model=model, optimizer=optimizer, scaler=scaler, 
                        scheduler=scheduler, dataloader=augme_dataloader, loss_fn=loss_func
                    )
            else:
                model, optimizer, scaler, scheduler, train_loss = train_one_loop(
                    model=model, optimizer=optimizer, scaler=scaler, 
                    scheduler=scheduler, dataloader=train_dataloader, loss_fn=loss_func
                )
            
            valid_loss, auc, f1, f1_03, f1_05, sk_f1_30, sk_f1_50 = evaluate_validation(
                model=model, dataloader=valid_dataloader, loss_fn=loss_func
            )
            
            if best_loss > valid_loss:
                end_time = time.time()
                print(f"[epoch {str(e).zfill(2)}] AUC{auc: .4f}, F1{f1: .4f}, F1_03{f1_03: .4f}, F1_05{f1_05: .4f}")
                print(f"[epoch {str(e).zfill(2)}] SKF1_03{sk_f1_30: .4f}, SKF1_05{sk_f1_50: .4f}")
                print(f"[epoch {str(e).zfill(2)}] valid_loss {valid_loss: .6f}")
                print(f"[epoch {str(e).zfill(2)}] update loss {best_loss: .6f} --> {valid_loss: .6f} {(end_time - start_time): .1f}[s]")
                print(f"[epoch {str(e).zfill(2)}] update auc score {best_auc: .6f} --> {auc: .6f} {(end_time - start_time): .1f}[s]")
                model_name = f'{name}/checkpoint/fold_{fold}_snapshot_epoch_{str(e).zfill(2)}.pth'
                best_model = model
                best_loss = valid_loss
                best_auc = auc
                best_f1 = f1
            else:
                end_time = time.time()
                print(f"[epoch {str(e).zfill(2)}] NOT update loss {best_loss: .6f} <-- {valid_loss: .6f} {(end_time - start_time): .1f}[s]")
                print(f"[epoch {str(e).zfill(2)}] NOT update score {best_auc: .6f} <-- {auc: .6f} {(end_time - start_time): .1f}[s]")

        if cfg.wandb:
            wandb.log({"best_loss": best_loss, "best_f1": best_f1, "best_auc": best_auc})

        torch.save(best_model.state_dict(), model_name)
        
        del model, best_model
        gc.collect()
        torch.cuda.empty_cache()
        print("--")


**Process Breakdown:**
1. **Seed Initialization:** Ensures reproducible results.
1. **WandB Initialization:** Logs experiment details for monitoring.
1. **Cross-Validation Loop:** Iterates over each specified fold.
1. * **Data Splitting:** Separates training and validation data based on the current fold.
1. * **Oversampling:** Applies oversampling to handle class imbalance if enabled.
1. * **Dataset and Dataloader Creation:** Prepares data loaders for training and validation.
1. **Model Initialization:** Sets up the model, optimizer, scheduler, scaler, and loss function.
1. **Epoch Loop:** Trains the model over multiple epochs.
1. 1. **Augmentation Phase:** Applies mixup augmentation during initial epochs.
1. 1. **Training Phase:** Conducts forward and backward passes, updates weights.
1. 1. **Validation Phase:** Evaluates model performance on the validation set.
1. 1. **Checkpointing:** Saves the model checkpoint if validation loss improves.
1. **Cleanup:** Frees up memory by deleting models and clearing caches.

## **Inference**
After training, the model is used to make predictions on test data. The trained models are converted to the ONNX format for optimized inference.

### **Loading Models**
Loads the best model checkpoints and prepares them for inference.

In [None]:
models = dict()
models_names = dict()

for fold in cfg.inference_folds:
    if KAGGLE == True:
        bestmodel_path = sorted(glob.glob(f"/kaggle/input/{name}/checkpoint/fold_{fold}*.pth"))[-1]
    else:
        bestmodel_path = sorted(glob.glob(f"{name}/checkpoint/fold_{fold}*.pth"))[-1]
    print(bestmodel_path)
    model = BirdModel(model_name=cfg.model_name, pretrained=False, in_channels=1, num_classes=len(LABELS))
    model.load_state_dict(torch.load(bestmodel_path, map_location=torch.device('cpu')))
    model = model.eval()
    models[fold] = model

    models_names[fold] = bestmodel_path.split(".")[0] + ".onnx"
    print(models_names[fold])


### **Preparing Test Data**
Sets up the test dataset and dataloader based on the environment.

In [None]:
if KAGGLE == True:
    test_audio_dir = f"{cfg.dir}test_soundscapes/"
    file_list = glob.glob(test_audio_dir + "*.ogg")
    file_list = sorted(file_list)
else:
    test_audio_dir = f"{cfg.dir}unlabeled_soundscapes/"
    file_list = glob.glob(test_audio_dir + "*.ogg")
    file_list = sorted(file_list)[:3]
test_dataset = BirdCLEF_Dataset(df=file_list, mode="test")
test_dataloader = torch.utils.data.DataLoader(dataset=test_dataset, batch_size=1, shuffle=False)


### **Exporting Models to ONNX**
Converts the trained PyTorch models to the ONNX format for efficient inference.

In [None]:
input_tensor = torch.randn((48, 1, cfg.n_mels, cfg.size_x + 1))  # input shape
output_names = ['output']
input_names = ["x"]

if KAGGLE == False:
    for fold in cfg.inference_folds:
        torch.onnx.export(
            model=models[fold].eval(),
            args=(input_tensor),
            input_names=input_names,
            output_names=output_names,
            f=models_names[fold]
        )


### **Loading ONNX Models**
Loads the ONNX models using onnxruntime for inference.

In [None]:
onnx_sessions = dict()

for fold in cfg.inference_folds:
    if KAGGLE == True:
        onnxmodel_path = sorted(glob.glob(f"/kaggle/input/{name}/checkpoint/fold_{fold}*.onnx"))[-1]
    else:
        onnxmodel_path = sorted(glob.glob(f"{name}/checkpoint/fold_{fold}*.onnx"))[-1]
    print(onnxmodel_path)
    models_names[fold] = onnxmodel_path

    onnx_model = onnx.load(models_names[fold])
    onnx_model_graph = onnx_model.graph
    onnx_session = ort.InferenceSession(onnx_model.SerializeToString())

    onnx_sessions[fold] = onnx_session


### **Making Predictions**
Performs inference on the test dataset and aggregates predictions from all folds.

In [None]:
start_time = time.time()

predictions = []
for data in tqdm(test_dataloader):
    preds = []
    for fold in cfg.inference_folds:
        session = onnx_sessions[fold]
        pred = session.run(output_names, {input_names[0]: data[0].numpy()})[0]
        pred = torch.sigmoid(torch.tensor(pred))
        preds.append(pred)
    preds_per_batch = torch.stack(preds, axis=0).mean(axis=0)
    predictions.extend(preds_per_batch)

if len(predictions) > 0:
    predictions = torch.stack(predictions)
else:
    predictions = predictions
end_time = time.time()
use_time = end_time - start_time
print(f"{cfg.inference_folds}fold +     3ogg is {round(use_time,1)}[s]")
print(f"{cfg.inference_folds}fold + 1,100ogg is {round(1100*use_time/3,1)}[s], {round(1100*use_time/3/60,1)}[m]")


## **Visualization**
Provides visual insights into model predictions and performance metrics.

### **Example Predictions**
Displays spectrograms of training and validation data to visualize how the model interprets different bird sounds.

In [None]:
if isTrain:
    print("train data")
    dataset = BirdCLEF_Dataset(df=train_csv, augmentation=True, mode="train")
    data, target = dataset[270]
    fig, ax = plt.subplots(figsize=(6,4))
    plt.imshow(data[0], cmap="jet", origin="lower")
    plt.show()
    
    print("validation data")
    dataset = BirdCLEF_Dataset(df=train_csv, augmentation=True, mode="valid")
    data, target = dataset[270]
    fig, axes = plt.subplots(figsize=(12,8), nrows=len(data), tight_layout=True)
    for idx, ax in enumerate(axes.ravel()):
        ax.imshow(data[idx], cmap="jet", origin="lower")


### **Evaluation Metrics Visualization**
Plots distribution of labels and corresponding AUC scores.

In [None]:
if KAGGLE == False:
    train_csv.groupby("fold", as_index=False)["primary_label"].value_counts()

# Additional plots for label distribution and model performance


### **Prediction Visualization**
Visualizes model predictions against true labels for specific data points.



In [None]:
if KAGGLE == False:
    def predict_and_visualize(data_index):
        use_fold = 4
        c_dataset, df = get_fold_data(fold=use_fold)
        model = models[use_fold].to(device)
        
        prediction = model.forward(c_dataset[data_index][0].to(device)).cpu().detach()
        prediction = torch.sigmoid(prediction)
        
        df_index = df.index[data_index]
        true_labels = df.loc[df_index, "new_target"].split()
        true_guide_pos = [LABELS.index(l) + 0.5 for l in true_labels]
    
        fig, ax = plt.subplots(figsize=(24, 1.5))
        sns.heatmap(prediction, cmap="jet", vmin=0, vmax=1)
        ax.set_xticks(np.arange(0, 182))
        ax.set_xticklabels(LABELS, fontsize=8)
        
        ax.set_yticks(np.arange(0, prediction.shape[0]))
        ax.set_yticklabels(np.arange(1, 1 + prediction.shape[0]) * 5)
        
        for pos in true_guide_pos:
            ax.axvline(x=pos, color="red", ls="--", lw=0.9)
        
        plt.xticks(ticks=np.arange(0, 182), labels=LABELS, color='black')
        plt.title(c_dataset[data_index][1])
        plt.show()
    
        return prediction

    # Example visualizations
    N = len(train_csv.loc[train_csv["fold"] == fold]) // 30
    print(N)
    
    predict_and_visualize(data_index=0)
    predict_and_visualize(data_index=1)
    predict_and_visualize(data_index=2)
    predict_and_visualize(data_index=3)
    predict_and_visualize(data_index=4)
    
    for i in range(N):
        try:
            predict_and_visualize(data_index=i * 30)
        except:
            pass


## **ONNX Export**
Converts trained PyTorch models into the ONNX format, facilitating optimized and platform-independent deployment.

### **Setting ONNX Configuration**
Defines input and output parameters for the ONNX export process.

In [None]:
input_tensor = torch.randn((48, 1, cfg.n_mels, cfg.size_x + 1))  # Input shape
output_names = ['output']
input_names = ["x"]


### **Exporting Models**
Exports each trained model to the ONNX format.

In [None]:
if KAGGLE == False:
    for fold in cfg.inference_folds:
        torch.onnx.export(
            model=models[fold].eval(),
            args=(input_tensor),
            input_names=input_names,
            output_names=output_names,
            f=models_names[fold]
        )


### **Verifying ONNX Models**
Loads and verifies the exported ONNX models to ensure correctness.

In [None]:
import onnx
import onnxruntime as ort

for fold in cfg.inference_folds:
    if KAGGLE == True:
        onnxmodel_path = sorted(glob.glob(f"/kaggle/input/{name}/checkpoint/fold_{fold}*.onnx"))[-1]
    else:
        onnxmodel_path = sorted(glob.glob(f"{name}/checkpoint/fold_{fold}*.onnx"))[-1]
    print(onnxmodel_path)
    models_names[fold] = onnxmodel_path

    onnx_model = onnx.load(models_names[fold])
    onnx_model_graph = onnx_model.graph
    onnx_session = ort.InferenceSession(onnx_model.SerializeToString())

    onnx_sessions[fold] = onnx_session


### **Performing Inference with ONNX Models**
Uses onnxruntime to perform fast inference on test data.

In [None]:
start_time = time.time()

predictions = []
for data in tqdm(test_dataloader):
    preds = []
    for fold in cfg.inference_folds:
        session = onnx_sessions[fold]
        pred = session.run(output_names, {input_names[0]: data[0].numpy()})[0]
        pred = torch.sigmoid(torch.tensor(pred))
        preds.append(pred)
    preds_per_batch = torch.stack(preds, axis=0).mean(axis=0)
    predictions.extend(preds_per_batch)

if len(predictions) > 0:
    predictions = torch.stack(predictions)
else:
    predictions = predictions
end_time = time.time()
use_time = end_time - start_time
print(f"{cfg.inference_folds}fold +     3ogg is {round(use_time,1)}[s]")
print(f"{cfg.inference_folds}fold + 1,100ogg is {round(1100*use_time/3,1)}[s], {round(1100*use_time/3/60,1)}[m]")


### **Generating Submission File**
Creates a submission CSV file based on the model predictions.

In [None]:
bird_cols = sample_submission.columns[1:]
df = pd.DataFrame(columns=['row_id'] + list(bird_cols))

row_list = []
for file in file_list:
    dataname = file.split("/")[-1][:-4]
    for i in range(int(4 * 60 / 5)):
        row = f"{dataname}_{(i + 1) * 5}"
        row_list.append(row)
df['row_id'] = row_list

if len(predictions) < 1:
    pass
else:
    df[bird_cols] = predictions
df.to_csv("submission.csv", index=False)
df[:48].set_index("row_id").max().T.plot(kind="bar", figsize=(24, 4))
