# Deep learning algorithms to classify audio

In [17]:

%pip install tensorflow
%pip install keras
%pip install -U tensorflow-addons

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeableNote: you may need to restart the kernel to use updated packages.

Defaulting to user installation because normal site-packages is not writeable
Collecting tensorflow-addons
  Downloading tensorflow_addons-0.22.0-cp310-cp310-win_amd64.whl.metadata (1.8 kB)
Collecting typeguard<3.0.0,>=2.7 (from tensorflow-addons)
  Downloading typeguard-2.13.3-py3-none-any.whl.metadata (3.6 kB)
Downloading tensorflow_addons-0.22.0-cp310-cp310-win_amd64.whl (719 kB)
   ---------------------------------------- 0.0/719.8 kB ? eta -:--:--
   --------------------------------------- 719.8/719.8 kB 28.6 MB/s eta 0:00:00
Downloading typeguard-2.13.3-py3-none-any.whl (17 kB)
Installing collected packages: typeguard, tensorflow-addons
Successfully installed tensorflow-addons-0.22.0 typeguard-2.1

In [19]:
import gc
import glob
import logging
import os
import random
import re
import sys
import time
import warnings
import joblib
from concurrent.futures import ThreadPoolExecutor
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
from tqdm import tqdm
import numpy as np
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression
import numpy as np
import tensorflow as tf
from tensorflow.keras import layers, models
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.applications import ResNet50
from sklearn.model_selection import train_test_split
from collections import Counter
from tensorflow.keras.callbacks import ReduceLROnPlateau, EarlyStopping
from sklearn.metrics import f1_score, roc_auc_score


warnings.filterwarnings("ignore")
logging.basicConfig(level=logging.ERROR)

os.environ["CUDA_VISIBLE_DEVICES"] = ""

print(tf.__version__)
print(dir(tf.keras))

2.19.0
['DTypePolicy', 'FloatDTypePolicy', 'Function', 'Initializer', 'Input', 'InputSpec', 'KerasTensor', 'Layer', 'Loss', 'Metric', 'Model', 'Operation', 'Optimizer', 'Quantizer', 'Regularizer', 'RematScope', 'Sequential', 'StatelessScope', 'SymbolicScope', 'Variable', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__version__', 'activations', 'applications', 'backend', 'callbacks', 'config', 'constraints', 'datasets', 'device', 'distribution', 'dtype_policies', 'export', 'initializers', 'layers', 'legacy', 'losses', 'metrics', 'mixed_precision', 'models', 'name_scope', 'ops', 'optimizers', 'preprocessing', 'quantizers', 'random', 'regularizers', 'remat', 'tree', 'utils', 'version', 'visualization', 'wrappers']


## BirdCLEF 2025: ResNet-based Multi-label Classification Approach

In this Kaggle competition, our primary task is to **identify multiple animal species (birds, amphibians, mammals, insects)** from soundscape recordings. We built a robust classification pipeline leveraging deep learning, particularly using a **ResNet50** model in a multi-label classification setting.

Below is a detailed breakdown of our methodology, rationale, and technical choices.

---

### **1. Data Preparation and Representation**

#### **Input Data: Mel-Spectrograms**
- **Shape:** Each audio sample was preprocessed into a Mel-spectrogram, represented by a `(128, 256)` array.  
- **Reasoning:** Mel-spectrograms effectively represent audio signals in the frequency-time domain, capturing essential acoustic features useful for species identification.

#### **Labels: Single-label to Multi-label Conversion**
- Original dataset provided **single-label annotations**, i.e., each recording labeled with one primary species.
- However, competition guidelines require **multi-label outputs** (probabilities for all species).
- **Our approach**: Converted single-label annotations into a sparse **one-hot encoding** multi-label format (206 species), allowing the network to output independent probabilities for each species.
- **Why?** Even though each training example has only one positive class, a multi-label approach is beneficial for flexibility at inference, allowing independent species probability estimation.

---

### **2. Model Architecture: ResNet50**

#### **Why ResNet50?**
- **Residual Networks (ResNets)** have been proven effective in complex feature extraction, significantly reducing the **vanishing gradient** problem through residual connections.
- **Preliminary advantage:** ResNet captures hierarchical audio features—crucial for distinguishing subtle differences in species-specific calls.

#### **Input Adaptation**
- **Issue:** Standard ResNet expects **3-channel inputs (RGB images)**, but our Mel-spectrograms have only **1 channel**.
- **Solution:** Added a simple `Conv2D(3, (1,1))` layer to convert the single-channel input into a 3-channel representation.  
  **Reasoning:** This minimal adaptation allows efficient usage of standard pre-trained architectures with minimal computational overhead.

#### **Model Output (Sigmoid)**
- **Activation:** Used **sigmoid** activation in the final dense layer (206 neurons), with each neuron independently representing the probability of a species' presence.
- **Loss function:** **Binary Cross-Entropy (BCE)** suited for multi-label classification.
- **Label smoothing:** Applied BCE with label smoothing (`label_smoothing=0.05`) to regularize training, prevent overfitting, and ensure generalization.

---

### **3. Deep Learning Techniques Used**

We employed multiple deep learning techniques learned in class to enhance model performance:

#### **Dropout (regularization)**
- **Usage:** Applied `Dropout(0.3)` after global average pooling (GAP).
- **Purpose:** Reduces overfitting by randomly ignoring neurons, helping generalization.

#### **ReLU Activations (non-linearity)**
- **Built-in in ResNet:** ResNet layers inherently include multiple **Rectified Linear Units (ReLU)**.
- **Purpose:** Introduces non-linearity into the model, crucial for learning complex acoustic feature mappings.

#### **Max Pooling (feature extraction & reduction)**
- **Built-in in ResNet:** Reduces spatial dimensions, helping to abstract and distill relevant features, minimizing sensitivity to minor temporal shifts in audio.

#### **Data Augmentation (generalization)**
- **Techniques:** Applied simple augmentations like random horizontal flips (`RandomFlip`) and rotations (`RandomRotation`) as illustrative examples.

#### **Adam Optimizer (efficient training)**
- **Why Adam?** An adaptive optimizer that adjusts learning rates automatically, accelerating convergence and handling noisy gradients well—particularly beneficial for audio data.

#### **Early Stopping (training efficiency)**
- **Purpose:** Stops training automatically when validation loss ceases improvement (`patience=5`).  
- **Benefit:** Prevents unnecessary training, reduces overfitting, and saves computational resources.

#### **Class Weighting (addressing imbalance)**
- Dataset exhibited significant class imbalance (some species had very few samples).
- **Approach:** Implemented `class_weight` inversely proportional to species frequency.
- **Effect:** Ensures that rare species are appropriately emphasized during training, improving their identification accuracy.

---

### **4. Training and Validation**

#### **Dataset Split**
- Training set: 80% | Validation set: 20%, stratified by species.
- **Rationale:** Provides a robust estimate of model performance on unseen data and helps identify overfitting early.

#### **Metrics for Evaluation**
- Initially monitored `accuracy` and `loss` via `model.evaluate`.
- **Extended Evaluation:** Used external tools (`scikit-learn classification_report`) to compute detailed precision, recall, and F1-scores.
- **Reasoning:** Accuracy alone can be misleading, especially in imbalanced or multi-label scenarios. F1-score provides a more balanced measure of model performance.

---

### **5. Model Persistence**

- **Saved model:** used `model.save('resnet50_birdclef.h5')` format (structure + weights).
- **Reasoning:** Allows for straightforward model loading (`tf.keras.models.load_model`) and ensures reproducibility and ease of future inference.

---

### **6. Inference & Submission Generation**

- **Inference Procedure:**  
  - Divide the test soundscape audio into fixed-duration segments (5 seconds each).
  - Extract Mel-spectrogram for each segment.
  - Predict species probabilities (`model.predict`), obtaining a `(1, 206)` vector for each segment.
- **Ensemble Predictions (optional strategy):** You can aggregate or average overlapping segment probabilities to achieve more robust predictions.

- **Why multi-label inference?**  
  Real soundscape recordings may contain multiple simultaneous species. Even though training annotations were single-label, our multi-label output allows flexibility, capturing realistic ecological scenarios.

---

### **7. Why This Overall Approach? (Summary)**

Our chosen pipeline integrates practical deep learning techniques learned in class into a cohesive system tailored specifically for acoustic species identification:

- **ResNet:** proven success in capturing complex patterns and features in audio.
- **Multi-label approach:** meets competition requirements and realistically handles multiple species scenarios.
- **Deep learning tricks (dropout, Adam, augmentation, label smoothing, class weighting, early stopping):** enhance robustness, reduce overfitting, and optimize performance given real-world challenges such as imbalanced data and limited labeled examples.

In [4]:
# -------------------------
# 1) Load train_data.npy
# -------------------------
# Content example:
# data_dict[fid] = {
# 'data': (128,256) Mel spectrum,
# 'label': 'Name of a species'
# }
# -------------------------
data_dict = np.load('dataset/train_data.npy', allow_pickle=True).item()

X_list = []
y_list = []

all_labels_set = set()

for fid, content in data_dict.items():
    mel_2d = content['data']             # shape=(128,256)
    label_str = content['label']         # 'species_xxx'

    X_list.append(mel_2d)
    y_list.append(label_str)
    all_labels_set.add(label_str)

X_array = np.array(X_list, dtype=np.float32)    # shape=(N,128,256)
y_array = np.array(y_list)                      # shape=(N,)

all_labels = sorted(list(all_labels_set))
label_to_idx = {lb: i for i, lb in enumerate(all_labels)}
num_species = len(all_labels)

print("Number of samples:", X_array.shape[0])
print("Mel shape: (128,256)")
print("Number of unique species:", num_species)

Number of samples: 28564
Mel shape: (128,256)
Number of unique species: 206


In [5]:
# -----------------------------
# 2) Multi-label One-Hot: Only one position in each record is 1
# -----------------------------
Y_one_hot = np.zeros((len(y_array), num_species), dtype=np.float32)
for i, lb in enumerate(y_array):
    Y_one_hot[i, label_to_idx[lb]] = 1.0

# -----------------------------
# 3) Split training/validation set (80/20)
# -----------------------------
X_train, X_val, y_train, y_val = train_test_split(
    X_array, Y_one_hot, 
    test_size=0.2, 
    random_state=42, 
    stratify=y_array # Stratify by string label
)

print("Train shape:", X_train.shape, y_train.shape)
print("Val shape:",   X_val.shape,   y_val.shape)

Train shape: (22851, 128, 256) (22851, 206)
Val shape: (5713, 128, 256) (5713, 206)


In [6]:
# -----------------------------
# 4) Dealing with data imbalance -> class_weight
# Since each record has only one label, we can count the number of times each label appears
# and assign values ​​in reverse proportion.
# -----------------------------
label_counts = Counter(y_array)
max_count = max(label_counts.values())
# Give higher weight to less common categories
class_weight = {}
for lb, freq in label_counts.items():
    idx = label_to_idx[lb]
    class_weight[idx] = max_count / freq

print("Class weight example:", list(class_weight.items())[:5])

Class weight example: [(110, 7.7952755905511815), (177, 6.470588235294118), (71, 3.1832797427652735), (30, 33.0), (47, 47.142857142857146)]


In [None]:
# -----------------------------
# 5) Build data pipeline + data augmentation
# Random flip/rotate (for images)
# -----------------------------
augment_layers = tf.keras.Sequential([
    layers.RandomFlip(mode='horizontal'),
    layers.RandomRotation(0.1),
], name="data_augmentation")

def preprocess_fn(x, y):
    # x: (128,256) => expand dims到(128,256,1)
    x = tf.expand_dims(x, axis=-1)
    # cast to float
    x = tf.cast(x, tf.float32)
    x = augment_layers(x, training=True)  
    return x, y

def preprocess_fn_val(x, y):
    # No data augmentation for validation set
    x = tf.expand_dims(x, axis=-1)
    x = tf.cast(x, tf.float32)
    return x, y

batch_size = 16
train_ds = tf.data.Dataset.from_tensor_slices((X_train, y_train))
train_ds = train_ds.shuffle(buffer_size=2048).map(preprocess_fn).batch(batch_size).prefetch(tf.data.AUTOTUNE)

val_ds = tf.data.Dataset.from_tensor_slices((X_val, y_val))
val_ds = val_ds.map(preprocess_fn_val).batch(batch_size).prefetch(tf.data.AUTOTUNE)

In [None]:
# -----------------------------
# 6) Build ResNet (Keras)
# - Input (128,256,1) => First use Conv2D to transform to 3 channels => ResNet50 => GAP => Multi-label sigmoid
# - Add Dropout after ResNet output
# - Use BinaryCrossentropy(label_smoothing=...) for label smoothing
# -----------------------------
def build_resnet50(input_shape=(128, 256, 1), num_classes=10):
    inputs = layers.Input(shape=input_shape)

    # Convert 1 channel to 3 channels (1x1 convolution)
    x = layers.Conv2D(3, (1, 1), padding='same')(inputs)

    base_model = ResNet50(include_top=False,
                          weights='None',
                          input_tensor=x)
    x = base_model.output
    x = layers.GlobalAveragePooling2D()(x)
    # Add an additional dropout to prevent overfitting
    x = layers.Dropout(0.3)(x)
    # Output multiple labels -> num_classes neurons, activation=sigmoid
    outputs = layers.Dense(num_classes, activation='sigmoid')(x)

    model = models.Model(inputs, outputs, name="ResNet50_BirdCLEF")
    return model


model = build_resnet50(input_shape=(128, 256, 1), num_classes=num_species)

loss_fn = tf.keras.losses.BinaryCrossentropy(
    from_logits=False,
    label_smoothing=0.05 # Smoothing
)

model.compile(
    optimizer=Adam(learning_rate=1e-4),
    loss=loss_fn,
    metrics=['accuracy']
)

model.summary()

# Early Stop Callback
early_stopping = tf.keras.callbacks.EarlyStopping(
    monitor='val_loss',
    patience=5,
    restore_best_weights=True
)

In [None]:
# -----------------------------
# 7) Start training
# Note: Multi-label + class_weight
# Because each sample has only one positive class, this is actually equivalent to "single label"
# -----------------------------
epochs = 20
history = model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=epochs,
    class_weight=class_weight,
    callbacks=early_stopping
)
model.save('my_resnet_model.h5')  

val_loss, val_acc = model.evaluate(val_ds, verbose=1)
print("Validation Loss:", val_loss)
print("Validation Accuracy:", val_acc)

Epoch 1/20
[1m1429/1429[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1066s[0m 727ms/step - accuracy: 0.0040 - loss: 1.2208 - val_accuracy: 3.5008e-04 - val_loss: 0.1341
Epoch 2/20
[1m1429/1429[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1015s[0m 710ms/step - accuracy: 0.0013 - loss: 0.9861 - val_accuracy: 8.7520e-04 - val_loss: 0.1339
Epoch 3/20
[1m1429/1429[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2074s[0m 1s/step - accuracy: 0.0028 - loss: 0.9893 - val_accuracy: 8.7520e-04 - val_loss: 0.1337
Epoch 4/20
[1m1429/1429[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1848s[0m 1s/step - accuracy: 0.0014 - loss: 0.9942 - val_accuracy: 0.0011 - val_loss: 0.1343
Epoch 5/20
[1m1429/1429[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1038s[0m 726ms/step - accuracy: 0.0019 - loss: 0.9787 - val_accuracy: 3.5008e-04 - val_loss: 0.1340
Epoch 6/20
[1m1429/1429[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3295s[0m 2s/step - accuracy: 0.0015 - loss: 0.9729 - val_accuracy: 0.000

## Training Summary (ResNet50 on BirdCLEF 2025)

- Model: ResNet50 (no pretraining, sigmoid multi-label output)

- Loss: Binary Crossentropy + Label Smoothing

- Epochs: Trained 8/20 (EarlyStopped)

- Final Val Loss: 0.1337

- Validation Accuracy: ~0.00087 (not meaningful for multi-label)

- Filtered Macro AUC: 0.6262

Training is stable and successful.

Accuracy is not useful here — use AUC / F1 instead.

In [15]:
val_images = []
val_labels = []
for x_batch, y_batch in val_ds:
    val_images.append(x_batch.numpy())
    val_labels.append(y_batch.numpy())

val_images = np.concatenate(val_images, axis=0)
val_labels = np.concatenate(val_labels, axis=0)

# Predict
pred_probs = model.predict(val_images)  # shape: (N, 206)
pred_labels = (pred_probs > 0.5).astype(np.float32)


valid_species_idx = (val_labels.sum(axis=0) > 0)

auc = roc_auc_score(
    val_labels[:, valid_species_idx],
    pred_probs[:, valid_species_idx],
    average="macro"
)

print("Filtered Macro AUC:", auc)


[1m179/179[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m49s[0m 274ms/step
Filtered Macro AUC: 0.62621917844194


## Improved and remodel

This version improves upon the baseline multi-label classification pipeline by introducing several key enhancements:

### 1. Used Pretrained ResNet50 (`weights='imagenet'`)
- **Why:** Transfer learning helps the model converge faster and learn better high-level features from spectrograms.
- **How:** Input mel-spectrograms are single-channel; we use a `Conv2D(1x1)` layer to expand them to 3 channels, making them compatible with ImageNet-pretrained ResNet50.

### 2. Added Proper Metrics for Multi-Label Evaluation
- **Why:** Accuracy is misleading in multi-label tasks.
- **How:** Tracked AUC, Precision, and Recall during training using `tf.keras.metrics`.

### 3. Added Learning Rate Scheduler
- **Why:** 
  - **ReduceLROnPlateau**: Dynamically lowers learning rate when validation loss plateaus.

### 4. Model Saved for Future Inference
- **Why:** For consistent reusability and easy inference/finetuning later.
- **How:** Saved using `model.save('my_resnet_model_improved.h5')`.

In [21]:
# -----------------------------
# 6) Build ResNet (Keras)
# - Input (128,256,1) => First use Conv2D to transform to 3 channels => ResNet50 => GAP => Multi-label sigmoid
# - Add Dropout after ResNet output
# - Use BinaryCrossentropy(label_smoothing=...) for label smoothing
# -----------------------------
def build_resnet50(input_shape=(128, 256, 1), num_classes=10):
    inputs = layers.Input(shape=input_shape)

    x = layers.Conv2D(3, (1, 1), padding='same')(inputs)

    base_model = ResNet50(
        include_top=False,
        weights='imagenet'
    )

    x = base_model(x)

    x = layers.GlobalAveragePooling2D()(x)
    x = layers.Dropout(0.3)(x)
    outputs = layers.Dense(num_classes, activation='sigmoid')(x)

    model = models.Model(inputs=inputs, outputs=outputs, name="ResNet50_BirdCLEF")
    return model


model = build_resnet50(input_shape=(128, 256, 1), num_classes=num_species)

loss_fn = tf.keras.losses.BinaryCrossentropy(
    from_logits=False,
    label_smoothing=0.05  # Smoothing
)

model.compile(
    optimizer=Adam(learning_rate=1e-4),
    loss=loss_fn,
    metrics=[
        tf.keras.metrics.AUC(name='auc', multi_label=True),
        tf.keras.metrics.Precision(name='precision'),
        tf.keras.metrics.Recall(name='recall')
    ]
)

model.summary()

# Early Stop Callback
early_stopping = tf.keras.callbacks.EarlyStopping(
    monitor='val_loss',
    patience=5,
    restore_best_weights=True
)

lr_scheduler = tf.keras.callbacks.ReduceLROnPlateau(
    monitor='val_loss',
    factor=0.5,
    patience=3,
    min_lr=1e-6,
    verbose=1
)

In [None]:
# -----------------------------
# 7) Start training
# Note: Multi-label + class_weight
# Because each sample has only one positive class, this is actually equivalent to "single label"
# -----------------------------
epochs = 20
history = model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=epochs,
    class_weight=class_weight,
    callbacks=[lr_scheduler, early_stopping]
)
model.save('my_resnet_model_improved.h5')  

val_loss, val_acc = model.evaluate(val_ds, verbose=1)
print("Validation Loss:", val_loss)
print("Validation Accuracy:", val_acc)