# Deep learning algorithms to classify audio (ResNet)

In [None]:
%pip install tensorflow
%pip install keras

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeableNote: you may need to restart the kernel to use updated packages.

Defaulting to user installation because normal site-packages is not writeable
Collecting tensorflow-addons
  Downloading tensorflow_addons-0.22.0-cp310-cp310-win_amd64.whl.metadata (1.8 kB)
Collecting typeguard<3.0.0,>=2.7 (from tensorflow-addons)
  Downloading typeguard-2.13.3-py3-none-any.whl.metadata (3.6 kB)
Downloading tensorflow_addons-0.22.0-cp310-cp310-win_amd64.whl (719 kB)
   ---------------------------------------- 0.0/719.8 kB ? eta -:--:--
   --------------------------------------- 719.8/719.8 kB 28.6 MB/s eta 0:00:00
Downloading typeguard-2.13.3-py3-none-any.whl (17 kB)
Installing collected packages: typeguard, tensorflow-addons
Successfully installed tensorflow-addons-0.22.0 typeguard-2.1

In [None]:
import gc
import glob
import logging
import os
import random
import re
import sys
import time
import warnings
from collections import Counter
from concurrent.futures import ThreadPoolExecutor
from pathlib import Path

import joblib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    f1_score,
    roc_auc_score,
)
from sklearn.model_selection import (
    RandomizedSearchCV,
    StratifiedKFold,
    train_test_split,
)
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras import layers, models
from tensorflow.keras.applications import ResNet50
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
from tensorflow.keras.optimizers import Adam
from tqdm import tqdm

warnings.filterwarnings("ignore")
logging.basicConfig(level=logging.ERROR)

os.environ["CUDA_VISIBLE_DEVICES"] = ""

print(tf.__version__)
print(dir(tf.keras))

2.19.0
['DTypePolicy', 'FloatDTypePolicy', 'Function', 'Initializer', 'Input', 'InputSpec', 'KerasTensor', 'Layer', 'Loss', 'Metric', 'Model', 'Operation', 'Optimizer', 'Quantizer', 'Regularizer', 'RematScope', 'Sequential', 'StatelessScope', 'SymbolicScope', 'Variable', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__version__', 'activations', 'applications', 'backend', 'callbacks', 'config', 'constraints', 'datasets', 'device', 'distribution', 'dtype_policies', 'export', 'initializers', 'layers', 'legacy', 'losses', 'metrics', 'mixed_precision', 'models', 'name_scope', 'ops', 'optimizers', 'preprocessing', 'quantizers', 'random', 'regularizers', 'remat', 'tree', 'utils', 'version', 'visualization', 'wrappers']


## BirdCLEF 2025: ResNet-based Multi-label Classification Approach

In this Kaggle competition our goal is to **identify multiple animal species (primarily birds)** from 10‑second soundscape clips. We implement a deep‑learning pipeline built around an **ImageNet‑pre‑trained ResNet50** fine‑tuned on Mel‑spectrogram “images” in a **multi‑label** setting.

---

### 1. Data Preparation

| Step | Detail (exactly mirroring the code) |
|------|-------------------------------------|
| **Load** | `np.load('dataset/train_data.npy', allow_pickle=True)` → dictionary with keys `data` (128 × 256 Mel) and `label` (species string). |
| **Shape** | Each sample → `(128, 256)` **single‑channel** Mel‑spectrogram. |
| **Label space** | Discovered dynamically from the file; in our run it equals **206 unique species** (`num_species`). |
| **One‑Hot** | `Y_one_hot[i, idx] = 1` gives **sparse one‑hot vectors** (exactly one “1” per sample). |
| **Stratified split** | `train_test_split(..., test_size=0.2, stratify=y_array)` → **80 / 20** train/val, preserving class ratios. |
| **Class imbalance** | `class_weight[idx] = max_count / freq` assigns **inverse‑frequency weights** for rare species. |

---

### 2. Data Pipeline & Augmentation

```text
(128,256,1) Mel  ──► expand_dims
               ──► RandomRotation(0.05)
               ──► RandomZoom(height_factor=0.05)
               ──► BATCH / PREFETCH
```
---

### 3. Model Architecture

| Block | Implementation |
|-------|----------------|
| **Input** | `Input(shape=(128,256,1))` |
| **Channel lift** | `Conv2D(3, 1 × 1)` converts 1‑channel → 3‑channel so we can reuse ImageNet weights. |
| **Backbone** | `ResNet50(include_top=False, weights='imagenet')` (all layers trainable by default). |
| **Pooling** | `GlobalAveragePooling2D()` |
| **Regularization** | `Dropout(0.3)` |
| **Head** | `Dense(num_species, activation='sigmoid')` |

---

### 4. Training Configuration

| Item | Code Value | Rationale |
|------|------------|-----------|
| **Loss** | `BinaryCrossentropy(label_smoothing=0.05)` | Multi‑label + smooth out hard 0/1 targets. |
| **Optimizer** | `Adam(lr=1e‑4)` | Adaptive, stable for noisy gradients. |
| **Metrics** | `AUC`, `Precision`, `Recall` | Accuracy is not informative for sparse multi‑label; these capture ranking & class‑wise performance. |
| **Callbacks** | `ReduceLROnPlateau(factor=0.5, patience=3)` <br>`EarlyStopping(patience=5, restore_best_weights=True)` | Automatic LR scheduling and training cut‑off when val‑loss stops improving. |
| **Epochs** | 20 (upper bound) | Early‑stop generally triggers sooner. |
| **Class weights** | passed via `class_weight` | Boosts loss for under‑represented species. |

---

### 5. Results & Persistence

* After training, **best validation metrics** are restored (EarlyStopping).
* Model is saved with `model.save('my_resnet_model_improved.h5')`, bundling architecture + weights for **one‑line re‑load**:  
  ```python
  model = tf.keras.models.load_model('my_resnet_model_improved.h5')
  ```
* Evaluation on the held‑out 20 % validation set is printed immediately after saving.

---

### 6. Future Work

* **SpecAugment**‑style time/frequency masking could further boost robustness.  
* Explore **EfficientNet‑based** backbones for a better parameter‑accuracy trade‑off.  
* Iion choices, loss/metric configs, and saving paths are all one‑to‑one with the implementation.

In [None]:
# -------------------------
# 1) Load train_data.npy
# -------------------------
data_dict = np.load('dataset/train_data.npy', allow_pickle=True).item()

X_list = []
y_list = []

all_labels_set = set()

for fid, content in data_dict.items():
    mel_2d = content['data']             # shape=(128,256)
    label_str = content['label']         # 'species_xxx'

    X_list.append(mel_2d)
    y_list.append(label_str)
    all_labels_set.add(label_str)

X_array = np.array(X_list, dtype=np.float32)    # shape=(N,128,256)
y_array = np.array(y_list)                      # shape=(N,)

all_labels = sorted(list(all_labels_set))
label_to_idx = {lb: i for i, lb in enumerate(all_labels)}
num_species = len(all_labels)

print("Number of samples:", X_array.shape[0])
print("Mel shape: (128,256)")
print("Number of unique species:", num_species)

Number of samples: 28564
Mel shape: (128,256)
Number of unique species: 206


In [5]:
# -----------------------------
# 2) Multi-label One-Hot: Only one position in each record is 1
# -----------------------------
Y_one_hot = np.zeros((len(y_array), num_species), dtype=np.float32)
for i, lb in enumerate(y_array):
    Y_one_hot[i, label_to_idx[lb]] = 1.0

# -----------------------------
# 3) Split training/validation set (80/20)
# -----------------------------
X_train, X_val, y_train, y_val = train_test_split(
    X_array, Y_one_hot, 
    test_size=0.2, 
    random_state=42, 
    stratify=y_array # Stratify by string label
)

print("Train shape:", X_train.shape, y_train.shape)
print("Val shape:",   X_val.shape,   y_val.shape)

Train shape: (22851, 128, 256) (22851, 206)
Val shape: (5713, 128, 256) (5713, 206)


In [None]:
# -----------------------------
# 4) Dealing with data imbalance -> class_weight
# -----------------------------
label_counts = Counter(y_array)
max_count = max(label_counts.values())
# Give higher weight to less common categories
class_weight = {}
for lb, freq in label_counts.items():
    idx = label_to_idx[lb]
    class_weight[idx] = max_count / freq

print("Class weight example:", list(class_weight.items())[:5])

Class weight example: [(110, 7.7952755905511815), (177, 6.470588235294118), (71, 3.1832797427652735), (30, 33.0), (47, 47.142857142857146)]


In [None]:
# -----------------------------
# 5) Build data pipeline + data augmentation
# -----------------------------
augment_layers = tf.keras.Sequential([
    layers.RandomRotation(0.05),
    layers.RandomZoom(height_factor=0.05)
])

def preprocess_fn(x, y):
    # x: (128,256) => expand dims(128,256,1)
    x = tf.expand_dims(x, axis=-1)
    # cast to float
    x = tf.cast(x, tf.float32)
    x = augment_layers(x, training=True)  
    return x, y

def preprocess_fn_val(x, y):
    # No data augmentation for validation set
    x = tf.expand_dims(x, axis=-1)
    x = tf.cast(x, tf.float32)
    return x, y

batch_size = 16
train_ds = tf.data.Dataset.from_tensor_slices((X_train, y_train))
train_ds = train_ds.shuffle(buffer_size=2048).map(preprocess_fn).batch(batch_size).prefetch(tf.data.AUTOTUNE)

val_ds = tf.data.Dataset.from_tensor_slices((X_val, y_val))
val_ds = val_ds.map(preprocess_fn_val).batch(batch_size).prefetch(tf.data.AUTOTUNE)

In [None]:
# -----------------------------
# 6) Build ResNet (Keras)
# - Input (128,256,1) => First use Conv2D to transform to 3 channels => ResNet50 => GAP => Multi-label sigmoid
# - Add Dropout after ResNet output
# - Use BinaryCrossentropy for label smoothing
# -----------------------------
def build_resnet50(input_shape=(128, 256, 1), num_classes=206):
    inputs = layers.Input(shape=input_shape)

    x = layers.Conv2D(3, (1, 1), padding='same')(inputs)

    base_model = ResNet50(
        include_top=False,
        weights='imagenet'
    )

    x = base_model(x)

    x = layers.GlobalAveragePooling2D()(x)
    x = layers.Dropout(0.3)(x)
    outputs = layers.Dense(num_classes, activation='sigmoid')(x)

    model = models.Model(inputs=inputs, outputs=outputs, name="ResNet50_BirdCLEF")
    return model


model = build_resnet50(input_shape=(128, 256, 1), num_classes=num_species)

loss_fn = tf.keras.losses.BinaryCrossentropy(
    from_logits=False,
    label_smoothing=0.05  # Smoothing
)

model.compile(
    optimizer=Adam(learning_rate=1e-4),
    loss=loss_fn,
    metrics=[
        tf.keras.metrics.AUC(name='auc', multi_label=True),
        tf.keras.metrics.Precision(name='precision'),
        tf.keras.metrics.Recall(name='recall')
    ]
)

model.summary()

# Early Stop Callback
early_stopping = tf.keras.callbacks.EarlyStopping(
    monitor='val_loss',
    patience=5,
    restore_best_weights=True
)

lr_scheduler = tf.keras.callbacks.ReduceLROnPlateau(
    monitor='val_loss',
    factor=0.5,
    patience=3,
    min_lr=1e-6,
    verbose=1
)

In [None]:
# -----------------------------
# 7) Start training
# Multi-label + class_weight
# -----------------------------
epochs = 20
history = model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=epochs,
    class_weight=class_weight,
    callbacks=[lr_scheduler, early_stopping]
)
model.save('my_resnet_model_improved.h5')  

Epoch 1/20
[1m1429/1429[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1250s[0m 855ms/step - auc: 0.4705 - loss: 1.2112 - precision: 0.0044 - recall: 0.0176 - val_auc: 0.5833 - val_loss: 0.1341 - val_precision: 0.0000e+00 - val_recall: 0.0000e+00 - learning_rate: 1.0000e-04
Epoch 2/20
[1m1429/1429[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3003s[0m 2s/step - auc: 0.5557 - loss: 0.9794 - precision: 0.0000e+00 - recall: 0.0000e+00 - val_auc: 0.7492 - val_loss: 0.1333 - val_precision: 0.1250 - val_recall: 7.0016e-04 - learning_rate: 1.0000e-04
Epoch 3/20
[1m1429/1429[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2394s[0m 2s/step - auc: 0.6803 - loss: 0.9749 - precision: 0.0197 - recall: 2.9257e-05 - val_auc: 0.7960 - val_loss: 0.1327 - val_precision: 0.5476 - val_recall: 0.0040 - learning_rate: 1.0000e-04
Epoch 4/20
[1m1429/1429[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1611s[0m 1s/step - auc: 0.7592 - loss: 0.9593 - precision: 0.3186 - recall: 9.2242e-04 - val_auc: 0.81

ValueError: too many values to unpack (expected 2)

In [23]:
results = model.evaluate(val_ds, verbose=1)
val_loss = results[0]
val_auc = results[1]
val_precision = results[2]
val_recall = results[3]

print(f"Validation Loss: {val_loss:.4f}")
print(f"Validation AUC: {val_auc:.4f}")
print(f"Precision: {val_precision:.4f}")
print(f"Recall: {val_recall:.4f}")

[1m358/358[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m54s[0m 150ms/step - auc: 0.7709 - loss: 0.1266 - precision: 0.7917 - recall: 0.3850
Validation Loss: 0.1266
Validation AUC: 0.8828
Precision: 0.7959
Recall: 0.3898


## Training Summary of Improved

### Final Validation Results

| Metric       | Value    |
|--------------|----------|
| **Loss**     | `0.1266` |
| **AUC**      | `0.8828` |
| **Precision**| `0.7959` |
| **Recall**   | `0.3898` |

These results show that the model performs **very well in terms of ranking (AUC ≈ 0.88)** and **has strong precision (≈ 0.80)**, though recall is moderate due to the single-label nature of training.

---

### Comparison with Baseline

| Version           | AUC    | Loss   | Notes |
|-------------------|--------|--------|-------|
| Baseline (from scratch) | ~0.63 | ~0.13 | No pretraining, no label smoothing, used accuracy |
| Improved (this version) | **0.88** | **0.1266** | With pretrained backbone, real metrics, regularization |

The improved model demonstrates a **significant gain in AUC (~+0.25)** and a **clearer training trajectory**. These changes resulted in better generalization and more confidence-calibrated predictions.