# Freesound Audio Tagging 2019 — Medal-Driven Notebook (Plan v2.1: Gold-Only Strategy, Mandatory Fixes Applied)

Experiment Log and Plan (v2.1 — incorporates C0 audit + mandatory revisions):
- Objective: GOLD medal. Optimize Label-Weighted LRAP (LWLRAP). Focus on SOTA pretrained encoders, robust noisy-label handling, strong CV, TTA, and ensembling.
- Data artifacts present: train_curated.csv, train_noisy.csv, train_curated.zip, train_noisy.zip, test.zip, sample_submission.csv.
- Protocol: concise notebook, document attempts, backup before major changes, delete stale code, submit for audit at milestones.

    Unified Gold Strategy (Single Path Only)
1) Encoder: Fine-tune pretrained PANNs (primary: CNN14, backup: ResNet38). Use log-mel frontend matching PANNs defaults.
2) CV: 5-fold MultilabelStratifiedKFold (seed=42), stratifying on 80-class binary matrix. Track per-fold/per-class LRAP and global LWLRAP.
3) Noisy-label protocol: teacher-student and curriculum with confidence filtering/weighting.
4) Inference: strong TTA (multi time-crop), fold-averaging, rank-aware ensembling across diverse models/seeds.

Details
- Label Space:
  - Parse label column (semicolon-separated). Build consistent class list sorted to match sample_submission column order.
  - Binarize y for CV and training. Save mapping for inference.

- Audio Preprocessing (Log-mel aligned to PANNs):
  - Sample rate: 32,000 Hz (resample all audio to 32k).
  - Clip duration: 10.0 s target per sample.
  - Variable length policy: during training, random time-crop to 10 s if longer; if shorter, loop-pad (tile then trim) with 0.1 probability; else zero-pad to 10 s. At inference, use multi-crop TTA (see below).
  - STFT: n_fft=1024, hop_length=320 (10 ms hop), win_length=1024, window=hann, center=True.
  - Mel: PRIMARY n_mels=64 (to match PANNs CNN14 pretraining); diversity models may use n_mels=128 or 256. fmin=50 Hz, fmax=16000 Hz, htk=False, norm=None.
  - Log scale: use librosa.power_to_db on mel power; then standardize per-frequency bin with dataset mean/std (computed on curated train); clamp to [-10, 10] after standardization for stability.
  - Channel: mono (downmix).

- Model Architecture:
  - Base: PANNs CNN14 pretrained on AudioSet (log-mel, 64 mel). Replace classifier with attention pooling head for 80 classes.
  - Pooling: attention pooling over time (linear attention + context gating) instead of mean pooling.
  - Head: Dropout p=0.5 before final linear; output 80 logits.
  - Alternate diversity models: PANNs ResNet38; CNN14 with 128/256 mels; crop lengths 5 s and 12 s variants.

- Losses and Label Handling:
  - Primary loss: BCEWithLogitsLoss.
  - Class weighting: inverse sqrt class frequency from curated train; normalize weights to mean=1.0.
  - Label smoothing: 0.05 on positives (targets y -> y*(1-0.05) + 0.5*0.05), negatives stay at 0.
  - Robust noise alternatives: Generalized Cross Entropy (q=0.7) or Symmetric Cross Entropy (alpha=1.0, beta=0.5) if needed in noisy stages.

- Optimizer, Schedule, Checkpointing:
  - Optimizer: AdamW (betas=(0.9, 0.999), weight_decay=1e-4).
  - LR schedule: cosine with warmup. Base LR 2e-4 for head, 1e-4 for encoder (param groups). Warmup 1 epoch (or 1000 steps), then cosine decay.
  - Epochs: 20 epochs curated-only warm start; 10–15 epochs with noisy curriculum stages (see below).
  - Mixed precision (AMP) enabled; gradient clipping at 5.0.
  - Batch size: as large as fits GPU (V100 16GB): target 32 for 64-mel 10 s; adjust dynamically.
  - Early stopping & checkpoints: monitor validation LWLRAP each epoch, save best checkpoint (highest LWLRAP); patience=3 epochs for early stop. Use best-val checkpoint for inference.

- Data Augmentations:
  - Time-domain: random gain [-6, +6] dB; pink/gaussian noise injection SNR ~ 20–30 dB; time shift ±0.5 s; mild pitch shift (±2 semitones) and time stretch (0.9–1.1).
  - Spectrogram: SpecAugment — 2 freq masks (width up to 20 mel bins), 2 time masks (up to 10% of frames), without masking entire clip.
  - MixUp: alpha=0.4 on spectrograms/logits; labels mixed linearly; probability 0.5.
  - Random time-crop as above; for multi-scale training, occasionally use 5 s or 12 s crops (p=0.2 each) in diversity runs.

- Cross-Validation (explicit):
  - Iterative Stratification (MultilabelStratifiedKFold) with n_splits=5, shuffle=True, random_state=42.
  - Stratify on 80-dim binary labels from curated train only; keep folds disjoint by filename.
  - Metrics per fold: LWLRAP (primary), per-class LRAP, macro/micro AUC for diagnostics; early-stop on LWLRAP.

- Noisy Data Protocol (multi-step):
  1) Train a strong teacher on curated-only (5-fold, out-of-fold predictions saved for all curated).
  2) Use the teacher (fold models averaged) to infer probabilities on train_noisy.
  3) Confidence filtering:
     - Positive selection: keep labels where teacher prob for that class ≥ 0.8; set others to 0 for that sample.
     - Optional addition: add teacher positive pseudo-labels for classes ≥ 0.95 even if not present in weak labels.
     - Discard samples with no remaining positives after filtering.
  4) Weighting: per-sample weight = max teacher prob among positives; clip to [0.5, 1.0]. Also scale by class weight as above.
  5) Curriculum retraining:
     - Stage A: fine-tune teacher from curated checkpoint adding filtered noisy with low weight (0.5) for 3–5 epochs.
     - Stage B: increase noisy sample weights to 0.75–1.0 for 5–10 epochs; optionally unfreeze more encoder layers.
     - If noisy destabilizes LWLRAP, swap to GCE (q=0.7) or SCE (α=1, β=0.5).
  6) Recompute OOF on curated and evaluate LWLRAP improvements before proceeding.

- Sampling Strategy:
  - Balanced batch sampler: ensure each batch includes rare classes via inverse-frequency sampling on curated; cap oversampling at 5x.

- Inference, Rank Averaging, and Mapping Back to Probabilities:
  - TTA per model/fold: K=5 time-crops of 10 s each spaced uniformly; for clips <10 s, use varied pad/loop starts. For long clips, consider K=10 if time allows.
  - Compute logits per crop; average logits across crops; apply sigmoid at the end.
  - Fold/model ensembling: perform per-class rank averaging across models/folds/TTAs.
  - Rank→probability mapping (explicit): per class, min-max normalize averaged ranks to [0,1] using CV distribution; then apply per-class temperature scaling (scalar τ_c learned on OOF/CV by minimizing BCE) to calibrate probabilities. Ensure final outputs ∈ [0,1].

- Ensembling Plan (diversity targets):
  - At least 4–6 models:
    1) CNN14, 64 mels, 10 s crop, BCE+LS, seed 42.
    2) CNN14, 64 mels, 10 s crop, BCE+LS, seed 2025.
    3) CNN14, 128 mels, 10 s crop.
    4) ResNet38, 64 mels, 10 s crop.
    5) CNN14, 64 mels, 10 s crop with focal loss (gamma 1.5) variant.
    6) CNN14, 64 mels, multi-scale crops (5/10/12 s schedule).
  - Blend via rank averaging with per-model weights proportional to CV LWLRAP (normalize weights to sum=1).

- Engineering & Efficiency:
  - Augmentation vs caching: to preserve augmentation diversity, compute spectrograms on-the-fly during training (torchaudio preferred). Optionally cache only for inference or for fixed crops used in TTA.
  - Use PyTorch DataLoader with num_workers=8–12, prefetch and pinned memory; cudnn.benchmark=True.
  - Determinism: set all seeds; log configs and per-fold metrics; save checkpoints and OOF predictions.
  - Notebook backups: programmatically save a copy of agent_notebook.ipynb before major refactors.

Milestones & Audits
- C0 (this): Gold-only Plan v2.1 with mandatory fixes — Approved.
- C1: Data loading, label parsing, CV split, LWLRAP implementation check, basic EDA (label counts/durations), and mel frontend prototype aligned to PANNs (64 mel).
- C2: PANNs CNN14 curated-only 5-fold training; OOF LWLRAP reported; baseline submission (safety).
- C3: Noisy protocol (teacher inference, filtering/weighting), curriculum fine-tune; updated CV.
- C4: TTA and initial ensembling across folds/seeds; submit improved predictions.
- C5: Diversity models (ResNet38, 128/256 mels), final rank-averaged ensemble; final submission.

Next Action: Proceed to C1 implementation with this corrected plan (PANNs-aligned 64-mel frontend, explicit rank→prob mapping, and defined checkpointing/early stopping).


In [1]:
# Cell 2: Consolidated setup and imports (single source of truth)

import sys, subprocess, os, json, time, warnings

from pathlib import Path

warnings.filterwarnings('ignore')

def pip_install(pkg):

    try:

        subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--quiet', pkg])

        print(f'Installed: {pkg}')

    except Exception as e:

        print(f'Failed to install {pkg}: {e}')



for pkg in ['soundfile', 'librosa', 'iterative-stratification', 'joblib', 'scikit-learn']:

    try:

        __import__(pkg.split('==')[0].replace('-', '_'))

        print(f'{pkg.split("==")[0]} already available')

    except Exception:

        pip_install(pkg)



import numpy as np

import pandas as pd

import soundfile as sf

import librosa

import torch

from iterstrat.ml_stratifiers import MultilabelStratifiedKFold

from sklearn.linear_model import LogisticRegression

from sklearn.multiclass import OneVsRestClassifier

from sklearn.preprocessing import StandardScaler

from sklearn.pipeline import make_pipeline

import joblib

BASE = Path('.')

np.random.seed(42)



soundfile already available
librosa already available


Installed: iterative-stratification


joblib already available


Installed: scikit-learn


In [2]:
# Cell 3: Offline PANNs asset preparation (labels CSV stub + CNN14 checkpoint)

from pathlib import Path

import urllib.request

assets_dir = Path('/app/panns_data')

assets_dir.mkdir(parents=True, exist_ok=True)

labels_csv = assets_dir / 'class_labels_indices.csv'

if not labels_csv.exists():

    import csv

    with open(labels_csv, 'w', newline='') as f:

        w = csv.writer(f)

        w.writerow(['index','mid','display_name'])

        for i in range(527):

            w.writerow([i, f'/m/{i}', f'class_{i:03d}'])

    print('Created labels stub at', labels_csv)

else:

    print('Labels CSV exists at', labels_csv)



ckpt_path = assets_dir / 'Cnn14_mAP=0.431.pth'

url = 'https://zenodo.org/record/3987831/files/Cnn14_mAP=0.431.pth?download=1'

if not ckpt_path.exists() or ckpt_path.stat().st_size == 0:

    print('Downloading CNN14 weights ...')

    with urllib.request.urlopen(url) as resp, open(ckpt_path, 'wb') as out:

        while True:

            chunk = resp.read(1<<20)

            if not chunk:

                break

            out.write(chunk)

    print('Saved CNN14 to', ckpt_path, 'size:', ckpt_path.stat().st_size)

else:

    print('CNN14 checkpoint present:', ckpt_path, 'size:', ckpt_path.stat().st_size)



Created labels stub at /app/panns_data/class_labels_indices.csv
Downloading CNN14 weights ...


Saved CNN14 to /app/panns_data/Cnn14_mAP=0.431.pth size: 327428481


In [3]:
# Cell 4: Data foundation, robust label parsing, MLSKF folds (no fallback)

BASE = Path('.')

df_cur = pd.read_csv(BASE / 'train_curated.csv')

df_ss  = pd.read_csv(BASE / 'sample_submission.csv')

fname_col = 'fname' if 'fname' in df_cur.columns else df_cur.columns[0]

labels_col = 'labels' if 'labels' in df_cur.columns else df_cur.columns[-1]

class_names = [c for c in df_ss.columns if c != 'fname']

label_to_idx = {c:i for i,c in enumerate(class_names)}

n_classes = len(class_names)



def parse_labels_str(s):

    if not isinstance(s, str):

        return []

    toks = [t.strip() for t in s.replace(';', ',').split(',') if t.strip()]

    unknown = [t for t in toks if t not in label_to_idx]

    if unknown:

        raise ValueError(f'Unknown labels: {unknown[:5]} (total {len(unknown)})')

    return toks



def encode_tokens(toks):

    y = np.zeros(n_classes, dtype=np.float32)

    for t in toks:

        y[label_to_idx[t]] = 1.0

    return y



def encode_labels(s):

    return encode_tokens(parse_labels_str(s))



tokens = df_cur[labels_col].apply(parse_labels_str)

Y_cur = np.stack(tokens.apply(encode_tokens).values)

mlsk = MultilabelStratifiedKFold(n_splits=5, shuffle=True, random_state=42)

folds = np.full(len(df_cur), -1, dtype=int)

for k, (_, val_idx) in enumerate(mlsk.split(df_cur[fname_col].values, Y_cur)):

    folds[val_idx] = k

assert (folds >= 0).all()

df_cur['fold'] = folds

df_cur.to_csv('train_curated_folds.csv', index=False)

with open('metadata.json', 'w') as f:

    json.dump({'class_names': class_names, 'label_to_idx': label_to_idx, 'fname_col': fname_col, 'labels_col': labels_col}, f)

print('Saved train_curated_folds.csv and metadata.json; fold counts:')

print(df_cur['fold'].value_counts().sort_index())



def lwlrap_np(truth, scores):

    assert truth.shape == scores.shape

    n_samples, n_labels = truth.shape

    precisions = np.zeros(n_labels)

    labels_per_class = np.maximum(truth.sum(axis=0), 1)

    for i in range(n_samples):

        pos = np.where(truth[i] > 0)[0]

        if pos.size == 0:

            continue

        ranking = np.argsort(-scores[i])

        ranked_truth = truth[i][ranking]

        cumsum = np.cumsum(ranked_truth)

        pos_rank = np.where(ranked_truth > 0)[0]

        prec = cumsum[pos_rank] / (pos_rank + 1)

        ranked_labels = ranking[pos_rank]

        for lbl, p in zip(ranked_labels, prec):

            precisions[lbl] += p

    per_class = precisions / labels_per_class

    weights = truth.sum(axis=0) / max(truth.sum(), 1)

    return float((per_class * weights).sum()), per_class



Saved train_curated_folds.csv and metadata.json; fold counts:
fold
0    999
1    994
2    992
3    997
4    988
Name: count, dtype: int64


In [4]:
# Cell 5: PANNs CNN14 embeddings + OVR Logistic Regression (OOF + submission)

from pathlib import Path

import numpy as np, pandas as pd

import librosa, torch, joblib, time

from sklearn.linear_model import LogisticRegression

from sklearn.multiclass import OneVsRestClassifier

from sklearn.preprocessing import StandardScaler

from sklearn.pipeline import make_pipeline

from panns_inference import AudioTagging



BASE = Path('.')

SR = 32000

CROP_SEC = 10.0

EMB_DIM = 2048

N_FOLDS = 5

CKPT_PATH = Path('/app/panns_data/Cnn14_mAP=0.431.pth')

assert CKPT_PATH.exists(), 'CNN14 checkpoint missing; run Cell 3 first.'



df_cur = pd.read_csv(BASE / 'train_curated_folds.csv')

df_ss  = pd.read_csv(BASE / 'sample_submission.csv')

class_names = [c for c in df_ss.columns if c != 'fname']

label_to_idx = {c:i for i,c in enumerate(class_names)}

n_classes = len(class_names)

train_dir = BASE / 'train_curated'

test_dir  = BASE / 'test'



def load_center_crop_10s(path, sr=SR, crop_sec=CROP_SEC):

    y, s = librosa.load(path, sr=sr, mono=True)

    target = int(sr * crop_sec)

    if len(y) >= target:

        start = max(0, (len(y) - target)//2)

        y = y[start:start+target]

    else:

        y = np.pad(y, (0, target-len(y)))

    return y.astype(np.float32)



emb_cur_path = BASE / 'embeddings_curated.npy'

emb_test_path = BASE / 'embeddings_test.npy'



def extract_embeddings(file_list, root_dir):

    device = 'cuda' if torch.cuda.is_available() else 'cpu'

    at = AudioTagging(checkpoint_path=str(CKPT_PATH), device=device)

    X = np.zeros((len(file_list), EMB_DIM), dtype=np.float32)

    t0 = time.time()

    for i, fname in enumerate(file_list):

        y = load_center_crop_10s(str(Path(root_dir) / fname), sr=SR, crop_sec=CROP_SEC)

        y_batched = np.expand_dims(y, 0)

        with torch.no_grad():

            out = at.inference(y_batched)

        if isinstance(out, tuple) and len(out) == 2:

            embedding = out[1]

        elif isinstance(out, dict) and 'embedding' in out:

            embedding = out['embedding']

        else:

            raise RuntimeError(f'Unexpected AudioTagging output: {type(out)}')

        X[i] = np.asarray(embedding, dtype=np.float32)[0]

        if (i+1) % 200 == 0:

            dt = time.time() - t0

            print(f'  {i+1}/{len(file_list)} in {dt/60:.1f} min')

    return X



train_files = df_cur['fname'].values

test_files  = df_ss['fname'].values



if emb_cur_path.exists() and emb_test_path.exists():

    X_cur = np.load(emb_cur_path)

    X_test = np.load(emb_test_path)

    print('Loaded cached embeddings.')

else:

    print('Extracting curated embeddings ...')

    X_cur = extract_embeddings(train_files, root_dir=BASE/'train_curated')

    np.save(emb_cur_path, X_cur)

    print('Extracting test embeddings ...')

    X_test = extract_embeddings(test_files, root_dir=BASE/'test')

    np.save(emb_test_path, X_test)

    print('Saved embeddings to disk.')



Y_cur = np.stack(df_cur['labels'].apply(encode_labels).values)

oof = np.zeros((len(df_cur), n_classes), dtype=np.float32)

fold_scores = []

for k in range(N_FOLDS):

    trn_idx = np.where(df_cur['fold'].values != k)[0]

    val_idx = np.where(df_cur['fold'].values == k)[0]

    X_tr, X_va = X_cur[trn_idx], X_cur[val_idx]

    y_tr, y_va = Y_cur[trn_idx], Y_cur[val_idx]

    base_lr = LogisticRegression(solver='lbfgs', max_iter=1000, C=2.0, n_jobs=16, verbose=0)

    clf = OneVsRestClassifier(make_pipeline(StandardScaler(with_mean=True, with_std=True), base_lr), n_jobs=-1)

    clf.fit(X_tr, y_tr)

    proba = clf.predict_proba(X_va)

    oof[val_idx] = proba.astype(np.float32)

    lw, _ = lwlrap_np(y_va, proba)

    fold_scores.append(lw)

    print(f'Fold {k} LWLRAP={lw:.4f}')

oof_lw, _ = lwlrap_np(Y_cur, oof)

print(f'OOF LWLRAP={oof_lw:.4f}; per-fold={fold_scores}')

np.save('oof_panns_lr.npy', oof)



base_lr_full = LogisticRegression(solver='lbfgs', max_iter=1000, C=2.0, n_jobs=16, verbose=0)

clf_full = OneVsRestClassifier(make_pipeline(StandardScaler(with_mean=True, with_std=True), base_lr_full), n_jobs=-1)

clf_full.fit(X_cur, Y_cur)

test_proba = clf_full.predict_proba(X_test).astype(np.float32)

sub = pd.DataFrame(test_proba, columns=class_names)

sub.insert(0, 'fname', test_files)

sub.to_csv('submission.csv', index=False)

print('Saved submission.csv. Shape:', sub.shape)

joblib.dump(clf_full, 'ovr_logreg_panns.joblib')

print('Saved classifier checkpoint.')



Extracting curated embeddings ...
Checkpoint path: /app/panns_data/Cnn14_mAP=0.431.pth


Using CPU.


  200/4970 in 4.0 min


  400/4970 in 13.5 min


  600/4970 in 22.3 min


  800/4970 in 32.5 min


  1000/4970 in 40.9 min


  1200/4970 in 45.6 min


  1400/4970 in 49.2 min


  1600/4970 in 50.2 min


  1800/4970 in 51.2 min


  2000/4970 in 52.8 min


  2200/4970 in 53.9 min


  2400/4970 in 54.9 min


  2600/4970 in 55.9 min


  2800/4970 in 56.9 min


  3000/4970 in 57.9 min


  3200/4970 in 58.9 min


  3400/4970 in 59.9 min


  3600/4970 in 60.9 min


  3800/4970 in 62.0 min


  4000/4970 in 63.0 min


  4200/4970 in 64.1 min


  4400/4970 in 65.1 min


  4600/4970 in 66.4 min


  4800/4970 in 67.6 min


Extracting test embeddings ...
Checkpoint path: /app/panns_data/Cnn14_mAP=0.431.pth


Using CPU.


  200/3361 in 1.3 min


  400/3361 in 2.5 min


  600/3361 in 3.7 min


  800/3361 in 5.0 min


  1000/3361 in 6.2 min


  1200/3361 in 7.5 min


  1400/3361 in 8.7 min


  1600/3361 in 9.8 min


  1800/3361 in 11.0 min


  2000/3361 in 12.2 min


  2200/3361 in 13.5 min


  2400/3361 in 14.8 min


  2600/3361 in 16.1 min


  2800/3361 in 17.4 min


  3000/3361 in 18.6 min


  3200/3361 in 19.8 min


Saved embeddings to disk.


Fold 0 LWLRAP=0.8091


Fold 1 LWLRAP=0.8004


Fold 2 LWLRAP=0.8080


Fold 3 LWLRAP=0.8097


Fold 4 LWLRAP=0.7748
OOF LWLRAP=0.8001; per-fold=[0.8091265640638446, 0.8004257362345152, 0.8080218131533763, 0.8096867591565071, 0.7748105563971049]


Saved submission.csv. Shape: (3361, 81)
Saved classifier checkpoint.
