# Freesound Audio Tagging 2019 — Medal-Driven Notebook (Plan v2.1: Gold-Only Strategy, Mandatory Fixes Applied)

Experiment Log and Plan (v2.1 — incorporates C0 audit + mandatory revisions):
- Objective: GOLD medal. Optimize Label-Weighted LRAP (LWLRAP). Focus on SOTA pretrained encoders, robust noisy-label handling, strong CV, TTA, and ensembling.
- Data artifacts present: train_curated.csv, train_noisy.csv, train_curated.zip, train_noisy.zip, test.zip, sample_submission.csv.
- Protocol: concise notebook, document attempts, backup before major changes, delete stale code, submit for audit at milestones.

    Unified Gold Strategy (Single Path Only)
1) Encoder: Fine-tune pretrained PANNs (primary: CNN14, backup: ResNet38). Use log-mel frontend matching PANNs defaults.
2) CV: 5-fold MultilabelStratifiedKFold (seed=42), stratifying on 80-class binary matrix. Track per-fold/per-class LRAP and global LWLRAP.
3) Noisy-label protocol: teacher-student and curriculum with confidence filtering/weighting.
4) Inference: strong TTA (multi time-crop), fold-averaging, rank-aware ensembling across diverse models/seeds.

Details
- Label Space:
  - Parse label column (semicolon-separated). Build consistent class list sorted to match sample_submission column order.
  - Binarize y for CV and training. Save mapping for inference.

- Audio Preprocessing (Log-mel aligned to PANNs):
  - Sample rate: 32,000 Hz (resample all audio to 32k).
  - Clip duration: 10.0 s target per sample.
  - Variable length policy: during training, random time-crop to 10 s if longer; if shorter, loop-pad (tile then trim) with 0.1 probability; else zero-pad to 10 s. At inference, use multi-crop TTA (see below).
  - STFT: n_fft=1024, hop_length=320 (10 ms hop), win_length=1024, window=hann, center=True.
  - Mel: PRIMARY n_mels=64 (to match PANNs CNN14 pretraining); diversity models may use n_mels=128 or 256. fmin=50 Hz, fmax=16000 Hz, htk=False, norm=None.
  - Log scale: use librosa.power_to_db on mel power; then standardize per-frequency bin with dataset mean/std (computed on curated train); clamp to [-10, 10] after standardization for stability.
  - Channel: mono (downmix).

- Model Architecture:
  - Base: PANNs CNN14 pretrained on AudioSet (log-mel, 64 mel). Replace classifier with attention pooling head for 80 classes.
  - Pooling: attention pooling over time (linear attention + context gating) instead of mean pooling.
  - Head: Dropout p=0.5 before final linear; output 80 logits.
  - Alternate diversity models: PANNs ResNet38; CNN14 with 128/256 mels; crop lengths 5 s and 12 s variants.

- Losses and Label Handling:
  - Primary loss: BCEWithLogitsLoss.
  - Class weighting: inverse sqrt class frequency from curated train; normalize weights to mean=1.0.
  - Label smoothing: 0.05 on positives (targets y -> y*(1-0.05) + 0.5*0.05), negatives stay at 0.
  - Robust noise alternatives: Generalized Cross Entropy (q=0.7) or Symmetric Cross Entropy (alpha=1.0, beta=0.5) if needed in noisy stages.

- Optimizer, Schedule, Checkpointing:
  - Optimizer: AdamW (betas=(0.9, 0.999), weight_decay=1e-4).
  - LR schedule: cosine with warmup. Base LR 2e-4 for head, 1e-4 for encoder (param groups). Warmup 1 epoch (or 1000 steps), then cosine decay.
  - Epochs: 20 epochs curated-only warm start; 10–15 epochs with noisy curriculum stages (see below).
  - Mixed precision (AMP) enabled; gradient clipping at 5.0.
  - Batch size: as large as fits GPU (V100 16GB): target 32 for 64-mel 10 s; adjust dynamically.
  - Early stopping & checkpoints: monitor validation LWLRAP each epoch, save best checkpoint (highest LWLRAP); patience=3 epochs for early stop. Use best-val checkpoint for inference.

- Data Augmentations:
  - Time-domain: random gain [-6, +6] dB; pink/gaussian noise injection SNR ~ 20–30 dB; time shift ±0.5 s; mild pitch shift (±2 semitones) and time stretch (0.9–1.1).
  - Spectrogram: SpecAugment — 2 freq masks (width up to 20 mel bins), 2 time masks (up to 10% of frames), without masking entire clip.
  - MixUp: alpha=0.4 on spectrograms/logits; labels mixed linearly; probability 0.5.
  - Random time-crop as above; for multi-scale training, occasionally use 5 s or 12 s crops (p=0.2 each) in diversity runs.

- Cross-Validation (explicit):
  - Iterative Stratification (MultilabelStratifiedKFold) with n_splits=5, shuffle=True, random_state=42.
  - Stratify on 80-dim binary labels from curated train only; keep folds disjoint by filename.
  - Metrics per fold: LWLRAP (primary), per-class LRAP, macro/micro AUC for diagnostics; early-stop on LWLRAP.

- Noisy Data Protocol (multi-step):
  1) Train a strong teacher on curated-only (5-fold, out-of-fold predictions saved for all curated).
  2) Use the teacher (fold models averaged) to infer probabilities on train_noisy.
  3) Confidence filtering:
     - Positive selection: keep labels where teacher prob for that class ≥ 0.8; set others to 0 for that sample.
     - Optional addition: add teacher positive pseudo-labels for classes ≥ 0.95 even if not present in weak labels.
     - Discard samples with no remaining positives after filtering.
  4) Weighting: per-sample weight = max teacher prob among positives; clip to [0.5, 1.0]. Also scale by class weight as above.
  5) Curriculum retraining:
     - Stage A: fine-tune teacher from curated checkpoint adding filtered noisy with low weight (0.5) for 3–5 epochs.
     - Stage B: increase noisy sample weights to 0.75–1.0 for 5–10 epochs; optionally unfreeze more encoder layers.
     - If noisy destabilizes LWLRAP, swap to GCE (q=0.7) or SCE (α=1, β=0.5).
  6) Recompute OOF on curated and evaluate LWLRAP improvements before proceeding.

- Sampling Strategy:
  - Balanced batch sampler: ensure each batch includes rare classes via inverse-frequency sampling on curated; cap oversampling at 5x.

- Inference, Rank Averaging, and Mapping Back to Probabilities:
  - TTA per model/fold: K=5 time-crops of 10 s each spaced uniformly; for clips <10 s, use varied pad/loop starts. For long clips, consider K=10 if time allows.
  - Compute logits per crop; average logits across crops; apply sigmoid at the end.
  - Fold/model ensembling: perform per-class rank averaging across models/folds/TTAs.
  - Rank→probability mapping (explicit): per class, min-max normalize averaged ranks to [0,1] using CV distribution; then apply per-class temperature scaling (scalar τ_c learned on OOF/CV by minimizing BCE) to calibrate probabilities. Ensure final outputs ∈ [0,1].

- Ensembling Plan (diversity targets):
  - At least 4–6 models:
    1) CNN14, 64 mels, 10 s crop, BCE+LS, seed 42.
    2) CNN14, 64 mels, 10 s crop, BCE+LS, seed 2025.
    3) CNN14, 128 mels, 10 s crop.
    4) ResNet38, 64 mels, 10 s crop.
    5) CNN14, 64 mels, 10 s crop with focal loss (gamma 1.5) variant.
    6) CNN14, 64 mels, multi-scale crops (5/10/12 s schedule).
  - Blend via rank averaging with per-model weights proportional to CV LWLRAP (normalize weights to sum=1).

- Engineering & Efficiency:
  - Augmentation vs caching: to preserve augmentation diversity, compute spectrograms on-the-fly during training (torchaudio preferred). Optionally cache only for inference or for fixed crops used in TTA.
  - Use PyTorch DataLoader with num_workers=8–12, prefetch and pinned memory; cudnn.benchmark=True.
  - Determinism: set all seeds; log configs and per-fold metrics; save checkpoints and OOF predictions.
  - Notebook backups: programmatically save a copy of agent_notebook.ipynb before major refactors.

Milestones & Audits
- C0 (this): Gold-only Plan v2.1 with mandatory fixes — Approved.
- C1: Data loading, label parsing, CV split, LWLRAP implementation check, basic EDA (label counts/durations), and mel frontend prototype aligned to PANNs (64 mel).
- C2: PANNs CNN14 curated-only 5-fold training; OOF LWLRAP reported; baseline submission (safety).
- C3: Noisy protocol (teacher inference, filtering/weighting), curriculum fine-tune; updated CV.
- C4: TTA and initial ensembling across folds/seeds; submit improved predictions.
- C5: Diversity models (ResNet38, 128/256 mels), final rank-averaged ensemble; final submission.

Next Action: Proceed to C1 implementation with this corrected plan (PANNs-aligned 64-mel frontend, explicit rank→prob mapping, and defined checkpointing/early stopping).


In [1]:
# C1: Data I/O, label parsing, CV split (MLSK), LWLRAP metric, basic EDA, unzip audio
import os, zipfile, json, sys, math, random, warnings
from pathlib import Path
import numpy as np
import pandas as pd

warnings.filterwarnings('ignore')
BASE = Path('.')

# 1) Unzip datasets if not already extracted
def unzip_if_needed(zip_path: Path, dest_dir: Path):
    if not dest_dir.exists():
        dest_dir.mkdir(parents=True, exist_ok=True)
    # Heuristic: check if dest_dir has any audio files
    has_files = any(dest_dir.rglob('*.wav')) or any(dest_dir.rglob('*.mp3')) or any(dest_dir.rglob('*.flac'))
    if (not has_files) and zip_path.exists():
        print(f"Extracting {zip_path} -> {dest_dir} ...")
        with zipfile.ZipFile(zip_path, 'r') as z:
            z.extractall(dest_dir)
        print("Done.")
    return dest_dir

train_curated_csv = BASE / 'train_curated.csv'
train_noisy_csv   = BASE / 'train_noisy.csv'
sample_sub_csv    = BASE / 'sample_submission.csv'
train_curated_zip = BASE / 'train_curated.zip'
train_noisy_zip   = BASE / 'train_noisy.zip'
test_zip          = BASE / 'test.zip'

train_curated_dir = unzip_if_needed(train_curated_zip, BASE / 'train_curated')
train_noisy_dir   = unzip_if_needed(train_noisy_zip,   BASE / 'train_noisy')
test_dir          = unzip_if_needed(test_zip,          BASE / 'test')

# 2) Load CSVs
df_cur = pd.read_csv(train_curated_csv)
df_noi = pd.read_csv(train_noisy_csv)
df_ss  = pd.read_csv(sample_sub_csv)

print('Curated shape:', df_cur.shape)
print('Noisy shape:', df_noi.shape)
print('Sample submission shape:', df_ss.shape)

# 3) Determine schema
print('Curated head:')
print(df_cur.head())
print('Noisy head:')
print(df_noi.head())

# Expect columns: 'fname' and 'labels' (semicolon-separated). Confirm and adapt if needed.
fname_col = 'fname' if 'fname' in df_cur.columns else df_cur.columns[0]
labels_col = 'labels' if 'labels' in df_cur.columns else df_cur.columns[-1]
print(f"Using columns -> fname: {fname_col}, labels: {labels_col}")

# 4) Build class list from sample_submission column order
class_names = [c for c in df_ss.columns if c != 'fname']
n_classes = len(class_names)
print('Number of classes from sample_submission:', n_classes)

# 5) Parse labels to multilabel binary matrix for curated data
label_to_idx = {c:i for i,c in enumerate(class_names)}

def encode_labels(label_str: str):
    y = np.zeros(n_classes, dtype=np.float32)
    if isinstance(label_str, str) and label_str.strip():
        for tok in label_str.split(',') if (' ,' in label_str or ',' in label_str and ';' not in label_str) else label_str.split(';'):
            t = tok.strip()
            if t in label_to_idx:
                y[label_to_idx[t]] = 1.0
    return y

Y_cur = np.stack(df_cur[labels_col].apply(encode_labels).values)
print('Curated label matrix shape:', Y_cur.shape, 'positives:', int(Y_cur.sum()))

# 6) Implement LWLRAP metric (Kaggle-official style)
def lwlrap(truth, scores):
    # truth: (n_samples, n_classes) binary; scores: same shape floats
    # Computes label-weighted label-ranking average precision
    assert truth.shape == scores.shape
    n_samples, n_labels = truth.shape
    # Per-class sum of precisions
    precisions_for_these_labels = np.zeros(n_labels)
    labels_per_class = truth.sum(axis=0)
    # Avoid divide by zero
    labels_per_class = np.maximum(labels_per_class, 1)
    for i in range(n_samples):
        pos_label_indices = np.where(truth[i] > 0)[0]
        if len(pos_label_indices) == 0:
            continue
        ranking = np.argsort(-scores[i])
        ranked_truth = truth[i][ranking]
        cumsum = np.cumsum(ranked_truth)
        # precision at each true label
        pos_rank_indices = np.where(ranked_truth > 0)[0]
        precisions = cumsum[pos_rank_indices] / (pos_rank_indices + 1)
        # add to corresponding classes
        ranked_labels = ranking[pos_rank_indices]
        for lbl, prec in zip(ranked_labels, precisions):
            precisions_for_these_labels[lbl] += prec
    per_class_lwlrap = precisions_for_these_labels / labels_per_class
    # label-weighted: weight by prevalence of positives
    weights = (truth.sum(axis=0) / np.maximum(truth.sum(), 1))
    return float((per_class_lwlrap * weights).sum()), per_class_lwlrap

# 7) Create 5-fold Multilabel Stratified K-Fold (iterative stratification)
folds = np.full(len(df_cur), -1, dtype=int)
mlsk_available = False
try:
    from iterstrat.ml_stratifiers import MultilabelStratifiedKFold
    mlsk = MultilabelStratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    for k, (_, val_idx) in enumerate(mlsk.split(df_cur[fname_col].values, Y_cur)):
        folds[val_idx] = k
    mlsk_available = True
except Exception as e:
    print('iterstrat not available or failed to import:', e)
    print('Falling back to regular KFold (WARNING: less reliable for multilabel).')
    from sklearn.model_selection import KFold
    kf = KFold(n_splits=5, shuffle=True, random_state=42)
    for k, (_, val_idx) in enumerate(kf.split(df_cur)):
        folds[val_idx] = k

df_cur['fold'] = folds
print('Fold distribution (counts):')
print(df_cur['fold'].value_counts().sort_index())

# 8) Basic EDA: label frequency and co-occurrence summary
label_freq = Y_cur.sum(axis=0)
top_idx = np.argsort(-label_freq)[:10]
print('Top 10 classes by frequency:')
for i in top_idx:
    print(f"{class_names[i]}: {int(label_freq[i])}")
print('Num samples with >3 labels:', int((Y_cur.sum(axis=1) > 3).sum()))

# 9) Save artifacts: class mapping and folds
meta = {
    'class_names': class_names,
    'label_to_idx': label_to_idx,
    'fname_col': fname_col,
    'labels_col': labels_col,
    'mlsk_available': mlsk_available
}
with open('metadata.json', 'w') as f:
    json.dump(meta, f)
df_cur.to_csv('train_curated_folds.csv', index=False)
print('Saved metadata.json and train_curated_folds.csv')

# 10) Sanity check LWLRAP implementation with a tiny synthetic example
truth_demo = np.array([[1,0,1],[0,1,0]], dtype=int)
scores_demo = np.array([[0.9, 0.2, 0.8],[0.1,0.7,0.3]], dtype=float)
demo_lwlrap, _ = lwlrap(truth_demo, scores_demo)
print('LWLRAP demo (should be between 0 and 1):', demo_lwlrap)

print('\nC1 setup complete: data loaded, labels parsed, 5-folds created, LWLRAP ready.')

Extracting train_curated.zip -> train_curated ...


Done.
Extracting train_noisy.zip -> train_noisy ...


Done.
Extracting test.zip -> test ...


Done.
Curated shape: (4970, 2)
Noisy shape: (19815, 2)
Sample submission shape: (3361, 81)
Curated head:
          fname           labels
0  0006ae4e.wav             Bark
1  0019ef41.wav         Raindrop
2  001ec0ad.wav  Finger_snapping
3  0026c7cb.wav              Run
4  0026f116.wav  Finger_snapping
Noisy head:
          fname                                          labels
0  00097e21.wav                    Bathtub_(filling_or_washing)
1  000b6cfb.wav                                      Motorcycle
2  00116cd2.wav              Marimba_and_xylophone,Glockenspiel
3  00127d14.wav  Water_tap_and_faucet,Sink_(filling_or_washing)
4  0019adae.wav                                        Raindrop
Using columns -> fname: fname, labels: labels
Number of classes from sample_submission: 80
Curated label matrix shape: (4970, 80) positives: 5752
iterstrat not available or failed to import: No module named 'iterstrat'


Fold distribution (counts):
fold
0    994
1    994
2    994
3    994
4    994
Name: count, dtype: int64
Top 10 classes by frequency:
Marimba_and_xylophone: 75
Screaming: 75
Scissors: 75
Run: 75
Raindrop: 75
Printer: 75
Motorcycle: 75
Microwave_oven: 75
Meow: 75
Shatter: 75
Num samples with >3 labels: 5
Saved metadata.json and train_curated_folds.csv
LWLRAP demo (should be between 0 and 1): 1.0

C1 setup complete: data loaded, labels parsed, 5-folds created, LWLRAP ready.


In [2]:
# C1 Remediation: Install MLSKF, robust label parsing, duration EDA, 64-mel frontend prototype, rebuild folds
import sys, subprocess, json, math
from pathlib import Path
import numpy as np
import pandas as pd
import soundfile as sf
import librosa

BASE = Path('.')
train_curated_csv = BASE / 'train_curated.csv'
sample_sub_csv    = BASE / 'sample_submission.csv'
train_curated_dir = BASE / 'train_curated'

    # 0) Load CSVs and sample submission for class order
df_cur = pd.read_csv(train_curated_csv)
df_ss  = pd.read_csv(sample_sub_csv)
fname_col = 'fname' if 'fname' in df_cur.columns else df_cur.columns[0]
labels_col = 'labels' if 'labels' in df_cur.columns else df_cur.columns[-1]
class_names = [c for c in df_ss.columns if c != 'fname']
label_to_idx = {c:i for i,c in enumerate(class_names)}
n_classes = len(class_names)

# 1) Robust label parsing with validation (semicolon-insensitive); fail loudly on unknown labels
def parse_labels_str(label_str):
    if not isinstance(label_str, str):
        return []
    toks = [t.strip() for t in label_str.replace(';', ',').split(',') if t.strip()]
    # validate
    unknown = [t for t in toks if t not in label_to_idx]
    if unknown:
        raise ValueError(f"Unknown labels encountered: {unknown[:5]} ... total {len(unknown)}. Check class mappings/CSV.")
    return toks

def encode_labels(toks):
    y = np.zeros(n_classes, dtype=np.float32)
    for t in toks:
        y[label_to_idx[t]] = 1.0
    return y

cur_tokens = df_cur[labels_col].apply(parse_labels_str)
Y_cur = np.stack(cur_tokens.apply(encode_labels).values)
print('Reparsed curated label matrix:', Y_cur.shape, 'positives:', int(Y_cur.sum()))

# 2) Investigate EDA anomaly: label frequency distribution
label_freq = Y_cur.sum(axis=0)
vals, counts = np.unique(label_freq, return_counts=True)
print('Label frequency value_counts (value -> num_classes):')
for v, c in zip(vals.astype(int), counts):
    print(f'  {v} -> {c}')
top10_idx = np.argsort(-label_freq)[:10]
print('Top10 classes and counts:')
for i in top10_idx:
    print(f'  {class_names[i]}: {int(label_freq[i])}')
if len(vals) == 1:
    print('NOTE: All classes have identical frequency in curated set. Dataset appears artificially balanced. Adjust class weighting accordingly (likely no weighting needed for curated-only).')
elif counts.max() > len(class_names)*0.5:
    print('NOTE: Majority of classes share the same frequency. Curated set is near-balanced. Use light/none class weighting for curated.')

# 3) Audio duration EDA using soundfile metadata (fast, no full decode)
audio_paths = df_cur[fname_col].apply(lambda f: str(train_curated_dir / f)).values
durations = []
missing = 0
for p in audio_paths:
    try:
        info = sf.info(p)
        d = info.frames / max(info.samplerate, 1)
        durations.append(d)
    except Exception:
        # fallback to librosa load header if needed
        try:
            y, sr = librosa.load(p, sr=None, mono=True)
            durations.append(len(y)/sr)
        except Exception:
            missing += 1
            durations.append(np.nan)
durations = np.array(durations, dtype=float)
valid = ~np.isnan(durations)
dur = durations[valid]
def pct(a, q):
    return float(np.percentile(a, q)) if len(a) else float('nan')
print(f'Durations stats (n={len(dur)}, missing={missing}): min={dur.min():.3f}s, mean={dur.mean():.3f}s, p50={pct(dur,50):.3f}s, p90={pct(dur,90):.3f}s, p99={pct(dur,99):.3f}s, max={dur.max():.3f}s')
print('Implication: use 10.0s target crops; apply random crop if longer, zero/loop pad if shorter as per plan.')

# 4) Install iterative-stratification and build 5-fold MultilabelStratifiedKFold (no fallback)
need_install = False
try:
    from iterstrat.ml_stratifiers import MultilabelStratifiedKFold
except Exception:
    need_install = True
if need_install:
    print('Installing iterative-stratification ...')
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--quiet', 'iterative-stratification'])
    from iterstrat.ml_stratifiers import MultilabelStratifiedKFold

mlsk = MultilabelStratifiedKFold(n_splits=5, shuffle=True, random_state=42)
folds = np.full(len(df_cur), -1, dtype=int)
for k, (_, val_idx) in enumerate(mlsk.split(df_cur[fname_col].values, Y_cur)):
    folds[val_idx] = k
assert (folds >= 0).all(), 'Fold assignment failed.'
df_cur['fold'] = folds
df_cur.to_csv('train_curated_folds.csv', index=False)
print('Rebuilt 5-fold MultilabelStratifiedKFold. Fold counts:')
print(df_cur['fold'].value_counts().sort_index())

# 5) PANNs-aligned mel-spectrogram frontend prototype (sr=32k, n_mels=64)
MEL_CFG = {
    'sr': 32000,
    'n_fft': 1024,
    'hop_length': 320,
    'win_length': 1024,
    'n_mels': 64,
    'fmin': 50,
    'fmax': 16000,
}

def load_audio_32k(path, sr=32000):
    y, s = librosa.load(path, sr=sr, mono=True)
    return y, s

def logmel_64(y, sr, cfg=MEL_CFG):
    S = librosa.feature.melspectrogram(y=y, sr=sr, n_fft=cfg['n_fft'], hop_length=cfg['hop_length'],
                                       win_length=cfg['win_length'], window='hann', n_mels=cfg['n_mels'],
                                       fmin=cfg['fmin'], fmax=cfg['fmax'], center=True, power=2.0)
    S_db = librosa.power_to_db(S, ref=1.0)
    return S_db.astype(np.float32)  # shape (n_mels, T)

# Demo on a few files
demo_files = [str(train_curated_dir / f) for f in df_cur[fname_col].head(3).values]
for p in demo_files:
    y, sr = load_audio_32k(p, sr=MEL_CFG['sr'])
    S = logmel_64(y, sr)
    T = S.shape[1]
    seconds = T * (MEL_CFG['hop_length'] / MEL_CFG['sr'])
    print(f'Mel64 for {Path(p).name}: shape={S.shape}, covers ~{seconds:.2f}s of audio')

print('\nC1 remediation complete: robust parsing, EDA, mel64 prototype, and MLSKF folds rebuilt.')

ModuleNotFoundError: No module named 'soundfile'

In [3]:
# Install missing dependencies for C1 remediation
import sys, subprocess

def pip_install(pkg):
    try:
        subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--quiet', pkg])
        print(f'Installed: {pkg}')
    except Exception as e:
        print(f'Failed to install {pkg}: {e}')

# Ensure soundfile and iterative-stratification are available
try:
    import soundfile  # noqa: F401
    print('soundfile already available')
except Exception:
    pip_install('soundfile')

try:
    from iterstrat.ml_stratifiers import MultilabelStratifiedKFold  # noqa: F401
    print('iterative-stratification already available')
except Exception:
    pip_install('iterative-stratification')

print('Dependency installation step complete.')

Installed: soundfile


Installed: iterative-stratification
Dependency installation step complete.


In [6]:
# C1 Remediation (re-run): Robust parsing, Duration EDA, 64-mel frontend, enforce MLSKF folds (no KFold fallback)
import json
from pathlib import Path
import numpy as np
import pandas as pd
import soundfile as sf
import librosa
from iterstrat.ml_stratifiers import MultilabelStratifiedKFold

BASE = Path('.')
train_curated_csv = BASE / 'train_curated.csv'
sample_sub_csv    = BASE / 'sample_submission.csv'
train_curated_dir = BASE / 'train_curated'

# Load CSVs and class order
df_cur = pd.read_csv(train_curated_csv)
df_ss  = pd.read_csv(sample_sub_csv)
fname_col = 'fname' if 'fname' in df_cur.columns else df_cur.columns[0]
labels_col = 'labels' if 'labels' in df_cur.columns else df_cur.columns[-1]
class_names = [c for c in df_ss.columns if c != 'fname']
label_to_idx = {c:i for i,c in enumerate(class_names)}
n_classes = len(class_names)

# Robust label parsing with validation
def parse_labels_str(label_str):
    if not isinstance(label_str, str):
        return []
    toks = [t.strip() for t in label_str.replace(';', ',').split(',') if t.strip()]
    unknown = [t for t in toks if t not in label_to_idx]
    if unknown:
        raise ValueError(f"Unknown labels encountered: {unknown[:5]} ... total {len(unknown)}")
    return toks

def encode_labels(toks):
    y = np.zeros(n_classes, dtype=np.float32)
    for t in toks:
        y[label_to_idx[t]] = 1.0
    return y

cur_tokens = df_cur[labels_col].apply(parse_labels_str)
Y_cur = np.stack(cur_tokens.apply(encode_labels).values)
print('Reparsed curated label matrix:', Y_cur.shape, 'positives:', int(Y_cur.sum()))

# Investigate label frequency anomaly
label_freq = Y_cur.sum(axis=0)
vals, counts = np.unique(label_freq, return_counts=True)
print('Label frequency value_counts (value -> num_classes):')
for v, c in zip(vals.astype(int), counts):
    print(f'  {int(v)} -> {int(c)}')
if len(vals) == 1:
    print('NOTE: All classes have identical frequency in curated set (artificial balance). Use none/light class weighting on curated.')

# Duration EDA (soundfile metadata)
audio_paths = df_cur[fname_col].apply(lambda f: str(train_curated_dir / f)).values
durations = []
missing = 0
for p in audio_paths:
    try:
        info = sf.info(p)
        durations.append(info.frames / max(info.samplerate, 1))
    except Exception:
        try:
            y, sr = librosa.load(p, sr=None, mono=True)
            durations.append(len(y)/sr)
        except Exception:
            durations.append(np.nan); missing += 1
dur = np.array(durations, dtype=float)
valid = np.isfinite(dur)
dur = dur[valid]
pct = lambda a,q: float(np.percentile(a,q)) if len(a) else float('nan')
print(f'Durations stats (n={len(dur)}, missing={missing}): min={dur.min():.3f}s, mean={dur.mean():.3f}s, p50={pct(dur,50):.3f}s, p90={pct(dur,90):.3f}s, p99={pct(dur,99):.3f}s, max={dur.max():.3f}s')
print('Implication: 10.0s target crops; random crop if longer; zero/loop pad if shorter.')

# Enforce 5-fold MultilabelStratifiedKFold (no fallback)
mlsk = MultilabelStratifiedKFold(n_splits=5, shuffle=True, random_state=42)
folds = np.full(len(df_cur), -1, dtype=int)
for k, (_, val_idx) in enumerate(mlsk.split(df_cur[fname_col].values, Y_cur)):
    folds[val_idx] = k
assert (folds >= 0).all(), 'Fold assignment failed.'
df_cur['fold'] = folds
df_cur.to_csv('train_curated_folds.csv', index=False)
print('Rebuilt MLSKF. Fold counts:')
print(df_cur['fold'].value_counts().sort_index())

# 64-mel frontend prototype (PANNs-aligned)
MEL_CFG = dict(sr=32000, n_fft=1024, hop_length=320, win_length=1024, n_mels=64, fmin=50, fmax=16000)
def load_audio_32k(path, sr=32000):
    y, s = librosa.load(path, sr=sr, mono=True)
    return y, s
def logmel_64(y, sr, cfg=MEL_CFG):
    S = librosa.feature.melspectrogram(y=y, sr=sr, n_fft=cfg['n_fft'], hop_length=cfg['hop_length'],
                                       win_length=cfg['win_length'], window='hann', n_mels=cfg['n_mels'],
                                       fmin=cfg['fmin'], fmax=cfg['fmax'], center=True, power=2.0)
    S_db = librosa.power_to_db(S, ref=1.0)
    return S_db.astype(np.float32)
demo_files = [str(train_curated_dir / f) for f in df_cur[fname_col].head(3).values]
for p in demo_files:
    y, sr = load_audio_32k(p, sr=MEL_CFG['sr'])
    S = logmel_64(y, sr)
    T = S.shape[1]
    seconds = T * (MEL_CFG['hop_length'] / MEL_CFG['sr'])
    print(f'Mel64 for {Path(p).name}: shape={S.shape}, covers ~{seconds:.2f}s of audio')

print('\nC1 remediation complete: robust parsing, duration EDA, mel64 prototype, and MLSKF folds rebuilt.')

Reparsed curated label matrix: (4970, 80) positives: 5752
Label frequency value_counts (value -> num_classes):
  47 -> 1
  48 -> 1
  49 -> 1
  50 -> 1
  53 -> 1
  56 -> 3
  57 -> 1
  63 -> 2
  65 -> 1
  67 -> 1
  72 -> 1
  75 -> 66


Durations stats (n=4970, missing=0): min=0.300s, mean=7.648s, p50=4.676s, p90=20.349s, p99=29.000s, max=57.571s
Implication: 10.0s target crops; random crop if longer; zero/loop pad if shorter.


Rebuilt MLSKF. Fold counts:
fold
0    999
1    994
2    992
3    997
4    988
Name: count, dtype: int64


Mel64 for 0006ae4e.wav: shape=(64, 704), covers ~7.04s of audio
Mel64 for 0019ef41.wav: shape=(64, 206), covers ~2.06s of audio
Mel64 for 001ec0ad.wav: shape=(64, 152), covers ~1.52s of audio

C1 remediation complete: robust parsing, duration EDA, mel64 prototype, and MLSKF folds rebuilt.


In [5]:
# Install librosa and related deps for audio frontend
import sys, subprocess

def pip_install(pkg):
    try:
        subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--quiet', pkg])
        print(f'Installed: {pkg}')
    except Exception as e:
        print(f'Failed to install {pkg}: {e}')

try:
    import librosa  # noqa: F401
    print('librosa already available')
except Exception:
    pip_install('librosa')
    # optional backends/utilities
    pip_install('audioread')

print('Librosa installation step complete.')

Installed: librosa


Installed: audioread
Librosa installation step complete.


In [20]:
# C2 Remediation: Pretrained PANNs CNN14 embeddings + fast OVR Logistic Regression (5-fold OOF + submission)
# - Uses panns-inference pretrained CNN14 (AudioSet) as frozen encoder to extract 2048-d embeddings per clip.
# - Trains scikit-learn OneVsRest LogisticRegression on these embeddings using the existing 5-fold MLSKF splits.
# - Produces valid OOF LWLRAP and a competitive submission efficiently. Caches embeddings.

import sys, subprocess, os, time
from pathlib import Path
import numpy as np
import pandas as pd
import librosa
import joblib
import torch
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# Ensure panns data directory and local assets exist
os.makedirs('/app/panns_data', exist_ok=True)
LABELS_CSV = Path('/app/panns_data/class_labels_indices.csv')
CKPT_PATH = Path('/app/panns_data/Cnn14_mAP=0.431.pth')
assert LABELS_CSV.exists(), 'Missing /app/panns_data/class_labels_indices.csv. Run the stub-creation cell first.'
assert CKPT_PATH.exists(), 'Missing CNN14 checkpoint at /app/panns_data/Cnn14_mAP=0.431.pth. Run the download cell first.'

# Install panns-inference if missing
try:
    import panns_inference
except Exception:
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--quiet', 'panns-inference'])
    import panns_inference

from panns_inference import AudioTagging

BASE = Path('.')
SR = 32000
CROP_SEC = 10.0
EMB_DIM = 2048
N_FOLDS = 5

# Load metadata and folds (precomputed MLSKF)
df_cur = pd.read_csv(BASE / 'train_curated_folds.csv')
df_ss  = pd.read_csv(BASE / 'sample_submission.csv')
class_names = [c for c in df_ss.columns if c != 'fname']
label_to_idx = {c:i for i,c in enumerate(class_names)}
n_classes = len(class_names)
train_dir = BASE / 'train_curated'
test_dir  = BASE / 'test'

def parse_labels_str(s):
    if not isinstance(s, str):
        return []
    toks = [t.strip() for t in s.replace(';', ',').split(',') if t.strip()]
    unk = [t for t in toks if t not in label_to_idx]
    if unk:
        raise ValueError(f'Unknown labels encountered: {unk[:5]} (total {len(unk)})')
    return toks

def encode_labels(s):
    y = np.zeros(n_classes, dtype=np.float32)
    for t in parse_labels_str(s):
        y[label_to_idx[t]] = 1.0
    return y

def lwlrap_np(truth, scores):
    assert truth.shape == scores.shape
    n_samples, n_labels = truth.shape
    precisions_for_these_labels = np.zeros(n_labels)
    labels_per_class = truth.sum(axis=0)
    labels_per_class = np.maximum(labels_per_class, 1)
    for i in range(n_samples):
        pos_idx = np.where(truth[i] > 0)[0]
        if len(pos_idx) == 0:
            continue
        ranking = np.argsort(-scores[i])
        ranked_truth = truth[i][ranking]
        cumsum = np.cumsum(ranked_truth)
        pos_rank_indices = np.where(ranked_truth > 0)[0]
        precisions = cumsum[pos_rank_indices] / (pos_rank_indices + 1)
        ranked_labels = ranking[pos_rank_indices]
        for lbl, prec in zip(ranked_labels, precisions):
            precisions_for_these_labels[lbl] += prec
    per_class = precisions_for_these_labels / labels_per_class
    weights = (truth.sum(axis=0) / np.maximum(truth.sum(), 1))
    return float((per_class * weights).sum()), per_class

# Deterministic center crop to 10s (no randomness for embeddings)
def load_center_crop_10s(path, sr=SR, crop_sec=CROP_SEC):
    y, s = librosa.load(path, sr=sr, mono=True)
    target = int(crop_sec * sr)
    if len(y) >= target:
        start = max(0, (len(y) - target) // 2)
        y = y[start:start+target]
    else:
        y = np.pad(y, (0, target - len(y)), mode='constant')
    return y.astype(np.float32)

emb_cur_path = BASE / 'embeddings_curated.npy'
emb_test_path = BASE / 'embeddings_test.npy'

def extract_embeddings(file_list, root_dir):
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    # Use local checkpoint to avoid any downloads
    at = AudioTagging(checkpoint_path=str(CKPT_PATH), device=device)
    X = np.zeros((len(file_list), EMB_DIM), dtype=np.float32)
    t0 = time.time()
    for i, fname in enumerate(file_list):
        wav_path = str(Path(root_dir) / fname)
        y = load_center_crop_10s(wav_path, sr=SR, crop_sec=CROP_SEC)
        # AudioTagging.inference expects a batched waveform: shape (B, T)
        y_batched = np.expand_dims(y, 0)
        with torch.no_grad():
            out = at.inference(y_batched)
        # Some versions return tuple (clipwise_output, embedding)
        if isinstance(out, tuple) and len(out) == 2:
            embedding = out[1]
        elif isinstance(out, dict) and 'embedding' in out:
            embedding = out['embedding']
        else:
            raise RuntimeError(f'Unexpected AudioTagging output type: {type(out)}')
        X[i] = np.asarray(embedding, dtype=np.float32)[0]
        if (i+1) % 200 == 0:
            dt = time.time() - t0
            print(f'  Extracted {i+1}/{len(file_list)} embeddings in {dt/60:.1f} min')
    return X

# Filenames
train_files = df_cur['fname'].values
test_files  = df_ss['fname'].values

if emb_cur_path.exists() and emb_test_path.exists():
    X_cur = np.load(emb_cur_path)
    X_test = np.load(emb_test_path)
    print('Loaded cached embeddings.')
else:
    print('Extracting curated embeddings ...')
    X_cur = extract_embeddings(train_files, train_dir)
    np.save(emb_cur_path, X_cur)
    print('Extracting test embeddings ...')
    X_test = extract_embeddings(test_files, test_dir)
    np.save(emb_test_path, X_test)
    print('Saved embeddings to disk.')

# Targets
Y_cur = np.stack(df_cur['labels'].apply(encode_labels).values)

# 5-fold OOF using existing folds
oof = np.zeros((len(df_cur), n_classes), dtype=np.float32)
fold_scores = []

for k in range(N_FOLDS):
    trn_idx = np.where(df_cur['fold'].values != k)[0]
    val_idx = np.where(df_cur['fold'].values == k)[0]
    print(f'Fold {k}: train {len(trn_idx)}, val {len(val_idx)}')
    X_tr, X_va = X_cur[trn_idx], X_cur[val_idx]
    y_tr, y_va = Y_cur[trn_idx], Y_cur[val_idx]
    base_lr = LogisticRegression(
        solver='lbfgs', max_iter=1000, C=2.0, n_jobs=16, verbose=0
    )
    clf = OneVsRestClassifier(make_pipeline(StandardScaler(with_mean=True, with_std=True), base_lr), n_jobs=-1)
    clf.fit(X_tr, y_tr)
    proba = clf.predict_proba(X_va)
    oof[val_idx] = proba.astype(np.float32)
    lw, _ = lwlrap_np(y_va, proba)
    fold_scores.append(lw)
    print(f'  Fold {k} LWLRAP={lw:.4f}')

oof_lw, _ = lwlrap_np(Y_cur, oof)
print(f'OOF LWLRAP={oof_lw:.4f}; per-fold={fold_scores}')
np.save('oof_panns_lr.npy', oof)

# Train full model and predict test
base_lr_full = LogisticRegression(solver='lbfgs', max_iter=1000, C=2.0, n_jobs=16, verbose=0)
clf_full = OneVsRestClassifier(make_pipeline(StandardScaler(with_mean=True, with_std=True), base_lr_full), n_jobs=-1)
clf_full.fit(X_cur, Y_cur)
test_proba = clf_full.predict_proba(X_test).astype(np.float32)

sub = pd.DataFrame(test_proba, columns=class_names)
sub.insert(0, 'fname', test_files)
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv using PANNs CNN14 embeddings + OVR LR. Shape:', sub.shape)

# Save classifier for reuse
joblib.dump(clf_full, 'ovr_logreg_panns.joblib')
print('Saved classifier checkpoint.')


Loaded cached embeddings.
Fold 0: train 3971, val 999


  Fold 0 LWLRAP=0.8091
Fold 1: train 3976, val 994


  Fold 1 LWLRAP=0.8004
Fold 2: train 3978, val 992


  Fold 2 LWLRAP=0.8080
Fold 3: train 3973, val 997


  Fold 3 LWLRAP=0.8097
Fold 4: train 3982, val 988


  Fold 4 LWLRAP=0.7748
OOF LWLRAP=0.8001; per-fold=[0.8091265640638446, 0.8004257362345152, 0.8080218131533763, 0.8096867591565071, 0.7748105563971049]


Saved submission.csv using PANNs CNN14 embeddings + OVR LR. Shape: (3361, 81)
Saved classifier checkpoint.


In [13]:
# Create offline stub for panns_inference labels to bypass wget during import
import os
from pathlib import Path
labels_dir = Path('/app/panns_data')
labels_dir.mkdir(parents=True, exist_ok=True)
labels_csv = labels_dir / 'class_labels_indices.csv'
if not labels_csv.exists():
    # AudioSet has 527 classes; create a stub with correct header and 527 rows
    import csv
    with open(labels_csv, 'w', newline='') as f:
        writer = csv.writer(f)
        writer.writerow(['index', 'mid', 'display_name'])
        for i in range(527):
            writer.writerow([i, f'/m/{i}', f'class_{i:03d}'])
    print('Created stub labels CSV at', labels_csv)
else:
    print('Labels CSV already exists at', labels_csv)

print('Stub setup complete. Next: re-run the PANNs embeddings cell to observe weight download behavior and capture expected checkpoint path.')


Created stub labels CSV at /app/panns_data/class_labels_indices.csv
Stub setup complete. Next: re-run the PANNs embeddings cell to observe weight download behavior and capture expected checkpoint path.


In [14]:
# Download pretrained PANNs CNN14 weights via Python (no wget/apt) to /app/panns_data
import os, sys, hashlib
from pathlib import Path
import urllib.request

dest_dir = Path('/app/panns_data')
dest_dir.mkdir(parents=True, exist_ok=True)
ckpt_path = dest_dir / 'Cnn14_mAP=0.431.pth'

# Zenodo direct download link (shared by colleagues)
url = 'https://zenodo.org/record/3987831/files/Cnn14_mAP=0.431.pth?download=1'

def download(url, path):
    print(f'Downloading CNN14 weights to {path} ...')
    with urllib.request.urlopen(url) as resp, open(path, 'wb') as out:
        block_size = 1 << 20
        while True:
            chunk = resp.read(block_size)
            if not chunk:
                break
            out.write(chunk)
    print('Download complete.')

if ckpt_path.exists() and ckpt_path.stat().st_size > 0:
    print('Pretrained weights already present at:', ckpt_path, 'size:', ckpt_path.stat().st_size)
else:
    download(url, ckpt_path)
    print('Saved:', ckpt_path, 'size:', ckpt_path.stat().st_size)

print('Next step: modify PANNs embeddings cell to pass checkpoint_path=str(ckpt_path) to AudioTagging and re-run.')


Downloading CNN14 weights to /app/panns_data/Cnn14_mAP=0.431.pth ...


Download complete.
Saved: /app/panns_data/Cnn14_mAP=0.431.pth size: 327428481
Next step: modify PANNs embeddings cell to pass checkpoint_path=str(ckpt_path) to AudioTagging and re-run.


In [17]:
# Diagnose panns_inference.AudioTagging API and verify inference call + output keys with various input shapes
from pathlib import Path
import numpy as np
import pandas as pd
import librosa, torch, sys, subprocess, os

# Ensure assets exist
LABELS_CSV = Path('/app/panns_data/class_labels_indices.csv')
CKPT_PATH = Path('/app/panns_data/Cnn14_mAP=0.431.pth')
assert LABELS_CSV.exists(), 'Missing labels CSV'
assert CKPT_PATH.exists(), 'Missing checkpoint file'

# Import AudioTagging
try:
    import panns_inference
except Exception:
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--quiet', 'panns-inference'])
    import panns_inference
from panns_inference import AudioTagging

print('panns_inference version:', getattr(panns_inference, '__version__', 'unknown'))
print('AudioTagging public attrs:', [m for m in dir(AudioTagging) if not m.startswith('_')])

# Prepare one waveform from curated to test API
df_cur = pd.read_csv('train_curated_folds.csv')
wav_path = str(Path('train_curated') / df_cur.iloc[0]['fname'])
y, sr = librosa.load(wav_path, sr=32000, mono=True)
target = 32000 * 10
if len(y) >= target:
    start = max(0, (len(y) - target)//2)
    y = y[start:start+target]
else:
    y = np.pad(y, (0, target - len(y)))
y = y.astype(np.float32)

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print('Device for test:', device)
print('Checkpoint path:', CKPT_PATH)
print('Using CPU.' if device=='cpu' else 'Using CUDA.')
at = AudioTagging(checkpoint_path=str(CKPT_PATH), device=device)

def try_call(inp, name):
    try:
        out = at.inference(inp)
        print(f'Call with {name} succeeded; type(out)={type(out)}')
        if isinstance(out, dict):
            print('  Output keys:', list(out.keys()))
            for k in ['embedding', 'clipwise_output', 'tags']:
                if k in out:
                    v = out[k]
                    try:
                        print(f'    {k} shape:', getattr(v, 'shape', 'n/a'))
                    except Exception:
                        pass
        else:
            # some versions return tuple (clipwise_output, embedding)
            try:
                print('  Treating as tuple-like, len:', len(out))
                for i, v in enumerate(out):
                    print(f'    item {i} type={type(v)}, shape={getattr(v, "shape", "n/a")}')
            except Exception as e:
                print('  Non-iterable output:', repr(e))
    except Exception as e:
        print(f'Call with {name} failed:', repr(e))

# Try different input shapes
try_call(y, '1D waveform (T,)')
try_call(np.expand_dims(y, 0), 'batched waveform (1, T)')
try_call([y], 'list of waveforms [T]')

print('Diagnostics complete. Use the successful calling convention above in the embeddings pipeline.')


panns_inference version: 0.1.0
AudioTagging public attrs: ['inference']
Device for test: cpu
Checkpoint path: /app/panns_data/Cnn14_mAP=0.431.pth
Using CPU.
Checkpoint path: /app/panns_data/Cnn14_mAP=0.431.pth


Using CPU.
Call with 1D waveform (T,) failed: IndexError('too many indices for tensor of dimension 1')


Call with batched waveform (1, T) succeeded; type(out)=<class 'tuple'>
  Treating as tuple-like, len: 2
    item 0 type=<class 'numpy.ndarray'>, shape=(1, 527)
    item 1 type=<class 'numpy.ndarray'>, shape=(1, 2048)
Call with list of waveforms [T] failed: AttributeError("'list' object has no attribute 'dtype'")
Diagnostics complete. Use the successful calling convention above in the embeddings pipeline.


In [22]:
# Programmatically create a clean, linear notebook: C2_gold_pipeline_final.ipynb
import json, sys, subprocess, os, time, shutil
from pathlib import Path

# Ensure nbformat/nbconvert are available
def pip_install(pkg):
    try:
        subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--quiet', pkg])
    except Exception as e:
        print(f'Failed to install {pkg}: {e}')

try:
    import nbformat
except Exception:
    pip_install('nbformat')
    import nbformat
try:
    from nbconvert.preprocessors import ExecutePreprocessor
except Exception:
    pip_install('nbconvert')
    from nbconvert.preprocessors import ExecutePreprocessor

import nbformat as nbf

final_nb_path = Path('C2_gold_pipeline_final.ipynb')

# 1) Get plan markdown from current notebook's Cell 0
plan_md = None
try:
    with open('agent_notebook.ipynb', 'r', encoding='utf-8') as f:
        cur_nb = nbformat.read(f, as_version=4)
    if cur_nb['cells'] and cur_nb['cells'][0]['cell_type'] == 'markdown':
        plan_md = cur_nb['cells'][0]['source']
except Exception as e:
    print('Warning: could not read agent_notebook.ipynb:', e)
if plan_md is None:
    plan_md = '# Freesound Audio Tagging 2019 — Gold-Only Plan v2.1 (consolidated)\nThis notebook is a clean, linear reproduction of the approved plan and winning pipeline.'

# 2) Consolidated setup & imports cell (no panns_inference import here to satisfy asset sequencing)
cell2 = '''\
# Cell 2: Consolidated setup and imports (single source of truth)\n
import sys, subprocess, os, json, time, warnings\n
from pathlib import Path\n
warnings.filterwarnings('ignore')\n
def pip_install(pkg):\n
    try:\n
        subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--quiet', pkg])\n
        print(f'Installed: {pkg}')\n
    except Exception as e:\n
        print(f'Failed to install {pkg}: {e}')\n
\n
for pkg in ['soundfile', 'librosa', 'iterative-stratification', 'joblib', 'scikit-learn']:\n
    try:\n
        __import__(pkg.split('==')[0].replace('-', '_'))\n
        print(f'{pkg.split("==")[0]} already available')\n
    except Exception:\n
        pip_install(pkg)\n
\n
import numpy as np\n
import pandas as pd\n
import soundfile as sf\n
import librosa\n
import torch\n
from iterstrat.ml_stratifiers import MultilabelStratifiedKFold\n
from sklearn.linear_model import LogisticRegression\n
from sklearn.multiclass import OneVsRestClassifier\n
from sklearn.preprocessing import StandardScaler\n
from sklearn.pipeline import make_pipeline\n
import joblib\n
BASE = Path('.')\n
np.random.seed(42)\n
'''

# 3) Offline PANNs asset preparation
cell3 = '''\
# Cell 3: Offline PANNs asset preparation (labels CSV stub + CNN14 checkpoint)\n
from pathlib import Path\n
import urllib.request\n
assets_dir = Path('/app/panns_data')\n
assets_dir.mkdir(parents=True, exist_ok=True)\n
labels_csv = assets_dir / 'class_labels_indices.csv'\n
if not labels_csv.exists():\n
    import csv\n
    with open(labels_csv, 'w', newline='') as f:\n
        w = csv.writer(f)\n
        w.writerow(['index','mid','display_name'])\n
        for i in range(527):\n
            w.writerow([i, f'/m/{i}', f'class_{i:03d}'])\n
    print('Created labels stub at', labels_csv)\n
else:\n
    print('Labels CSV exists at', labels_csv)\n
\n
ckpt_path = assets_dir / 'Cnn14_mAP=0.431.pth'\n
url = 'https://zenodo.org/record/3987831/files/Cnn14_mAP=0.431.pth?download=1'\n
if not ckpt_path.exists() or ckpt_path.stat().st_size == 0:\n
    print('Downloading CNN14 weights ...')\n
    with urllib.request.urlopen(url) as resp, open(ckpt_path, 'wb') as out:\n
        while True:\n
            chunk = resp.read(1<<20)\n
            if not chunk:\n
                break\n
            out.write(chunk)\n
    print('Saved CNN14 to', ckpt_path, 'size:', ckpt_path.stat().st_size)\n
else:\n
    print('CNN14 checkpoint present:', ckpt_path, 'size:', ckpt_path.stat().st_size)\n
'''

# 4) Data foundation & CV (no KFold fallback). Define helper funcs once here.
cell4 = '''\
# Cell 4: Data foundation, robust label parsing, MLSKF folds (no fallback)\n
BASE = Path('.')\n
df_cur = pd.read_csv(BASE / 'train_curated.csv')\n
df_ss  = pd.read_csv(BASE / 'sample_submission.csv')\n
fname_col = 'fname' if 'fname' in df_cur.columns else df_cur.columns[0]\n
labels_col = 'labels' if 'labels' in df_cur.columns else df_cur.columns[-1]\n
class_names = [c for c in df_ss.columns if c != 'fname']\n
label_to_idx = {c:i for i,c in enumerate(class_names)}\n
n_classes = len(class_names)\n
\n
def parse_labels_str(s):\n
    if not isinstance(s, str):\n
        return []\n
    toks = [t.strip() for t in s.replace(';', ',').split(',') if t.strip()]\n
    unknown = [t for t in toks if t not in label_to_idx]\n
    if unknown:\n
        raise ValueError(f'Unknown labels: {unknown[:5]} (total {len(unknown)})')\n
    return toks\n
\n
def encode_tokens(toks):\n
    y = np.zeros(n_classes, dtype=np.float32)\n
    for t in toks:\n
        y[label_to_idx[t]] = 1.0\n
    return y\n
\n
def encode_labels(s):\n
    return encode_tokens(parse_labels_str(s))\n
\n
tokens = df_cur[labels_col].apply(parse_labels_str)\n
Y_cur = np.stack(tokens.apply(encode_tokens).values)\n
mlsk = MultilabelStratifiedKFold(n_splits=5, shuffle=True, random_state=42)\n
folds = np.full(len(df_cur), -1, dtype=int)\n
for k, (_, val_idx) in enumerate(mlsk.split(df_cur[fname_col].values, Y_cur)):\n
    folds[val_idx] = k\n
assert (folds >= 0).all()\n
df_cur['fold'] = folds\n
df_cur.to_csv('train_curated_folds.csv', index=False)\n
with open('metadata.json', 'w') as f:\n
    json.dump({'class_names': class_names, 'label_to_idx': label_to_idx, 'fname_col': fname_col, 'labels_col': labels_col}, f)\n
print('Saved train_curated_folds.csv and metadata.json; fold counts:')\n
print(df_cur['fold'].value_counts().sort_index())\n
\n
def lwlrap_np(truth, scores):\n
    assert truth.shape == scores.shape\n
    n_samples, n_labels = truth.shape\n
    precisions = np.zeros(n_labels)\n
    labels_per_class = np.maximum(truth.sum(axis=0), 1)\n
    for i in range(n_samples):\n
        pos = np.where(truth[i] > 0)[0]\n
        if pos.size == 0:\n
            continue\n
        ranking = np.argsort(-scores[i])\n
        ranked_truth = truth[i][ranking]\n
        cumsum = np.cumsum(ranked_truth)\n
        pos_rank = np.where(ranked_truth > 0)[0]\n
        prec = cumsum[pos_rank] / (pos_rank + 1)\n
        ranked_labels = ranking[pos_rank]\n
        for lbl, p in zip(ranked_labels, prec):\n
            precisions[lbl] += p\n
    per_class = precisions / labels_per_class\n
    weights = truth.sum(axis=0) / max(truth.sum(), 1)\n
    return float((per_class * weights).sum()), per_class\n
'''

# 5) PANNs embeddings + OVR Logistic Regression pipeline (imports panns_inference here, after assets)
cell5 = '''\
# Cell 5: PANNs CNN14 embeddings + OVR Logistic Regression (OOF + submission)\n
from pathlib import Path\n
import numpy as np, pandas as pd\n
import librosa, torch, joblib, time\n
from sklearn.linear_model import LogisticRegression\n
from sklearn.multiclass import OneVsRestClassifier\n
from sklearn.preprocessing import StandardScaler\n
from sklearn.pipeline import make_pipeline\n
from panns_inference import AudioTagging\n
\n
BASE = Path('.')\n
SR = 32000\n
CROP_SEC = 10.0\n
EMB_DIM = 2048\n
N_FOLDS = 5\n
CKPT_PATH = Path('/app/panns_data/Cnn14_mAP=0.431.pth')\n
assert CKPT_PATH.exists(), 'CNN14 checkpoint missing; run Cell 3 first.'\n
\n
df_cur = pd.read_csv(BASE / 'train_curated_folds.csv')\n
df_ss  = pd.read_csv(BASE / 'sample_submission.csv')\n
class_names = [c for c in df_ss.columns if c != 'fname']\n
label_to_idx = {c:i for i,c in enumerate(class_names)}\n
n_classes = len(class_names)\n
train_dir = BASE / 'train_curated'\n
test_dir  = BASE / 'test'\n
\n
def load_center_crop_10s(path, sr=SR, crop_sec=CROP_SEC):\n
    y, s = librosa.load(path, sr=sr, mono=True)\n
    target = int(sr * crop_sec)\n
    if len(y) >= target:\n
        start = max(0, (len(y) - target)//2)\n
        y = y[start:start+target]\n
    else:\n
        y = np.pad(y, (0, target-len(y)))\n
    return y.astype(np.float32)\n
\n
emb_cur_path = BASE / 'embeddings_curated.npy'\n
emb_test_path = BASE / 'embeddings_test.npy'\n
\n
def extract_embeddings(file_list, root_dir):\n
    device = 'cuda' if torch.cuda.is_available() else 'cpu'\n
    at = AudioTagging(checkpoint_path=str(CKPT_PATH), device=device)\n
    X = np.zeros((len(file_list), EMB_DIM), dtype=np.float32)\n
    t0 = time.time()\n
    for i, fname in enumerate(file_list):\n
        y = load_center_crop_10s(str(Path(root_dir) / fname), sr=SR, crop_sec=CROP_SEC)\n
        y_batched = np.expand_dims(y, 0)\n
        with torch.no_grad():\n
            out = at.inference(y_batched)\n
        if isinstance(out, tuple) and len(out) == 2:\n
            embedding = out[1]\n
        elif isinstance(out, dict) and 'embedding' in out:\n
            embedding = out['embedding']\n
        else:\n
            raise RuntimeError(f'Unexpected AudioTagging output: {type(out)}')\n
        X[i] = np.asarray(embedding, dtype=np.float32)[0]\n
        if (i+1) % 200 == 0:\n
            dt = time.time() - t0\n
            print(f'  {i+1}/{len(file_list)} in {dt/60:.1f} min')\n
    return X\n
\n
train_files = df_cur['fname'].values\n
test_files  = df_ss['fname'].values\n
\n
if emb_cur_path.exists() and emb_test_path.exists():\n
    X_cur = np.load(emb_cur_path)\n
    X_test = np.load(emb_test_path)\n
    print('Loaded cached embeddings.')\n
else:\n
    print('Extracting curated embeddings ...')\n
    X_cur = extract_embeddings(train_files, root_dir=BASE/'train_curated')\n
    np.save(emb_cur_path, X_cur)\n
    print('Extracting test embeddings ...')\n
    X_test = extract_embeddings(test_files, root_dir=BASE/'test')\n
    np.save(emb_test_path, X_test)\n
    print('Saved embeddings to disk.')\n
\n
Y_cur = np.stack(df_cur['labels'].apply(encode_labels).values)\n
oof = np.zeros((len(df_cur), n_classes), dtype=np.float32)\n
fold_scores = []\n
for k in range(N_FOLDS):\n
    trn_idx = np.where(df_cur['fold'].values != k)[0]\n
    val_idx = np.where(df_cur['fold'].values == k)[0]\n
    X_tr, X_va = X_cur[trn_idx], X_cur[val_idx]\n
    y_tr, y_va = Y_cur[trn_idx], Y_cur[val_idx]\n
    base_lr = LogisticRegression(solver='lbfgs', max_iter=1000, C=2.0, n_jobs=16, verbose=0)\n
    clf = OneVsRestClassifier(make_pipeline(StandardScaler(with_mean=True, with_std=True), base_lr), n_jobs=-1)\n
    clf.fit(X_tr, y_tr)\n
    proba = clf.predict_proba(X_va)\n
    oof[val_idx] = proba.astype(np.float32)\n
    lw, _ = lwlrap_np(y_va, proba)\n
    fold_scores.append(lw)\n
    print(f'Fold {k} LWLRAP={lw:.4f}')\n
oof_lw, _ = lwlrap_np(Y_cur, oof)\n
print(f'OOF LWLRAP={oof_lw:.4f}; per-fold={fold_scores}')\n
np.save('oof_panns_lr.npy', oof)\n
\n
base_lr_full = LogisticRegression(solver='lbfgs', max_iter=1000, C=2.0, n_jobs=16, verbose=0)\n
clf_full = OneVsRestClassifier(make_pipeline(StandardScaler(with_mean=True, with_std=True), base_lr_full), n_jobs=-1)\n
clf_full.fit(X_cur, Y_cur)\n
test_proba = clf_full.predict_proba(X_test).astype(np.float32)\n
sub = pd.DataFrame(test_proba, columns=class_names)\n
sub.insert(0, 'fname', test_files)\n
sub.to_csv('submission.csv', index=False)\n
print('Saved submission.csv. Shape:', sub.shape)\n
joblib.dump(clf_full, 'ovr_logreg_panns.joblib')\n
print('Saved classifier checkpoint.')\n
'''

# Build the new notebook
nb = nbf.v4.new_notebook()
nb['cells'] = [
    nbf.v4.new_markdown_cell(plan_md),
    nbf.v4.new_code_cell(cell2),
    nbf.v4.new_code_cell(cell3),
    nbf.v4.new_code_cell(cell4),
    nbf.v4.new_code_cell(cell5),
]

with open(final_nb_path, 'w', encoding='utf-8') as f:
    nbformat.write(nb, f)
print('Wrote clean notebook to', final_nb_path)

# Clean-room validation: remove cached embeddings and panns assets before executing
for p in ['embeddings_curated.npy', 'embeddings_test.npy', 'oof_panns_lr.npy', 'ovr_logreg_panns.joblib']:
    fp = Path(p)
    if fp.exists():
        try:
            fp.unlink()
            print('Removed cache:', fp)
        except Exception as e:
            print('Failed to remove', fp, ':', e)
assets_dir = Path('/app/panns_data')
if assets_dir.exists():
    try:
        shutil.rmtree(assets_dir)
        print('Removed assets dir:', assets_dir)
    except Exception as e:
        print('Failed to remove assets dir:', e)

# Execute the new notebook end-to-end to validate Run All in clean state
try:
    print('Executing the clean notebook (this may take significant time) ...')
    with open(final_nb_path, 'r', encoding='utf-8') as f:
        nb_to_run = nbformat.read(f, as_version=4)
    ep = ExecutePreprocessor(timeout=36000, kernel_name='python3')
    ep.preprocess(nb_to_run, {'metadata': {'path': '.'}})
    with open(final_nb_path, 'w', encoding='utf-8') as f:
        nbformat.write(nb_to_run, f)
    print('Clean notebook executed successfully.')
except Exception as e:
    print('Execution of the clean notebook failed:', repr(e))
    print('Open and Run All manually if needed.')

print('Done.')


Wrote clean notebook to C2_gold_pipeline_final.ipynb
Removed cache: embeddings_curated.npy
Removed cache: embeddings_test.npy
Removed cache: oof_panns_lr.npy
Removed cache: ovr_logreg_panns.joblib
Removed assets dir: /app/panns_data
Executing the clean notebook (this may take significant time) ...


Clean notebook executed successfully.
Done.


# C3 Plan (Revised and Final): De-risked Order, Pure Teacher, Data-Driven Thresholds, and Hygiene

Objective: Expand the gold margin on top of the approved C2 baseline (OOF LWLRAP 0.8001) using a rigorously reproducible, low-risk-first sequence. Deliverables: clean, versioned C3 notebooks; ablations per step; improved OOF and final submission.

Final priority order (mandatory):
1) Multi-Crop Embedding TTA (lowest risk, immediate gain)
- Extract K=5 time crops per clip for curated and test: begin, center, end, and two uniform offsets. Average the resulting 2048-d embeddings per clip; cache as embeddings_curated_mc5.npy and embeddings_test_mc5.npy.
- Use the same MLSKF splits (seed=42) and retrain the OVR Logistic Regression head on averaged embeddings. Report OOF and per-fold metrics.

2) Dual Features (feature fusion)
- Augment the 2048-d embeddings with CNN14 clipwise_output (527-d) to form a 2575-d feature vector; re-train LR with identical CV setup.
- Cache fused features; update ablation table to show incremental delta from TTA-only to TTA+Fusion.

3) Teacher–Student on train_noisy (methodologically pure)
- Teacher models: strictly the 5 out-of-fold C2/C3 fold models (no full-data teacher). The teacher prediction for any curated validation fold must exclude that fold’s model to preserve purity. For train_noisy (separate set), use the fold-ensemble average (all 5 folds).
- Embed train_noisy with the same TTA settings as curated (K=5), and optionally the fused features if step 2 is adopted. Cache as embeddings_noisy_mc5.npy (and fused variant).
- Score train_noisy with the teacher ensemble to obtain per-class probabilities.
- Data-driven threshold selection: before filtering, analyze the teacher’s probability distribution per class on train_noisy (histograms/quantiles). Choose thresholds by targeting a precision-first regime with coverage constraints:
  - Candidate grids per class: quantiles q ∈ {0.98, 0.95, 0.90, 0.85} or fixed p ∈ {0.95, 0.90, 0.85}, pick per-class thresholds to achieve a target retained-rate band (e.g., retain between 20–60% of weak positives) while maximizing estimated precision.
  - Add new positives (not in weak labels) only at very high confidence (e.g., top 1–2% quantile or p ≥ class-specific high threshold). Cap added positives per clip (≤3).
- Guardrails and logs:
  - Log per-class retained counts and retained-rate vs weak labels; fail if median retained-rate < 30% or if any class retains < 10 samples.
  - Verify sample_weight propagation: weight each noisy sample by max positive probability (clip to [0.5, 1.0]); log summary stats.
  - Abort/raise thresholds if curated OOF drops by >0.01 from the TTA+Fusion baseline.
- Retrain LR on curated + filtered noisy (same MLSKF folds for curated; noisy used only in training folds with its sample weights). Re-evaluate curated OOF to validate gains.

4) Lightweight Ensemble and Calibration (final polish)
- Train several LR heads varying C ∈ {1.0, 2.0, 4.0} and seed ∈ {42, 2025}; average probabilities (or rank-average if distributions differ). Keep folds fixed for clean OOF blending.
- Per-class temperature scaling: fit temperatures on curated OOF by minimizing BCE; apply to test predictions. Log pre/post calibration metrics.

Hygiene, reproducibility, and versioning (mandatory):
- Start each C3 milestone from the clean C2 artifact structure. Maintain a linear, 5–7 cell notebook per milestone without stale/diagnostic cells. No KFold fallback anywhere.
- Versioned notebooks and ablations:
  - C3A_TTA.ipynb: Implement K=5 multi-crop TTA on embeddings; report OOF and save oof_tta.npy.
  - C3B_Fusion.ipynb: Add 527-d clipwise outputs to form 2575-d features; report OOF and save oof_tta_fusion.npy.
  - C3C_TeacherStudent.ipynb: Build pure OOF teacher ensemble, analyze distributions, select data-driven thresholds, filter/weight train_noisy, retrain LR; save artifacts (thresholds.json, oof_tta_fusion_noisy.npy).
  - C3D_Ensemble_Calibration.ipynb: Train LR heads with different C/seeds; blend and calibrate; save final oof_ensemble.npy and temperatures.npy.
- Artifacts to persist per step: embeddings and/or fused features (.npy), OOF predictions (.npy), per-fold metrics (JSON), selected thresholds per class (JSON), calibration temps (JSON), and final submission.csv.
- Determinism: set seeds; pin critical package versions (iterative-stratification, scikit-learn, librosa, panns-inference); log environment versions.
- Efficiency: batch panns-inference on GPU (micro-batches 8–16), cache all embeddings/features per variant to avoid recompute.

Acceptance criteria and safety checks:
- Curated OOF must not degrade by >0.01 at any step; if it does, revert the last change or tighten filtering thresholds.
- Monitor per-fold LWLRAP and per-class LRAP to prevent collapse on rare classes.
- Final deliverable remains linear, offline-first, and re-runnable from a clean state (assets recreated automatically).

Next action (C3A): Implement K=5 multi-crop embedding TTA in a clean notebook cloned from C2_gold_pipeline_final.ipynb, retrain the LR head with fixed MLSKF folds, and submit for audit with OOF deltas and cached artifacts.

In [27]:
# Programmatically create C3A_TTA_curated.ipynb (K=5 multi-crop TTA embeddings for curated ONLY + OOF LR)
import json, sys, subprocess, os, time
from pathlib import Path

def pip_install(pkg):
    try:
        subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--quiet', pkg])
    except Exception as e:
        print('pip install failed for', pkg, ':', e)

# Ensure nbformat/nbconvert
try:
    import nbformat
except Exception:
    pip_install('nbformat')
    import nbformat
try:
    from nbconvert.preprocessors import ExecutePreprocessor
except Exception:
    pip_install('nbconvert')
    from nbconvert.preprocessors import ExecutePreprocessor
import nbformat as nbf

nb_path = Path('C3A_TTA_curated.ipynb')

# Cell 1: Markdown plan
cell1 = """
# C3A: Multi-Crop Embedding TTA (K=5) — Curated Only Stage

This staged notebook implements K=5 deterministic multi-crop TTA at the embedding level using PANNs CNN14 for curated data only, computes OOF LWLRAP, and caches artifacts.
- Crops per clip (target T=10s @ 32kHz): begin, center, end, 25% offset, 75% offset.
- Aggregation: mean across the 5 embeddings (no pre-normalization).
- CV: fixed 5-fold MLSKF from train_curated_folds.csv (seed=42).
- Outputs: embeddings_curated_mc5.npy, metadata_c3a.json; OOF (oof_tta.npy), per-fold LWLRAP.
- Test MC5 extraction and submission are deferred to a separate follow-up stage after OOF verification.
"""

# Cell 2: Setup and imports (install panns-inference, but do not import yet)
cell2 = '''
import sys, subprocess, os, json, time, warnings
from pathlib import Path
warnings.filterwarnings('ignore')
def pip_install(pkg):
    try:
        subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--quiet', pkg])
        print(f'Installed: {pkg}')
    except Exception as e:
        print(f'Failed to install {pkg}: {e}')

for pkg in ['soundfile', 'librosa', 'iterative-stratification', 'joblib', 'scikit-learn', 'panns-inference']:
    try:
        __import__(pkg.split('==')[0].replace('-', '_'))
        print(f"{pkg.split('==')[0]} already available")
    except Exception:
        pip_install(pkg)

import numpy as np
import pandas as pd
import soundfile as sf
import librosa
import torch
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from iterstrat.ml_stratifiers import MultilabelStratifiedKFold
import joblib
BASE = Path('.')
np.random.seed(42)
print('Versions:', {k:__import__(k).__version__ if hasattr(__import__(k), '__version__') else 'n/a' for k in ['librosa','torch','sklearn']})
'''

# Cell 3: Offline PANNs assets
cell3 = '''
from pathlib import Path
import urllib.request
assets_dir = Path('/app/panns_data')
assets_dir.mkdir(parents=True, exist_ok=True)
labels_csv = assets_dir / 'class_labels_indices.csv'
if not labels_csv.exists():
    import csv
    with open(labels_csv, 'w', newline='') as f:
        w = csv.writer(f)
        w.writerow(['index','mid','display_name'])
        for i in range(527):
            w.writerow([i, f'/m/{i}', f'class_{i:03d}'])
    print('Created labels stub at', labels_csv)
else:
    print('Labels CSV exists at', labels_csv)

ckpt_path = assets_dir / 'Cnn14_mAP=0.431.pth'
url = 'https://zenodo.org/record/3987831/files/Cnn14_mAP=0.431.pth?download=1'
if not ckpt_path.exists() or ckpt_path.stat().st_size == 0:
    print('Downloading CNN14 weights ...')
    with urllib.request.urlopen(url) as resp, open(ckpt_path, 'wb') as out:
        while True:
            chunk = resp.read(1<<20)
            if not chunk:
                break
            out.write(chunk)
    print('Saved CNN14 to', ckpt_path, 'size:', ckpt_path.stat().st_size)
else:
    print('CNN14 checkpoint present:', ckpt_path, 'size:', ckpt_path.stat().st_size)
'''

# Cell 4: Load folds, helpers (no rebuild of folds)
cell4 = '''
from pathlib import Path
import numpy as np, pandas as pd
import json
BASE = Path('.')
df_cur = pd.read_csv(BASE / 'train_curated_folds.csv')
df_ss  = pd.read_csv(BASE / 'sample_submission.csv')
class_names = [c for c in df_ss.columns if c != 'fname']
label_to_idx = {c:i for i,c in enumerate(class_names)}
n_classes = len(class_names)

def parse_labels_str(s):
    if not isinstance(s, str):
        return []
    toks = [t.strip() for t in s.replace(';', ',').split(',') if t.strip()]
    unknown = [t for t in toks if t not in label_to_idx]
    if unknown:
        raise ValueError(f'Unknown labels: {unknown[:5]} (total {len(unknown)})')
    return toks

def encode_tokens(toks):
    y = np.zeros(n_classes, dtype=np.float32)
    for t in toks:
        y[label_to_idx[t]] = 1.0
    return y

def encode_labels(s):
    return encode_tokens(parse_labels_str(s))

def lwlrap_np(truth, scores):
    assert truth.shape == scores.shape
    n_samples, n_labels = truth.shape
    precisions = np.zeros(n_labels)
    labels_per_class = np.maximum(truth.sum(axis=0), 1)
    for i in range(n_samples):
        pos = np.where(truth[i] > 0)[0]
        if pos.size == 0:
            continue
        ranking = np.argsort(-scores[i])
        ranked_truth = truth[i][ranking]
        cumsum = np.cumsum(ranked_truth)
        pos_rank = np.where(ranked_truth > 0)[0]
        prec = cumsum[pos_rank] / (pos_rank + 1)
        ranked_labels = ranking[pos_rank]
        for lbl, p in zip(ranked_labels, prec):
            precisions[lbl] += p
    per_class = precisions / labels_per_class
    weights = truth.sum(axis=0) / max(truth.sum(), 1)
    return float((per_class * weights).sum()), per_class

print('Loaded folds and helpers. Fold counts:', df_cur['fold'].value_counts().sort_index().to_dict())
'''

# Cell 5: Curated-only MC5 embedding extraction (batched crops) and caching
cell5 = '''
from pathlib import Path
import numpy as np, pandas as pd, time, json, hashlib
import librosa, torch
from panns_inference import AudioTagging

BASE = Path('.')
SR = 32000
T_SEC = 10.0
T = int(SR * T_SEC)
K = 5
EMB_DIM = 2048
CKPT_PATH = Path('/app/panns_data/Cnn14_mAP=0.431.pth')
assert CKPT_PATH.exists(), 'Checkpoint missing; run Cell 3.'

def load_audio(path, sr=SR):
    y, s = librosa.load(path, sr=sr, mono=True)
    return y.astype(np.float32)

def crop_starts(L, T):
    if L <= T:
        return [0, 0, 0, 0, 0]
    starts = [0, (L - T)//2, L - T, int(0.25*(L - T)), int(0.75*(L - T))]
    starts = [max(0, min(s, L - T)) for s in starts]
    return starts

def crops_for_wave(y, T):
    L = len(y)
    starts = crop_starts(L, T)
    crops = []
    for s in starts:
        if L >= T:
            crops.append(y[s:s+T])
        else:
            pad = np.pad(y, (0, T - L))
            crops.append(pad)
    return np.stack(crops, 0)  # (K, T)

def sha1_of_file(path: Path, block_size=1<<20):
    h = hashlib.sha1()
    with open(path, 'rb') as f:
        while True:
            b = f.read(block_size)
            if not b:
                break
            h.update(b)
    return h.hexdigest()

def extract_mc5_embeddings_curated(file_list, root_dir, log_every=100):
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    at = AudioTagging(checkpoint_path=str(CKPT_PATH), device=device)
    X = np.zeros((len(file_list), EMB_DIM), dtype=np.float32)
    t0 = time.time()
    for i, fname in enumerate(file_list):
        y = load_audio(str(Path(root_dir) / fname), sr=SR)
        crops = crops_for_wave(y, T)  # (5, T)
        with torch.no_grad():
            out = at.inference(crops)  # batched inference on (5, T)
        if isinstance(out, tuple) and len(out)==2:
            embs = np.asarray(out[1], dtype=np.float32)  # (5, 2048)
        elif isinstance(out, dict) and 'embedding' in out:
            embs = np.asarray(out['embedding'], dtype=np.float32)
        else:
            raise RuntimeError('Unexpected AudioTagging output type')
        assert embs.ndim == 2 and embs.shape[1] == EMB_DIM, f'Bad embedding shape: {embs.shape}'
        X[i] = embs.mean(axis=0)
        if (i+1) % log_every == 0:
            dt = time.time() - t0
            print(f'  Curated MC5: {i+1}/{len(file_list)} in {dt/60:.1f} min')
    assert X.shape == (len(file_list), EMB_DIM), f'Output shape mismatch: {X.shape}'
    return X

df_cur = pd.read_csv(BASE / 'train_curated_folds.csv')
train_files = df_cur['fname'].values

emb_cur_mc5_path = BASE / 'embeddings_curated_mc5.npy'
if emb_cur_mc5_path.exists():
    X_cur = np.load(emb_cur_mc5_path)
    print('Loaded cached curated MC5 embeddings.')
else:
    print('Extracting curated MC5 embeddings ...')
    X_cur = extract_mc5_embeddings_curated(train_files, root_dir=BASE/'train_curated', log_every=100)
    np.save(emb_cur_mc5_path, X_cur)
    print('Saved curated MC5 embeddings.')

import torch as _torch, librosa as _librosa
ckpt_size = CKPT_PATH.stat().st_size if CKPT_PATH.exists() else None
ckpt_sha1 = sha1_of_file(CKPT_PATH) if CKPT_PATH.exists() else None
meta = {
    'tta': 'mc5',
    'sr': SR, 'T_sec': T_SEC, 'T': T,
    'crops': 'begin,center,end,25%,75%',
    'aggregation': 'mean',
    'stage': 'curated_only',
    'versions': {'torch': getattr(_torch, '__version__', 'n/a'), 'librosa': getattr(_librosa, '__version__', 'n/a')},
    'checkpoint': {'path': str(CKPT_PATH), 'size_bytes': ckpt_size, 'sha1': ckpt_sha1}
}
with open('metadata_c3a.json', 'w') as f:
    json.dump(meta, f)
print('metadata_c3a.json written with provenance.')
'''

# Cell 6: Train LR on curated MC5 embeddings, OOF only (no test/submission in this stage) + Persist metrics
cell6 = '''
from pathlib import Path
import numpy as np, pandas as pd, json
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

BASE = Path('.')
df_cur = pd.read_csv(BASE / 'train_curated_folds.csv')
df_ss  = pd.read_csv(BASE / 'sample_submission.csv')
class_names = [c for c in df_ss.columns if c != 'fname']
label_to_idx = {c:i for i,c in enumerate(class_names)}
n_classes = len(class_names)

def parse_labels_str(s):
    if not isinstance(s, str):
        return []
    toks = [t.strip() for t in s.replace(';', ',').split(',') if t.strip()]
    return toks

def encode_labels(s):
    y = np.zeros(n_classes, dtype=np.float32)
    for t in parse_labels_str(s):
        if t in label_to_idx:
            y[label_to_idx[t]] = 1.0
    return y

def lwlrap_np(truth, scores):
    assert truth.shape == scores.shape
    n_samples, n_labels = truth.shape
    precisions = np.zeros(n_labels)
    labels_per_class = np.maximum(truth.sum(axis=0), 1)
    for i in range(n_samples):
        pos = np.where(truth[i] > 0)[0]
        if pos.size == 0:
            continue
        ranking = np.argsort(-scores[i])
        ranked_truth = truth[i][ranking]
        cumsum = np.cumsum(ranked_truth)
        pos_rank = np.where(ranked_truth > 0)[0]
        prec = cumsum[pos_rank] / (pos_rank + 1)
        ranked_labels = ranking[pos_rank]
        for lbl, p in zip(ranked_labels, prec):
            precisions[lbl] += p
    per_class = precisions / labels_per_class
    weights = truth.sum(axis=0) / max(truth.sum(), 1)
    return float((per_class * weights).sum()), per_class

X_cur = np.load('embeddings_curated_mc5.npy')
assert X_cur.ndim == 2 and X_cur.shape[1] == 2048, f'Embeddings shape invalid: {X_cur.shape}'
Y_cur = np.stack(df_cur['labels'].apply(encode_labels).values).astype(np.float32)

oof = np.zeros((len(df_cur), n_classes), dtype=np.float32)
fold_scores = []
for k in range(5):
    trn_idx = np.where(df_cur['fold'].values != k)[0]
    val_idx = np.where(df_cur['fold'].values == k)[0]
    X_tr, X_va = X_cur[trn_idx], X_cur[val_idx]
    y_tr, y_va = Y_cur[trn_idx], Y_cur[val_idx]
    base_lr = LogisticRegression(solver='lbfgs', max_iter=1000, C=2.0, n_jobs=16, verbose=0)
    clf = OneVsRestClassifier(make_pipeline(StandardScaler(with_mean=True, with_std=True), base_lr), n_jobs=-1)
    clf.fit(X_tr, y_tr)
    proba = clf.predict_proba(X_va)
    oof[val_idx] = proba.astype(np.float32)
    lw, _ = lwlrap_np(y_va, proba)
    fold_scores.append(lw)
    print(f'Fold {k} LWLRAP (MC5 curated)={lw:.4f}')
oof_lw, _ = lwlrap_np(Y_cur, oof)
print(f'OOF LWLRAP (MC5 curated)={oof_lw:.4f}; per-fold={fold_scores}')
np.save('oof_tta.npy', oof)

# Mandatory metrics persistence
BASELINE = 0.8001
delta = float(oof_lw - BASELINE)
ck_sha1 = None
try:
    with open('metadata_c3a.json', 'r') as f:
        meta = json.load(f)
    ck_sha1 = meta.get('checkpoint', {}).get('sha1')
except Exception:
    pass
metrics = {
    'stage': 'C3A_curated_mc5',
    'oof_lwlrap': float(oof_lw),
    'per_fold_lwlrap': [float(x) for x in fold_scores],
    'delta_vs_c2_baseline': delta,
    'tta_scheme': 'mc5_mean',
    'checkpoint_sha1': ck_sha1
}
with open('metrics_c3a_curated.json', 'w') as f:
    json.dump(metrics, f)
print('Persisted metrics to metrics_c3a_curated.json:', metrics)
'''

# Build notebook
nb = nbf.v4.new_notebook()
nb['cells'] = [
    nbf.v4.new_markdown_cell(cell1),
    nbf.v4.new_code_cell(cell2),
    nbf.v4.new_code_cell(cell3),
    nbf.v4.new_code_cell(cell4),
    nbf.v4.new_code_cell(cell5),
    nbf.v4.new_code_cell(cell6),
]

with open(nb_path, 'w', encoding='utf-8') as f:
    nbformat.write(nb, f)
print('Wrote C3A_TTA_curated.ipynb (curated-only stage) with metrics persistence')

print('Note: Not auto-executing due to previous long runtimes. Run this notebook with Run All to extract curated MC5 and compute OOF, then we will add test extraction in a follow-up stage.')


Wrote C3A_TTA_curated.ipynb (curated-only stage) with metrics persistence
Note: Not auto-executing due to previous long runtimes. Run this notebook with Run All to extract curated MC5 and compute OOF, then we will add test extraction in a follow-up stage.


In [28]:
# Execute the staged C3A_TTA_curated.ipynb to produce MC5 curated embeddings and OOF
import sys, subprocess, time
from pathlib import Path

def pip_install(pkg):
    try:
        subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--quiet', pkg])
    except Exception as e:
        print('pip install failed for', pkg, ':', e)

try:
    import nbformat
except Exception:
    pip_install('nbformat')
    import nbformat

try:
    from nbconvert.preprocessors import ExecutePreprocessor
except Exception:
    pip_install('nbconvert')
    from nbconvert.preprocessors import ExecutePreprocessor

nb_path = Path('C3A_TTA_curated.ipynb')
assert nb_path.exists(), 'C3A_TTA_curated.ipynb not found. Generate it first (Cell 12).'

print('Executing C3A_TTA_curated.ipynb ... This may take significant time (GPU-accelerated).')
t0 = time.time()
with open(nb_path, 'r', encoding='utf-8') as f:
    nb_to_run = nbformat.read(f, as_version=4)
ep = ExecutePreprocessor(timeout=36000, kernel_name='python3')
ep.preprocess(nb_to_run, {'metadata': {'path': '.'}})
with open(nb_path, 'w', encoding='utf-8') as f:
    nbformat.write(nb_to_run, f)
dt = time.time() - t0
print(f'Notebook executed successfully in {dt/60:.1f} minutes.')

# Quick artifact check
for p in ['embeddings_curated_mc5.npy', 'oof_tta.npy', 'metadata_c3a.json']:
    print(p, 'exists' if Path(p).exists() else 'MISSING')


Executing C3A_TTA_curated.ipynb ... This may take significant time (GPU-accelerated).


Notebook executed successfully in 1.3 minutes.
embeddings_curated_mc5.npy exists
oof_tta.npy exists
metadata_c3a.json exists


In [29]:
# C3A Curated TTA: Verify and/or persist metrics (safety check)
from pathlib import Path
import json
import numpy as np
import pandas as pd

metrics_path = Path('metrics_c3a_curated.json')
print('metrics_c3a_curated.json exists:', metrics_path.exists())
if metrics_path.exists():
    with open(metrics_path, 'r') as f:
        metrics = json.load(f)
    print('Loaded metrics:', metrics)
else:
    # Fallback: compute OOF LWLRAP from oof_tta.npy and persist metrics now
    print('metrics_c3a_curated.json missing. Computing from oof_tta.npy and persisting ...')
    oof_path = Path('oof_tta.npy')
    assert oof_path.exists(), 'Missing oof_tta.npy to compute metrics.'
    oof = np.load(oof_path)
    df_cur = pd.read_csv('train_curated_folds.csv')
    df_ss  = pd.read_csv('sample_submission.csv')
    class_names = [c for c in df_ss.columns if c != 'fname']
    label_to_idx = {c:i for i,c in enumerate(class_names)}
    n_classes = len(class_names)
    def parse_labels_str(s):
        if not isinstance(s, str):
            return []
        return [t.strip() for t in s.replace(';', ',').split(',') if t.strip()]
    def encode_labels(s):
        y = np.zeros(n_classes, dtype=np.float32)
        for t in parse_labels_str(s):
            if t in label_to_idx:
                y[label_to_idx[t]] = 1.0
        return y
    def lwlrap_np(truth, scores):
        assert truth.shape == scores.shape
        n_samples, n_labels = truth.shape
        precisions = np.zeros(n_labels)
        labels_per_class = np.maximum(truth.sum(axis=0), 1)
        for i in range(n_samples):
            pos = np.where(truth[i] > 0)[0]
            if pos.size == 0:
                continue
            ranking = np.argsort(-scores[i])
            ranked_truth = truth[i][ranking]
            cumsum = np.cumsum(ranked_truth)
            pos_rank = np.where(ranked_truth > 0)[0]
            prec = cumsum[pos_rank] / (pos_rank + 1)
            ranked_labels = ranking[pos_rank]
            for lbl, p in zip(ranked_labels, prec):
                precisions[lbl] += p
        per_class = precisions / labels_per_class
        weights = truth.sum(axis=0) / max(truth.sum(), 1)
        return float((per_class * weights).sum()), per_class
    Y_cur = np.stack(df_cur['labels'].apply(encode_labels).values).astype(np.float32)
    oof_lw, _ = lwlrap_np(Y_cur, oof)
    BASELINE = 0.8001
    delta = float(oof_lw - BASELINE)
    # Try to include checkpoint sha1 if available
    ck_sha1 = None
    try:
        with open('metadata_c3a.json', 'r') as f:
            meta = json.load(f)
        ck_sha1 = meta.get('checkpoint', {}).get('sha1')
    except Exception:
        pass
    metrics = {
        'stage': 'C3A_curated_mc5',
        'oof_lwlrap': float(oof_lw),
        'per_fold_lwlrap': None,
        'delta_vs_c2_baseline': delta,
        'tta_scheme': 'mc5_mean',
        'checkpoint_sha1': ck_sha1
    }
    with open(metrics_path, 'w') as f:
        json.dump(metrics, f)
    print('Persisted fallback metrics to metrics_c3a_curated.json:', metrics)

# Display quick summary
print('Artifacts present:', {
    'embeddings_curated_mc5.npy': Path('embeddings_curated_mc5.npy').exists(),
    'oof_tta.npy': Path('oof_tta.npy').exists(),
    'metadata_c3a.json': Path('metadata_c3a.json').exists(),
    'metrics_c3a_curated.json': metrics_path.exists(),
})

metrics_c3a_curated.json exists: True
Loaded metrics: {'stage': 'C3A_curated_mc5', 'oof_lwlrap': 0.8049036654978639, 'per_fold_lwlrap': [0.8125134119899252, 0.8058232379476366, 0.8127316131414057, 0.8160070710672407, 0.7809753987693859], 'delta_vs_c2_baseline': 0.004803665497863818, 'tta_scheme': 'mc5_mean', 'checkpoint_sha1': '5f73e32676afd7a763ddec6693d975be16859f90'}
Artifacts present: {'embeddings_curated_mc5.npy': True, 'oof_tta.npy': True, 'metadata_c3a.json': True, 'metrics_c3a_curated.json': True}


In [40]:
# C3A Test MC5: Batched 5-crop embeddings for test, fit LR on curated MC5, predict test, write submission.csv (train-test parity)
import os, sys, subprocess, time
from pathlib import Path
import numpy as np
import pandas as pd
import librosa, torch
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

def pip_install(pkg):
    try:
        subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--quiet', pkg])
    except Exception as e:
        print('pip install failed for', pkg, ':', e)

# Ensure panns-inference is available
try:
    import panns_inference  # noqa: F401
except Exception:
    pip_install('panns-inference')
    import panns_inference  # noqa: F401
from panns_inference import AudioTagging

BASE = Path('.')
SR = 32000
T_SEC = 10.0
T = int(SR * T_SEC)
K = 5  # enforce MC5 for strict train-test parity
EMB_DIM = 2048
CKPT_PATH = Path('/app/panns_data/Cnn14_mAP=0.431.pth')
assert CKPT_PATH.exists(), 'CNN14 checkpoint missing. Re-run the C3A asset cell if needed.'

df_cur = pd.read_csv(BASE / 'train_curated_folds.csv')
df_ss  = pd.read_csv(BASE / 'sample_submission.csv')
class_names = [c for c in df_ss.columns if c != 'fname']
n_classes = len(class_names)

# Load curated MC5 embeddings (training uses MC5 embeddings per plan)
emb_cur_mc5_path = BASE / 'embeddings_curated_mc5.npy'
assert emb_cur_mc5_path.exists(), 'embeddings_curated_mc5.npy not found. Run curated MC5 extraction first.'
X_cur = np.load(emb_cur_mc5_path)
assert X_cur.ndim == 2 and X_cur.shape[1] == EMB_DIM

def load_audio(path, sr=SR):
    y, s = librosa.load(path, sr=sr, mono=True)
    return y.astype(np.float32)

def crop_starts(L, T, K):
    if L <= T:
        return [0]*K
    if K == 3:
        starts = [0, (L - T)//2, L - T]
    else:
        starts = [0, (L - T)//2, L - T, int(0.25*(L - T)), int(0.75*(L - T))]
    return [max(0, min(s, L - T)) for s in starts]

def crops_for_wave(y, T, K):
    L = len(y)
    starts = crop_starts(L, T, K)
    crops = []
    if L >= T:
        for s in starts:
            crops.append(y[s:s+T])
    else:
        pad = np.pad(y, (0, T - L))
        crops = [pad for _ in range(K)]
    return np.stack(crops, 0)  # (K, T)

def extract_mc_embeddings_batched(file_list, root_dir, K=5, batch_size=16, log_every=200):
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    print(f'Device: {device.upper()} | Batch size: {batch_size} | K={K}')
    at = AudioTagging(checkpoint_path=str(CKPT_PATH), device=device)
    N = len(file_list)
    X = np.zeros((N, EMB_DIM), dtype=np.float32)
    t0 = time.time()
    for start_idx in range(0, N, batch_size):
        end_idx = min(start_idx + batch_size, N)
        batch_files = file_list[start_idx:end_idx]
        # Load and crop all in batch
        batch_crops = []  # (B*K, T)
        for fname in batch_files:
            y = load_audio(str(Path(root_dir) / fname), sr=SR)
            ck = crops_for_wave(y, T, K)  # (K, T)
            batch_crops.append(ck)
        batch_crops = np.concatenate(batch_crops, axis=0)  # (B*K, T)
        with torch.no_grad():
            out = at.inference(batch_crops)
        if isinstance(out, tuple) and len(out)==2:
            embs = np.asarray(out[1], dtype=np.float32)  # (B*K, 2048)
        elif isinstance(out, dict) and 'embedding' in out:
            embs = np.asarray(out['embedding'], dtype=np.float32)
        else:
            raise RuntimeError('Unexpected AudioTagging output type')
        # reshape to (B, K, EMB_DIM) and mean over K
        B = end_idx - start_idx
        embs = embs.reshape(B, K, EMB_DIM).mean(axis=1)
        X[start_idx:end_idx] = embs
        if (end_idx) % log_every == 0 or end_idx == N:
            dt = time.time() - t0
            print(f'  Test MC{K}: {end_idx}/{N} in {dt/60:.1f} min')
    return X

# Extract/Load MC5 test embeddings
emb_test_mc_path = BASE / f'embeddings_test_mc{K}.npy'
test_files = df_ss['fname'].values
if emb_test_mc_path.exists():
    X_test = np.load(emb_test_mc_path)
    print(f'Loaded cached test MC{K} embeddings.')
else:
    print(f'Extracting test MC{K} embeddings (batched) ...')
    # Heuristic batch size: larger on CPU to leverage MKL, moderate on GPU to fit memory
    bs = 64 if not torch.cuda.is_available() else 16
    X_test = extract_mc_embeddings_batched(test_files, root_dir=BASE/'test', K=K, batch_size=bs, log_every=256)
    np.save(emb_test_mc_path, X_test)
    print(f'Saved test MC{K} embeddings to {emb_test_mc_path}.')

# Train LR on curated MC5 embeddings and predict test (MC5) with serial n_jobs=1 to avoid parallelism bugs
def encode_labels_row(s):
    y = np.zeros(n_classes, dtype=np.float32)
    if isinstance(s, str):
        for t in s.replace(';', ',').split(','):
            t = t.strip()
            if t and t in class_names:
                y[class_names.index(t)] = 1.0
    return y

Y_cur = np.stack(df_cur['labels'].apply(encode_labels_row).values)
base_lr_full = LogisticRegression(solver='lbfgs', max_iter=1000, C=2.0, n_jobs=1, verbose=0)
clf_full = OneVsRestClassifier(make_pipeline(StandardScaler(with_mean=True, with_std=True), base_lr_full), n_jobs=1)
clf_full.fit(X_cur, Y_cur)
test_proba = clf_full.predict_proba(X_test).astype(np.float32)
sub = pd.DataFrame(test_proba, columns=class_names)
sub.insert(0, 'fname', test_files)
sub.to_csv('submission.csv', index=False)
print(f'Saved submission.csv (MC{K} TTA, batched, n_jobs=1). Shape:', sub.shape)


Loaded cached test MC5 embeddings.


Saved submission.csv (MC5 TTA, batched, n_jobs=1). Shape: (3361, 81)


In [33]:
# C3B: Dual Features Fusion (MC5) — concatenate 2048-d embeddings with 527-d clipwise outputs; retrain LR; submit
import os, time
from pathlib import Path
import numpy as np
import pandas as pd
import librosa, torch
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
import json

# Ensure panns-inference is available
try:
    import panns_inference  # noqa: F401
except Exception:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--quiet', 'panns-inference'])
    import panns_inference  # noqa: F401
from panns_inference import AudioTagging

BASE = Path('.')
SR = 32000
T_SEC = 10.0
T = int(SR * T_SEC)
K = 5  # MC5 parity
EMB_DIM = 2048
CLIP_DIM = 527
CKPT_PATH = Path('/app/panns_data/Cnn14_mAP=0.431.pth')
assert CKPT_PATH.exists(), 'CNN14 checkpoint missing. Prepare assets first.'

# Data and class schema
df_cur = pd.read_csv(BASE / 'train_curated_folds.csv')
df_ss  = pd.read_csv(BASE / 'sample_submission.csv')
class_names = [c for c in df_ss.columns if c != 'fname']
label_to_idx = {c:i for i,c in enumerate(class_names)}
n_classes = len(class_names)
train_files = df_cur['fname'].values
test_files  = df_ss['fname'].values

# Existing MC5 embeddings from C3A
X_cur_path = BASE / 'embeddings_curated_mc5.npy'
X_test_path = BASE / 'embeddings_test_mc5.npy'
assert X_cur_path.exists() and X_test_path.exists(), 'MC5 embeddings missing. Run C3A first.'
X_cur = np.load(X_cur_path)
X_test = np.load(X_test_path)
assert X_cur.shape[1] == EMB_DIM and X_test.shape[1] == EMB_DIM

def load_audio(path, sr=SR):
    y, s = librosa.load(path, sr=sr, mono=True)
    return y.astype(np.float32)

def crop_starts(L, T, K=5):
    if L <= T:
        return [0]*K
    return [0, (L - T)//2, L - T, int(0.25*(L - T)), int(0.75*(L - T))]

def crops_for_wave(y, T, K=5):
    L = len(y)
    starts = crop_starts(L, T, K)
    crops = []
    if L >= T:
        for s in starts:
            s = max(0, min(s, L - T))
            crops.append(y[s:s+T])
    else:
        pad = np.pad(y, (0, T - L))
        crops = [pad for _ in range(K)]
    return np.stack(crops, 0)  # (K, T)

def extract_mc5_dual_features(file_list, root_dir, batch_size=64, log_every=256):
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    print(f'Device: {device.upper()} | Batch size: {batch_size} | Dual dims: emb={EMB_DIM}, clip={CLIP_DIM}')
    at = AudioTagging(checkpoint_path=str(CKPT_PATH), device=device)
    N = len(file_list)
    emb = np.zeros((N, EMB_DIM), dtype=np.float32)
    clip = np.zeros((N, CLIP_DIM), dtype=np.float32)
    t0 = time.time()
    for start_idx in range(0, N, batch_size):
        end_idx = min(start_idx + batch_size, N)
        batch_files = file_list[start_idx:end_idx]
        batch_crops = []  # (B*K, T)
        for fname in batch_files:
            y = load_audio(str(Path(root_dir) / fname), sr=SR)
            ck = crops_for_wave(y, T, K)
            batch_crops.append(ck)
        batch_crops = np.concatenate(batch_crops, axis=0)
        with torch.no_grad():
            out = at.inference(batch_crops)
        if isinstance(out, tuple) and len(out)==2:
            clip_bk = np.asarray(out[0], dtype=np.float32)  # (B*K, 527)
            emb_bk  = np.asarray(out[1], dtype=np.float32)  # (B*K, 2048)
        elif isinstance(out, dict):
            clip_bk = np.asarray(out.get('clipwise_output'), dtype=np.float32)
            emb_bk  = np.asarray(out.get('embedding'), dtype=np.float32)
        else:
            raise RuntimeError('Unexpected AudioTagging output type')
        B = end_idx - start_idx
        emb_b = emb_bk.reshape(B, K, EMB_DIM).mean(axis=1)
        clip_b = clip_bk.reshape(B, K, CLIP_DIM).mean(axis=1)
        emb[start_idx:end_idx] = emb_b
        clip[start_idx:end_idx] = clip_b
        if end_idx % log_every == 0 or end_idx == N:
            dt = time.time() - t0
            print(f'  Dual MC5: {end_idx}/{N} in {dt/60:.1f} min')
    return emb, clip

# Paths for clipwise features
CL_cur_path = BASE / 'clipwise_curated_mc5.npy'
CL_test_path = BASE / 'clipwise_test_mc5.npy'

if CL_cur_path.exists():
    CL_cur = np.load(CL_cur_path)
    print('Loaded cached clipwise_curated_mc5.npy')
else:
    print('Extracting curated dual features (MC5) ...')
    # We already have embeddings; but for consistency, re-extract both and keep embedding for sanity check
    _, CL_cur = extract_mc5_dual_features(train_files, BASE/'train_curated', batch_size=64, log_every=200)
    np.save(CL_cur_path, CL_cur)
    print('Saved clipwise_curated_mc5.npy')

if CL_test_path.exists():
    CL_test = np.load(CL_test_path)
    print('Loaded cached clipwise_test_mc5.npy')
else:
    print('Extracting test dual features (MC5) ...')
    _, CL_test = extract_mc5_dual_features(test_files, BASE/'test', batch_size=64, log_every=256)
    np.save(CL_test_path, CL_test)
    print('Saved clipwise_test_mc5.npy')

assert CL_cur.shape[0] == X_cur.shape[0] and CL_test.shape[0] == X_test.shape[0]
assert CL_cur.shape[1] == CLIP_DIM and CL_test.shape[1] == CLIP_DIM

# Fuse features: [embeddings | clipwise]
X_cur_fused = np.concatenate([X_cur, CL_cur], axis=1)
X_test_fused = np.concatenate([X_test, CL_test], axis=1)
print('Fused shapes:', X_cur_fused.shape, X_test_fused.shape)
assert X_cur_fused.shape[1] == EMB_DIM + CLIP_DIM == 2575

# Encode labels
def encode_labels_row(s):
    y = np.zeros(n_classes, dtype=np.float32)
    if isinstance(s, str):
        for t in s.replace(';', ',').split(','):
            t = t.strip()
            if t and t in label_to_idx:
                y[label_to_idx[t]] = 1.0
    return y
Y_cur = np.stack(df_cur['labels'].apply(encode_labels_row).values)

# OOF with fused features
def lwlrap_np(truth, scores):
    assert truth.shape == scores.shape
    n_samples, n_labels = truth.shape
    precisions = np.zeros(n_labels)
    labels_per_class = np.maximum(truth.sum(axis=0), 1)
    for i in range(n_samples):
        pos = np.where(truth[i] > 0)[0]
        if pos.size == 0:
            continue
        ranking = np.argsort(-scores[i])
        ranked_truth = truth[i][ranking]
        cumsum = np.cumsum(ranked_truth)
        pos_rank = np.where(ranked_truth > 0)[0]
        prec = cumsum[pos_rank] / (pos_rank + 1)
        ranked_labels = ranking[pos_rank]
        for lbl, p in zip(ranked_labels, prec):
            precisions[lbl] += p
    per_class = precisions / labels_per_class
    weights = truth.sum(axis=0) / max(truth.sum(), 1)
    return float((per_class * weights).sum()), per_class

oof = np.zeros((len(df_cur), n_classes), dtype=np.float32)
fold_scores = []
for k in range(5):
    trn_idx = np.where(df_cur['fold'].values != k)[0]
    val_idx = np.where(df_cur['fold'].values == k)[0]
    X_tr, X_va = X_cur_fused[trn_idx], X_cur_fused[val_idx]
    y_tr, y_va = Y_cur[trn_idx], Y_cur[val_idx]
    # Slightly stronger regularization to reduce overfit with higher dim
    base_lr = LogisticRegression(solver='lbfgs', max_iter=1000, C=1.0, n_jobs=16, verbose=0)
    clf = OneVsRestClassifier(make_pipeline(StandardScaler(with_mean=True, with_std=True), base_lr), n_jobs=-1)
    clf.fit(X_tr, y_tr)
    proba = clf.predict_proba(X_va)
    oof[val_idx] = proba.astype(np.float32)
    lw, _ = lwlrap_np(y_va, proba)
    fold_scores.append(lw)
    print(f'Fold {k} LWLRAP (Fusion MC5)={lw:.4f}')
oof_lw, _ = lwlrap_np(Y_cur, oof)
print(f'OOF LWLRAP (Fusion MC5)={oof_lw:.4f}; per-fold={fold_scores}')
np.save('oof_tta_fusion.npy', oof)

# Persist metrics
BASELINE_C3A = 0.8049036654978639
metrics = {
    'stage': 'C3B_fusion_mc5',
    'oof_lwlrap': float(oof_lw),
    'per_fold_lwlrap': [float(x) for x in fold_scores],
    'delta_vs_c3a_baseline': float(oof_lw - BASELINE_C3A),
    'tta_scheme': 'mc5_mean',
    'feature_dims': {'embedding': EMB_DIM, 'clipwise': CLIP_DIM, 'fused': EMB_DIM+CLIP_DIM}
}
with open('metrics_c3b_curated.json', 'w') as f:
    json.dump(metrics, f)
print('Persisted metrics to metrics_c3b_curated.json:', metrics)

# Train full model on fused features and predict test
base_lr_full = LogisticRegression(solver='lbfgs', max_iter=1000, C=1.0, n_jobs=16, verbose=0)
clf_full = OneVsRestClassifier(make_pipeline(StandardScaler(with_mean=True, with_std=True), base_lr_full), n_jobs=-1)
clf_full.fit(X_cur_fused, Y_cur)
test_proba = clf_full.predict_proba(X_test_fused).astype(np.float32)
sub = pd.DataFrame(test_proba, columns=class_names)
sub.insert(0, 'fname', test_files)
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv (Fusion MC5). Shape:', sub.shape)


Extracting curated dual features (MC5) ...
Device: CPU | Batch size: 64 | Dual dims: emb=2048, clip=527
Checkpoint path: /app/panns_data/Cnn14_mAP=0.431.pth


Using CPU.


  Dual MC5: 1600/4970 in 9.5 min


  Dual MC5: 3200/4970 in 19.7 min


  Dual MC5: 4800/4970 in 29.4 min


  Dual MC5: 4970/4970 in 30.5 min
Saved clipwise_curated_mc5.npy
Extracting test dual features (MC5) ...
Device: CPU | Batch size: 64 | Dual dims: emb=2048, clip=527
Checkpoint path: /app/panns_data/Cnn14_mAP=0.431.pth


Using CPU.


  Dual MC5: 256/3361 in 1.5 min


  Dual MC5: 512/3361 in 3.4 min


  Dual MC5: 768/3361 in 5.5 min


  Dual MC5: 1024/3361 in 7.6 min


  Dual MC5: 1280/3361 in 9.8 min


  Dual MC5: 1536/3361 in 12.0 min


  Dual MC5: 1792/3361 in 14.1 min


  Dual MC5: 2048/3361 in 16.0 min


  Dual MC5: 2304/3361 in 17.5 min


  Dual MC5: 2560/3361 in 18.9 min


  Dual MC5: 2816/3361 in 20.4 min


  Dual MC5: 3072/3361 in 21.8 min


  Dual MC5: 3328/3361 in 23.3 min


  Dual MC5: 3361/3361 in 23.5 min
Saved clipwise_test_mc5.npy
Fused shapes: (4970, 2575) (3361, 2575)


Fold 0 LWLRAP (Fusion MC5)=0.8106


Fold 1 LWLRAP (Fusion MC5)=0.8128


Fold 2 LWLRAP (Fusion MC5)=0.8065


Fold 3 LWLRAP (Fusion MC5)=0.8059


Fold 4 LWLRAP (Fusion MC5)=0.7733
OOF LWLRAP (Fusion MC5)=0.8017; per-fold=[0.8106072890540544, 0.8127845900116647, 0.8064642232857031, 0.8059466790179934, 0.773337002183789]
Persisted metrics to metrics_c3b_curated.json: {'stage': 'C3B_fusion_mc5', 'oof_lwlrap': 0.8017233014450034, 'per_fold_lwlrap': [0.8106072890540544, 0.8127845900116647, 0.8064642232857031, 0.8059466790179934, 0.773337002183789], 'delta_vs_c3a_baseline': -0.003180364052860485, 'tta_scheme': 'mc5_mean', 'feature_dims': {'embedding': 2048, 'clipwise': 527, 'fused': 2575}}


Saved submission.csv (Fusion MC5). Shape: (3361, 81)


In [36]:
# C3C: Pure Teacher-Student on train_noisy (center-crop fast path, BATCHED) — OOF guardrails + new submission
# Rationale: Use batched center-crop embeddings to enable fast teacher inference over 19.8k noisy clips.
# Teachers are strictly 5 OOF LR models trained on curated center-crop embeddings (no full-data leak).
# Select high-confidence positives among weak labels (precision-first), weight samples, retrain student,
# validate curated OOF (must not drop >0.01 vs C2 baseline 0.8001), then produce a new submission on test center-crop.

import os, json, time, math
from pathlib import Path
import numpy as np
import pandas as pd
import librosa, torch
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
import joblib

# Ensure panns_inference for feature extraction (center-crop) on noisy if needed
try:
    from panns_inference import AudioTagging
except Exception:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--quiet', 'panns-inference'])
    from panns_inference import AudioTagging

BASE = Path('.')
SR = 32000
CROP_SEC = 10.0
T = int(SR * CROP_SEC)
EMB_DIM = 2048
N_FOLDS = 5
CKPT_PATH = Path('/app/panns_data/Cnn14_mAP=0.431.pth')
assert CKPT_PATH.exists(), 'Missing CNN14 checkpoint; run earlier asset cell.'

# Load dataframes and schema
df_cur = pd.read_csv(BASE / 'train_curated_folds.csv')
df_noi = pd.read_csv(BASE / 'train_noisy.csv')
df_ss  = pd.read_csv(BASE / 'sample_submission.csv')
class_names = [c for c in df_ss.columns if c != 'fname']
label_to_idx = {c:i for i,c in enumerate(class_names)}
n_classes = len(class_names)

def parse_labels_str(s):
    if not isinstance(s, str):
        return []
    return [t.strip() for t in s.replace(';', ',').split(',') if t.strip()]

def encode_labels(s):
    y = np.zeros(n_classes, dtype=np.float32)
    for t in parse_labels_str(s):
        if t in label_to_idx:
            y[label_to_idx[t]] = 1.0
    return y

def lwlrap_np(truth, scores):
    assert truth.shape == scores.shape
    n_samples, n_labels = truth.shape
    precisions = np.zeros(n_labels)
    labels_per_class = np.maximum(truth.sum(axis=0), 1)
    for i in range(n_samples):
        pos = np.where(truth[i] > 0)[0]
        if pos.size == 0:
            continue
        ranking = np.argsort(-scores[i])
        ranked_truth = truth[i][ranking]
        cumsum = np.cumsum(ranked_truth)
        pos_rank = np.where(ranked_truth > 0)[0]
        prec = cumsum[pos_rank] / (pos_rank + 1)
        ranked_labels = ranking[pos_rank]
        for lbl, p in zip(ranked_labels, prec):
            precisions[lbl] += p
    per_class = precisions / labels_per_class
    weights = truth.sum(axis=0) / max(truth.sum(), 1)
    return float((per_class * weights).sum()), per_class

# Load curated/test center-crop embeddings (C2 artifacts)
emb_cur_path = BASE / 'embeddings_curated.npy'
emb_test_path = BASE / 'embeddings_test.npy'
assert emb_cur_path.exists() and emb_test_path.exists(), 'Center-crop embeddings missing; run C2 pipeline.'
X_cur = np.load(emb_cur_path)
X_test = np.load(emb_test_path)
assert X_cur.shape[1] == EMB_DIM and X_test.shape[1] == EMB_DIM
Y_cur = np.stack(df_cur['labels'].apply(encode_labels).values).astype(np.float32)

# 1) Train 5 pure OOF teacher models on curated center-crop embeddings
teachers = []
oof = np.zeros((len(df_cur), n_classes), dtype=np.float32)
fold_scores = []
for k in range(N_FOLDS):
    trn_idx = np.where(df_cur['fold'].values != k)[0]
    val_idx = np.where(df_cur['fold'].values == k)[0]
    X_tr, X_va = X_cur[trn_idx], X_cur[val_idx]
    y_tr, y_va = Y_cur[trn_idx], Y_cur[val_idx]
    base_lr = LogisticRegression(solver='lbfgs', max_iter=1000, C=2.0, n_jobs=16, verbose=0)
    clf = OneVsRestClassifier(make_pipeline(StandardScaler(with_mean=True, with_std=True), base_lr), n_jobs=-1)
    clf.fit(X_tr, y_tr)
    teachers.append(clf)
    proba = clf.predict_proba(X_va)
    oof[val_idx] = proba.astype(np.float32)
    lw, _ = lwlrap_np(y_va, proba)
    fold_scores.append(lw)
    print(f'Teacher fold {k} LWLRAP (center-crop)={lw:.4f}')
oof_lw, _ = lwlrap_np(Y_cur, oof)
print(f'Teacher OOF LWLRAP (center-crop curated)={oof_lw:.4f}; per-fold={fold_scores}')

# 2) Extract center-crop embeddings for train_noisy (BATCHED) if not cached
emb_noisy_path = BASE / 'embeddings_noisy.npy'
def load_center_crop_10s(path, sr=SR, crop_sec=CROP_SEC):
    y, s = librosa.load(path, sr=sr, mono=True)
    if y.dtype != np.float32:
        y = y.astype(np.float32)
    target = int(sr * crop_sec)
    if len(y) >= target:
        start = max(0, (len(y) - target)//2)
        y = y[start:start+target]
    else:
        y = np.pad(y, (0, target-len(y)))
    return y

def extract_noisy_embeddings_batched(file_list, root_dir, batch_size=128, log_every=1024):
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    print(f'Extracting noisy embeddings batched | Device: {device.upper()} | batch_size={batch_size}')
    at = AudioTagging(checkpoint_path=str(CKPT_PATH), device=device)
    N = len(file_list)
    X = np.zeros((N, EMB_DIM), dtype=np.float32)
    t0 = time.time()
    for start in range(0, N, batch_size):
        end = min(start + batch_size, N)
        batch_files = file_list[start:end]
        batch_waves = []
        for fname in batch_files:
            y = load_center_crop_10s(str(Path(root_dir) / fname))
            batch_waves.append(y)
        batch_waves = np.stack(batch_waves, 0)  # (B, T)
        with torch.no_grad():
            out = at.inference(batch_waves)
        if isinstance(out, tuple) and len(out)==2:
            emb_b = np.asarray(out[1], dtype=np.float32)  # (B, 2048)
        elif isinstance(out, dict) and 'embedding' in out:
            emb_b = np.asarray(out['embedding'], dtype=np.float32)
        else:
            raise RuntimeError('Unexpected AudioTagging output for noisy batch')
        X[start:end] = emb_b
        if (end % log_every == 0) or (end == N):
            dt = time.time() - t0
            print(f'  Noisy {end}/{N} in {dt/60:.1f} min')
    return X

if emb_noisy_path.exists():
    X_noisy = np.load(emb_noisy_path)
    print('Loaded cached embeddings_noisy.npy')
else:
    noisy_files = df_noi['fname'].values
    root = BASE / 'train_noisy'
    # Heuristic: bigger batch on CPU, moderate on GPU
    bs = 128 if not torch.cuda.is_available() else 64
    X_noisy = extract_noisy_embeddings_batched(noisy_files, root, batch_size=bs, log_every=2048)
    np.save(emb_noisy_path, X_noisy)
    print('Saved embeddings_noisy.npy')

# 3) Teacher ensemble predictions on train_noisy (average of fold teachers)
print('Scoring train_noisy with teacher ensemble ...')
probs_noisy = np.zeros((len(df_noi), n_classes), dtype=np.float32)
for t_idx, clf in enumerate(teachers):
    p = clf.predict_proba(X_noisy).astype(np.float32)
    probs_noisy += p
probs_noisy /= len(teachers)

# 4) Data-driven thresholds per class (precision-first among weak positives)
print('Selecting thresholds and filtering weak positives ...')
weak_lists = df_noi['labels'].apply(parse_labels_str).values
thr = np.full(n_classes, 0.95, dtype=np.float32)  # default min threshold
for c, name in enumerate(class_names):
    idx = [i for i, toks in enumerate(weak_lists) if name in toks]
    if len(idx) >= 50:
        vals = probs_noisy[idx, c]
        q95 = float(np.quantile(vals, 0.95))
        thr[c] = max(0.90, q95)
    elif len(idx) >= 10:
        vals = probs_noisy[idx, c]
        q90 = float(np.quantile(vals, 0.90))
        thr[c] = max(0.92, q90)
    else:
        thr[c] = 0.95

# Build selected noisy labels matrix based on thresholds, restricted to weak positives only (no new positives)
Y_noisy_sel = np.zeros((len(df_noi), n_classes), dtype=np.float32)
sel_mask = np.zeros(len(df_noi), dtype=bool)
weights_noisy = np.zeros(len(df_noi), dtype=np.float32)
for i, toks in enumerate(weak_lists):
    if not toks:
        continue
    cls_idx = [label_to_idx[t] for t in toks if t in label_to_idx]
    if not cls_idx:
        continue
    preds = probs_noisy[i, cls_idx]
    keep = [j for j, p in zip(cls_idx, preds) if p >= thr[j]]
    if keep:
        Y_noisy_sel[i, keep] = 1.0
        sel_mask[i] = True
        weights_noisy[i] = float(np.clip(probs_noisy[i, keep].max(), 0.5, 1.0))

selected_idx = np.where(sel_mask)[0]
print('Selected noisy samples:', len(selected_idx), 'out of', len(df_noi), f'({len(selected_idx)/len(df_noi)*100:.1f}%)')
assert len(selected_idx) > 0, 'No noisy samples selected; thresholds too strict.'

# Persist thresholds and selection summary
thr_dict = {name: float(thr[i]) for i, name in enumerate(class_names)}
with open('thresholds_c3c.json', 'w') as f:
    json.dump({'per_class_thresholds': thr_dict, 'selection_count': int(len(selected_idx))}, f)
print('Saved thresholds_c3c.json')

# 5) OOF validation on curated with noisy augmentation in training folds
print('Validating curated OOF with noisy-augmented training ...')
oof_student = np.zeros((len(df_cur), n_classes), dtype=np.float32)
fold_scores_student = []
BASELINE_C2 = 0.8001  # curated center-crop baseline

def predict_oof_weighted_lr(X_tr, y_tr, X_va, sw_tr, C=2.0, max_iter=1000):
    scaler = StandardScaler(with_mean=True, with_std=True)
    X_tr_s = scaler.fit_transform(X_tr)
    X_va_s = scaler.transform(X_va)
    proba = np.zeros((X_va.shape[0], n_classes), dtype=np.float32)
    for c in range(n_classes):
        lr = LogisticRegression(solver='lbfgs', max_iter=max_iter, C=C, n_jobs=16, verbose=0)
        lr.fit(X_tr_s, y_tr[:, c], sample_weight=sw_tr)
        proba[:, c] = lr.predict_proba(X_va_s)[:, 1]
    return proba

for k in range(N_FOLDS):
    trn_idx = np.where(df_cur['fold'].values != k)[0]
    val_idx = np.where(df_cur['fold'].values == k)[0]
    X_tr = np.concatenate([X_cur[trn_idx], X_noisy[selected_idx]], axis=0)
    y_tr = np.concatenate([Y_cur[trn_idx], Y_noisy_sel[selected_idx]], axis=0)
    sw_tr = np.concatenate([np.ones(len(trn_idx), dtype=np.float32), weights_noisy[selected_idx]], axis=0)
    X_va = X_cur[val_idx]
    y_va = Y_cur[val_idx]
    proba = predict_oof_weighted_lr(X_tr, y_tr, X_va, sw_tr, C=2.0, max_iter=1000)
    oof_student[val_idx] = proba.astype(np.float32)
    lw, _ = lwlrap_np(y_va, proba)
    fold_scores_student.append(lw)
    print(f'Student fold {k} LWLRAP={lw:.4f}')
oof_lw_student, _ = lwlrap_np(Y_cur, oof_student)
print(f'Student OOF LWLRAP={oof_lw_student:.4f}; per-fold={fold_scores_student}')

# Guardrail: If degradation > 0.01, tighten thresholds and recompute once
if oof_lw_student + 0.01 < BASELINE_C2:
    print('WARNING: OOF degraded by >0.01 vs baseline. Tightening thresholds to 0.98 minimum and recomputing selection ...')
    thr = np.maximum(thr, 0.98)
    Y_noisy_sel[:] = 0.0
    sel_mask[:] = False
    weights_noisy[:] = 0.0
    for i, toks in enumerate(weak_lists):
        if not toks:
            continue
        cls_idx = [label_to_idx[t] for t in toks if t in label_to_idx]
        if not cls_idx:
            continue
        preds = probs_noisy[i, cls_idx]
        keep = [j for j, p in zip(cls_idx, preds) if p >= thr[j]]
        if keep:
            Y_noisy_sel[i, keep] = 1.0
            sel_mask[i] = True
            weights_noisy[i] = float(np.clip(probs_noisy[i, keep].max(), 0.5, 1.0))
    selected_idx = np.where(sel_mask)[0]
    print('After tightening, selected noisy samples:', len(selected_idx))
    oof_student[:] = 0.0
    fold_scores_student = []
    for k in range(N_FOLDS):
        trn_idx = np.where(df_cur['fold'].values != k)[0]
        val_idx = np.where(df_cur['fold'].values == k)[0]
        X_tr = np.concatenate([X_cur[trn_idx], X_noisy[selected_idx]], axis=0)
        y_tr = np.concatenate([Y_cur[trn_idx], Y_noisy_sel[selected_idx]], axis=0)
        sw_tr = np.concatenate([np.ones(len(trn_idx), dtype=np.float32), weights_noisy[selected_idx]], axis=0)
        X_va = X_cur[val_idx]
        y_va = Y_cur[val_idx]
        proba = predict_oof_weighted_lr(X_tr, y_tr, X_va, sw_tr, C=2.0, max_iter=1000)
        oof_student[val_idx] = proba.astype(np.float32)
        lw, _ = lwlrap_np(y_va, proba)
        fold_scores_student.append(lw)
        print(f'(Tight) Student fold {k} LWLRAP={lw:.4f}')
    oof_lw_student, _ = lwlrap_np(Y_cur, oof_student)
    print(f'(Tight) Student OOF LWLRAP={oof_lw_student:.4f}')

# Persist C3C curated metrics
metrics_c3c = {
    'stage': 'C3C_teacher_student_center_batched',
    'teacher_oof_lwlrap': float(oof_lw),
    'student_oof_lwlrap': float(oof_lw_student),
    'student_per_fold': [float(x) for x in fold_scores_student],
    'baseline_c2_center': 0.8001,
    'selected_noisy': int(len(selected_idx))
}
with open('metrics_c3c_curated.json', 'w') as f:
    json.dump(metrics_c3c, f)
print('Saved metrics_c3c_curated.json:', metrics_c3c)

# 6) Train full student on curated + selected noisy and predict test (center-crop submission)
def predict_full_weighted_lr(X_full, Y_full, W_full, X_eval, C=2.0, max_iter=1000):
    scaler = StandardScaler(with_mean=True, with_std=True)
    X_full_s = scaler.fit_transform(X_full)
    X_eval_s = scaler.transform(X_eval)
    proba = np.zeros((X_eval.shape[0], n_classes), dtype=np.float32)
    for c in range(n_classes):
        lr = LogisticRegression(solver='lbfgs', max_iter=max_iter, C=C, n_jobs=16, verbose=0)
        lr.fit(X_full_s, Y_full[:, c], sample_weight=W_full)
        proba[:, c] = lr.predict_proba(X_eval_s)[:, 1]
    return proba

X_full = np.concatenate([X_cur, X_noisy[selected_idx]], axis=0)
Y_full = np.concatenate([Y_cur, Y_noisy_sel[selected_idx]], axis=0)
W_full = np.concatenate([np.ones(len(X_cur), dtype=np.float32), weights_noisy[selected_idx]], axis=0)

test_proba = predict_full_weighted_lr(X_full, Y_full, W_full, X_test, C=2.0, max_iter=1000)
sub = pd.DataFrame(test_proba, columns=class_names)
sub.insert(0, 'fname', df_ss['fname'].values)
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv (C3C center-crop student, batched). Shape:', sub.shape)

# Save student scaler and a note (full per-class models are not serialized to keep notebook light)
joblib.dump({'note': 'Per-class LR models trained ad-hoc; reproduce by re-running this cell', 'selected_noisy': selected_idx.shape[0]}, 'ovr_logreg_student_c3c_center_note.joblib')
print('Saved student training note checkpoint.')


Teacher fold 0 LWLRAP (center-crop)=0.8091


Teacher fold 1 LWLRAP (center-crop)=0.8000


Teacher fold 2 LWLRAP (center-crop)=0.8086


Teacher fold 3 LWLRAP (center-crop)=0.8101


Teacher fold 4 LWLRAP (center-crop)=0.7745
Teacher OOF LWLRAP (center-crop curated)=0.8001; per-fold=[0.8091147409599597, 0.7999997415260673, 0.8085873055161429, 0.8101376948155439, 0.7744689058892884]
Loaded cached embeddings_noisy.npy
Scoring train_noisy with teacher ensemble ...


Selecting thresholds and filtering weak positives ...


Selected noisy samples: 1183 out of 19815 (6.0%)
Saved thresholds_c3c.json
Validating curated OOF with noisy-augmented training ...


Student fold 0 LWLRAP=0.8109


Student fold 1 LWLRAP=0.8039


Student fold 2 LWLRAP=0.8078


Student fold 3 LWLRAP=0.8150


Student fold 4 LWLRAP=0.7795
Student OOF LWLRAP=0.8034; per-fold=[0.8108990560912546, 0.8039474681684713, 0.807789676386343, 0.8150056151141913, 0.7795498747685645]
Saved metrics_c3c_curated.json: {'stage': 'C3C_teacher_student_center_batched', 'teacher_oof_lwlrap': 0.8000558604499433, 'student_oof_lwlrap': 0.8034271510417962, 'student_per_fold': [0.8108990560912546, 0.8039474681684713, 0.807789676386343, 0.8150056151141913, 0.7795498747685645], 'baseline_c2_center': 0.8001, 'selected_noisy': 1183}


Saved submission.csv (C3C center-crop student, batched). Shape: (3361, 81)
Saved student training note checkpoint.


In [37]:
# C3C (Remediated): Pure Teacher-Student on train_noisy with strict MC5 parity and class-specific weights
# Foundation: C3A MC5 embeddings (best validated model, OOF=0.8049). No center-crop anywhere.
# - Train 5 pure OOF teachers on curated MC5
# - Extract MC5 embeddings for train_noisy (batched)
# - Teacher ensemble on noisy -> per-class thresholds (precision-first, retain 20–60% of weak positives)
# - Select positives (weak-only) + optionally add ultra-confident new positives (p>=0.98, cap 3/clip)
# - Train student with class-specific sample weights per class (noisy negatives get weight 0)
# - Validate curated OOF against true baseline (0.8049); persist metrics; produce submission on test MC5.

import os, json, time
from pathlib import Path
import numpy as np
import pandas as pd
import librosa, torch
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
import joblib

# panns-inference
try:
    from panns_inference import AudioTagging
except Exception:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--quiet', 'panns-inference'])
    from panns_inference import AudioTagging

BASE = Path('.')
SR = 32000
T_SEC = 10.0
T = int(SR * T_SEC)
K = 5
EMB_DIM = 2048
N_FOLDS = 5
CKPT_PATH = Path('/app/panns_data/Cnn14_mAP=0.431.pth')
assert CKPT_PATH.exists(), 'CNN14 checkpoint missing; prepare assets first.'

# Data
df_cur = pd.read_csv(BASE / 'train_curated_folds.csv')
df_noi = pd.read_csv(BASE / 'train_noisy.csv')
df_ss  = pd.read_csv(BASE / 'sample_submission.csv')
class_names = [c for c in df_ss.columns if c != 'fname']
label_to_idx = {c:i for i,c in enumerate(class_names)}
n_classes = len(class_names)

def parse_labels_str(s):
    if not isinstance(s, str):
        return []
    return [t.strip() for t in s.replace(';', ',').split(',') if t.strip()]

def encode_labels(s):
    y = np.zeros(n_classes, dtype=np.float32)
    for t in parse_labels_str(s):
        if t in label_to_idx:
            y[label_to_idx[t]] = 1.0
    return y

def lwlrap_np(truth, scores):
    assert truth.shape == scores.shape
    n_samples, n_labels = truth.shape
    precisions = np.zeros(n_labels)
    labels_per_class = np.maximum(truth.sum(axis=0), 1)
    for i in range(n_samples):
        pos = np.where(truth[i] > 0)[0]
        if pos.size == 0:
            continue
        ranking = np.argsort(-scores[i])
        ranked_truth = truth[i][ranking]
        cumsum = np.cumsum(ranked_truth)
        pos_rank = np.where(ranked_truth > 0)[0]
        prec = cumsum[pos_rank] / (pos_rank + 1)
        ranked_labels = ranking[pos_rank]
        for lbl, p in zip(ranked_labels, prec):
            precisions[lbl] += p
    per_class = precisions / labels_per_class
    weights = truth.sum(axis=0) / max(truth.sum(), 1)
    return float((per_class * weights).sum()), per_class

# Load curated/test MC5 embeddings (foundation)
X_cur_path = BASE / 'embeddings_curated_mc5.npy'
X_test_path = BASE / 'embeddings_test_mc5.npy'
assert X_cur_path.exists() and X_test_path.exists(), 'Missing MC5 embeddings; run C3A first.'
X_cur = np.load(X_cur_path)
X_test = np.load(X_test_path)
assert X_cur.shape[1] == EMB_DIM and X_test.shape[1] == EMB_DIM
Y_cur = np.stack(df_cur['labels'].apply(encode_labels).values).astype(np.float32)

# 1) Train 5 pure OOF teachers on curated MC5
teachers = []
oof = np.zeros((len(df_cur), n_classes), dtype=np.float32)
fold_scores = []
for k in range(N_FOLDS):
    trn_idx = np.where(df_cur['fold'].values != k)[0]
    val_idx = np.where(df_cur['fold'].values == k)[0]
    X_tr, X_va = X_cur[trn_idx], X_cur[val_idx]
    y_tr, y_va = Y_cur[trn_idx], Y_cur[val_idx]
    scaler = StandardScaler(with_mean=True, with_std=True)
    X_tr_s = scaler.fit_transform(X_tr)
    X_va_s = scaler.transform(X_va)
    proba_va = np.zeros((len(val_idx), n_classes), dtype=np.float32)
    models_k = []
    for c in range(n_classes):
        lr = LogisticRegression(solver='lbfgs', max_iter=1000, C=2.0, n_jobs=16, verbose=0)
        lr.fit(X_tr_s, y_tr[:, c])
        proba_va[:, c] = lr.predict_proba(X_va_s)[:, 1]
        models_k.append(lr)
    teachers.append((scaler, models_k))
    oof[val_idx] = proba_va
    lw, _ = lwlrap_np(y_va, proba_va)
    fold_scores.append(lw)
    print(f'Teacher fold {k} LWLRAP (MC5)={lw:.4f}')
oof_lw, _ = lwlrap_np(Y_cur, oof)
print(f'Teacher OOF LWLRAP (MC5 curated)={oof_lw:.4f}; per-fold={fold_scores}')

# 2) Extract MC5 embeddings for train_noisy (batched, cached)
def load_audio(path, sr=SR):
    y, s = librosa.load(path, sr=sr, mono=True)
    return y.astype(np.float32)

def crop_starts(L, T, K=5):
    if L <= T:
        return [0]*K
    return [0, (L - T)//2, L - T, int(0.25*(L - T)), int(0.75*(L - T))]

def crops_for_wave(y, T, K=5):
    L = len(y)
    starts = crop_starts(L, T, K)
    if L >= T:
        crops = [y[max(0, min(s, L - T)):max(0, min(s, L - T))+T] for s in starts]
    else:
        pad = np.pad(y, (0, T - L))
        crops = [pad for _ in range(K)]
    return np.stack(crops, 0)  # (K, T)

def extract_mc_embeddings_batched(file_list, root_dir, K=5, batch_size=128, log_every=2048):
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    print(f'Noisy MC{K} extraction | Device: {device.upper()} | batch_size={batch_size}')
    at = AudioTagging(checkpoint_path=str(CKPT_PATH), device=device)
    N = len(file_list)
    X = np.zeros((N, EMB_DIM), dtype=np.float32)
    t0 = time.time()
    for start_idx in range(0, N, batch_size):
        end_idx = min(start_idx + batch_size, N)
        batch_files = file_list[start_idx:end_idx]
        batch_crops = []
        for fname in batch_files:
            y = load_audio(str(Path(root_dir) / fname), sr=SR)
            ck = crops_for_wave(y, T, K)
            batch_crops.append(ck)
        batch_crops = np.concatenate(batch_crops, axis=0)  # (B*K, T)
        with torch.no_grad():
            out = at.inference(batch_crops)
        if isinstance(out, tuple) and len(out)==2:
            emb_bk = np.asarray(out[1], dtype=np.float32)
        elif isinstance(out, dict) and 'embedding' in out:
            emb_bk = np.asarray(out['embedding'], dtype=np.float32)
        else:
            raise RuntimeError('Unexpected AudioTagging output type')
        B = end_idx - start_idx
        embs = emb_bk.reshape(B, K, EMB_DIM).mean(axis=1)
        X[start_idx:end_idx] = embs
        if (end_idx % log_every == 0) or (end_idx == N):
            dt = time.time() - t0
            print(f'  Noisy MC{K}: {end_idx}/{N} in {dt/60:.1f} min')
    return X

emb_noisy_mc_path = BASE / 'embeddings_noisy_mc5.npy'
if emb_noisy_mc_path.exists():
    X_noisy = np.load(emb_noisy_mc_path)
    print('Loaded cached embeddings_noisy_mc5.npy')
else:
    noisy_files = df_noi['fname'].values
    root = BASE / 'train_noisy'
    bs = 128 if not torch.cuda.is_available() else 16
    X_noisy = extract_mc_embeddings_batched(noisy_files, root, K=K, batch_size=bs, log_every=2048)
    np.save(emb_noisy_mc_path, X_noisy)
    print('Saved embeddings_noisy_mc5.npy')

# 3) Teacher ensemble predictions on train_noisy (average of fold teachers)
print('Scoring train_noisy (MC5) with teacher ensemble ...')
probs_noisy = np.zeros((len(df_noi), n_classes), dtype=np.float32)
for (scaler, models_k) in teachers:
    Xn_s = scaler.transform(X_noisy)
    p = np.zeros_like(probs_noisy)
    for c in range(n_classes):
        p[:, c] = models_k[c].predict_proba(Xn_s)[:, 1]
    probs_noisy += p.astype(np.float32)
probs_noisy /= len(teachers)

# 4) Per-class thresholds among weak positives; aim retain ~20–60% coverage
weak_lists = df_noi['labels'].apply(parse_labels_str).values
thr = np.full(n_classes, 0.90, dtype=np.float32)
retained_per_class = np.zeros(n_classes, dtype=int)
total_weak_per_class = np.zeros(n_classes, dtype=int)
for c, name in enumerate(class_names):
    idx = [i for i, toks in enumerate(weak_lists) if name in toks]
    total_weak_per_class[c] = len(idx)
    if len(idx) == 0:
        thr[c] = 0.95
        continue
    vals = probs_noisy[idx, c]
    q = 0.90
    qv = float(np.quantile(vals, q))
    t = max(0.85, qv)
    retained = int((vals >= t).sum())
    if retained < 20 and len(idx) >= 50:
        q = 0.85
        qv = float(np.quantile(vals, q))
        t = max(0.85, qv)
        retained = int((vals >= t).sum())
    thr[c] = t
    retained_per_class[c] = retained

# Build noisy label matrix restricted to weak positives; add ultra-confident new positives (p>=0.98, cap 3/clip)
Y_noisy_sel = np.zeros((len(df_noi), n_classes), dtype=np.float32)
sel_mask = np.zeros(len(df_noi), dtype=bool)
for i, toks in enumerate(weak_lists):
    cls_idx = [label_to_idx[t] for t in toks if t in label_to_idx]
    for j in cls_idx:
        if probs_noisy[i, j] >= thr[j]:
            Y_noisy_sel[i, j] = 1.0
    not_weak = [j for j in range(n_classes) if j not in cls_idx]
    if not_weak:
        top_new = [(j, probs_noisy[i, j]) for j in not_weak if probs_noisy[i, j] >= 0.98]
        top_new.sort(key=lambda x: -x[1])
        for j, _p in top_new[:3]:
            Y_noisy_sel[i, j] = 1.0
    if Y_noisy_sel[i].sum() > 0:
        sel_mask[i] = True

selected_idx = np.where(sel_mask)[0]
print('Selected noisy samples (any positives after filtering):', len(selected_idx), 'out of', len(df_noi), f'({len(selected_idx)/len(df_noi)*100:.1f}%)')
assert len(selected_idx) > 0, 'No noisy samples selected; thresholds too strict.'

# Persist thresholds and selection summary
thr_dict = {name: float(thr[i]) for i, name in enumerate(class_names)}
summary = {
    'per_class_thresholds': thr_dict,
    'selection_count': int(len(selected_idx)),
    'retained_per_class': {class_names[i]: int(retained_per_class[i]) for i in range(n_classes)},
    'total_weak_per_class': {class_names[i]: int(total_weak_per_class[i]) for i in range(n_classes)}
}
with open('thresholds_c3c.json', 'w') as f:
    json.dump(summary, f)
print('Saved thresholds_c3c.json (MC5).')

# 5) OOF validation on curated with noisy augmentation, class-specific sample weights (noisy negatives weight=0)
print('Validating curated OOF (student, MC5, class-specific weights) ...')
oof_student = np.zeros((len(df_cur), n_classes), dtype=np.float32)
fold_scores_student = []
BASELINE_C3A = 0.8049036654978639

def predict_oof_student_mc5(X_tr, y_tr, X_va, probs_noisy_sel, C=2.0, max_iter=1000):
    n_cur = y_tr.shape[0] - probs_noisy_sel.shape[0]
    scaler = StandardScaler(with_mean=True, with_std=True)
    X_tr_s = scaler.fit_transform(X_tr)
    X_va_s = scaler.transform(X_va)
    proba = np.zeros((X_va.shape[0], n_classes), dtype=np.float32)
    # y_tr noisy block for masking
    y_noi = y_tr[n_cur:]
    for c in range(n_classes):
        w_cur = np.ones(n_cur, dtype=np.float32)
        # class-specific weights: use prob only for positive pseudo-labels; negatives get 0.0
        w_noi_c = np.where(y_noi[:, c] > 0.5, np.clip(probs_noisy_sel[:, c], 0.5, 1.0), 0.0).astype(np.float32)
        w = np.concatenate([w_cur, w_noi_c], axis=0)
        lr = LogisticRegression(solver='lbfgs', max_iter=max_iter, C=C, n_jobs=16, verbose=0)
        lr.fit(X_tr_s, y_tr[:, c], sample_weight=w)
        proba[:, c] = lr.predict_proba(X_va_s)[:, 1]
    return proba

for k in range(N_FOLDS):
    trn_idx = np.where(df_cur['fold'].values != k)[0]
    val_idx = np.where(df_cur['fold'].values == k)[0]
    X_tr_cur, y_tr_cur = X_cur[trn_idx], Y_cur[trn_idx]
    X_va, y_va = X_cur[val_idx], Y_cur[val_idx]
    X_tr = np.concatenate([X_tr_cur, X_noisy[selected_idx]], axis=0)
    y_tr = np.concatenate([y_tr_cur, Y_noisy_sel[selected_idx]], axis=0)
    proba = predict_oof_student_mc5(X_tr, y_tr, X_va, probs_noisy[selected_idx], C=2.0, max_iter=1000)
    oof_student[val_idx] = proba
    lw, _ = lwlrap_np(y_va, proba)
    fold_scores_student.append(lw)
    print(f'Student fold {k} LWLRAP={lw:.4f}')
oof_lw_student, _ = lwlrap_np(Y_cur, oof_student)
print(f'Student OOF LWLRAP (MC5)={oof_lw_student:.4f}; per-fold={fold_scores_student}')

metrics_c3c = {
    'stage': 'C3C_teacher_student_mc5_batched',
    'teacher_oof_lwlrap': float(oof_lw),
    'student_oof_lwlrap': float(oof_lw_student),
    'student_per_fold': [float(x) for x in fold_scores_student],
    'baseline_c3a_mc5': float(BASELINE_C3A),
    'selected_noisy': int(len(selected_idx))
}
with open('metrics_c3c_curated.json', 'w') as f:
    json.dump(metrics_c3c, f)
print('Saved metrics_c3c_curated.json (MC5):', metrics_c3c)

# 6) Train full student on curated + selected noisy (MC5) and predict test (MC5) -> submission.csv
def predict_full_student_mc5(X_full, Y_full, probs_noisy_sel, X_eval, n_cur, C=2.0, max_iter=1000):
    scaler = StandardScaler(with_mean=True, with_std=True)
    X_full_s = scaler.fit_transform(X_full)
    X_eval_s = scaler.transform(X_eval)
    proba = np.zeros((X_eval.shape[0], n_classes), dtype=np.float32)
    y_noi = Y_full[n_cur:]
    for c in range(n_classes):
        w_cur = np.ones(n_cur, dtype=np.float32)
        w_noi_c = np.where(y_noi[:, c] > 0.5, np.clip(probs_noisy_sel[:, c], 0.5, 1.0), 0.0).astype(np.float32)
        w = np.concatenate([w_cur, w_noi_c], axis=0)
        lr = LogisticRegression(solver='lbfgs', max_iter=max_iter, C=C, n_jobs=16, verbose=0)
        lr.fit(X_full_s, Y_full[:, c], sample_weight=w)
        proba[:, c] = lr.predict_proba(X_eval_s)[:, 1]
    return proba

X_full = np.concatenate([X_cur, X_noisy[selected_idx]], axis=0)
Y_full = np.concatenate([Y_cur, Y_noisy_sel[selected_idx]], axis=0)
n_cur = X_cur.shape[0]
test_proba = predict_full_student_mc5(X_full, Y_full, probs_noisy[selected_idx], X_test, n_cur=n_cur, C=2.0, max_iter=1000)
sub = pd.DataFrame(test_proba, columns=class_names)
sub.insert(0, 'fname', df_ss['fname'].values)
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv (C3C MC5 student). Shape:', sub.shape)

# Save note artifact
joblib.dump({'note': 'MC5 teacher-student per-class LR trained on-the-fly (class-specific weights); reproduce by re-running this cell', 'selected_noisy': int(len(selected_idx))}, 'ovr_logreg_student_c3c_mc5_note.joblib')
print('Saved student training note (MC5).')


Teacher fold 0 LWLRAP (MC5)=0.8121


Teacher fold 1 LWLRAP (MC5)=0.8054


Teacher fold 2 LWLRAP (MC5)=0.8123


Teacher fold 3 LWLRAP (MC5)=0.8147


Teacher fold 4 LWLRAP (MC5)=0.7801
Teacher OOF LWLRAP (MC5 curated)=0.8049; per-fold=[0.8120782509891866, 0.805387697867566, 0.8122972082814656, 0.8147039939573124, 0.7801080960536076]
Noisy MC5 extraction | Device: CPU | batch_size=128
Checkpoint path: /app/panns_data/Cnn14_mAP=0.431.pth


Using CPU.


  Noisy MC5: 2048/19815 in 11.4 min


  Noisy MC5: 4096/19815 in 22.6 min


  Noisy MC5: 6144/19815 in 37.4 min


  Noisy MC5: 8192/19815 in 51.5 min


  Noisy MC5: 10240/19815 in 64.0 min


  Noisy MC5: 12288/19815 in 79.0 min


  Noisy MC5: 14336/19815 in 90.7 min


  Noisy MC5: 16384/19815 in 106.6 min


  Noisy MC5: 18432/19815 in 123.8 min


  Noisy MC5: 19815/19815 in 135.5 min
Saved embeddings_noisy_mc5.npy
Scoring train_noisy (MC5) with teacher ensemble ...


Selected noisy samples (any positives after filtering): 4446 out of 19815 (22.4%)
Saved thresholds_c3c.json (MC5).
Validating curated OOF (student, MC5, class-specific weights) ...


Student fold 0 LWLRAP=0.8190


Student fold 1 LWLRAP=0.8099


Student fold 2 LWLRAP=0.8235


Student fold 3 LWLRAP=0.8197


Student fold 4 LWLRAP=0.7886
Student OOF LWLRAP (MC5)=0.8121; per-fold=[0.8189625167044805, 0.8099130328822854, 0.8235343965782844, 0.819744053492511, 0.7885782685976188]
Saved metrics_c3c_curated.json (MC5): {'stage': 'C3C_teacher_student_mc5_batched', 'teacher_oof_lwlrap': 0.8049036654978639, 'student_oof_lwlrap': 0.8121370323443532, 'student_per_fold': [0.8189625167044805, 0.8099130328822854, 0.8235343965782844, 0.819744053492511, 0.7885782685976188], 'baseline_c3a_mc5': 0.8049036654978639, 'selected_noisy': 4446}


Saved submission.csv (C3C MC5 student). Shape: (3361, 81)
Saved student training note (MC5).


In [39]:
# C3D: Lightweight Ensemble + Per-class Temperature Scaling (MC5 parity)
# - Rebuild MC5 teachers, rescore train_noisy to recompute selection (deterministic)
# - Train student heads with different C values, fold-OOF ensemble
# - Fit per-class temperature scaling on OOF, apply to test, write submission.csv
# - Persist artifacts: metrics_c3d_curated.json, oof_ensemble.npy, temps_c3d.json

import os, json, time
from pathlib import Path
import numpy as np
import pandas as pd
import librosa, torch
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

try:
    from panns_inference import AudioTagging
except Exception:
    import sys, subprocess
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--quiet', 'panns-inference'])
    from panns_inference import AudioTagging

BASE = Path('.')
SR = 32000
T_SEC = 10.0
T = int(SR * T_SEC)
K = 5
EMB_DIM = 2048
N_FOLDS = 5
CKPT_PATH = Path('/app/panns_data/Cnn14_mAP=0.431.pth')
assert CKPT_PATH.exists(), 'CNN14 checkpoint missing.'

# Data
df_cur = pd.read_csv(BASE / 'train_curated_folds.csv')
df_noi = pd.read_csv(BASE / 'train_noisy.csv')
df_ss  = pd.read_csv(BASE / 'sample_submission.csv')
class_names = [c for c in df_ss.columns if c != 'fname']
label_to_idx = {c:i for i,c in enumerate(class_names)}
n_classes = len(class_names)

def parse_labels_str(s):
    if not isinstance(s, str):
        return []
    return [t.strip() for t in s.replace(';', ',').split(',') if t.strip()]

def encode_labels(s):
    y = np.zeros(n_classes, dtype=np.float32)
    for t in parse_labels_str(s):
        if t in label_to_idx:
            y[label_to_idx[t]] = 1.0
    return y

def lwlrap_np(truth, scores):
    assert truth.shape == scores.shape
    n_samples, n_labels = truth.shape
    precisions = np.zeros(n_labels)
    labels_per_class = np.maximum(truth.sum(axis=0), 1)
    for i in range(n_samples):
        pos = np.where(truth[i] > 0)[0]
        if pos.size == 0:
            continue
        ranking = np.argsort(-scores[i])
        ranked_truth = truth[i][ranking]
        cumsum = np.cumsum(ranked_truth)
        pos_rank = np.where(ranked_truth > 0)[0]
        prec = cumsum[pos_rank] / (pos_rank + 1)
        ranked_labels = ranking[pos_rank]
        for lbl, p in zip(ranked_labels, prec):
            precisions[lbl] += p
    per_class = precisions / labels_per_class
    weights = truth.sum(axis=0) / max(truth.sum(), 1)
    return float((per_class * weights).sum()), per_class

# Load MC5 embeddings
X_cur = np.load('embeddings_curated_mc5.npy')
X_test = np.load('embeddings_test_mc5.npy')
X_noisy = np.load('embeddings_noisy_mc5.npy') if Path('embeddings_noisy_mc5.npy').exists() else None
assert X_noisy is not None, 'embeddings_noisy_mc5.npy missing; run C3C MC5 first.'
Y_cur = np.stack(df_cur['labels'].apply(encode_labels).values).astype(np.float32)

# 1) Rebuild teachers on curated MC5 (pure OOF)
teachers = []
oof_teacher = np.zeros((len(df_cur), n_classes), dtype=np.float32)
for k in range(N_FOLDS):
    trn_idx = np.where(df_cur['fold'].values != k)[0]
    val_idx = np.where(df_cur['fold'].values == k)[0]
    X_tr, X_va = X_cur[trn_idx], X_cur[val_idx]
    y_tr, y_va = Y_cur[trn_idx], Y_cur[val_idx]
    scaler = StandardScaler(with_mean=True, with_std=True)
    X_tr_s = scaler.fit_transform(X_tr)
    X_va_s = scaler.transform(X_va)
    proba_va = np.zeros((len(val_idx), n_classes), dtype=np.float32)
    models_k = []
    for c in range(n_classes):
        lr = LogisticRegression(solver='lbfgs', max_iter=1000, C=2.0, n_jobs=1)
        lr.fit(X_tr_s, y_tr[:, c])
        proba_va[:, c] = lr.predict_proba(X_va_s)[:, 1]
        models_k.append(lr)
    teachers.append((scaler, models_k))
    oof_teacher[val_idx] = proba_va
teacher_oof_lw, _ = lwlrap_np(Y_cur, oof_teacher)
print(f'Teacher OOF (MC5) rebuilt: {teacher_oof_lw:.4f}')

# 2) Score train_noisy with teacher ensemble to recompute selection
print('Scoring noisy with teacher ensemble ...')
probs_noisy = np.zeros((len(df_noi), n_classes), dtype=np.float32)
for scaler, models_k in teachers:
    Xn_s = scaler.transform(X_noisy)
    p = np.zeros_like(probs_noisy)
    for c in range(n_classes):
        p[:, c] = models_k[c].predict_proba(Xn_s)[:, 1]
    probs_noisy += p.astype(np.float32)
probs_noisy /= len(teachers)

# Thresholds aiming for ~20–60% retention among weak positives with ultra-confident additions
weak_lists = df_noi['labels'].apply(parse_labels_str).values
thr = np.full(n_classes, 0.90, dtype=np.float32)
for c, name in enumerate(class_names):
    idx = [i for i, toks in enumerate(weak_lists) if name in toks]
    if len(idx) == 0:
        thr[c] = 0.95
        continue
    vals = probs_noisy[idx, c]
    qv = float(np.quantile(vals, 0.90))
    t = max(0.85, qv)
    if (vals >= t).sum() < 20 and len(idx) >= 50:
        qv = float(np.quantile(vals, 0.85))
        t = max(0.85, qv)
    thr[c] = t

Y_noisy_sel = np.zeros((len(df_noi), n_classes), dtype=np.float32)
sel_mask = np.zeros(len(df_noi), dtype=bool)
for i, toks in enumerate(weak_lists):
    cls_idx = [label_to_idx[t] for t in toks if t in label_to_idx]
    for j in cls_idx:
        if probs_noisy[i, j] >= thr[j]:
            Y_noisy_sel[i, j] = 1.0
    # ultra-confident new positives (cap 3)
    not_weak = [j for j in range(n_classes) if j not in cls_idx]
    top_new = [(j, probs_noisy[i, j]) for j in not_weak if probs_noisy[i, j] >= 0.98]
    top_new.sort(key=lambda x: -x[1])
    for j, _p in top_new[:3]:
        Y_noisy_sel[i, j] = 1.0
    if Y_noisy_sel[i].sum() > 0:
        sel_mask[i] = True
selected_idx = np.where(sel_mask)[0]
print('Selected noisy:', len(selected_idx), f'({len(selected_idx)/len(df_noi)*100:.1f}%)')
assert len(selected_idx) > 0

# 3) OOF ensemble of students with different C; class-specific weights (noisy negatives weight=0)
C_LIST = [1.0, 2.0, 4.0]
oof_ensemble = np.zeros((len(df_cur), n_classes), dtype=np.float32)
fold_scores = []
for k in range(N_FOLDS):
    trn_idx = np.where(df_cur['fold'].values != k)[0]
    val_idx = np.where(df_cur['fold'].values == k)[0]
    X_tr_cur, y_tr_cur = X_cur[trn_idx], Y_cur[trn_idx]
    X_va, y_va = X_cur[val_idx], Y_cur[val_idx]
    X_tr = np.concatenate([X_tr_cur, X_noisy[selected_idx]], axis=0)
    y_tr = np.concatenate([y_tr_cur, Y_noisy_sel[selected_idx]], axis=0)
    n_cur = X_tr_cur.shape[0]
    scaler = StandardScaler(with_mean=True, with_std=True)
    X_tr_s = scaler.fit_transform(X_tr)
    X_va_s = scaler.transform(X_va)
    y_noi = y_tr[n_cur:]
    proba_accum = np.zeros((len(val_idx), n_classes), dtype=np.float32)
    for Cval in C_LIST:
        proba_cfg = np.zeros((len(val_idx), n_classes), dtype=np.float32)
        for c in range(n_classes):
            w_cur = np.ones(n_cur, dtype=np.float32)
            w_noi_c = np.where(y_noi[:, c] > 0.5, np.clip(probs_noisy[selected_idx, c], 0.5, 1.0), 0.0).astype(np.float32)
            w = np.concatenate([w_cur, w_noi_c], axis=0)
            lr = LogisticRegression(solver='lbfgs', max_iter=1000, C=Cval, n_jobs=1)
            lr.fit(X_tr_s, y_tr[:, c], sample_weight=w)
            proba_cfg[:, c] = lr.predict_proba(X_va_s)[:, 1]
        proba_accum += proba_cfg
    proba_avg = proba_accum / len(C_LIST)
    oof_ensemble[val_idx] = proba_avg
    lw, _ = lwlrap_np(y_va, proba_avg)
    fold_scores.append(lw)
    print(f'Fold {k} ensemble LWLRAP={lw:.4f}')
oof_lw_ens, _ = lwlrap_np(Y_cur, oof_ensemble)
print(f'OOF LWLRAP ensemble (MC5 student, {len(C_LIST)} heads) = {oof_lw_ens:.4f}; per-fold={fold_scores}')
np.save('oof_ensemble.npy', oof_ensemble)

# 4) Per-class temperature scaling on OOF to calibrate probabilities
def safe_logit(p, eps=1e-6):
    p = np.clip(p, eps, 1 - eps)
    return np.log(p / (1 - p))
def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-x))
def bce_loss(y, p, eps=1e-7):
    p = np.clip(p, eps, 1 - eps)
    return float(-(y * np.log(p) + (1 - y) * np.log(1 - p)).mean())

temps = np.ones(n_classes, dtype=np.float32)
for c in range(n_classes):
    y = Y_cur[:, c]
    p = oof_ensemble[:, c]
    z = safe_logit(p)
    best_t, best_loss = 1.0, bce_loss(y, p)
    # grid over tau in [0.5, 2.0]
    for tau in np.linspace(0.5, 2.0, 16):
        p_cal = sigmoid(z / tau)
        loss = bce_loss(y, p_cal)
        if loss < best_loss:
            best_loss, best_t = loss, float(tau)
    temps[c] = best_t
print('Calibrated per-class temperatures (min/max):', float(temps.min()), float(temps.max()))
with open('temps_c3d.json', 'w') as f:
    json.dump({class_names[i]: float(temps[i]) for i in range(n_classes)}, f)
print('Saved temps_c3d.json')

# 5) Train full ensemble on curated+selected noisy and predict test; apply temperature scaling
X_full = np.concatenate([X_cur, X_noisy[selected_idx]], axis=0)
Y_full = np.concatenate([Y_cur, Y_noisy_sel[selected_idx]], axis=0)
n_cur = X_cur.shape[0]
scaler_full = StandardScaler(with_mean=True, with_std=True)
X_full_s = scaler_full.fit_transform(X_full)
X_test_s = scaler_full.transform(X_test)

proba_test_accum = np.zeros((X_test.shape[0], n_classes), dtype=np.float32)
y_noi_full = Y_full[n_cur:]
for Cval in C_LIST:
    proba_test_cfg = np.zeros_like(proba_test_accum)
    for c in range(n_classes):
        w_cur = np.ones(n_cur, dtype=np.float32)
        w_noi_c = np.where(y_noi_full[:, c] > 0.5, np.clip(probs_noisy[selected_idx, c], 0.5, 1.0), 0.0).astype(np.float32)
        w = np.concatenate([w_cur, w_noi_c], axis=0)
        lr = LogisticRegression(solver='lbfgs', max_iter=1000, C=Cval, n_jobs=1)
        lr.fit(X_full_s, Y_full[:, c], sample_weight=w)
        proba_test_cfg[:, c] = lr.predict_proba(X_test_s)[:, 1]
    proba_test_accum += proba_test_cfg
proba_test = proba_test_accum / len(C_LIST)

# Apply temperature scaling
logits_test = safe_logit(proba_test)
for c in range(n_classes):
    logits_test[:, c] = logits_test[:, c] / max(temps[c], 1e-6)
proba_test_cal = sigmoid(logits_test)

# 6) Save submission and metrics
sub = pd.DataFrame(proba_test_cal, columns=class_names)
sub.insert(0, 'fname', df_ss['fname'].values)
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv (C3D ensemble + temp scaling). Shape:', sub.shape)

metrics = {
    'stage': 'C3D_ensemble_calibration_mc5',
    'teacher_oof_lwlrap': float(teacher_oof_lw),
    'student_oof_lwlrap_ensemble': float(oof_lw_ens),
    'per_fold_student_ens': [float(x) for x in fold_scores],
    'C_list': C_LIST,
    'selected_noisy': int(len(selected_idx)),
    'temps_min': float(temps.min()),
    'temps_max': float(temps.max())
}
with open('metrics_c3d_curated.json', 'w') as f:
    json.dump(metrics, f)
print('Saved metrics_c3d_curated.json:', metrics)


Teacher OOF (MC5) rebuilt: 0.8049
Scoring noisy with teacher ensemble ...


Selected noisy: 4506 (22.7%)


Fold 0 ensemble LWLRAP=0.8230


Fold 1 ensemble LWLRAP=0.8116


Fold 2 ensemble LWLRAP=0.8236


Fold 3 ensemble LWLRAP=0.8228


Fold 4 ensemble LWLRAP=0.7925
OOF LWLRAP ensemble (MC5 student, 3 heads) = 0.8147; per-fold=[0.8230405728471376, 0.8115917422894654, 0.8235739410416323, 0.8228305234799349, 0.7924861952710641]


Calibrated per-class temperatures (min/max): 1.100000023841858 2.0
Saved temps_c3d.json


Saved submission.csv (C3D ensemble + temp scaling). Shape: (3361, 81)
Saved metrics_c3d_curated.json: {'stage': 'C3D_ensemble_calibration_mc5', 'teacher_oof_lwlrap': 0.8048759167920174, 'student_oof_lwlrap_ensemble': 0.814695573093225, 'per_fold_student_ens': [0.8230405728471376, 0.8115917422894654, 0.8235739410416323, 0.8228305234799349, 0.7924861952710641], 'C_list': [1.0, 2.0, 4.0], 'selected_noisy': 4506, 'temps_min': 1.100000023841858, 'temps_max': 2.0}


In [41]:
# C4 Phase 2 — Step 1: Mandatory duration diagnostics (curated vs test)
import numpy as np
import pandas as pd
from pathlib import Path
import soundfile as sf

BASE = Path('.')
df_cur = pd.read_csv(BASE / 'train_curated_folds.csv')
df_ss  = pd.read_csv(BASE / 'sample_submission.csv')
cur_paths = [str(BASE / 'train_curated' / f) for f in df_cur['fname'].values]
test_paths = [str(BASE / 'test' / f) for f in df_ss['fname'].values]

def fast_duration(p):
    try:
        info = sf.info(p)
        return info.frames / max(info.samplerate, 1)
    except Exception:
        return np.nan

def describe_durations(paths, name):
    durs = np.array([fast_duration(p) for p in paths], dtype=float)
    durs = durs[np.isfinite(durs)]
    if durs.size == 0:
        print(f"{name}: no valid durations found")
        return
    def pct(a, q):
        return float(np.percentile(a, q))
    print(f"{name} durations (n={len(durs)}): min={durs.min():.3f}s, mean={durs.mean():.3f}s, p50={pct(durs,50):.3f}s, p75={pct(durs,75):.3f}s, p90={pct(durs,90):.3f}s, p95={pct(durs,95):.3f}s, p99={pct(durs,99):.3f}s, max={durs.max():.3f}s")

print('Duration diagnostics — curated vs test\n')
describe_durations(cur_paths, 'Curated')
describe_durations(test_paths, 'Test')
print('\nImplications: If test median >> curated median or tail is longer, prioritize multi-crop TTA (MC7) and robust handling of long clips. If test median << curated, ensure pad/loop policies are consistent.')


Duration diagnostics — curated vs test



Curated durations (n=4970): min=0.300s, mean=7.648s, p50=4.676s, p75=11.146s, p90=20.349s, p95=24.497s, p99=29.000s, max=57.571s


Test durations (n=3361): min=0.320s, mean=10.489s, p50=8.160s, p75=16.160s, p90=23.700s, p95=27.000s, p99=29.580s, max=30.000s

Implications: If test median >> curated median or tail is longer, prioritize multi-crop TTA (MC7) and robust handling of long clips. If test median << curated, ensure pad/loop policies are consistent.


In [42]:
# C4 Phase 2 — Step 2: Strict-parity MC7 embeddings for curated and test (begin, 16%, 33%, 50%, 66%, 83%, end) with batched inference
import time
from pathlib import Path
import numpy as np
import pandas as pd
import librosa, torch

BASE = Path('.')
SR = 32000
T_SEC = 10.0
T = int(SR * T_SEC)
K = 7
EMB_DIM = 2048
CKPT_PATH = Path('/app/panns_data/Cnn14_mAP=0.431.pth')
assert CKPT_PATH.exists(), 'CNN14 checkpoint missing; prepare assets first.'

df_cur = pd.read_csv(BASE / 'train_curated_folds.csv')
df_ss  = pd.read_csv(BASE / 'sample_submission.csv')
train_files = df_cur['fname'].values
test_files  = df_ss['fname'].values

def load_audio(path, sr=SR):
    y, s = librosa.load(path, sr=sr, mono=True)
    return y.astype(np.float32)

def mc7_starts(L: int, T: int):
    if L <= T:
        return [0]*7
    span = max(L - T, 1)
    # begin, 16%, 33%, center, 66%, 83%, end
    ratios = [0.0, 1.0/6.0, 2.0/6.0, 0.5, 4.0/6.0, 5.0/6.0, 1.0]
    starts = [int(round(r * span)) for r in ratios]
    return [max(0, min(s, L - T)) for s in starts]

def crops_for_wave(y: np.ndarray, T: int, K: int = 7):
    L = len(y)
    starts = mc7_starts(L, T)
    if L >= T:
        crops = [y[s:s+T] for s in starts]
    else:
        pad = np.pad(y, (0, T - L))
        crops = [pad for _ in range(K)]
    return np.stack(crops, 0)

def extract_mc7_embeddings(file_list, root_dir, batch_size=16, log_every=200):
    from panns_inference import AudioTagging
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    print(f'Device: {device.upper()} | Batch size: {batch_size} | K=7')
    at = AudioTagging(checkpoint_path=str(CKPT_PATH), device=device)
    N = len(file_list)
    X = np.zeros((N, EMB_DIM), dtype=np.float32)
    t0 = time.time()
    for start_idx in range(0, N, batch_size):
        end_idx = min(start_idx + batch_size, N)
        batch_files = file_list[start_idx:end_idx]
        batch_crops = []  # (B*K, T)
        for fname in batch_files:
            y = load_audio(str(Path(root_dir) / fname), sr=SR)
            ck = crops_for_wave(y, T, 7)
            batch_crops.append(ck)
        batch_crops = np.concatenate(batch_crops, axis=0)
        with torch.no_grad():
            out = at.inference(batch_crops)
        if isinstance(out, tuple) and len(out)==2:
            embs = np.asarray(out[1], dtype=np.float32)
        elif isinstance(out, dict) and 'embedding' in out:
            embs = np.asarray(out['embedding'], dtype=np.float32)
        else:
            raise RuntimeError('Unexpected AudioTagging output type')
        B = end_idx - start_idx
        embs = embs.reshape(B, 7, EMB_DIM).mean(axis=1)
        X[start_idx:end_idx] = embs
        if (end_idx % log_every == 0) or (end_idx == N):
            dt = time.time() - t0
            print(f'  {end_idx}/{N} in {dt/60:.1f} min')
    return X

cur_path = BASE / 'embeddings_curated_mc7.npy'
tst_path = BASE / 'embeddings_test_mc7.npy'

if cur_path.exists():
    X_cur_mc7 = np.load(cur_path)
    print('Loaded cached embeddings_curated_mc7.npy')
else:
    bs = 16 if torch.cuda.is_available() else 64
    print('Extracting curated MC7 embeddings ...')
    X_cur_mc7 = extract_mc7_embeddings(train_files, BASE/'train_curated', batch_size=bs, log_every=200)
    np.save(cur_path, X_cur_mc7)
    print('Saved embeddings_curated_mc7.npy')

if tst_path.exists():
    X_test_mc7 = np.load(tst_path)
    print('Loaded cached embeddings_test_mc7.npy')
else:
    bs = 16 if torch.cuda.is_available() else 64
    print('Extracting test MC7 embeddings ...')
    X_test_mc7 = extract_mc7_embeddings(test_files, BASE/'test', batch_size=bs, log_every=256)
    np.save(tst_path, X_test_mc7)
    print('Saved embeddings_test_mc7.npy')

print('MC7 extraction complete. Shapes:', X_cur_mc7.shape, X_test_mc7.shape)


Extracting curated MC7 embeddings ...
Device: CPU | Batch size: 64 | K=7
Checkpoint path: /app/panns_data/Cnn14_mAP=0.431.pth


Using CPU.


  1600/4970 in 15.4 min


  3200/4970 in 31.7 min


  4800/4970 in 49.1 min


  4970/4970 in 50.8 min
Saved embeddings_curated_mc7.npy
Extracting test MC7 embeddings ...
Device: CPU | Batch size: 64 | K=7
Checkpoint path: /app/panns_data/Cnn14_mAP=0.431.pth


Using CPU.


  256/3361 in 2.5 min


  512/3361 in 4.4 min


  768/3361 in 6.3 min


  1024/3361 in 8.3 min


  1280/3361 in 10.2 min


  1536/3361 in 12.2 min


  1792/3361 in 15.2 min


  2048/3361 in 18.2 min


  2304/3361 in 21.2 min


  2560/3361 in 23.8 min


  2816/3361 in 26.4 min


  3072/3361 in 28.4 min


  3328/3361 in 30.5 min


  3361/3361 in 30.8 min
Saved embeddings_test_mc7.npy
MC7 extraction complete. Shapes: (4970, 2048) (3361, 2048)


In [None]:
# C4 Phase 2 — Step 3–5: Curated-only MC7 Ensemble (LR grid, PCA+LR, kNN) + Rank Blending + Per-class Temp Scaling -> submission.csv
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
import json

BASE = Path('.')
df_cur = pd.read_csv(BASE / 'train_curated_folds.csv')
df_ss  = pd.read_csv(BASE / 'sample_submission.csv')
class_names = [c for c in df_ss.columns if c != 'fname']
n_classes = len(class_names)
test_files = df_ss['fname'].values

X_cur = np.load('embeddings_curated_mc7.npy')
X_test = np.load('embeddings_test_mc7.npy')
assert X_cur.ndim==2 and X_test.ndim==2 and X_cur.shape[1]==X_test.shape[1]==2048

def parse_labels_str(s):
    if not isinstance(s, str):
        return []
    return [t.strip() for t in s.replace(';', ',').split(',') if t.strip()]

label_to_idx = {c:i for i,c in enumerate(class_names)}
def encode_labels(s):
    y = np.zeros(n_classes, dtype=np.float32)
    for t in parse_labels_str(s):
        if t in label_to_idx:
            y[label_to_idx[t]] = 1.0
    return y
Y_cur = np.stack(df_cur['labels'].apply(encode_labels).values).astype(np.float32)

def lwlrap_np(truth, scores):
    assert truth.shape == scores.shape
    n_samples, n_labels = truth.shape
    precisions = np.zeros(n_labels)
    labels_per_class = np.maximum(truth.sum(axis=0), 1)
    for i in range(n_samples):
        pos = np.where(truth[i] > 0)[0]
        if pos.size == 0:
            continue
        ranking = np.argsort(-scores[i])
        ranked_truth = truth[i][ranking]
        cumsum = np.cumsum(ranked_truth)
        pos_rank = np.where(ranked_truth > 0)[0]
        prec = cumsum[pos_rank] / (pos_rank + 1)
        ranked_labels = ranking[pos_rank]
        for lbl, p in zip(ranked_labels, prec):
            precisions[lbl] += p
    per_class = precisions / labels_per_class
    weights = truth.sum(axis=0) / max(truth.sum(), 1)
    return float((per_class * weights).sum()), per_class

def ranks_0_1(x):
    # x: (N,) vector, return ranks in [0,1]
    order = np.argsort(x)
    ranks = np.empty_like(order, dtype=np.float32)
    ranks[order] = np.arange(len(x), dtype=np.float32)
    if len(x) > 1:
        ranks /= (len(x) - 1)
    else:
        ranks[:] = 0.0
    return ranks

def safe_logit(p, eps=1e-6):
    p = np.clip(p, eps, 1 - eps)
    return np.log(p / (1 - p))
def sigmoid(z):
    return 1.0 / (1.0 + np.exp(-z))
def bce_loss(y, p, eps=1e-7):
    p = np.clip(p, eps, 1 - eps)
    return float(-(y * np.log(p) + (1 - y) * np.log(1 - p)).mean())

# Head definitions
LR_C_GRID = [0.5, 1.0, 2.0, 4.0, 8.0]
SEEDS = [42, 2025, 7]
PCA_DIMS = [512, 1024]
KNN_K = [50, 100]

heads_info = []  # list of dicts: {name, oof, test, oof_lw}

folds = df_cur['fold'].values.astype(int)
N_FOLDS = np.unique(folds).size

# Utility: fit OneVsRest LR head with StandardScaler on fold
def fit_ovr_lr(X_tr, y_tr, X_va, X_te, C=2.0, seed=42):
    scaler = StandardScaler(with_mean=True, with_std=True)
    X_tr_s = scaler.fit_transform(X_tr)
    X_va_s = scaler.transform(X_va)
    X_te_s = scaler.transform(X_te)
    proba_va = np.zeros((X_va.shape[0], n_classes), dtype=np.float32)
    proba_te = np.zeros((X_te.shape[0], n_classes), dtype=np.float32)
    for c in range(n_classes):
        lr = LogisticRegression(solver='lbfgs', max_iter=1000, C=C, n_jobs=1, random_state=seed)
        lr.fit(X_tr_s, y_tr[:, c])
        proba_va[:, c] = lr.predict_proba(X_va_s)[:, 1]
        proba_te[:, c] = lr.predict_proba(X_te_s)[:, 1]
    return proba_va, proba_te

def fit_pca_lr(X_tr, y_tr, X_va, X_te, n_comp=512, C=2.0, seed=42):
    scaler = StandardScaler(with_mean=True, with_std=True)
    X_tr_s = scaler.fit_transform(X_tr)
    X_va_s = scaler.transform(X_va)
    X_te_s = scaler.transform(X_te)
    pca = PCA(n_components=n_comp, svd_solver='auto', random_state=seed)
    X_tr_p = pca.fit_transform(X_tr_s)
    X_va_p = pca.transform(X_va_s)
    X_te_p = pca.transform(X_te_s)
    proba_va = np.zeros((X_va.shape[0], n_classes), dtype=np.float32)
    proba_te = np.zeros((X_te.shape[0], n_classes), dtype=np.float32)
    for c in range(n_classes):
        lr = LogisticRegression(solver='lbfgs', max_iter=1000, C=C, n_jobs=1, random_state=seed)
        lr.fit(X_tr_p, y_tr[:, c])
        proba_va[:, c] = lr.predict_proba(X_va_p)[:, 1]
        proba_te[:, c] = lr.predict_proba(X_te_p)[:, 1]
    return proba_va, proba_te

def fit_knn(X_tr, y_tr, X_va, X_te, k=50):
    # Multioutput KNN classification with cosine distance; use probability via distance weights
    scaler = StandardScaler(with_mean=True, with_std=True)
    X_tr_s = scaler.fit_transform(X_tr)
    X_va_s = scaler.transform(X_va)
    X_te_s = scaler.transform(X_te)
    knn = KNeighborsClassifier(n_neighbors=k, metric='cosine', weights='distance', n_jobs=1)
    knn.fit(X_tr_s, y_tr)
    # predict_proba returns list of arrays (n_classes long), each (N, 2)
    def knn_proba(X):
        probs_list = knn.predict_proba(X)
        # Convert list to (N, C) of positive class probabilities
        out = np.zeros((X.shape[0], n_classes), dtype=np.float32)
        for c, pc in enumerate(probs_list):
            # pc shape (N, n_classes_of_target_c). For binary, columns correspond to [class0, class1]
            if pc.ndim == 2 and pc.shape[1] == 2:
                out[:, c] = pc[:, 1].astype(np.float32)
            else:
                # fallback: decision function not available => use predictions
                out[:, c] = (knn.predict(X)[:, c] > 0.5).astype(np.float32)
        return out
    return knn_proba(X_va_s), knn_proba(X_te_s)

def run_head(name, fit_fn, *args, **kwargs):
    oof = np.zeros((len(df_cur), n_classes), dtype=np.float32)
    te = np.zeros((len(X_test), n_classes), dtype=np.float32)
    for k in range(N_FOLDS):
        trn_idx = np.where(folds != k)[0]
        val_idx = np.where(folds == k)[0]
        X_tr, y_tr = X_cur[trn_idx], Y_cur[trn_idx]
        X_va, y_va = X_cur[val_idx], Y_cur[val_idx]
        proba_va, proba_te = fit_fn(X_tr, y_tr, X_va, X_test, *args, **kwargs)
        oof[val_idx] = proba_va.astype(np.float32)
        te += proba_te.astype(np.float32)
    te /= N_FOLDS
    oof_lw, _ = lwlrap_np(Y_cur, oof)
    heads_info.append({'name': name, 'oof': oof, 'test': te, 'oof_lw': float(oof_lw)})
    print(f"Head {name}: OOF LWLRAP={oof_lw:.4f}")

print('Training curated-only heads (leak-proof per fold) ...')
# LR heads grid
for C in LR_C_GRID:
    for seed in SEEDS:
        run_head(f'LR_C{C}_S{seed}', fit_ovr_lr, C, seed)

# PCA+LR heads (reduced C grid for speed)
for n_comp in PCA_DIMS:
    for C in [1.0, 2.0]:
        for seed in SEEDS:
            run_head(f'PCA{n_comp}_LR_C{C}_S{seed}', fit_pca_lr, n_comp, C, seed)

# kNN heads
for k in KNN_K:
    run_head(f'kNN_k{k}', fit_knn, k)

# Rank blending with weights proportional to head OOF LWLRAP
weights = np.array([h['oof_lw'] for h in heads_info], dtype=np.float64)
w_sum = weights.sum()
assert w_sum > 0

def blend_by_rank(heads, split='oof'):
    if split == 'oof':
        N = heads[0]['oof'].shape[0]
        blended = np.zeros((N, n_classes), dtype=np.float32)
        for c in range(n_classes):
            acc = np.zeros(N, dtype=np.float64)
            for w, h in zip(weights, heads):
                ranks = ranks_0_1(h['oof'][:, c])
                acc += w * ranks
            blended[:, c] = (acc / w_sum).astype(np.float32)
        return blended
    else:
        N = heads[0]['test'].shape[0]
        blended = np.zeros((N, n_classes), dtype=np.float32)
        for c in range(n_classes):
            acc = np.zeros(N, dtype=np.float64)
            for w, h in zip(weights, heads):
                ranks = ranks_0_1(h['test'][:, c])
                acc += w * ranks
            blended[:, c] = (acc / w_sum).astype(np.float32)
        return blended

blend_oof = blend_by_rank(heads_info, 'oof')
blend_test = blend_by_rank(heads_info, 'test')
blend_oof_lw, _ = lwlrap_np(Y_cur, blend_oof)
print(f'Blended OOF LWLRAP (rank-avg, weighted) = {blend_oof_lw:.4f}')

# Guardrail
BASELINE = 0.8049  # C3A MC5 OOF, target to exceed here as curated-only MC7
if blend_oof_lw < BASELINE - 1e-6:
    print('WARNING: Blended OOF is below baseline. Investigate before trusting LB.')

# Per-class Temperature Scaling on blended OOF ranks (treated as probs in [0,1])
temps = np.ones(n_classes, dtype=np.float32)
for c in range(n_classes):
    y = Y_cur[:, c]
    p = blend_oof[:, c]
    z = safe_logit(p)
    best_t, best_loss = 1.0, bce_loss(y, p)
    for tau in np.linspace(0.5, 2.0, 16):
        p_cal = sigmoid(z / tau)
        loss = bce_loss(y, p_cal)
        if loss < best_loss:
            best_loss, best_t = loss, float(tau)
    temps[c] = best_t
print('Temperature scaling range (min/max):', float(temps.min()), float(temps.max()))

# Apply temps to test blended ranks
logits_test = safe_logit(blend_test)
for c in range(n_classes):
    logits_test[:, c] = logits_test[:, c] / max(temps[c], 1e-6)
proba_test_final = sigmoid(logits_test).astype(np.float32)

# Save submission and metrics
sub = pd.DataFrame(proba_test_final, columns=class_names)
sub.insert(0, 'fname', test_files)
sub.to_csv('submission.csv', index=False)
metrics = {
    'stage': 'C4_curated_only_mc7_ensemble_rankblend_temp',
    'n_heads': len(heads_info),
    'heads': [{k: (float(v) if isinstance(v, (int,float)) else v) for k,v in {'name':h['name'], 'oof_lw':h['oof_lw']}.items()} for h in heads_info],
    'blended_oof_lw': float(blend_oof_lw),
    'temps_min': float(temps.min()), 'temps_max': float(temps.max())
}
with open('metrics_c4_curated_mc7.json', 'w') as f:
    json.dump(metrics, f)
print('Saved submission.csv and metrics_c4_curated_mc7.json. Heads used:', len(heads_info))
