# Dog Breed Identification — Medal Plan

Goal: Achieve a medal-worthy multi-class log loss on Kaggle's Dog Breed Identification.

Strategy:
- Use strong pretrained CNN via timm (e.g., convnext_tiny or efficientnetv2_s) with ImageNet weights.
- Single-label, 120 classes; optimize CrossEntropy with label smoothing.
- Mixed precision training (AMP) for speed; AdamW + cosine schedule with warmup.
- Data pipeline: RandomResizedCrop 384, RandAug/ColorJitter, CutMix/MixUp optional; validation: Resize/CenterCrop.
- Split: Stratified train/val (e.g., 90/10). Start with 1-fold to get a baseline, then extend to 5-fold if time permits.
- Early stopping by val logloss; checkpoint best model.
- Inference TTA (e.g., 3-5 flips/resize) to boost robustness.
- Ensure submission format matches sample_submission (columns are breeds, rows by test image id, probabilities sum to 1 per row).

Milestones:
1) Environment setup (install torch/cu121, timm, albumentations, opencv).
2) Load labels.csv; build class map and stratified split.
3) Dataset/Dataloader with augmentations.
4) Baseline model training (frozen backbone head-only warmup 1 epoch, then unfreeze).
5) Validate, tune LR/augmentation if needed.
6) Full-train best config; TTA inference on test; write submission.csv.

Logging/Speed:
- Print fold/epoch progress, timings, and interim val logloss.
- Use num_workers based on CPU; pin_memory; gradient accumulation if needed.

Next: Set up environment and verify GPU, then request expert feedback on model/size and augment/TTAs before long runs.

In [19]:
# GPU diagnostic
!nvidia-smi || echo 'nvidia-smi failed: GPU not visible to the runtime'

Failed to initialize NVML: Unknown Error
nvidia-smi failed: GPU not visible to the runtime


In [2]:
# Environment setup and GPU check
import sys, subprocess, os, time

# Set writable cache dirs BEFORE importing libraries that may download from HF Hub
CACHE_ROOT = os.path.join(os.getcwd(), '.model_cache')
HF_ROOT = os.path.join(os.getcwd(), 'hf_cache')
os.makedirs(CACHE_ROOT, exist_ok=True)
os.makedirs(HF_ROOT, exist_ok=True)
os.environ['TORCH_HOME'] = CACHE_ROOT
os.environ['XDG_CACHE_HOME'] = CACHE_ROOT  # avoid /app/.cache
os.environ['HF_HOME'] = HF_ROOT
os.environ['HF_HUB_CACHE'] = os.path.join(HF_ROOT, 'hub')
os.environ['HUGGINGFACE_HUB_CACHE'] = os.path.join(HF_ROOT, 'hub')
# Reduce CUDA memory fragmentation
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True'

def pip_install(pkgs):
    print(f"Installing: {' '.join(pkgs)}", flush=True)
    subprocess.check_call([sys.executable, '-m', 'pip', 'install', '-q'] + pkgs)

# Force reinstall PyTorch from official cu121 index
print('Reinstalling torch/torchvision from cu121 index (force-reinstall)...', flush=True)
subprocess.check_call([sys.executable, '-m', 'pip', 'install', '--force-reinstall', '-q',
                       'torch==2.4.0', 'torchvision==0.19.0',
                       '--index-url', 'https://download.pytorch.org/whl/cu121'])

# Ensure timm, albumentations, opencv, sklearn, pandas, numpy
pkgs = [
    'timm==1.0.9',
    'albucore==0.0.11',  # ensure compatibility for albumentations 1.4.x
    'albumentations==1.4.8',
    'opencv-python-headless==4.10.0.84',
    'pandas==2.2.2',
    'scikit-learn==1.5.1',
    'numpy==1.26.4'
]
pip_install(pkgs)

# Retry import; if albumentations still fails, attempt a fallback pin
try:
    import torch, torchvision, timm, albumentations as A, cv2, pandas as pd, numpy as np, sklearn
except Exception as e:
    print('Primary import failed, attempting albumentations fallback pin (1.3.1) ...', e, flush=True)
    pip_install(['albumentations==1.3.1'])
    import torch, torchvision, timm, albumentations as A, cv2, pandas as pd, numpy as np, sklearn

print('Versions:',
      f"torch {torch.__version__}",
      f"torchvision {torchvision.__version__}",
      f"timm {timm.__version__}",
      f"albumentations {A.__version__}",
      f"cv2 {cv2.__version__}",
      f"sklearn {sklearn.__version__}")

print('Torch CUDA build:', getattr(torch.version, 'cuda', 'n/a'))
print('torch.cuda.is_available():', torch.cuda.is_available())
try:
    if torch.cuda.is_available():
        print('GPU count:', torch.cuda.device_count())
        for i in range(torch.cuda.device_count()):
            print(f'GPU {i}:', torch.cuda.get_device_name(i))
    else:
        print('No CUDA GPUs detected by PyTorch.')
except Exception as e:
    print('CUDA query error:', e)

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print('Device:', device)
if device == 'cuda':
    torch.backends.cudnn.benchmark = True
else:
    print('WARNING: CUDA not available, training will be slow.')

# Quick dataset sanity checks
import glob
train_dir, test_dir = 'train', 'test'
train_imgs = glob.glob(os.path.join(train_dir, '*.jpg'))
test_imgs = glob.glob(os.path.join(test_dir, '*.jpg'))
print(f"Found {len(train_imgs)} train images, {len(test_imgs)} test images")
labels_path = 'labels.csv'
ss_path = 'sample_submission.csv'
assert os.path.exists(labels_path), 'labels.csv not found'
assert os.path.exists(ss_path), 'sample_submission.csv not found'
labels_df = pd.read_csv(labels_path)
ss = pd.read_csv(ss_path)
print('labels_df shape:', labels_df.shape, 'unique breeds:', labels_df['breed'].nunique())
print('sample_submission shape:', ss.shape)
print('First 3 breeds in sample_submission:', list(ss.columns[1:4]))
print('Setup complete.')

Reinstalling torch/torchvision from cu121 index (force-reinstall)...




Installing: timm==1.0.9 albucore==0.0.11 albumentations==1.4.8 opencv-python-headless==4.10.0.84 pandas==2.2.2 scikit-learn==1.5.1 numpy==1.26.4




  from .autonotebook import tqdm as notebook_tqdm


INFO:albumentations.check_version:A new version of Albumentations is available: 2.0.8 (you have 1.4.8). Upgrade using: pip install --upgrade albumentations


Versions: torch 2.4.0+cu121 torchvision 0.19.0+cu121 timm 1.0.9 albumentations 1.4.8 cv2 4.10.0 sklearn 1.5.1
Torch CUDA build: 12.1
torch.cuda.is_available(): False
No CUDA GPUs detected by PyTorch.
Device: cpu
Found 9199 train images, 1023 test images
labels_df shape: (9199, 2) unique breeds: 120
sample_submission shape: (1023, 121)
First 3 breeds in sample_submission: ['affenpinscher', 'afghan_hound', 'african_hunting_dog']
Setup complete.


In [3]:
# Data prep: stratified split, datasets, dataloaders, transforms
import os, glob, random, math, time
import numpy as np
import pandas as pd
from sklearn.model_selection import StratifiedKFold, train_test_split
import torch
from torch.utils.data import Dataset, DataLoader
import albumentations as A
from albumentations.pytorch import ToTensorV2
import cv2

SEED = 42
random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED); torch.cuda.manual_seed_all(SEED)

# Build class mapping from labels_df and ensure sample_submission order alignment
breeds = list(pd.read_csv('sample_submission.csv').columns[1:])
breed_to_idx = {b:i for i,b in enumerate(breeds)}
num_classes = len(breeds)
print('Num classes:', num_classes)

# Merge labels with paths and map to indices
labels_df = pd.read_csv('labels.csv')
labels_df['filepath'] = labels_df['id'].apply(lambda x: os.path.join('train', f'{x}.jpg'))
labels_df['label'] = labels_df['breed'].map(breed_to_idx)
assert labels_df['label'].notnull().all(), 'Found breeds not in sample_submission columns'

# Stratified split 90/10
train_df, val_df = train_test_split(labels_df, test_size=0.1, stratify=labels_df['label'], random_state=SEED)
print('Train/Val sizes:', len(train_df), len(val_df))

# Transforms (CPU‑friendly, simple)
IMG_SIZE = 224
imagenet_mean = (0.485, 0.456, 0.406)
imagenet_std = (0.229, 0.224, 0.225)

train_tfms = A.Compose([
    A.RandomResizedCrop(IMG_SIZE, IMG_SIZE, scale=(0.8, 1.0), ratio=(0.75, 1.33), p=1.0),
    A.HorizontalFlip(p=0.5),
    A.Normalize(mean=imagenet_mean, std=imagenet_std),
    ToTensorV2(),
])

val_tfms = A.Compose([
    A.Resize(IMG_SIZE, IMG_SIZE),
    A.CenterCrop(IMG_SIZE, IMG_SIZE),
    A.Normalize(mean=imagenet_mean, std=imagenet_std),
    ToTensorV2(),
])

class DogDataset(Dataset):
    def __init__(self, df, augment, is_train=True):
        self.df = df.reset_index(drop=True)
        self.augment = augment
        self.is_train = is_train
    def __len__(self):
        return len(self.df)
    def __getitem__(self, idx):
        row = self.df.iloc[idx]
        img = cv2.imread(row.filepath)
        if img is None:
            raise FileNotFoundError(row.filepath)
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        aug = self.augment(image=img)
        img_t = aug['image']
        if self.is_train:
            label = int(row.label)
            return img_t, label
        else:
            return img_t, row.id

# Datasets
ds_train = DogDataset(train_df, train_tfms, is_train=True)
ds_val = DogDataset(val_df, val_tfms, is_train=True)

# Dataloaders (CPU‑safe settings)
BATCH_SIZE = 8
NUM_WORKERS = 0
train_loader = DataLoader(
    ds_train, batch_size=BATCH_SIZE, shuffle=True, num_workers=NUM_WORKERS, pin_memory=False, drop_last=True
)
val_loader = DataLoader(
    ds_val, batch_size=BATCH_SIZE, shuffle=False, num_workers=NUM_WORKERS, pin_memory=False
)
print('DataLoaders ready:', len(train_loader), 'train batches,', len(val_loader), 'val batches')

# Prepare test df and loader
test_ids = [os.path.splitext(os.path.basename(p))[0] for p in sorted(glob.glob(os.path.join('test','*.jpg')))]
test_df = pd.DataFrame({'id': test_ids})
test_df['filepath'] = test_df['id'].apply(lambda x: os.path.join('test', f'{x}.jpg'))
ds_test = DogDataset(test_df, val_tfms, is_train=False)
test_loader = DataLoader(
    ds_test, batch_size=BATCH_SIZE, shuffle=False, num_workers=NUM_WORKERS, pin_memory=False
)
print('Test batches:', len(test_loader))

Num classes: 120
Train/Val sizes: 8279 920
DataLoaders ready: 1034 train batches, 115 val batches
Test batches: 128


In [None]:
# Model, training loop with AMP, Mixup, EMA, early stopping
import math, time, os, sys, importlib
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.cuda.amp import autocast, GradScaler
from timm.utils import ModelEmaV2
from timm.data.mixup import Mixup
from timm.loss import SoftTargetCrossEntropy, LabelSmoothingCrossEntropy
from torchvision.models import efficientnet_v2_s, EfficientNet_V2_S_Weights

# Ensure cache dirs are writable for any downloads
CACHE_ROOT = os.path.join(os.getcwd(), '.model_cache')
os.makedirs(CACHE_ROOT, exist_ok=True)
os.environ['TORCH_HOME'] = CACHE_ROOT
os.environ['XDG_CACHE_HOME'] = CACHE_ROOT

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Create torchvision model (avoid HF Hub read-only cache issue)
weights = EfficientNet_V2_S_Weights.IMAGENET1K_V1
model = efficientnet_v2_s(weights=weights)
in_features = model.classifier[1].in_features
model.classifier[1] = nn.Linear(in_features, num_classes)
model = model.to(device)
model = model.to(memory_format=torch.channels_last)

# Mixup/CutMix
mixup_fn = Mixup(mixup_alpha=0.2, cutmix_alpha=1.0, cutmix_minmax=None, prob=1.0, switch_prob=0.5, mode='batch', label_smoothing=0.0, num_classes=num_classes)
use_mixup = True
criterion = SoftTargetCrossEntropy() if use_mixup else LabelSmoothingCrossEntropy(smoothing=0.1)

# Optimizer with weight decay exclusions (by parameter shape/name to avoid tensor comparisons)
lr = 5e-4
weight_decay = 0.01
decay, no_decay = [], []
for name, p in model.named_parameters():
    if not p.requires_grad:
        continue
    if p.ndim == 1 or name.endswith('.bias'):
        no_decay.append(p)
    else:
        decay.append(p)
param_groups = [
    {'params': decay, 'weight_decay': weight_decay},
    {'params': no_decay, 'weight_decay': 0.0},
]
optimizer = torch.optim.AdamW(param_groups, lr=lr, betas=(0.9, 0.999))

# Scheduler (cosine after warmup handled manually per-epoch at loop start)
epochs = 20
warmup_epochs = 2
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=epochs - warmup_epochs, eta_min=1e-6)

# EMA (keep EMA on same device as model to avoid dtype/device mismatches)
ema_decay = 0.9996
model_ema = ModelEmaV2(model, decay=ema_decay)  # no explicit device arg
# ensure EMA module is on same device and memory format
if getattr(model_ema, 'module', None) is not None:
    model_ema.module.to(device)
    model_ema.module.to(memory_format=torch.channels_last)

# Grad scaler for AMP
scaler = GradScaler(enabled=True)

# Train utils
def train_one_epoch(epoch, freeze_backbone=False, accum_steps=4):
    model.train()
    if freeze_backbone:
        for n,p in model.named_parameters():
            if 'classifier' in n:
                p.requires_grad = True
            else:
                p.requires_grad = False
    else:
        for p in model.parameters():
            p.requires_grad = True
    running_loss = 0.0
    n_samples = 0
    start = time.time()
    optimizer.zero_grad(set_to_none=True)
    for i, (images, targets) in enumerate(train_loader):
        images = images.to(device, non_blocking=True, memory_format=torch.channels_last)
        targets = targets.to(device, non_blocking=True)
        if use_mixup:
            images, targets = mixup_fn(images, targets)
        with autocast(enabled=True):
            outputs = model(images)
            loss = criterion(outputs, targets) / accum_steps
        scaler.scale(loss).backward()
        if (i + 1) % accum_steps == 0:
            scaler.step(optimizer)
            scaler.update()
            optimizer.zero_grad(set_to_none=True)
            if model_ema is not None:
                model_ema.update(model)
        running_loss += loss.item() * accum_steps * images.size(0)
        n_samples += images.size(0)
        if (i+1) % 50 == 0:
            elapsed = time.time() - start
            print(f"Epoch {epoch} [{i+1}/{len(train_loader)}] loss={running_loss/max(n_samples,1):.4f} time={elapsed:.1f}s", flush=True)
    return running_loss / max(n_samples,1)

def _get_ema_module():
    return getattr(model_ema, 'module', None)

def evaluate(use_ema=False):
    model_to_eval = _get_ema_module() if (use_ema and model_ema is not None) else model
    # ensure EMA is on the correct device for eval
    if use_ema and model_to_eval is not None and device.type == 'cuda':
        model_to_eval.to(device)
        model_to_eval.to(memory_format=torch.channels_last)
    model_to_eval.eval()
    total_loss = 0.0
    n_samples = 0
    with torch.no_grad():
        for images, targets in val_loader:
            images = images.to(device, non_blocking=True, memory_format=torch.channels_last).float()
            targets = targets.to(device, non_blocking=True)
            outputs = model_to_eval(images)
            loss = F.cross_entropy(outputs, targets)
            total_loss += loss.item() * images.size(0)
            n_samples += images.size(0)
    return total_loss / max(n_samples,1)

# Training loop with early stopping
best_loss = float('inf')
best_ema_loss = float('inf')
patience = 4
no_improve = 0
save_dir = 'checkpoints'
os.makedirs(save_dir, exist_ok=True)

print('Starting training...', flush=True)
t0 = time.time()
for epoch in range(epochs):
    # Set LR at the BEGINNING of epoch (manual linear warmup across warmup_epochs)
    if epoch < warmup_epochs:
        warm_factor = (epoch + 1) / max(1, warmup_epochs)
        for pg in optimizer.param_groups:
            pg['lr'] = lr * warm_factor
    else:
        # keep scheduler-managed LR for cosine phase
        pass

    freeze = (epoch == 0)  # head-only warmup for first epoch
    train_loss = train_one_epoch(epoch, freeze_backbone=freeze, accum_steps=4)
    val_loss = evaluate(use_ema=False)
    val_loss_ema = evaluate(use_ema=True)
    cur_lr = optimizer.param_groups[0]['lr']
    print(f"Epoch {epoch}: train_loss={train_loss:.4f} val_loss={val_loss:.5f} val_loss_ema={val_loss_ema:.5f} lr={cur_lr:.6f} elapsed={(time.time()-t0)/60:.1f}m", flush=True)
    improved = False
    if val_loss < best_loss - 1e-5:
        best_loss = val_loss
        torch.save({'model': model.state_dict(), 'epoch': epoch, 'val_loss': best_loss}, os.path.join(save_dir, 'best.pth'))
        improved = True
    if val_loss_ema < best_ema_loss - 1e-5:
        best_ema_loss = val_loss_ema
        ema_module = _get_ema_module()
        if ema_module is not None:
            cpu_state = {k: v.cpu() for k, v in ema_module.state_dict().items()}
            torch.save({'model': cpu_state, 'epoch': epoch, 'val_loss': best_ema_loss}, os.path.join(save_dir, 'best_ema.pth'))
        improved = True
    if not improved:
        no_improve += 1
    else:
        no_improve = 0
    # Step cosine scheduler AFTER warmup epochs
    if epoch >= warmup_epochs:
        scheduler.step()
    if no_improve >= patience:
        print('Early stopping triggered.', flush=True)
        break
print('Training complete. Best val loss:', best_loss, 'Best EMA val loss:', best_ema_loss)

In [4]:
# Inference: load best checkpoint from timm convnext_tiny run, TTA (orig + hflip), temperature scaling, create submission.csv
import os, numpy as np, pandas as pd, torch, torch.nn.functional as F
import timm

@torch.no_grad()
def predict_logits_tta(model, loader, device, use_hflip=True):
    model.eval()
    all_logits, all_ids = [], []
    for batch in loader:
        if len(batch) == 2 and isinstance(batch[1], torch.Tensor):
            images, _ = batch
            ids = None
        else:
            images, ids = batch
        images = images.to(device, non_blocking=True, memory_format=torch.channels_last)
        logits = model(images)
        if use_hflip:
            logits_flip = model(images.flip(-1))
            logits = 0.5 * (logits + logits_flip)
        all_logits.append(logits.float().cpu())
        if ids is not None:
            all_ids.extend(list(ids))
    all_logits = torch.cat(all_logits, dim=0)
    return all_logits, all_ids

def build_model(num_classes):
    m = timm.create_model('convnext_tiny', pretrained=False, num_classes=num_classes, drop_path_rate=0.2)
    return m

def load_ckpt(model, ckpt_path):
    ckpt = torch.load(ckpt_path, map_location='cpu', weights_only=True)
    state = ckpt.get('model', ckpt)
    model.load_state_dict(state, strict=True)
    return model

def nll_from_logits(logits, targets):
    return F.cross_entropy(logits, targets).item()

def find_best_temperature(val_logits, val_targets, t_min=0.5, t_max=2.0, steps=31):
    ts = np.linspace(t_min, t_max, steps)
    best_t, best_loss = 1.0, float('inf')
    for t in ts:
        loss = nll_from_logits(val_logits / t, val_targets)
        if loss < best_loss:
            best_loss, best_t = loss, float(t)
    return best_t, best_loss

def run_inference_and_save(use_ema=False):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model = build_model(num_classes)
    ckpt_name = 'best_ema.pth' if use_ema else 'best.pth'        
    model = load_ckpt(model, os.path.join('checkpoints', ckpt_name))
    model = model.to(device)
    model = model.to(memory_format=torch.channels_last)

    # Validation logits and temperature scaling
    val_targets = []
    for _, t in val_loader:
        val_targets.append(t)
    val_targets = torch.cat(val_targets, dim=0).to('cpu')
    val_logits, _ = predict_logits_tta(model, val_loader, device, use_hflip=True)
    T_opt, _ = find_best_temperature(val_logits, val_targets, 0.5, 2.0, 31)
    print(f'Optimal temperature T={T_opt:.3f}')

    # Test logits with same TTA
    test_logits, test_ids = predict_logits_tta(model, test_loader, device, use_hflip=True)
    probs = F.softmax(test_logits / T_opt, dim=1).numpy()

    # Build submission in sample_submission order
    sub = pd.DataFrame(probs, columns=breeds)
    sub.insert(0, 'id', test_ids)
    sub.to_csv('submission.csv', index=False)
    print('Saved submission.csv with shape', sub.shape)

# To run after training completes: run_inference_and_save(use_ema=True or False)

In [5]:
# Execute inference with automatic EMA selection based on stored val_loss
import os, torch

def choose_use_ema():
    best_path = os.path.join('checkpoints', 'best.pth')
    best_ema_path = os.path.join('checkpoints', 'best_ema.pth')
    use_ema = False
    if os.path.exists(best_path) and os.path.exists(best_ema_path):
        try:
            b = torch.load(best_path, map_location='cpu', weights_only=True)
            be = torch.load(best_ema_path, map_state_dict=None, map_location='cpu', weights_only=True) if False else torch.load(best_ema_path, map_location='cpu', weights_only=True)
            b_loss = float(b.get('val_loss', float('inf')))
            be_loss = float(be.get('val_loss', float('inf')))
            use_ema = be_loss < b_loss
            print(f"Choosing {'EMA' if use_ema else 'non-EMA'} checkpoint (val_loss: best={b_loss:.5f}, best_ema={be_loss:.5f})")
        except Exception as e:
            print('Failed to compare checkpoints, defaulting to non-EMA:', e)
            use_ema = False
    elif os.path.exists(best_ema_path):
        print('Only best_ema.pth found, using EMA.')
        use_ema = True
    else:
        print('Using non-EMA best.pth (either EMA missing or non-EMA better).')
        use_ema = False
    return use_ema

use_ema = choose_use_ema()
run_inference_and_save(use_ema=use_ema)

  b = torch.load(best_path, map_location='cpu')
  be = torch.load(best_ema_path, map_location='cpu')


Choosing EMA checkpoint (val_loss: best=0.84325, best_ema=0.69830)


  ckpt = torch.load(ckpt_path, map_location='cpu')


Optimal temperature T=1.200


Saved submission.csv with shape (1023, 121)


In [None]:
# New strong training run: timm convnext_tiny.in22k_ft_in1k + fixed EMA + per-step warmup + grad clip
import os, time, math
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.cuda.amp import GradScaler
from torch.amp import autocast

# Minor perf boost on T4
torch.set_float32_matmul_precision('high')

# Writable caches (must set before any hub downloads)
os.environ['HF_HOME'] = os.path.join(os.getcwd(), 'hf_cache')
os.environ['HF_HUB_CACHE'] = os.path.join(os.getcwd(), 'hf_cache', 'hub')
os.environ['TORCH_HOME'] = os.path.join(os.getcwd(), '.model_cache')
os.makedirs(os.environ['HF_HOME'], exist_ok=True)
os.makedirs(os.environ['TORCH_HOME'], exist_ok=True)

import timm
from timm.utils import ModelEmaV2
from timm.data.mixup import Mixup
from timm.loss import SoftTargetCrossEntropy, LabelSmoothingCrossEntropy
from torch.optim.lr_scheduler import SequentialLR, LinearLR, CosineAnnealingLR

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
torch.backends.cudnn.benchmark = True

# Try to minimize fragmentation
torch.cuda.empty_cache()

# Model (memory-efficient backbone) 
model = timm.create_model(
    'convnext_tiny',  # use default IM1K pretrained to avoid invalid tag issues
    pretrained=True,
    num_classes=num_classes,
    drop_path_rate=0.2
)
if hasattr(model, 'set_grad_checkpointing'):
    model.set_grad_checkpointing(True)
model = model.to(device)
model = model.to(memory_format=torch.channels_last)

# MixUp/CutMix setup
mixup_fn = Mixup(mixup_alpha=0.2, cutmix_alpha=1.0, prob=1.0, switch_prob=0.5, mode='batch', label_smoothing=0.0, num_classes=num_classes)
use_mixup = True
criterion_soft = SoftTargetCrossEntropy()
criterion_ls = LabelSmoothingCrossEntropy(smoothing=0.1)

# Optimizer with wd exclusions
lr = 5e-4
weight_decay = 0.01
decay, no_decay = [], []
for n, p in model.named_parameters():
    if not p.requires_grad:
        continue
    if p.ndim == 1 or n.endswith('.bias'):
        no_decay.append(p)
    else:
        decay.append(p)
optimizer = torch.optim.AdamW([
    {'params': decay, 'weight_decay': weight_decay},
    {'params': no_decay, 'weight_decay': 0.0},
], lr=lr, betas=(0.9, 0.999))

# EMA (manual BN buffer sync workaround)
ema_decay = 0.9996
model_ema = ModelEmaV2(model, decay=ema_decay)
if getattr(model_ema, 'module', None) is not None:
    model_ema.module.to(device)
    model_ema.module.to(memory_format=torch.channels_last)

def sync_bn_buffers(ema_module: torch.nn.Module, src_module: torch.nn.Module):
    ema_state = ema_module.state_dict()
    src_state = src_module.state_dict()
    for k, v in src_state.items():
        if ('running_mean' in k) or ('running_var' in k) or ('num_batches_tracked' in k):
            if k in ema_state and ema_state[k].shape == v.shape:
                ema_state[k] = v.detach().clone()
    ema_module.load_state_dict(ema_state, strict=True)

# Scheduler: per-step warmup then cosine
epochs = 25
accum_steps = 8
warmup_steps = 500
total_steps = (len(train_loader) // accum_steps) * epochs
warmup_lr = LinearLR(optimizer, start_factor=0.1, total_iters=warmup_steps)
cosine_lr = CosineAnnealingLR(optimizer, T_max=max(1, total_steps - warmup_steps), eta_min=1e-6)
scheduler = SequentialLR(optimizer, schedulers=[warmup_lr, cosine_lr], milestones=[warmup_steps])

scaler = GradScaler(enabled=True)

def evaluate_model(eval_model):
    eval_model.eval()
    total_loss, n = 0.0, 0
    with torch.no_grad():
        for images, targets in val_loader:
            images = images.to(device, non_blocking=True, memory_format=torch.channels_last)
            targets = targets.to(device, non_blocking=True)
            with autocast('cuda', enabled=True):
                outputs = eval_model(images)
                loss = F.cross_entropy(outputs, targets)
            total_loss += loss.item() * images.size(0)
            n += images.size(0)
    return total_loss / max(n, 1)

save_dir = 'checkpoints'
os.makedirs(save_dir, exist_ok=True)
best_loss = float('inf')
best_ema_loss = float('inf')
patience = 6
no_improve = 0

print('Starting strong run (timm convnext_tiny @320px)...', flush=True)
t0 = time.time()
for epoch in range(epochs):
    model.train()
    running_loss, seen = 0.0, 0
    start = time.time()
    optimizer.zero_grad(set_to_none=True)

    # Disable MixUp in the final 3 epochs, switch to label smoothing
    use_mixup = (epoch < epochs - 3)

    for it, (images, targets) in enumerate(train_loader):
        images = images.to(device, non_blocking=True, memory_format=torch.channels_last)
        targets = targets.to(device, non_blocking=True)
        if use_mixup:
            images, targets = mixup_fn(images, targets)
        with autocast('cuda', enabled=True):
            outputs = model(images)
            loss = (criterion_soft(outputs, targets) if use_mixup else criterion_ls(outputs, targets))
            loss = loss / accum_steps
        scaler.scale(loss).backward()
        if (it + 1) % accum_steps == 0:
            scaler.unscale_(optimizer)
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
            scaler.step(optimizer)
            scaler.update()
            optimizer.zero_grad(set_to_none=True)
            if model_ema is not None:
                model_ema.update(model)
                sync_bn_buffers(model_ema.module, model)
            scheduler.step()
        running_loss += loss.item() * accum_steps * images.size(0)
        seen += images.size(0)
        if (it + 1) % 50 == 0:
            print(f"Epoch {epoch} [{it+1}/{len(train_loader)}] loss={running_loss/max(seen,1):.4f} time={time.time()-start:.1f}s", flush=True)

    # Validation
    val_loss = evaluate_model(model)
    ema_module = getattr(model_ema, 'module', None)
    if ema_module is not None:
        ema_module.to(device)
        ema_module.to(memory_format=torch.channels_last)
        val_loss_ema = evaluate_model(ema_module)
    else:
        val_loss_ema = float('inf')

    cur_lr = optimizer.param_groups[0]['lr']
    print(f"Epoch {epoch}: train_loss={running_loss/max(seen,1):.4f} val={val_loss:.5f} val_ema={val_loss_ema:.5f} lr={cur_lr:.6f} elapsed={(time.time()-t0)/60:.1f}m", flush=True)

    improved = False
    if val_loss < best_loss - 1e-6:
        best_loss = val_loss
        torch.save({'model': model.state_dict(), 'epoch': epoch, 'val_loss': best_loss}, os.path.join(save_dir, 'best.pth'))
        improved = True
    if val_loss_ema < best_ema_loss - 1e-6:
        best_ema_loss = val_loss_ema
        if ema_module is not None:
            cpu_state = {k: v.cpu() for k, v in ema_module.state_dict().items()}
            torch.save({'model': cpu_state, 'epoch': epoch, 'val_loss': best_ema_loss}, os.path.join(save_dir, 'best_ema.pth'))
        improved = True
    if not improved:
        no_improve += 1
    else:
        no_improve = 0
    if no_improve >= patience:
        print('Early stopping.', flush=True)
        break

print('Strong run complete. Best val:', best_loss, 'Best EMA val:', best_ema_loss)

In [None]:
# 5-Fold CV setup (splits, transforms, dataloaders) for medal push
import os, math, random, glob, time
import numpy as np
import pandas as pd
import torch
from sklearn.model_selection import StratifiedKFold
import albumentations as A
from albumentations.pytorch import ToTensorV2
import cv2

SEED = 42
random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED); torch.cuda.manual_seed_all(SEED)

# Config for strong runs
CFG = {
    'model_name': 'convnext_small.in12k_ft_in1k',
    'img_size': 384,
    'epochs': 25,
    'patience': 6,
    'batch_size': 4,          # physical BS
    'accum_steps': 8,         # effective BS ~32
    'num_workers': min(8, os.cpu_count() or 4),
    'lr': 2e-4,
    'weight_decay': 0.05,
    'ema_decay': 0.9996,
    'mixup_alpha': 0.1,
    'cutmix_alpha': 0.5,
    'disable_mixup_last': 5,  # epochs
}

# Reuse global breeds/labels if present
breeds = list(pd.read_csv('sample_submission.csv').columns[1:])
breed_to_idx = {b:i for i,b in enumerate(breeds)}
num_classes = len(breeds)
labels_df = pd.read_csv('labels.csv')
labels_df['filepath'] = labels_df['id'].apply(lambda x: os.path.join('train', f'{x}.jpg'))
labels_df['label'] = labels_df['breed'].map(breed_to_idx)
assert labels_df['label'].notnull().all()

# Build 5 folds
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=SEED)
folds = []
for fold, (trn_idx, val_idx) in enumerate(skf.split(labels_df['id'].values, labels_df['label'].values)):
    trn_df = labels_df.iloc[trn_idx].reset_index(drop=True)
    val_df = labels_df.iloc[val_idx].reset_index(drop=True)
    folds.append((trn_df, val_df))
    os.makedirs('folds', exist_ok=True)
    trn_df[['id','breed','filepath','label']].to_csv(f'folds/train_fold{fold}.csv', index=False)
    val_df[['id','breed','filepath','label']].to_csv(f'folds/val_fold{fold}.csv', index=False)
print('Created 5-fold splits. Fold sizes:')
for i,(trn,val) in enumerate(folds):
    print(f'  Fold {i}: train={len(trn)} val={len(val)}')

# Transforms (robust, albumentations-only due to version quirks)
IM = CFG['img_size']
imagenet_mean = (0.485, 0.456, 0.406)
imagenet_std = (0.229, 0.224, 0.225)

def get_train_tfms():
    return A.Compose([
        A.RandomResizedCrop(IM, IM, scale=(0.8, 1.0), ratio=(0.75, 1.33), p=1.0),
        A.HorizontalFlip(p=0.5),
        A.OneOf([
            A.ColorJitter(brightness=0.25, contrast=0.25, saturation=0.15, hue=0.05, p=1.0),
            A.RandomBrightnessContrast(brightness_limit=0.25, contrast_limit=0.25, p=1.0),
            A.HueSaturationValue(hue_shift_limit=10, sat_shift_limit=15, val_shift_limit=10, p=1.0),
            A.RGBShift(r_shift_limit=10, g_shift_limit=10, b_shift_limit=10, p=1.0),
            A.Blur(blur_limit=3, p=1.0),
            A.GaussNoise(var_limit=(10.0, 50.0), p=1.0),
            A.Sharpen(alpha=(0.1, 0.2), lightness=(0.8, 1.2), p=1.0),
        ], p=0.9),
        A.CoarseDropout(max_holes=1, max_height=int(0.12*IM), max_width=int(0.12*IM), min_holes=1, fill_value=0, p=0.5),
        A.Normalize(mean=imagenet_mean, std=imagenet_std),
        ToTensorV2(),
    ])

def get_val_tfms():
    return A.Compose([
        A.Resize(IM, IM),
        A.CenterCrop(IM, IM),
        A.Normalize(mean=imagenet_mean, std=imagenet_std),
        ToTensorV2(),
    ])

class DogDataset(torch.utils.data.Dataset):
    def __init__(self, df, augment, is_train=True):
        self.df = df.reset_index(drop=True)
        self.augment = augment
        self.is_train = is_train
    def __len__(self):
        return len(self.df)
    def __getitem__(self, idx):
        row = self.df.iloc[idx]
        img = cv2.imread(row.filepath)
        if img is None:
            raise FileNotFoundError(row.filepath)
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        img = self.augment(image=img)['image']
        if self.is_train:
            return img, int(row.label)
        else:
            return img, row.id

def make_loaders_for_fold(fold_idx):
    trn_df, val_df = folds[fold_idx]
    ds_tr = DogDataset(trn_df, get_train_tfms(), is_train=True)
    ds_va = DogDataset(val_df, get_val_tfms(), is_train=True)
    num_workers = CFG['num_workers']
    train_loader = torch.utils.data.DataLoader(
        ds_tr, batch_size=CFG['batch_size'], shuffle=True, num_workers=num_workers, pin_memory=True, drop_last=True,
        persistent_workers=True if num_workers > 0 else False, prefetch_factor=2 if num_workers > 0 else None
    )
    val_loader = torch.utils.data.DataLoader(
        ds_va, batch_size=CFG['batch_size'], shuffle=False, num_workers=num_workers, pin_memory=True,
        persistent_workers=True if num_workers > 0 else False, prefetch_factor=2 if num_workers > 0 else None
    )
    return train_loader, val_loader, trn_df, val_df

print('CV setup complete. Use make_loaders_for_fold(k) to get loaders for fold k. Next: implement per-fold training loop with EMA and save OOF logits.')

In [14]:
# 5-fold training loop with EMA, OOF saving, and ensemble inference utilities
import os, time, math, gc, glob
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.cuda.amp import autocast, GradScaler
import timm
from timm.utils import ModelEmaV2
from timm.data.mixup import Mixup
from timm.loss import SoftTargetCrossEntropy, LabelSmoothingCrossEntropy
import albumentations as A
from albumentations.pytorch import ToTensorV2

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
torch.backends.cudnn.benchmark = True
torch.set_float32_matmul_precision('high')
if torch.cuda.is_available():
    torch.backends.cuda.matmul.allow_tf32 = True
    torch.backends.cudnn.allow_tf32 = True

os.makedirs('checkpoints', exist_ok=True)
os.makedirs('oof', exist_ok=True)

def build_model(num_classes):
    # drop_path_rate tuned to 0.25 as per expert advice
    m = timm.create_model(CFG['model_name'], pretrained=True, num_classes=num_classes, drop_path_rate=0.25)
    if hasattr(m, 'set_grad_checkpointing'):
        m.set_grad_checkpointing(True)
    m.to(device)
    m.to(memory_format=torch.channels_last)
    return m

def get_optim(model):
    decay, no_decay = [], []
    for n, p in model.named_parameters():
        if not p.requires_grad: continue
        if p.ndim == 1 or n.endswith('.bias'): no_decay.append(p)
        else: decay.append(p)
    optimizer = torch.optim.AdamW([
        {'params': decay, 'weight_decay': CFG['weight_decay']},
        {'params': no_decay, 'weight_decay': 0.0},
    ], lr=CFG['lr'], betas=(0.9, 0.999))
    return optimizer

def evaluate(model, loader):
    model.eval()
    total, n = 0.0, 0
    with torch.no_grad():
        for images, targets in loader:
            images = images.to(device, non_blocking=True, memory_format=torch.channels_last)
            targets = targets.to(device, non_blocking=True)
            with autocast(enabled=True):
                logits = model(images)
                loss = F.cross_entropy(logits, targets)
            total += loss.item() * images.size(0)
            n += images.size(0)
    return total / max(n, 1)

def collect_logits(model, loader):
    model.eval()
    all_logits, all_targets = [], []
    with torch.no_grad():
        for images, targets in loader:
            images = images.to(device, non_blocking=True, memory_format=torch.channels_last)
            with autocast(enabled=True):
                logits = model(images)
            all_logits.append(logits.float().cpu())
            all_targets.append(targets.cpu())
    return torch.cat(all_logits, 0), torch.cat(all_targets, 0)

def train_one_fold(fold_idx):
    print(f'===== Fold {fold_idx} training start =====', flush=True)
    train_loader, val_loader, trn_df, val_df = make_loaders_for_fold(fold_idx)
    model = build_model(num_classes)
    optimizer = get_optim(model)
    scaler = GradScaler(enabled=True)
    model_ema = ModelEmaV2(model, decay=CFG['ema_decay'])
    model_ema.module.to(device); model_ema.module.to(memory_format=torch.channels_last)
    # Keep some identity samples via prob=0.8
    mixup_fn = Mixup(mixup_alpha=CFG['mixup_alpha'], cutmix_alpha=CFG['cutmix_alpha'], prob=0.8, switch_prob=0.5, mode='batch', label_smoothing=0.0, num_classes=num_classes)
    crit_soft = SoftTargetCrossEntropy()
    crit_ls = LabelSmoothingCrossEntropy(smoothing=0.1)

    accum = CFG['accum_steps']

    # --- Start of refined scheduler/EMA block ---
    steps_per_epoch = len(train_loader)
    update_steps_per_epoch = max(1, steps_per_epoch // CFG['accum_steps'])
    total_updates = update_steps_per_epoch * CFG['epochs']

    # LR: guarded linear warmup then cosine to 0.01x base LR
    min_lr = CFG['lr'] * 0.01
    warmup_updates = min(400, max(update_steps_per_epoch, int(0.3 * total_updates)))
    def get_lr_update(u_step):
        if u_step < warmup_updates:
            return CFG['lr'] * (u_step + 1) / max(1, warmup_updates)
        t = (u_step - warmup_updates) / max(1, (total_updates - warmup_updates))
        return min_lr + (CFG['lr'] - min_lr) * 0.5 * (1 + math.cos(math.pi * t))

    # EMA: adaptive decay warmup
    ema_warmup_updates = min(800, int(0.3 * total_updates))
    ema_decay_start = 0.99
    ema_decay_target = CFG['ema_decay']
    def get_ema_decay_update(u_step):
        if u_step >= ema_warmup_updates:
            return ema_decay_target
        frac = u_step / max(1, ema_warmup_updates)
        return float(ema_decay_start + (ema_decay_target - ema_decay_start) * frac)
    # --- End of refined scheduler/EMA block ---

    best, best_ema, no_improve = 1e9, 1e9, 0
    global_iter = 0           # counts dataloader iterations
    update_step = 0           # counts optimizer updates

    for epoch in range(CFG['epochs']):
        model.train()
        running, seen = 0.0, 0
        t0 = time.time()
        optimizer.zero_grad(set_to_none=True)
        use_mix = (epoch < (CFG['epochs'] - CFG['disable_mixup_last']))
        for it, (images, targets) in enumerate(train_loader):
            images = images.to(device, non_blocking=True, memory_format=torch.channels_last)
            targets = targets.to(device, non_blocking=True)
            if use_mix:
                images, targets = mixup_fn(images, targets)
            with autocast(enabled=True):
                logits = model(images)
                loss = (crit_soft(logits, targets) if use_mix else crit_ls(logits, targets)) / accum
            scaler.scale(loss).backward()

            if (it + 1) % accum == 0:
                # Apply LR & EMA based on optimizer update steps
                lr_now = get_lr_update(update_step)
                for pg in optimizer.param_groups: pg['lr'] = lr_now
                scaler.unscale_(optimizer)
                torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
                scaler.step(optimizer)
                scaler.update()
                optimizer.zero_grad(set_to_none=True)
                model_ema.decay = get_ema_decay_update(update_step)
                model_ema.update(model)
                # CRITICAL: sync BN buffers for EMA to avoid stale statistics
                model_ema.update_buffers(model)
                update_step += 1

            running += loss.item() * accum * images.size(0)
            seen += images.size(0)
            global_iter += 1
            if (it + 1) % 100 == 0:
                cur_lr = optimizer.param_groups[0]['lr']
                print(f"Fold {fold_idx} Epoch {epoch} [{it+1}/{len(train_loader)}] loss={running/max(seen,1):.4f} lr={cur_lr:.6f} upd={update_step} time={time.time()-t0:.1f}s", flush=True)

        val = evaluate(model, val_loader)
        model_ema.module.to(device); model_ema.module.to(memory_format=torch.channels_last)
        val_ema = evaluate(model_ema.module, val_loader)
        cur_lr = optimizer.param_groups[0]['lr']
        print(f"Fold {fold_idx} Epoch {epoch}: train_loss={running/max(seen,1):.4f} val={val:.5f} val_ema={val_ema:.5f} lr={cur_lr:.6f}", flush=True)

        improved = False
        ckpt_dir = f'checkpoints/fold{fold_idx}'
        os.makedirs(ckpt_dir, exist_ok=True)
        if val < best - 1e-6:
            best = val; improved = True
            torch.save({'model': model.state_dict(), 'val_loss': best, 'epoch': epoch}, os.path.join(ckpt_dir, 'best.pth'))
        if val_ema < best_ema - 1e-6:
            best_ema = val_ema; improved = True
            cpu_state = {k: v.detach().cpu() for k, v in model_ema.module.state_dict().items()}
            torch.save({'model': cpu_state, 'val_loss': best_ema, 'epoch': epoch}, os.path.join(ckpt_dir, 'best_ema.pth'))
        if not improved:
            no_improve += 1
        else:
            no_improve = 0
        if no_improve >= CFG['patience']:
            print(f'Fold {fold_idx}: early stopping.', flush=True)
            break
        gc.collect(); torch.cuda.empty_cache()

    # Save OOF logits (EMA)
    print(f'Fold {fold_idx}: generating OOF logits for validation set...', flush=True)
    # Build EMA model with pretrained=False to avoid redundant downloads/VRAM spikes
    ema_model = timm.create_model(CFG['model_name'], pretrained=False, num_classes=num_classes, drop_path_rate=0.25)
    if hasattr(ema_model, 'set_grad_checkpointing'):
        ema_model.set_grad_checkpointing(True)
    ema_ckpt = torch.load(os.path.join(f'checkpoints/fold{fold_idx}', 'best_ema.pth'), map_location='cpu', weights_only=True)
    state = ema_ckpt.get('model', ema_ckpt)
    ema_model.load_state_dict(state, strict=True)
    ema_model.to(device); ema_model.to(memory_format=torch.channels_last)
    val_loader = make_loaders_for_fold(fold_idx)[1]
    oof_logits, oof_tgts = collect_logits(ema_model, val_loader)
    np.savez_compressed(f'oof/fold{fold_idx}_oof.npz', logits=oof_logits.numpy(), targets=oof_tgts.numpy(), idx=val_df.index.values)
    del ema_model; gc.collect(); torch.cuda.empty_cache()
    print(f'===== Fold {fold_idx} done. Best val={best:.5f} best_ema={best_ema:.5f} =====', flush=True)

def nll_from_logits_np(logits, targets):
    logits_t = torch.from_numpy(logits)
    targets_t = torch.from_numpy(targets).long()
    return F.cross_entropy(logits_t, targets_t).item()

def fit_global_temperature_from_oof():
    # Load all OOF logits
    all_logits, all_targets = [], []
    for k in range(5):
        path = f'oof/fold{k}_oof.npz'
        if not os.path.exists(path):
            continue
        d = np.load(path)
        all_logits.append(d['logits'])
        all_targets.append(d['targets'])
    if len(all_logits) == 0:
        print('WARNING: No OOF files found. Defaulting temperature T=1.0', flush=True)
        return 1.0
    logits = np.concatenate(all_logits, 0)
    targets = np.concatenate(all_targets, 0)
    Ts = np.linspace(0.5, 2.5, 41)
    best_T, best_loss = 1.0, 1e9
    for T in Ts:
        loss = nll_from_logits_np(logits / T, targets)
        if loss < best_loss:
            best_loss, best_T = loss, float(T)
    print(f'Fitted global temperature T={best_T:.3f} on OOF (CE={best_loss:.5f})')
    return best_T

def fit_bias_vector_from_oof(T=1.0):
    # Vector scaling: fit per-class bias b on OOF to minimize CE of logits/T + b
    all_logits, all_targets = [], []
    for k in range(5):
        path = f'oof/fold{k}_oof.npz'
        if not os.path.exists(path):
            continue
        d = np.load(path)
        all_logits.append(d['logits'])
        all_targets.append(d['targets'])
    if len(all_logits) == 0:
        print('WARNING: No OOF files for bias fit. Using zero bias.', flush=True)
        return torch.zeros((num_classes,), dtype=torch.float32)
    logits = torch.from_numpy(np.concatenate(all_logits, 0)).float() / float(T)
    targets = torch.from_numpy(np.concatenate(all_targets, 0)).long()
    b = torch.zeros((num_classes,), dtype=torch.float32, requires_grad=True)
    optimizer = torch.optim.LBFGS([b], max_iter=100, tolerance_grad=1e-7, tolerance_change=1e-9, line_search_fn='strong_wolfe')
    def closure():
        optimizer.zero_grad()
        loss = F.cross_entropy(logits + b.unsqueeze(0), targets)
        loss.backward()
        return loss
    loss0 = F.cross_entropy(logits, targets).item()
    optimizer.step(closure)
    with torch.no_grad():
        loss1 = F.cross_entropy(logits + b.unsqueeze(0), targets).item()
    print(f'Bias vector fit on OOF: CE before={loss0:.5f} after={loss1:.5f}')
    return b.detach()

@torch.no_grad()
def predict_logits_tta(model, loader, use_hflip=True):
    model.eval()
    all_logits = []
    for images, _ in loader:
        images = images.to(device, non_blocking=True, memory_format=torch.channels_last)
        with autocast(enabled=True):
            logits = model(images)
            if use_hflip:
                logits = 0.5 * (logits + model(images.flip(-1)))
        all_logits.append(logits.float().cpu())
    return torch.cat(all_logits, 0)

def build_scaled_loader(size):
    tfm = A.Compose([
        A.Resize(size, size),
        A.CenterCrop(size, size),
        A.Normalize(mean=(0.485,0.456,0.406), std=(0.229,0.224,0.225)),
        ToTensorV2(),
    ])
    import glob as _glob, os as _os, cv2
    class TestDS(torch.utils.data.Dataset):
        def __init__(self, ids): self.ids = ids
        def __len__(self): return len(self.ids)
        def __getitem__(self, i):
            pid = self.ids[i]
            img = cv2.imread(_os.path.join('test', f'{pid}.jpg'))
            img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
            img = tfm(image=img)['image']
            return img, pid
    test_ids = [os.path.splitext(os.path.basename(p))[0] for p in sorted(_glob.glob(os.path.join('test','*.jpg')))]
    ds = TestDS(test_ids)
    # Use a larger, fixed batch size for fast inference; set num_workers=0 to prevent hang
    loader = torch.utils.data.DataLoader(ds, batch_size=32, shuffle=False, num_workers=0, pin_memory=False)
    return loader, test_ids

def infer_ensemble_and_save():
    sizes = [CFG['img_size'], CFG['img_size'] + 32]
    fold_ckpts = [f'checkpoints/fold{k}/best_ema.pth' for k in range(5) if os.path.exists(f'checkpoints/fold{k}/best_ema.pth')]
    if len(fold_ckpts) < 2:
        print(f'WARNING: Only {len(fold_ckpts)} fold checkpoints found. Proceeding anyway.', flush=True)
    if len(fold_ckpts) == 0:
        print('ERROR: No fold checkpoints found. Aborting ensemble inference.', flush=True)
        return
    # Fit temperature on OOF
    T = fit_global_temperature_from_oof()
    # Also fit per-class bias on OOF (vector scaling) using logits/T
    b = fit_bias_vector_from_oof(T=T)
    # Prepare loaders per size
    loaders = [build_scaled_loader(s)[0] for s in sizes]
    # Accumulate logits across folds and TTAs
    fold_logits = None
    for fi, ck in enumerate(fold_ckpts):
        print(f'Loading fold checkpoint: {ck}', flush=True)
        model = build_model(num_classes)
        ckpt = torch.load(ck, map_location='cpu', weights_only=True)
        state = ckpt.get('model', ckpt)
        model.load_state_dict(state, strict=True)
        model.to(device); model.to(memory_format=torch.channels_last)
        logits_sum = None
        for ld in loaders:
            lg = predict_logits_tta(model, ld, use_hflip=True)
            logits_sum = lg if logits_sum is None else logits_sum + lg
        logits_avg = logits_sum / len(loaders)
        fold_logits = logits_avg if fold_logits is None else fold_logits + logits_avg
        del model; gc.collect(); torch.cuda.empty_cache()
    fold_logits = fold_logits / len(fold_ckpts)
    # Apply temperature and per-class bias before softmax
    probs = F.softmax(fold_logits / float(T) + b.unsqueeze(0), dim=1).numpy()
    # Build submission
    _, test_ids = build_scaled_loader(CFG['img_size'])
    sub = pd.DataFrame(probs, columns=breeds)
    sub.insert(0, 'id', test_ids)
    sub.to_csv('submission.csv', index=False)
    print('Saved submission.csv (ensemble) with shape', sub.shape)

print('Utilities ready: call train_one_fold(k) for k in 0..4, then infer_ensemble_and_save() to create submission.')

Utilities ready: call train_one_fold(k) for k in 0..4, then infer_ensemble_and_save() to create submission.


In [None]:
# Kick off sanity fold training (reduce epochs for speed) 
import time
start_ts = time.time()
CFG['epochs'] = 18
CFG['patience'] = 4
print('Starting train_one_fold(0) with CFG:', CFG)
train_one_fold(0)
print(f'Fold 0 training finished in {(time.time()-start_ts)/60:.2f} min')

In [13]:
# Override CFG and loaders without re-running cell 7; use saved CSV folds
import os, time, math, glob, cv2, torch, pandas as pd, numpy as np, random
import albumentations as A
from albumentations.pytorch import ToTensorV2

SEED = 42
torch.manual_seed(SEED); torch.cuda.manual_seed_all(SEED)
random.seed(SEED); np.random.seed(SEED)

# Stability: prevent OpenCV thread contention
cv2.setNumThreads(0)

CFG = {
    'model_name': 'convnext_small.in12k_ft_in1k',
    'img_size': 384,
    'epochs': 25,
    'patience': 6,
    'batch_size': 4,
    'accum_steps': 8,
    'num_workers': 4,
    'lr': 2e-4,
    'weight_decay': 0.05,
    'ema_decay': 0.9996,
    'mixup_alpha': 0.1,
    'cutmix_alpha': 0.5,
    'disable_mixup_last': 5,
}

# Rebuild breeds mapping
breeds = list(pd.read_csv('sample_submission.csv').columns[1:])
breed_to_idx = {b:i for i,b in enumerate(breeds)}
num_classes = len(breeds)

# Transforms
IM = CFG['img_size']
imagenet_mean = (0.485, 0.456, 0.406)
imagenet_std = (0.229, 0.224, 0.225)

def get_train_tfms():
    return A.Compose([
        A.RandomResizedCrop(IM, IM, scale=(0.8, 1.0), ratio=(0.75, 1.33), p=1.0),
        A.HorizontalFlip(p=0.5),
        A.OneOf([
            A.ColorJitter(brightness=0.25, contrast=0.25, saturation=0.15, hue=0.05, p=1.0),
            A.RandomBrightnessContrast(brightness_limit=0.25, contrast_limit=0.25, p=1.0),
            A.HueSaturationValue(hue_shift_limit=10, sat_shift_limit=15, val_shift_limit=10, p=1.0),
            A.RGBShift(r_shift_limit=10, g_shift_limit=10, b_shift_limit=10, p=1.0),
            A.Blur(blur_limit=3, p=1.0),
            A.GaussNoise(var_limit=(10.0, 50.0), p=1.0),
            A.Sharpen(alpha=(0.1, 0.2), lightness=(0.8, 1.2), p=1.0),
        ], p=0.9),
        A.ShiftScaleRotate(shift_limit=0.02, scale_limit=0.1, rotate_limit=10, p=0.3),
        A.CoarseDropout(max_holes=1, max_height=int(0.12*IM), max_width=int(0.12*IM), min_holes=1, fill_value=0, p=0.5),
        A.Normalize(mean=imagenet_mean, std=imagenet_std),
        ToTensorV2(),
    ])

def get_val_tfms():
    return A.Compose([
        A.Resize(IM, IM),
        A.CenterCrop(IM, IM),
        A.Normalize(mean=imagenet_mean, std=imagenet_std),
        ToTensorV2(),
    ])

class DogDataset(torch.utils.data.Dataset):
    def __init__(self, df, augment, is_train=True):
        self.df = df.reset_index(drop=True)
        self.augment = augment
        self.is_train = is_train
    def __len__(self):
        return len(self.df)
    def __getitem__(self, idx):
        row = self.df.iloc[idx]
        img = cv2.imread(row.filepath)
        if img is None:
            raise FileNotFoundError(row.filepath)
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        img = self.augment(image=img)['image']
        if self.is_train:
            return img, int(row.label)
        else:
            return img, row.id

def seed_worker(worker_id: int):
    base_seed = SEED + worker_id
    np.random.seed(base_seed)
    random.seed(base_seed)
    torch.manual_seed(base_seed)
    return None

def make_loaders_for_fold(fold_idx):
    trn_df = pd.read_csv(f'folds/train_fold{fold_idx}.csv')
    val_df = pd.read_csv(f'folds/val_fold{fold_idx}.csv')
    ds_tr = DogDataset(trn_df, get_train_tfms(), is_train=True)
    ds_va = DogDataset(val_df, get_val_tfms(), is_train=True)
    num_workers = CFG['num_workers']
    train_loader = torch.utils.data.DataLoader(
        ds_tr, batch_size=CFG['batch_size'], shuffle=True, num_workers=num_workers, pin_memory=True, drop_last=True,
        persistent_workers=True if num_workers > 0 else False, prefetch_factor=2 if num_workers > 0 else None,
        worker_init_fn=seed_worker
    )
    val_loader = torch.utils.data.DataLoader(
        ds_va, batch_size=CFG['batch_size'], shuffle=False, num_workers=num_workers, pin_memory=True,
        persistent_workers=True if num_workers > 0 else False, prefetch_factor=2 if num_workers > 0 else None,
        worker_init_fn=seed_worker
    )
    return train_loader, val_loader, trn_df, val_df

print('CFG and loaders set via CSV folds. Ready to call train_one_fold(0).')

CFG and loaders set via CSV folds. Ready to call train_one_fold(0).


In [7]:
# CPU-only ensemble inference: convnext_tiny (existing ckpts) + convnext_small fold0 (if present), extra scales
import os, time, numpy as np, pandas as pd, torch, torch.nn.functional as F
import timm
from albumentations.pytorch import ToTensorV2
import albumentations as A

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

def build_model(name, num_classes):
    return timm.create_model(name, pretrained=False, num_classes=num_classes)

def load_ckpt(model, ckpt_path):
    ckpt = torch.load(ckpt_path, map_location='cpu', weights_only=True)
    state = ckpt.get('model', ckpt)
    model.load_state_dict(state, strict=True)
    return model

@torch.no_grad()
def predict_logits_tta(model, loader, use_hflip=True):
    model.eval()
    outs = []
    for batch in loader:
        if len(batch) == 2 and isinstance(batch[1], torch.Tensor):
            images, _ = batch
        else:
            images, _ids = batch
        images = images.to(device, non_blocking=True, memory_format=torch.channels_last)
        logits = model(images)
        if use_hflip:
            logits = 0.5 * (logits + model(images.flip(-1)))
        outs.append(logits.float().cpu())
    return torch.cat(outs, 0)

def build_scaled_test_loader(size):
    tfm = A.Compose([
        A.Resize(size, size),
        A.CenterCrop(size, size),
        A.Normalize(mean=(0.485,0.456,0.406), std=(0.229,0.224,0.225)),
        ToTensorV2(),
    ])
    from torch.utils.data import Dataset, DataLoader
    import glob, cv2
    class TestDS(Dataset):
        def __init__(self, ids):
            self.ids = ids
        def __len__(self):
            return len(self.ids)
        def __getitem__(self, i):
            pid = self.ids[i]
            img = cv2.imread(os.path.join('test', f'{pid}.jpg'))
            img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
            img = tfm(image=img)['image']
            return img, pid
    test_ids = [os.path.splitext(os.path.basename(p))[0] for p in sorted(glob.glob(os.path.join('test','*.jpg')))]
    ds = TestDS(test_ids)
    ld = torch.utils.data.DataLoader(ds, batch_size=max(2, 8), shuffle=False, num_workers=0, pin_memory=False)
    return ld, test_ids

def nll_from_logits(logits, targets):
    return F.cross_entropy(logits, targets).item()

def find_best_temperature(val_logits, val_targets, t_min=0.5, t_max=2.0, steps=31):
    ts = np.linspace(t_min, t_max, steps)
    best_t, best_loss = 1.0, float('inf')
    for t in ts:
        loss = nll_from_logits(val_logits / t, val_targets)
        if loss < best_loss:
            best_loss, best_t = loss, float(t)
    return best_t, best_loss

print('Starting CPU ensemble inference...', flush=True)
# Build models conditionally (include EMA and non-EMA if both exist)
models = []
names = []

# convnext_tiny checkpoints
if os.path.exists('checkpoints/best_ema.pth'):
    m = build_model('convnext_tiny', num_classes)
    m = load_ckpt(m, 'checkpoints/best_ema.pth').to(device).to(memory_format=torch.channels_last)
    models.append(m); names.append('tiny_ema'); print('Loaded convnext_tiny EMA')
if os.path.exists('checkpoints/best.pth'):
    m = build_model('convnext_tiny', num_classes)
    m = load_ckpt(m, 'checkpoints/best.pth').to(device).to(memory_format=torch.channels_last)
    models.append(m); names.append('tiny'); print('Loaded convnext_tiny non-EMA')

# convnext_small fold0 checkpoints
if os.path.exists('checkpoints/fold0/best_ema.pth'):
    m = build_model('convnext_small.in12k_ft_in1k', num_classes)
    m = load_ckpt(m, 'checkpoints/fold0/best_ema.pth').to(device).to(memory_format=torch.channels_last)
    models.append(m); names.append('small_f0_ema'); print('Loaded convnext_small fold0 EMA')
if os.path.exists('checkpoints/fold0/best.pth'):
    m = build_model('convnext_small.in12k_ft_in1k', num_classes)
    m = load_ckpt(m, 'checkpoints/fold0/best.pth').to(device).to(memory_format=torch.channels_last)
    models.append(m); names.append('small_f0'); print('Loaded convnext_small fold0 non-EMA')

assert len(models) > 0, 'No checkpoints found to run ensemble inference.'

# Fit temperature on current val split using ensemble of models
with torch.no_grad():
    val_logits_list = []
    for m in models:
        lg = predict_logits_tta(m, val_loader, use_hflip=True)
        val_logits_list.append(lg)
    val_logits_ens = torch.stack(val_logits_list, 0).mean(0)
    val_targets = torch.cat([t for _, t in val_loader], 0).long()
T_opt, ce = find_best_temperature(val_logits_ens, val_targets, 0.5, 2.0, 41)
print(f'Fitted temperature T={T_opt:.3f} (CE={ce:.5f}) on current val split)')

# Test inference at multiple scales, then average across models and scales
sizes = [224, 256, 288, 320]
test_logits_accum = None
for sz in sizes:
    ld, test_ids = build_scaled_test_loader(sz)
    logits_models = []
    for m in models:
        lg = predict_logits_tta(m, ld, use_hflip=True)
        logits_models.append(lg)
    logits_scale_avg = torch.stack(logits_models, 0).mean(0)
    test_logits_accum = logits_scale_avg if test_logits_accum is None else (test_logits_accum + logits_scale_avg)
test_logits_avg = test_logits_accum / len(sizes)
probs = F.softmax(test_logits_avg / T_opt, dim=1).numpy()

# Build and save submission
sub = pd.DataFrame(probs, columns=breeds)
sub.insert(0, 'id', test_ids)
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv (CPU ensemble) with shape', sub.shape, 'from models:', names, 'scales:', sizes)

In [10]:
# CPU-only optimized inference: 2 EMA models (tiny + small_f0), 2 scales, per-model temp + blend weight optimization
import os, numpy as np, pandas as pd, torch, torch.nn.functional as F
import timm, scipy.optimize as opt
import albumentations as A
from albumentations.pytorch import ToTensorV2

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

def build_model(name, num_classes):
    return timm.create_model(name, pretrained=False, num_classes=num_classes)

def load_ckpt(model, path):
    ckpt = torch.load(path, map_location='cpu', weights_only=True)
    state = ckpt.get('model', ckpt)
    model.load_state_dict(state, strict=True)
    return model

@torch.no_grad()
def predict_logits_tta(model, loader, use_hflip=True):
    model.eval()
    outs = []
    for batch in loader:
        if len(batch) == 2 and isinstance(batch[1], torch.Tensor):
            images, _ = batch
        else:
            images, _ids = batch
        images = images.to(device, non_blocking=True, memory_format=torch.channels_last)
        logits = model(images)
        if use_hflip:
            logits = 0.5 * (logits + model(images.flip(-1)))
        outs.append(logits.float().cpu())
    return torch.cat(outs, 0)

def build_test_loader(size, batch_size=48):
    tfm = A.Compose([A.Resize(size, size), A.CenterCrop(size, size), A.Normalize(mean=(0.485,0.456,0.406), std=(0.229,0.224,0.225)), ToTensorV2()])
    from torch.utils.data import Dataset, DataLoader
    import glob, cv2
    class TestDS(Dataset):
        def __init__(self, ids): self.ids = ids
        def __len__(self): return len(self.ids)
        def __getitem__(self, i):
            pid = self.ids[i]
            img = cv2.imread(os.path.join('test', f'{pid}.jpg'))
            img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
            img = tfm(image=img)['image']
            return img, pid
    test_ids = [os.path.splitext(os.path.basename(p))[0] for p in sorted(glob.glob(os.path.join('test','*.jpg')))]
    ds = TestDS(test_ids)
    ld = torch.utils.data.DataLoader(ds, batch_size=batch_size, shuffle=False, num_workers=0, pin_memory=False)
    return ld, test_ids

def ce_from_logits(logits, targets):
    return F.cross_entropy(logits, targets).item()

def grid_search_temperature(val_logits, val_targets, t_min=0.5, t_max=2.0, steps=41):
    ts = np.linspace(t_min, t_max, steps)
    best_t, best_loss = 1.0, 1e9
    for t in ts:
        loss = ce_from_logits(val_logits / t, val_targets)
        if loss < best_loss:
            best_loss, best_t = loss, float(t)
    return best_t, best_loss

print('Optimized CPU inference: loading EMA models...', flush=True)
paths = {
    'tiny_ema': 'checkpoints/best_ema.pth',
    'small_f0_ema': 'checkpoints/fold0/best_ema.pth'
}
avail = {k: os.path.exists(v) for k,v in paths.items()}
assert any(avail.values()), 'No EMA checkpoints found'

models = []
names = []
if avail.get('tiny_ema'):
    m = build_model('convnext_tiny', num_classes)
    m = load_ckpt(m, paths['tiny_ema']).to(device).to(memory_format=torch.channels_last)
    models.append(m); names.append('tiny_ema')
if avail.get('small_f0_ema'):
    m = build_model('convnext_small.in12k_ft_in1k', num_classes)
    m = load_ckpt(m, paths['small_f0_ema']).to(device).to(memory_format=torch.channels_last)
    models.append(m); names.append('small_f0_ema')
print('Loaded:', names)

# 1) Compute val logits per model (reuse existing val_loader from Cell 3) and fit per-model temperatures
val_targets = torch.cat([t for _, t in val_loader], 0).long()
val_logits_list, Ts = [], []
for m in models:
    lg = predict_logits_tta(m, val_loader, use_hflip=True)
    t_opt, _ = grid_search_temperature(lg, val_targets, 0.5, 2.0, 41)
    val_logits_list.append(lg); Ts.append(t_opt)
print('Per-model temperatures:', dict(zip(names, Ts)))

# 2) Optimize blend weight w in [0,1] to minimize CE on val
def val_loss_for_w(w):
    w = float(w)
    blend = w * (val_logits_list[0] / Ts[0])
    if len(val_logits_list) > 1:
        blend = blend + (1.0 - w) * (val_logits_list[1] / Ts[1])
    return ce_from_logits(blend, val_targets)
if len(val_logits_list) == 1:
    w_opt = 1.0
else:
    res = opt.minimize_scalar(lambda x: val_loss_for_w(x), bounds=(0.0, 1.0), method='bounded', options={'xatol':1e-3, 'maxiter':100})
    w_opt = float(res.x)
print('Optimized blend weight w:', w_opt)

# 3) Test inference at three scales (224,256,288) with hflip; batch_size=48 for speed
sizes = [224, 256, 288]
test_logits_accum = None
for sz in sizes:
    ld, test_ids = build_test_loader(sz, batch_size=48)
    # per-model logits with their calibrated temps
    per_model = []
    for mi, m in enumerate(models):
        lg = predict_logits_tta(m, ld, use_hflip=True) / Ts[mi]
        per_model.append(lg)
    # blend with w_opt
    if len(per_model) == 1:
        blended = per_model[0]
    else:
        blended = w_opt * per_model[0] + (1.0 - w_opt) * per_model[1]
    test_logits_accum = blended if test_logits_accum is None else (test_logits_accum + blended)
test_logits_avg = test_logits_accum / len(sizes)
probs = F.softmax(test_logits_avg, dim=1).numpy()

sub = pd.DataFrame(probs, columns=breeds)
sub.insert(0, 'id', test_ids)
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv (CPU optimized blend) with shape', sub.shape, 'models:', names, 'scales:', sizes, 'w:', w_opt)

In [9]:
# CPU feature extraction + logistic regression, then blend with current submission
import os, glob, numpy as np, pandas as pd, torch, torch.nn as nn, torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import timm, cv2, albumentations as A
from albumentations.pytorch import ToTensorV2
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss
import scipy.optimize as opt

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

class ImageDS(Dataset):
    def __init__(self, df, tfm):
        self.df = df.reset_index(drop=True); self.tfm = tfm
    def __len__(self): return len(self.df)
    def __getitem__(self, i):
        r = self.df.iloc[i]
        img = cv2.imread(r.filepath); img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        t = self.tfm(image=img)['image']
        return t, int(r.label) if 'label' in r else r.id

def build_feature_model(name):
    # num_classes=0 returns global pooled features
    m = timm.create_model(name, pretrained=True, num_classes=0)
    m.eval(); m.to(device); m.to(memory_format=torch.channels_last)
    return m

def make_tfm(sz):
    return A.Compose([A.Resize(sz, sz), A.CenterCrop(sz, sz), A.Normalize(mean=(0.485,0.456,0.406), std=(0.229,0.224,0.225)), ToTensorV2()])

@torch.no_grad()
def extract_feats(model, loader):
    feats, ys_or_ids = [], []
    for batch in loader:
        imgs, meta = batch
        imgs = imgs.to(device, non_blocking=True, memory_format=torch.channels_last)
        f = model(imgs).float().cpu().numpy()
        feats.append(f)
        if isinstance(meta, torch.Tensor):
            ys_or_ids.extend(meta.cpu().numpy().tolist())
        else:
            ys_or_ids.extend(list(meta))
    return np.concatenate(feats, 0), ys_or_ids

# Use convnext_tiny features at two sizes for speed
model_name = 'convnext_tiny'
sizes = [224, 256]
BATCH = 32

# Full train df for features
labels_df_all = pd.read_csv('labels.csv')
labels_df_all['filepath'] = labels_df_all['id'].apply(lambda x: os.path.join('train', f'{x}.jpg'))
labels_df_all['label'] = labels_df_all['breed'].map(breed_to_idx)

# Validation split from Cell 3 already exists: val_df
train_only_df = labels_df_all[~labels_df_all['id'].isin(val_df['id'])].reset_index(drop=True)

feat_list_tr, y_list_tr = [], []
feat_list_va, y_list_va = [], []
for sz in sizes:
    tfm = make_tfm(sz)
    ds_tr = ImageDS(train_only_df, tfm); ld_tr = DataLoader(ds_tr, batch_size=BATCH, shuffle=False, num_workers=0, pin_memory=False)
    ds_va = ImageDS(val_df, tfm); ld_va = DataLoader(ds_va, batch_size=BATCH, shuffle=False, num_workers=0, pin_memory=False)
    m = build_feature_model(model_name)
    f_tr, y_tr = extract_feats(m, ld_tr); f_va, y_va = extract_feats(m, ld_va)
    feat_list_tr.append(f_tr); feat_list_va.append(f_va)
    y_list_tr = y_tr; y_list_va = y_va
    del m; torch.cuda.empty_cache()

X_tr = np.concatenate(feat_list_tr, axis=1)
X_va = np.concatenate(feat_list_va, axis=1)
y_tr = np.array(y_list_tr, dtype=int)
y_va = np.array(y_list_va, dtype=int)
print('Feature shapes:', X_tr.shape, X_va.shape)

# Train multinomial logistic regression
lr_clf = LogisticRegression(max_iter=200, n_jobs=-1, verbose=0, penalty='l2', solver='saga', multi_class='multinomial', C=1.0)
lr_clf.fit(X_tr, y_tr)
probs_va_lr = lr_clf.predict_proba(X_va)
print('Val CE (LR only):', log_loss(y_va, probs_va_lr, labels=list(range(num_classes))))

# Build test features
test_ids = [os.path.splitext(os.path.basename(p))[0] for p in sorted(glob.glob(os.path.join('test','*.jpg')))]
test_df_all = pd.DataFrame({'id': test_ids})
test_df_all['filepath'] = test_df_all['id'].apply(lambda x: os.path.join('test', f'{x}.jpg'))

feat_list_te = []
for sz in sizes:
    tfm = make_tfm(sz)
    class TestDS(Dataset):
        def __init__(self, df, tfm): self.df=df.reset_index(drop=True); self.tfm=tfm
        def __len__(self): return len(self.df)
        def __getitem__(self, i):
            r = self.df.iloc[i]
            img = cv2.imread(r.filepath); img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
            t = self.tfm(image=img)['image']
            return t, r.id
    ld_te = DataLoader(TestDS(test_df_all, tfm), batch_size=BATCH, shuffle=False, num_workers=0, pin_memory=False)
    m = build_feature_model(model_name)
    f_te, _ids = extract_feats(m, ld_te)
    feat_list_te.append(f_te)
    del m; torch.cuda.empty_cache()
X_te = np.concatenate(feat_list_te, axis=1)
probs_te_lr = lr_clf.predict_proba(X_te)

# Load current submission (from optimized blend) and corresponding val probs to optimize blend
base_sub = pd.read_csv('submission.csv')
assert base_sub.shape[1] == 121, 'submission.csv format mismatch'
base_cols = list(base_sub.columns[1:])
assert base_cols == breeds, 'breed columns mismatch'

# Recompute base model val probs using Cell 13 models for a fair blend on validation
from copy import deepcopy
paths = {
    'tiny_ema': 'checkpoints/best_ema.pth',
    'small_f0_ema': 'checkpoints/fold0/best_ema.pth'
}
models_blend = []
names_blend = []
for name, pth in paths.items():
    if os.path.exists(pth):
        nm = 'convnext_tiny' if 'tiny' in name else 'convnext_small.in12k_ft_in1k'
        m = timm.create_model(nm, pretrained=False, num_classes=num_classes)
        ck = torch.load(pth, map_location='cpu', weights_only=True); st = ck.get('model', ck)
        m.load_state_dict(st, strict=True); m.to(device); m.to(memory_format=torch.channels_last)
        models_blend.append(m); names_blend.append(name)

@torch.no_grad()
def val_logits_for_models(models):
    outs = []
    for m in models:
        m.eval()
        logits = []
        for imgs, tgts in val_loader:
            imgs = imgs.to(device, non_blocking=True, memory_format=torch.channels_last)
            lg = m(imgs)
            lg = 0.5*(lg + m(imgs.flip(-1)))
            logits.append(lg.float().cpu())
        outs.append(torch.cat(logits, 0))
    return outs

val_logits_models = val_logits_for_models(models_blend) if len(models_blend)>0 else []
val_targets = torch.cat([t for _, t in val_loader], 0).long().numpy()

# Optimize blend: base (models_blend averaged) vs LR features
if len(val_logits_models) > 0:
    val_logits_base = torch.stack(val_logits_models, 0).mean(0).numpy()
    val_probs_base = torch.softmax(torch.from_numpy(val_logits_base), dim=1).numpy()
else:
    val_probs_base = np.zeros((len(val_targets), num_classes), dtype=np.float32) + (1.0/num_classes)

def val_blend_loss(w):
    w = float(w)
    probs = w*val_probs_base + (1.0-w)*probs_va_lr
    return log_loss(val_targets, probs, labels=list(range(num_classes)))
res = opt.minimize_scalar(lambda x: val_blend_loss(x), bounds=(0.0,1.0), method='bounded', options={'xatol':1e-3,'maxiter':100})
w_opt = float(res.x) if res.success else 0.5
print('Optimized blend w (base vs LR):', w_opt)

# Build final probs for test
if len(models_blend) > 0:
    # reuse base_sub as base probs
    probs_base_te = base_sub[breeds].values.astype(np.float32)
else:
    probs_base_te = np.zeros_like(probs_te_lr) + (1.0/num_classes)
probs_final = w_opt*probs_base_te + (1.0-w_opt)*probs_te_lr

sub = pd.DataFrame(probs_final, columns=breeds)
sub.insert(0, 'id', base_sub['id'].values)
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv (blend base + LR features) with shape', sub.shape, 'w=', w_opt)

In [12]:
# CPU-only: calibration with per-class bias on blended logits (2 EMA models, 2 scales) to reduce logloss
import os, numpy as np, pandas as pd, torch, torch.nn.functional as F
import timm, albumentations as A
from albumentations.pytorch import ToTensorV2

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

def build_model(name, num_classes):
    return timm.create_model(name, pretrained=False, num_classes=num_classes)

def load_ckpt(model, path):
    ckpt = torch.load(path, map_location='cpu', weights_only=True)
    state = ckpt.get('model', ckpt)
    model.load_state_dict(state, strict=True)
    return model

@torch.no_grad()
def predict_logits_tta(model, loader, use_hflip=True):
    model.eval()
    outs = []
    for batch in loader:
        if len(batch) == 2 and isinstance(batch[1], torch.Tensor):
            images, _ = batch
        else:
            images, _ids = batch
        images = images.to(device, non_blocking=True, memory_format=torch.channels_last)
        logits = model(images)
        if use_hflip:
            logits = 0.5 * (logits + model(images.flip(-1)))
        outs.append(logits.float().cpu())
    return torch.cat(outs, 0)

def build_loader(size, df, is_test=False, batch_size=48):
    tfm = A.Compose([A.Resize(size, size), A.CenterCrop(size, size), A.Normalize(mean=(0.485,0.456,0.406), std=(0.229,0.224,0.225)), ToTensorV2()])
    from torch.utils.data import Dataset, DataLoader
    import cv2
    class DS(Dataset):
        def __init__(self, df): self.df=df.reset_index(drop=True)
        def __len__(self): return len(self.df)
        def __getitem__(self, i):
            r = self.df.iloc[i]
            img = cv2.imread(r.filepath); img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
            img = tfm(image=img)['image']
            return (img, r.label) if not is_test else (img, r.id)
    ds = DS(df)
    ld = torch.utils.data.DataLoader(ds, batch_size=batch_size, shuffle=False, num_workers=0, pin_memory=False)
    return ld

# Load EMA checkpoints (tiny + small fold0). If not found, fallback to available ones.
paths = {
    'tiny_ema': 'checkpoints/best_ema.pth',
    'small_f0_ema': 'checkpoints/fold0/best_ema.pth'
}
models = []
names = []
if os.path.exists(paths['tiny_ema']):
    m = build_model('convnext_tiny', num_classes); m = load_ckpt(m, paths['tiny_ema']).to(device).to(memory_format=torch.channels_last)
    models.append(m); names.append('tiny_ema')
if os.path.exists(paths['small_f0_ema']):
    m = build_model('convnext_small.in12k_ft_in1k', num_classes); m = load_ckpt(m, paths['small_f0_ema']).to(device).to(memory_format=torch.channels_last)
    models.append(m); names.append('small_f0_ema')
assert len(models) > 0, 'No EMA checkpoints found'
print('Calib using models:', names)

# Build val/test DataFrames compatible with loaders
val_df_ = val_df[['id','filepath','label']].copy()
test_ids = [os.path.splitext(os.path.basename(p))[0] for p in sorted(__import__('glob').glob(os.path.join('test','*.jpg')))]
test_df_ = pd.DataFrame({'id': test_ids}); test_df_['filepath'] = test_df_['id'].apply(lambda x: os.path.join('test', f'{x}.jpg'))

# Sizes and per-model temperature via grid on validation
sizes = [224, 256]
val_targets = torch.tensor(val_df_['label'].values, dtype=torch.long)
val_logits_models = []  # list of tensors (N,C)
Ts = []
for m in models:
    # average logits over sizes
    logits_sum = None
    for sz in sizes:
        ld_va = build_loader(sz, val_df_, is_test=False, batch_size=32)
        lg = predict_logits_tta(m, ld_va, use_hflip=True)
        logits_sum = lg if logits_sum is None else (logits_sum + lg)
    lg_avg = logits_sum / len(sizes)
    # temperature grid search
    ts = np.linspace(0.5, 2.0, 41); best_t, best_ce = 1.0, 1e9
    for t in ts:
        ce = F.cross_entropy(lg_avg/float(t), val_targets).item()
        if ce < best_ce: best_ce, best_t = ce, float(t)
    Ts.append(best_t); val_logits_models.append(lg_avg)
print('Per-model T:', Ts)

# Optimize blend weight w on validation
def ce_w(w):
    w = float(w)
    blend = (val_logits_models[0]/Ts[0])
    if len(val_logits_models) > 1:
        blend = w*(val_logits_models[0]/Ts[0]) + (1.0-w)*(val_logits_models[1]/Ts[1])
    return F.cross_entropy(blend, val_targets).item()
if len(val_logits_models) == 1:
    w_opt = 1.0
else:
    from scipy.optimize import minimize_scalar
    res = minimize_scalar(lambda x: ce_w(x), bounds=(0.0,1.0), method='bounded', options={'xatol':1e-3,'maxiter':100})
    w_opt = float(res.x)
print('w_opt:', w_opt)

# Build blended validation logits
val_blend = (val_logits_models[0]/Ts[0]) if len(val_logits_models)==1 else (w_opt*(val_logits_models[0]/Ts[0]) + (1.0-w_opt)*(val_logits_models[1]/Ts[1]))

# Fit per-class bias vector b to minimize CE on validation: minimize CE(softmax(val_blend + b), y)
b = torch.zeros((num_classes,), dtype=torch.float32, requires_grad=True)
optim = torch.optim.LBFGS([b], max_iter=100, tolerance_grad=1e-7, tolerance_change=1e-9, line_search_fn='strong_wolfe')
def closure():
    optim.zero_grad()
    logits = val_blend + b.unsqueeze(0)
    loss = F.cross_entropy(logits, val_targets)
    loss.backward()
    return loss
loss0 = F.cross_entropy(val_blend, val_targets).item()
optim.step(closure)
with torch.no_grad():
    loss1 = F.cross_entropy(val_blend + b.unsqueeze(0), val_targets).item()
print(f'Bias calib CE: before={loss0:.5f} after={loss1:.5f}')

# Detach bias for test-time inference to avoid grads in numpy conversion
b = b.detach()

# Inference on test with sizes, blend, and bias
test_logits_accum = None
for sz in sizes:
    ld_te = build_loader(sz, test_df_[['id','filepath']].copy(), is_test=True, batch_size=48)
    # collect per-model logits
    per_model = []
    for mi, m in enumerate(models):
        lg = predict_logits_tta(m, ld_te, use_hflip=True) / Ts[mi]
        per_model.append(lg)
    # blend
    if len(per_model) == 1:
        blended = per_model[0]
    else:
        blended = w_opt*per_model[0] + (1.0 - w_opt)*per_model[1]
    test_logits_accum = blended if test_logits_accum is None else (test_logits_accum + blended)
test_logits_avg = test_logits_accum / len(sizes)
probs = F.softmax(test_logits_avg + b.unsqueeze(0), dim=1).detach().numpy()

sub = pd.DataFrame(probs, columns=breeds)
sub.insert(0, 'id', test_df_['id'].values)
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv with per-class bias calibration. Shape:', sub.shape)

  ckpt = torch.load(path, map_location='cpu')


Calib using models: ['tiny_ema', 'small_f0_ema']


Per-model T: [1.1, 0.65]
w_opt: 0.07063204494756387
Bias calib CE: before=0.52495 after=0.42436


Saved submission.csv with per-class bias calibration. Shape: (1023, 121)


In [15]:
# === GPU Driver: 5-fold ConvNeXt-Small medal run ===
import torch, time, gc
try:
    CFG
    train_one_fold
    infer_ensemble_and_save
except NameError as e:
    print('ERROR: Missing definitions. Run Cells 11 (CFG/loaders) and 9 (training/utils) first, then re-run this cell.')
    raise

if not torch.cuda.is_available():
    print('CUDA not available. Relaunch environment to obtain a working GPU (nvidia-smi must succeed), then run:')
    print('  1) Run Cell 11 (CFG/loaders)')
    print('  2) Run Cell 9 (training/utils)')
    print('  3) Run this driver cell to train all folds and infer submission')
else:
    print('GPU detected. Starting 5-fold training...')
    # Runtime-safe medal plan per expert synthesis
    CFG['epochs'] = 8
    CFG['patience'] = 2
    t0 = time.time()
    for fold_idx in range(5):
        fold_t0 = time.time()
        print(f'>>> Launching fold {fold_idx} / 5', flush=True)
        train_one_fold(fold_idx)
        torch.cuda.synchronize()
        elapsed_h = (time.time() - fold_t0) / 3600.0
        print(f'>>> Fold {fold_idx} finished in {elapsed_h:.2f} h', flush=True)
        # Free VRAM/host RAM between folds
        gc.collect()
        torch.cuda.empty_cache()
    total_h = (time.time()-t0)/3600.0
    print(f'All folds completed in {total_h:.2f} h. Starting ensemble inference...', flush=True)
    infer_ensemble_and_save()
    gc.collect(); torch.cuda.empty_cache()
    print('Done. submission.csv saved.')

CUDA not available. Relaunch environment to obtain a working GPU (nvidia-smi must succeed), then run:
  1) Run Cell 11 (CFG/loaders)
  2) Run Cell 9 (training/utils)
  3) Run this driver cell to train all folds and infer submission


## Medal Run Checklist (GPU required)

Follow these exact steps after relaunching with a working GPU:

1) Verify GPU works:
   - Run Cell 1 (nvidia-smi). Ensure torch.cuda.is_available() == True (Cell 2 prints device).
   - If not available, relaunch/copy-and-edit to a new host. Optional: add a lightweight retry loop to re-check CUDA every 60–90s.

2) Prepare definitions:
   - Run Cell 11 (CFG/loaders via CSV folds).
   - Run Cell 9 (training + inference utilities).

3) Train all folds + infer:
   - Run Cell 16 (GPU driver).
   - It sets epochs=8, patience=2 (runtime-safe) and loops folds 0..4 with EMA warmup, then runs ensemble TTA (sizes: 384, 416) and global temperature from OOF, saving submission.csv.

Notes:
- Keep CFG as-is: convnext_small.in12k_ft_in1k @384, batch_size=4, accum_steps=8, lr=2e-4, ema_decay warmup ~0.99→0.9996 (adaptive), mixup_alpha=0.1, cutmix_alpha=0.5, num_workers=4, drop_path_rate=0.25.
- LR/EMA schedules are tied to optimizer update steps (respecting grad accumulation).
- Dataloaders use pin_memory and persistent_workers for speed; inference uses batch_size=32 and num_workers=0 to avoid hangs.
- If OOM: set CFG['accum_steps']=6 in Cell 11, then re-run Cell 9 and Cell 16.
- Do not change hyperparameters mid-run.
- Stability tip: add `cv2.setNumThreads(0)` near imports to avoid OpenCV thread contention.

Outcome:
- After ≥2 folds (preferably all 5), submission.csv will be saved in CWD, ready to score.

In [17]:
# Lightweight CUDA retry loop (run this to periodically check for GPU availability)
import time, torch, datetime
max_checks = 8  # ~10 minutes if interval=75s
interval_s = 75
print(f"[{datetime.datetime.now().strftime('%H:%M:%S')}] Starting CUDA retry loop: {max_checks} checks, {interval_s}s interval", flush=True)
for i in range(max_checks):
    ok = torch.cuda.is_available()
    ts = datetime.datetime.now().strftime('%H:%M:%S')
    print(f"[{ts}] Check {i+1}/{max_checks}: torch.cuda.is_available() = {ok}", flush=True)
    if ok:
        print('GPU is now available. Next steps: run Cell 11 -> Cell 9 -> Cell 16.', flush=True)
        break
    if i < max_checks - 1:
        time.sleep(interval_s)
else:
    print('GPU still unavailable after retries. Recommended: Restart session or Copy & Edit to force new host, then run Cells 11 -> 9 -> 16.', flush=True)

[10:17:08] Starting CUDA retry loop: 8 checks, 75s interval


[10:17:08] Check 1/8: torch.cuda.is_available() = False


[10:18:23] Check 2/8: torch.cuda.is_available() = False


[10:19:38] Check 3/8: torch.cuda.is_available() = False


[10:20:53] Check 4/8: torch.cuda.is_available() = False


[10:22:08] Check 5/8: torch.cuda.is_available() = False


[10:23:23] Check 6/8: torch.cuda.is_available() = False


[10:24:38] Check 7/8: torch.cuda.is_available() = False


[10:25:53] Check 8/8: torch.cuda.is_available() = False


GPU still unavailable after retries. Recommended: Restart session or Copy & Edit to force new host, then run Cells 11 -> 9 -> 16.
