# APTOS 2019 Blindness Detection — Medal Plan & Experiment Log

Objective: Train an image model to predict diabetic retinopathy grade (0–4) on fundus images. Metric: Quadratic Weighted Kappa (QWK). Deliver submission.csv and iterate to medal.

Success gates:
- Bronze ≥ 0.9145
- Silver ≥ 0.9197
- Gold ≥ 0.9305

Metric quirks:
- Ordinal classes; misordering penalized more than close misses. QWK sensitive to calibration/thresholds.

Validation protocol (lock early):
- 5-fold StratifiedKFold by diagnosis with fixed seed
- Save and reuse exact fold indices across experiments (deterministic CV)
- Deterministic: set seeds (python, numpy, torch, cudnn), fix worker seeds
- Track OOF predictions and compute QWK per fold + overall; early stop on QWK

Mandatory preprocessing (apply to train/val/test):
- Circular crop of fundus (remove black borders)
- Brightness/contrast normalization: Ben Graham/CLAHE, optional color constancy
- Center and resize to target resolution; avoid vertical flips

Baseline plan (fast to strong):
1) Sanity checks + GPU test
2) Data load + EDA: class balance, image dims, sample visual checks of preprocessing
3) Minimal baseline (pipeline validation): EfficientNet-B0/224, REGRESSION head (1 out), MSELoss, AdamW, cosine LR, AMP, 5-fold, 2–3 epochs on subset
4) Threshold optimization on OOF (Nelder–Mead/coordinate descent, 4 cutpoints) for QWK; apply fixed thresholds to test; submit
5) Stronger model @512: EfficientNet-B3 or EfficientNetV2-S, augmentations (HF only, ±15° rotate, scale/zoom, brightness/contrast/gamma, light blur; no VF/90°), class weights or sampler, EMA
6) TTA (orig + hflip) and resolution/seed sweep; retrain with early stopping on QWK
7) Ensembling: 2–3 diverse models/seeds; average regression outputs; re-opt thresholds on blended OOF

Risk controls:
- Install torch + timm; test CUDA first. Start with tiny subset to avoid long broken runs.
- Early stopping, save best QWK per fold; log OOF and thresholds to disk.
- Keep notebook concise; delete stale cells after logging.

Deliverables:
- submission.csv (id_code, diagnosis)
- Reproducible CV with OOF QWK tracked and thresholds saved

Experiment Log:
| Exp | Model/Res | Loss | Aug | Folds/Seeds | OOF QWK | LB QWK | Notes |
|-----|-----------|------|-----|-------------|---------|--------|-------|
|     |           |      |     |             |         |        |       |

Next step: GPU test, package installs, implement fundus crop + normalization, and data load.

In [None]:
import sys, subprocess, importlib, os
def ensure(pkg, pip_name=None):
    pip_name = pip_name or pkg
    try:
        importlib.import_module(pkg)
        print(f"OK: {pkg} already installed")
    except Exception as e:
        print(f"Installing {pip_name}...")
        subprocess.run([sys.executable, '-m', 'pip', 'install', pip_name, '-q'], check=False)
        importlib.import_module(pkg)

# Core deps
ensure('pandas')
ensure('numpy')
ensure('scipy')
ensure('sklearn', 'scikit-learn')
ensure('tqdm')
ensure('cv2', 'opencv-python-headless')
ensure('albumentations')

# PyTorch + vision + timm (CUDA 12.1 wheels)
try:
    import torch
    import torchvision
except Exception:
    print('Installing torch/torchvision for CUDA 12.1...')
    subprocess.run([sys.executable, '-m', 'pip', 'install', '--index-url', 'https://download.pytorch.org/whl/cu121', 'torch', 'torchvision', '-q'], check=False)
    import torch, torchvision
ensure('timm')

import torch
print(f"GPU Available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU Count: {torch.cuda.device_count()}")
    print(f"GPU Name: {torch.cuda.get_device_name(0)}")
    props = torch.cuda.get_device_properties(0)
    print(f"GPU Memory: {props.total_memory / 1024**3:.1f} GB")

import pandas as pd
train_path, test_path = 'train.csv', 'test.csv'
train_df = pd.read_csv(train_path)
test_df = pd.read_csv(test_path)
print('Train shape:', train_df.shape, 'Test shape:', test_df.shape)
print('Train head:')
print(train_df.head())
if 'diagnosis' in train_df.columns:
    print('Diagnosis distribution:')
    print(train_df['diagnosis'].value_counts().sort_index())

In [None]:
import cv2
import numpy as np
import os
from typing import Tuple

def robust_circular_crop(img: np.ndarray) -> np.ndarray:
    h, w = img.shape[:2]
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
    gray = cv2.medianBlur(gray, 11)
    # Otsu threshold
    _, mask = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
    # Sometimes Otsu picks background as fg; choose better of mask / inverted by largest contour area
    def largest_area(m):
        cnts, _ = cv2.findContours(m, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
        return 0 if not cnts else max(cv2.contourArea(c) for c in cnts)
    area1 = largest_area(mask)
    area2 = largest_area(255 - mask)
    if area2 > area1:
        mask = 255 - mask
    # Morph open to clean noise
    kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (9, 9))
    mask = cv2.morphologyEx(mask, cv2.MORPH_OPEN, kernel)
    cnts, _ = cv2.findContours(mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    if not cnts:
        # Fallback: center square crop
        side = min(h, w)
        y0 = (h - side) // 2
        x0 = (w - side) // 2
        return img[y0:y0+side, x0:x0+side]
    c = max(cnts, key=cv2.contourArea)
    (x, y), r = cv2.minEnclosingCircle(c)
    x, y, r = int(x), int(y), int(r)
    # sanity: if radius too small, fallback to center crop
    if r < min(h, w) * 0.2:
        side = min(h, w)
        y0 = (h - side) // 2
        x0 = (w - side) // 2
        return img[y0:y0+side, x0:x0+side]
    x1, y1 = max(0, x - r), max(0, y - r)
    x2, y2 = min(w, x + r), min(h, y + r)
    crop = img[y1:y2, x1:x2]
    return crop if crop.size else img

def ben_graham_normalize(img: np.ndarray, sigma: int = 30) -> np.ndarray:
    # Apply on fixed-size image with fixed sigma
    blur = cv2.GaussianBlur(img, (0, 0), sigma)
    enhanced = cv2.addWeighted(img, 4, blur, -4, 128)
    return np.clip(enhanced, 0, 255).astype(np.uint8)

def preprocess_fundus(img: np.ndarray, out_size: int = 512, use_ben: bool = True) -> np.ndarray:
    # 1) robust circular crop
    img = robust_circular_crop(img)
    # 2) resize to fixed size
    img = cv2.resize(img, (out_size, out_size), interpolation=cv2.INTER_AREA)
    # 3) illumination normalization (Ben Graham)
    if use_ben:
        img = ben_graham_normalize(img, sigma=30)
    return img

def read_preprocess(image_id: str, root: str, out_size: int = 512) -> np.ndarray:
    path = os.path.join(root, f"{image_id}.png")
    img = cv2.imread(path, cv2.IMREAD_COLOR)
    if img is None:
        raise FileNotFoundError(path)
    img = preprocess_fundus(img, out_size=out_size, use_ben=True)
    return img

# Quick dry-run on a few samples to validate shapes and visualize before/after:
try:
    import pandas as pd
    import matplotlib.pyplot as plt
    train_df = pd.read_csv('train.csv')
    # pick up to 2 samples per class for visual check
    ids = []
    for cls in sorted(train_df['diagnosis'].unique()):
        ids += train_df[train_df['diagnosis'] == cls]['id_code'].head(2).tolist()
    ids = ids[:8] if len(ids) > 8 else ids
    print('Visual check IDs:', ids)
    n = len(ids)
    fig, axes = plt.subplots(n, 2, figsize=(6, 3*n))
    if n == 1:
        axes = np.array([axes])
    for i, iid in enumerate(ids):
        p = os.path.join('train_images', f'{iid}.png')
        orig = cv2.imread(p, cv2.IMREAD_COLOR)
        proc = preprocess_fundus(orig.copy(), out_size=512, use_ben=True)
        # BGR->RGB for display
        orig_rgb = cv2.cvtColor(orig, cv2.COLOR_BGR2RGB)
        proc_rgb = cv2.cvtColor(proc, cv2.COLOR_BGR2RGB)
        axes[i, 0].imshow(orig_rgb); axes[i, 0].set_title(f'{iid} orig'); axes[i, 0].axis('off')
        axes[i, 1].imshow(proc_rgb); axes[i, 1].set_title('processed'); axes[i, 1].axis('off')
    plt.tight_layout()
    plt.show()
    # Print basic stats to confirm dtype/range
    for iid in ids[:3]:
        arr = read_preprocess(iid, 'train_images', out_size=512)
        print(iid, arr.shape, arr.dtype, arr.min(), arr.max())
except Exception as e:
    print('Preprocessing validation error:', e)

In [None]:
import os, random, json
import numpy as np
import pandas as pd
from sklearn.model_selection import StratifiedKFold
from tqdm import tqdm

# Reuse preprocess from cell 2
from pathlib import Path

SEED = 42
FOLDS = 5
OUT_SIZE = 512
FOLD_FILE = f'folds_s{SEED}_k{FOLDS}.csv'
CACHE_DIR = Path(f'cache_{OUT_SIZE}')

def seed_everything(seed=SEED):
    random.seed(seed)
    np.random.seed(seed)
    try:
        import torch
        torch.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False
    except Exception:
        pass

seed_everything(SEED)

# Create/reuse CV folds
train_df = pd.read_csv('train.csv')
if os.path.exists(FOLD_FILE):
    folds_df = pd.read_csv(FOLD_FILE)
    print('Loaded existing folds:', FOLD_FILE)
else:
    skf = StratifiedKFold(n_splits=FOLDS, shuffle=True, random_state=SEED)
    fold_idx = np.zeros(len(train_df), dtype=int) - 1
    for f, (_, val_idx) in enumerate(skf.split(train_df['id_code'], train_df['diagnosis'])):
        fold_idx[val_idx] = f
    folds_df = train_df.copy()
    folds_df['fold'] = fold_idx
    folds_df.to_csv(FOLD_FILE, index=False)
    print('Saved folds to', FOLD_FILE)

# Weighted sampler weights (1/sqrt(n_class)) to save for later
vc = folds_df['diagnosis'].value_counts().to_dict()
class_weights = {k: 1.0 / np.sqrt(v) for k, v in vc.items()}
with open('class_weights.json', 'w') as f:
    json.dump(class_weights, f)
print('Class weights:', class_weights)

# Cache preprocessed images (train + test) to speed up IO
train_src = Path('train_images')
test_src = Path('test_images')
train_dst = CACHE_DIR / 'train'
test_dst = CACHE_DIR / 'test'
train_dst.mkdir(parents=True, exist_ok=True)
test_dst.mkdir(parents=True, exist_ok=True)

def cache_split(df, src_dir: Path, dst_dir: Path):
    for iid in tqdm(df['id_code'].tolist(), desc=f'Caching to {dst_dir}'):
        outp = dst_dir / f'{iid}.png'
        if outp.exists():
            continue
        try:
            img = read_preprocess(iid, str(src_dir), out_size=OUT_SIZE)
            cv2.imwrite(str(outp), img)
        except Exception as e:
            print('Error caching', iid, e)

cache_split(train_df, train_src, train_dst)
test_df = pd.read_csv('test.csv')
cache_split(test_df, test_src, test_dst)
print('Caching complete:', train_dst, test_dst)

In [None]:
import os, json, numpy as np, pandas as pd, torch, timm, cv2
from pathlib import Path
import albumentations as A
from albumentations.pytorch import ToTensorV2
from torch.utils.data import Dataset, DataLoader
from torch.optim.swa_utils import AveragedModel

DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
CACHE_DIR_512 = Path('cache_512/test')
TEST_DF = pd.read_csv('test.csv')
IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)

def get_transforms_for_size(size):
    return A.Compose([A.Resize(size, size), A.Normalize(mean=IMAGENET_MEAN, std=IMAGENET_STD), ToTensorV2()])

class CacheTestDS(Dataset):
    def __init__(self, df, img_dir: Path, size: int):
        self.df = df.reset_index(drop=True)
        self.img_dir = Path(img_dir)
        self.tfm = get_transforms_for_size(size)
    def __len__(self):
        return len(self.df)
    def __getitem__(self, idx):
        row = self.df.iloc[idx]
        img_path = self.img_dir / f"{row['id_code']}.png"
        img = cv2.imread(str(img_path), cv2.IMREAD_COLOR)
        if img is None:
            raise FileNotFoundError(str(img_path))
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        x = self.tfm(image=img)['image']
        return x, -1.0

def load_b0_model(ckpt_path: str):
    model = timm.create_model('tf_efficientnet_b0_ns', pretrained=False, num_classes=1)
    model = model.to(DEVICE).to(memory_format=torch.channels_last)
    ckpt = torch.load(ckpt_path, map_location='cpu', weights_only=True)
    model.load_state_dict(ckpt['model'], strict=True)
    model.eval()
    return model

def load_b3_ema_model(ckpt_path: str):
    base = timm.create_model('tf_efficientnet_b3_ns', pretrained=False, num_classes=1)
    base = base.to(DEVICE).to(memory_format=torch.channels_last)
    ema = AveragedModel(base)
    ckpt = torch.load(ckpt_path, map_location='cpu', weights_only=True)
    ema.load_state_dict(ckpt['model'], strict=True)
    ema.eval()
    return ema

@torch.no_grad()
def infer_models(model_specs):
    # model_specs: list of dicts {path, kind ('b0'|'b3'), size}
    all_preds = []
    for spec in model_specs:
        size = spec['size']
        ds = CacheTestDS(TEST_DF, CACHE_DIR_512, size=size)
        loader = DataLoader(ds, batch_size=2, shuffle=False, num_workers=0, pin_memory=True)
        if spec['kind'] == 'b0':
            model = load_b0_model(spec['path'])
        else:
            model = load_b3_ema_model(spec['path'])
        preds = []
        for xb, _ in loader:
            xb = xb.to(DEVICE, non_blocking=True).contiguous(memory_format=torch.channels_last)
            xb_flip = torch.flip(xb, dims=[3])
            with torch.amp.autocast('cuda', enabled=(DEVICE=='cuda')):
                p1 = model(xb).squeeze(1).float()
                p2 = model(xb_flip).squeeze(1).float()
                p = 0.5 * (p1 + p2)
            preds.append(p.detach().cpu().numpy())
        all_preds.append(np.concatenate(preds))
        del model; torch.cuda.empty_cache()
    return np.mean(np.stack(all_preds, axis=0), axis=0)

def apply_thresholds(preds, cuts):
    t = np.sort(np.asarray(cuts).astype(float))
    y = np.digitize(preds, t)
    return np.clip(y, 0, 4).astype(int)

# Assemble available checkpoints
model_specs = []
# b3@512 folds 0-1 (EMA checkpoints present)
if os.path.exists('models/b3_512_reg_v1_fold0.pt'):
    model_specs.append({'path': 'models/b3_512_reg_v1_fold0.pt', 'kind': 'b3', 'size': 512})
if os.path.exists('models/b3_512_reg_v1_fold1.pt'):
    model_specs.append({'path': 'models/b3_512_reg_v1_fold1.pt', 'kind': 'b3', 'size': 512})
# b0@512 folds 0-4
for f in range(5):
    p = f'models/b0_512_reg_v1_fold{f}.pt'
    if os.path.exists(p):
        model_specs.append({'path': p, 'kind': 'b0', 'size': 512})

assert len(model_specs) > 0, 'No ensemble checkpoints found.'
print('Ensembling models:', model_specs)

blended_preds = infer_models(model_specs)
with open('thresholds_b0_512_reg_v1.json', 'r') as fh:
    th = json.load(fh)['cuts']
labels = apply_thresholds(blended_preds, th)
sub = pd.DataFrame({'id_code': TEST_DF['id_code'], 'diagnosis': labels})
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv', sub.shape, 'using ensemble of', len(model_specs), 'models')

In [None]:
import os, json, numpy as np, pandas as pd, cv2, torch, timm
from pathlib import Path
from torch.utils.data import Dataset, DataLoader
import albumentations as A
from albumentations.pytorch import ToTensorV2
from torch.optim.swa_utils import AveragedModel
from scipy.optimize import minimize

EXP_NAME = 'b3_384_reg_v2'
MODEL_NAME = 'tf_efficientnet_b3_ns'
TEST_DF = pd.read_csv('test.csv')
CACHE_DIR = Path('cache_512/test')
IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)

def get_test_tfm(size=384):
    return A.Compose([A.Resize(size, size), A.Normalize(mean=IMAGENET_MEAN, std=IMAGENET_STD), ToTensorV2()])

class CacheTestDS(Dataset):
    def __init__(self, df, img_dir: Path, size: int = 384):
        self.df = df.reset_index(drop=True)
        self.img_dir = Path(img_dir)
        self.tfm = get_test_tfm(size)
    def __len__(self):
        return len(self.df)
    def __getitem__(self, idx):
        row = self.df.iloc[idx]
        p = self.img_dir / f"{row['id_code']}.png"
        img = cv2.imread(str(p), cv2.IMREAD_COLOR)
        if img is None:
            raise FileNotFoundError(str(p))
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        x = self.tfm(image=img)['image']
        return x, -1.0

@torch.no_grad()
def infer_fold(ckpt_path: str, batch_size: int = 2, size: int = 384):
    ds = CacheTestDS(TEST_DF, CACHE_DIR, size=size)
    loader = DataLoader(ds, batch_size=batch_size, shuffle=False, num_workers=0, pin_memory=False)
    base = timm.create_model(MODEL_NAME, pretrained=False, num_classes=1)
    ema = AveragedModel(base)
    ckpt = torch.load(ckpt_path, map_location='cpu', weights_only=True)
    ema.load_state_dict(ckpt['model'], strict=True)
    ema.eval()
    preds = []
    for xb, _ in loader:
        out = ema(xb).squeeze(1).float()
        preds.append(out.detach().cpu().numpy())
    return np.concatenate(preds)

def apply_thresholds(preds, cuts):
    t = np.sort(np.asarray(cuts).astype(float))
    y = np.digitize(preds, t)
    return np.clip(y, 0, 4).astype(int)

def optimize_thresholds(preds, targets, init=[0.5,1.5,2.5,3.5]):
    preds = np.asarray(preds).ravel().astype(float)
    targets = np.asarray(targets).ravel().astype(int)
    def neg_qwk(c):
        from sklearn.metrics import cohen_kappa_score
        y = apply_thresholds(preds, c)
        return -cohen_kappa_score(targets, y, weights='quadratic')
    res = minimize(neg_qwk, x0=np.array(init, dtype=float), method='Nelder-Mead', options={'maxiter':500, 'xatol':1e-3, 'fatol':1e-3})
    return np.sort(res.x).tolist()

# Collect fold checkpoints
fold_paths = [f'models/{EXP_NAME}_fold{f}.pt' for f in range(5) if os.path.exists(f'models/{EXP_NAME}_fold{f}.pt')]
assert len(fold_paths) > 0, 'No trained fold checkpoints found for b3_384_reg_v2.'
print('Found fold checkpoints:', fold_paths)

# Inference per fold on CPU, then average
fold_preds = [infer_fold(p, batch_size=2, size=384) for p in fold_paths]
test_preds = np.mean(np.stack(fold_preds, axis=0), axis=0)
print('Test preds shape:', test_preds.shape)

# Optimize thresholds on OOF and save
oof_preds_path = f'oof/{EXP_NAME}_oof_preds.npy'
oof_tgts_path = f'oof/{EXP_NAME}_oof_targets.npy'
assert os.path.exists(oof_preds_path) and os.path.exists(oof_tgts_path), 'OOF files not found.'
oof_preds = np.load(oof_preds_path)
oof_tgts = np.load(oof_tgts_path)
cuts = optimize_thresholds(oof_preds, oof_tgts)
with open(f'thresholds_{EXP_NAME}.json', 'w') as fh:
    json.dump({'cuts': cuts}, fh)
print('Optimized thresholds:', cuts)

# Make submission
labels = apply_thresholds(test_preds, cuts)
sub = pd.DataFrame({'id_code': TEST_DF['id_code'], 'diagnosis': labels})
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv', sub.shape)

In [None]:
# Sanity check: train a single fold with shorter schedule to verify stability before full CV
import pandas as pd, time
folds_df = pd.read_csv('folds_s42_k5.csv')

# Temporarily reduce epochs/patience for quick validation
NUM_EPOCHS = 10
MIN_EPOCHS = 6
PATIENCE = 5

start = time.time()
oof_pred, y_val, ckpt_path = train_one_fold(0, folds_df)
from sklearn.metrics import cohen_kappa_score
import numpy as np
cuts, _ = optimize_thresholds(oof_pred, y_val)
y_round = apply_thresholds(oof_pred, cuts)
q_naive = cohen_kappa_score(y_val.astype(int), np.clip(np.rint(oof_pred), 0, 4).astype(int), weights='quadratic')
q_opt = cohen_kappa_score(y_val.astype(int), y_round.astype(int), weights='quadratic')
print('Sanity Fold0: QWK naive/opt:', q_naive, q_opt, 'pred_mean/std:', float(np.mean(oof_pred)), float(np.std(oof_pred)), 'ckpt:', ckpt_path)
print(f'Wall-clock: {(time.time()-start)/60:.1f} min')

In [None]:
# Sanity check for ordinal pipeline: train a single fold briefly and report QWK
import time, pandas as pd, numpy as np
folds_df = pd.read_csv('folds_s42_k5.csv')

# Short schedule for sanity
NUM_EPOCHS = 8
ACCUM_STEPS = 16

t0 = time.time()
preds_ord, y_val, best_path = train_one_fold(0, folds_df)
from sklearn.metrics import cohen_kappa_score
q = cohen_kappa_score(y_val, preds_ord, weights='quadratic')
print(f'Sanity Fold0 (ordinal): QWK={q:.4f}, preds dist:', np.bincount(preds_ord, minlength=5), 'path:', best_path)
print(f'Wall-clock: {(time.time()-t0)/60:.1f} min')

In [None]:
# Minimal, clean baseline: EfficientNet-B0 @512 regression, single-fold sanity run (FP32, no AMP/EMA/schedulers)
import os, time, json, numpy as np, pandas as pd, cv2, torch, timm
import torch.nn as nn
from pathlib import Path
from torch.utils.data import Dataset, DataLoader
import albumentations as A
from albumentations.pytorch import ToTensorV2
from sklearn.metrics import cohen_kappa_score

torch.backends.cudnn.benchmark = True
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
CACHE_DIR = Path('cache_512/train')
IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)

def qwk(y_true, y_pred):
    return cohen_kappa_score(np.asarray(y_true).astype(int), np.asarray(y_pred).astype(int), weights='quadratic')

def apply_thresholds(preds, cuts):
    t = np.sort(np.asarray(cuts).astype(float))
    y = np.digitize(preds, t)
    return np.clip(y, 0, 4).astype(int)

def optimize_thresholds(preds, targets, init=[0.5,1.5,2.5,3.5]):
    from scipy.optimize import minimize
    preds = np.asarray(preds).ravel().astype(float)
    targets = np.asarray(targets).ravel().astype(int)
    def neg_qwk(c):
        y = apply_thresholds(preds, c)
        return -qwk(targets, y)
    res = minimize(neg_qwk, x0=np.array(init, dtype=float), method='Nelder-Mead', options={'maxiter':400, 'xatol':1e-3, 'fatol':1e-3})
    return np.sort(res.x)

class CacheDS(Dataset):
    def __init__(self, df, img_dir: Path, train=True):
        self.df = df.reset_index(drop=True)
        self.img_dir = Path(img_dir)
        self.train = train
        self.tfm = A.Compose([
            A.Resize(512, 512),
            A.HorizontalFlip(p=0.5 if train else 0.0),
            A.Normalize(mean=IMAGENET_MEAN, std=IMAGENET_STD),
            ToTensorV2(),
        ])
    def __len__(self): return len(self.df)
    def __getitem__(self, idx):
        r = self.df.iloc[idx]
        p = self.img_dir / f"{r['id_code']}.png"
        img = cv2.imread(str(p), cv2.IMREAD_COLOR)
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        x = self.tfm(image=img)['image']
        y = torch.tensor(float(r['diagnosis']), dtype=torch.float32)
        return x, y

def sanity_train_b0_fold0(epochs=5, bs=8, lr=1e-4):
    # Clamp batch size to avoid CUDA OOM in FP32 B0@512
    if bs > 2:
        print(f"[Info] Reducing batch size from {bs} to 2 to prevent OOM.")
        bs = 2
    folds = pd.read_csv('folds_s42_k5.csv')
    trn_df = folds[folds.fold != 0].copy()
    val_df = folds[folds.fold == 0].copy()
    trn_ds = CacheDS(trn_df, CACHE_DIR, train=True)
    val_ds = CacheDS(val_df, CACHE_DIR, train=False)
    trn_loader = DataLoader(trn_ds, batch_size=bs, shuffle=True, num_workers=2, pin_memory=(DEVICE=='cuda'))
    val_loader = DataLoader(val_ds, batch_size=bs, shuffle=False, num_workers=2, pin_memory=(DEVICE=='cuda'))

    model = timm.create_model('tf_efficientnet_b0_ns', pretrained=True, num_classes=1).to(DEVICE)
    # Explicitly disable grad checkpointing to avoid CheckpointError
    try:
        model.set_grad_checkpointing(False)
    except Exception:
        pass
    if DEVICE=='cuda':
        model = model.to(memory_format=torch.channels_last)
    opt = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=1e-5)
    crit = nn.SmoothL1Loss(reduction='mean')

    best_q = -1.0; best_pred = None
    y_val = val_df['diagnosis'].values.astype(int)
    for ep in range(1, epochs+1):
        model.train()
        for xb, yb in trn_loader:
            xb = xb.to(DEVICE, non_blocking=True)
            if DEVICE=='cuda': xb = xb.contiguous(memory_format=torch.channels_last)
            yb = yb.to(DEVICE, non_blocking=True)
            out = model(xb).squeeze(1)
            loss = crit(out, yb)
            opt.zero_grad(set_to_none=True)
            loss.backward()
            opt.step()
        # Validate
        model.eval(); preds = []
        with torch.no_grad():
            for xb, _ in val_loader:
                xb = xb.to(DEVICE, non_blocking=True)
                if DEVICE=='cuda': xb = xb.contiguous(memory_format=torch.channels_last)
                p = model(xb).squeeze(1).float()
                preds.append(p.detach().cpu().numpy())
        preds = np.concatenate(preds)
        cuts = optimize_thresholds(preds, y_val)
        q = qwk(y_val, apply_thresholds(preds, cuts))
        print(f'Epoch {ep}: val_qwk_opt={q:.4f}, preds(mean/std)=({preds.mean():.3f}/{preds.std():.3f})')
        if q > best_q: best_q, best_pred = q, preds.copy()
        if DEVICE=='cuda': torch.cuda.empty_cache()
    return best_q

print('Minimal B0@512 regression sanity runner ready (FP32). Run sanity_train_b0_fold0() to verify learning.')

In [None]:
# Execute minimal B0@512 sanity run (single fold, few epochs) to validate pipeline
best_q = sanity_train_b0_fold0(epochs=6, bs=8, lr=1e-4)
print('Sanity Fold0 best QWK (opt thresholds):', best_q)

In [None]:
# Enhanced B0@512 single-fold trainer with sampler, aug, AMP, accumulation, clipping, cosine LR
import math, numpy as np, pandas as pd, torch, timm, albumentations as A, cv2, random
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader, WeightedRandomSampler
from albumentations.pytorch import ToTensorV2
from sklearn.metrics import cohen_kappa_score
from pathlib import Path

cfg = {
    'size': 512,
    'epochs': 12,
    'batch_size': 2,  # small to fit; use accumulation to simulate 8
    'accum': 4,
    'lr': 2e-4,
    'weight_decay': 1e-5,
    'model': 'tf_efficientnet_b0_ns',
    'num_workers': 2,
    'grad_clip': 1.0,
    'seed': 42,
}
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)
CACHE_DIR = Path('cache_512/train')

def seed_everything(seed: int):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

def seed_worker(worker_id):
    worker_seed = (cfg['seed'] + worker_id) % 2**32
    np.random.seed(worker_seed)
    random.seed(worker_seed)

seed_everything(cfg['seed'])

def qwk(y_true, y_pred):
    return cohen_kappa_score(np.asarray(y_true).astype(int), np.asarray(y_pred).astype(int), weights='quadratic')

def apply_thresholds(preds, cuts):
    t = np.sort(np.asarray(cuts).astype(float))
    y = np.digitize(preds, t)
    return np.clip(y, 0, 4).astype(int)

def optimize_thresholds(preds, targets, init=[0.5,1.5,2.5,3.5]):
    from scipy.optimize import minimize
    preds = np.asarray(preds).ravel().astype(float)
    targets = np.asarray(targets).ravel().astype(int)
    def neg_qwk(c):
        y = apply_thresholds(preds, c)
        return -qwk(targets, y)
    res = minimize(neg_qwk, x0=np.array(init, dtype=float), method='Nelder-Mead', options={'maxiter':500, 'xatol':1e-3, 'fatol':1e-3})
    return np.sort(res.x)

class CacheDSAug(Dataset):
    def __init__(self, df, img_dir: Path, train=True, size=512):
        self.df = df.reset_index(drop=True)
        self.img_dir = Path(img_dir)
        self.train = train
        if train:
            self.tfm = A.Compose([
                A.Resize(size, size),
                A.HorizontalFlip(p=0.5),
                A.Rotate(limit=10, border_mode=cv2.BORDER_REFLECT_101, p=0.5),
                A.RandomBrightnessContrast(brightness_limit=0.15, contrast_limit=0.15, p=0.4),
                A.Normalize(mean=IMAGENET_MEAN, std=IMAGENET_STD),
                ToTensorV2(),
            ])
        else:
            self.tfm = A.Compose([
                A.Resize(size, size),
                A.Normalize(mean=IMAGENET_MEAN, std=IMAGENET_STD),
                ToTensorV2(),
            ])
    def __len__(self): return len(self.df)
    def __getitem__(self, idx):
        r = self.df.iloc[idx]
        p = self.img_dir / f"{r['id_code']}.png"
        img = cv2.imread(str(p), cv2.IMREAD_COLOR)
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        x = self.tfm(image=img)['image']
        y = torch.tensor(float(r['diagnosis']), dtype=torch.float32)
        return x, y

def make_weighted_sampler(labels: np.ndarray):
    vals, counts = np.unique(labels, return_counts=True)
    freq = {v: c for v, c in zip(vals, counts)}
    w_per_class = {v: 1.0 / c for v, c in freq.items()}
    weights = np.array([w_per_class[int(y)] for y in labels], dtype=np.float32)
    return WeightedRandomSampler(weights.tolist(), num_samples=len(weights), replacement=True)

def train_enhanced_b0_fold0():
    folds = pd.read_csv('folds_s42_k5.csv')
    trn_df = folds[folds.fold != 0].copy()
    val_df = folds[folds.fold == 0].copy()
    trn_ds = CacheDSAug(trn_df, CACHE_DIR, train=True, size=cfg['size'])
    val_ds = CacheDSAug(val_df, CACHE_DIR, train=False, size=cfg['size'])
    sampler = make_weighted_sampler(trn_df['diagnosis'].values.astype(int))
    gen_tr = torch.Generator(); gen_tr.manual_seed(cfg['seed'] + 123)
    gen_va = torch.Generator(); gen_va.manual_seed(cfg['seed'] + 456)
    trn_loader = DataLoader(trn_ds, batch_size=cfg['batch_size'], sampler=sampler, num_workers=cfg['num_workers'], pin_memory=(DEVICE=='cuda'), worker_init_fn=seed_worker, generator=gen_tr)
    val_loader = DataLoader(val_ds, batch_size=max(2, cfg['batch_size']), shuffle=False, num_workers=cfg['num_workers'], pin_memory=(DEVICE=='cuda'), worker_init_fn=seed_worker, generator=gen_va)

    model = timm.create_model(cfg['model'], pretrained=True, num_classes=1).to(DEVICE)
    try: model.set_grad_checkpointing(False)
    except Exception: pass
    if DEVICE=='cuda': model = model.to(memory_format=torch.channels_last)
    opt = torch.optim.AdamW(model.parameters(), lr=cfg['lr'], weight_decay=cfg['weight_decay'])
    scaler = torch.amp.GradScaler('cuda', enabled=(DEVICE=='cuda'))
    crit = nn.SmoothL1Loss(reduction='mean')

    # Cosine schedule over total steps
    steps_per_epoch = math.ceil(len(trn_loader) / max(1, cfg['accum']))
    total_steps = cfg['epochs'] * steps_per_epoch
    scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(opt, T_max=total_steps, eta_min=cfg['lr'] * 0.1)

    best_q = -1.0
    y_val = val_df['diagnosis'].values.astype(int)
    global_step = 0
    for ep in range(1, cfg['epochs'] + 1):
        model.train()
        opt.zero_grad(set_to_none=True)
        for it, (xb, yb) in enumerate(trn_loader):
            xb = xb.to(DEVICE, non_blocking=True)
            if DEVICE=='cuda': xb = xb.contiguous(memory_format=torch.channels_last)
            yb = yb.to(DEVICE, non_blocking=True)
            with torch.amp.autocast('cuda', enabled=(DEVICE=='cuda')):
                out = model(xb).squeeze(1)
                loss = crit(out, yb) / cfg['accum']
            scaler.scale(loss).backward() if DEVICE=='cuda' else loss.backward()
            if ((it + 1) % cfg['accum'] == 0) or ((it + 1) == len(trn_loader)):
                if DEVICE=='cuda': scaler.unscale_(opt)
                nn.utils.clip_grad_norm_(model.parameters(), max_norm=cfg['grad_clip'])
                if DEVICE=='cuda':
                    scaler.step(opt); scaler.update()
                else:
                    opt.step()
                opt.zero_grad(set_to_none=True)
                scheduler.step()
                global_step += 1

        # Validate
        model.eval(); preds = []
        with torch.no_grad():
            for xb, _ in val_loader:
                xb = xb.to(DEVICE, non_blocking=True)
                if DEVICE=='cuda': xb = xb.contiguous(memory_format=torch.channels_last)
                with torch.amp.autocast('cuda', enabled=(DEVICE=='cuda')):
                    p = model(xb).squeeze(1).float()
                preds.append(p.detach().cpu().numpy())
        preds = np.concatenate(preds)
        cuts = optimize_thresholds(preds, y_val)
        q = qwk(y_val, apply_thresholds(preds, cuts))
        print(f'Epoch {ep}: val_qwk_opt={q:.4f}, lr={opt.param_groups[0]["lr"]:.2e}, pred_mean/std=({preds.mean():.3f}/{preds.std():.3f})')
        if q > best_q: best_q = q
        if DEVICE=='cuda': torch.cuda.empty_cache()
    print('Enhanced sanity Fold0 best QWK:', best_q)
    return best_q

print('Enhanced B0@512 trainer ready: sampler+aug+AMP+accum+clip+cosine (deterministic). Run train_enhanced_b0_fold0() and aim for QWK>0.89.')

In [None]:
# Run enhanced single-fold training aiming for QWK > 0.89
best_q_enh = train_enhanced_b0_fold0()
print('Enhanced Fold0 best QWK:', best_q_enh)

In [1]:
# 5-fold CV for enhanced B0@512: EMA, early stopping, warmup+cosine, cleaner aug, accum=4, OOF+TTA submission
import os, json, math, numpy as np, pandas as pd, torch, timm, cv2, random
from pathlib import Path
from torch.utils.data import DataLoader, WeightedRandomSampler, Dataset
import albumentations as A
from albumentations.pytorch import ToTensorV2
from torch.optim.swa_utils import AveragedModel

# Hard-disable any gradient checkpointing globally to avoid CheckpointError
try:
    import torch.utils.checkpoint as cp
    cp.set_checkpoint_enabled(False)
    print('[Init] torch.utils.checkpoint globally disabled')
except Exception:
    pass

models_dir = Path('models'); models_dir.mkdir(exist_ok=True)
oof_dir = Path('oof'); oof_dir.mkdir(exist_ok=True)

# Ensure core config/globals exist
if 'cfg' not in globals():
    cfg = {
        'size': 512,
        'epochs': 20,
        'batch_size': 2,
        'accum': 4,
        'lr': 2e-4,
        'weight_decay': 1e-5,
        'model': 'tf_efficientnet_b0_ns',
        'num_workers': 2,
        'grad_clip': 1.0,
        'seed': 42,
        'warmup_epochs': 1,
        'patience': 5,
        'ema_decay': 0.999,
        'min_epochs': 8,
    }
else:
    cfg['epochs'] = cfg.get('epochs', 20)
    cfg['accum'] = 4
    cfg['lr'] = 2e-4
    cfg['warmup_epochs'] = cfg.get('warmup_epochs', 1)
    cfg['patience'] = 5
    cfg['ema_decay'] = cfg.get('ema_decay', 0.999)
    cfg['seed'] = cfg.get('seed', 42)
    cfg['size'] = 512
    cfg['batch_size'] = 2
    cfg['weight_decay'] = cfg.get('weight_decay', 1e-5)
    cfg['model'] = cfg.get('model', 'tf_efficientnet_b0_ns')
    cfg['num_workers'] = 2
    cfg['grad_clip'] = cfg.get('grad_clip', 1.0)
    cfg['min_epochs'] = cfg.get('min_epochs', 8)
if 'DEVICE' not in globals():
    DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
if 'IMAGENET_MEAN' not in globals(): IMAGENET_MEAN = (0.485, 0.456, 0.406)
if 'IMAGENET_STD' not in globals(): IMAGENET_STD = (0.229, 0.224, 0.225)

# Ensure deterministic seeding utilities exist in this scope
if 'seed_everything' not in globals():
    def seed_everything(seed: int):
        random.seed(seed); np.random.seed(seed); torch.manual_seed(seed)
        if torch.cuda.is_available(): torch.cuda.manual_seed_all(seed)
        torch.backends.cudnn.deterministic = True; torch.backends.cudnn.benchmark = False
if 'seed_worker' not in globals():
    def seed_worker(worker_id):
        base = globals().get('cfg', {}).get('seed', 42)
        worker_seed = (int(base) + int(worker_id)) % (2**32)
        np.random.seed(worker_seed); random.seed(worker_seed)

# Expect apply_thresholds, optimize_thresholds, qwk defined earlier; if not, define minimal fallbacks
if 'apply_thresholds' not in globals():
    def apply_thresholds(preds, cuts):
        t = np.sort(np.asarray(cuts).astype(float)); y = np.digitize(preds, t)
        return np.clip(y, 0, 4).astype(int)
if 'optimize_thresholds' not in globals():
    def optimize_thresholds(preds, targets, init=[0.5,1.5,2.5,3.5]):
        from scipy.optimize import minimize
        from sklearn.metrics import cohen_kappa_score
        def qwk_local(y_true, y_pred):
            return cohen_kappa_score(np.asarray(y_true).astype(int), np.asarray(y_pred).astype(int), weights='quadratic')
        preds = np.asarray(preds).ravel().astype(float); targets = np.asarray(targets).ravel().astype(int)
        def neg_qwk(c): return -qwk_local(targets, apply_thresholds(preds, c))
        res = minimize(neg_qwk, x0=np.array(init, dtype=float), method='Nelder-Mead', options={'maxiter':500,'xatol':1e-3,'fatol':1e-3})
        return np.sort(res.x)
if 'qwk' not in globals():
    from sklearn.metrics import cohen_kappa_score
    def qwk(y_true, y_pred):
        return cohen_kappa_score(np.asarray(y_true).astype(int), np.asarray(y_pred).astype(int), weights='quadratic')

class CacheTestDS(Dataset):
    def __init__(self, df, img_dir: Path, size: int):
        self.df = df.reset_index(drop=True); self.img_dir = Path(img_dir)
        self.tfm = A.Compose([A.Resize(size, size), A.Normalize(mean=IMAGENET_MEAN, std=IMAGENET_STD), ToTensorV2()])
    def __len__(self): return len(self.df)
    def __getitem__(self, idx):
        row = self.df.iloc[idx]; p = self.img_dir / f"{row['id_code']}.png"
        img = cv2.imread(str(p), cv2.IMREAD_COLOR); img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        x = self.tfm(image=img)['image']; return x, -1.0

def make_weighted_sampler(labels: np.ndarray):
    vals, counts = np.unique(labels, return_counts=True); wpc = {int(v): 1.0/float(c) for v, c in zip(vals, counts)}
    weights = np.array([wpc[int(y)] for y in labels], dtype=np.float32)
    return WeightedRandomSampler(weights.tolist(), num_samples=len(weights), replacement=True)

class CacheDSAugCV(Dataset):
    def __init__(self, df, img_dir: Path, train=True, size=512):
        self.df = df.reset_index(drop=True); self.img_dir = Path(img_dir)
        if train:
            self.tfm = A.Compose([
                A.Resize(size, size),
                A.HorizontalFlip(p=0.5),
                A.Rotate(limit=15, border_mode=cv2.BORDER_REFLECT_101, p=0.5),
                A.RandomBrightnessContrast(brightness_limit=0.2, contrast_limit=0.2, p=0.5),
                A.Normalize(mean=IMAGENET_MEAN, std=IMAGENET_STD),
                ToTensorV2(),
            ])
        else:
            self.tfm = A.Compose([A.Resize(size, size), A.Normalize(mean=IMAGENET_MEAN, std=IMAGENET_STD), ToTensorV2()])
    def __len__(self): return len(self.df)
    def __getitem__(self, idx):
        r = self.df.iloc[idx]; p = self.img_dir / f"{r['id_code']}.png"
        img = cv2.imread(str(p), cv2.IMREAD_COLOR); img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        x = self.tfm(image=img)['image']; y = torch.tensor(float(r['diagnosis']), dtype=torch.float32)
        return x, y

def _warmup_cosine_lr(step, total_steps, base_lr, warmup_steps):
    if step < warmup_steps:
        return base_lr * (step + 1) / max(1, warmup_steps)
    t = (step - warmup_steps) / max(1, total_steps - warmup_steps)
    return base_lr * 0.5 * (1.0 + math.cos(math.pi * t))

def train_one_fold_b0_enh(fold: int):
    # Also ensure checkpointing disabled inside function scope
    try:
        import torch.utils.checkpoint as cp
        cp.set_checkpoint_enabled(False)
    except Exception:
        pass
    base_seed = cfg.get('seed', 42); seed_everything(int(base_seed) + int(fold))
    folds = pd.read_csv('folds_s42_k5.csv')
    trn_df = folds[folds.fold != fold].copy(); val_df = folds[folds.fold == fold].copy()
    trn_ds = CacheDSAugCV(trn_df, Path('cache_512/train'), train=True, size=cfg['size']); val_ds = CacheDSAugCV(val_df, Path('cache_512/train'), train=False, size=cfg['size'])
    sampler = make_weighted_sampler(trn_df['diagnosis'].values.astype(int))
    gen_tr = torch.Generator(); gen_tr.manual_seed(int(base_seed) + 1000 + int(fold))
    gen_va = torch.Generator(); gen_va.manual_seed(int(base_seed) + 2000 + int(fold))
    trn_loader = DataLoader(trn_ds, batch_size=cfg['batch_size'], sampler=sampler, num_workers=cfg['num_workers'], pin_memory=(DEVICE=='cuda'), worker_init_fn=seed_worker, generator=gen_tr)
    val_loader = DataLoader(val_ds, batch_size=max(2, cfg['batch_size']), shuffle=False, num_workers=cfg['num_workers'], pin_memory=(DEVICE=='cuda'), worker_init_fn=seed_worker, generator=gen_va)

    model = timm.create_model(cfg['model'], pretrained=True, num_classes=1).to(DEVICE)
    # Robustly disable any gradient checkpointing
    try: model.set_grad_checkpointing(False)
    except Exception: pass
    try:
        from timm.layers import set_grad_checkpointing as timm_set_gc
        timm_set_gc(model, False)
    except Exception:
        try:
            from timm.models.layers import set_grad_checkpointing as timm_set_gc_legacy
            timm_set_gc_legacy(model, False)
        except Exception:
            pass
    if hasattr(model, 'grad_checkpointing'):
        try: model.grad_checkpointing = False
        except Exception: pass
    if DEVICE=='cuda': model = model.to(memory_format=torch.channels_last)
    opt = torch.optim.AdamW(model.parameters(), lr=cfg['lr'], weight_decay=cfg['weight_decay'])
    # Disable AMP to avoid any interaction with checkpointing
    scaler = torch.amp.GradScaler('cuda', enabled=False)
    crit = torch.nn.SmoothL1Loss(reduction='mean')
    ema_model = AveragedModel(model, avg_fn=None)

    steps_per_epoch = math.ceil(len(trn_loader) / max(1, cfg['accum']))
    total_steps = max(1, cfg['epochs'] * steps_per_epoch)
    warmup_steps = int(cfg['warmup_epochs'] * steps_per_epoch)

    best_q = -1.0; best_preds = None
    y_val = val_df['diagnosis'].values.astype(int)
    global_step = 0; epochs_no_improve = 0
    for ep in range(1, cfg['epochs'] + 1):
        model.train(); opt.zero_grad(set_to_none=True)
        running_loss = 0.0; n_loss = 0
        for it, (xb, yb) in enumerate(trn_loader):
            xb = xb.to(DEVICE, non_blocking=True);
            if DEVICE=='cuda': xb = xb.contiguous(memory_format=torch.channels_last)
            yb = yb.to(DEVICE, non_blocking=True)
            # FP32 forward (no autocast) to avoid checkpoint incompatibilities
            out = model(xb).squeeze(1); loss = crit(out, yb) / cfg['accum']
            loss.backward()
            running_loss += loss.detach().float().item(); n_loss += 1
            if ((it + 1) % cfg['accum'] == 0) or ((it + 1) == len(trn_loader)):
                torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=cfg['grad_clip'])
                opt.step()
                opt.zero_grad(set_to_none=True)
                # Correct EMA update
                ema_model.update_parameters(model)
                # Per-step LR schedule (warmup + cosine)
                global_step += 1
                lr_now = _warmup_cosine_lr(global_step, total_steps, cfg['lr'], warmup_steps)
                for pg in opt.param_groups: pg['lr'] = lr_now

        # Update BN stats of EMA before validation
        try:
            torch.optim.swa_utils.update_bn(trn_loader, ema_model, device=DEVICE)
        except Exception:
            pass

        # Validate EMA
        ema_model.eval(); preds_ema = []
        with torch.no_grad():
            for xb, _ in val_loader:
                xb = xb.to(DEVICE, non_blocking=True);
                if DEVICE=='cuda': xb = xb.contiguous(memory_format=torch.channels_last)
                p = ema_model(xb).squeeze(1).float()
                preds_ema.append(p.detach().cpu().numpy())
        preds_ema = np.concatenate(preds_ema)
        cuts_ema = optimize_thresholds(preds_ema, y_val); q_ema = qwk(y_val, apply_thresholds(preds_ema, cuts_ema))

        # Validate base model too
        model.eval(); preds_base = []
        with torch.no_grad():
            for xb, _ in val_loader:
                xb = xb.to(DEVICE, non_blocking=True);
                if DEVICE=='cuda': xb = xb.contiguous(memory_format=torch.channels_last)
                p = model(xb).squeeze(1).float()
                preds_base.append(p.detach().cpu().numpy())
        preds_base = np.concatenate(preds_base)
        cuts_base = optimize_thresholds(preds_base, y_val); q_base = qwk(y_val, apply_thresholds(preds_base, cuts_base))

        avg_loss = (running_loss / max(1, n_loss)) * cfg['accum']
        print(f"[Fold {fold}] Epoch {ep}: loss={avg_loss:.4f}, lr={opt.param_groups[0]['lr']:.2e}, Q_base={q_base:.4f} (m/s={preds_base.mean():.3f}/{preds_base.std():.3f}), Q_ema={q_ema:.4f} (m/s={preds_ema.mean():.3f}/{preds_ema.std():.3f})")

        q_curr = max(q_ema, q_base); use_ema = (q_ema >= q_base); cuts_curr = (cuts_ema if use_ema else cuts_base); preds_curr = (preds_ema if use_ema else preds_base)
        if q_curr > best_q + 1e-4:
            best_q = q_curr; best_preds = preds_curr.copy()
            torch.save({'model': ema_model.state_dict()}, models_dir / f'b0_512_enh_v2_ema_fold{int(fold)}.pt')
            epochs_no_improve = 0
        else:
            epochs_no_improve += 1
        if (ep >= int(cfg['min_epochs'])) and (epochs_no_improve >= int(cfg['patience'])):
            print(f'[Fold {fold}] Early stopping at epoch {ep} (best QWK={best_q:.4f})')
            break
        if DEVICE=='cuda': torch.cuda.empty_cache()
    return best_preds, y_val, float(best_q)

def run_cv_b0_enh_and_submit():
    folds = pd.read_csv('folds_s42_k5.csv')
    oof_preds = np.zeros(len(folds), dtype=np.float32); oof_tgts = folds['diagnosis'].values.astype(int); fold_q = []
    for fold in sorted(folds['fold'].unique()):
        preds, y_val, bq = train_one_fold_b0_enh(int(fold))
        idx = folds.index[folds.fold == fold].to_numpy(); oof_preds[idx] = preds; fold_q.append(bq)
        print(f'Fold {fold} best QWK: {bq:.4f}')
    np.save(oof_dir / 'b0_512_enh_v2_oof_preds.npy', oof_preds)
    np.save(oof_dir / 'b0_512_enh_v2_oof_targets.npy', oof_tgts)
    cuts = optimize_thresholds(oof_preds, oof_tgts).tolist()
    with open('thresholds_b0_512_enh_v2.json', 'w') as fh: json.dump({'cuts': cuts}, fh)
    print('OOF QWK (opt):', qwk(oof_tgts, apply_thresholds(oof_preds, cuts)), 'per-fold:', fold_q)

    test_df = pd.read_csv('test.csv')
    test_ds = CacheTestDS(test_df, Path('cache_512/test'), size=cfg['size'])
    test_loader = DataLoader(test_ds, batch_size=2, shuffle=False, num_workers=cfg['num_workers'], pin_memory=(DEVICE=='cuda'), worker_init_fn=seed_worker)
    all_fold_preds = []
    for fold in sorted(folds['fold'].unique()):
        model = timm.create_model(cfg['model'], pretrained=False, num_classes=1).to(DEVICE)
        try: model.set_grad_checkpointing(False)
        except Exception: pass
        try:
            from timm.layers import set_grad_checkpointing as timm_set_gc
            timm_set_gc(model, False)
        except Exception:
            try:
                from timm.models.layers import set_grad_checkpointing as timm_set_gc_legacy
                timm_set_gc_legacy(model, False)
            except Exception:
                pass
        if DEVICE=='cuda': model = model.to(memory_format=torch.channels_last)
        ck = torch.load(models_dir / f'b0_512_enh_v2_ema_fold{int(fold)}.pt', map_location='cpu', weights_only=True)
        model.load_state_dict(ck['model'], strict=True); model.eval()
        preds_fold = []
        with torch.no_grad():
            for xb, _ in test_loader:
                xb = xb.to(DEVICE, non_blocking=True).contiguous(memory_format=torch.channels_last)
                xb_flip = torch.flip(xb, dims=[3])
                # FP32 inference (no autocast) for safety
                p1 = model(xb).squeeze(1).float(); p2 = model(xb_flip).squeeze(1).float(); p = 0.5 * (p1 + p2)
                preds_fold.append(p.detach().cpu().numpy())
        all_fold_preds.append(np.concatenate(preds_fold)); del model; torch.cuda.empty_cache()
    test_preds = np.mean(np.stack(all_fold_preds, axis=0), axis=0); labels = apply_thresholds(test_preds, cuts)
    sub = pd.DataFrame({'id_code': test_df['id_code'], 'diagnosis': labels}); sub.to_csv('submission.csv', index=False)
    print('Saved submission.csv with EMA 5-fold TTA. Shape:', sub.shape)

print('5-fold CV (EMA+update_bn+dual-val+relaxed-ES+warmup+cosine+clean aug+accum=4, size=512, bs=2, nw=2, lr=2e-4, no-checkpointing) ready: run run_cv_b0_enh_and_submit().')

  from .autonotebook import tqdm as notebook_tqdm


5-fold CV (EMA+update_bn+dual-val+relaxed-ES+warmup+cosine+clean aug+accum=4, size=512, bs=2, nw=2, lr=2e-4, no-checkpointing) ready: run run_cv_b0_enh_and_submit().


In [None]:
# Run full 5-fold CV with enhanced B0@512 and generate submission.csv
import time
t0 = time.time()
run_cv_b0_enh_and_submit()
print(f'Total wall-clock: {(time.time()-t0)/60:.1f} min')

In [None]:
# Single-fold sanity check for v2 pipeline (expect QWK >= 0.89 before full CV)
import time
t0 = time.time()
preds_fold0, y_val_fold0, best_q_fold0 = train_one_fold_b0_enh(0)
print(f"Fold 0 sanity check — best QWK: {best_q_fold0:.4f}. Wall-clock: {(time.time()-t0)/60:.1f} min")