# APTOS 2019 Blindness Detection – Medal Plan

Goal: Win a medal via strong CV and fast CNN baseline → iterate with error-driven improvements.

Pipeline overview:
- Validation: 5-fold StratifiedKFold on diagnosis, fixed seed, save folds to disk; monitor OOF QWK.
- Metric: Optimize quadratic weighted kappa via post-training threshold search on OOF.
- Model: timm pretrained CNN (start: tf_efficientnet_b3_ns, fallback: resnet50). Head: regression (single logit) with MSE/Huber; inference uses learned thresholds.
- Image size: start 384px, then 512px if time allows. Mixed precision, GPU mandatory.
- Augmentations (Albumentations): Horizontal/vertical flip, random rotate/shift, RandomResizedCrop, brightness/contrast, CLAHE (light), blur/noise light, Cutout optional. Normalize to ImageNet.
- Preprocessing v0: simple center-crop and resize (no heavy circle crop yet). v1: fast fundus crop via threshold mask.
- Training: AdamW, cosine schedule w/ warmup, early stopping (patience 3), epochs ~10-12 per fold (stop early).
- TTA: simple (hflip) at inference after solid CV.
- Logging: print fold idx, epoch, train/valid loss, valid QWK, elapsed time. Save best per fold, OOF preds, test preds, thresholds.
- Repro: fix seeds, deterministic ops where feasible.

Iteration plan:
1) Sanity: Verify GPU (install torch), list data counts, class distribution; create folds.
2) Baseline: EfficientNet-B3 384px, regression + OOF thresholding. Quick 2-fold smoke test (2-3 epochs) to validate pipeline.
3) Full 5-fold run; cache OOF/test. Submit baseline.
4) Improvements:
   - Better preprocessing: fast circular mask crop; light CLAHE.
   - Stronger aug schedule; try BCE-with-ordinal targets vs regression.
   - Image size 512; lr/weight-decay tuning; label smoothing for CE variant.
   - TTA on LB; blend seeds/models.
5) Error analysis: OOF confusion, per-class QWK bins; adjust thresholds/augs.

Safety/efficiency:
- Subsample for smoke runs; early stop aggressively.
- While training, run expert review on strategy and CV; do not wait on unproductive runs.

Artifacts:
- folds.csv, oof.npy, oof_targets.npy, oof_logits.npy, thresholds.npy, test_preds.npy, submission.csv

Next: Get expert feedback before heavy runs; then environment check and installs.

In [None]:
# Environment setup: install GPU PyTorch and libs; verify GPU
import sys, subprocess, time
def pip_install(args):
    print('Installing:', ' '.join(args), flush=True)
    r = subprocess.run([sys.executable, '-m', 'pip', 'install'] + args, check=False)
    print('Return code:', r.returncode, flush=True)

# Install PyTorch CUDA 12.1 first to avoid CPU wheels
pip_install(['--upgrade', 'pip'])
pip_install(['--index-url', 'https://download.pytorch.org/whl/cu121', 'torch', 'torchvision', 'torchaudio'])
# Core libs
pip_install(['timm', 'albumentations>=1.4.0', 'opencv-python-headless', 'scikit-learn', 'pandas', 'numpy', 'tqdm', 'scipy'])

import torch
print(f'GPU Available: {torch.cuda.is_available()}', flush=True)
print(f'GPU Count: {torch.cuda.device_count()}', flush=True)
if torch.cuda.is_available():
    print(f'GPU Name: {torch.cuda.get_device_name(0)}', flush=True)
    props = torch.cuda.get_device_properties(0)
    print(f'GPU Memory: {props.total_memory / 1024**3:.1f} GB', flush=True)

In [2]:
# Data load, CV folds, and fast 512px cache with circle-crop + Ben filter + CLAHE
import os, math, random, json, gc, shutil, glob, time
from pathlib import Path
import numpy as np
import pandas as pd
import cv2
from tqdm import tqdm
from sklearn.model_selection import StratifiedKFold

SEED = 42
random.seed(SEED); np.random.seed(SEED)

DATA_DIR = Path('.')
TRAIN_DIR = DATA_DIR / 'train_images'
TEST_DIR = DATA_DIR / 'test_images'
FOLDS_CSV = DATA_DIR / 'folds.csv'
CACHE_DIR = DATA_DIR / 'cache512'
CACHE_TRAIN = CACHE_DIR / 'train'
CACHE_TEST = CACHE_DIR / 'test'
IMG_SIZE = 512

# Read CSVs
train_df = pd.read_csv(DATA_DIR / 'train.csv')
test_df = pd.read_csv(DATA_DIR / 'test.csv')
print('Train shape:', train_df.shape, 'Test shape:', test_df.shape, flush=True)
print('Train head:\n', train_df.head(), flush=True)
print('Class distribution (train):\n', train_df['diagnosis'].value_counts().sort_index(), flush=True)

# Create 5-fold stratified splits (fixed and reusable)
if FOLDS_CSV.exists():
    folds_df = pd.read_csv(FOLDS_CSV)
    print('Loaded existing folds.csv with shape', folds_df.shape, flush=True)
else:
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=SEED)
    folds = np.zeros(len(train_df), dtype=np.int32)
    for fold, (_, val_idx) in enumerate(skf.split(train_df['id_code'], train_df['diagnosis'])):
        folds[val_idx] = fold
    folds_df = train_df.copy()
    folds_df['fold'] = folds
    folds_df.to_csv(FOLDS_CSV, index=False)
    print('Saved folds.csv with shape', folds_df.shape, flush=True)
print(folds_df['fold'].value_counts().sort_index(), flush=True)

# --- Fast fundus preprocessing ---
def circle_crop_ben_clahe(img_bgr: np.ndarray, size: int = 512, clahe_p: float = 0.4) -> np.ndarray:
    h, w = img_bgr.shape[:2]
    # Make a quick mask for fundus
    gray = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2GRAY)
    # Normalize a bit to stabilize threshold
    gray_blur = cv2.GaussianBlur(gray, (0,0), 3)
    thr = cv2.threshold(gray_blur, 0, 255, cv2.THRESH_BINARY+cv2.THRESH_OTSU)[1]
    # Invert if background is white-ish
    if np.mean(thr) > 127:
        thr = cv2.bitwise_not(thr)
    # Morph close to fill holes
    kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (7,7))
    thr = cv2.morphologyEx(thr, cv2.MORPH_CLOSE, kernel, iterations=2)
    contours, _ = cv2.findContours(thr, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    if contours:
        cnt = max(contours, key=cv2.contourArea)
        x, y, bw, bh = cv2.boundingRect(cnt)
    else:
        # Fallback center square
        side = min(h, w)
        x = (w - side) // 2; y = (h - side) // 2; bw = side; bh = side
    crop = img_bgr[y:y+bh, x:x+bw]
    ch, cw = crop.shape[:2]
    # Pad to square
    side = max(ch, cw)
    pad_top = (side - ch) // 2; pad_bottom = side - ch - pad_top
    pad_left = (side - cw) // 2; pad_right = side - cw - pad_left
    crop_sq = cv2.copyMakeBorder(crop, pad_top, pad_bottom, pad_left, pad_right, cv2.BORDER_CONSTANT, value=(0,0,0))
    # Resize
    img = cv2.resize(crop_sq, (size, size), interpolation=cv2.INTER_AREA)
    # Ben Graham enhancement (unsharp-like)
    blur = cv2.GaussianBlur(img, (0,0), size/30)  # sigma proportional to size
    img = cv2.addWeighted(img, 4, blur, -4, 128)
    img = np.clip(img, 0, 255).astype(np.uint8)
    # Light CLAHE on L channel
    if random.random() < clahe_p:
        lab = cv2.cvtColor(img, cv2.COLOR_BGR2LAB)
        L, A, B = cv2.split(lab)
        clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8,8))
        L = clahe.apply(L)
        lab = cv2.merge([L, A, B])
        img = cv2.cvtColor(lab, cv2.COLOR_LAB2BGR)
    return img

def ensure_dir(p: Path):
    p.mkdir(parents=True, exist_ok=True)

ensure_dir(CACHE_TRAIN); ensure_dir(CACHE_TEST)

def cache_split(df: pd.DataFrame, src_dir: Path, dst_dir: Path, id_col: str, limit: int | None = None):
    paths = []
    for iid in df[id_col].values.tolist():
        src = src_dir / f"{iid}.png"
        dst = dst_dir / f"{iid}.png"
        paths.append((str(src), str(dst)))
    if limit is not None:
        paths = paths[:limit]
    cnt_exist = sum(os.path.exists(d) for _, d in paths)
    print(f'{dst_dir.name}: {cnt_exist}/{len(paths)} already cached', flush=True)
    to_process = [(s, d) for s, d in paths if not os.path.exists(d)]
    print(f'Processing {len(to_process)} images into {dst_dir} ...', flush=True)
    t0 = time.time()
    for i, (src, dst) in enumerate(tqdm(to_process, total=len(to_process))):
        img = cv2.imread(src, cv2.IMREAD_COLOR)
        if img is None:
            # Skip or copy as-is if missing
            continue
        img = circle_crop_ben_clahe(img, size=IMG_SIZE, clahe_p=0.4)
        cv2.imwrite(dst, img, [cv2.IMWRITE_PNG_COMPRESSION, 3])
        if (i+1) % 200 == 0:
            print(f'Cached {i+1}/{len(to_process)}; elapsed {time.time()-t0:.1f}s', flush=True)
    print(f'Done {dst_dir.name}; total elapsed {time.time()-t0:.1f}s', flush=True)

# Full cache run at 512px for all images (per expert advice)
SMOKE_LIMIT = None  # set to None for full cache run
cache_split(train_df, TRAIN_DIR, CACHE_TRAIN, id_col='id_code', limit=SMOKE_LIMIT)
cache_split(test_df, TEST_DIR, CACHE_TEST, id_col='id_code', limit=SMOKE_LIMIT)
print('Cache complete. Proceed to training.', flush=True)

Train shape: (3295, 2) Test shape: (367, 1)


Train head:
         id_code  diagnosis
0  2a2274bcb00a          0
1  eda29a9d78f3          0
2  789c60cba801          0
3  a07efb1ecfc0          0
4  44855f666225          2


Class distribution (train):
 diagnosis
0    1628
1     340
2     896
3     176
4     255
Name: count, dtype: int64


Loaded existing folds.csv with shape (3295, 3)


fold
0    659
1    659
2    659
3    659
4    659
Name: count, dtype: int64


train: 3295/3295 already cached


Processing 0 images into cache512/train ...


0it [00:00, ?it/s]

0it [00:00, ?it/s]

Done train; total elapsed 0.0s


test: 367/367 already cached


Processing 0 images into cache512/test ...





0it [00:00, ?it/s]

0it [00:00, ?it/s]

Done test; total elapsed 0.0s


Cache complete. Proceed to training.





In [3]:
# Training scaffold: Dataset, Augs, Model (EffNet-B4 @512), CV loop, QWK + threshold optimization
import math, os, time, json, gc, copy, random
from pathlib import Path
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import albumentations as A
from albumentations.pytorch import ToTensorV2
import cv2
import timm
from sklearn.metrics import cohen_kappa_score
from sklearn.model_selection import StratifiedKFold
from tqdm import tqdm
from scipy.optimize import minimize

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
IMG_DIR_TRAIN = CACHE_TRAIN
IMG_DIR_TEST = CACHE_TEST
NUM_CLASSES = 5

# Ensure writable HF cache (timm pretrained weights) to avoid read-only /app/.cache errors
HF_CACHE_DIR = Path('./hf_cache')
HF_CACHE_DIR.mkdir(parents=True, exist_ok=True)
os.environ['HF_HOME'] = str(HF_CACHE_DIR)
os.environ['HUGGINGFACE_HUB_CACHE'] = str(HF_CACHE_DIR)
os.environ['XDG_CACHE_HOME'] = str(HF_CACHE_DIR)
os.environ['HF_HUB_CACHE'] = str(HF_CACHE_DIR)
# Help CUDA allocator avoid fragmentation per PyTorch docs
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True'

def seed_everything(seed=42):
    random.seed(seed); np.random.seed(seed); torch.manual_seed(seed); torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

seed_everything(SEED)

# Hyperparams (full run defaults)
BATCH_SIZE = 8  # start safer; OOM logic can drop to 4 or 2 with accumulation
EPOCHS_FULL = 15
LR = 2e-4
WD = 1e-5
PATIENCE = 3
NUM_WORKERS_TRAIN = 0
NUM_WORKERS_TEST = 6
BACKBONE = 'tf_efficientnet_b4_ns'

# Albumentations pipelines (conservative, per expert advice)
MEAN = (0.485, 0.456, 0.406); STD = (0.229, 0.224, 0.225)
train_tfms = A.Compose([
    A.Resize(IMG_SIZE, IMG_SIZE),
    A.HorizontalFlip(p=0.5),
    A.ShiftScaleRotate(shift_limit=0.05, scale_limit=0.1, rotate_limit=15, border_mode=cv2.BORDER_REFLECT_101, p=0.7),
    A.RandomBrightnessContrast(brightness_limit=0.15, contrast_limit=0.15, p=0.7),
    A.RandomGamma(gamma_limit=(90, 110), p=0.4),
    A.Normalize(mean=MEAN, std=STD),
    ToTensorV2(),
])
valid_tfms = A.Compose([
    A.Resize(IMG_SIZE, IMG_SIZE),
    A.Normalize(mean=MEAN, std=STD),
    ToTensorV2(),
])

class DRDataset(Dataset):
    def __init__(self, df: pd.DataFrame, img_dir: Path, transforms=None):
        self.df = df.reset_index(drop=True)
        self.img_dir = Path(img_dir)
        self.transforms = transforms
        self.has_target = 'diagnosis' in df.columns
    def __len__(self):
        return len(self.df)
    def __getitem__(self, idx):
        row = self.df.iloc[idx]
        img_path = self.img_dir / f"{row['id_code']}.png"
        img = cv2.imread(str(img_path), cv2.IMREAD_COLOR)
        if img is None:
            # Fallback: read from original dir if cache missing
            orig = TRAIN_DIR / f"{row['id_code']}.png"
            img = cv2.imread(str(orig), cv2.IMREAD_COLOR)
            if img is None:
                img = np.zeros((IMG_SIZE, IMG_SIZE, 3), dtype=np.uint8)
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        # Hard enforce size to avoid variable tensor sizes even if transforms fail
        if img.shape[0] != IMG_SIZE or img.shape[1] != IMG_SIZE:
            img = cv2.resize(img, (IMG_SIZE, IMG_SIZE), interpolation=cv2.INTER_AREA)
        if self.transforms:
            img = self.transforms(image=img)['image']
        if self.has_target:
            target = float(row['diagnosis'])
            return img, torch.tensor(target, dtype=torch.float32)
        else:
            return img, row['id_code']

class RegHeadModel(nn.Module):
    def __init__(self, backbone_name='tf_efficientnet_b4_ns', pretrained=True):
        super().__init__()
        self.backbone = timm.create_model(backbone_name, pretrained=pretrained, num_classes=0, global_pool='avg', cache_dir=str(HF_CACHE_DIR))
        # Try to enable gradient checkpointing to save memory
        try:
            if hasattr(self.backbone, 'set_grad_checkpointing'):
                self.backbone.set_grad_checkpointing(True)
        except Exception:
            pass
        in_ch = self.backbone.num_features
        self.head = nn.Sequential(
            nn.Dropout(0.3),
            nn.Linear(in_ch, 1)
        )
    def forward(self, x):
        feats = self.backbone(x)
        out = self.head(feats).squeeze(1)
        return out

def qwk(y_true, y_pred_cls):
    return cohen_kappa_score(y_true, y_pred_cls, weights='quadratic')

def preds_to_classes(preds, thresholds):
    th0, th1, th2, th3 = thresholds
    return np.digitize(preds, bins=[th0, th1, th2, th3])

def optimize_thresholds(oof_targets, oof_preds, init=[0.5,1.5,2.5,3.5]):
    y = np.asarray(oof_targets).astype(float)
    p = np.asarray(oof_preds).astype(float)
    def _loss(th):
        th = np.sort(th)
        th = np.clip(th, -1.0, 4.0)
        cls = preds_to_classes(p, th)
        return -qwk(y, cls)
    res = minimize(_loss, x0=np.array(init, dtype=float), method='Nelder-Mead', options={'maxiter':2000, 'xatol':1e-3, 'fatol':1e-3})
    th = np.sort(res.x)
    for i in range(1,4):
        if th[i] - th[i-1] < 0.05:
            th[i] = th[i-1] + 0.05
    return th

def get_loaders(tr_df, va_df, batch_size=16, num_workers=0):
    dtr = DRDataset(tr_df, IMG_DIR_TRAIN, transforms=train_tfms)
    dva = DRDataset(va_df, IMG_DIR_TRAIN, transforms=valid_tfms)
    dl_tr = DataLoader(dtr, batch_size=batch_size, shuffle=True, num_workers=num_workers, pin_memory=True, drop_last=True, persistent_workers=False)
    dl_va = DataLoader(dva, batch_size=batch_size, shuffle=False, num_workers=num_workers, pin_memory=True, persistent_workers=False)
    return dl_tr, dl_va

def validate(model, dl, loss_fn):
    model.eval()
    preds = []; targs = []; val_loss = 0.0; n = 0
    with torch.no_grad():
        for xb, yb in dl:
            xb = xb.to(device, non_blocking=True).to(memory_format=torch.channels_last)
            yb = yb.to(device, non_blocking=True)
            with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):
                out = model(xb)
                loss = loss_fn(out, yb)
            bs = xb.size(0)
            val_loss += loss.item()*bs; n += bs
            preds.append(out.detach().float().cpu().numpy())
            targs.append(yb.detach().float().cpu().numpy())
    preds = np.concatenate(preds); targs = np.concatenate(targs)
    return val_loss/n, preds, targs

def tta_predict(model, dl):
    model.eval()
    preds = []
    with torch.no_grad():
        for xb, _ in dl:
            xb = xb.to(device, non_blocking=True).to(memory_format=torch.channels_last)
            with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):
                p1 = model(xb)
                p2 = model(torch.flip(xb, dims=[-1]))
                p = (p1 + p2) / 2.0
            preds.append(p.detach().float().cpu().numpy())
    return np.concatenate(preds)

def train_one_fold(fold, folds_df, epochs=12, lr=2e-4, wd=1e-5, batch_size=16, patience=3, backbone='tf_efficientnet_b4_ns', num_workers=0):
    print(f'\n===== Fold {fold} / {folds_df["fold"].nunique()} =====', flush=True)
    if torch.cuda.is_available():
        try:
            torch.cuda.reset_peak_memory_stats()
        except Exception:
            pass
        print('Peak GB before fold:', f'{torch.cuda.max_memory_allocated()/1024**3:.2f}', flush=True)
    tr_df = folds_df[folds_df['fold'] != fold][['id_code','diagnosis']].reset_index(drop=True)
    va_df = folds_df[folds_df['fold'] == fold][['id_code','diagnosis']].reset_index(drop=True)
    dl_tr, dl_va = get_loaders(tr_df, va_df, batch_size=batch_size, num_workers=num_workers)
    # Use pretrained=True now that cache dir is writable
    model = RegHeadModel(backbone_name=backbone, pretrained=True).to(device)
    model = model.to(memory_format=torch.channels_last)
    optimizer = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=wd)
    loss_fn = nn.HuberLoss(delta=1.0)
    scaler = torch.cuda.amp.GradScaler(enabled=True)
    total_steps = epochs * len(dl_tr)
    def lr_lambda(step):
        if step < len(dl_tr):
            return (step+1)/len(dl_tr)
        progress = (step - len(dl_tr)) / max(1, total_steps - len(dl_tr))
        return 0.5 * (1 + math.cos(math.pi * progress))
    scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)

    best_qwk = -1.0; best_state = None; best_preds = None; best_targs = None; no_improve = 0
    global_step = 0; t_start = time.time()
    accum_steps = max(1, 16 // batch_size)
    optimizer.zero_grad(set_to_none=True)
    for epoch in range(1, epochs+1):
        model.train()
        tr_loss = 0.0; n = 0; t0 = time.time()
        for it, (xb, yb) in enumerate(dl_tr):
            xb = xb.to(device, non_blocking=True).to(memory_format=torch.channels_last); yb = yb.to(device, non_blocking=True)
            with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):
                out = model(xb)
                loss = loss_fn(out, yb) / accum_steps
            scaler.scale(loss).backward()
            if (it + 1) % accum_steps == 0:
                scaler.step(optimizer)
                scaler.update()
                optimizer.zero_grad(set_to_none=True)
                scheduler.step()
            bs = xb.size(0); tr_loss += (loss.item() * accum_steps)*bs; n += bs; global_step += 1
            if (it+1) % 50 == 0:
                print(f'Epoch {epoch} it {it+1}/{len(dl_tr)} tr_loss {(tr_loss/n):.4f} elapsed {time.time()-t0:.1f}s', flush=True)
        val_loss, v_preds, v_targs = validate(model, dl_va, loss_fn)
        tmp_cls = preds_to_classes(v_preds, [0.5,1.5,2.5,3.5])
        val_q = qwk(v_targs, tmp_cls)
        print(f'Epoch {epoch}: tr_loss {(tr_loss/n):.4f} val_loss {val_loss:.4f} val_qwk {val_q:.4f} epoch_time {time.time()-t0:.1f}s total_elapsed {time.time()-t_start:.1f}s', flush=True)
        if val_q > best_qwk:
            best_qwk = val_q; best_state = copy.deepcopy(model.state_dict()); best_preds = v_preds.copy(); best_targs = v_targs.copy(); no_improve = 0
        else:
            no_improve += 1
        if no_improve >= patience:
            print('Early stopping triggered', flush=True); break
    model.load_state_dict(best_state)
    return model, best_preds, best_targs

def run_cv(folds_df, backbone='tf_efficientnet_b4_ns', epochs=12, batch_size=8, num_workers=0):
    # Fix index alignment for OOF
    folds_df = folds_df.reset_index(drop=True).copy()
    n_folds = folds_df['fold'].nunique()
    oof_preds = np.zeros(len(folds_df), dtype=np.float32)
    oof_targs = folds_df['diagnosis'].values.astype(float)
    models = []
    for fold in range(n_folds):
        try:
            torch.cuda.empty_cache()
            if torch.cuda.is_available():
                torch.cuda.reset_peak_memory_stats()
        except Exception:
            pass
        gc.collect()
        current_bs = batch_size
        for attempt in range(6):
            print(f'Attempt {attempt+1}: trying batch_size={current_bs}', flush=True)
            try:
                fm, v_preds, v_targs = train_one_fold(fold, folds_df, epochs=epochs, lr=LR, wd=WD, batch_size=current_bs, patience=PATIENCE, backbone=backbone, num_workers=num_workers)
                break
            except (RuntimeError, torch.cuda.OutOfMemoryError, torch.OutOfMemoryError) as e:
                msg = str(e).lower()
                if 'out of memory' in msg or 'cuda out of memory' in msg:
                    print(f'CUDA OOM on fold {fold} with batch_size={current_bs}. Reducing and retrying (attempt {attempt+1})...', flush=True)
                    # Aggressive cleanup
                    try:
                        del fm
                    except Exception:
                        pass
                    gc.collect()
                    try:
                        torch.cuda.empty_cache()
                        if torch.cuda.is_available():
                            torch.cuda.reset_peak_memory_stats()
                    except Exception:
                        pass
                    current_bs = max(2, current_bs - 2)
                    if attempt >= 5:
                        raise
                    continue
                else:
                    raise
        models.append(fm)
        va_idx = np.where(folds_df['fold'].values == fold)[0]
        oof_preds[va_idx] = v_preds
        fold_q = qwk(v_targs, preds_to_classes(v_preds, [0.5,1.5,2.5,3.5]))
        print(f'Fold {fold} val QWK (default th): {fold_q:.4f}', flush=True)
    th = optimize_thresholds(oof_targs, oof_preds)
    oof_q = qwk(oof_targs, preds_to_classes(oof_preds, th))
    print('Optimized thresholds:', th, 'OOF QWK:', f'{oof_q:.5f}', flush=True)
    np.save('oof_preds.npy', oof_preds); np.save('oof_targets.npy', oof_targs); np.save('thresholds.npy', th)
    return models, th, oof_q

def predict_test(models, thresholds, batch_size=16):
    test_ids = test_df['id_code'].values
    dte = DRDataset(test_df[['id_code']].copy(), IMG_DIR_TEST, transforms=valid_tfms)
    dl_te = DataLoader(dte, batch_size=batch_size*2, shuffle=False, num_workers=NUM_WORKERS_TEST, pin_memory=True, persistent_workers=False)
    all_preds = []
    for m_i, m in enumerate(models):
        print(f'Inference model {m_i+1}/{len(models)}', flush=True)
        preds = tta_predict(m, dl_te)
        all_preds.append(preds)
    reg_pred = np.mean(np.stack(all_preds, axis=0), axis=0)
    cls = preds_to_classes(reg_pred, thresholds).astype(int)
    sub = pd.DataFrame({'id_code': test_ids, 'diagnosis': cls})
    sub.to_csv('submission.csv', index=False)
    np.save('test_reg_preds.npy', reg_pred)
    print('Saved submission.csv and test_reg_preds.npy')
    return sub

# --- Full CV run (smoke off) ---
SMOKE_TRAIN = False
if SMOKE_TRAIN:
    folds_use = folds_df.copy()
    folds_use = folds_use[folds_use['fold'].isin([0,1])].copy()
    models, thres, oof_q = run_cv(folds_use, backbone=BACKBONE, epochs=2, batch_size=8, num_workers=0)
    print('Smoke OOF QWK (partial):', oof_q)
else:
    # Optional one-time warmup to force weight download before CV
    try:
        _ = timm.create_model(BACKBONE, pretrained=True, num_classes=0, global_pool='avg', cache_dir=str(HF_CACHE_DIR))
        del _
    except Exception as e:
        print('Pretrained weight warmup failed (will retry inside folds):', e)
    models, thres, oof_q = run_cv(folds_df, backbone=BACKBONE, epochs=EPOCHS_FULL, batch_size=BATCH_SIZE, num_workers=NUM_WORKERS_TRAIN)

print('Training scaffold ready.')

  original_init(self, **validated_kwargs)
  model = create_fn(


Attempt 1: trying batch_size=8



===== Fold 0 / 5 =====


Peak GB before fold: 0.00


  model = create_fn(


  scaler = torch.cuda.amp.GradScaler(enabled=True)
  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 1 it 50/329 tr_loss 0.8565 elapsed 19.8s


Epoch 1 it 100/329 tr_loss 0.8189 elapsed 38.7s


Epoch 1 it 150/329 tr_loss 0.7184 elapsed 57.9s


Epoch 1 it 200/329 tr_loss 0.6401 elapsed 77.0s


Epoch 1 it 250/329 tr_loss 0.5738 elapsed 96.1s


Epoch 1 it 300/329 tr_loss 0.5251 elapsed 115.5s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 1: tr_loss 0.5026 val_loss 0.2787 val_qwk 0.6722 epoch_time 137.9s total_elapsed 137.9s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 2 it 50/329 tr_loss 0.2933 elapsed 19.4s


Epoch 2 it 100/329 tr_loss 0.2979 elapsed 39.0s


Epoch 2 it 150/329 tr_loss 0.2945 elapsed 58.7s


Epoch 2 it 200/329 tr_loss 0.2820 elapsed 78.2s


Epoch 2 it 250/329 tr_loss 0.2825 elapsed 97.7s


Epoch 2 it 300/329 tr_loss 0.2866 elapsed 117.2s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 2: tr_loss 0.2833 val_loss 0.2376 val_qwk 0.7754 epoch_time 139.6s total_elapsed 277.8s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 3 it 50/329 tr_loss 0.2769 elapsed 19.7s


Epoch 3 it 100/329 tr_loss 0.2680 elapsed 39.4s


Epoch 3 it 150/329 tr_loss 0.2482 elapsed 58.9s


Epoch 3 it 200/329 tr_loss 0.2542 elapsed 78.5s


Epoch 3 it 250/329 tr_loss 0.2484 elapsed 98.1s


Epoch 3 it 300/329 tr_loss 0.2484 elapsed 117.6s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 3: tr_loss 0.2467 val_loss 0.2210 val_qwk 0.8427 epoch_time 140.0s total_elapsed 417.8s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 4 it 50/329 tr_loss 0.1759 elapsed 19.7s


Epoch 4 it 100/329 tr_loss 0.1851 elapsed 39.2s


Epoch 4 it 150/329 tr_loss 0.1870 elapsed 58.7s


Epoch 4 it 200/329 tr_loss 0.1976 elapsed 78.2s


Epoch 4 it 250/329 tr_loss 0.1929 elapsed 97.7s


Epoch 4 it 300/329 tr_loss 0.1973 elapsed 117.3s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 4: tr_loss 0.1975 val_loss 0.1877 val_qwk 0.8067 epoch_time 139.8s total_elapsed 557.6s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 5 it 50/329 tr_loss 0.1677 elapsed 19.5s


Epoch 5 it 100/329 tr_loss 0.1797 elapsed 39.1s


Epoch 5 it 150/329 tr_loss 0.1897 elapsed 58.7s


Epoch 5 it 200/329 tr_loss 0.1822 elapsed 78.4s


Epoch 5 it 250/329 tr_loss 0.1834 elapsed 98.0s


Epoch 5 it 300/329 tr_loss 0.1890 elapsed 117.5s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 5: tr_loss 0.1892 val_loss 0.1839 val_qwk 0.8305 epoch_time 139.9s total_elapsed 697.5s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 6 it 50/329 tr_loss 0.1762 elapsed 19.6s


Epoch 6 it 100/329 tr_loss 0.1676 elapsed 39.0s


Epoch 6 it 150/329 tr_loss 0.1609 elapsed 58.6s


Epoch 6 it 200/329 tr_loss 0.1651 elapsed 78.3s


Epoch 6 it 250/329 tr_loss 0.1634 elapsed 97.7s


Epoch 6 it 300/329 tr_loss 0.1614 elapsed 117.3s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 6: tr_loss 0.1583 val_loss 0.1678 val_qwk 0.8561 epoch_time 139.8s total_elapsed 837.3s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 7 it 50/329 tr_loss 0.1159 elapsed 19.5s


Epoch 7 it 100/329 tr_loss 0.1185 elapsed 39.0s


Epoch 7 it 150/329 tr_loss 0.1314 elapsed 58.4s


Epoch 7 it 200/329 tr_loss 0.1335 elapsed 77.9s


Epoch 7 it 250/329 tr_loss 0.1367 elapsed 97.3s


Epoch 7 it 300/329 tr_loss 0.1388 elapsed 117.1s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 7: tr_loss 0.1401 val_loss 0.1676 val_qwk 0.8522 epoch_time 139.6s total_elapsed 977.0s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 8 it 50/329 tr_loss 0.1366 elapsed 19.5s


Epoch 8 it 100/329 tr_loss 0.1320 elapsed 39.0s


Epoch 8 it 150/329 tr_loss 0.1292 elapsed 58.5s


Epoch 8 it 200/329 tr_loss 0.1252 elapsed 78.2s


Epoch 8 it 250/329 tr_loss 0.1271 elapsed 97.7s


Epoch 8 it 300/329 tr_loss 0.1259 elapsed 117.2s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 8: tr_loss 0.1263 val_loss 0.1830 val_qwk 0.8351 epoch_time 139.9s total_elapsed 1116.8s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 9 it 50/329 tr_loss 0.0992 elapsed 19.5s


Epoch 9 it 100/329 tr_loss 0.1142 elapsed 38.9s


Epoch 9 it 150/329 tr_loss 0.1078 elapsed 58.4s


Epoch 9 it 200/329 tr_loss 0.1058 elapsed 77.9s


Epoch 9 it 250/329 tr_loss 0.1025 elapsed 97.3s


Epoch 9 it 300/329 tr_loss 0.1053 elapsed 116.8s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 9: tr_loss 0.1075 val_loss 0.1687 val_qwk 0.8528 epoch_time 139.5s total_elapsed 1256.4s


Early stopping triggered


Fold 0 val QWK (default th): 0.8561


Attempt 1: trying batch_size=8



===== Fold 1 / 5 =====


Peak GB before fold: 0.15


  model = create_fn(


  scaler = torch.cuda.amp.GradScaler(enabled=True)
  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 1 it 50/329 tr_loss 0.8326 elapsed 19.5s


Epoch 1 it 100/329 tr_loss 0.8417 elapsed 39.0s


Epoch 1 it 150/329 tr_loss 0.7409 elapsed 58.5s


Epoch 1 it 200/329 tr_loss 0.6514 elapsed 78.0s


Epoch 1 it 250/329 tr_loss 0.5818 elapsed 97.5s


Epoch 1 it 300/329 tr_loss 0.5457 elapsed 117.0s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 1: tr_loss 0.5269 val_loss 0.2900 val_qwk 0.6758 epoch_time 139.5s total_elapsed 139.5s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 2 it 50/329 tr_loss 0.2926 elapsed 19.3s


Epoch 2 it 100/329 tr_loss 0.3127 elapsed 39.0s


Epoch 2 it 150/329 tr_loss 0.3013 elapsed 58.5s


Epoch 2 it 200/329 tr_loss 0.2880 elapsed 78.0s


Epoch 2 it 250/329 tr_loss 0.2784 elapsed 97.5s


Epoch 2 it 300/329 tr_loss 0.2807 elapsed 117.0s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 2: tr_loss 0.2813 val_loss 0.1968 val_qwk 0.8333 epoch_time 139.5s total_elapsed 279.2s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 3 it 50/329 tr_loss 0.1865 elapsed 19.5s


Epoch 3 it 100/329 tr_loss 0.2037 elapsed 39.3s


Epoch 3 it 150/329 tr_loss 0.2284 elapsed 58.8s


Epoch 3 it 200/329 tr_loss 0.2347 elapsed 78.3s


Epoch 3 it 250/329 tr_loss 0.2316 elapsed 97.8s


Epoch 3 it 300/329 tr_loss 0.2309 elapsed 117.4s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 3: tr_loss 0.2281 val_loss 0.2117 val_qwk 0.8029 epoch_time 140.1s total_elapsed 419.3s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 4 it 50/329 tr_loss 0.2109 elapsed 19.5s


Epoch 4 it 100/329 tr_loss 0.1984 elapsed 39.0s


Epoch 4 it 150/329 tr_loss 0.2012 elapsed 58.5s


Epoch 4 it 200/329 tr_loss 0.1986 elapsed 78.2s


Epoch 4 it 250/329 tr_loss 0.1971 elapsed 97.8s


Epoch 4 it 300/329 tr_loss 0.1933 elapsed 117.3s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 4: tr_loss 0.1953 val_loss 0.1955 val_qwk 0.8565 epoch_time 139.9s total_elapsed 559.2s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 5 it 50/329 tr_loss 0.1708 elapsed 19.5s


Epoch 5 it 100/329 tr_loss 0.1589 elapsed 38.9s


Epoch 5 it 150/329 tr_loss 0.1632 elapsed 58.4s


Epoch 5 it 200/329 tr_loss 0.1632 elapsed 77.9s


Epoch 5 it 250/329 tr_loss 0.1609 elapsed 97.4s


Epoch 5 it 300/329 tr_loss 0.1603 elapsed 116.9s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 5: tr_loss 0.1614 val_loss 0.1857 val_qwk 0.8268 epoch_time 139.5s total_elapsed 698.7s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 6 it 50/329 tr_loss 0.1391 elapsed 19.5s


Epoch 6 it 100/329 tr_loss 0.1311 elapsed 39.0s


Epoch 6 it 150/329 tr_loss 0.1362 elapsed 58.5s


Epoch 6 it 200/329 tr_loss 0.1455 elapsed 78.0s


Epoch 6 it 250/329 tr_loss 0.1492 elapsed 97.5s


Epoch 6 it 300/329 tr_loss 0.1504 elapsed 116.9s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 6: tr_loss 0.1474 val_loss 0.1869 val_qwk 0.8239 epoch_time 139.6s total_elapsed 838.3s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 7 it 50/329 tr_loss 0.1249 elapsed 19.6s


Epoch 7 it 100/329 tr_loss 0.1310 elapsed 39.0s


Epoch 7 it 150/329 tr_loss 0.1274 elapsed 58.5s


Epoch 7 it 200/329 tr_loss 0.1274 elapsed 78.0s


Epoch 7 it 250/329 tr_loss 0.1267 elapsed 97.6s


Epoch 7 it 300/329 tr_loss 0.1289 elapsed 117.0s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 7: tr_loss 0.1294 val_loss 0.1879 val_qwk 0.8356 epoch_time 139.8s total_elapsed 978.1s


Early stopping triggered


Fold 1 val QWK (default th): 0.8565


Attempt 1: trying batch_size=8



===== Fold 2 / 5 =====


Peak GB before fold: 0.28


  model = create_fn(


  scaler = torch.cuda.amp.GradScaler(enabled=True)
  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 1 it 50/329 tr_loss 0.9227 elapsed 19.6s


Epoch 1 it 100/329 tr_loss 0.8357 elapsed 39.2s


Epoch 1 it 150/329 tr_loss 0.7319 elapsed 58.8s


Epoch 1 it 200/329 tr_loss 0.6474 elapsed 78.3s


Epoch 1 it 250/329 tr_loss 0.5928 elapsed 97.8s


Epoch 1 it 300/329 tr_loss 0.5496 elapsed 117.4s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 1: tr_loss 0.5290 val_loss 0.2881 val_qwk 0.7551 epoch_time 139.9s total_elapsed 139.9s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 2 it 50/329 tr_loss 0.2827 elapsed 19.5s


Epoch 2 it 100/329 tr_loss 0.2955 elapsed 39.0s


Epoch 2 it 150/329 tr_loss 0.2884 elapsed 58.5s


Epoch 2 it 200/329 tr_loss 0.2841 elapsed 78.1s


Epoch 2 it 250/329 tr_loss 0.2794 elapsed 97.7s


Epoch 2 it 300/329 tr_loss 0.2775 elapsed 117.1s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 2: tr_loss 0.2764 val_loss 0.2128 val_qwk 0.8193 epoch_time 139.8s total_elapsed 279.8s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 3 it 50/329 tr_loss 0.2284 elapsed 19.5s


Epoch 3 it 100/329 tr_loss 0.2408 elapsed 39.0s


Epoch 3 it 150/329 tr_loss 0.2326 elapsed 58.6s


Epoch 3 it 200/329 tr_loss 0.2358 elapsed 78.1s


Epoch 3 it 250/329 tr_loss 0.2389 elapsed 97.7s


Epoch 3 it 300/329 tr_loss 0.2413 elapsed 117.3s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 3: tr_loss 0.2366 val_loss 0.1820 val_qwk 0.8399 epoch_time 140.0s total_elapsed 419.8s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 4 it 50/329 tr_loss 0.1845 elapsed 19.5s


Epoch 4 it 100/329 tr_loss 0.1942 elapsed 39.0s


Epoch 4 it 150/329 tr_loss 0.1995 elapsed 58.5s


Epoch 4 it 200/329 tr_loss 0.1923 elapsed 77.9s


Epoch 4 it 250/329 tr_loss 0.1876 elapsed 97.4s


Epoch 4 it 300/329 tr_loss 0.1903 elapsed 116.8s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 4: tr_loss 0.1913 val_loss 0.1811 val_qwk 0.8662 epoch_time 139.5s total_elapsed 559.3s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 5 it 50/329 tr_loss 0.1661 elapsed 19.5s


Epoch 5 it 100/329 tr_loss 0.1702 elapsed 39.1s


Epoch 5 it 150/329 tr_loss 0.1710 elapsed 58.6s


Epoch 5 it 200/329 tr_loss 0.1733 elapsed 78.1s


Epoch 5 it 250/329 tr_loss 0.1668 elapsed 97.6s


Epoch 5 it 300/329 tr_loss 0.1653 elapsed 117.1s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 5: tr_loss 0.1680 val_loss 0.1825 val_qwk 0.8502 epoch_time 139.6s total_elapsed 698.9s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 6 it 50/329 tr_loss 0.1722 elapsed 19.6s


Epoch 6 it 100/329 tr_loss 0.1615 elapsed 39.1s


Epoch 6 it 150/329 tr_loss 0.1564 elapsed 58.6s


Epoch 6 it 200/329 tr_loss 0.1562 elapsed 78.2s


Epoch 6 it 250/329 tr_loss 0.1559 elapsed 97.6s


Epoch 6 it 300/329 tr_loss 0.1555 elapsed 117.3s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 6: tr_loss 0.1557 val_loss 0.1939 val_qwk 0.8303 epoch_time 139.7s total_elapsed 838.6s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 7 it 50/329 tr_loss 0.1496 elapsed 19.5s


Epoch 7 it 100/329 tr_loss 0.1418 elapsed 39.0s


Epoch 7 it 150/329 tr_loss 0.1452 elapsed 58.5s


Epoch 7 it 200/329 tr_loss 0.1452 elapsed 78.1s


Epoch 7 it 250/329 tr_loss 0.1393 elapsed 97.5s


Epoch 7 it 300/329 tr_loss 0.1405 elapsed 117.1s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 7: tr_loss 0.1394 val_loss 0.1907 val_qwk 0.8671 epoch_time 139.7s total_elapsed 978.4s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 8 it 50/329 tr_loss 0.1030 elapsed 19.4s


Epoch 8 it 100/329 tr_loss 0.1194 elapsed 38.9s


Epoch 8 it 150/329 tr_loss 0.1114 elapsed 58.4s


Epoch 8 it 200/329 tr_loss 0.1146 elapsed 77.9s


Epoch 8 it 250/329 tr_loss 0.1141 elapsed 97.3s


Epoch 8 it 300/329 tr_loss 0.1159 elapsed 116.8s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 8: tr_loss 0.1151 val_loss 0.1508 val_qwk 0.8698 epoch_time 139.3s total_elapsed 1117.7s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 9 it 50/329 tr_loss 0.0851 elapsed 19.4s


Epoch 9 it 100/329 tr_loss 0.0858 elapsed 38.8s


Epoch 9 it 150/329 tr_loss 0.0841 elapsed 58.3s


Epoch 9 it 200/329 tr_loss 0.0900 elapsed 77.8s


Epoch 9 it 250/329 tr_loss 0.0928 elapsed 97.2s


Epoch 9 it 300/329 tr_loss 0.0958 elapsed 116.6s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 9: tr_loss 0.0993 val_loss 0.1866 val_qwk 0.8452 epoch_time 139.0s total_elapsed 1256.8s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 10 it 50/329 tr_loss 0.0850 elapsed 19.4s


Epoch 10 it 100/329 tr_loss 0.0845 elapsed 38.9s


Epoch 10 it 150/329 tr_loss 0.0791 elapsed 58.4s


Epoch 10 it 200/329 tr_loss 0.0812 elapsed 77.8s


Epoch 10 it 250/329 tr_loss 0.0805 elapsed 97.3s


Epoch 10 it 300/329 tr_loss 0.0801 elapsed 116.9s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 10: tr_loss 0.0826 val_loss 0.1603 val_qwk 0.8610 epoch_time 139.3s total_elapsed 1396.1s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 11 it 50/329 tr_loss 0.0773 elapsed 19.5s


Epoch 11 it 100/329 tr_loss 0.0824 elapsed 38.9s


Epoch 11 it 150/329 tr_loss 0.0808 elapsed 58.4s


Epoch 11 it 200/329 tr_loss 0.0773 elapsed 77.9s


Epoch 11 it 250/329 tr_loss 0.0756 elapsed 97.3s


Epoch 11 it 300/329 tr_loss 0.0772 elapsed 116.8s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 11: tr_loss 0.0791 val_loss 0.1959 val_qwk 0.8608 epoch_time 139.6s total_elapsed 1535.7s


Early stopping triggered


Fold 2 val QWK (default th): 0.8698


Attempt 1: trying batch_size=8



===== Fold 3 / 5 =====


Peak GB before fold: 0.41


  model = create_fn(


  scaler = torch.cuda.amp.GradScaler(enabled=True)
  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 1 it 50/329 tr_loss 0.8941 elapsed 19.5s


Epoch 1 it 100/329 tr_loss 0.8648 elapsed 39.0s


Epoch 1 it 150/329 tr_loss 0.7648 elapsed 58.5s


Epoch 1 it 200/329 tr_loss 0.6573 elapsed 78.1s


Epoch 1 it 250/329 tr_loss 0.5936 elapsed 97.6s


Epoch 1 it 300/329 tr_loss 0.5445 elapsed 117.2s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 1: tr_loss 0.5209 val_loss 0.3047 val_qwk 0.7008 epoch_time 139.6s total_elapsed 139.6s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 2 it 50/329 tr_loss 0.3395 elapsed 19.5s


Epoch 2 it 100/329 tr_loss 0.2837 elapsed 39.0s


Epoch 2 it 150/329 tr_loss 0.2890 elapsed 58.5s


Epoch 2 it 200/329 tr_loss 0.2812 elapsed 78.1s


Epoch 2 it 250/329 tr_loss 0.2802 elapsed 97.6s


Epoch 2 it 300/329 tr_loss 0.2827 elapsed 117.1s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 2: tr_loss 0.2828 val_loss 0.2235 val_qwk 0.7751 epoch_time 139.8s total_elapsed 279.5s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 3 it 50/329 tr_loss 0.2717 elapsed 19.4s


Epoch 3 it 100/329 tr_loss 0.2674 elapsed 38.9s


Epoch 3 it 150/329 tr_loss 0.2526 elapsed 58.5s


Epoch 3 it 200/329 tr_loss 0.2381 elapsed 78.0s


Epoch 3 it 250/329 tr_loss 0.2376 elapsed 97.5s


Epoch 3 it 300/329 tr_loss 0.2316 elapsed 117.0s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 3: tr_loss 0.2300 val_loss 0.2039 val_qwk 0.8220 epoch_time 139.4s total_elapsed 418.9s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 4 it 50/329 tr_loss 0.2460 elapsed 19.5s


Epoch 4 it 100/329 tr_loss 0.2156 elapsed 38.9s


Epoch 4 it 150/329 tr_loss 0.2193 elapsed 58.4s


Epoch 4 it 200/329 tr_loss 0.2071 elapsed 77.9s


Epoch 4 it 250/329 tr_loss 0.1994 elapsed 97.4s


Epoch 4 it 300/329 tr_loss 0.1986 elapsed 116.9s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 4: tr_loss 0.1972 val_loss 0.1859 val_qwk 0.8346 epoch_time 139.5s total_elapsed 558.4s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 5 it 50/329 tr_loss 0.1536 elapsed 19.5s


Epoch 5 it 100/329 tr_loss 0.1698 elapsed 38.9s


Epoch 5 it 150/329 tr_loss 0.1785 elapsed 58.5s


Epoch 5 it 200/329 tr_loss 0.1700 elapsed 78.0s


Epoch 5 it 250/329 tr_loss 0.1730 elapsed 97.4s


Epoch 5 it 300/329 tr_loss 0.1771 elapsed 116.9s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 5: tr_loss 0.1773 val_loss 0.1952 val_qwk 0.8106 epoch_time 139.4s total_elapsed 697.9s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 6 it 50/329 tr_loss 0.1561 elapsed 19.4s


Epoch 6 it 100/329 tr_loss 0.1623 elapsed 38.9s


Epoch 6 it 150/329 tr_loss 0.1582 elapsed 58.4s


Epoch 6 it 200/329 tr_loss 0.1584 elapsed 77.8s


Epoch 6 it 250/329 tr_loss 0.1589 elapsed 97.4s


Epoch 6 it 300/329 tr_loss 0.1579 elapsed 116.9s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 6: tr_loss 0.1583 val_loss 0.2184 val_qwk 0.7817 epoch_time 139.6s total_elapsed 837.5s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 7 it 50/329 tr_loss 0.1392 elapsed 19.4s


Epoch 7 it 100/329 tr_loss 0.1385 elapsed 39.0s


Epoch 7 it 150/329 tr_loss 0.1329 elapsed 58.5s


Epoch 7 it 200/329 tr_loss 0.1379 elapsed 78.0s


Epoch 7 it 250/329 tr_loss 0.1376 elapsed 97.7s


Epoch 7 it 300/329 tr_loss 0.1409 elapsed 117.3s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 7: tr_loss 0.1409 val_loss 0.1830 val_qwk 0.8708 epoch_time 140.1s total_elapsed 977.5s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 8 it 50/329 tr_loss 0.1134 elapsed 19.6s


Epoch 8 it 100/329 tr_loss 0.1306 elapsed 39.0s


Epoch 8 it 150/329 tr_loss 0.1397 elapsed 58.4s


Epoch 8 it 200/329 tr_loss 0.1370 elapsed 78.1s


Epoch 8 it 250/329 tr_loss 0.1322 elapsed 97.6s


Epoch 8 it 300/329 tr_loss 0.1298 elapsed 116.8s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 8: tr_loss 0.1299 val_loss 0.1629 val_qwk 0.8614 epoch_time 139.4s total_elapsed 1117.0s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 9 it 50/329 tr_loss 0.1215 elapsed 19.5s


Epoch 9 it 100/329 tr_loss 0.1190 elapsed 39.0s


Epoch 9 it 150/329 tr_loss 0.1192 elapsed 58.4s


Epoch 9 it 200/329 tr_loss 0.1167 elapsed 77.9s


Epoch 9 it 250/329 tr_loss 0.1136 elapsed 97.3s


Epoch 9 it 300/329 tr_loss 0.1106 elapsed 116.8s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 9: tr_loss 0.1106 val_loss 0.1699 val_qwk 0.8664 epoch_time 139.4s total_elapsed 1256.5s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 10 it 50/329 tr_loss 0.1071 elapsed 19.4s


Epoch 10 it 100/329 tr_loss 0.0949 elapsed 38.8s


Epoch 10 it 150/329 tr_loss 0.0948 elapsed 58.4s


Epoch 10 it 200/329 tr_loss 0.0946 elapsed 77.8s


Epoch 10 it 250/329 tr_loss 0.0962 elapsed 97.3s


Epoch 10 it 300/329 tr_loss 0.0957 elapsed 116.8s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 10: tr_loss 0.0971 val_loss 0.1667 val_qwk 0.8566 epoch_time 139.1s total_elapsed 1395.6s


Early stopping triggered


Fold 3 val QWK (default th): 0.8708


Attempt 1: trying batch_size=8



===== Fold 4 / 5 =====


Peak GB before fold: 0.54


  model = create_fn(


  scaler = torch.cuda.amp.GradScaler(enabled=True)
  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 1 it 50/329 tr_loss 0.9296 elapsed 19.5s


Epoch 1 it 100/329 tr_loss 0.8330 elapsed 39.0s


Epoch 1 it 150/329 tr_loss 0.7267 elapsed 58.5s


Epoch 1 it 200/329 tr_loss 0.6438 elapsed 77.9s


Epoch 1 it 250/329 tr_loss 0.5729 elapsed 97.6s


Epoch 1 it 300/329 tr_loss 0.5289 elapsed 117.1s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 1: tr_loss 0.5142 val_loss 0.2656 val_qwk 0.7778 epoch_time 139.6s total_elapsed 139.6s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 2 it 50/329 tr_loss 0.3244 elapsed 19.5s


Epoch 2 it 100/329 tr_loss 0.3094 elapsed 39.1s


Epoch 2 it 150/329 tr_loss 0.2948 elapsed 58.5s


Epoch 2 it 200/329 tr_loss 0.2870 elapsed 78.0s


Epoch 2 it 250/329 tr_loss 0.2814 elapsed 97.4s


Epoch 2 it 300/329 tr_loss 0.2784 elapsed 116.8s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 2: tr_loss 0.2746 val_loss 0.2156 val_qwk 0.7977 epoch_time 139.2s total_elapsed 278.8s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 3 it 50/329 tr_loss 0.2329 elapsed 19.4s


Epoch 3 it 100/329 tr_loss 0.2287 elapsed 38.7s


Epoch 3 it 150/329 tr_loss 0.2279 elapsed 58.0s


Epoch 3 it 200/329 tr_loss 0.2392 elapsed 77.4s


Epoch 3 it 250/329 tr_loss 0.2406 elapsed 96.8s


Epoch 3 it 300/329 tr_loss 0.2412 elapsed 116.2s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 3: tr_loss 0.2476 val_loss 0.2453 val_qwk 0.7538 epoch_time 138.7s total_elapsed 417.5s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 4 it 50/329 tr_loss 0.2325 elapsed 19.4s


Epoch 4 it 100/329 tr_loss 0.2169 elapsed 38.8s


Epoch 4 it 150/329 tr_loss 0.2169 elapsed 58.1s


Epoch 4 it 200/329 tr_loss 0.2190 elapsed 77.5s


Epoch 4 it 250/329 tr_loss 0.2117 elapsed 96.8s


Epoch 4 it 300/329 tr_loss 0.2133 elapsed 116.2s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 4: tr_loss 0.2105 val_loss 0.1851 val_qwk 0.8470 epoch_time 138.6s total_elapsed 556.1s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 5 it 50/329 tr_loss 0.1536 elapsed 19.5s


Epoch 5 it 100/329 tr_loss 0.1671 elapsed 38.8s


Epoch 5 it 150/329 tr_loss 0.1825 elapsed 58.2s


Epoch 5 it 200/329 tr_loss 0.1845 elapsed 77.5s


Epoch 5 it 250/329 tr_loss 0.1843 elapsed 96.8s


Epoch 5 it 300/329 tr_loss 0.1833 elapsed 116.2s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 5: tr_loss 0.1864 val_loss 0.2040 val_qwk 0.7962 epoch_time 138.7s total_elapsed 694.9s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 6 it 50/329 tr_loss 0.1644 elapsed 19.1s


Epoch 6 it 100/329 tr_loss 0.1611 elapsed 38.4s


Epoch 6 it 150/329 tr_loss 0.1606 elapsed 57.9s


Epoch 6 it 200/329 tr_loss 0.1693 elapsed 77.2s


Epoch 6 it 250/329 tr_loss 0.1638 elapsed 96.6s


Epoch 6 it 300/329 tr_loss 0.1685 elapsed 115.8s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 6: tr_loss 0.1663 val_loss 0.1732 val_qwk 0.8332 epoch_time 138.3s total_elapsed 833.2s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 7 it 50/329 tr_loss 0.1429 elapsed 19.4s


Epoch 7 it 100/329 tr_loss 0.1388 elapsed 38.8s


Epoch 7 it 150/329 tr_loss 0.1292 elapsed 58.2s


Epoch 7 it 200/329 tr_loss 0.1355 elapsed 77.7s


Epoch 7 it 250/329 tr_loss 0.1353 elapsed 97.3s


Epoch 7 it 300/329 tr_loss 0.1391 elapsed 116.7s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 7: tr_loss 0.1373 val_loss 0.1518 val_qwk 0.8745 epoch_time 139.0s total_elapsed 972.1s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 8 it 50/329 tr_loss 0.1391 elapsed 19.5s


Epoch 8 it 100/329 tr_loss 0.1228 elapsed 39.1s


Epoch 8 it 150/329 tr_loss 0.1268 elapsed 58.6s


Epoch 8 it 200/329 tr_loss 0.1294 elapsed 78.1s


Epoch 8 it 250/329 tr_loss 0.1216 elapsed 97.5s


Epoch 8 it 300/329 tr_loss 0.1261 elapsed 117.0s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 8: tr_loss 0.1309 val_loss 0.1766 val_qwk 0.8735 epoch_time 139.7s total_elapsed 1111.9s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 9 it 50/329 tr_loss 0.0938 elapsed 19.7s


Epoch 9 it 100/329 tr_loss 0.1056 elapsed 39.2s


Epoch 9 it 150/329 tr_loss 0.1092 elapsed 58.9s


Epoch 9 it 200/329 tr_loss 0.1026 elapsed 78.5s


Epoch 9 it 250/329 tr_loss 0.1098 elapsed 98.1s


Epoch 9 it 300/329 tr_loss 0.1129 elapsed 117.6s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 9: tr_loss 0.1171 val_loss 0.1733 val_qwk 0.8613 epoch_time 140.2s total_elapsed 1252.1s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 10 it 50/329 tr_loss 0.0835 elapsed 19.6s


Epoch 10 it 100/329 tr_loss 0.0870 elapsed 39.2s


Epoch 10 it 150/329 tr_loss 0.0995 elapsed 58.8s


Epoch 10 it 200/329 tr_loss 0.0963 elapsed 78.4s


Epoch 10 it 250/329 tr_loss 0.1006 elapsed 98.0s


Epoch 10 it 300/329 tr_loss 0.1005 elapsed 117.6s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 10: tr_loss 0.1005 val_loss 0.1715 val_qwk 0.8423 epoch_time 140.6s total_elapsed 1392.7s


Early stopping triggered


Fold 4 val QWK (default th): 0.8745


Optimized thresholds: [0.51432359 1.49038092 2.48714289 3.27959726] OOF QWK: 0.86938


Training scaffold ready.


In [4]:
# Inference and submission after CV completes
print('OOF QWK (from CV):', oof_q if 'oof_q' in globals() else 'N/A', flush=True)
sub = predict_test(models, thres, batch_size=BATCH_SIZE)
print('Submission head:\n', sub.head(), flush=True)
import os
print('submission.csv exists:', os.path.exists('submission.csv'), flush=True)

OOF QWK (from CV): 0.8693825133407611


Inference model 1/5


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Inference model 2/5


Inference model 3/5


Inference model 4/5


Inference model 5/5


Saved submission.csv and test_reg_preds.npy
Submission head:
         id_code  diagnosis
0  b460ca9fa26f          0
1  6cee2e148520          0
2  ca6842bfcbc9          1
3  6cbc3dad809c          2
4  a9bc2f892cb3          0


submission.csv exists: True


In [5]:
# Train EfficientNet-B5 @512, then blend with B4; re-optimize thresholds on blended OOF; create blended submission
import shutil, os
print('Preparing for B5 run + blend with B4', flush=True)

# Rename current B4 artifacts to keep them
if os.path.exists('oof_preds.npy'):
    shutil.move('oof_preds.npy', 'oof_preds_b4.npy')
if os.path.exists('oof_targets.npy') and not os.path.exists('oof_targets_b4.npy'):
    shutil.copy('oof_targets.npy', 'oof_targets_b4.npy')  # targets identical
if os.path.exists('thresholds.npy'):
    shutil.move('thresholds.npy', 'thresholds_b4.npy')
if os.path.exists('test_reg_preds.npy'):
    shutil.move('test_reg_preds.npy', 'test_reg_preds_b4.npy')

# Train B5 with safer batch size
BACKBONE_B5 = 'tf_efficientnet_b5_ns'
BATCH_SIZE_B5 = 8  # OOM logic will reduce if needed
print('Starting B5 training...', flush=True)
models_b5, thres_b5, oof_q_b5 = run_cv(folds_df, backbone=BACKBONE_B5, epochs=EPOCHS_FULL, batch_size=BATCH_SIZE_B5, num_workers=NUM_WORKERS_TRAIN)
print('B5 OOF QWK:', oof_q_b5, flush=True)

# Inference for B5 to cache its test preds
print('Running B5 inference...', flush=True)
sub_b5 = predict_test(models_b5, thres_b5, batch_size=BATCH_SIZE_B5)

# Load OOFs and blend (equal weights). Re-optimize thresholds on blended OOF.
oof_b4 = np.load('oof_preds_b4.npy')
oof_b5 = np.load('oof_preds.npy')  # latest from B5 run
oof_t = np.load('oof_targets_b4.npy')
oof_blend = 0.5 * oof_b4 + 0.5 * oof_b5
th_blend = optimize_thresholds(oof_t, oof_blend)
oof_q_blend = qwk(oof_t, preds_to_classes(oof_blend, th_blend))
print('Blended OOF QWK:', f'{oof_q_blend:.5f}', 'Thresholds:', th_blend, flush=True)
np.save('thresholds_blend.npy', th_blend)

# Blend test predictions and create submission
preds_b4 = np.load('test_reg_preds_b4.npy')
preds_b5 = np.load('test_reg_preds.npy')
preds_blend = 0.5 * preds_b4 + 0.5 * preds_b5
cls_blend = preds_to_classes(preds_blend, th_blend).astype(int)
sub_blend = pd.DataFrame({'id_code': test_df['id_code'].values, 'diagnosis': cls_blend})
sub_blend.to_csv('submission.csv', index=False)
np.save('test_reg_preds_blend.npy', preds_blend)
print('Saved blended submission.csv and test_reg_preds_blend.npy', flush=True)
print('Head of blended submission:\n', sub_blend.head(), flush=True)

Preparing for B5 run + blend with B4


Starting B5 training...


Attempt 1: trying batch_size=8



===== Fold 0 / 5 =====


Peak GB before fold: 0.67


  model = create_fn(


  scaler = torch.cuda.amp.GradScaler(enabled=True)
  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 1 it 50/329 tr_loss 0.8244 elapsed 26.0s


Epoch 1 it 100/329 tr_loss 0.8092 elapsed 51.9s


Epoch 1 it 150/329 tr_loss 0.6943 elapsed 77.5s


Epoch 1 it 200/329 tr_loss 0.6014 elapsed 103.2s


Epoch 1 it 250/329 tr_loss 0.5465 elapsed 128.8s


Epoch 1 it 300/329 tr_loss 0.5019 elapsed 154.7s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 1: tr_loss 0.4841 val_loss 0.2344 val_qwk 0.7695 epoch_time 183.0s total_elapsed 183.0s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 2 it 50/329 tr_loss 0.2817 elapsed 25.7s


Epoch 2 it 100/329 tr_loss 0.2638 elapsed 51.4s


Epoch 2 it 150/329 tr_loss 0.2638 elapsed 77.1s


Epoch 2 it 200/329 tr_loss 0.2740 elapsed 102.6s


Epoch 2 it 250/329 tr_loss 0.2649 elapsed 128.3s


Epoch 2 it 300/329 tr_loss 0.2657 elapsed 154.1s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 2: tr_loss 0.2631 val_loss 0.2431 val_qwk 0.7500 epoch_time 182.2s total_elapsed 365.3s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 3 it 50/329 tr_loss 0.2698 elapsed 25.8s


Epoch 3 it 100/329 tr_loss 0.2210 elapsed 51.5s


Epoch 3 it 150/329 tr_loss 0.2184 elapsed 77.2s


Epoch 3 it 200/329 tr_loss 0.2163 elapsed 102.8s


Epoch 3 it 250/329 tr_loss 0.2168 elapsed 128.5s


Epoch 3 it 300/329 tr_loss 0.2159 elapsed 154.2s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 3: tr_loss 0.2193 val_loss 0.1841 val_qwk 0.8477 epoch_time 182.3s total_elapsed 547.7s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 4 it 50/329 tr_loss 0.1902 elapsed 25.7s


Epoch 4 it 100/329 tr_loss 0.1861 elapsed 51.3s


Epoch 4 it 150/329 tr_loss 0.1772 elapsed 77.1s


Epoch 4 it 200/329 tr_loss 0.1746 elapsed 102.8s


Epoch 4 it 250/329 tr_loss 0.1779 elapsed 128.4s


Epoch 4 it 300/329 tr_loss 0.1796 elapsed 154.2s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 4: tr_loss 0.1820 val_loss 0.2268 val_qwk 0.8498 epoch_time 182.5s total_elapsed 730.4s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 5 it 50/329 tr_loss 0.1656 elapsed 25.8s


Epoch 5 it 100/329 tr_loss 0.1723 elapsed 51.4s


Epoch 5 it 150/329 tr_loss 0.1692 elapsed 77.1s


Epoch 5 it 200/329 tr_loss 0.1645 elapsed 102.6s


Epoch 5 it 250/329 tr_loss 0.1597 elapsed 128.2s


Epoch 5 it 300/329 tr_loss 0.1558 elapsed 154.0s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 5: tr_loss 0.1547 val_loss 0.1730 val_qwk 0.8507 epoch_time 182.2s total_elapsed 912.7s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 6 it 50/329 tr_loss 0.1507 elapsed 25.9s


Epoch 6 it 100/329 tr_loss 0.1447 elapsed 51.5s


Epoch 6 it 150/329 tr_loss 0.1482 elapsed 77.3s


Epoch 6 it 200/329 tr_loss 0.1403 elapsed 102.9s


Epoch 6 it 250/329 tr_loss 0.1370 elapsed 128.5s


Epoch 6 it 300/329 tr_loss 0.1368 elapsed 154.3s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 6: tr_loss 0.1373 val_loss 0.1649 val_qwk 0.8555 epoch_time 182.5s total_elapsed 1095.3s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 7 it 50/329 tr_loss 0.0945 elapsed 25.8s


Epoch 7 it 100/329 tr_loss 0.1030 elapsed 51.4s


Epoch 7 it 150/329 tr_loss 0.1105 elapsed 77.0s


Epoch 7 it 200/329 tr_loss 0.1102 elapsed 102.7s


Epoch 7 it 250/329 tr_loss 0.1121 elapsed 128.4s


Epoch 7 it 300/329 tr_loss 0.1097 elapsed 154.0s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 7: tr_loss 0.1115 val_loss 0.1806 val_qwk 0.8622 epoch_time 182.3s total_elapsed 1277.7s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 8 it 50/329 tr_loss 0.1139 elapsed 25.6s


Epoch 8 it 100/329 tr_loss 0.1062 elapsed 51.5s


Epoch 8 it 150/329 tr_loss 0.0999 elapsed 77.1s


Epoch 8 it 200/329 tr_loss 0.0977 elapsed 102.9s


Epoch 8 it 250/329 tr_loss 0.0979 elapsed 128.5s


Epoch 8 it 300/329 tr_loss 0.1037 elapsed 154.2s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 8: tr_loss 0.1020 val_loss 0.1596 val_qwk 0.8555 epoch_time 182.6s total_elapsed 1460.3s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 9 it 50/329 tr_loss 0.0802 elapsed 25.6s


Epoch 9 it 100/329 tr_loss 0.0834 elapsed 51.5s


Epoch 9 it 150/329 tr_loss 0.0894 elapsed 77.1s


Epoch 9 it 200/329 tr_loss 0.0864 elapsed 102.9s


Epoch 9 it 250/329 tr_loss 0.0878 elapsed 128.6s


Epoch 9 it 300/329 tr_loss 0.0877 elapsed 154.2s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 9: tr_loss 0.0873 val_loss 0.1618 val_qwk 0.8614 epoch_time 182.5s total_elapsed 1642.8s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 10 it 50/329 tr_loss 0.0792 elapsed 25.6s


Epoch 10 it 100/329 tr_loss 0.0767 elapsed 51.4s


Epoch 10 it 150/329 tr_loss 0.0843 elapsed 77.1s


Epoch 10 it 200/329 tr_loss 0.0827 elapsed 103.0s


Epoch 10 it 250/329 tr_loss 0.0882 elapsed 128.7s


Epoch 10 it 300/329 tr_loss 0.0883 elapsed 154.4s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 10: tr_loss 0.0896 val_loss 0.1727 val_qwk 0.8411 epoch_time 183.0s total_elapsed 1825.8s


Early stopping triggered


Fold 0 val QWK (default th): 0.8622


Attempt 1: trying batch_size=8



===== Fold 1 / 5 =====


Peak GB before fold: 0.89


  model = create_fn(


  scaler = torch.cuda.amp.GradScaler(enabled=True)
  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 1 it 50/329 tr_loss 0.8956 elapsed 25.8s


Epoch 1 it 100/329 tr_loss 0.7975 elapsed 51.6s


Epoch 1 it 150/329 tr_loss 0.6895 elapsed 77.6s


Epoch 1 it 200/329 tr_loss 0.6028 elapsed 103.3s


Epoch 1 it 250/329 tr_loss 0.5508 elapsed 129.0s


Epoch 1 it 300/329 tr_loss 0.5125 elapsed 154.8s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 1: tr_loss 0.4926 val_loss 0.2171 val_qwk 0.8030 epoch_time 183.2s total_elapsed 183.2s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 2 it 50/329 tr_loss 0.2445 elapsed 25.9s


Epoch 2 it 100/329 tr_loss 0.2531 elapsed 51.6s


Epoch 2 it 150/329 tr_loss 0.2806 elapsed 77.2s


Epoch 2 it 200/329 tr_loss 0.2685 elapsed 102.9s


Epoch 2 it 250/329 tr_loss 0.2609 elapsed 128.5s


Epoch 2 it 300/329 tr_loss 0.2674 elapsed 154.0s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 2: tr_loss 0.2680 val_loss 0.2672 val_qwk 0.7240 epoch_time 182.5s total_elapsed 365.7s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 3 it 50/329 tr_loss 0.2665 elapsed 25.5s


Epoch 3 it 100/329 tr_loss 0.2364 elapsed 51.3s


Epoch 3 it 150/329 tr_loss 0.2248 elapsed 77.0s


Epoch 3 it 200/329 tr_loss 0.2244 elapsed 102.6s


Epoch 3 it 250/329 tr_loss 0.2206 elapsed 128.5s


Epoch 3 it 300/329 tr_loss 0.2203 elapsed 154.2s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 3: tr_loss 0.2182 val_loss 0.1836 val_qwk 0.8438 epoch_time 182.8s total_elapsed 548.6s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 4 it 50/329 tr_loss 0.1604 elapsed 25.7s


Epoch 4 it 100/329 tr_loss 0.1760 elapsed 51.5s


Epoch 4 it 150/329 tr_loss 0.1743 elapsed 77.2s


Epoch 4 it 200/329 tr_loss 0.1699 elapsed 103.0s


Epoch 4 it 250/329 tr_loss 0.1746 elapsed 128.6s


Epoch 4 it 300/329 tr_loss 0.1717 elapsed 154.4s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 4: tr_loss 0.1694 val_loss 0.1708 val_qwk 0.8602 epoch_time 182.7s total_elapsed 731.3s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 5 it 50/329 tr_loss 0.1407 elapsed 25.8s


Epoch 5 it 100/329 tr_loss 0.1395 elapsed 51.4s


Epoch 5 it 150/329 tr_loss 0.1375 elapsed 77.0s


Epoch 5 it 200/329 tr_loss 0.1402 elapsed 102.8s


Epoch 5 it 250/329 tr_loss 0.1449 elapsed 128.3s


Epoch 5 it 300/329 tr_loss 0.1463 elapsed 154.0s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 5: tr_loss 0.1473 val_loss 0.1842 val_qwk 0.8561 epoch_time 182.4s total_elapsed 913.7s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 6 it 50/329 tr_loss 0.1363 elapsed 25.6s


Epoch 6 it 100/329 tr_loss 0.1346 elapsed 51.4s


Epoch 6 it 150/329 tr_loss 0.1407 elapsed 77.1s


Epoch 6 it 200/329 tr_loss 0.1414 elapsed 102.7s


Epoch 6 it 250/329 tr_loss 0.1396 elapsed 128.5s


Epoch 6 it 300/329 tr_loss 0.1326 elapsed 154.1s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 6: tr_loss 0.1313 val_loss 0.1914 val_qwk 0.8353 epoch_time 182.5s total_elapsed 1096.3s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 7 it 50/329 tr_loss 0.0805 elapsed 25.7s


Epoch 7 it 100/329 tr_loss 0.0944 elapsed 51.4s


Epoch 7 it 150/329 tr_loss 0.1069 elapsed 77.0s


Epoch 7 it 200/329 tr_loss 0.1149 elapsed 102.7s


Epoch 7 it 250/329 tr_loss 0.1156 elapsed 128.4s


Epoch 7 it 300/329 tr_loss 0.1187 elapsed 154.0s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 7: tr_loss 0.1175 val_loss 0.1688 val_qwk 0.8521 epoch_time 182.2s total_elapsed 1278.5s


Early stopping triggered


Fold 1 val QWK (default th): 0.8602


Attempt 1: trying batch_size=8



===== Fold 2 / 5 =====


Peak GB before fold: 1.10


  model = create_fn(


  scaler = torch.cuda.amp.GradScaler(enabled=True)
  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 1 it 50/329 tr_loss 0.7916 elapsed 25.6s


Epoch 1 it 100/329 tr_loss 0.7574 elapsed 51.1s


Epoch 1 it 150/329 tr_loss 0.6685 elapsed 76.8s


Epoch 1 it 200/329 tr_loss 0.5915 elapsed 102.3s


Epoch 1 it 250/329 tr_loss 0.5373 elapsed 127.7s


Epoch 1 it 300/329 tr_loss 0.4952 elapsed 153.4s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 1: tr_loss 0.4795 val_loss 0.2534 val_qwk 0.7535 epoch_time 181.5s total_elapsed 181.5s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 2 it 50/329 tr_loss 0.3024 elapsed 25.7s


Epoch 2 it 100/329 tr_loss 0.2745 elapsed 51.3s


Epoch 2 it 150/329 tr_loss 0.2563 elapsed 76.9s


Epoch 2 it 200/329 tr_loss 0.2666 elapsed 102.4s


Epoch 2 it 250/329 tr_loss 0.2709 elapsed 128.2s


Epoch 2 it 300/329 tr_loss 0.2654 elapsed 153.8s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 2: tr_loss 0.2624 val_loss 0.1730 val_qwk 0.8402 epoch_time 182.1s total_elapsed 363.7s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 3 it 50/329 tr_loss 0.2121 elapsed 25.6s


Epoch 3 it 100/329 tr_loss 0.2254 elapsed 51.3s


Epoch 3 it 150/329 tr_loss 0.2270 elapsed 77.0s


Epoch 3 it 200/329 tr_loss 0.2319 elapsed 102.6s


Epoch 3 it 250/329 tr_loss 0.2264 elapsed 128.3s


Epoch 3 it 300/329 tr_loss 0.2185 elapsed 154.1s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 3: tr_loss 0.2206 val_loss 0.1876 val_qwk 0.8413 epoch_time 182.3s total_elapsed 546.0s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 4 it 50/329 tr_loss 0.2090 elapsed 25.7s


Epoch 4 it 100/329 tr_loss 0.1909 elapsed 51.3s


Epoch 4 it 150/329 tr_loss 0.1873 elapsed 76.9s


Epoch 4 it 200/329 tr_loss 0.1805 elapsed 102.6s


Epoch 4 it 250/329 tr_loss 0.1791 elapsed 128.2s


Epoch 4 it 300/329 tr_loss 0.1733 elapsed 153.8s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 4: tr_loss 0.1745 val_loss 0.1646 val_qwk 0.8622 epoch_time 182.1s total_elapsed 728.2s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 5 it 50/329 tr_loss 0.1744 elapsed 25.6s


Epoch 5 it 100/329 tr_loss 0.1676 elapsed 51.4s


Epoch 5 it 150/329 tr_loss 0.1640 elapsed 77.0s


Epoch 5 it 200/329 tr_loss 0.1581 elapsed 102.5s


Epoch 5 it 250/329 tr_loss 0.1552 elapsed 128.3s


Epoch 5 it 300/329 tr_loss 0.1519 elapsed 153.9s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 5: tr_loss 0.1497 val_loss 0.1669 val_qwk 0.8430 epoch_time 182.3s total_elapsed 910.5s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 6 it 50/329 tr_loss 0.1086 elapsed 25.7s


Epoch 6 it 100/329 tr_loss 0.1193 elapsed 51.2s


Epoch 6 it 150/329 tr_loss 0.1169 elapsed 77.0s


Epoch 6 it 200/329 tr_loss 0.1198 elapsed 102.6s


Epoch 6 it 250/329 tr_loss 0.1230 elapsed 128.1s


Epoch 6 it 300/329 tr_loss 0.1262 elapsed 153.8s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 6: tr_loss 0.1266 val_loss 0.1994 val_qwk 0.8054 epoch_time 182.0s total_elapsed 1092.6s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 7 it 50/329 tr_loss 0.1330 elapsed 25.7s


Epoch 7 it 100/329 tr_loss 0.1087 elapsed 51.5s


Epoch 7 it 150/329 tr_loss 0.1078 elapsed 77.2s


Epoch 7 it 200/329 tr_loss 0.1109 elapsed 102.9s


Epoch 7 it 250/329 tr_loss 0.1112 elapsed 128.8s


Epoch 7 it 300/329 tr_loss 0.1104 elapsed 154.4s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 7: tr_loss 0.1108 val_loss 0.1843 val_qwk 0.8534 epoch_time 182.8s total_elapsed 1275.4s


Early stopping triggered


Fold 2 val QWK (default th): 0.8622


Attempt 1: trying batch_size=8



===== Fold 3 / 5 =====


Peak GB before fold: 1.31


  model = create_fn(


  scaler = torch.cuda.amp.GradScaler(enabled=True)
  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 1 it 50/329 tr_loss 0.8399 elapsed 25.6s


Epoch 1 it 100/329 tr_loss 0.7378 elapsed 51.2s


Epoch 1 it 150/329 tr_loss 0.6400 elapsed 77.0s


Epoch 1 it 200/329 tr_loss 0.5644 elapsed 102.6s


Epoch 1 it 250/329 tr_loss 0.5150 elapsed 128.3s


Epoch 1 it 300/329 tr_loss 0.4781 elapsed 153.9s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 1: tr_loss 0.4639 val_loss 0.3287 val_qwk 0.6301 epoch_time 182.3s total_elapsed 182.3s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 2 it 50/329 tr_loss 0.3039 elapsed 25.5s


Epoch 2 it 100/329 tr_loss 0.2765 elapsed 51.3s


Epoch 2 it 150/329 tr_loss 0.2805 elapsed 77.0s


Epoch 2 it 200/329 tr_loss 0.2700 elapsed 102.6s


Epoch 2 it 250/329 tr_loss 0.2707 elapsed 128.2s


Epoch 2 it 300/329 tr_loss 0.2691 elapsed 154.1s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 2: tr_loss 0.2686 val_loss 0.2190 val_qwk 0.8134 epoch_time 182.3s total_elapsed 364.7s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 3 it 50/329 tr_loss 0.2038 elapsed 25.8s


Epoch 3 it 100/329 tr_loss 0.2213 elapsed 51.3s


Epoch 3 it 150/329 tr_loss 0.2306 elapsed 76.8s


Epoch 3 it 200/329 tr_loss 0.2303 elapsed 102.3s


Epoch 3 it 250/329 tr_loss 0.2240 elapsed 128.1s


Epoch 3 it 300/329 tr_loss 0.2170 elapsed 153.6s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 3: tr_loss 0.2163 val_loss 0.1907 val_qwk 0.8181 epoch_time 181.8s total_elapsed 546.5s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 4 it 50/329 tr_loss 0.1745 elapsed 25.4s


Epoch 4 it 100/329 tr_loss 0.1833 elapsed 51.0s


Epoch 4 it 150/329 tr_loss 0.1786 elapsed 76.5s


Epoch 4 it 200/329 tr_loss 0.1817 elapsed 102.3s


Epoch 4 it 250/329 tr_loss 0.1741 elapsed 127.9s


Epoch 4 it 300/329 tr_loss 0.1755 elapsed 153.5s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 4: tr_loss 0.1731 val_loss 0.1811 val_qwk 0.8299 epoch_time 181.8s total_elapsed 728.4s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 5 it 50/329 tr_loss 0.1680 elapsed 25.7s


Epoch 5 it 100/329 tr_loss 0.1540 elapsed 51.5s


Epoch 5 it 150/329 tr_loss 0.1509 elapsed 77.2s


Epoch 5 it 200/329 tr_loss 0.1514 elapsed 103.1s


Epoch 5 it 250/329 tr_loss 0.1585 elapsed 129.0s


Epoch 5 it 300/329 tr_loss 0.1576 elapsed 154.8s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 5: tr_loss 0.1591 val_loss 0.1832 val_qwk 0.8423 epoch_time 183.3s total_elapsed 911.7s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 6 it 50/329 tr_loss 0.1498 elapsed 25.7s


Epoch 6 it 100/329 tr_loss 0.1452 elapsed 51.2s


Epoch 6 it 150/329 tr_loss 0.1372 elapsed 76.9s


Epoch 6 it 200/329 tr_loss 0.1393 elapsed 102.7s


Epoch 6 it 250/329 tr_loss 0.1398 elapsed 128.4s


Epoch 6 it 300/329 tr_loss 0.1370 elapsed 154.0s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 6: tr_loss 0.1342 val_loss 0.1738 val_qwk 0.8372 epoch_time 182.4s total_elapsed 1094.1s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 7 it 50/329 tr_loss 0.1053 elapsed 25.6s


Epoch 7 it 100/329 tr_loss 0.1048 elapsed 51.3s


Epoch 7 it 150/329 tr_loss 0.1138 elapsed 77.0s


Epoch 7 it 200/329 tr_loss 0.1175 elapsed 102.6s


Epoch 7 it 250/329 tr_loss 0.1198 elapsed 128.2s


Epoch 7 it 300/329 tr_loss 0.1195 elapsed 154.0s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 7: tr_loss 0.1176 val_loss 0.1747 val_qwk 0.8570 epoch_time 182.0s total_elapsed 1276.1s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 8 it 50/329 tr_loss 0.0885 elapsed 25.7s


Epoch 8 it 100/329 tr_loss 0.0868 elapsed 51.3s


Epoch 8 it 150/329 tr_loss 0.0900 elapsed 77.0s


Epoch 8 it 200/329 tr_loss 0.0945 elapsed 102.7s


Epoch 8 it 250/329 tr_loss 0.0985 elapsed 128.4s


Epoch 8 it 300/329 tr_loss 0.1020 elapsed 154.0s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 8: tr_loss 0.1029 val_loss 0.1880 val_qwk 0.8324 epoch_time 182.4s total_elapsed 1458.5s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 9 it 50/329 tr_loss 0.0923 elapsed 25.6s


Epoch 9 it 100/329 tr_loss 0.0886 elapsed 51.2s


Epoch 9 it 150/329 tr_loss 0.0844 elapsed 77.0s


Epoch 9 it 200/329 tr_loss 0.0913 elapsed 102.6s


Epoch 9 it 250/329 tr_loss 0.0896 elapsed 128.1s


Epoch 9 it 300/329 tr_loss 0.0916 elapsed 153.8s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 9: tr_loss 0.0921 val_loss 0.1685 val_qwk 0.8647 epoch_time 182.1s total_elapsed 1640.6s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 10 it 50/329 tr_loss 0.0864 elapsed 25.6s


Epoch 10 it 100/329 tr_loss 0.0906 elapsed 51.3s


Epoch 10 it 150/329 tr_loss 0.0850 elapsed 77.0s


Epoch 10 it 200/329 tr_loss 0.0775 elapsed 102.7s


Epoch 10 it 250/329 tr_loss 0.0783 elapsed 128.3s


Epoch 10 it 300/329 tr_loss 0.0813 elapsed 154.0s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 10: tr_loss 0.0821 val_loss 0.1827 val_qwk 0.8549 epoch_time 182.2s total_elapsed 1822.9s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 11 it 50/329 tr_loss 0.0657 elapsed 25.7s


Epoch 11 it 100/329 tr_loss 0.0618 elapsed 51.4s


Epoch 11 it 150/329 tr_loss 0.0660 elapsed 77.0s


Epoch 11 it 200/329 tr_loss 0.0703 elapsed 102.6s


Epoch 11 it 250/329 tr_loss 0.0709 elapsed 128.4s


Epoch 11 it 300/329 tr_loss 0.0702 elapsed 153.9s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 11: tr_loss 0.0701 val_loss 0.1570 val_qwk 0.8613 epoch_time 182.3s total_elapsed 2005.2s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 12 it 50/329 tr_loss 0.0650 elapsed 25.6s


Epoch 12 it 100/329 tr_loss 0.0555 elapsed 51.2s


Epoch 12 it 150/329 tr_loss 0.0549 elapsed 76.8s


Epoch 12 it 200/329 tr_loss 0.0558 elapsed 102.6s


Epoch 12 it 250/329 tr_loss 0.0560 elapsed 128.2s


Epoch 12 it 300/329 tr_loss 0.0561 elapsed 153.8s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 12: tr_loss 0.0563 val_loss 0.1684 val_qwk 0.8468 epoch_time 182.2s total_elapsed 2187.3s


Early stopping triggered


Fold 3 val QWK (default th): 0.8647


Attempt 1: trying batch_size=8



===== Fold 4 / 5 =====


Peak GB before fold: 1.52


  model = create_fn(


  scaler = torch.cuda.amp.GradScaler(enabled=True)
  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 1 it 50/329 tr_loss 0.8809 elapsed 25.7s


Epoch 1 it 100/329 tr_loss 0.7895 elapsed 51.4s


Epoch 1 it 150/329 tr_loss 0.6756 elapsed 77.0s


Epoch 1 it 200/329 tr_loss 0.5966 elapsed 102.9s


Epoch 1 it 250/329 tr_loss 0.5462 elapsed 128.6s


Epoch 1 it 300/329 tr_loss 0.5062 elapsed 154.3s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 1: tr_loss 0.4838 val_loss 0.2493 val_qwk 0.8079 epoch_time 182.7s total_elapsed 182.7s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 2 it 50/329 tr_loss 0.2786 elapsed 25.6s


Epoch 2 it 100/329 tr_loss 0.2888 elapsed 51.2s


Epoch 2 it 150/329 tr_loss 0.2807 elapsed 76.9s


Epoch 2 it 200/329 tr_loss 0.2763 elapsed 102.8s


Epoch 2 it 250/329 tr_loss 0.2703 elapsed 128.3s


Epoch 2 it 300/329 tr_loss 0.2541 elapsed 154.0s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 2: tr_loss 0.2565 val_loss 0.2056 val_qwk 0.8416 epoch_time 182.3s total_elapsed 365.1s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 3 it 50/329 tr_loss 0.1979 elapsed 25.6s


Epoch 3 it 100/329 tr_loss 0.2073 elapsed 51.2s


Epoch 3 it 150/329 tr_loss 0.2081 elapsed 77.0s


Epoch 3 it 200/329 tr_loss 0.2125 elapsed 102.5s


Epoch 3 it 250/329 tr_loss 0.2092 elapsed 128.1s


Epoch 3 it 300/329 tr_loss 0.2128 elapsed 153.7s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 3: tr_loss 0.2137 val_loss 0.2038 val_qwk 0.8469 epoch_time 181.9s total_elapsed 547.1s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 4 it 50/329 tr_loss 0.1952 elapsed 25.7s


Epoch 4 it 100/329 tr_loss 0.1840 elapsed 51.5s


Epoch 4 it 150/329 tr_loss 0.1823 elapsed 77.2s


Epoch 4 it 200/329 tr_loss 0.1847 elapsed 102.8s


Epoch 4 it 250/329 tr_loss 0.1810 elapsed 128.4s


Epoch 4 it 300/329 tr_loss 0.1779 elapsed 154.3s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 4: tr_loss 0.1774 val_loss 0.1847 val_qwk 0.8301 epoch_time 182.4s total_elapsed 729.5s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 5 it 50/329 tr_loss 0.1757 elapsed 25.6s


Epoch 5 it 100/329 tr_loss 0.1680 elapsed 51.5s


Epoch 5 it 150/329 tr_loss 0.1682 elapsed 77.1s


Epoch 5 it 200/329 tr_loss 0.1617 elapsed 102.8s


Epoch 5 it 250/329 tr_loss 0.1588 elapsed 128.4s


Epoch 5 it 300/329 tr_loss 0.1608 elapsed 154.2s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 5: tr_loss 0.1600 val_loss 0.1718 val_qwk 0.8669 epoch_time 182.4s total_elapsed 911.9s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 6 it 50/329 tr_loss 0.1563 elapsed 25.8s


Epoch 6 it 100/329 tr_loss 0.1543 elapsed 51.4s


Epoch 6 it 150/329 tr_loss 0.1458 elapsed 77.0s


Epoch 6 it 200/329 tr_loss 0.1406 elapsed 102.6s


Epoch 6 it 250/329 tr_loss 0.1455 elapsed 128.5s


Epoch 6 it 300/329 tr_loss 0.1462 elapsed 154.1s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 6: tr_loss 0.1452 val_loss 0.1604 val_qwk 0.8540 epoch_time 182.5s total_elapsed 1094.4s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 7 it 50/329 tr_loss 0.1053 elapsed 25.6s


Epoch 7 it 100/329 tr_loss 0.1181 elapsed 51.2s


Epoch 7 it 150/329 tr_loss 0.1176 elapsed 76.9s


Epoch 7 it 200/329 tr_loss 0.1211 elapsed 102.7s


Epoch 7 it 250/329 tr_loss 0.1240 elapsed 128.3s


Epoch 7 it 300/329 tr_loss 0.1231 elapsed 154.0s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 7: tr_loss 0.1236 val_loss 0.1586 val_qwk 0.8548 epoch_time 182.4s total_elapsed 1276.8s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 8 it 50/329 tr_loss 0.0889 elapsed 25.7s


Epoch 8 it 100/329 tr_loss 0.0920 elapsed 51.3s


Epoch 8 it 150/329 tr_loss 0.0931 elapsed 76.9s


Epoch 8 it 200/329 tr_loss 0.0990 elapsed 102.7s


Epoch 8 it 250/329 tr_loss 0.1000 elapsed 128.4s


Epoch 8 it 300/329 tr_loss 0.1028 elapsed 154.0s


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Epoch 8: tr_loss 0.1024 val_loss 0.1800 val_qwk 0.8321 epoch_time 182.4s total_elapsed 1459.2s


Early stopping triggered


Fold 4 val QWK (default th): 0.8669


Optimized thresholds: [0.52202625 1.33637352 2.53423977 3.68851833] OOF QWK: 0.86733


B5 OOF QWK: 0.867325189236931


Running B5 inference...


Inference model 1/5


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


Inference model 2/5


Inference model 3/5


Inference model 4/5


Inference model 5/5


Saved submission.csv and test_reg_preds.npy


Blended OOF QWK: 0.87970 Thresholds: [0.56248219 1.49132383 2.58129375 3.13354049]


Saved blended submission.csv and test_reg_preds_blend.npy


Head of blended submission:
         id_code  diagnosis
0  b460ca9fa26f          0
1  6cee2e148520          0
2  ca6842bfcbc9          1
3  6cbc3dad809c          2
4  a9bc2f892cb3          0


In [6]:
# Stronger TTA inference for both B4 and B5, then blend with thresholds_blend.npy
import numpy as np, pandas as pd, torch
from torch.utils.data import DataLoader

def multi_tta_predict(models, dl):
    preds_accum = []
    for m_i, model in enumerate(models):
        model.eval()
        preds_model = []
        with torch.no_grad():
            for xb, _ in dl:
                xb = xb.to(device, non_blocking=True).to(memory_format=torch.channels_last)
                with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):
                    p0 = model(xb)
                    p1 = model(torch.flip(xb, dims=[-1]))  # hflip
                    p2 = model(torch.flip(xb, dims=[-2]))  # vflip
                    x90 = xb.transpose(-1, -2)  # rot90 via transpose + hflip
                    p3 = model(x90)
                    x270 = torch.flip(x90, dims=[-1])
                    p4 = model(x270)
                    p = (p0 + p1 + p2 + p3 + p4) / 5.0
                preds_model.append(p.detach().float().cpu().numpy())
        preds_model = np.concatenate(preds_model)
        preds_accum.append(preds_model)
        print(f'TTA inference done for model {m_i+1}/{len(models)}', flush=True)
    return np.mean(np.stack(preds_accum, axis=0), axis=0)

print('Running stronger TTA + blend...', flush=True)
dte = DRDataset(test_df[['id_code']].copy(), IMG_DIR_TEST, transforms=valid_tfms)
dl_te = DataLoader(dte, batch_size=BATCH_SIZE*2, shuffle=False, num_workers=NUM_WORKERS_TEST, pin_memory=True, persistent_workers=False)

# B4 TTA
preds_b4_tta = multi_tta_predict(models, dl_te)
np.save('test_reg_preds_b4_tta.npy', preds_b4_tta)

# B5 TTA
preds_b5_tta = multi_tta_predict(models_b5, dl_te)
np.save('test_reg_preds_b5_tta.npy', preds_b5_tta)

# Blend and threshold (use thresholds_blend optimized on OOF blend)
th_blend = np.load('thresholds_blend.npy')
preds_blend_tta = 0.5 * preds_b4_tta + 0.5 * preds_b5_tta
cls_blend_tta = preds_to_classes(preds_blend_tta, th_blend).astype(int)
sub_blend_tta = pd.DataFrame({'id_code': test_df['id_code'].values, 'diagnosis': cls_blend_tta})
sub_blend_tta.to_csv('submission.csv', index=False)
np.save('test_reg_preds_blend_tta.npy', preds_blend_tta)
print('Saved TTA-blended submission.csv and test_reg_preds_blend_tta.npy', flush=True)
print('Head of TTA-blended submission:\n', sub_blend_tta.head(), flush=True)

Running stronger TTA + blend...


  with torch.cuda.amp.autocast(enabled=True, dtype=torch.float16):


TTA inference done for model 1/5


TTA inference done for model 2/5


TTA inference done for model 3/5


TTA inference done for model 4/5


TTA inference done for model 5/5


TTA inference done for model 1/5


TTA inference done for model 2/5


TTA inference done for model 3/5


TTA inference done for model 4/5


TTA inference done for model 5/5


Saved TTA-blended submission.csv and test_reg_preds_blend_tta.npy


Head of TTA-blended submission:
         id_code  diagnosis
0  b460ca9fa26f          0
1  6cee2e148520          0
2  ca6842bfcbc9          1
3  6cbc3dad809c          2
4  a9bc2f892cb3          0


In [11]:
# RRC + EMA strengthened recipe and new B5@512 seed run
import math, time, copy, gc, os
import numpy as np
import pandas as pd
import albumentations as A
import cv2
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from timm.utils import ModelEmaV2
from sklearn.metrics import cohen_kappa_score
from scipy.optimize import minimize
from albumentations.pytorch import ToTensorV2

print('Setting up strengthened recipe: RRC + EMA for B5@512', flush=True)

# Augmentations per expert advice (fix params for current Albumentations)
MEAN = (0.485, 0.456, 0.406); STD = (0.229, 0.224, 0.225)
train_tfms_rrc = A.Compose([
    A.RandomResizedCrop(size=(IMG_SIZE, IMG_SIZE), scale=(0.85, 1.0), ratio=(0.95, 1.05), interpolation=cv2.INTER_AREA),
    A.HorizontalFlip(p=0.5),
    A.Affine(scale=(0.95, 1.05), translate_percent=(0.0, 0.06), rotate=(-15, 15), border_mode=cv2.BORDER_REFLECT_101, p=0.7),
    A.RandomBrightnessContrast(brightness_limit=0.15, contrast_limit=0.15, p=0.7),
    A.RandomGamma(gamma_limit=(90,110), p=0.4),
    A.HueSaturationValue(hue_shift_limit=5, sat_shift_limit=8, val_shift_limit=8, p=0.3),
    A.Normalize(mean=MEAN, std=STD),
    ToTensorV2(),
])
valid_tfms_rrc = A.Compose([
    A.Resize(IMG_SIZE, IMG_SIZE),
    A.Normalize(mean=MEAN, std=STD),
    ToTensorV2(),
])

def preds_to_classes_local(preds, thresholds):
    th0, th1, th2, th3 = thresholds
    return np.digitize(preds, bins=[th0, th1, th2, th3])

def qwk_local(y_true, y_pred_cls):
    return cohen_kappa_score(y_true, y_pred_cls, weights='quadratic')

def optimize_thresholds_local(oof_targets, oof_preds, init=[0.5,1.5,2.5,3.5]):
    y = np.asarray(oof_targets).astype(float)
    p = np.asarray(oof_preds).astype(float)
    def _loss(th):
        th = np.sort(th)
        th = np.clip(th, -1.0, 4.0)
        cls = preds_to_classes_local(p, th)
        return -qwk_local(y, cls)
    res = minimize(_loss, x0=np.array(init, dtype=float), method='Nelder-Mead', options={'maxiter':2000, 'xatol':1e-3, 'fatol':1e-3})
    th = np.sort(res.x)
    for i in range(1,4):
        if th[i] - th[i-1] < 0.05:
            th[i] = th[i-1] + 0.05
    return th

def get_loaders_rrc(tr_df, va_df, batch_size=16, num_workers=0):
    dtr = DRDataset(tr_df, IMG_DIR_TRAIN, transforms=train_tfms_rrc)
    dva = DRDataset(va_df, IMG_DIR_TRAIN, transforms=valid_tfms_rrc)
    dl_tr = DataLoader(dtr, batch_size=batch_size, shuffle=True, num_workers=num_workers, pin_memory=True, drop_last=True, persistent_workers=False)
    dl_va = DataLoader(dva, batch_size=batch_size, shuffle=False, num_workers=num_workers, pin_memory=True, persistent_workers=False)
    return dl_tr, dl_va

def validate_with_model(model, dl, loss_fn):
    model.eval()
    preds = []; targs = []; val_loss = 0.0; n = 0
    with torch.no_grad():
        for xb, yb in dl:
            xb = xb.to(device, non_blocking=True).to(memory_format=torch.channels_last)
            yb = yb.to(device, non_blocking=True)
            with torch.amp.autocast('cuda', enabled=True, dtype=torch.float16):
                out = model(xb)
                loss = loss_fn(out, yb)
            bs = xb.size(0)
            val_loss += loss.item()*bs; n += bs
            preds.append(out.detach().float().cpu().numpy())
            targs.append(yb.detach().float().cpu().numpy())
    preds = np.concatenate(preds); targs = np.concatenate(targs)
    return val_loss/n, preds, targs

def train_one_fold_ema(fold, folds_df, epochs=12, lr=2e-4, wd=1e-5, batch_size=16, patience=3, backbone='tf_efficientnet_b5_ns', num_workers=0, ema_decay=0.9996):
    print(f'\n===== [RRC+EMA] Fold {fold} / {folds_df["fold"].nunique()} =====', flush=True)
    if torch.cuda.is_available():
        try: torch.cuda.reset_peak_memory_stats()
        except Exception: pass
        print('Peak GB before fold:', f'{torch.cuda.max_memory_allocated()/1024**3:.2f}', flush=True)
    tr_df = folds_df[folds_df['fold'] != fold][['id_code','diagnosis']].reset_index(drop=True)
    va_df = folds_df[folds_df['fold'] == fold][['id_code','diagnosis']].reset_index(drop=True)
    dl_tr, dl_va = get_loaders_rrc(tr_df, va_df, batch_size=batch_size, num_workers=num_workers)
    model = RegHeadModel(backbone_name=backbone, pretrained=True).to(device)
    model = model.to(memory_format=torch.channels_last)
    optimizer = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=wd)
    loss_fn = nn.HuberLoss(delta=1.0)
    scaler = torch.amp.GradScaler('cuda', enabled=True)
    ema = ModelEmaV2(model, decay=ema_decay, device=None)
    total_steps = epochs * len(dl_tr)
    def lr_lambda(step):
        if step < len(dl_tr):
            return (step+1)/len(dl_tr)
        progress = (step - len(dl_tr)) / max(1, total_steps - len(dl_tr))
        return 0.5 * (1 + math.cos(math.pi * progress))
    scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)

    best_metric = float('inf'); best_state = None; best_preds = None; best_targs = None; no_improve = 0
    global_step = 0; t_start = time.time()
    accum_steps = max(1, 16 // batch_size)
    optimizer.zero_grad(set_to_none=True)
    for epoch in range(1, epochs+1):
        model.train()
        tr_loss = 0.0; n = 0; t0 = time.time()
        for it, (xb, yb) in enumerate(dl_tr):
            xb = xb.to(device, non_blocking=True).to(memory_format=torch.channels_last); yb = yb.to(device, non_blocking=True)
            with torch.amp.autocast('cuda', enabled=True, dtype=torch.float16):
                out = model(xb)
                loss = loss_fn(out, yb) / accum_steps
            scaler.scale(loss).backward()
            if (it + 1) % accum_steps == 0:
                scaler.step(optimizer)
                scaler.update()
                optimizer.zero_grad(set_to_none=True)
                scheduler.step()
                ema.update(model)
            bs = xb.size(0); tr_loss += (loss.item() * accum_steps)*bs; n += bs; global_step += 1
            if (it+1) % 50 == 0:
                print(f'Epoch {epoch} it {it+1}/{len(dl_tr)} tr_loss {(tr_loss/n):.4f} elapsed {time.time()-t0:.1f}s', flush=True)
        # Validate with current (non-EMA) model for stability; still save EMA weights
        val_loss_m, v_preds_m, v_targs = validate_with_model(model, dl_va, loss_fn)
        cls_def = preds_to_classes_local(v_preds_m, [0.5,1.5,2.5,3.5])
        val_q_def = qwk_local(v_targs, cls_def)
        print(f'Epoch {epoch}: tr_loss {(tr_loss/n):.4f} val_loss {val_loss_m:.4f} val_qwk(def) {val_q_def:.4f} preds[min/mean/max]={v_preds_m.min():.3f}/{v_preds_m.mean():.3f}/{v_preds_m.max():.3f} epoch_time {time.time()-t0:.1f}s total_elapsed {time.time()-t_start:.1f}s', flush=True)
        # Early stopping on val_loss
        if val_loss_m < best_metric:
            best_metric = val_loss_m; best_state = copy.deepcopy(ema.module.state_dict()); best_preds = v_preds_m.copy(); best_targs = v_targs.copy(); no_improve = 0
        else:
            no_improve += 1
        if no_improve >= patience:
            print('Early stopping triggered', flush=True); break
    # Load best EMA weights and return EMA model
    ema.module.load_state_dict(best_state)
    return ema.module, best_preds, best_targs

def run_cv_ema(folds_df, backbone='tf_efficientnet_b5_ns', epochs=15, batch_size=8, num_workers=0):
    folds_df = folds_df.reset_index(drop=True).copy()
    n_folds = folds_df['fold'].nunique()
    oof_preds = np.zeros(len(folds_df), dtype=np.float32)
    oof_targs = folds_df['diagnosis'].values.astype(float)
    models_local = []
    for fold in range(n_folds):
        try:
            torch.cuda.empty_cache()
            if torch.cuda.is_available():
                torch.cuda.reset_peak_memory_stats()
        except Exception:
            pass
        gc.collect()
        current_bs = batch_size
        for attempt in range(6):
            print(f'Attempt {attempt+1}: trying batch_size={current_bs}', flush=True)
            try:
                fm, v_preds, v_targs = train_one_fold_ema(fold, folds_df, epochs=epochs, lr=LR, wd=WD, batch_size=current_bs, patience=PATIENCE, backbone=backbone, num_workers=num_workers)
                break
            except (RuntimeError, torch.cuda.OutOfMemoryError, torch.OutOfMemoryError) as e:
                msg = str(e).lower()
                if 'out of memory' in msg or 'cuda out of memory' in msg:
                    print(f'CUDA OOM on fold {fold} with batch_size={current_bs}. Reducing and retrying (attempt {attempt+1})...', flush=True)
                    try: del fm
                    except Exception: pass
                    gc.collect()
                    try:
                        torch.cuda.empty_cache()
                        if torch.cuda.is_available():
                            torch.cuda.reset_peak_memory_stats()
                    except Exception:
                        pass
                    current_bs = max(2, current_bs - 2)
                    if attempt >= 5:
                        raise
                    continue
                else:
                    raise
        models_local.append(fm)
        va_idx = np.where(folds_df['fold'].values == fold)[0]
        oof_preds[va_idx] = v_preds
        fold_q = qwk_local(v_targs, preds_to_classes_local(v_preds, [0.5,1.5,2.5,3.5]))
        print(f'[RRC+EMA] Fold {fold} val QWK (default th): {fold_q:.4f}', flush=True)
    th = optimize_thresholds_local(oof_targs, oof_preds)
    oof_q = qwk_local(oof_targs, preds_to_classes_local(oof_preds, th))
    print('[RRC+EMA] Optimized thresholds:', th, 'OOF QWK:', f'{oof_q:.5f}', flush=True)
    np.save('oof_preds_b5_seed2025_rrc_ema.npy', oof_preds); np.save('oof_targets_b5_seed2025_rrc_ema.npy', oof_targs); np.save('thresholds_b5_seed2025_rrc_ema.npy', th)
    return models_local, th, oof_q

def bootstrap_thresholds(oof_targets, oof_preds, n_bootstrap=100, base=[0.5,1.5,2.5,3.5]):
    N = len(oof_targets)
    cuts = []
    for i in range(n_bootstrap):
        idx = np.random.randint(0, N, size=N)
        th = optimize_thresholds_local(oof_targets[idx], oof_preds[idx], init=base)
        cuts.append(th)
        if (i+1) % 20 == 0:
            print(f'Bootstrap {i+1}/{n_bootstrap}', flush=True)
    cuts = np.array(cuts)
    med = np.median(cuts, axis=0)
    med = np.clip(med, 0.3, 3.7)
    for j in range(1,4):
        if med[j] - med[j-1] < 0.1:
            med[j] = med[j-1] + 0.1
    base_arr = np.array([0.5,1.5,2.5,3.5])
    med = 0.9*med + 0.1*base_arr
    return med

# Launch B5@512 with new seed and RRC+EMA
SEED_NEW = 2025
seed_everything(SEED_NEW)
BACKBONE_NEW = 'tf_efficientnet_b5_ns'
EPOCHS_NEW = 15
BATCH_SIZE_NEW = 8
print('Starting RRC+EMA B5@512 training, seed', SEED_NEW, flush=True)
models_b5_seed2025_ema, th_b5_seed2025_ema, oof_q_b5_seed2025_ema = run_cv_ema(folds_df, backbone=BACKBONE_NEW, epochs=EPOCHS_NEW, batch_size=BATCH_SIZE_NEW, num_workers=NUM_WORKERS_TRAIN)
print('B5@512 RRC+EMA OOF QWK:', oof_q_b5_seed2025_ema, flush=True)

Setting up strengthened recipe: RRC + EMA for B5@512


Starting RRC+EMA B5@512 training, seed 2025


Attempt 1: trying batch_size=8



===== [RRC+EMA] Fold 0 / 5 =====


Peak GB before fold: 2.61


  model = create_fn(


Epoch 1 it 50/329 tr_loss 0.8294 elapsed 27.2s


Epoch 1 it 100/329 tr_loss 0.7854 elapsed 54.6s


Epoch 1 it 150/329 tr_loss 0.6518 elapsed 81.6s


Epoch 1 it 200/329 tr_loss 0.5687 elapsed 108.7s


Epoch 1 it 250/329 tr_loss 0.5134 elapsed 135.8s


Epoch 1 it 300/329 tr_loss 0.4774 elapsed 162.9s


Epoch 1: tr_loss 0.4650 val_loss 0.2359 val_qwk(def) 0.7981 preds[min/mean/max]=-0.466/0.939/5.203 epoch_time 192.1s total_elapsed 192.1s


Epoch 2 it 50/329 tr_loss 0.3449 elapsed 27.2s


Epoch 2 it 100/329 tr_loss 0.2995 elapsed 54.3s


Epoch 2 it 150/329 tr_loss 0.2772 elapsed 81.5s


Epoch 2 it 200/329 tr_loss 0.2737 elapsed 108.6s


Epoch 2 it 250/329 tr_loss 0.2717 elapsed 135.6s


Epoch 2 it 300/329 tr_loss 0.2732 elapsed 162.8s


Epoch 2: tr_loss 0.2693 val_loss 0.2069 val_qwk(def) 0.8119 preds[min/mean/max]=-0.300/1.122/3.162 epoch_time 191.8s total_elapsed 384.0s


Epoch 3 it 50/329 tr_loss 0.2276 elapsed 27.1s


Epoch 3 it 100/329 tr_loss 0.2414 elapsed 54.5s


Epoch 3 it 150/329 tr_loss 0.2366 elapsed 81.6s


Epoch 3 it 200/329 tr_loss 0.2269 elapsed 108.7s


Epoch 3 it 250/329 tr_loss 0.2248 elapsed 135.8s


Epoch 3 it 300/329 tr_loss 0.2239 elapsed 163.0s


Epoch 3: tr_loss 0.2225 val_loss 0.1761 val_qwk(def) 0.8621 preds[min/mean/max]=-0.122/1.193/3.828 epoch_time 191.9s total_elapsed 576.0s


Epoch 4 it 50/329 tr_loss 0.1888 elapsed 27.1s


Epoch 4 it 100/329 tr_loss 0.1867 elapsed 54.1s


Epoch 4 it 150/329 tr_loss 0.1800 elapsed 81.4s


Epoch 4 it 200/329 tr_loss 0.1963 elapsed 108.5s


Epoch 4 it 250/329 tr_loss 0.1946 elapsed 135.7s


Epoch 4 it 300/329 tr_loss 0.1930 elapsed 163.1s


Epoch 4: tr_loss 0.1872 val_loss 0.1646 val_qwk(def) 0.8713 preds[min/mean/max]=-0.384/1.136/4.535 epoch_time 192.1s total_elapsed 768.1s


Epoch 5 it 50/329 tr_loss 0.1446 elapsed 27.1s


Epoch 5 it 100/329 tr_loss 0.1645 elapsed 54.2s


Epoch 5 it 150/329 tr_loss 0.1671 elapsed 81.3s


Epoch 5 it 200/329 tr_loss 0.1569 elapsed 108.3s


Epoch 5 it 250/329 tr_loss 0.1577 elapsed 135.4s


Epoch 5 it 300/329 tr_loss 0.1608 elapsed 162.5s


Epoch 5: tr_loss 0.1576 val_loss 0.1637 val_qwk(def) 0.8556 preds[min/mean/max]=-0.216/1.002/5.266 epoch_time 191.7s total_elapsed 959.9s


Epoch 6 it 50/329 tr_loss 0.1235 elapsed 27.0s


Epoch 6 it 100/329 tr_loss 0.1234 elapsed 54.1s


Epoch 6 it 150/329 tr_loss 0.1326 elapsed 81.2s


Epoch 6 it 200/329 tr_loss 0.1270 elapsed 108.3s


Epoch 6 it 250/329 tr_loss 0.1278 elapsed 135.3s


Epoch 6 it 300/329 tr_loss 0.1277 elapsed 162.4s


Epoch 6: tr_loss 0.1311 val_loss 0.1681 val_qwk(def) 0.8782 preds[min/mean/max]=-0.126/1.195/5.262 epoch_time 191.7s total_elapsed 1151.7s


Epoch 7 it 50/329 tr_loss 0.1062 elapsed 27.1s


Epoch 7 it 100/329 tr_loss 0.1099 elapsed 54.2s


Epoch 7 it 150/329 tr_loss 0.1089 elapsed 81.3s


Epoch 7 it 200/329 tr_loss 0.1137 elapsed 108.4s


Epoch 7 it 250/329 tr_loss 0.1159 elapsed 135.6s


Epoch 7 it 300/329 tr_loss 0.1227 elapsed 162.7s


Epoch 7: tr_loss 0.1262 val_loss 0.1644 val_qwk(def) 0.8637 preds[min/mean/max]=-0.245/1.094/4.285 epoch_time 192.1s total_elapsed 1343.8s


Epoch 8 it 50/329 tr_loss 0.1182 elapsed 27.1s


Epoch 8 it 100/329 tr_loss 0.1111 elapsed 54.1s


Epoch 8 it 150/329 tr_loss 0.1112 elapsed 81.2s


Epoch 8 it 200/329 tr_loss 0.1167 elapsed 108.4s


Epoch 8 it 250/329 tr_loss 0.1106 elapsed 135.5s


Epoch 8 it 300/329 tr_loss 0.1064 elapsed 162.7s


Epoch 8: tr_loss 0.1052 val_loss 0.1629 val_qwk(def) 0.8598 preds[min/mean/max]=-0.285/0.979/4.352 epoch_time 191.9s total_elapsed 1535.7s


Epoch 9 it 50/329 tr_loss 0.0912 elapsed 27.0s


Epoch 9 it 100/329 tr_loss 0.0885 elapsed 54.2s


Epoch 9 it 150/329 tr_loss 0.0891 elapsed 81.3s


Epoch 9 it 200/329 tr_loss 0.0887 elapsed 108.5s


Epoch 9 it 250/329 tr_loss 0.0973 elapsed 135.6s


Epoch 9 it 300/329 tr_loss 0.0985 elapsed 162.7s


Epoch 9: tr_loss 0.0992 val_loss 0.1648 val_qwk(def) 0.8595 preds[min/mean/max]=-0.197/1.001/4.117 epoch_time 191.7s total_elapsed 1727.5s


Epoch 10 it 50/329 tr_loss 0.0797 elapsed 27.3s


Epoch 10 it 100/329 tr_loss 0.0790 elapsed 54.4s


Epoch 10 it 150/329 tr_loss 0.0781 elapsed 81.5s


Epoch 10 it 200/329 tr_loss 0.0766 elapsed 108.5s


Epoch 10 it 250/329 tr_loss 0.0793 elapsed 135.6s


Epoch 10 it 300/329 tr_loss 0.0805 elapsed 162.7s


Epoch 10: tr_loss 0.0804 val_loss 0.1539 val_qwk(def) 0.8821 preds[min/mean/max]=-0.216/1.112/4.242 epoch_time 191.6s total_elapsed 1919.1s


Epoch 11 it 50/329 tr_loss 0.0666 elapsed 27.2s


Epoch 11 it 100/329 tr_loss 0.0686 elapsed 54.4s


Epoch 11 it 150/329 tr_loss 0.0634 elapsed 81.4s


Epoch 11 it 200/329 tr_loss 0.0641 elapsed 108.9s


Epoch 11 it 250/329 tr_loss 0.0674 elapsed 136.0s


Epoch 11 it 300/329 tr_loss 0.0686 elapsed 163.2s


Epoch 11: tr_loss 0.0681 val_loss 0.1722 val_qwk(def) 0.8469 preds[min/mean/max]=-0.223/1.012/4.262 epoch_time 192.2s total_elapsed 2111.4s


Epoch 12 it 50/329 tr_loss 0.0626 elapsed 27.0s


Epoch 12 it 100/329 tr_loss 0.0575 elapsed 54.1s


Epoch 12 it 150/329 tr_loss 0.0590 elapsed 81.0s


Epoch 12 it 200/329 tr_loss 0.0564 elapsed 108.0s


Epoch 12 it 250/329 tr_loss 0.0558 elapsed 135.0s


Epoch 12 it 300/329 tr_loss 0.0553 elapsed 162.1s


Epoch 12: tr_loss 0.0567 val_loss 0.1716 val_qwk(def) 0.8452 preds[min/mean/max]=-0.019/1.106/3.953 epoch_time 191.4s total_elapsed 2302.7s


Epoch 13 it 50/329 tr_loss 0.0508 elapsed 27.2s


Epoch 13 it 100/329 tr_loss 0.0469 elapsed 54.4s


Epoch 13 it 150/329 tr_loss 0.0446 elapsed 81.5s


Epoch 13 it 200/329 tr_loss 0.0437 elapsed 108.6s


Epoch 13 it 250/329 tr_loss 0.0477 elapsed 135.7s


Epoch 13 it 300/329 tr_loss 0.0495 elapsed 162.8s


Epoch 13: tr_loss 0.0485 val_loss 0.1763 val_qwk(def) 0.8587 preds[min/mean/max]=-0.149/1.111/4.051 epoch_time 192.1s total_elapsed 2494.8s


Early stopping triggered


[RRC+EMA] Fold 0 val QWK (default th): 0.8821


Attempt 1: trying batch_size=8



===== [RRC+EMA] Fold 1 / 5 =====


Peak GB before fold: 2.71


  model = create_fn(


Epoch 1 it 50/329 tr_loss 0.7917 elapsed 27.1s


Epoch 1 it 100/329 tr_loss 0.7129 elapsed 54.3s


Epoch 1 it 150/329 tr_loss 0.6190 elapsed 81.4s


Epoch 1 it 200/329 tr_loss 0.5436 elapsed 108.6s


Epoch 1 it 250/329 tr_loss 0.5012 elapsed 135.7s


Epoch 1 it 300/329 tr_loss 0.4726 elapsed 162.9s


Epoch 1: tr_loss 0.4612 val_loss 0.2567 val_qwk(def) 0.7305 preds[min/mean/max]=-0.319/0.837/3.213 epoch_time 192.0s total_elapsed 192.0s


Epoch 2 it 50/329 tr_loss 0.2318 elapsed 27.1s


Epoch 2 it 100/329 tr_loss 0.2700 elapsed 54.2s


Epoch 2 it 150/329 tr_loss 0.2727 elapsed 81.5s


Epoch 2 it 200/329 tr_loss 0.2657 elapsed 108.7s


Epoch 2 it 250/329 tr_loss 0.2793 elapsed 135.8s


Epoch 2 it 300/329 tr_loss 0.2726 elapsed 162.9s


Epoch 2: tr_loss 0.2679 val_loss 0.2192 val_qwk(def) 0.8213 preds[min/mean/max]=-0.465/0.950/6.039 epoch_time 191.7s total_elapsed 383.8s


Epoch 3 it 50/329 tr_loss 0.2222 elapsed 27.1s


Epoch 3 it 100/329 tr_loss 0.2119 elapsed 54.1s


Epoch 3 it 150/329 tr_loss 0.2102 elapsed 81.2s


Epoch 3 it 200/329 tr_loss 0.1985 elapsed 108.2s


Epoch 3 it 250/329 tr_loss 0.2042 elapsed 135.2s


Epoch 3 it 300/329 tr_loss 0.2074 elapsed 162.3s


Epoch 3: tr_loss 0.2093 val_loss 0.2196 val_qwk(def) 0.7934 preds[min/mean/max]=-0.206/0.921/3.658 epoch_time 191.5s total_elapsed 575.4s


Epoch 4 it 50/329 tr_loss 0.1968 elapsed 27.1s


Epoch 4 it 100/329 tr_loss 0.1821 elapsed 54.1s


Epoch 4 it 150/329 tr_loss 0.1821 elapsed 81.2s


Epoch 4 it 200/329 tr_loss 0.1846 elapsed 108.2s


Epoch 4 it 250/329 tr_loss 0.1795 elapsed 135.2s


Epoch 4 it 300/329 tr_loss 0.1840 elapsed 162.2s


Epoch 4: tr_loss 0.1825 val_loss 0.1976 val_qwk(def) 0.8100 preds[min/mean/max]=-0.117/0.939/4.699 epoch_time 191.2s total_elapsed 766.6s


Epoch 5 it 50/329 tr_loss 0.1428 elapsed 27.1s


Epoch 5 it 100/329 tr_loss 0.1501 elapsed 54.2s


Epoch 5 it 150/329 tr_loss 0.1491 elapsed 81.4s


Epoch 5 it 200/329 tr_loss 0.1537 elapsed 108.4s


Epoch 5 it 250/329 tr_loss 0.1555 elapsed 135.4s


Epoch 5 it 300/329 tr_loss 0.1580 elapsed 162.6s


Epoch 5: tr_loss 0.1555 val_loss 0.1700 val_qwk(def) 0.8652 preds[min/mean/max]=-0.205/1.106/4.988 epoch_time 191.5s total_elapsed 958.2s


Epoch 6 it 50/329 tr_loss 0.1429 elapsed 27.0s


Epoch 6 it 100/329 tr_loss 0.1343 elapsed 53.9s


Epoch 6 it 150/329 tr_loss 0.1316 elapsed 81.0s


Epoch 6 it 200/329 tr_loss 0.1273 elapsed 108.1s


Epoch 6 it 250/329 tr_loss 0.1269 elapsed 135.3s


Epoch 6 it 300/329 tr_loss 0.1334 elapsed 162.4s


Epoch 6: tr_loss 0.1316 val_loss 0.1728 val_qwk(def) 0.8529 preds[min/mean/max]=-0.098/1.071/4.285 epoch_time 191.7s total_elapsed 1150.0s


Epoch 7 it 50/329 tr_loss 0.1129 elapsed 27.2s


Epoch 7 it 100/329 tr_loss 0.1090 elapsed 54.3s


Epoch 7 it 150/329 tr_loss 0.1102 elapsed 81.3s


Epoch 7 it 200/329 tr_loss 0.1150 elapsed 108.4s


Epoch 7 it 250/329 tr_loss 0.1148 elapsed 135.5s


Epoch 7 it 300/329 tr_loss 0.1150 elapsed 162.5s


Epoch 7: tr_loss 0.1150 val_loss 0.1774 val_qwk(def) 0.8486 preds[min/mean/max]=-0.059/1.142/4.441 epoch_time 191.6s total_elapsed 1341.6s


Epoch 8 it 50/329 tr_loss 0.0926 elapsed 27.0s


Epoch 8 it 100/329 tr_loss 0.1016 elapsed 54.4s


Epoch 8 it 150/329 tr_loss 0.1006 elapsed 81.5s


Epoch 8 it 200/329 tr_loss 0.0980 elapsed 108.7s


Epoch 8 it 250/329 tr_loss 0.1014 elapsed 135.8s


Epoch 8 it 300/329 tr_loss 0.1016 elapsed 162.7s


Epoch 8: tr_loss 0.1033 val_loss 0.1922 val_qwk(def) 0.8313 preds[min/mean/max]=-0.343/0.857/3.568 epoch_time 191.6s total_elapsed 1533.2s


Early stopping triggered


[RRC+EMA] Fold 1 val QWK (default th): 0.8652


Attempt 1: trying batch_size=8



===== [RRC+EMA] Fold 2 / 5 =====


Peak GB before fold: 2.82


  model = create_fn(


Epoch 1 it 50/329 tr_loss 0.8881 elapsed 27.0s


Epoch 1 it 100/329 tr_loss 0.7580 elapsed 54.1s


Epoch 1 it 150/329 tr_loss 0.6554 elapsed 81.3s


Epoch 1 it 200/329 tr_loss 0.5753 elapsed 108.4s


Epoch 1 it 250/329 tr_loss 0.5270 elapsed 135.5s


Epoch 1 it 300/329 tr_loss 0.4861 elapsed 162.6s


Epoch 1: tr_loss 0.4775 val_loss 0.2715 val_qwk(def) 0.7659 preds[min/mean/max]=-0.623/0.893/3.705 epoch_time 191.7s total_elapsed 191.7s


Epoch 2 it 50/329 tr_loss 0.2855 elapsed 27.0s


Epoch 2 it 100/329 tr_loss 0.2740 elapsed 54.1s


Epoch 2 it 150/329 tr_loss 0.2751 elapsed 81.2s


Epoch 2 it 200/329 tr_loss 0.2839 elapsed 108.4s


Epoch 2 it 250/329 tr_loss 0.2788 elapsed 135.5s


Epoch 2 it 300/329 tr_loss 0.2728 elapsed 162.5s


Epoch 2: tr_loss 0.2746 val_loss 0.2340 val_qwk(def) 0.8309 preds[min/mean/max]=-0.161/1.316/3.418 epoch_time 191.8s total_elapsed 383.6s


Epoch 3 it 50/329 tr_loss 0.2141 elapsed 27.1s


Epoch 3 it 100/329 tr_loss 0.2393 elapsed 54.5s


Epoch 3 it 150/329 tr_loss 0.2371 elapsed 81.7s


Epoch 3 it 200/329 tr_loss 0.2323 elapsed 108.9s


Epoch 3 it 250/329 tr_loss 0.2260 elapsed 136.0s


Epoch 3 it 300/329 tr_loss 0.2211 elapsed 163.0s


Epoch 3: tr_loss 0.2175 val_loss 0.2090 val_qwk(def) 0.8473 preds[min/mean/max]=-0.228/1.219/5.277 epoch_time 192.2s total_elapsed 575.9s


Epoch 4 it 50/329 tr_loss 0.1746 elapsed 27.2s


Epoch 4 it 100/329 tr_loss 0.1732 elapsed 54.5s


Epoch 4 it 150/329 tr_loss 0.1761 elapsed 81.6s


Epoch 4 it 200/329 tr_loss 0.1708 elapsed 108.6s


Epoch 4 it 250/329 tr_loss 0.1771 elapsed 135.7s


Epoch 4 it 300/329 tr_loss 0.1824 elapsed 163.0s


Epoch 4: tr_loss 0.1812 val_loss 0.1822 val_qwk(def) 0.8485 preds[min/mean/max]=-0.403/1.033/3.789 epoch_time 192.1s total_elapsed 768.0s


Epoch 5 it 50/329 tr_loss 0.1805 elapsed 27.1s


Epoch 5 it 100/329 tr_loss 0.1612 elapsed 54.4s


Epoch 5 it 150/329 tr_loss 0.1580 elapsed 81.5s


Epoch 5 it 200/329 tr_loss 0.1591 elapsed 108.7s


Epoch 5 it 250/329 tr_loss 0.1643 elapsed 135.8s


Epoch 5 it 300/329 tr_loss 0.1679 elapsed 162.9s


Epoch 5: tr_loss 0.1667 val_loss 0.1956 val_qwk(def) 0.8640 preds[min/mean/max]=-0.201/1.206/5.586 epoch_time 192.2s total_elapsed 960.3s


Epoch 6 it 50/329 tr_loss 0.1389 elapsed 27.1s


Epoch 6 it 100/329 tr_loss 0.1449 elapsed 54.2s


Epoch 6 it 150/329 tr_loss 0.1433 elapsed 81.2s


Epoch 6 it 200/329 tr_loss 0.1366 elapsed 108.5s


Epoch 6 it 250/329 tr_loss 0.1365 elapsed 135.5s


Epoch 6 it 300/329 tr_loss 0.1336 elapsed 162.8s


Epoch 6: tr_loss 0.1340 val_loss 0.1719 val_qwk(def) 0.8496 preds[min/mean/max]=-0.160/1.037/4.078 epoch_time 192.0s total_elapsed 1152.3s


Epoch 7 it 50/329 tr_loss 0.1268 elapsed 27.1s


Epoch 7 it 100/329 tr_loss 0.1227 elapsed 54.2s


Epoch 7 it 150/329 tr_loss 0.1231 elapsed 81.3s


Epoch 7 it 200/329 tr_loss 0.1265 elapsed 108.3s


Epoch 7 it 250/329 tr_loss 0.1256 elapsed 135.5s


Epoch 7 it 300/329 tr_loss 0.1278 elapsed 162.6s


Epoch 7: tr_loss 0.1249 val_loss 0.1757 val_qwk(def) 0.8389 preds[min/mean/max]=-0.022/1.180/4.188 epoch_time 191.6s total_elapsed 1343.9s


Epoch 8 it 50/329 tr_loss 0.0885 elapsed 27.2s


Epoch 8 it 100/329 tr_loss 0.0957 elapsed 54.3s


Epoch 8 it 150/329 tr_loss 0.0900 elapsed 81.5s


Epoch 8 it 200/329 tr_loss 0.0985 elapsed 108.6s


Epoch 8 it 250/329 tr_loss 0.0980 elapsed 135.7s


Epoch 8 it 300/329 tr_loss 0.0987 elapsed 162.8s


Epoch 8: tr_loss 0.1006 val_loss 0.1830 val_qwk(def) 0.8548 preds[min/mean/max]=-0.174/1.167/4.379 epoch_time 192.1s total_elapsed 1536.0s


Epoch 9 it 50/329 tr_loss 0.0874 elapsed 27.1s


Epoch 9 it 100/329 tr_loss 0.0856 elapsed 54.3s


Epoch 9 it 150/329 tr_loss 0.0860 elapsed 81.5s


Epoch 9 it 200/329 tr_loss 0.0893 elapsed 108.6s


Epoch 9 it 250/329 tr_loss 0.0913 elapsed 135.8s


Epoch 9 it 300/329 tr_loss 0.0935 elapsed 163.0s


Epoch 9: tr_loss 0.0921 val_loss 0.1566 val_qwk(def) 0.8754 preds[min/mean/max]=-0.243/1.143/4.543 epoch_time 192.3s total_elapsed 1728.3s


Epoch 10 it 50/329 tr_loss 0.0712 elapsed 27.3s


Epoch 10 it 100/329 tr_loss 0.0733 elapsed 54.5s


Epoch 10 it 150/329 tr_loss 0.0701 elapsed 81.8s


Epoch 10 it 200/329 tr_loss 0.0736 elapsed 109.0s


Epoch 10 it 250/329 tr_loss 0.0768 elapsed 136.2s


Epoch 10 it 300/329 tr_loss 0.0777 elapsed 163.2s


Epoch 10: tr_loss 0.0768 val_loss 0.1513 val_qwk(def) 0.8726 preds[min/mean/max]=-0.158/1.079/4.301 epoch_time 192.4s total_elapsed 1920.8s


Epoch 11 it 50/329 tr_loss 0.0717 elapsed 27.0s


Epoch 11 it 100/329 tr_loss 0.0645 elapsed 54.2s


Epoch 11 it 150/329 tr_loss 0.0642 elapsed 81.4s


Epoch 11 it 200/329 tr_loss 0.0645 elapsed 108.5s


Epoch 11 it 250/329 tr_loss 0.0634 elapsed 135.8s


Epoch 11 it 300/329 tr_loss 0.0632 elapsed 162.9s


Epoch 11: tr_loss 0.0622 val_loss 0.1707 val_qwk(def) 0.8599 preds[min/mean/max]=-0.231/0.981/4.012 epoch_time 192.2s total_elapsed 2113.1s


Epoch 12 it 50/329 tr_loss 0.0497 elapsed 27.1s


Epoch 12 it 100/329 tr_loss 0.0529 elapsed 54.1s


Epoch 12 it 150/329 tr_loss 0.0531 elapsed 81.2s


Epoch 12 it 200/329 tr_loss 0.0524 elapsed 108.4s


Epoch 12 it 250/329 tr_loss 0.0539 elapsed 135.6s


Epoch 12 it 300/329 tr_loss 0.0545 elapsed 162.9s


Epoch 12: tr_loss 0.0539 val_loss 0.1638 val_qwk(def) 0.8544 preds[min/mean/max]=-0.121/1.066/4.129 epoch_time 192.0s total_elapsed 2305.1s


Epoch 13 it 50/329 tr_loss 0.0383 elapsed 27.2s


Epoch 13 it 100/329 tr_loss 0.0407 elapsed 54.4s


Epoch 13 it 150/329 tr_loss 0.0396 elapsed 81.5s


Epoch 13 it 200/329 tr_loss 0.0409 elapsed 108.7s


Epoch 13 it 250/329 tr_loss 0.0410 elapsed 135.8s


Epoch 13 it 300/329 tr_loss 0.0420 elapsed 162.8s


Epoch 13: tr_loss 0.0420 val_loss 0.1659 val_qwk(def) 0.8548 preds[min/mean/max]=-0.045/1.081/4.078 epoch_time 192.0s total_elapsed 2497.1s


Early stopping triggered


[RRC+EMA] Fold 2 val QWK (default th): 0.8726


Attempt 1: trying batch_size=8



===== [RRC+EMA] Fold 3 / 5 =====


Peak GB before fold: 2.93


  model = create_fn(


Epoch 1 it 50/329 tr_loss 0.8207 elapsed 27.1s


Epoch 1 it 100/329 tr_loss 0.7581 elapsed 54.3s


Epoch 1 it 150/329 tr_loss 0.6637 elapsed 81.5s


Epoch 1 it 200/329 tr_loss 0.5749 elapsed 108.6s


Epoch 1 it 250/329 tr_loss 0.5163 elapsed 135.9s


Epoch 1 it 300/329 tr_loss 0.4889 elapsed 163.1s


Epoch 1: tr_loss 0.4728 val_loss 0.2687 val_qwk(def) 0.7330 preds[min/mean/max]=-0.372/0.821/3.422 epoch_time 192.1s total_elapsed 192.1s


Epoch 2 it 50/329 tr_loss 0.2742 elapsed 27.0s


Epoch 2 it 100/329 tr_loss 0.2648 elapsed 54.1s


Epoch 2 it 150/329 tr_loss 0.2610 elapsed 81.2s


Epoch 2 it 200/329 tr_loss 0.2753 elapsed 108.3s


Epoch 2 it 250/329 tr_loss 0.2773 elapsed 135.3s


Epoch 2 it 300/329 tr_loss 0.2730 elapsed 162.6s


Epoch 2: tr_loss 0.2654 val_loss 0.2363 val_qwk(def) 0.7646 preds[min/mean/max]=-0.074/0.958/3.834 epoch_time 191.9s total_elapsed 384.1s


Epoch 3 it 50/329 tr_loss 0.2420 elapsed 27.1s


Epoch 3 it 100/329 tr_loss 0.2093 elapsed 54.3s


Epoch 3 it 150/329 tr_loss 0.2180 elapsed 81.4s


Epoch 3 it 200/329 tr_loss 0.2202 elapsed 108.5s


Epoch 3 it 250/329 tr_loss 0.2208 elapsed 135.6s


Epoch 3 it 300/329 tr_loss 0.2204 elapsed 162.8s


Epoch 3: tr_loss 0.2178 val_loss 0.1913 val_qwk(def) 0.8419 preds[min/mean/max]=-0.007/1.211/4.512 epoch_time 191.9s total_elapsed 576.1s


Epoch 4 it 50/329 tr_loss 0.1728 elapsed 27.3s


Epoch 4 it 100/329 tr_loss 0.1822 elapsed 54.4s


Epoch 4 it 150/329 tr_loss 0.1761 elapsed 81.6s


Epoch 4 it 200/329 tr_loss 0.1774 elapsed 108.8s


Epoch 4 it 250/329 tr_loss 0.1754 elapsed 135.9s


Epoch 4 it 300/329 tr_loss 0.1782 elapsed 163.1s


Epoch 4: tr_loss 0.1801 val_loss 0.1888 val_qwk(def) 0.8225 preds[min/mean/max]=-0.186/0.987/3.719 epoch_time 192.5s total_elapsed 768.7s


Epoch 5 it 50/329 tr_loss 0.1610 elapsed 27.2s


Epoch 5 it 100/329 tr_loss 0.1772 elapsed 54.3s


Epoch 5 it 150/329 tr_loss 0.1774 elapsed 81.6s


Epoch 5 it 200/329 tr_loss 0.1739 elapsed 108.8s


Epoch 5 it 250/329 tr_loss 0.1706 elapsed 136.0s


Epoch 5 it 300/329 tr_loss 0.1639 elapsed 163.3s


Epoch 5: tr_loss 0.1598 val_loss 0.1692 val_qwk(def) 0.8467 preds[min/mean/max]=-0.105/1.003/4.531 epoch_time 192.5s total_elapsed 961.2s


Epoch 6 it 50/329 tr_loss 0.1305 elapsed 27.2s


Epoch 6 it 100/329 tr_loss 0.1212 elapsed 54.4s


Epoch 6 it 150/329 tr_loss 0.1242 elapsed 81.6s


Epoch 6 it 200/329 tr_loss 0.1267 elapsed 108.7s


Epoch 6 it 250/329 tr_loss 0.1297 elapsed 135.9s


Epoch 6 it 300/329 tr_loss 0.1386 elapsed 163.1s


Epoch 6: tr_loss 0.1415 val_loss 0.1803 val_qwk(def) 0.8500 preds[min/mean/max]=-0.357/1.011/4.359 epoch_time 192.5s total_elapsed 1153.8s


Epoch 7 it 50/329 tr_loss 0.1311 elapsed 27.1s


Epoch 7 it 100/329 tr_loss 0.1251 elapsed 54.2s


Epoch 7 it 150/329 tr_loss 0.1186 elapsed 81.2s


Epoch 7 it 200/329 tr_loss 0.1175 elapsed 108.5s


Epoch 7 it 250/329 tr_loss 0.1271 elapsed 135.6s


Epoch 7 it 300/329 tr_loss 0.1279 elapsed 163.0s


Epoch 7: tr_loss 0.1268 val_loss 0.1778 val_qwk(def) 0.8343 preds[min/mean/max]=-0.175/1.020/4.383 epoch_time 192.1s total_elapsed 1345.8s


Epoch 8 it 50/329 tr_loss 0.1272 elapsed 26.9s


Epoch 8 it 100/329 tr_loss 0.1185 elapsed 54.1s


Epoch 8 it 150/329 tr_loss 0.1176 elapsed 81.2s


Epoch 8 it 200/329 tr_loss 0.1120 elapsed 108.2s


Epoch 8 it 250/329 tr_loss 0.1130 elapsed 135.2s


Epoch 8 it 300/329 tr_loss 0.1113 elapsed 162.2s


Epoch 8: tr_loss 0.1131 val_loss 0.1829 val_qwk(def) 0.8376 preds[min/mean/max]=-0.209/1.076/3.531 epoch_time 191.5s total_elapsed 1537.3s


Early stopping triggered


[RRC+EMA] Fold 3 val QWK (default th): 0.8467


Attempt 1: trying batch_size=8



===== [RRC+EMA] Fold 4 / 5 =====


Peak GB before fold: 3.03


  model = create_fn(


Epoch 1 it 50/329 tr_loss 1.0075 elapsed 27.1s


Epoch 1 it 100/329 tr_loss 0.8450 elapsed 54.2s


Epoch 1 it 150/329 tr_loss 0.7023 elapsed 81.3s


Epoch 1 it 200/329 tr_loss 0.6056 elapsed 108.3s


Epoch 1 it 250/329 tr_loss 0.5489 elapsed 135.4s


Epoch 1 it 300/329 tr_loss 0.5064 elapsed 162.4s


Epoch 1: tr_loss 0.4849 val_loss 0.2398 val_qwk(def) 0.7633 preds[min/mean/max]=-0.156/1.020/3.496 epoch_time 191.5s total_elapsed 191.5s


Epoch 2 it 50/329 tr_loss 0.2927 elapsed 27.0s


Epoch 2 it 100/329 tr_loss 0.2810 elapsed 54.0s


Epoch 2 it 150/329 tr_loss 0.2778 elapsed 81.1s


Epoch 2 it 200/329 tr_loss 0.2758 elapsed 108.2s


Epoch 2 it 250/329 tr_loss 0.2717 elapsed 135.2s


Epoch 2 it 300/329 tr_loss 0.2640 elapsed 162.3s


Epoch 2: tr_loss 0.2606 val_loss 0.2175 val_qwk(def) 0.8182 preds[min/mean/max]=-0.143/1.220/4.047 epoch_time 191.5s total_elapsed 383.0s


Epoch 3 it 50/329 tr_loss 0.2209 elapsed 27.2s


Epoch 3 it 100/329 tr_loss 0.2378 elapsed 54.2s


Epoch 3 it 150/329 tr_loss 0.2354 elapsed 81.1s


Epoch 3 it 200/329 tr_loss 0.2241 elapsed 108.6s


Epoch 3 it 250/329 tr_loss 0.2157 elapsed 135.3s


Epoch 3 it 300/329 tr_loss 0.2111 elapsed 162.5s


Epoch 3: tr_loss 0.2125 val_loss 0.1924 val_qwk(def) 0.8615 preds[min/mean/max]=-0.154/1.259/4.648 epoch_time 191.7s total_elapsed 574.8s


Epoch 4 it 50/329 tr_loss 0.2063 elapsed 27.1s


Epoch 4 it 100/329 tr_loss 0.2006 elapsed 54.2s


Epoch 4 it 150/329 tr_loss 0.1936 elapsed 81.6s


Epoch 4 it 200/329 tr_loss 0.1963 elapsed 108.7s


Epoch 4 it 250/329 tr_loss 0.1937 elapsed 135.7s


Epoch 4 it 300/329 tr_loss 0.1928 elapsed 162.9s


Epoch 4: tr_loss 0.1891 val_loss 0.2116 val_qwk(def) 0.7886 preds[min/mean/max]=-0.149/0.889/3.209 epoch_time 191.8s total_elapsed 766.6s


Epoch 5 it 50/329 tr_loss 0.1503 elapsed 27.0s


Epoch 5 it 100/329 tr_loss 0.1622 elapsed 54.0s


Epoch 5 it 150/329 tr_loss 0.1632 elapsed 81.1s


Epoch 5 it 200/329 tr_loss 0.1593 elapsed 108.0s


Epoch 5 it 250/329 tr_loss 0.1598 elapsed 135.1s


Epoch 5 it 300/329 tr_loss 0.1588 elapsed 162.2s


Epoch 5: tr_loss 0.1611 val_loss 0.1710 val_qwk(def) 0.8624 preds[min/mean/max]=-0.404/1.209/4.637 epoch_time 191.4s total_elapsed 958.1s


Epoch 6 it 50/329 tr_loss 0.1422 elapsed 27.1s


Epoch 6 it 100/329 tr_loss 0.1328 elapsed 54.2s


Epoch 6 it 150/329 tr_loss 0.1503 elapsed 81.3s


Epoch 6 it 200/329 tr_loss 0.1460 elapsed 108.3s


Epoch 6 it 250/329 tr_loss 0.1473 elapsed 135.3s


Epoch 6 it 300/329 tr_loss 0.1477 elapsed 162.4s


Epoch 6: tr_loss 0.1509 val_loss 0.1671 val_qwk(def) 0.8573 preds[min/mean/max]=-0.271/1.033/3.918 epoch_time 191.3s total_elapsed 1149.5s


Epoch 7 it 50/329 tr_loss 0.1226 elapsed 27.2s


Epoch 7 it 100/329 tr_loss 0.1126 elapsed 54.3s


Epoch 7 it 150/329 tr_loss 0.1300 elapsed 81.5s


Epoch 7 it 200/329 tr_loss 0.1261 elapsed 108.5s


Epoch 7 it 250/329 tr_loss 0.1260 elapsed 135.5s


Epoch 7 it 300/329 tr_loss 0.1265 elapsed 162.5s


Epoch 7: tr_loss 0.1289 val_loss 0.1778 val_qwk(def) 0.8230 preds[min/mean/max]=-0.038/1.006/3.760 epoch_time 191.8s total_elapsed 1341.3s


Epoch 8 it 50/329 tr_loss 0.0891 elapsed 27.0s


Epoch 8 it 100/329 tr_loss 0.0985 elapsed 54.0s


Epoch 8 it 150/329 tr_loss 0.1000 elapsed 81.0s


Epoch 8 it 200/329 tr_loss 0.1068 elapsed 108.0s


Epoch 8 it 250/329 tr_loss 0.1108 elapsed 135.0s


Epoch 8 it 300/329 tr_loss 0.1115 elapsed 162.1s


Epoch 8: tr_loss 0.1135 val_loss 0.1859 val_qwk(def) 0.8635 preds[min/mean/max]=-0.184/1.304/3.920 epoch_time 190.9s total_elapsed 1532.2s


Epoch 9 it 50/329 tr_loss 0.1036 elapsed 27.0s


Epoch 9 it 100/329 tr_loss 0.0885 elapsed 54.0s


Epoch 9 it 150/329 tr_loss 0.0876 elapsed 81.0s


Epoch 9 it 200/329 tr_loss 0.0911 elapsed 108.0s


Epoch 9 it 250/329 tr_loss 0.0957 elapsed 135.0s


Epoch 9 it 300/329 tr_loss 0.0985 elapsed 162.0s


Epoch 9: tr_loss 0.0981 val_loss 0.1488 val_qwk(def) 0.8712 preds[min/mean/max]=-0.224/1.141/4.035 epoch_time 191.1s total_elapsed 1723.3s


Epoch 10 it 50/329 tr_loss 0.0883 elapsed 26.9s


Epoch 10 it 100/329 tr_loss 0.0802 elapsed 53.8s


Epoch 10 it 150/329 tr_loss 0.0813 elapsed 80.7s


Epoch 10 it 200/329 tr_loss 0.0834 elapsed 107.5s


Epoch 10 it 250/329 tr_loss 0.0837 elapsed 134.5s


Epoch 10 it 300/329 tr_loss 0.0838 elapsed 161.6s


Epoch 10: tr_loss 0.0834 val_loss 0.1523 val_qwk(def) 0.8745 preds[min/mean/max]=-0.100/1.146/3.709 epoch_time 190.6s total_elapsed 1914.0s


Epoch 11 it 50/329 tr_loss 0.0753 elapsed 27.1s


Epoch 11 it 100/329 tr_loss 0.0695 elapsed 54.1s


Epoch 11 it 150/329 tr_loss 0.0705 elapsed 81.2s


Epoch 11 it 200/329 tr_loss 0.0702 elapsed 108.2s


Epoch 11 it 250/329 tr_loss 0.0696 elapsed 135.2s


Epoch 11 it 300/329 tr_loss 0.0698 elapsed 162.1s


Epoch 11: tr_loss 0.0727 val_loss 0.1512 val_qwk(def) 0.8636 preds[min/mean/max]=-0.135/1.120/3.775 epoch_time 191.0s total_elapsed 2105.0s


Epoch 12 it 50/329 tr_loss 0.0535 elapsed 27.0s


Epoch 12 it 100/329 tr_loss 0.0514 elapsed 54.0s


Epoch 12 it 150/329 tr_loss 0.0516 elapsed 81.0s


Epoch 12 it 200/329 tr_loss 0.0527 elapsed 108.0s


Epoch 12 it 250/329 tr_loss 0.0545 elapsed 134.9s


Epoch 12 it 300/329 tr_loss 0.0574 elapsed 161.8s


Epoch 12: tr_loss 0.0582 val_loss 0.1662 val_qwk(def) 0.8505 preds[min/mean/max]=-0.090/1.034/4.094 epoch_time 190.9s total_elapsed 2295.9s


Early stopping triggered


[RRC+EMA] Fold 4 val QWK (default th): 0.8712


[RRC+EMA] Optimized thresholds: [0.57085114 1.37378998 2.37971497 3.43718414] OOF QWK: 0.87391


B5@512 RRC+EMA OOF QWK: 0.8739131318130468


In [14]:
# Prep: 640px cache, B4@640 RRC+EMA 3-fold run, and NNLS-weighted blend + bootstrapped thresholds
import os, time, gc, copy, math
from pathlib import Path
import numpy as np
import pandas as pd
import cv2
import torch
import torch.nn as nn
import albumentations as A
from albumentations.pytorch import ToTensorV2
from torch.utils.data import Dataset, DataLoader
from timm.utils import ModelEmaV2
from scipy.optimize import nnls, minimize
from sklearn.metrics import cohen_kappa_score

print('Preparing 640px pipeline (cache + B4 3-fold + blend)', flush=True)

# 640 cache setup
CACHE640_DIR = Path('cache640'); CACHE640_TRAIN = CACHE640_DIR / 'train'; CACHE640_TEST = CACHE640_DIR / 'test'
CACHE640_DIR.mkdir(parents=True, exist_ok=True); CACHE640_TRAIN.mkdir(parents=True, exist_ok=True); CACHE640_TEST.mkdir(parents=True, exist_ok=True)
IMG_SIZE_640 = 640

def circle_crop_ben_clahe_sz(img_bgr: np.ndarray, size: int = 512, clahe_p: float = 0.4) -> np.ndarray:
    h, w = img_bgr.shape[:2]
    gray = cv2.cvtColor(img_bgr, cv2.COLOR_BGR2GRAY)
    gray_blur = cv2.GaussianBlur(gray, (0,0), 3)
    thr = cv2.threshold(gray_blur, 0, 255, cv2.THRESH_BINARY+cv2.THRESH_OTSU)[1]
    if np.mean(thr) > 127: thr = cv2.bitwise_not(thr)
    kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (7,7))
    thr = cv2.morphologyEx(thr, cv2.MORPH_CLOSE, kernel, iterations=2)
    contours, _ = cv2.findContours(thr, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    if contours:
        cnt = max(contours, key=cv2.contourArea)
        x, y, bw, bh = cv2.boundingRect(cnt)
    else:
        side = min(h, w); x = (w - side) // 2; y = (h - side) // 2; bw = side; bh = side
    crop = img_bgr[y:y+bh, x:x+bw]
    ch, cw = crop.shape[:2]; side = max(ch, cw)
    pad_top = (side - ch) // 2; pad_bottom = side - ch - pad_top
    pad_left = (side - cw) // 2; pad_right = side - cw - pad_left
    crop_sq = cv2.copyMakeBorder(crop, pad_top, pad_bottom, pad_left, pad_right, cv2.BORDER_CONSTANT, value=(0,0,0))
    img = cv2.resize(crop_sq, (size, size), interpolation=cv2.INTER_AREA)
    blur = cv2.GaussianBlur(img, (0,0), size/30)
    img = cv2.addWeighted(img, 4, blur, -4, 128)
    img = np.clip(img, 0, 255).astype(np.uint8)
    if np.random.rand() < 0.4:
        lab = cv2.cvtColor(img, cv2.COLOR_BGR2LAB)
        L, A_, B_ = cv2.split(lab)
        clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8,8))
        L = clahe.apply(L)
        lab = cv2.merge([L, A_, B_])
        img = cv2.cvtColor(lab, cv2.COLOR_LAB2BGR)
    return img

def cache_split_sz(df: pd.DataFrame, src_dir: Path, dst_dir: Path, id_col: str, size: int, limit: int | None = None):
    paths = [(str(src_dir / f"{iid}.png"), str(dst_dir / f"{iid}.png")) for iid in df[id_col].values.tolist()]
    if limit is not None: paths = paths[:limit]
    cnt_exist = sum(os.path.exists(d) for _, d in paths)
    print(f'{dst_dir.name}: {cnt_exist}/{len(paths)} already cached at {size}px', flush=True)
    to_process = [(s, d) for s, d in paths if not os.path.exists(d)]
    t0 = time.time()
    for i, (src, dst) in enumerate(to_process):
        img = cv2.imread(src, cv2.IMREAD_COLOR)
        if img is None: continue
        img = circle_crop_ben_clahe_sz(img, size=size, clahe_p=0.4)
        cv2.imwrite(dst, img, [cv2.IMWRITE_PNG_COMPRESSION, 3])
        if (i+1) % 200 == 0:
            print(f'Cached {i+1}/{len(to_process)}; elapsed {time.time()-t0:.1f}s', flush=True)
    print(f'Done {dst_dir.name}; total elapsed {time.time()-t0:.1f}s', flush=True)

# Local dataset for arbitrary size/dir
class LocalDRDataset(Dataset):
    def __init__(self, df: pd.DataFrame, img_dir: Path, img_size: int, transforms=None):
        self.df = df.reset_index(drop=True)
        self.img_dir = Path(img_dir)
        self.img_size = img_size
        self.transforms = transforms
        self.has_target = 'diagnosis' in df.columns
    def __len__(self): return len(self.df)
    def __getitem__(self, idx):
        row = self.df.iloc[idx]
        img_path = self.img_dir / f"{row['id_code']}.png"
        img = cv2.imread(str(img_path), cv2.IMREAD_COLOR)
        if img is None:
            img = np.zeros((self.img_size, self.img_size, 3), dtype=np.uint8)
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        if img.shape[0] != self.img_size or img.shape[1] != self.img_size:
            img = cv2.resize(img, (self.img_size, self.img_size), interpolation=cv2.INTER_AREA)
        if self.transforms:
            img = self.transforms(image=img)['image']
        if self.has_target:
            return img, torch.tensor(float(row['diagnosis']), dtype=torch.float32)
        else:
            return img, row['id_code']

# Transforms for 640 with RRC + light augs (tightened per expert advice)
MEAN = (0.485, 0.456, 0.406); STD = (0.229, 0.224, 0.225)
train_tfms_640 = A.Compose([
    A.RandomResizedCrop(size=(IMG_SIZE_640, IMG_SIZE_640), scale=(0.9, 1.0), ratio=(0.98, 1.02), interpolation=cv2.INTER_AREA),
    A.HorizontalFlip(p=0.5),
    A.Affine(scale=(0.95, 1.05), translate_percent=(0.0, 0.04), rotate=(-10, 10), border_mode=cv2.BORDER_REFLECT_101, p=0.7),
    A.RandomBrightnessContrast(brightness_limit=0.15, contrast_limit=0.15, p=0.7),
    A.RandomGamma(gamma_limit=(90,110), p=0.4),
    A.Normalize(mean=MEAN, std=STD),
    ToTensorV2(),
])
valid_tfms_640 = A.Compose([
    A.Resize(IMG_SIZE_640, IMG_SIZE_640),
    A.Normalize(mean=MEAN, std=STD),
    ToTensorV2(),
])

def preds_to_classes_blend(preds, thresholds):
    th0, th1, th2, th3 = thresholds
    return np.digitize(preds, bins=[th0, th1, th2, th3])

def qwk_func(y_true, y_pred_cls):
    return cohen_kappa_score(y_true, y_pred_cls, weights='quadratic')

def optimize_thresholds_generic(y, p, init=[0.5,1.5,2.5,3.5]):
    y = np.asarray(y).astype(float); p = np.asarray(p).astype(float)
    def _loss(th):
        th = np.sort(th); th = np.clip(th, -1.0, 4.0)
        cls = preds_to_classes_blend(p, th)
        return -qwk_func(y, cls)
    res = minimize(_loss, x0=np.array(init, dtype=float), method='Nelder-Mead', options={'maxiter':2000, 'xatol':1e-3, 'fatol':1e-3})
    th = np.sort(res.x)
    for i in range(1,4):
        if th[i] - th[i-1] < 0.05: th[i] = th[i-1] + 0.05
    return th

def bootstrap_thresholds_generic(y, p, n_bootstrap=100, base=[0.5,1.5,2.5,3.5]):
    N = len(y); cuts = []
    for i in range(n_bootstrap):
        idx = np.random.randint(0, N, size=N)
        th = optimize_thresholds_generic(y[idx], p[idx], init=base)
        cuts.append(th)
        if (i+1) % 20 == 0: print(f'Bootstrap {i+1}/{n_bootstrap}', flush=True)
    cuts = np.array(cuts); med = np.median(cuts, axis=0); med = np.clip(med, 0.3, 3.7)
    for j in range(1,4):
        if med[j] - med[j-1] < 0.1: med[j] = med[j-1] + 0.1
    base_arr = np.array([0.5,1.5,2.5,3.5])
    return 0.9*med + 0.1*base_arr

def get_loaders_640(tr_df, va_df, batch_size=8, num_workers=0):
    dtr = LocalDRDataset(tr_df, CACHE640_TRAIN, IMG_SIZE_640, transforms=train_tfms_640)
    dva = LocalDRDataset(va_df, CACHE640_TRAIN, IMG_SIZE_640, transforms=valid_tfms_640)
    dl_tr = DataLoader(dtr, batch_size=batch_size, shuffle=True, num_workers=num_workers, pin_memory=True, drop_last=True, persistent_workers=False)
    dl_va = DataLoader(dva, batch_size=batch_size, shuffle=False, num_workers=num_workers, pin_memory=True, persistent_workers=False)
    return dl_tr, dl_va

def validate_model_generic(model, dl, loss_fn):
    model.eval(); preds = []; targs = []; val_loss = 0.0; n = 0
    with torch.no_grad():
        for xb, yb in dl:
            xb = xb.to(device, non_blocking=True).to(memory_format=torch.channels_last); yb = yb.to(device, non_blocking=True)
            with torch.amp.autocast('cuda', enabled=True, dtype=torch.float16):
                out = model(xb); loss = loss_fn(out, yb)
            bs = xb.size(0); val_loss += loss.item()*bs; n += bs
            preds.append(out.detach().float().cpu().numpy()); targs.append(yb.detach().float().cpu().numpy())
    preds = np.concatenate(preds); targs = np.concatenate(targs)
    return val_loss/n, preds, targs

def train_one_fold_b4_640_rrc_ema(fold, folds_df, epochs=15, lr=2e-4, wd=1e-5, batch_size=4, patience=3, ema_decay=0.9996):
    print(f'\n===== [B4@640 RRC+EMA] Fold {fold} =====', flush=True)
    tr_df = folds_df[folds_df['fold'] != fold][['id_code','diagnosis']].reset_index(drop=True)
    va_df = folds_df[folds_df['fold'] == fold][['id_code','diagnosis']].reset_index(drop=True)
    dl_tr, dl_va = get_loaders_640(tr_df, va_df, batch_size=batch_size, num_workers=0)
    model = RegHeadModel(backbone_name='tf_efficientnet_b4_ns', pretrained=True).to(device)
    model = model.to(memory_format=torch.channels_last)
    optimizer = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=wd)
    loss_fn = nn.HuberLoss(delta=1.0)
    scaler = torch.amp.GradScaler('cuda', enabled=True)
    ema = ModelEmaV2(model, decay=ema_decay, device=None)
    total_steps = epochs * len(dl_tr)
    def lr_lambda(step):
        if step < len(dl_tr): return (step+1)/len(dl_tr)
        progress = (step - len(dl_tr)) / max(1, total_steps - len(dl_tr))
        return 0.5 * (1 + math.cos(math.pi * progress))
    scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)
    best_loss = float('inf'); best_state = None; best_preds = None; best_targs = None; no_improve = 0
    accum_steps = max(1, 16 // batch_size); t_start = time.time()
    optimizer.zero_grad(set_to_none=True)
    for epoch in range(1, epochs+1):
        model.train(); tr_loss = 0.0; n = 0; t0 = time.time()
        for it, (xb, yb) in enumerate(dl_tr):
            xb = xb.to(device, non_blocking=True).to(memory_format=torch.channels_last); yb = yb.to(device, non_blocking=True)
            with torch.amp.autocast('cuda', enabled=True, dtype=torch.float16):
                out = model(xb); loss = loss_fn(out, yb) / accum_steps
            scaler.scale(loss).backward()
            if (it + 1) % accum_steps == 0:
                scaler.step(optimizer); scaler.update(); optimizer.zero_grad(set_to_none=True); scheduler.step(); ema.update(model)
            bs = xb.size(0); tr_loss += (loss.item()*accum_steps)*bs; n += bs
            if (it+1) % 50 == 0:
                print(f'Epoch {epoch} it {it+1}/{len(dl_tr)} tr_loss {(tr_loss/n):.4f} elapsed {time.time()-t0:.1f}s', flush=True)
        vloss, vpreds, vtargs = validate_model_generic(model, dl_va, loss_fn)
        vq = qwk_func(vtargs, preds_to_classes_blend(vpreds, [0.5,1.5,2.5,3.5]))
        print(f'Epoch {epoch}: tr_loss {(tr_loss/n):.4f} val_loss {vloss:.4f} val_qwk(def) {vq:.4f} epoch_time {time.time()-t0:.1f}s total_elapsed {time.time()-t_start:.1f}s', flush=True)
        if vloss < best_loss:
            best_loss = vloss; best_state = copy.deepcopy(ema.module.state_dict()); best_preds = vpreds.copy(); best_targs = vtargs.copy(); no_improve = 0
        else:
            no_improve += 1
        if no_improve >= patience:
            print('Early stopping triggered', flush=True); break
    ema.module.load_state_dict(best_state)
    return ema.module, best_preds, best_targs

def run_cv_b4_640_rrc_ema(folds_df, epochs=15, batch_size=4, folds_to_run=(0,1,2)):
    folds_df = folds_df.reset_index(drop=True).copy()
    n = len(folds_df); oof_preds = np.full(n, np.nan, dtype=np.float32); oof_targs = folds_df['diagnosis'].values.astype(float)
    models_out = [];
    for fold in folds_to_run:
        try:
            torch.cuda.empty_cache();
            if torch.cuda.is_available(): torch.cuda.reset_peak_memory_stats()
        except Exception: pass
        gc.collect()
        current_bs = batch_size
        for attempt in range(6):
            print(f'Attempt {attempt+1}: trying batch_size={current_bs}', flush=True)
            try:
                fm, v_preds, v_targs = train_one_fold_b4_640_rrc_ema(fold, folds_df, epochs=epochs, lr=LR, wd=WD, batch_size=current_bs, patience=PATIENCE, ema_decay=0.9996)
                break
            except (RuntimeError, torch.cuda.OutOfMemoryError, torch.OutOfMemoryError) as e:
                if 'out of memory' in str(e).lower():
                    print(f'CUDA OOM on fold {fold} with batch_size={current_bs}. Reducing and retrying...', flush=True)
                    try: del fm
                    except Exception: pass
                    gc.collect()
                    try: torch.cuda.empty_cache(); torch.cuda.reset_peak_memory_stats()
                    except Exception: pass
                    current_bs = max(2, current_bs - 2);
                    if attempt >= 5: raise
                    continue
                else:
                    raise
        models_out.append(fm)
        va_idx = np.where(folds_df['fold'].values == fold)[0]
        oof_preds[va_idx] = v_preds
        fold_q = qwk_func(v_targs, preds_to_classes_blend(v_preds, [0.5,1.5,2.5,3.5]))
        print(f'[B4@640] Fold {fold} val QWK (default th): {fold_q:.4f}', flush=True)
    # Save partial OOF (NaNs where not run)
    np.save('oof_preds_b4_640_rrc_ema.npy', oof_preds); np.save('oof_targets_b4_640_rrc_ema.npy', oof_targs)
    return models_out, oof_preds, oof_targs

def nnls_blend_weights(y, preds_list, mask=None):
    y = np.asarray(y).astype(float)
    mats = []
    for p in preds_list:
        mats.append(np.asarray(p).reshape(-1,1))
    X = np.concatenate(mats, axis=1)
    if mask is not None:
        X = X[mask]; y = y[mask]
    w, _ = nnls(X, y)
    if w.sum() > 0:
        w = w / w.sum()
    else:
        w = np.ones_like(w) / len(w)
    # Cap and floor weights per expert advice to avoid dominance / collapse
    w = np.clip(w, 0.05, 0.8)
    w = w / w.sum()
    return w

def prepare_and_run_b4_640_and_blend():
    # 1) Cache 640 if needed
    cache_split_sz(train_df, TRAIN_DIR, CACHE640_TRAIN, id_col='id_code', size=IMG_SIZE_640, limit=None)
    cache_split_sz(test_df, TEST_DIR, CACHE640_TEST, id_col='id_code', size=IMG_SIZE_640, limit=None)
    print('640 cache ready.', flush=True)
    # 2) Train B4@640 for 3 folds
    models_b4_640, oof_b4_640, oof_t = run_cv_b4_640_rrc_ema(folds_df, epochs=15, batch_size=4, folds_to_run=(0,1,2))
    # 3) Load existing OOFs
    paths = {
        'b4_512': 'oof_preds_b4.npy',
        'b5_512': 'oof_preds.npy',
        'b5_512_rrcema': 'oof_preds_b5_seed2025_rrc_ema.npy',
        'b4_640': 'oof_preds_b4_640_rrc_ema.npy',
    }
    loaded = {};
    for k, p in paths.items():
        if os.path.exists(p): loaded[k] = np.load(p)
    # Align mask where all selected models have non-nan preds
    keys = list(loaded.keys())
    arrs = [loaded[k] for k in keys]
    mask = np.ones_like(oof_t, dtype=bool)
    for a in arrs:
        mask &= np.isfinite(a)
    # 4) Fit NNLS blend weights on intersection
    weights = nnls_blend_weights(oof_t, arrs, mask=mask)
    print('NNLS blend weights (', keys, '):', weights, flush=True)
    # 5) Create blended OOF and optimize + bootstrap thresholds
    blend_oof = np.zeros_like(oof_t, dtype=float)
    for w, a in zip(weights, arrs):
        a_fill = np.where(np.isfinite(a), a, 0.0)
        blend_oof += w * a_fill
    th_opt = optimize_thresholds_generic(oof_t[mask], blend_oof[mask])
    oof_q_opt = qwk_func(oof_t[mask], preds_to_classes_blend(blend_oof[mask], th_opt))
    print('Blend OOF QWK (opt on mask):', f'{oof_q_opt:.5f}', 'th:', th_opt, flush=True)
    th_boot = bootstrap_thresholds_generic(oof_t[mask], blend_oof[mask], n_bootstrap=100, base=[0.5,1.5,2.5,3.5])
    print('Bootstrapped thresholds:', th_boot, flush=True)
    np.save('blend_weights_keys.npy', np.array(keys, dtype=object)); np.save('blend_weights_vals.npy', weights); np.save('thresholds_blend_boot.npy', th_boot)
    # 6) TTA inference for B4@640 (simple hflip) and final blended submission
    dte640 = LocalDRDataset(test_df[['id_code']].copy(), CACHE640_TEST, IMG_SIZE_640, transforms=valid_tfms_640)
    dl_te640 = DataLoader(dte640, batch_size=16, shuffle=False, num_workers=4, pin_memory=True, persistent_workers=False)
    def tta_predict_simple(models, dl):
        preds = [];
        with torch.no_grad():
            for xb, _ in dl:
                xb = xb.to(device, non_blocking=True).to(memory_format=torch.channels_last)
                with torch.amp.autocast('cuda', enabled=True, dtype=torch.float16):
                    p0 = 0; cnt = 0
                    for m in models:
                        m.eval()
                        p0m = m(xb); p1m = m(torch.flip(xb, dims=[-1]))
                        p0 += (p0m + p1m) / 2.0; cnt += 1
                    p = p0 / max(1, cnt)
                preds.append(p.detach().float().cpu().numpy())
        return np.concatenate(preds)
    preds_b4_640 = tta_predict_simple(models_b4_640, dl_te640)
    np.save('test_reg_preds_b4_640_rrc_ema.npy', preds_b4_640)
    # Load existing test preds
    test_paths = {
        'b4_512': 'test_reg_preds_b4_tta.npy' if os.path.exists('test_reg_preds_b4_tta.npy') else 'test_reg_preds_b4.npy',
        'b5_512': 'test_reg_preds_b5_tta.npy' if os.path.exists('test_reg_preds_b5_tta.npy') else 'test_reg_preds.npy',
    }
    test_loaded = []
    for k in keys:
        if k == 'b4_640':
            test_loaded.append(preds_b4_640)
        elif k in test_paths and os.path.exists(test_paths[k]):
            test_loaded.append(np.load(test_paths[k]))
        elif k in ('b4_512','b5_512') and os.path.exists('test_reg_preds_blend_tta.npy'):
            pass
        elif k == 'b5_512_rrcema':
            pass
    # If some test preds missing (e.g., new EMA B5), fallback to equal weights among available
    if len(test_loaded) != len(keys):
        print('Warning: some test preds missing for keys', keys, 'using available ones only for final blend now.', flush=True)
        weights_use = weights[:len(test_loaded)];
        if weights_use.sum() > 0: weights_use = weights_use / weights_use.sum()
    else:
        weights_use = weights
    # Blend available test preds in consistent key order
    final_test_list = []
    for k in keys:
        if k == 'b4_640': final_test_list.append(preds_b4_640)
        elif k in test_paths and os.path.exists(test_paths[k]): final_test_list.append(np.load(test_paths[k]))
    if len(final_test_list) == 0:
        print('No test predictions available to blend.', flush=True); return
    w_adj = np.array(weights[:len(final_test_list)])
    if w_adj.sum() == 0: w_adj = np.ones_like(w_adj)/len(w_adj)
    w_adj = w_adj / w_adj.sum()
    blend_test = np.zeros_like(final_test_list[0], dtype=float)
    for w, a in zip(w_adj, final_test_list): blend_test += w * a
    th_final = np.load('thresholds_blend_boot.npy') if os.path.exists('thresholds_blend_boot.npy') else th_opt
    cls = preds_to_classes_blend(blend_test, th_final).astype(int)
    sub = pd.DataFrame({'id_code': test_df['id_code'].values, 'diagnosis': cls})
    sub.to_csv('submission.csv', index=False)
    np.save('test_reg_preds_blend_final.npy', blend_test)
    print('Saved submission.csv with NNLS-weighted blend and bootstrapped thresholds.', flush=True)

print('Cell ready. After B5 RRC+EMA completes, execute prepare_and_run_b4_640_and_blend() to continue medal push.', flush=True)

Preparing 640px pipeline (cache + B4 3-fold + blend)


Cell ready. After B5 RRC+EMA completes, execute prepare_and_run_b4_640_and_blend() to continue medal push.


In [15]:
# Post-B5(EMA) inference + trigger 640 pipeline
import numpy as np, torch, pandas as pd
from torch.utils.data import DataLoader

def hflip_tta_predict(models, dl):
    preds = [];
    with torch.no_grad():
        for xb, _ in dl:
            xb = xb.to(device, non_blocking=True).to(memory_format=torch.channels_last)
            with torch.amp.autocast('cuda', enabled=True, dtype=torch.float16):
                p_sum = 0; cnt = 0
                for m in models:
                    m.eval()
                    p0 = m(xb); p1 = m(torch.flip(xb, dims=[-1]))
                    p_sum += (p0 + p1) / 2.0; cnt += 1
                p = p_sum / max(1, cnt)
            preds.append(p.detach().float().cpu().numpy())
    return np.concatenate(preds)

def run_b5_seed2025_ema_test_infer_if_ready():
    if 'models_b5_seed2025_ema' not in globals():
        print('EMA B5 models not present yet; skip test inference.', flush=True);
        return False
    try:
        dte = DRDataset(test_df[['id_code']].copy(), IMG_DIR_TEST, transforms=valid_tfms)
        dl_te = DataLoader(dte, batch_size=16, shuffle=False, num_workers=4, pin_memory=True, persistent_workers=False)
        preds = hflip_tta_predict(models_b5_seed2025_ema, dl_te)
        np.save('test_reg_preds_b5_seed2025_rrc_ema.npy', preds)
        print('Saved test_reg_preds_b5_seed2025_rrc_ema.npy', flush=True)
        return True
    except Exception as e:
        print('B5 EMA test inference failed:', e, flush=True);
        return False

print('If B5@512 RRC+EMA run finished, we will cache its test preds, then launch 640 pipeline.', flush=True)
ok = run_b5_seed2025_ema_test_infer_if_ready()
if ok:
    try:
        prepare_and_run_b4_640_and_blend()
    except NameError:
        print('Cell 8 not executed yet; run it first to define prepare_and_run_b4_640_and_blend()', flush=True)

If B5@512 RRC+EMA run finished, we will cache its test preds, then launch 640 pipeline.


Saved test_reg_preds_b5_seed2025_rrc_ema.npy


train: 0/3295 already cached at 640px


Cached 200/3295; elapsed 40.1s


Cached 400/3295; elapsed 79.3s


Cached 600/3295; elapsed 117.6s


Cached 800/3295; elapsed 157.1s


Cached 1000/3295; elapsed 195.4s


Cached 1200/3295; elapsed 234.9s


Cached 1400/3295; elapsed 274.6s


Cached 1600/3295; elapsed 314.1s


Cached 1800/3295; elapsed 353.1s


Cached 2000/3295; elapsed 391.6s


Cached 2200/3295; elapsed 429.9s


Cached 2400/3295; elapsed 470.3s


Cached 2600/3295; elapsed 510.4s


Cached 2800/3295; elapsed 548.4s


Cached 3000/3295; elapsed 586.4s


Cached 3200/3295; elapsed 624.1s


Done train; total elapsed 642.6s


test: 0/367 already cached at 640px


Cached 200/367; elapsed 37.7s


Done test; total elapsed 73.4s


640 cache ready.


Attempt 1: trying batch_size=4



===== [B4@640 RRC+EMA] Fold 0 =====


  model = create_fn(


Epoch 1 it 50/659 tr_loss 0.8093 elapsed 15.7s


Epoch 1 it 100/659 tr_loss 0.8099 elapsed 31.6s


Epoch 1 it 150/659 tr_loss 0.7996 elapsed 47.6s


Epoch 1 it 200/659 tr_loss 0.8165 elapsed 63.3s


Epoch 1 it 250/659 tr_loss 0.8202 elapsed 79.5s


Epoch 1 it 300/659 tr_loss 0.7834 elapsed 95.6s


Epoch 1 it 350/659 tr_loss 0.7432 elapsed 111.3s


Epoch 1 it 400/659 tr_loss 0.7103 elapsed 127.5s


Epoch 1 it 450/659 tr_loss 0.6809 elapsed 143.3s


Epoch 1 it 500/659 tr_loss 0.6525 elapsed 159.6s


Epoch 1 it 550/659 tr_loss 0.6241 elapsed 175.9s


Epoch 1 it 600/659 tr_loss 0.5980 elapsed 191.8s


Epoch 1 it 650/659 tr_loss 0.5819 elapsed 207.9s


Epoch 1: tr_loss 0.5782 val_loss 0.2419 val_qwk(def) 0.7853 epoch_time 228.3s total_elapsed 228.3s


Epoch 2 it 50/659 tr_loss 0.3196 elapsed 16.2s


Epoch 2 it 100/659 tr_loss 0.3786 elapsed 32.0s


Epoch 2 it 150/659 tr_loss 0.3435 elapsed 48.0s


Epoch 2 it 200/659 tr_loss 0.3412 elapsed 64.2s


Epoch 2 it 250/659 tr_loss 0.3442 elapsed 79.8s


Epoch 2 it 300/659 tr_loss 0.3508 elapsed 96.0s


Epoch 2 it 350/659 tr_loss 0.3526 elapsed 111.9s


Epoch 2 it 400/659 tr_loss 0.3438 elapsed 127.6s


Epoch 2 it 450/659 tr_loss 0.3367 elapsed 143.6s


Epoch 2 it 500/659 tr_loss 0.3311 elapsed 159.6s


Epoch 2 it 550/659 tr_loss 0.3284 elapsed 175.5s


Epoch 2 it 600/659 tr_loss 0.3270 elapsed 191.6s


Epoch 2 it 650/659 tr_loss 0.3260 elapsed 207.3s


Epoch 2: tr_loss 0.3264 val_loss 0.2284 val_qwk(def) 0.8149 epoch_time 227.5s total_elapsed 455.8s


Epoch 3 it 50/659 tr_loss 0.2865 elapsed 16.0s


Epoch 3 it 100/659 tr_loss 0.2730 elapsed 32.1s


Epoch 3 it 150/659 tr_loss 0.2693 elapsed 47.7s


Epoch 3 it 200/659 tr_loss 0.2846 elapsed 63.9s


Epoch 3 it 250/659 tr_loss 0.2756 elapsed 79.9s


Epoch 3 it 300/659 tr_loss 0.2869 elapsed 95.7s


Epoch 3 it 350/659 tr_loss 0.2866 elapsed 111.8s


Epoch 3 it 400/659 tr_loss 0.2813 elapsed 128.0s


Epoch 3 it 450/659 tr_loss 0.2780 elapsed 143.7s


Epoch 3 it 500/659 tr_loss 0.2779 elapsed 159.7s


Epoch 3 it 550/659 tr_loss 0.2777 elapsed 175.8s


Epoch 3 it 600/659 tr_loss 0.2733 elapsed 191.8s


Epoch 3 it 650/659 tr_loss 0.2756 elapsed 207.9s


Epoch 3: tr_loss 0.2738 val_loss 0.2347 val_qwk(def) 0.7686 epoch_time 228.1s total_elapsed 684.0s


Epoch 4 it 50/659 tr_loss 0.3103 elapsed 15.9s


Epoch 4 it 100/659 tr_loss 0.2623 elapsed 31.8s


Epoch 4 it 150/659 tr_loss 0.2455 elapsed 47.8s


Epoch 4 it 200/659 tr_loss 0.2402 elapsed 63.8s


Epoch 4 it 250/659 tr_loss 0.2285 elapsed 79.6s


Epoch 4 it 300/659 tr_loss 0.2335 elapsed 95.5s


Epoch 4 it 350/659 tr_loss 0.2427 elapsed 111.2s


Epoch 4 it 400/659 tr_loss 0.2470 elapsed 127.4s


Epoch 4 it 450/659 tr_loss 0.2452 elapsed 143.4s


Epoch 4 it 500/659 tr_loss 0.2442 elapsed 159.2s


Epoch 4 it 550/659 tr_loss 0.2401 elapsed 175.2s


Epoch 4 it 600/659 tr_loss 0.2367 elapsed 191.4s


Epoch 4 it 650/659 tr_loss 0.2391 elapsed 207.2s


Epoch 4: tr_loss 0.2377 val_loss 0.2410 val_qwk(def) 0.7407 epoch_time 227.4s total_elapsed 911.4s


Epoch 5 it 50/659 tr_loss 0.1786 elapsed 15.9s


Epoch 5 it 100/659 tr_loss 0.1980 elapsed 31.9s


Epoch 5 it 150/659 tr_loss 0.1994 elapsed 47.4s


Epoch 5 it 200/659 tr_loss 0.2025 elapsed 63.6s


Epoch 5 it 250/659 tr_loss 0.1968 elapsed 79.5s


Epoch 5 it 300/659 tr_loss 0.2060 elapsed 95.2s


Epoch 5 it 350/659 tr_loss 0.2144 elapsed 111.1s


Epoch 5 it 400/659 tr_loss 0.2182 elapsed 127.1s


Epoch 5 it 450/659 tr_loss 0.2160 elapsed 142.8s


Epoch 5 it 500/659 tr_loss 0.2164 elapsed 158.9s


Epoch 5 it 550/659 tr_loss 0.2137 elapsed 174.8s


Epoch 5 it 600/659 tr_loss 0.2092 elapsed 190.5s


Epoch 5 it 650/659 tr_loss 0.2069 elapsed 206.5s


Epoch 5: tr_loss 0.2059 val_loss 0.1937 val_qwk(def) 0.8133 epoch_time 226.8s total_elapsed 1138.2s


Epoch 6 it 50/659 tr_loss 0.1110 elapsed 16.0s


Epoch 6 it 100/659 tr_loss 0.1622 elapsed 31.8s


Epoch 6 it 150/659 tr_loss 0.1699 elapsed 47.8s


Epoch 6 it 200/659 tr_loss 0.1667 elapsed 63.9s


Epoch 6 it 250/659 tr_loss 0.1635 elapsed 79.8s


Epoch 6 it 300/659 tr_loss 0.1678 elapsed 95.8s


Epoch 6 it 350/659 tr_loss 0.1679 elapsed 111.6s


Epoch 6 it 400/659 tr_loss 0.1657 elapsed 127.6s


Epoch 6 it 450/659 tr_loss 0.1641 elapsed 143.7s


Epoch 6 it 500/659 tr_loss 0.1636 elapsed 159.5s


Epoch 6 it 550/659 tr_loss 0.1666 elapsed 175.5s


Epoch 6 it 600/659 tr_loss 0.1686 elapsed 191.6s


Epoch 6 it 650/659 tr_loss 0.1710 elapsed 207.2s


Epoch 6: tr_loss 0.1722 val_loss 0.1766 val_qwk(def) 0.8355 epoch_time 227.5s total_elapsed 1365.7s


Epoch 7 it 50/659 tr_loss 0.1895 elapsed 16.0s


Epoch 7 it 100/659 tr_loss 0.1861 elapsed 32.0s


Epoch 7 it 150/659 tr_loss 0.1713 elapsed 47.6s


Epoch 7 it 200/659 tr_loss 0.1708 elapsed 63.7s


Epoch 7 it 250/659 tr_loss 0.1656 elapsed 79.8s


Epoch 7 it 300/659 tr_loss 0.1676 elapsed 95.6s


Epoch 7 it 350/659 tr_loss 0.1682 elapsed 111.6s


Epoch 7 it 400/659 tr_loss 0.1662 elapsed 127.4s


Epoch 7 it 450/659 tr_loss 0.1631 elapsed 143.4s


Epoch 7 it 500/659 tr_loss 0.1605 elapsed 159.5s


Epoch 7 it 550/659 tr_loss 0.1623 elapsed 175.3s


Epoch 7 it 600/659 tr_loss 0.1627 elapsed 191.4s


Epoch 7 it 650/659 tr_loss 0.1629 elapsed 207.4s


Epoch 7: tr_loss 0.1627 val_loss 0.1957 val_qwk(def) 0.8374 epoch_time 227.7s total_elapsed 1593.5s


Epoch 8 it 50/659 tr_loss 0.1358 elapsed 15.9s


Epoch 8 it 100/659 tr_loss 0.1264 elapsed 32.1s


Epoch 8 it 150/659 tr_loss 0.1186 elapsed 48.2s


Epoch 8 it 200/659 tr_loss 0.1307 elapsed 64.1s


Epoch 8 it 250/659 tr_loss 0.1328 elapsed 80.1s


Epoch 8 it 300/659 tr_loss 0.1318 elapsed 96.2s


Epoch 8 it 350/659 tr_loss 0.1364 elapsed 112.0s


Epoch 8 it 400/659 tr_loss 0.1402 elapsed 128.2s


Epoch 8 it 450/659 tr_loss 0.1405 elapsed 144.2s


Epoch 8 it 500/659 tr_loss 0.1401 elapsed 160.2s


Epoch 8 it 550/659 tr_loss 0.1420 elapsed 176.2s


Epoch 8 it 600/659 tr_loss 0.1434 elapsed 192.2s


Epoch 8 it 650/659 tr_loss 0.1404 elapsed 207.9s


Epoch 8: tr_loss 0.1409 val_loss 0.1847 val_qwk(def) 0.8403 epoch_time 228.3s total_elapsed 1821.8s


Epoch 9 it 50/659 tr_loss 0.1159 elapsed 16.2s


Epoch 9 it 100/659 tr_loss 0.1238 elapsed 32.5s


Epoch 9 it 150/659 tr_loss 0.1187 elapsed 48.3s


Epoch 9 it 200/659 tr_loss 0.1142 elapsed 64.5s


Epoch 9 it 250/659 tr_loss 0.1121 elapsed 80.4s


Epoch 9 it 300/659 tr_loss 0.1147 elapsed 96.5s


Epoch 9 it 350/659 tr_loss 0.1154 elapsed 112.6s


Epoch 9 it 400/659 tr_loss 0.1204 elapsed 128.5s


Epoch 9 it 450/659 tr_loss 0.1215 elapsed 144.5s


Epoch 9 it 500/659 tr_loss 0.1251 elapsed 160.6s


Epoch 9 it 550/659 tr_loss 0.1255 elapsed 176.5s


Epoch 9 it 600/659 tr_loss 0.1302 elapsed 192.5s


Epoch 9 it 650/659 tr_loss 0.1303 elapsed 208.3s


Epoch 9: tr_loss 0.1308 val_loss 0.1975 val_qwk(def) 0.8328 epoch_time 228.9s total_elapsed 2050.7s


Early stopping triggered


[B4@640] Fold 0 val QWK (default th): 0.8355


Attempt 1: trying batch_size=4



===== [B4@640 RRC+EMA] Fold 1 =====


  model = create_fn(


Epoch 1 it 50/659 tr_loss 0.8031 elapsed 15.8s


Epoch 1 it 100/659 tr_loss 0.8478 elapsed 32.0s


Epoch 1 it 150/659 tr_loss 0.8476 elapsed 48.4s


Epoch 1 it 200/659 tr_loss 0.8544 elapsed 64.5s


Epoch 1 it 250/659 tr_loss 0.8433 elapsed 80.7s


Epoch 1 it 300/659 tr_loss 0.8057 elapsed 96.6s


Epoch 1 it 350/659 tr_loss 0.7716 elapsed 112.7s


Epoch 1 it 400/659 tr_loss 0.7394 elapsed 128.9s


Epoch 1 it 450/659 tr_loss 0.6969 elapsed 144.6s


Epoch 1 it 500/659 tr_loss 0.6657 elapsed 160.8s


Epoch 1 it 550/659 tr_loss 0.6382 elapsed 176.9s


Epoch 1 it 600/659 tr_loss 0.6100 elapsed 192.7s


Epoch 1 it 650/659 tr_loss 0.5889 elapsed 208.8s


Epoch 1: tr_loss 0.5846 val_loss 0.2740 val_qwk(def) 0.7131 epoch_time 229.3s total_elapsed 229.3s


Epoch 2 it 50/659 tr_loss 0.2856 elapsed 16.1s


Epoch 2 it 100/659 tr_loss 0.3232 elapsed 31.9s


Epoch 2 it 150/659 tr_loss 0.3432 elapsed 48.1s


Epoch 2 it 200/659 tr_loss 0.3409 elapsed 64.3s


Epoch 2 it 250/659 tr_loss 0.3285 elapsed 80.1s


Epoch 2 it 300/659 tr_loss 0.3318 elapsed 96.2s


Epoch 2 it 350/659 tr_loss 0.3361 elapsed 112.1s


Epoch 2 it 400/659 tr_loss 0.3274 elapsed 128.4s


Epoch 2 it 450/659 tr_loss 0.3225 elapsed 144.6s


Epoch 2 it 500/659 tr_loss 0.3187 elapsed 160.4s


Epoch 2 it 550/659 tr_loss 0.3162 elapsed 176.8s


Epoch 2 it 600/659 tr_loss 0.3124 elapsed 193.0s


Epoch 2 it 650/659 tr_loss 0.3105 elapsed 208.8s


Epoch 2: tr_loss 0.3120 val_loss 0.3062 val_qwk(def) 0.6814 epoch_time 229.2s total_elapsed 458.6s


Epoch 3 it 50/659 tr_loss 0.2684 elapsed 16.2s


Epoch 3 it 100/659 tr_loss 0.3077 elapsed 32.3s


Epoch 3 it 150/659 tr_loss 0.3078 elapsed 48.1s


Epoch 3 it 200/659 tr_loss 0.2990 elapsed 64.3s


Epoch 3 it 250/659 tr_loss 0.2920 elapsed 80.2s


Epoch 3 it 300/659 tr_loss 0.2906 elapsed 96.5s


Epoch 3 it 350/659 tr_loss 0.2850 elapsed 112.8s


Epoch 3 it 400/659 tr_loss 0.2816 elapsed 128.9s


Epoch 3 it 450/659 tr_loss 0.2727 elapsed 145.0s


Epoch 3 it 500/659 tr_loss 0.2736 elapsed 161.1s


Epoch 3 it 550/659 tr_loss 0.2761 elapsed 176.9s


Epoch 3 it 600/659 tr_loss 0.2695 elapsed 193.1s


Epoch 3 it 650/659 tr_loss 0.2663 elapsed 209.3s


Epoch 3: tr_loss 0.2650 val_loss 0.2133 val_qwk(def) 0.8005 epoch_time 229.8s total_elapsed 688.4s


Epoch 4 it 50/659 tr_loss 0.2371 elapsed 15.8s


Epoch 4 it 100/659 tr_loss 0.2464 elapsed 31.9s


Epoch 4 it 150/659 tr_loss 0.2535 elapsed 47.9s


Epoch 4 it 200/659 tr_loss 0.2349 elapsed 63.7s


Epoch 4 it 250/659 tr_loss 0.2261 elapsed 79.9s


Epoch 4 it 300/659 tr_loss 0.2270 elapsed 95.7s


Epoch 4 it 350/659 tr_loss 0.2181 elapsed 111.7s


Epoch 4 it 400/659 tr_loss 0.2170 elapsed 127.9s


Epoch 4 it 450/659 tr_loss 0.2165 elapsed 143.6s


Epoch 4 it 500/659 tr_loss 0.2163 elapsed 159.7s


Epoch 4 it 550/659 tr_loss 0.2151 elapsed 175.8s


Epoch 4 it 600/659 tr_loss 0.2147 elapsed 191.6s


Epoch 4 it 650/659 tr_loss 0.2166 elapsed 207.8s


Epoch 4: tr_loss 0.2201 val_loss 0.3065 val_qwk(def) 0.6783 epoch_time 228.2s total_elapsed 916.7s


Epoch 5 it 50/659 tr_loss 0.2494 elapsed 16.1s


Epoch 5 it 100/659 tr_loss 0.2212 elapsed 31.9s


Epoch 5 it 150/659 tr_loss 0.2250 elapsed 47.9s


Epoch 5 it 200/659 tr_loss 0.2213 elapsed 63.6s


Epoch 5 it 250/659 tr_loss 0.2098 elapsed 79.6s


Epoch 5 it 300/659 tr_loss 0.2140 elapsed 95.7s


Epoch 5 it 350/659 tr_loss 0.2108 elapsed 111.4s


Epoch 5 it 400/659 tr_loss 0.2106 elapsed 127.5s


Epoch 5 it 450/659 tr_loss 0.2147 elapsed 143.5s


Epoch 5 it 500/659 tr_loss 0.2149 elapsed 159.2s


Epoch 5 it 550/659 tr_loss 0.2163 elapsed 175.4s


Epoch 5 it 600/659 tr_loss 0.2105 elapsed 191.6s


Epoch 5 it 650/659 tr_loss 0.2074 elapsed 207.4s


Epoch 5: tr_loss 0.2061 val_loss 0.2182 val_qwk(def) 0.8091 epoch_time 227.7s total_elapsed 1144.4s


Epoch 6 it 50/659 tr_loss 0.2177 elapsed 16.0s


Epoch 6 it 100/659 tr_loss 0.1953 elapsed 32.2s


Epoch 6 it 150/659 tr_loss 0.1761 elapsed 47.9s


Epoch 6 it 200/659 tr_loss 0.1702 elapsed 64.0s


Epoch 6 it 250/659 tr_loss 0.1716 elapsed 79.9s


Epoch 6 it 300/659 tr_loss 0.1687 elapsed 95.9s


Epoch 6 it 350/659 tr_loss 0.1689 elapsed 111.9s


Epoch 6 it 400/659 tr_loss 0.1690 elapsed 127.8s


Epoch 6 it 450/659 tr_loss 0.1710 elapsed 143.8s


Epoch 6 it 500/659 tr_loss 0.1707 elapsed 159.8s


Epoch 6 it 550/659 tr_loss 0.1756 elapsed 175.8s


Epoch 6 it 600/659 tr_loss 0.1735 elapsed 191.9s


Epoch 6 it 650/659 tr_loss 0.1746 elapsed 207.7s


Epoch 6: tr_loss 0.1751 val_loss 0.1851 val_qwk(def) 0.8612 epoch_time 228.2s total_elapsed 1372.6s


Epoch 7 it 50/659 tr_loss 0.1612 elapsed 16.1s


Epoch 7 it 100/659 tr_loss 0.1734 elapsed 32.2s


Epoch 7 it 150/659 tr_loss 0.1783 elapsed 47.8s


Epoch 7 it 200/659 tr_loss 0.1665 elapsed 64.0s


Epoch 7 it 250/659 tr_loss 0.1643 elapsed 80.0s


Epoch 7 it 300/659 tr_loss 0.1596 elapsed 95.8s


Epoch 7 it 350/659 tr_loss 0.1566 elapsed 111.9s


Epoch 7 it 400/659 tr_loss 0.1549 elapsed 128.0s


Epoch 7 it 450/659 tr_loss 0.1507 elapsed 143.9s


Epoch 7 it 500/659 tr_loss 0.1502 elapsed 159.9s


Epoch 7 it 550/659 tr_loss 0.1512 elapsed 176.0s


Epoch 7 it 600/659 tr_loss 0.1544 elapsed 191.8s


Epoch 7 it 650/659 tr_loss 0.1521 elapsed 207.9s


Epoch 7: tr_loss 0.1532 val_loss 0.2538 val_qwk(def) 0.8445 epoch_time 228.4s total_elapsed 1601.0s


Epoch 8 it 50/659 tr_loss 0.1326 elapsed 16.0s


Epoch 8 it 100/659 tr_loss 0.1299 elapsed 31.7s


Epoch 8 it 150/659 tr_loss 0.1357 elapsed 47.7s


Epoch 8 it 200/659 tr_loss 0.1383 elapsed 63.5s


Epoch 8 it 250/659 tr_loss 0.1400 elapsed 79.7s


Epoch 8 it 300/659 tr_loss 0.1373 elapsed 95.8s


Epoch 8 it 350/659 tr_loss 0.1352 elapsed 111.5s


Epoch 8 it 400/659 tr_loss 0.1309 elapsed 127.6s


Epoch 8 it 450/659 tr_loss 0.1308 elapsed 143.4s


Epoch 8 it 500/659 tr_loss 0.1352 elapsed 159.3s


Epoch 8 it 550/659 tr_loss 0.1378 elapsed 175.3s


Epoch 8 it 600/659 tr_loss 0.1398 elapsed 191.0s


Epoch 8 it 650/659 tr_loss 0.1406 elapsed 207.0s


Epoch 8: tr_loss 0.1413 val_loss 0.1698 val_qwk(def) 0.8552 epoch_time 227.4s total_elapsed 1828.4s


Epoch 9 it 50/659 tr_loss 0.1198 elapsed 15.9s


Epoch 9 it 100/659 tr_loss 0.1304 elapsed 31.6s


Epoch 9 it 150/659 tr_loss 0.1285 elapsed 47.7s


Epoch 9 it 200/659 tr_loss 0.1277 elapsed 63.7s


Epoch 9 it 250/659 tr_loss 0.1322 elapsed 79.5s


Epoch 9 it 300/659 tr_loss 0.1314 elapsed 95.5s


Epoch 9 it 350/659 tr_loss 0.1291 elapsed 111.8s


Epoch 9 it 400/659 tr_loss 0.1228 elapsed 127.6s


Epoch 9 it 450/659 tr_loss 0.1251 elapsed 143.7s


Epoch 9 it 500/659 tr_loss 0.1248 elapsed 159.5s


Epoch 9 it 550/659 tr_loss 0.1262 elapsed 175.5s


Epoch 9 it 600/659 tr_loss 0.1244 elapsed 191.6s


Epoch 9 it 650/659 tr_loss 0.1263 elapsed 207.3s


Epoch 9: tr_loss 0.1261 val_loss 0.2056 val_qwk(def) 0.8269 epoch_time 227.7s total_elapsed 2056.2s


Epoch 10 it 50/659 tr_loss 0.1044 elapsed 16.0s


Epoch 10 it 100/659 tr_loss 0.1058 elapsed 32.0s


Epoch 10 it 150/659 tr_loss 0.1047 elapsed 47.8s


Epoch 10 it 200/659 tr_loss 0.1097 elapsed 63.8s


Epoch 10 it 250/659 tr_loss 0.1131 elapsed 79.9s


Epoch 10 it 300/659 tr_loss 0.1093 elapsed 95.7s


Epoch 10 it 350/659 tr_loss 0.1126 elapsed 111.8s


Epoch 10 it 400/659 tr_loss 0.1121 elapsed 127.8s


Epoch 10 it 450/659 tr_loss 0.1143 elapsed 143.5s


Epoch 10 it 500/659 tr_loss 0.1129 elapsed 159.7s


Epoch 10 it 550/659 tr_loss 0.1111 elapsed 175.5s


Epoch 10 it 600/659 tr_loss 0.1116 elapsed 191.7s


Epoch 10 it 650/659 tr_loss 0.1110 elapsed 207.7s


Epoch 10: tr_loss 0.1109 val_loss 0.2143 val_qwk(def) 0.8016 epoch_time 228.2s total_elapsed 2284.4s


Epoch 11 it 50/659 tr_loss 0.1157 elapsed 15.9s


Epoch 11 it 100/659 tr_loss 0.1098 elapsed 32.1s


Epoch 11 it 150/659 tr_loss 0.1042 elapsed 48.1s


Epoch 11 it 200/659 tr_loss 0.0977 elapsed 63.9s


Epoch 11 it 250/659 tr_loss 0.0948 elapsed 80.1s


Epoch 11 it 300/659 tr_loss 0.0963 elapsed 95.8s


Epoch 11 it 350/659 tr_loss 0.0957 elapsed 111.9s


Epoch 11 it 400/659 tr_loss 0.0960 elapsed 128.0s


Epoch 11 it 450/659 tr_loss 0.0955 elapsed 143.8s


Epoch 11 it 500/659 tr_loss 0.0961 elapsed 159.9s


Epoch 11 it 550/659 tr_loss 0.0931 elapsed 176.0s


Epoch 11 it 600/659 tr_loss 0.0962 elapsed 191.9s


Epoch 11 it 650/659 tr_loss 0.0980 elapsed 208.1s


Epoch 11: tr_loss 0.0981 val_loss 0.2239 val_qwk(def) 0.8463 epoch_time 228.6s total_elapsed 2512.9s


Early stopping triggered


[B4@640] Fold 1 val QWK (default th): 0.8552


Attempt 1: trying batch_size=4



===== [B4@640 RRC+EMA] Fold 2 =====


  model = create_fn(


Epoch 1 it 50/659 tr_loss 0.8272 elapsed 15.9s


Epoch 1 it 100/659 tr_loss 0.8438 elapsed 32.2s


Epoch 1 it 150/659 tr_loss 0.8468 elapsed 48.5s


Epoch 1 it 200/659 tr_loss 0.8046 elapsed 64.4s


Epoch 1 it 250/659 tr_loss 0.7780 elapsed 80.7s


Epoch 1 it 300/659 tr_loss 0.7643 elapsed 96.6s


Epoch 1 it 350/659 tr_loss 0.7322 elapsed 112.7s


Epoch 1 it 400/659 tr_loss 0.7132 elapsed 129.0s


Epoch 1 it 450/659 tr_loss 0.6811 elapsed 145.0s


Epoch 1 it 500/659 tr_loss 0.6561 elapsed 161.2s


Epoch 1 it 550/659 tr_loss 0.6376 elapsed 177.1s


Epoch 1 it 600/659 tr_loss 0.6149 elapsed 192.9s


Epoch 1 it 650/659 tr_loss 0.5891 elapsed 209.0s


Epoch 1: tr_loss 0.5857 val_loss 0.2672 val_qwk(def) 0.7838 epoch_time 229.5s total_elapsed 229.5s


Epoch 2 it 50/659 tr_loss 0.3578 elapsed 16.1s


Epoch 2 it 100/659 tr_loss 0.3353 elapsed 31.9s


Epoch 2 it 150/659 tr_loss 0.3375 elapsed 48.2s


Epoch 2 it 200/659 tr_loss 0.3334 elapsed 64.4s


Epoch 2 it 250/659 tr_loss 0.3292 elapsed 80.7s


Epoch 2 it 300/659 tr_loss 0.3237 elapsed 97.1s


Epoch 2 it 350/659 tr_loss 0.3314 elapsed 113.1s


Epoch 2 it 400/659 tr_loss 0.3241 elapsed 129.7s


Epoch 2 it 450/659 tr_loss 0.3287 elapsed 145.9s


Epoch 2 it 500/659 tr_loss 0.3324 elapsed 162.0s


Epoch 2 it 550/659 tr_loss 0.3277 elapsed 178.4s


Epoch 2 it 600/659 tr_loss 0.3286 elapsed 194.4s


Epoch 2 it 650/659 tr_loss 0.3281 elapsed 210.7s


Epoch 2: tr_loss 0.3289 val_loss 0.2821 val_qwk(def) 0.6904 epoch_time 231.4s total_elapsed 461.0s


Epoch 3 it 50/659 tr_loss 0.2975 elapsed 16.3s


Epoch 3 it 100/659 tr_loss 0.3110 elapsed 32.3s


Epoch 3 it 150/659 tr_loss 0.2917 elapsed 48.7s


Epoch 3 it 200/659 tr_loss 0.2923 elapsed 65.2s


Epoch 3 it 250/659 tr_loss 0.2939 elapsed 80.9s


Epoch 3 it 300/659 tr_loss 0.2926 elapsed 97.2s


Epoch 3 it 350/659 tr_loss 0.2905 elapsed 113.3s


Epoch 3 it 400/659 tr_loss 0.2876 elapsed 129.5s


Epoch 3 it 450/659 tr_loss 0.2851 elapsed 145.7s


Epoch 3 it 500/659 tr_loss 0.2797 elapsed 161.6s


Epoch 3 it 550/659 tr_loss 0.2796 elapsed 177.7s


Epoch 3 it 600/659 tr_loss 0.2830 elapsed 193.9s


Epoch 3 it 650/659 tr_loss 0.2776 elapsed 209.8s


Epoch 3: tr_loss 0.2767 val_loss 0.2088 val_qwk(def) 0.7910 epoch_time 230.4s total_elapsed 691.4s


Epoch 4 it 50/659 tr_loss 0.2254 elapsed 16.1s


Epoch 4 it 100/659 tr_loss 0.2354 elapsed 32.2s


Epoch 4 it 150/659 tr_loss 0.2282 elapsed 48.0s


Epoch 4 it 200/659 tr_loss 0.2366 elapsed 64.1s


Epoch 4 it 250/659 tr_loss 0.2496 elapsed 80.3s


Epoch 4 it 300/659 tr_loss 0.2437 elapsed 96.2s


Epoch 4 it 350/659 tr_loss 0.2423 elapsed 112.3s


Epoch 4 it 400/659 tr_loss 0.2415 elapsed 128.1s


Epoch 4 it 450/659 tr_loss 0.2413 elapsed 144.5s


Epoch 4 it 500/659 tr_loss 0.2414 elapsed 160.7s


Epoch 4 it 550/659 tr_loss 0.2370 elapsed 176.6s


Epoch 4 it 600/659 tr_loss 0.2350 elapsed 192.7s


Epoch 4 it 650/659 tr_loss 0.2351 elapsed 208.5s


Epoch 4: tr_loss 0.2337 val_loss 0.2169 val_qwk(def) 0.7935 epoch_time 229.3s total_elapsed 920.7s


Epoch 5 it 50/659 tr_loss 0.1980 elapsed 15.8s


Epoch 5 it 100/659 tr_loss 0.2063 elapsed 31.9s


Epoch 5 it 150/659 tr_loss 0.1885 elapsed 47.7s


Epoch 5 it 200/659 tr_loss 0.1990 elapsed 63.7s


Epoch 5 it 250/659 tr_loss 0.1960 elapsed 79.8s


Epoch 5 it 300/659 tr_loss 0.1992 elapsed 95.7s


Epoch 5 it 350/659 tr_loss 0.2027 elapsed 111.8s


Epoch 5 it 400/659 tr_loss 0.2028 elapsed 127.6s


Epoch 5 it 450/659 tr_loss 0.2059 elapsed 143.7s


Epoch 5 it 500/659 tr_loss 0.2064 elapsed 159.9s


Epoch 5 it 550/659 tr_loss 0.2010 elapsed 175.8s


Epoch 5 it 600/659 tr_loss 0.2003 elapsed 192.2s


Epoch 5 it 650/659 tr_loss 0.2013 elapsed 208.2s


Epoch 5: tr_loss 0.2030 val_loss 0.2365 val_qwk(def) 0.7974 epoch_time 229.1s total_elapsed 1149.8s


Epoch 6 it 50/659 tr_loss 0.1483 elapsed 16.0s


Epoch 6 it 100/659 tr_loss 0.1761 elapsed 32.7s


Epoch 6 it 150/659 tr_loss 0.1811 elapsed 49.1s


Epoch 6 it 200/659 tr_loss 0.1896 elapsed 65.1s


Epoch 6 it 250/659 tr_loss 0.1816 elapsed 81.4s


Epoch 6 it 300/659 tr_loss 0.1780 elapsed 97.5s


Epoch 6 it 350/659 tr_loss 0.1769 elapsed 113.7s


Epoch 6 it 400/659 tr_loss 0.1801 elapsed 130.2s


Epoch 6 it 450/659 tr_loss 0.1838 elapsed 146.1s


Epoch 6 it 500/659 tr_loss 0.1806 elapsed 162.4s


Epoch 6 it 550/659 tr_loss 0.1802 elapsed 178.3s


Epoch 6 it 600/659 tr_loss 0.1827 elapsed 194.6s


Epoch 6 it 650/659 tr_loss 0.1861 elapsed 210.9s


Epoch 6: tr_loss 0.1855 val_loss 0.2264 val_qwk(def) 0.8015 epoch_time 231.6s total_elapsed 1381.4s


Early stopping triggered


[B4@640] Fold 2 val QWK (default th): 0.7910


NNLS blend weights ( ['b4_512', 'b5_512', 'b5_512_rrcema', 'b4_640'] ): [0.16578083 0.26201063 0.40078278 0.17142576]


Blend OOF QWK (opt on mask): 0.88125 th: [0.57361172 1.57246177 2.45997467 2.9617959 ]


Bootstrap 20/100


Bootstrap 40/100


Bootstrap 60/100


Bootstrap 80/100


Bootstrap 100/100


Bootstrapped thresholds: [0.52041135 1.50418103 2.55775329 3.28651022]




Saved submission.csv with NNLS-weighted blend and bootstrapped thresholds.


In [16]:
# Utility: Recompute hflip-only test preds for B4@512 and B5@512, then finalize blend with NNLS weights
import numpy as np, pandas as pd, torch
from torch.utils.data import DataLoader

def hflip_only_predict(models, dl):
    preds = []
    with torch.no_grad():
        for xb, _ in dl:
            xb = xb.to(device, non_blocking=True).to(memory_format=torch.channels_last)
            with torch.amp.autocast('cuda', enabled=True, dtype=torch.float16):
                s = 0; c = 0
                for m in models:
                    m.eval()
                    p0 = m(xb); p1 = m(torch.flip(xb, dims=[-1]))
                    s += (p0 + p1) / 2.0; c += 1
                p = s / max(1, c)
            preds.append(p.detach().float().cpu().numpy())
    return np.concatenate(preds)

def recompute_hflip_test_preds_for_512_models():
    made = {}
    try:
        dte = DRDataset(test_df[['id_code']].copy(), IMG_DIR_TEST, transforms=valid_tfms)
        dl = DataLoader(dte, batch_size=16, shuffle=False, num_workers=4, pin_memory=True, persistent_workers=False)
        if 'models' in globals() and isinstance(models, list) and len(models) > 0:
            p_b4 = hflip_only_predict(models, dl)
            np.save('test_reg_preds_b4_hflip.npy', p_b4)
            print('Saved test_reg_preds_b4_hflip.npy', flush=True)
            made['b4_512'] = 'test_reg_preds_b4_hflip.npy'
        else:
            print('models (B4@512) not found; skipping', flush=True)
        if 'models_b5' in globals() and isinstance(models_b5, list) and len(models_b5) > 0:
            p_b5 = hflip_only_predict(models_b5, dl)
            np.save('test_reg_preds_b5_hflip.npy', p_b5)
            print('Saved test_reg_preds_b5_hflip.npy', flush=True)
            made['b5_512'] = 'test_reg_preds_b5_hflip.npy'
        else:
            print('models_b5 (B5@512) not found; skipping', flush=True)
    except Exception as e:
        print('Recompute hflip test preds failed:', e, flush=True)
    return made

def finalize_blend_with_standardized_hflip():
    # Load NNLS weights and keys from previous step
    if not (os.path.exists('blend_weights_keys.npy') and os.path.exists('blend_weights_vals.npy')):
        print('Blend weights not found. Run prepare_and_run_b4_640_and_blend() first.', flush=True);
        return
    keys = np.load('blend_weights_keys.npy', allow_pickle=True).tolist()
    weights = np.load('blend_weights_vals.npy')
    # Thresholds
    th = np.load('thresholds_blend_boot.npy') if os.path.exists('thresholds_blend_boot.npy') else np.array([0.5,1.5,2.5,3.5], dtype=float)
    # Collect test preds per key with hflip-only preference
    pred_map = {}
    for k in keys:
        if k == 'b4_512':
            if os.path.exists('test_reg_preds_b4_hflip.npy'):
                pred_map[k] = np.load('test_reg_preds_b4_hflip.npy')
            elif os.path.exists('test_reg_preds_b4_tta.npy'):
                pred_map[k] = np.load('test_reg_preds_b4_tta.npy')
            elif os.path.exists('test_reg_preds_b4.npy'):
                pred_map[k] = np.load('test_reg_preds_b4.npy')
        elif k == 'b5_512':
            if os.path.exists('test_reg_preds_b5_hflip.npy'):
                pred_map[k] = np.load('test_reg_preds_b5_hflip.npy')
            elif os.path.exists('test_reg_preds_b5_tta.npy'):
                pred_map[k] = np.load('test_reg_preds_b5_tta.npy')
            elif os.path.exists('test_reg_preds.npy'):
                pred_map[k] = np.load('test_reg_preds.npy')
        elif k == 'b5_512_rrcema':
            # already saved as hflip-only earlier
            if os.path.exists('test_reg_preds_b5_seed2025_rrc_ema.npy'):
                pred_map[k] = np.load('test_reg_preds_b5_seed2025_rrc_ema.npy')
        elif k == 'b4_640':
            if os.path.exists('test_reg_preds_b4_640_rrc_ema.npy'):
                pred_map[k] = np.load('test_reg_preds_b4_640_rrc_ema.npy')
    # Verify availability
    missing = [k for k in keys if k not in pred_map]
    if missing:
        print('Missing test preds for keys:', missing, flush=True)
    used_keys = [k for k in keys if k in pred_map]
    mats = [pred_map[k] for k in used_keys]
    if not mats:
        print('No test preds to blend.', flush=True); return
    w = weights[:len(mats)]
    if w.sum() <= 0: w = np.ones_like(w) / len(w)
    w = w / w.sum()
    blend = np.zeros_like(mats[0], dtype=float)
    for wi, arr in zip(w, mats): blend += wi * arr
    # Threshold to classes
    def preds_to_classes_blend(preds, thresholds):
        th0, th1, th2, th3 = thresholds
        return np.digitize(preds, bins=[th0, th1, th2, th3])
    cls = preds_to_classes_blend(blend, th).astype(int)
    sub = pd.DataFrame({'id_code': test_df['id_code'].values, 'diagnosis': cls})
    sub.to_csv('submission.csv', index=False)
    np.save('test_reg_preds_blend_final.npy', blend)
    print('Final standardized hflip-only blend saved to submission.csv', flush=True)

print('Cell ready: after B4@640 completes and initial blend is done, run:', flush=True)
print('- recompute_hflip_test_preds_for_512_models()', flush=True)
print('- finalize_blend_with_standardized_hflip()', flush=True)

Cell ready: after B4@640 completes and initial blend is done, run:


- recompute_hflip_test_preds_for_512_models()


- finalize_blend_with_standardized_hflip()


In [17]:
# Recompute hflip-only preds for 512px models and finalize standardized blend
made = recompute_hflip_test_preds_for_512_models()
print('Made:', made, flush=True)
finalize_blend_with_standardized_hflip()

Saved test_reg_preds_b4_hflip.npy


Saved test_reg_preds_b5_hflip.npy


Made: {'b4_512': 'test_reg_preds_b4_hflip.npy', 'b5_512': 'test_reg_preds_b5_hflip.npy'}


Final standardized hflip-only blend saved to submission.csv


In [18]:
# SEResNeXt50_32x4d @512 (3 folds) with RRC(scale=(0.92,1.0)) + EMA, then refit NNLS with caps [0.05,0.7], bootstrap thresholds (n=180)
import os, gc, copy, time, math, numpy as np, pandas as pd, torch, torch.nn as nn, albumentations as A, cv2
from torch.utils.data import DataLoader
from timm.utils import ModelEmaV2
from scipy.optimize import nnls, minimize
from sklearn.metrics import cohen_kappa_score

print('Starting SEResNeXt50_32x4d@512 (3 folds) RRC+EMA...', flush=True)

# Transforms: RRC tightened to (0.92, 1.0)
MEAN = (0.485, 0.456, 0.406); STD = (0.229, 0.224, 0.225)
train_tfms_serx = A.Compose([
    A.RandomResizedCrop(size=(IMG_SIZE, IMG_SIZE), scale=(0.92, 1.0), ratio=(0.95, 1.05), interpolation=cv2.INTER_AREA),
    A.HorizontalFlip(p=0.5),
    A.Affine(scale=(0.95, 1.05), translate_percent=(0.0, 0.05), rotate=(-12, 12), border_mode=cv2.BORDER_REFLECT_101, p=0.7),
    A.RandomBrightnessContrast(brightness_limit=0.15, contrast_limit=0.15, p=0.7),
    A.RandomGamma(gamma_limit=(90,110), p=0.4),
    A.Normalize(mean=MEAN, std=STD),
    ToTensorV2(),
])
valid_tfms_serx = A.Compose([
    A.Resize(IMG_SIZE, IMG_SIZE),
    A.Normalize(mean=MEAN, std=STD),
    ToTensorV2(),
])

def get_loaders_serx(tr_df, va_df, batch_size=8, num_workers=0):
    dtr = DRDataset(tr_df, IMG_DIR_TRAIN, transforms=train_tfms_serx)
    dva = DRDataset(va_df, IMG_DIR_TRAIN, transforms=valid_tfms_serx)
    dl_tr = DataLoader(dtr, batch_size=batch_size, shuffle=True, num_workers=num_workers, pin_memory=True, drop_last=True, persistent_workers=False)
    dl_va = DataLoader(dva, batch_size=batch_size, shuffle=False, num_workers=num_workers, pin_memory=True, persistent_workers=False)
    return dl_tr, dl_va

def validate_serx(model, dl, loss_fn):
    model.eval(); preds = []; targs = []; val_loss = 0.0; n = 0
    with torch.no_grad():
        for xb, yb in dl:
            xb = xb.to(device, non_blocking=True).to(memory_format=torch.channels_last); yb = yb.to(device, non_blocking=True)
            with torch.amp.autocast('cuda', enabled=True, dtype=torch.float16):
                out = model(xb); loss = loss_fn(out, yb)
            bs = xb.size(0); val_loss += loss.item()*bs; n += bs
            preds.append(out.detach().float().cpu().numpy()); targs.append(yb.detach().float().cpu().numpy())
    preds = np.concatenate(preds); targs = np.concatenate(targs)
    return val_loss/max(1,n), preds, targs

def train_one_fold_serx_ema(fold, folds_df, epochs=15, lr=2e-4, wd=1e-5, batch_size=8, patience=3, ema_decay=0.9996):
    print(f"\n===== [SEResNeXt50@512 RRC+EMA] Fold {fold} =====", flush=True)
    tr_df = folds_df[folds_df['fold'] != fold][['id_code','diagnosis']].reset_index(drop=True)
    va_df = folds_df[folds_df['fold'] == fold][['id_code','diagnosis']].reset_index(drop=True)
    dl_tr, dl_va = get_loaders_serx(tr_df, va_df, batch_size=batch_size, num_workers=0)
    model = RegHeadModel(backbone_name='seresnext50_32x4d', pretrained=True).to(device)
    model = model.to(memory_format=torch.channels_last)
    opt = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=wd)
    loss_fn = nn.HuberLoss(delta=1.0)
    scaler = torch.amp.GradScaler('cuda', enabled=True)
    ema = ModelEmaV2(model, decay=ema_decay, device=None)
    total_steps = epochs * len(dl_tr)
    def lr_lambda(step):
        if step < len(dl_tr): return (step+1)/len(dl_tr)
        progress = (step - len(dl_tr)) / max(1, total_steps - len(dl_tr))
        return 0.5 * (1 + math.cos(math.pi * progress))
    sch = torch.optim.lr_scheduler.LambdaLR(opt, lr_lambda)
    best_loss = float('inf'); best_state = None; best_preds = None; best_targs = None; no_imp = 0
    accum = max(1, 16 // batch_size); t0_all = time.time(); opt.zero_grad(set_to_none=True)
    for ep in range(1, epochs+1):
        model.train(); tr_loss=0.0; n=0; t0=time.time()
        for it,(xb,yb) in enumerate(dl_tr):
            xb = xb.to(device, non_blocking=True).to(memory_format=torch.channels_last); yb = yb.to(device, non_blocking=True)
            with torch.amp.autocast('cuda', enabled=True, dtype=torch.float16):
                out = model(xb); loss = loss_fn(out, yb)/accum
            scaler.scale(loss).backward()
            if (it+1)%accum==0:
                scaler.step(opt); scaler.update(); opt.zero_grad(set_to_none=True); sch.step(); ema.update(model)
            bs = xb.size(0); tr_loss += (loss.item()*accum)*bs; n += bs
            if (it+1)%50==0: print(f'Epoch {ep} it {it+1}/{len(dl_tr)} tr_loss {tr_loss/max(1,n):.4f} elapsed {time.time()-t0:.1f}s', flush=True)
        vloss, vpreds, vtargs = validate_serx(model, dl_va, loss_fn)
        vq = cohen_kappa_score(vtargs, np.digitize(vpreds, bins=[0.5,1.5,2.5,3.5]), weights='quadratic')
        print(f'Epoch {ep}: tr_loss {tr_loss/max(1,n):.4f} val_loss {vloss:.4f} val_qwk(def) {vq:.4f} epoch_time {time.time()-t0:.1f}s total_elapsed {time.time()-t0_all:.1f}s', flush=True)
        if vloss < best_loss:
            best_loss = vloss; best_state = copy.deepcopy(ema.module.state_dict()); best_preds = vpreds.copy(); best_targs = vtargs.copy(); no_imp = 0
        else:
            no_imp += 1
        if no_imp >= patience:
            print('Early stopping triggered', flush=True); break
    ema.module.load_state_dict(best_state)
    return ema.module, best_preds, best_targs

def run_cv_serx_ema(folds_df, folds_to_run=(0,1,2), epochs=15, batch_size=8):
    folds_df = folds_df.reset_index(drop=True).copy()
    n = len(folds_df); oof_preds = np.full(n, np.nan, dtype=np.float32); oof_targs = folds_df['diagnosis'].values.astype(float)
    models_out = []
    for fold in folds_to_run:
        try:
            torch.cuda.empty_cache()
            if torch.cuda.is_available(): torch.cuda.reset_peak_memory_stats()
        except Exception:
            pass
        gc.collect()
        bs = batch_size
        for attempt in range(5):
            print(f'Attempt {attempt+1}: trying batch_size={bs}', flush=True)
            try:
                m, vpreds, vtargs = train_one_fold_serx_ema(fold, folds_df, epochs=epochs, lr=2e-4, wd=1e-5, batch_size=bs, patience=3, ema_decay=0.9996)
                break
            except (RuntimeError, torch.cuda.OutOfMemoryError, torch.OutOfMemoryError) as e:
                if 'out of memory' in str(e).lower():
                    print(f'CUDA OOM on fold {fold} with batch_size={bs}. Reducing...', flush=True)
                    try: del m
                    except Exception: pass
                    gc.collect()
                    try: torch.cuda.empty_cache(); torch.cuda.reset_peak_memory_stats()
                    except Exception: pass
                    bs = max(2, bs-2); continue
                else:
                    raise
        models_out.append(m)
        va_idx = np.where(folds_df['fold'].values == fold)[0]
        oof_preds[va_idx] = vpreds
        fold_q = cohen_kappa_score(vtargs, np.digitize(vpreds, bins=[0.5,1.5,2.5,3.5]), weights='quadratic')
        print(f'[SEResNeXt50@512] Fold {fold} val QWK (default th): {fold_q:.4f}', flush=True)
    np.save('oof_preds_serx50_512_rrc_ema.npy', oof_preds); np.save('oof_targets_serx50_512_rrc_ema.npy', oof_targs)
    return models_out, oof_preds, oof_targs

def hflip_tta_predict_mean(models, dl):
    preds = []
    with torch.no_grad():
        for xb, _ in dl:
            xb = xb.to(device, non_blocking=True).to(memory_format=torch.channels_last)
            with torch.amp.autocast('cuda', enabled=True, dtype=torch.float16):
                s = 0; c = 0
                for m in models:
                    m.eval(); p0 = m(xb); p1 = m(torch.flip(xb, dims=[-1]))
                    s += (p0 + p1) / 2.0; c += 1
                p = s / max(1,c)
            preds.append(p.detach().float().cpu().numpy())
    return np.concatenate(preds)

def qwk_func(y_true, y_pred_cls):
    return cohen_kappa_score(y_true, y_pred_cls, weights='quadratic')

def preds_to_classes_blend(preds, thresholds):
    th0, th1, th2, th3 = thresholds
    return np.digitize(preds, bins=[th0, th1, th2, th3])

def optimize_thresholds_generic(y, p, init=[0.5,1.5,2.5,3.5]):
    y = np.asarray(y).astype(float); p = np.asarray(p).astype(float)
    def _loss(th):
        th = np.sort(th); th = np.clip(th, -1.0, 4.0)
        cls = preds_to_classes_blend(p, th)
        return -qwk_func(y, cls)
    res = minimize(_loss, x0=np.array(init, dtype=float), method='Nelder-Mead', options={'maxiter':2000, 'xatol':1e-3, 'fatol':1e-3})
    th = np.sort(res.x)
    for i in range(1,4):
        if th[i] - th[i-1] < 0.05: th[i] = th[i-1] + 0.05
    return th

def bootstrap_thresholds_generic(y, p, n_bootstrap=180, base=[0.5,1.5,2.5,3.5]):
    N = len(y); cuts = []
    for i in range(n_bootstrap):
        idx = np.random.randint(0, N, size=N)
        th = optimize_thresholds_generic(y[idx], p[idx], init=base)
        cuts.append(th)
        if (i+1)%30==0: print(f'Bootstrap {i+1}/{n_bootstrap}', flush=True)
    cuts = np.array(cuts); med = np.median(cuts, axis=0); med = np.clip(med, 0.3, 3.7)
    for j in range(1,4):
        if med[j] - med[j-1] < 0.12: med[j] = med[j-1] + 0.12
    base_arr = np.array([0.5,1.5,2.5,3.5])
    return 0.9*med + 0.1*base_arr

def nnls_blend_weights_capped(y, preds_list, mask=None, lo=0.05, hi=0.7):
    y = np.asarray(y).astype(float); mats = [np.asarray(p).reshape(-1,1) for p in preds_list]
    X = np.concatenate(mats, axis=1)
    if mask is not None: X = X[mask]; y = y[mask]
    w, _ = nnls(X, y)
    if w.sum() > 0: w = w / w.sum()
    else: w = np.ones_like(w) / len(w)
    w = np.clip(w, lo, hi)
    w = w / w.sum()
    return w

# 1) Train SEResNeXt50 (3 folds)
models_serx, oof_serx, oof_t = run_cv_serx_ema(folds_df, folds_to_run=(0,1,2), epochs=15, batch_size=8)

# 2) Inference for SEResNeXt50 (hflip-only TTA) on test
dte_serx = DRDataset(test_df[['id_code']].copy(), IMG_DIR_TEST, transforms=valid_tfms_serx)
dl_te_serx = DataLoader(dte_serx, batch_size=16, shuffle=False, num_workers=4, pin_memory=True, persistent_workers=False)
preds_serx = hflip_tta_predict_mean(models_serx, dl_te_serx)
np.save('test_reg_preds_serx50_512_rrc_ema.npy', preds_serx)
print('Saved test_reg_preds_serx50_512_rrc_ema.npy', flush=True)

# 3) Refit NNLS on OOF intersection including SEResNeXt50 with caps [0.05,0.7]
paths_oof = {
    'b4_512': 'oof_preds_b4.npy',
    'b5_512': 'oof_preds.npy',
    'b5_512_rrcema': 'oof_preds_b5_seed2025_rrc_ema.npy',
    'b4_640': 'oof_preds_b4_640_rrc_ema.npy',
    'serx50_512_rrcema': 'oof_preds_serx50_512_rrc_ema.npy',
}
loaded_keys = []; loaded_arrs = []; mask = np.ones_like(oof_t, dtype=bool)
for k,p in paths_oof.items():
    if os.path.exists(p):
        arr = np.load(p); loaded_keys.append(k); loaded_arrs.append(arr); mask &= np.isfinite(arr)
    else:
        print('Missing OOF for', k, '->', p, flush=True)
weights = nnls_blend_weights_capped(oof_t, loaded_arrs, mask=mask, lo=0.05, hi=0.7)
print('NNLS weights (capped) for', loaded_keys, ':', weights, flush=True)

# 4) Create blended OOF, bootstrap thresholds (n=180)
blend_oof = np.zeros_like(oof_t, dtype=float)
for w, a in zip(weights, loaded_arrs):
    a_fill = np.where(np.isfinite(a), a, 0.0); blend_oof += w * a_fill
th_opt = optimize_thresholds_generic(oof_t[mask], blend_oof[mask])
oof_q_opt = qwk_func(oof_t[mask], preds_to_classes_blend(blend_oof[mask], th_opt))
print('Blend OOF QWK (opt on mask):', f'{oof_q_opt:.5f}', 'th:', th_opt, flush=True)
th_boot = bootstrap_thresholds_generic(oof_t[mask], blend_oof[mask], n_bootstrap=180, base=[0.5,1.5,2.5,3.5])
print('Bootstrapped thresholds (n=180):', th_boot, flush=True)
np.save('blend_weights_keys.npy', np.array(loaded_keys, dtype=object)); np.save('blend_weights_vals.npy', weights); np.save('thresholds_blend_boot.npy', th_boot)

# 5) Finalize standardized hflip-only test blend across all available keys
pred_map = {}
for k in loaded_keys:
    if k == 'b4_512' and os.path.exists('test_reg_preds_b4_hflip.npy'): pred_map[k] = np.load('test_reg_preds_b4_hflip.npy')
    elif k == 'b5_512' and os.path.exists('test_reg_preds_b5_hflip.npy'): pred_map[k] = np.load('test_reg_preds_b5_hflip.npy')
    elif k == 'b5_512_rrcema' and os.path.exists('test_reg_preds_b5_seed2025_rrc_ema.npy'): pred_map[k] = np.load('test_reg_preds_b5_seed2025_rrc_ema.npy')
    elif k == 'b4_640' and os.path.exists('test_reg_preds_b4_640_rrc_ema.npy'): pred_map[k] = np.load('test_reg_preds_b4_640_rrc_ema.npy')
    elif k == 'serx50_512_rrcema' and os.path.exists('test_reg_preds_serx50_512_rrc_ema.npy'): pred_map[k] = np.load('test_reg_preds_serx50_512_rrc_ema.npy')

used_keys = [k for k in loaded_keys if k in pred_map]
mats = [pred_map[k] for k in used_keys]
if len(mats) == 0:
    print('No test preds available to blend.', flush=True)
else:
    w_use = weights[:len(mats)]
    if w_use.sum() <= 0: w_use = np.ones_like(w_use)/len(w_use)
    w_use = w_use / w_use.sum()
    blend_test = np.zeros_like(mats[0], dtype=float)
    for wi, arr in zip(w_use, mats): blend_test += wi * arr
    cls = preds_to_classes_blend(blend_test, th_boot).astype(int)
    sub = pd.DataFrame({'id_code': test_df['id_code'].values, 'diagnosis': cls})
    sub.to_csv('submission.csv', index=False)
    np.save('test_reg_preds_blend_final.npy', blend_test)
    print('Final submission.csv written with SEResNeXt50 included.', flush=True)

Starting SEResNeXt50_32x4d@512 (3 folds) RRC+EMA...


Attempt 1: trying batch_size=8



===== [SEResNeXt50@512 RRC+EMA] Fold 0 =====


Epoch 1 it 50/329 tr_loss 0.8071 elapsed 17.4s


Epoch 1 it 100/329 tr_loss 0.7231 elapsed 35.1s


Epoch 1 it 150/329 tr_loss 0.6547 elapsed 52.4s


Epoch 1 it 200/329 tr_loss 0.5866 elapsed 69.6s


Epoch 1 it 250/329 tr_loss 0.5414 elapsed 86.7s


Epoch 1 it 300/329 tr_loss 0.5019 elapsed 104.1s


Epoch 1: tr_loss 0.4844 val_loss 0.2652 val_qwk(def) 0.7086 epoch_time 125.1s total_elapsed 125.1s


Epoch 2 it 50/329 tr_loss 0.2773 elapsed 17.2s


Epoch 2 it 100/329 tr_loss 0.2600 elapsed 34.8s


Epoch 2 it 150/329 tr_loss 0.2522 elapsed 52.2s


Epoch 2 it 200/329 tr_loss 0.2525 elapsed 69.5s


Epoch 2 it 250/329 tr_loss 0.2582 elapsed 87.0s


Epoch 2 it 300/329 tr_loss 0.2475 elapsed 104.3s


Epoch 2: tr_loss 0.2461 val_loss 0.2254 val_qwk(def) 0.8147 epoch_time 125.1s total_elapsed 250.2s


Epoch 3 it 50/329 tr_loss 0.2181 elapsed 17.3s


Epoch 3 it 100/329 tr_loss 0.2162 elapsed 34.7s


Epoch 3 it 150/329 tr_loss 0.2170 elapsed 52.2s


Epoch 3 it 200/329 tr_loss 0.2107 elapsed 69.4s


Epoch 3 it 250/329 tr_loss 0.2108 elapsed 86.6s


Epoch 3 it 300/329 tr_loss 0.2155 elapsed 104.2s


Epoch 3: tr_loss 0.2102 val_loss 0.1901 val_qwk(def) 0.8185 epoch_time 125.0s total_elapsed 375.2s


Epoch 4 it 50/329 tr_loss 0.1712 elapsed 17.2s


Epoch 4 it 100/329 tr_loss 0.1788 elapsed 34.4s


Epoch 4 it 150/329 tr_loss 0.1770 elapsed 51.7s


Epoch 4 it 200/329 tr_loss 0.1858 elapsed 68.9s


Epoch 4 it 250/329 tr_loss 0.1911 elapsed 86.1s


Epoch 4 it 300/329 tr_loss 0.1934 elapsed 103.7s


Epoch 4: tr_loss 0.1899 val_loss 0.1803 val_qwk(def) 0.8396 epoch_time 124.3s total_elapsed 499.6s


Epoch 5 it 50/329 tr_loss 0.1621 elapsed 17.2s


Epoch 5 it 100/329 tr_loss 0.1696 elapsed 34.4s


Epoch 5 it 150/329 tr_loss 0.1626 elapsed 51.7s


Epoch 5 it 200/329 tr_loss 0.1622 elapsed 68.9s


Epoch 5 it 250/329 tr_loss 0.1661 elapsed 86.3s


Epoch 5 it 300/329 tr_loss 0.1625 elapsed 103.4s


Epoch 5: tr_loss 0.1620 val_loss 0.1847 val_qwk(def) 0.8224 epoch_time 124.0s total_elapsed 623.6s


Epoch 6 it 50/329 tr_loss 0.1285 elapsed 17.1s


Epoch 6 it 100/329 tr_loss 0.1342 elapsed 34.6s


Epoch 6 it 150/329 tr_loss 0.1412 elapsed 51.7s


Epoch 6 it 200/329 tr_loss 0.1510 elapsed 68.9s


Epoch 6 it 250/329 tr_loss 0.1566 elapsed 86.4s


Epoch 6 it 300/329 tr_loss 0.1551 elapsed 103.7s


Epoch 6: tr_loss 0.1562 val_loss 0.1721 val_qwk(def) 0.8598 epoch_time 124.4s total_elapsed 747.9s


Epoch 7 it 50/329 tr_loss 0.1443 elapsed 17.5s


Epoch 7 it 100/329 tr_loss 0.1527 elapsed 34.8s


Epoch 7 it 150/329 tr_loss 0.1484 elapsed 52.2s


Epoch 7 it 200/329 tr_loss 0.1418 elapsed 69.6s


Epoch 7 it 250/329 tr_loss 0.1423 elapsed 87.0s


Epoch 7 it 300/329 tr_loss 0.1418 elapsed 104.6s


Epoch 7: tr_loss 0.1421 val_loss 0.1827 val_qwk(def) 0.8543 epoch_time 125.4s total_elapsed 873.4s


Epoch 8 it 50/329 tr_loss 0.1149 elapsed 17.3s


Epoch 8 it 100/329 tr_loss 0.1217 elapsed 34.5s


Epoch 8 it 150/329 tr_loss 0.1215 elapsed 52.1s


Epoch 8 it 200/329 tr_loss 0.1193 elapsed 69.2s


Epoch 8 it 250/329 tr_loss 0.1182 elapsed 86.3s


Epoch 8 it 300/329 tr_loss 0.1176 elapsed 103.8s


Epoch 8: tr_loss 0.1187 val_loss 0.1922 val_qwk(def) 0.8201 epoch_time 124.4s total_elapsed 997.8s


Epoch 9 it 50/329 tr_loss 0.0939 elapsed 17.2s


Epoch 9 it 100/329 tr_loss 0.0947 elapsed 34.4s


Epoch 9 it 150/329 tr_loss 0.1086 elapsed 51.5s


Epoch 9 it 200/329 tr_loss 0.1108 elapsed 68.9s


Epoch 9 it 250/329 tr_loss 0.1133 elapsed 86.1s


Epoch 9 it 300/329 tr_loss 0.1141 elapsed 103.3s


Epoch 9: tr_loss 0.1123 val_loss 0.1688 val_qwk(def) 0.8497 epoch_time 123.8s total_elapsed 1121.6s


Epoch 10 it 50/329 tr_loss 0.1078 elapsed 17.4s


Epoch 10 it 100/329 tr_loss 0.1147 elapsed 34.7s


Epoch 10 it 150/329 tr_loss 0.1107 elapsed 52.0s


Epoch 10 it 200/329 tr_loss 0.1052 elapsed 69.4s


Epoch 10 it 250/329 tr_loss 0.1105 elapsed 86.7s


Epoch 10 it 300/329 tr_loss 0.1071 elapsed 103.9s


Epoch 10: tr_loss 0.1053 val_loss 0.1707 val_qwk(def) 0.8627 epoch_time 124.6s total_elapsed 1246.3s


Epoch 11 it 50/329 tr_loss 0.0863 elapsed 17.5s


Epoch 11 it 100/329 tr_loss 0.0829 elapsed 34.7s


Epoch 11 it 150/329 tr_loss 0.0789 elapsed 51.9s


Epoch 11 it 200/329 tr_loss 0.0826 elapsed 69.1s


Epoch 11 it 250/329 tr_loss 0.0811 elapsed 86.3s


Epoch 11 it 300/329 tr_loss 0.0824 elapsed 103.6s


Epoch 11: tr_loss 0.0839 val_loss 0.1613 val_qwk(def) 0.8622 epoch_time 124.3s total_elapsed 1370.6s


Epoch 12 it 50/329 tr_loss 0.0726 elapsed 17.1s


Epoch 12 it 100/329 tr_loss 0.0820 elapsed 34.5s


Epoch 12 it 150/329 tr_loss 0.0832 elapsed 51.6s


Epoch 12 it 200/329 tr_loss 0.0845 elapsed 68.7s


Epoch 12 it 250/329 tr_loss 0.0818 elapsed 86.1s


Epoch 12 it 300/329 tr_loss 0.0800 elapsed 103.1s


Epoch 12: tr_loss 0.0788 val_loss 0.1679 val_qwk(def) 0.8400 epoch_time 123.8s total_elapsed 1494.4s


Epoch 13 it 50/329 tr_loss 0.0695 elapsed 17.3s


Epoch 13 it 100/329 tr_loss 0.0705 elapsed 34.5s


Epoch 13 it 150/329 tr_loss 0.0684 elapsed 51.6s


Epoch 13 it 200/329 tr_loss 0.0687 elapsed 69.0s


Epoch 13 it 250/329 tr_loss 0.0739 elapsed 86.1s


Epoch 13 it 300/329 tr_loss 0.0739 elapsed 103.2s


Epoch 13: tr_loss 0.0747 val_loss 0.1756 val_qwk(def) 0.8520 epoch_time 123.7s total_elapsed 1618.1s


Epoch 14 it 50/329 tr_loss 0.0500 elapsed 17.1s


Epoch 14 it 100/329 tr_loss 0.0583 elapsed 34.3s


Epoch 14 it 150/329 tr_loss 0.0603 elapsed 51.5s


Epoch 14 it 200/329 tr_loss 0.0629 elapsed 68.6s


Epoch 14 it 250/329 tr_loss 0.0615 elapsed 85.9s


Epoch 14 it 300/329 tr_loss 0.0648 elapsed 103.0s


Epoch 14: tr_loss 0.0640 val_loss 0.1776 val_qwk(def) 0.8595 epoch_time 123.5s total_elapsed 1741.6s


Early stopping triggered


[SEResNeXt50@512] Fold 0 val QWK (default th): 0.8622


Attempt 1: trying batch_size=8



===== [SEResNeXt50@512 RRC+EMA] Fold 1 =====


Epoch 1 it 50/329 tr_loss 0.9114 elapsed 17.2s


Epoch 1 it 100/329 tr_loss 0.7994 elapsed 34.4s


Epoch 1 it 150/329 tr_loss 0.7208 elapsed 51.7s


Epoch 1 it 200/329 tr_loss 0.6261 elapsed 68.9s


Epoch 1 it 250/329 tr_loss 0.5664 elapsed 86.0s


Epoch 1 it 300/329 tr_loss 0.5215 elapsed 103.2s


Epoch 1: tr_loss 0.5033 val_loss 0.3511 val_qwk(def) 0.5943 epoch_time 123.8s total_elapsed 123.8s


Epoch 2 it 50/329 tr_loss 0.2181 elapsed 17.2s


Epoch 2 it 100/329 tr_loss 0.2431 elapsed 34.7s


Epoch 2 it 150/329 tr_loss 0.2424 elapsed 51.9s


Epoch 2 it 200/329 tr_loss 0.2451 elapsed 69.1s


Epoch 2 it 250/329 tr_loss 0.2456 elapsed 86.5s


Epoch 2 it 300/329 tr_loss 0.2544 elapsed 103.6s


Epoch 2: tr_loss 0.2533 val_loss 0.2828 val_qwk(def) 0.6732 epoch_time 124.3s total_elapsed 248.1s


Epoch 3 it 50/329 tr_loss 0.2606 elapsed 17.2s


Epoch 3 it 100/329 tr_loss 0.2433 elapsed 34.2s


Epoch 3 it 150/329 tr_loss 0.2154 elapsed 51.6s


Epoch 3 it 200/329 tr_loss 0.2018 elapsed 68.8s


Epoch 3 it 250/329 tr_loss 0.2106 elapsed 85.8s


Epoch 3 it 300/329 tr_loss 0.2135 elapsed 103.3s


Epoch 3: tr_loss 0.2108 val_loss 0.2040 val_qwk(def) 0.8091 epoch_time 124.0s total_elapsed 372.1s


Epoch 4 it 50/329 tr_loss 0.2055 elapsed 17.2s


Epoch 4 it 100/329 tr_loss 0.2074 elapsed 34.4s


Epoch 4 it 150/329 tr_loss 0.1902 elapsed 51.5s


Epoch 4 it 200/329 tr_loss 0.1843 elapsed 68.7s


Epoch 4 it 250/329 tr_loss 0.1825 elapsed 85.8s


Epoch 4 it 300/329 tr_loss 0.1821 elapsed 102.9s


Epoch 4: tr_loss 0.1868 val_loss 0.1929 val_qwk(def) 0.8208 epoch_time 123.8s total_elapsed 495.9s


Epoch 5 it 50/329 tr_loss 0.1573 elapsed 17.1s


Epoch 5 it 100/329 tr_loss 0.1536 elapsed 34.2s


Epoch 5 it 150/329 tr_loss 0.1556 elapsed 51.3s


Epoch 5 it 200/329 tr_loss 0.1575 elapsed 68.5s


Epoch 5 it 250/329 tr_loss 0.1581 elapsed 85.9s


Epoch 5 it 300/329 tr_loss 0.1628 elapsed 103.1s


Epoch 5: tr_loss 0.1628 val_loss 0.2102 val_qwk(def) 0.8201 epoch_time 123.6s total_elapsed 619.6s


Epoch 6 it 50/329 tr_loss 0.1558 elapsed 17.1s


Epoch 6 it 100/329 tr_loss 0.1456 elapsed 34.6s


Epoch 6 it 150/329 tr_loss 0.1459 elapsed 51.7s


Epoch 6 it 200/329 tr_loss 0.1492 elapsed 69.0s


Epoch 6 it 250/329 tr_loss 0.1531 elapsed 86.6s


Epoch 6 it 300/329 tr_loss 0.1498 elapsed 103.9s


Epoch 6: tr_loss 0.1485 val_loss 0.1759 val_qwk(def) 0.8741 epoch_time 124.8s total_elapsed 744.3s


Epoch 7 it 50/329 tr_loss 0.1331 elapsed 17.1s


Epoch 7 it 100/329 tr_loss 0.1502 elapsed 34.6s


Epoch 7 it 150/329 tr_loss 0.1464 elapsed 51.9s


Epoch 7 it 200/329 tr_loss 0.1382 elapsed 69.2s


Epoch 7 it 250/329 tr_loss 0.1435 elapsed 86.4s


Epoch 7 it 300/329 tr_loss 0.1440 elapsed 103.7s


Epoch 7: tr_loss 0.1440 val_loss 0.2153 val_qwk(def) 0.8535 epoch_time 124.9s total_elapsed 869.2s


Epoch 8 it 50/329 tr_loss 0.1412 elapsed 17.2s


Epoch 8 it 100/329 tr_loss 0.1300 elapsed 34.4s


Epoch 8 it 150/329 tr_loss 0.1279 elapsed 51.5s


Epoch 8 it 200/329 tr_loss 0.1273 elapsed 68.7s


Epoch 8 it 250/329 tr_loss 0.1257 elapsed 86.2s


Epoch 8 it 300/329 tr_loss 0.1252 elapsed 103.3s


Epoch 8: tr_loss 0.1257 val_loss 0.1671 val_qwk(def) 0.8663 epoch_time 124.0s total_elapsed 993.3s


Epoch 9 it 50/329 tr_loss 0.1151 elapsed 17.2s


Epoch 9 it 100/329 tr_loss 0.1166 elapsed 34.4s


Epoch 9 it 150/329 tr_loss 0.1077 elapsed 51.7s


Epoch 9 it 200/329 tr_loss 0.1095 elapsed 68.8s


Epoch 9 it 250/329 tr_loss 0.1112 elapsed 86.3s


Epoch 9 it 300/329 tr_loss 0.1124 elapsed 103.5s


Epoch 9: tr_loss 0.1127 val_loss 0.2112 val_qwk(def) 0.8071 epoch_time 124.4s total_elapsed 1117.7s


Epoch 10 it 50/329 tr_loss 0.0975 elapsed 17.2s


Epoch 10 it 100/329 tr_loss 0.0918 elapsed 34.5s


Epoch 10 it 150/329 tr_loss 0.0963 elapsed 52.0s


Epoch 10 it 200/329 tr_loss 0.0933 elapsed 69.2s


Epoch 10 it 250/329 tr_loss 0.0976 elapsed 86.4s


Epoch 10 it 300/329 tr_loss 0.1000 elapsed 103.9s


Epoch 10: tr_loss 0.0993 val_loss 0.1864 val_qwk(def) 0.8215 epoch_time 124.6s total_elapsed 1242.3s


Epoch 11 it 50/329 tr_loss 0.0738 elapsed 17.1s


Epoch 11 it 100/329 tr_loss 0.0833 elapsed 34.3s


Epoch 11 it 150/329 tr_loss 0.0843 elapsed 51.7s


Epoch 11 it 200/329 tr_loss 0.0858 elapsed 68.9s


Epoch 11 it 250/329 tr_loss 0.0870 elapsed 86.0s


Epoch 11 it 300/329 tr_loss 0.0900 elapsed 103.1s


Epoch 11: tr_loss 0.0906 val_loss 0.1747 val_qwk(def) 0.8596 epoch_time 123.9s total_elapsed 1366.2s


Early stopping triggered


[SEResNeXt50@512] Fold 1 val QWK (default th): 0.8663


Attempt 1: trying batch_size=8



===== [SEResNeXt50@512 RRC+EMA] Fold 2 =====


Epoch 1 it 50/329 tr_loss 0.9190 elapsed 17.1s


Epoch 1 it 100/329 tr_loss 0.8260 elapsed 34.2s


Epoch 1 it 150/329 tr_loss 0.7267 elapsed 51.3s


Epoch 1 it 200/329 tr_loss 0.6508 elapsed 68.5s


Epoch 1 it 250/329 tr_loss 0.5831 elapsed 85.6s


Epoch 1 it 300/329 tr_loss 0.5344 elapsed 102.7s


Epoch 1: tr_loss 0.5122 val_loss 0.2299 val_qwk(def) 0.7763 epoch_time 123.3s total_elapsed 123.3s


Epoch 2 it 50/329 tr_loss 0.2375 elapsed 17.1s


Epoch 2 it 100/329 tr_loss 0.2564 elapsed 34.4s


Epoch 2 it 150/329 tr_loss 0.2529 elapsed 51.6s


Epoch 2 it 200/329 tr_loss 0.2604 elapsed 68.7s


Epoch 2 it 250/329 tr_loss 0.2621 elapsed 86.0s


Epoch 2 it 300/329 tr_loss 0.2675 elapsed 103.2s


Epoch 2: tr_loss 0.2641 val_loss 0.2207 val_qwk(def) 0.8253 epoch_time 123.7s total_elapsed 247.1s


Epoch 3 it 50/329 tr_loss 0.2091 elapsed 17.1s


Epoch 3 it 100/329 tr_loss 0.2234 elapsed 34.3s


Epoch 3 it 150/329 tr_loss 0.2236 elapsed 51.3s


Epoch 3 it 200/329 tr_loss 0.2162 elapsed 68.6s


Epoch 3 it 250/329 tr_loss 0.2136 elapsed 85.8s


Epoch 3 it 300/329 tr_loss 0.2128 elapsed 103.3s


Epoch 3: tr_loss 0.2131 val_loss 0.1805 val_qwk(def) 0.8267 epoch_time 123.9s total_elapsed 371.0s


Epoch 4 it 50/329 tr_loss 0.1920 elapsed 17.2s


Epoch 4 it 100/329 tr_loss 0.1895 elapsed 34.4s


Epoch 4 it 150/329 tr_loss 0.1785 elapsed 51.6s


Epoch 4 it 200/329 tr_loss 0.1838 elapsed 68.7s


Epoch 4 it 250/329 tr_loss 0.1846 elapsed 85.8s


Epoch 4 it 300/329 tr_loss 0.1806 elapsed 102.9s


Epoch 4: tr_loss 0.1835 val_loss 0.1739 val_qwk(def) 0.8611 epoch_time 123.5s total_elapsed 494.6s


Epoch 5 it 50/329 tr_loss 0.1556 elapsed 17.4s


Epoch 5 it 100/329 tr_loss 0.1437 elapsed 34.6s


Epoch 5 it 150/329 tr_loss 0.1514 elapsed 51.8s


Epoch 5 it 200/329 tr_loss 0.1640 elapsed 69.3s


Epoch 5 it 250/329 tr_loss 0.1686 elapsed 86.4s


Epoch 5 it 300/329 tr_loss 0.1734 elapsed 103.6s


Epoch 5: tr_loss 0.1730 val_loss 0.1899 val_qwk(def) 0.8415 epoch_time 124.2s total_elapsed 618.8s


Epoch 6 it 50/329 tr_loss 0.1536 elapsed 17.2s


Epoch 6 it 100/329 tr_loss 0.1528 elapsed 34.4s


Epoch 6 it 150/329 tr_loss 0.1545 elapsed 51.4s


Epoch 6 it 200/329 tr_loss 0.1530 elapsed 68.5s


Epoch 6 it 250/329 tr_loss 0.1502 elapsed 85.7s


Epoch 6 it 300/329 tr_loss 0.1538 elapsed 102.9s


Epoch 6: tr_loss 0.1544 val_loss 0.1913 val_qwk(def) 0.8135 epoch_time 123.5s total_elapsed 742.3s


Epoch 7 it 50/329 tr_loss 0.1300 elapsed 17.2s


Epoch 7 it 100/329 tr_loss 0.1373 elapsed 34.7s


Epoch 7 it 150/329 tr_loss 0.1371 elapsed 51.8s


Epoch 7 it 200/329 tr_loss 0.1394 elapsed 69.0s


Epoch 7 it 250/329 tr_loss 0.1401 elapsed 86.4s


Epoch 7 it 300/329 tr_loss 0.1435 elapsed 103.5s


Epoch 7: tr_loss 0.1448 val_loss 0.1688 val_qwk(def) 0.8751 epoch_time 124.2s total_elapsed 866.5s


Epoch 8 it 50/329 tr_loss 0.1307 elapsed 17.2s


Epoch 8 it 100/329 tr_loss 0.1240 elapsed 34.3s


Epoch 8 it 150/329 tr_loss 0.1213 elapsed 51.4s


Epoch 8 it 200/329 tr_loss 0.1233 elapsed 68.5s


Epoch 8 it 250/329 tr_loss 0.1256 elapsed 85.5s


Epoch 8 it 300/329 tr_loss 0.1249 elapsed 103.0s


Epoch 8: tr_loss 0.1266 val_loss 0.2319 val_qwk(def) 0.8468 epoch_time 123.6s total_elapsed 990.2s


Epoch 9 it 50/329 tr_loss 0.1294 elapsed 17.1s


Epoch 9 it 100/329 tr_loss 0.1164 elapsed 34.5s


Epoch 9 it 150/329 tr_loss 0.1135 elapsed 51.9s


Epoch 9 it 200/329 tr_loss 0.1068 elapsed 69.3s


Epoch 9 it 250/329 tr_loss 0.1108 elapsed 86.5s


Epoch 9 it 300/329 tr_loss 0.1146 elapsed 104.1s


Epoch 9: tr_loss 0.1140 val_loss 0.2094 val_qwk(def) 0.8026 epoch_time 125.1s total_elapsed 1115.3s


Epoch 10 it 50/329 tr_loss 0.0990 elapsed 17.3s


Epoch 10 it 100/329 tr_loss 0.0992 elapsed 34.6s


Epoch 10 it 150/329 tr_loss 0.0932 elapsed 51.9s


Epoch 10 it 200/329 tr_loss 0.1060 elapsed 69.1s


Epoch 10 it 250/329 tr_loss 0.1080 elapsed 86.3s


Epoch 10 it 300/329 tr_loss 0.1068 elapsed 103.5s


Epoch 10: tr_loss 0.1075 val_loss 0.1831 val_qwk(def) 0.8443 epoch_time 124.4s total_elapsed 1239.7s


Early stopping triggered


[SEResNeXt50@512] Fold 2 val QWK (default th): 0.8751


Saved test_reg_preds_serx50_512_rrc_ema.npy


NNLS weights (capped) for ['b4_512', 'b5_512', 'b5_512_rrcema', 'b4_640', 'serx50_512_rrcema'] : [0.10003348 0.16706826 0.32235549 0.15777191 0.25277085]


Blend OOF QWK (opt on mask): 0.88544 th: [0.5374432  1.53759037 2.41192657 3.05278265]


Bootstrap 30/180


Bootstrap 60/180


Bootstrap 90/180


Bootstrap 120/180


Bootstrap 150/180


Bootstrap 180/180


Bootstrapped thresholds (n=180): [0.53542452 1.53205458 2.44638209 3.23597427]


Final submission.csv written with SEResNeXt50 included.


In [19]:
# SEResNeXt50_32x4d second seed @512 (3 folds) with RRC(scale=(0.92,1.0)) + EMA, then refit NNLS (caps [0.05,0.7]) and bootstrap thresholds (n=180)
import os, numpy as np, pandas as pd, torch
from torch.utils.data import DataLoader

print('Starting SEResNeXt50_32x4d@512 second seed (3 folds) RRC+EMA...', flush=True)
seed_everything(2026)

# 1) Train SEResNeXt50 second seed (3 folds)
models_serx_s2, oof_serx_s2, oof_t = run_cv_serx_ema(folds_df, folds_to_run=(0,1,2), epochs=15, batch_size=8)
np.save('oof_preds_serx50_512_rrc_ema_seed2026.npy', oof_serx_s2); np.save('oof_targets_serx50_512_rrc_ema_seed2026.npy', oof_t)

# 2) Inference for SEResNeXt50 second seed (hflip-only TTA) on test
dte_serx2 = DRDataset(test_df[['id_code']].copy(), IMG_DIR_TEST, transforms=valid_tfms_serx)
dl_te_serx2 = DataLoader(dte_serx2, batch_size=16, shuffle=False, num_workers=4, pin_memory=True, persistent_workers=False)
preds_serx2 = hflip_tta_predict_mean(models_serx_s2, dl_te_serx2)
np.save('test_reg_preds_serx50_512_rrc_ema_seed2026.npy', preds_serx2)
print('Saved test_reg_preds_serx50_512_rrc_ema_seed2026.npy', flush=True)

# 3) Refit NNLS on OOF intersection including both SEResNeXt seeds with caps [0.05,0.7]
paths_oof = {
    'b4_512': 'oof_preds_b4.npy',
    'b5_512': 'oof_preds.npy',
    'b5_512_rrcema': 'oof_preds_b5_seed2025_rrc_ema.npy',
    'b4_640': 'oof_preds_b4_640_rrc_ema.npy',
    'serx50_512_rrcema': 'oof_preds_serx50_512_rrc_ema.npy',
    'serx50_512_rrcema_s2': 'oof_preds_serx50_512_rrc_ema_seed2026.npy',
}
loaded_keys = []; loaded_arrs = []; mask = np.ones_like(oof_t, dtype=bool)
for k,p in paths_oof.items():
    if os.path.exists(p):
        arr = np.load(p); loaded_keys.append(k); loaded_arrs.append(arr); mask &= np.isfinite(arr)
    else:
        print('Missing OOF for', k, '->', p, flush=True)
weights = nnls_blend_weights_capped(oof_t, loaded_arrs, mask=mask, lo=0.05, hi=0.7)
print('NNLS weights (capped) for', loaded_keys, ':', weights, flush=True)

# 4) Create blended OOF, bootstrap thresholds (n=180)
blend_oof = np.zeros_like(oof_t, dtype=float)
for w, a in zip(weights, loaded_arrs):
    a_fill = np.where(np.isfinite(a), a, 0.0); blend_oof += w * a_fill
th_opt = optimize_thresholds_generic(oof_t[mask], blend_oof[mask])
oof_q_opt = qwk_func(oof_t[mask], preds_to_classes_blend(blend_oof[mask], th_opt))
print('Blend OOF QWK (opt on mask):', f'{oof_q_opt:.5f}', 'th:', th_opt, flush=True)
th_boot = bootstrap_thresholds_generic(oof_t[mask], blend_oof[mask], n_bootstrap=180, base=[0.5,1.5,2.5,3.5])
print('Bootstrapped thresholds (n=180):', th_boot, flush=True)
np.save('blend_weights_keys.npy', np.array(loaded_keys, dtype=object)); np.save('blend_weights_vals.npy', weights); np.save('thresholds_blend_boot.npy', th_boot)

# 5) Finalize standardized hflip-only test blend across all available keys (including s2) and write submission
pred_map = {}
for k in loaded_keys:
    if k == 'b4_512' and os.path.exists('test_reg_preds_b4_hflip.npy'): pred_map[k] = np.load('test_reg_preds_b4_hflip.npy')
    elif k == 'b5_512' and os.path.exists('test_reg_preds_b5_hflip.npy'): pred_map[k] = np.load('test_reg_preds_b5_hflip.npy')
    elif k == 'b5_512_rrcema' and os.path.exists('test_reg_preds_b5_seed2025_rrc_ema.npy'): pred_map[k] = np.load('test_reg_preds_b5_seed2025_rrc_ema.npy')
    elif k == 'b4_640' and os.path.exists('test_reg_preds_b4_640_rrc_ema.npy'): pred_map[k] = np.load('test_reg_preds_b4_640_rrc_ema.npy')
    elif k == 'serx50_512_rrcema' and os.path.exists('test_reg_preds_serx50_512_rrc_ema.npy'): pred_map[k] = np.load('test_reg_preds_serx50_512_rrc_ema.npy')
    elif k == 'serx50_512_rrcema_s2' and os.path.exists('test_reg_preds_serx50_512_rrc_ema_seed2026.npy'): pred_map[k] = np.load('test_reg_preds_serx50_512_rrc_ema_seed2026.npy')
used_keys = [k for k in loaded_keys if k in pred_map]
mats = [pred_map[k] for k in used_keys]
if len(mats) == 0:
    print('No test preds available to blend.', flush=True)
else:
    w_use = weights[:len(mats)]
    if w_use.sum() <= 0: w_use = np.ones_like(w_use)/len(w_use)
    w_use = w_use / w_use.sum()
    blend_test = np.zeros_like(mats[0], dtype=float)
    for wi, arr in zip(w_use, mats): blend_test += wi * arr
    cls = preds_to_classes_blend(blend_test, th_boot).astype(int)
    sub = pd.DataFrame({'id_code': test_df['id_code'].values, 'diagnosis': cls})
    sub.to_csv('submission.csv', index=False)
    np.save('test_reg_preds_blend_final.npy', blend_test)
    print('Final submission.csv written with SEResNeXt50 second seed included.', flush=True)

Starting SEResNeXt50_32x4d@512 second seed (3 folds) RRC+EMA...


Attempt 1: trying batch_size=8



===== [SEResNeXt50@512 RRC+EMA] Fold 0 =====


Epoch 1 it 50/329 tr_loss 0.7859 elapsed 17.5s


Epoch 1 it 100/329 tr_loss 0.7754 elapsed 35.2s


Epoch 1 it 150/329 tr_loss 0.7008 elapsed 52.3s


Epoch 1 it 200/329 tr_loss 0.6383 elapsed 69.4s


Epoch 1 it 250/329 tr_loss 0.5724 elapsed 86.6s


Epoch 1 it 300/329 tr_loss 0.5307 elapsed 103.9s


Epoch 1: tr_loss 0.5112 val_loss 0.2468 val_qwk(def) 0.7891 epoch_time 124.6s total_elapsed 124.6s


Epoch 2 it 50/329 tr_loss 0.2708 elapsed 17.1s


Epoch 2 it 100/329 tr_loss 0.2604 elapsed 34.4s


Epoch 2 it 150/329 tr_loss 0.2579 elapsed 51.5s


Epoch 2 it 200/329 tr_loss 0.2526 elapsed 68.6s


Epoch 2 it 250/329 tr_loss 0.2548 elapsed 86.1s


Epoch 2 it 300/329 tr_loss 0.2535 elapsed 103.3s


Epoch 2: tr_loss 0.2508 val_loss 0.1856 val_qwk(def) 0.8386 epoch_time 123.9s total_elapsed 248.5s


Epoch 3 it 50/329 tr_loss 0.2051 elapsed 17.2s


Epoch 3 it 100/329 tr_loss 0.2189 elapsed 34.3s


Epoch 3 it 150/329 tr_loss 0.2225 elapsed 51.5s


Epoch 3 it 200/329 tr_loss 0.2192 elapsed 68.9s


Epoch 3 it 250/329 tr_loss 0.2113 elapsed 86.1s


Epoch 3 it 300/329 tr_loss 0.2153 elapsed 103.6s


Epoch 3: tr_loss 0.2138 val_loss 0.2020 val_qwk(def) 0.8182 epoch_time 124.2s total_elapsed 372.8s


Epoch 4 it 50/329 tr_loss 0.1769 elapsed 17.0s


Epoch 4 it 100/329 tr_loss 0.1862 elapsed 34.1s


Epoch 4 it 150/329 tr_loss 0.1942 elapsed 51.3s


Epoch 4 it 200/329 tr_loss 0.1947 elapsed 68.5s


Epoch 4 it 250/329 tr_loss 0.1868 elapsed 85.6s


Epoch 4 it 300/329 tr_loss 0.1827 elapsed 103.1s


Epoch 4: tr_loss 0.1831 val_loss 0.2049 val_qwk(def) 0.8514 epoch_time 123.7s total_elapsed 496.5s


Epoch 5 it 50/329 tr_loss 0.1594 elapsed 17.1s


Epoch 5 it 100/329 tr_loss 0.1547 elapsed 34.3s


Epoch 5 it 150/329 tr_loss 0.1699 elapsed 51.5s


Epoch 5 it 200/329 tr_loss 0.1640 elapsed 68.9s


Epoch 5 it 250/329 tr_loss 0.1626 elapsed 86.0s


Epoch 5 it 300/329 tr_loss 0.1659 elapsed 103.3s


Epoch 5: tr_loss 0.1646 val_loss 0.1820 val_qwk(def) 0.8314 epoch_time 124.0s total_elapsed 620.5s


Epoch 6 it 50/329 tr_loss 0.0977 elapsed 17.2s


Epoch 6 it 100/329 tr_loss 0.1033 elapsed 34.4s


Epoch 6 it 150/329 tr_loss 0.1211 elapsed 51.6s


Epoch 6 it 200/329 tr_loss 0.1260 elapsed 69.1s


Epoch 6 it 250/329 tr_loss 0.1301 elapsed 86.2s


Epoch 6 it 300/329 tr_loss 0.1359 elapsed 103.4s


Epoch 6: tr_loss 0.1410 val_loss 0.1751 val_qwk(def) 0.8607 epoch_time 124.0s total_elapsed 744.6s


Epoch 7 it 50/329 tr_loss 0.1264 elapsed 17.1s


Epoch 7 it 100/329 tr_loss 0.1483 elapsed 34.5s


Epoch 7 it 150/329 tr_loss 0.1449 elapsed 51.5s


Epoch 7 it 200/329 tr_loss 0.1420 elapsed 68.7s


Epoch 7 it 250/329 tr_loss 0.1383 elapsed 85.8s


Epoch 7 it 300/329 tr_loss 0.1397 elapsed 102.9s


Epoch 7: tr_loss 0.1411 val_loss 0.1692 val_qwk(def) 0.8747 epoch_time 123.5s total_elapsed 868.1s


Epoch 8 it 50/329 tr_loss 0.1217 elapsed 17.4s


Epoch 8 it 100/329 tr_loss 0.1245 elapsed 34.6s


Epoch 8 it 150/329 tr_loss 0.1230 elapsed 52.2s


Epoch 8 it 200/329 tr_loss 0.1230 elapsed 69.3s


Epoch 8 it 250/329 tr_loss 0.1219 elapsed 86.4s


Epoch 8 it 300/329 tr_loss 0.1234 elapsed 103.8s


Epoch 8: tr_loss 0.1240 val_loss 0.1922 val_qwk(def) 0.8239 epoch_time 124.5s total_elapsed 992.6s


Epoch 9 it 50/329 tr_loss 0.1175 elapsed 17.1s


Epoch 9 it 100/329 tr_loss 0.1158 elapsed 34.3s


Epoch 9 it 150/329 tr_loss 0.1196 elapsed 51.3s


Epoch 9 it 200/329 tr_loss 0.1183 elapsed 68.5s


Epoch 9 it 250/329 tr_loss 0.1182 elapsed 86.0s


Epoch 9 it 300/329 tr_loss 0.1182 elapsed 103.1s


Epoch 9: tr_loss 0.1198 val_loss 0.2075 val_qwk(def) 0.8633 epoch_time 123.7s total_elapsed 1116.3s


Epoch 10 it 50/329 tr_loss 0.1041 elapsed 17.2s


Epoch 10 it 100/329 tr_loss 0.1061 elapsed 34.7s


Epoch 10 it 150/329 tr_loss 0.1023 elapsed 51.8s


Epoch 10 it 200/329 tr_loss 0.1004 elapsed 69.0s


Epoch 10 it 250/329 tr_loss 0.0994 elapsed 86.3s


Epoch 10 it 300/329 tr_loss 0.1002 elapsed 103.5s


Epoch 10: tr_loss 0.1022 val_loss 0.1802 val_qwk(def) 0.8538 epoch_time 124.2s total_elapsed 1240.4s


Early stopping triggered


[SEResNeXt50@512] Fold 0 val QWK (default th): 0.8747


Attempt 1: trying batch_size=8



===== [SEResNeXt50@512 RRC+EMA] Fold 1 =====


Epoch 1 it 50/329 tr_loss 0.9218 elapsed 17.2s


Epoch 1 it 100/329 tr_loss 0.8024 elapsed 34.3s


Epoch 1 it 150/329 tr_loss 0.7117 elapsed 51.5s


Epoch 1 it 200/329 tr_loss 0.6187 elapsed 68.7s


Epoch 1 it 250/329 tr_loss 0.5593 elapsed 85.9s


Epoch 1 it 300/329 tr_loss 0.5075 elapsed 103.1s


Epoch 1: tr_loss 0.4860 val_loss 0.2877 val_qwk(def) 0.6920 epoch_time 123.7s total_elapsed 123.7s


Epoch 2 it 50/329 tr_loss 0.2860 elapsed 17.1s


Epoch 2 it 100/329 tr_loss 0.2924 elapsed 34.5s


Epoch 2 it 150/329 tr_loss 0.2696 elapsed 51.5s


Epoch 2 it 200/329 tr_loss 0.2721 elapsed 68.7s


Epoch 2 it 250/329 tr_loss 0.2648 elapsed 86.1s


Epoch 2 it 300/329 tr_loss 0.2591 elapsed 103.3s


Epoch 2: tr_loss 0.2556 val_loss 0.2279 val_qwk(def) 0.8052 epoch_time 124.0s total_elapsed 247.7s


Epoch 3 it 50/329 tr_loss 0.1932 elapsed 17.1s


Epoch 3 it 100/329 tr_loss 0.2026 elapsed 34.1s


Epoch 3 it 150/329 tr_loss 0.2141 elapsed 51.3s


Epoch 3 it 200/329 tr_loss 0.2168 elapsed 68.8s


Epoch 3 it 250/329 tr_loss 0.2062 elapsed 85.9s


Epoch 3 it 300/329 tr_loss 0.2138 elapsed 103.0s


Epoch 3: tr_loss 0.2167 val_loss 0.1960 val_qwk(def) 0.8206 epoch_time 123.9s total_elapsed 371.7s


Epoch 4 it 50/329 tr_loss 0.1886 elapsed 17.1s


Epoch 4 it 100/329 tr_loss 0.1832 elapsed 34.3s


Epoch 4 it 150/329 tr_loss 0.1989 elapsed 51.6s


Epoch 4 it 200/329 tr_loss 0.1960 elapsed 68.8s


Epoch 4 it 250/329 tr_loss 0.1987 elapsed 85.9s


Epoch 4 it 300/329 tr_loss 0.1914 elapsed 103.0s


Epoch 4: tr_loss 0.1928 val_loss 0.1986 val_qwk(def) 0.8571 epoch_time 123.5s total_elapsed 495.2s


Epoch 5 it 50/329 tr_loss 0.1596 elapsed 17.1s


Epoch 5 it 100/329 tr_loss 0.1506 elapsed 34.5s


Epoch 5 it 150/329 tr_loss 0.1531 elapsed 51.7s


Epoch 5 it 200/329 tr_loss 0.1503 elapsed 68.9s


Epoch 5 it 250/329 tr_loss 0.1609 elapsed 86.1s


Epoch 5 it 300/329 tr_loss 0.1619 elapsed 103.2s


Epoch 5: tr_loss 0.1625 val_loss 0.2063 val_qwk(def) 0.8513 epoch_time 124.2s total_elapsed 619.4s


Epoch 6 it 50/329 tr_loss 0.1438 elapsed 17.1s


Epoch 6 it 100/329 tr_loss 0.1423 elapsed 34.4s


Epoch 6 it 150/329 tr_loss 0.1448 elapsed 51.4s


Epoch 6 it 200/329 tr_loss 0.1451 elapsed 68.5s


Epoch 6 it 250/329 tr_loss 0.1477 elapsed 85.6s


Epoch 6 it 300/329 tr_loss 0.1500 elapsed 102.8s


Epoch 6: tr_loss 0.1470 val_loss 0.1822 val_qwk(def) 0.8547 epoch_time 123.5s total_elapsed 742.8s


Epoch 7 it 50/329 tr_loss 0.1358 elapsed 17.1s


Epoch 7 it 100/329 tr_loss 0.1376 elapsed 34.5s


Epoch 7 it 150/329 tr_loss 0.1351 elapsed 51.6s


Epoch 7 it 200/329 tr_loss 0.1357 elapsed 68.8s


Epoch 7 it 250/329 tr_loss 0.1345 elapsed 85.9s


Epoch 7 it 300/329 tr_loss 0.1337 elapsed 103.0s


Epoch 7: tr_loss 0.1349 val_loss 0.1712 val_qwk(def) 0.8485 epoch_time 123.6s total_elapsed 866.4s


Epoch 8 it 50/329 tr_loss 0.1207 elapsed 17.1s


Epoch 8 it 100/329 tr_loss 0.1181 elapsed 34.5s


Epoch 8 it 150/329 tr_loss 0.1235 elapsed 51.6s


Epoch 8 it 200/329 tr_loss 0.1233 elapsed 68.8s


Epoch 8 it 250/329 tr_loss 0.1296 elapsed 86.3s


Epoch 8 it 300/329 tr_loss 0.1294 elapsed 103.4s


Epoch 8: tr_loss 0.1273 val_loss 0.2179 val_qwk(def) 0.7929 epoch_time 123.9s total_elapsed 990.4s


Epoch 9 it 50/329 tr_loss 0.1212 elapsed 17.2s


Epoch 9 it 100/329 tr_loss 0.1198 elapsed 34.6s


Epoch 9 it 150/329 tr_loss 0.1136 elapsed 51.7s


Epoch 9 it 200/329 tr_loss 0.1120 elapsed 68.9s


Epoch 9 it 250/329 tr_loss 0.1119 elapsed 86.3s


Epoch 9 it 300/329 tr_loss 0.1136 elapsed 103.4s


Epoch 9: tr_loss 0.1114 val_loss 0.1831 val_qwk(def) 0.8563 epoch_time 124.1s total_elapsed 1114.5s


Epoch 10 it 50/329 tr_loss 0.0858 elapsed 17.2s


Epoch 10 it 100/329 tr_loss 0.0875 elapsed 34.3s


Epoch 10 it 150/329 tr_loss 0.0931 elapsed 51.4s


Epoch 10 it 200/329 tr_loss 0.0972 elapsed 68.6s


Epoch 10 it 250/329 tr_loss 0.0934 elapsed 85.8s


Epoch 10 it 300/329 tr_loss 0.0904 elapsed 102.9s


Epoch 10: tr_loss 0.0913 val_loss 0.2227 val_qwk(def) 0.8448 epoch_time 123.5s total_elapsed 1238.0s


Early stopping triggered


[SEResNeXt50@512] Fold 1 val QWK (default th): 0.8485


Attempt 1: trying batch_size=8



===== [SEResNeXt50@512 RRC+EMA] Fold 2 =====


Epoch 1 it 50/329 tr_loss 0.9156 elapsed 17.1s


Epoch 1 it 100/329 tr_loss 0.8361 elapsed 34.3s


Epoch 1 it 150/329 tr_loss 0.7379 elapsed 51.5s


Epoch 1 it 200/329 tr_loss 0.6468 elapsed 68.6s


Epoch 1 it 250/329 tr_loss 0.5935 elapsed 85.8s


Epoch 1 it 300/329 tr_loss 0.5384 elapsed 102.9s


Epoch 1: tr_loss 0.5159 val_loss 0.2692 val_qwk(def) 0.7245 epoch_time 123.6s total_elapsed 123.6s


Epoch 2 it 50/329 tr_loss 0.2836 elapsed 17.3s


Epoch 2 it 100/329 tr_loss 0.2627 elapsed 34.7s


Epoch 2 it 150/329 tr_loss 0.2575 elapsed 51.9s


Epoch 2 it 200/329 tr_loss 0.2618 elapsed 69.0s


Epoch 2 it 250/329 tr_loss 0.2587 elapsed 86.5s


Epoch 2 it 300/329 tr_loss 0.2606 elapsed 103.6s


Epoch 2: tr_loss 0.2582 val_loss 0.1950 val_qwk(def) 0.8044 epoch_time 124.4s total_elapsed 248.0s


Epoch 3 it 50/329 tr_loss 0.2146 elapsed 17.0s


Epoch 3 it 100/329 tr_loss 0.2140 elapsed 34.2s


Epoch 3 it 150/329 tr_loss 0.2107 elapsed 51.6s


Epoch 3 it 200/329 tr_loss 0.2258 elapsed 69.0s


Epoch 3 it 250/329 tr_loss 0.2296 elapsed 86.1s


Epoch 3 it 300/329 tr_loss 0.2272 elapsed 103.3s


Epoch 3: tr_loss 0.2264 val_loss 0.2296 val_qwk(def) 0.7598 epoch_time 124.3s total_elapsed 372.4s


Epoch 4 it 50/329 tr_loss 0.1384 elapsed 17.2s


Epoch 4 it 100/329 tr_loss 0.1692 elapsed 34.4s


Epoch 4 it 150/329 tr_loss 0.1849 elapsed 51.5s


Epoch 4 it 200/329 tr_loss 0.1863 elapsed 68.6s


Epoch 4 it 250/329 tr_loss 0.1881 elapsed 85.8s


Epoch 4 it 300/329 tr_loss 0.1847 elapsed 103.2s


Epoch 4: tr_loss 0.1878 val_loss 0.2171 val_qwk(def) 0.8473 epoch_time 124.0s total_elapsed 496.3s


Epoch 5 it 50/329 tr_loss 0.1585 elapsed 17.2s


Epoch 5 it 100/329 tr_loss 0.1738 elapsed 34.4s


Epoch 5 it 150/329 tr_loss 0.1671 elapsed 51.7s


Epoch 5 it 200/329 tr_loss 0.1661 elapsed 68.7s


Epoch 5 it 250/329 tr_loss 0.1680 elapsed 85.9s


Epoch 5 it 300/329 tr_loss 0.1699 elapsed 103.1s


Epoch 5: tr_loss 0.1677 val_loss 0.1853 val_qwk(def) 0.8121 epoch_time 124.1s total_elapsed 620.5s


Epoch 6 it 50/329 tr_loss 0.1516 elapsed 17.2s


Epoch 6 it 100/329 tr_loss 0.1561 elapsed 34.3s


Epoch 6 it 150/329 tr_loss 0.1539 elapsed 51.9s


Epoch 6 it 200/329 tr_loss 0.1507 elapsed 69.1s


Epoch 6 it 250/329 tr_loss 0.1454 elapsed 86.2s


Epoch 6 it 300/329 tr_loss 0.1482 elapsed 103.6s


Epoch 6: tr_loss 0.1474 val_loss 0.1570 val_qwk(def) 0.8760 epoch_time 124.2s total_elapsed 744.7s


Epoch 7 it 50/329 tr_loss 0.1513 elapsed 17.1s


Epoch 7 it 100/329 tr_loss 0.1307 elapsed 34.4s


Epoch 7 it 150/329 tr_loss 0.1326 elapsed 51.6s


Epoch 7 it 200/329 tr_loss 0.1323 elapsed 68.8s


Epoch 7 it 250/329 tr_loss 0.1244 elapsed 85.9s


Epoch 7 it 300/329 tr_loss 0.1265 elapsed 103.1s


Epoch 7: tr_loss 0.1294 val_loss 0.1690 val_qwk(def) 0.8641 epoch_time 123.7s total_elapsed 868.4s


Epoch 8 it 50/329 tr_loss 0.1040 elapsed 17.5s


Epoch 8 it 100/329 tr_loss 0.1323 elapsed 34.6s


Epoch 8 it 150/329 tr_loss 0.1259 elapsed 51.9s


Epoch 8 it 200/329 tr_loss 0.1286 elapsed 69.1s


Epoch 8 it 250/329 tr_loss 0.1348 elapsed 86.2s


Epoch 8 it 300/329 tr_loss 0.1294 elapsed 103.6s


Epoch 8: tr_loss 0.1313 val_loss 0.1888 val_qwk(def) 0.8323 epoch_time 124.2s total_elapsed 992.6s


Epoch 9 it 50/329 tr_loss 0.1085 elapsed 17.2s


Epoch 9 it 100/329 tr_loss 0.1118 elapsed 34.4s


Epoch 9 it 150/329 tr_loss 0.1062 elapsed 51.5s


Epoch 9 it 200/329 tr_loss 0.1089 elapsed 68.5s


Epoch 9 it 250/329 tr_loss 0.1127 elapsed 85.6s


Epoch 9 it 300/329 tr_loss 0.1087 elapsed 103.1s


Epoch 9: tr_loss 0.1086 val_loss 0.1745 val_qwk(def) 0.8551 epoch_time 123.6s total_elapsed 1116.2s


Early stopping triggered


[SEResNeXt50@512] Fold 2 val QWK (default th): 0.8760


Saved test_reg_preds_serx50_512_rrc_ema_seed2026.npy


NNLS weights (capped) for ['b4_512', 'b5_512', 'b5_512_rrcema', 'b4_640', 'serx50_512_rrcema', 'serx50_512_rrcema_s2'] : [0.12077895 0.17904818 0.31066866 0.11245652 0.22942863 0.04761905]


Blend OOF QWK (opt on mask): 0.88151 th: [0.55382156 1.42203981 2.43150836 3.30617202]


Bootstrap 30/180


Bootstrap 60/180


Bootstrap 90/180


Bootstrap 120/180


Bootstrap 150/180


Bootstrap 180/180


Bootstrapped thresholds (n=180): [0.53216035 1.51247651 2.47348362 3.39321307]


Final submission.csv written with SEResNeXt50 second seed included.


In [21]:
# Level-2 Stacker (LightGBM multiclass) on OOF regression features; hflip-only test; bootstrap thresholds; write submission
import os, numpy as np, pandas as pd
from sklearn.metrics import cohen_kappa_score
import lightgbm as lgb

print('L2 stacking: building feature matrices from OOF/test preds...', flush=True)
oof_t = np.load('oof_targets_b4.npy') if os.path.exists('oof_targets_b4.npy') else np.load('oof_targets.npy')
folds_df_l2 = pd.read_csv('folds.csv')

# Define model keys and corresponding file paths (OOF and Test)
keys = [
    ('b4_512', 'oof_preds_b4.npy', 'test_reg_preds_b4_hflip.npy'),
    ('b5_512', 'oof_preds.npy', 'test_reg_preds_b5_hflip.npy'),
    ('b5_512_rrcema', 'oof_preds_b5_seed2025_rrc_ema.npy', 'test_reg_preds_b5_seed2025_rrc_ema.npy'),
    ('b4_640', 'oof_preds_b4_640_rrc_ema.npy', 'test_reg_preds_b4_640_rrc_ema.npy'),
    ('serx50_512_rrcema', 'oof_preds_serx50_512_rrc_ema.npy', 'test_reg_preds_serx50_512_rrc_ema.npy'),
    ('serx50_512_rrcema_s2', 'oof_preds_serx50_512_rrc_ema_seed2026.npy', 'test_reg_preds_serx50_512_rrc_ema_seed2026.npy'),
]

feat_oof_list = []
feat_te_list = []
used_keys = []
for name, oof_path, te_path in keys:
    if os.path.exists(oof_path) and os.path.exists(te_path):
        oof_arr = np.load(oof_path)
        te_arr = np.load(te_path)
        feat_oof_list.append(oof_arr.reshape(-1, 1))
        feat_te_list.append(te_arr.reshape(-1, 1))
        used_keys.append(name)
    else:
        print('Skipping key due to missing files:', name, '->', oof_path, te_path, flush=True)

if len(feat_oof_list) == 0:
    raise RuntimeError('No features available for stacking.')

X_oof_full = np.concatenate(feat_oof_list, axis=1).astype(np.float32)
X_te_full = np.concatenate(feat_te_list, axis=1).astype(np.float32)
mask = np.isfinite(X_oof_full).all(axis=1) & np.isfinite(oof_t)
print('Features used:', used_keys, 'X_oof shape:', X_oof_full.shape, 'mask sum:', int(mask.sum()), flush=True)

# L2 LightGBM classifier with 5 classes
params = dict(objective='multiclass', num_class=5, learning_rate=0.03, n_estimators=2000,
              num_leaves=63, max_depth=-1, feature_fraction=0.8, bagging_fraction=0.8, bagging_freq=1,
              reg_lambda=1.0, verbosity=-1)

X = X_oof_full[mask]
y = oof_t[mask].astype(int)
X_te = X_te_full.copy()

# 5-fold CV consistent with original folds
oof_proba = np.zeros((len(X_oof_full), 5), dtype=np.float32)
te_proba_acc = np.zeros((len(X_te), 5), dtype=np.float32)

for f in range(5):
    va_idx_all = (folds_df_l2['fold'].values == f)
    va_mask_idx = np.where(mask & va_idx_all)[0]
    tr_mask_idx = np.where(mask & (~va_idx_all))[0]
    if len(va_mask_idx) == 0 or len(tr_mask_idx) == 0:
        continue
    X_tr, y_tr = X_oof_full[tr_mask_idx], oof_t[tr_mask_idx].astype(int)
    X_va, y_va = X_oof_full[va_mask_idx], oof_t[va_mask_idx].astype(int)
    clf = lgb.LGBMClassifier(**params)
    # Remove unsupported verbose kwarg; use callbacks for early stopping and silent logging
    clf.fit(X_tr, y_tr, eval_set=[(X_va, y_va)], eval_metric='multi_logloss', callbacks=[lgb.early_stopping(100), lgb.log_evaluation(0)])
    oof_proba[va_mask_idx] = clf.predict_proba(X_va)
    te_proba_acc += clf.predict_proba(X_te)
    y_va_pred = np.argmax(oof_proba[va_mask_idx], axis=1)
    print(f'[L2] Fold {f} done; val size {len(va_mask_idx)}', flush=True)

te_proba = te_proba_acc / 5.0

# Convert probabilities to expected value scalar for thresholds
classes = np.arange(5, dtype=np.float32).reshape(1, -1)
oof_reg = (oof_proba * classes).sum(axis=1)
te_reg = (te_proba * classes).sum(axis=1)

def qwk_func(y_true, y_pred_cls):
    return cohen_kappa_score(y_true, y_pred_cls, weights='quadratic')

def preds_to_classes_blend(preds, thresholds):
    th0, th1, th2, th3 = thresholds
    return np.digitize(preds, bins=[th0, th1, th2, th3])

from scipy.optimize import minimize
def optimize_thresholds_generic(y, p, init=[0.5,1.5,2.5,3.5]):
    y = np.asarray(y).astype(float); p = np.asarray(p).astype(float)
    def _loss(th):
        th = np.sort(th); th = np.clip(th, -1.0, 4.0)
        cls = preds_to_classes_blend(p, th)
        return -qwk_func(y, cls)
    res = minimize(_loss, x0=np.array(init, dtype=float), method='Nelder-Mead', options={'maxiter':2000, 'xatol':1e-3, 'fatol':1e-3})
    th = np.sort(res.x)
    for i in range(1,4):
        if th[i] - th[i-1] < 0.05: th[i] = th[i-1] + 0.05
    return th

def bootstrap_thresholds_generic(y, p, n_bootstrap=200, base=[0.5,1.5,2.5,3.5]):
    N = len(y); cuts = []
    for i in range(n_bootstrap):
        idx = np.random.randint(0, N, size=N)
        th = optimize_thresholds_generic(y[idx], p[idx], init=base)
        cuts.append(th)
        if (i+1)%30==0: print(f'Bootstrap {i+1}/{n_bootstrap}', flush=True)
    cuts = np.array(cuts); med = np.median(cuts, axis=0); med = np.clip(med, 0.3, 3.7)
    for j in range(1,4):
        if med[j] - med[j-1] < 0.12: med[j] = med[j-1] + 0.12
    base_arr = np.array([0.5,1.5,2.5,3.5])
    return 0.9*med + 0.1*base_arr

# Optimize thresholds on masked OOF expected values
th_opt = optimize_thresholds_generic(oof_t[mask], oof_reg[mask])
oof_q_opt = qwk_func(oof_t[mask], preds_to_classes_blend(oof_reg[mask], th_opt))
print('L2 OOF QWK (opt on mask):', f'{oof_q_opt:.5f}', 'th:', th_opt, flush=True)
th_boot = bootstrap_thresholds_generic(oof_t[mask], oof_reg[mask], n_bootstrap=200, base=[0.5,1.5,2.5,3.5])
print('L2 bootstrapped thresholds (n=200):', th_boot, flush=True)
np.save('l2_used_keys.npy', np.array(used_keys, dtype=object)); np.save('l2_oof_reg.npy', oof_reg); np.save('l2_te_reg.npy', te_reg); np.save('l2_thresholds_boot.npy', th_boot)

# Final test classes and submission
cls_final = preds_to_classes_blend(te_reg, th_boot).astype(int)
sub = pd.DataFrame({'id_code': pd.read_csv('test.csv')['id_code'].values, 'diagnosis': cls_final})
sub.to_csv('submission.csv', index=False)
print('L2 stacker submission.csv written.', flush=True)

L2 stacking: building feature matrices from OOF/test preds...


Features used: ['b4_512', 'b5_512', 'b5_512_rrcema', 'b4_640', 'serx50_512_rrcema', 'serx50_512_rrcema_s2'] X_oof shape: (3295, 6) mask sum: 1977


Training until validation scores don't improve for 100 rounds


Early stopping, best iteration is:
[80]	valid_0's multi_logloss: 0.603583
[L2] Fold 0 done; val size 659


Training until validation scores don't improve for 100 rounds




Early stopping, best iteration is:
[99]	valid_0's multi_logloss: 0.621572
[L2] Fold 1 done; val size 659


Training until validation scores don't improve for 100 rounds




Early stopping, best iteration is:
[95]	valid_0's multi_logloss: 0.613289
[L2] Fold 2 done; val size 659


L2 OOF QWK (opt on mask): 0.87159 th: [0.52815568 1.64064813 2.34277524 3.30322924]




Bootstrap 30/200


Bootstrap 60/200


Bootstrap 90/200


Bootstrap 120/200


Bootstrap 150/200


Bootstrap 180/200


L2 bootstrapped thresholds (n=200): [0.53360907 1.59051364 2.33572592 3.20070596]


L2 stacker submission.csv written.


In [22]:
# Ordinal Head Fine-tuning for B5@512 (RRC+EMA) per expert recipe
import os, gc, copy, time, math, numpy as np, pandas as pd, torch, torch.nn as nn, albumentations as A, cv2
from torch.utils.data import DataLoader
from timm.utils import ModelEmaV2
from sklearn.metrics import cohen_kappa_score
from scipy.optimize import nnls, minimize
from albumentations.pytorch import ToTensorV2

print('Starting B5@512 Ordinal head fine-tuning (5 folds)...', flush=True)

# Reuse globals: folds_df, IMG_SIZE, IMG_DIR_TRAIN, device, seed_everything, HF_CACHE_DIR
MEAN = (0.485, 0.456, 0.406); STD = (0.229, 0.224, 0.225)
train_tfms_ord = A.Compose([
    A.RandomResizedCrop(size=(IMG_SIZE, IMG_SIZE), scale=(0.85, 1.0), ratio=(0.95, 1.05), interpolation=cv2.INTER_AREA),
    A.HorizontalFlip(p=0.5),
    A.Affine(scale=(0.95, 1.05), translate_percent=(0.0, 0.06), rotate=(-15, 15), border_mode=cv2.BORDER_REFLECT_101, p=0.7),
    A.RandomBrightnessContrast(brightness_limit=0.15, contrast_limit=0.15, p=0.7),
    A.RandomGamma(gamma_limit=(90,110), p=0.4),
    A.Normalize(mean=MEAN, std=STD),
    ToTensorV2(),
])
valid_tfms_ord = A.Compose([
    A.Resize(IMG_SIZE, IMG_SIZE),
    A.Normalize(mean=MEAN, std=STD),
    ToTensorV2(),
])

class OrdinalDRDataset(Dataset):
    def __init__(self, df: pd.DataFrame, img_dir: Path, transforms=None):
        self.df = df.reset_index(drop=True)
        self.img_dir = Path(img_dir)
        self.transforms = transforms
        self.has_target = 'diagnosis' in df.columns
    def __len__(self):
        return len(self.df)
    def __getitem__(self, idx):
        row = self.df.iloc[idx]
        img_path = self.img_dir / f"{row['id_code']}.png"
        img = cv2.imread(str(img_path), cv2.IMREAD_COLOR)
        if img is None:
            img = np.zeros((IMG_SIZE, IMG_SIZE, 3), dtype=np.uint8)
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        if img.shape[0] != IMG_SIZE or img.shape[1] != IMG_SIZE:
            img = cv2.resize(img, (IMG_SIZE, IMG_SIZE), interpolation=cv2.INTER_AREA)
        if self.transforms:
            img = self.transforms(image=img)['image']
        if self.has_target:
            y = int(row['diagnosis'])
            return img, torch.tensor(y, dtype=torch.long)
        else:
            return img, row['id_code']

class OrdinalHeadModel(nn.Module):
    def __init__(self, backbone_name='tf_efficientnet_b5_ns', pretrained=True):
        super().__init__()
        self.backbone = timm.create_model(backbone_name, pretrained=pretrained, num_classes=0, global_pool='avg', cache_dir=str(HF_CACHE_DIR))
        try:
            if hasattr(self.backbone, 'set_grad_checkpointing'):
                self.backbone.set_grad_checkpointing(True)
        except Exception:
            pass
        in_ch = self.backbone.num_features
        self.dropout = nn.Dropout(0.25)
        self.head = nn.Linear(in_ch, 4)  # 4 ordinal logits
    def forward(self, x):
        feat = self.backbone(x)
        feat = self.dropout(feat)
        logits = self.head(feat)
        return logits  # raw logits

def make_targets_ordinal(y_long: torch.Tensor) -> torch.Tensor:
    y = y_long.float()
    t = torch.stack([(y >= 1.0), (y >= 2.0), (y >= 3.0), (y >= 4.0)], dim=1).float()
    return t

def get_pos_weight_from_train(df_all: pd.DataFrame) -> torch.Tensor:
    y = df_all['diagnosis'].values.astype(int)
    N = len(y)
    pos = []
    for k in range(4):
        s = (y >= (k+1)).sum()
        s = max(1, int(s))
        neg = N - s
        pos.append(neg / s)
    pw = torch.tensor(pos, dtype=torch.float32, device=device)
    return pw

def qwk_from_ev(y_true, ev, thresholds):
    y = np.asarray(y_true).astype(float)
    th0, th1, th2, th3 = thresholds
    cls = np.digitize(ev, bins=[th0, th1, th2, th3])
    return cohen_kappa_score(y, cls, weights='quadratic')

def optimize_thresholds_ev(y, ev, init=[0.5,1.5,2.5,3.5]):
    y = np.asarray(y).astype(float); p = np.asarray(ev).astype(float)
    def _loss(th):
        th = np.sort(th); th = np.clip(th, -0.5, 4.5)
        th0, th1, th2, th3 = th
        cls = np.digitize(p, bins=[th0, th1, th2, th3])
        return -cohen_kappa_score(y, cls, weights='quadratic')
    res = minimize(_loss, x0=np.array(init, dtype=float), method='Nelder-Mead', options={'maxiter':1500, 'xatol':1e-3, 'fatol':1e-3})
    th = np.sort(res.x)
    for i in range(1,4):
        if th[i] - th[i-1] < 0.05:
            th[i] = th[i-1] + 0.05
    return th

def get_loaders_ord(tr_df, va_df, batch_size=8, num_workers=0):
    dtr = OrdinalDRDataset(tr_df, IMG_DIR_TRAIN, transforms=train_tfms_ord)
    dva = OrdinalDRDataset(va_df, IMG_DIR_TRAIN, transforms=valid_tfms_ord)
    dl_tr = DataLoader(dtr, batch_size=batch_size, shuffle=True, num_workers=num_workers, pin_memory=True, drop_last=True, persistent_workers=False)
    dl_va = DataLoader(dva, batch_size=batch_size, shuffle=False, num_workers=num_workers, pin_memory=True, persistent_workers=False)
    return dl_tr, dl_va

def validate_ord(model, dl, loss_fn):
    model.eval(); val_loss=0.0; n=0; all_logits=[]; all_y=[]
    with torch.no_grad():
        for xb, yb in dl:
            xb = xb.to(device, non_blocking=True).to(memory_format=torch.channels_last)
            yb = yb.to(device, non_blocking=True)
            yt = make_targets_ordinal(yb)
            with torch.amp.autocast('cuda', enabled=True, dtype=torch.float16):
                logits = model(xb)
                loss = loss_fn(logits, yt)
            bs = xb.size(0); val_loss += loss.item()*bs; n += bs
            all_logits.append(logits.detach().float().cpu().numpy()); all_y.append(yb.detach().long().cpu().numpy())
    logits = np.concatenate(all_logits); y = np.concatenate(all_y)
    probs = 1.0 / (1.0 + np.exp(-logits))
    ev = probs.sum(axis=1)
    return val_loss/max(1,n), logits, probs, ev, y

def hflip_tta_logits(models, dl):
    preds = []
    with torch.no_grad():
        for xb, _ in dl:
            xb = xb.to(device, non_blocking=True).to(memory_format=torch.channels_last)
            with torch.amp.autocast('cuda', enabled=True, dtype=torch.float16):
                s = 0; c = 0
                for m in models:
                    m.eval()
                    l0 = m(xb); l1 = m(torch.flip(xb, dims=[-1]))
                    s += (l0 + l1) / 2.0; c += 1
                logits = s / max(1,c)
            preds.append(logits.detach().float().cpu().numpy())
    return np.concatenate(preds)

def train_fold_b5_ord(fold, folds_df, batch_size=8, ema_decay=0.9996, patience=2, epochs_head=1, epochs_unfreeze=4):
    print(f"\n===== [B5 Ordinal] Fold {fold} =====", flush=True)
    tr_df = folds_df[folds_df['fold'] != fold][['id_code','diagnosis']].reset_index(drop=True)
    va_df = folds_df[folds_df['fold'] == fold][['id_code','diagnosis']].reset_index(drop=True)
    dl_tr, dl_va = get_loaders_ord(tr_df, va_df, batch_size=batch_size, num_workers=0)
    model = OrdinalHeadModel('tf_efficientnet_b5_ns', pretrained=True).to(device)
    model = model.to(memory_format=torch.channels_last)
    ema = ModelEmaV2(model, decay=ema_decay, device=None)
    # pos_weight from full train
    pos_weight = get_pos_weight_from_train(folds_df)
    loss_fn = nn.BCEWithLogitsLoss(pos_weight=pos_weight, reduction='mean')
    scaler = torch.amp.GradScaler('cuda', enabled=True)

    def run_phase(opt, sch, max_epochs, freeze_backbone=False):
        best_loss = float('inf'); best_state=None; best_logits=None; best_probs=None; best_ev=None; best_y=None; no_imp=0
        if freeze_backbone:
            for p in model.backbone.parameters(): p.requires_grad_(False)
            model.backbone.eval()
        else:
            for p in model.backbone.parameters(): p.requires_grad_(True)
            model.backbone.train()
        for ep in range(1, max_epochs+1):
            model.train(); tr_loss=0.0; n=0; t0=time.time()
            for it, (xb, yb) in enumerate(dl_tr):
                xb = xb.to(device, non_blocking=True).to(memory_format=torch.channels_last)
                yb = yb.to(device, non_blocking=True)
                yt = make_targets_ordinal(yb)
                with torch.amp.autocast('cuda', enabled=True, dtype=torch.float16):
                    logits = model(xb); loss = loss_fn(logits, yt)
                scaler.scale(loss).backward()
                scaler.step(opt); scaler.update(); opt.zero_grad(set_to_none=True);
                if sch is not None: sch.step()
                ema.update(model)
                bs = xb.size(0); tr_loss += loss.item()*bs; n += bs
                if (it+1) % 50 == 0:
                    print(f'  it {it+1}/{len(dl_tr)} tr_loss {tr_loss/max(1,n):.4f} elapsed {time.time()-t0:.1f}s', flush=True)
            vloss, vlogits, vprobs, vev, vy = validate_ord(model, dl_va, loss_fn)
            q_def = qwk_from_ev(vy, vev, [0.5,1.5,2.5,3.5])
            print(f'  Epoch {ep}: tr_loss {tr_loss/max(1,n):.4f} val_loss {vloss:.4f} val_qwk(def) {q_def:.4f}', flush=True)
            if vloss < best_loss:
                best_loss = vloss; best_state = copy.deepcopy(ema.module.state_dict());
                best_logits = vlogits.copy(); best_probs = vprobs.copy(); best_ev = vev.copy(); best_y = vy.copy(); no_imp = 0
            else:
                no_imp += 1
            if no_imp >= patience:
                print('  Early stopping.', flush=True); break
        ema.module.load_state_dict(best_state)
        model.load_state_dict(best_state, strict=False)
        return best_logits, best_probs, best_ev, best_y

    # Phase A: head-only
    head_params = list(model.head.parameters()) + list(model.dropout.parameters())
    optA = torch.optim.AdamW(head_params, lr=1e-3, weight_decay=1e-5)
    stepsA = max(1, epochs_head) * len(dl_tr)
    def lr_lambda_a(step):
        warm = len(dl_tr)
        if step < warm: return (step+1)/warm
        prog = (step - warm) / max(1, stepsA - warm)
        return 0.5*(1+math.cos(math.pi*prog))
    schA = torch.optim.lr_scheduler.LambdaLR(optA, lr_lambda_a)
    _ = run_phase(optA, schA, max_epochs=epochs_head, freeze_backbone=True)

    # Phase B: unfreeze all with small LR on backbone, higher on head
    params = [
        {'params': model.backbone.parameters(), 'lr': 3e-5},
        {'params': model.head.parameters(), 'lr': 1e-4},
        {'params': model.dropout.parameters(), 'lr': 1e-4},
    ]
    optB = torch.optim.AdamW(params, lr=1e-4, weight_decay=1e-5)
    stepsB = max(1, epochs_unfreeze) * len(dl_tr)
    def lr_lambda_b(step):
        warm = len(dl_tr)
        if step < warm: return (step+1)/warm
        prog = (step - warm) / max(1, stepsB - warm)
        return 0.5*(1+math.cos(math.pi*prog))
    schB = torch.optim.lr_scheduler.LambdaLR(optB, lr_lambda_b)
    vlogits, vprobs, vev, vy = run_phase(optB, schB, max_epochs=epochs_unfreeze, freeze_backbone=False)
    return ema.module, vlogits, vprobs, vev, vy

def run_cv_b5_ordinal(folds_df, batch_size=8):
    folds_df = folds_df.reset_index(drop=True).copy()
    n = len(folds_df); n_folds = folds_df['fold'].nunique()
    oof_ev = np.zeros(n, dtype=np.float32)
    oof_probs4 = np.zeros((n,4), dtype=np.float32)
    y_all = folds_df['diagnosis'].values.astype(int)
    models_ord = []
    for f in range(n_folds):
        try:
            torch.cuda.empty_cache()
            if torch.cuda.is_available(): torch.cuda.reset_peak_memory_stats()
        except Exception: pass
        gc.collect()
        m, vlogits, vprobs, vev, vy = train_fold_b5_ord(f, folds_df, batch_size=batch_size, ema_decay=0.9996, patience=2, epochs_head=1, epochs_unfreeze=4)
        models_ord.append(m)
        va_idx = np.where(folds_df['fold'].values == f)[0]
        oof_ev[va_idx] = vev.astype(np.float32)
        oof_probs4[va_idx] = vprobs.astype(np.float32)
        q_def = qwk_from_ev(vy, vev, [0.5,1.5,2.5,3.5])
        print(f'[B5 Ordinal] Fold {f} val QWK(def): {q_def:.4f}', flush=True)
    th = optimize_thresholds_ev(y_all, oof_ev)
    oof_q = qwk_from_ev(y_all, oof_ev, th)
    print('B5 Ordinal OOF QWK (opt):', f'{oof_q:.5f}', 'th:', th, flush=True)
    np.save('oof_ev_b5_ordinal.npy', oof_ev); np.save('oof_probs4_b5_ordinal.npy', oof_probs4); np.save('oof_targets_b5_ordinal.npy', y_all); np.save('thresholds_b5_ordinal.npy', th)
    return models_ord, oof_ev, oof_probs4, y_all, th, oof_q

def infer_test_b5_ordinal(models_ord):
    dte = DRDataset(test_df[['id_code']].copy(), IMG_DIR_TEST, transforms=valid_tfms_ord)
    dl_te = DataLoader(dte, batch_size=16, shuffle=False, num_workers=4, pin_memory=True, persistent_workers=False)
    logits = hflip_tta_logits(models_ord, dl_te)
    probs = 1.0 / (1.0 + np.exp(-logits))
    ev = probs.sum(axis=1).astype(np.float32)
    np.save('test_ev_b5_ordinal.npy', ev); np.save('test_probs4_b5_ordinal.npy', probs.astype(np.float32))
    print('Saved test_ev_b5_ordinal.npy and test_probs4_b5_ordinal.npy', flush=True)
    return ev, probs

def nnls_blend_ord_with_existing():
    print('Blending ordinal EV with existing regression EVs via NNLS...', flush=True)
    # Load OOF targets (use main folds targets)
    y = np.load('oof_targets_b4.npy') if os.path.exists('oof_targets_b4.npy') else np.load('oof_targets.npy')
    # Load available OOF EV sources
    mats = []; keys = []
    def add_oof(path, key):
        if os.path.exists(path):
            arr = np.load(path); mats.append(arr.reshape(-1,1)); keys.append(key)
            return True
        return False
    add_oof('oof_preds_b4.npy', 'b4_512')
    add_oof('oof_preds.npy', 'b5_512')
    add_oof('oof_preds_b5_seed2025_rrc_ema.npy', 'b5_512_rrcema')
    add_oof('oof_preds_b4_640_rrc_ema.npy', 'b4_640')
    add_oof('oof_preds_serx50_512_rrc_ema.npy', 'serx50_512_rrcema')
    add_oof('oof_preds_serx50_512_rrc_ema_seed2026.npy', 'serx50_512_rrcema_s2')
    add_oof('oof_ev_b5_ordinal.npy', 'b5_512_ordinal')
    if len(mats) < 2:
        print('Not enough OOF sources to blend.')
        return None, None, None, None
    X = np.concatenate(mats, axis=1)
    mask = np.isfinite(X).all(axis=1) & np.isfinite(y)
    w, _ = nnls(X[mask], y[mask].astype(float))
    if w.sum() > 0: w = w / w.sum()
    w = np.clip(w, 0.05, 0.8); w = w / w.sum()
    blend_oof = (X * w.reshape(1,-1)).sum(axis=1)
    th = optimize_thresholds_ev(y[mask], blend_oof[mask])
    oof_q = qwk_from_ev(y[mask], blend_oof[mask], th)
    print('NNLS blend (incl ordinal) OOF QWK:', f'{oof_q:.5f}', 'weights:', w, 'keys:', keys, flush=True)
    np.save('blend_weights_keys.npy', np.array(keys, dtype=object)); np.save('blend_weights_vals.npy', w);
    # Build test blend in same key order
    test_list = []
    test_map = {
        'b4_512': 'test_reg_preds_b4_hflip.npy',
        'b5_512': 'test_reg_preds_b5_hflip.npy',
        'b5_512_rrcema': 'test_reg_preds_b5_seed2025_rrc_ema.npy',
        'b4_640': 'test_reg_preds_b4_640_rrc_ema.npy',
        'serx50_512_rrcema': 'test_reg_preds_serx50_512_rrc_ema.npy',
        'serx50_512_rrcema_s2': 'test_reg_preds_serx50_512_rrc_ema_seed2026.npy',
        'b5_512_ordinal': 'test_ev_b5_ordinal.npy',
    }
    for k in keys:
        pth = test_map.get(k, None)
        if pth and os.path.exists(pth):
            test_list.append(np.load(pth).reshape(-1))
        else:
            print('Missing test preds for key', k, '->', pth, flush=True)
    if len(test_list) != len(keys):
        print('Warning: some test preds missing; re-normalizing weights over available.', flush=True)
    avail = min(len(test_list), len(w))
    w_use = w[:avail];
    if w_use.sum() <= 0: w_use = np.ones_like(w_use)/len(w_use)
    w_use = w_use / w_use.sum()
    blend_test = np.zeros_like(test_list[0], dtype=float)
    for wi, arr in zip(w_use, test_list[:avail]):
        blend_test += wi * arr
    return blend_oof, blend_test, w, th

# Execute training and inference
seed_everything(42)
models_b5_ordinal, oof_ev_b5_ord, oof_probs4_b5_ord, y_all_b5_ord, th_b5_ord, oof_q_b5_ord = run_cv_b5_ordinal(folds_df, batch_size=8)
print('B5 Ordinal OOF done. OOF QWK:', oof_q_b5_ord, flush=True)
test_ev_b5_ord, test_probs4_b5_ord = infer_test_b5_ordinal(models_b5_ordinal)

# Blend with existing models via NNLS and write submission
blend_oof_ev, blend_test_ev, w_blend, th_blend = nnls_blend_ord_with_existing()
if blend_oof_ev is not None:
    # Bootstrap thresholds on blended OOF EV
    def bootstrap_thresholds_ev(y, p, n_bootstrap=180, base=[0.5,1.5,2.5,3.5]):
        N = len(y); cuts = []
        for i in range(n_bootstrap):
            idx = np.random.randint(0, N, size=N)
            th = optimize_thresholds_ev(y[idx], p[idx], init=base)
            cuts.append(th)
            if (i+1)%30==0: print(f'Bootstrap {i+1}/{n_bootstrap}', flush=True)
        cuts = np.array(cuts); med = np.median(cuts, axis=0); med = np.clip(med, 0.3, 3.7)
        for j in range(1,4):
            if med[j] - med[j-1] < 0.1: med[j] = med[j-1] + 0.1
        base_arr = np.array([0.5,1.5,2.5,3.5])
        return 0.9*med + 0.1*base_arr
    y_ref = np.load('oof_targets_b4.npy') if os.path.exists('oof_targets_b4.npy') else np.load('oof_targets.npy')
    mask_ref = np.isfinite(blend_oof_ev) & np.isfinite(y_ref)
    th_boot = bootstrap_thresholds_ev(y_ref[mask_ref], blend_oof_ev[mask_ref], n_bootstrap=180)
    np.save('thresholds_blend_boot.npy', th_boot)
    cls = np.digitize(blend_test_ev, bins=[th_boot[0], th_boot[1], th_boot[2], th_boot[3]]).astype(int)
    sub = pd.DataFrame({'id_code': test_df['id_code'].values, 'diagnosis': cls})
    sub.to_csv('submission.csv', index=False)
    print('Submission with ordinal-enhanced NNLS blend written to submission.csv', flush=True)
else:
    print('Skipping submission write due to missing blend.', flush=True)

Starting B5@512 Ordinal head fine-tuning (5 folds)...



===== [B5 Ordinal] Fold 0 =====


  model = create_fn(


  it 50/329 tr_loss 1.0178 elapsed 11.7s


  it 100/329 tr_loss 0.9991 elapsed 23.2s


  it 150/329 tr_loss 1.0008 elapsed 34.6s


  it 200/329 tr_loss 0.9807 elapsed 45.9s


  it 250/329 tr_loss 0.9678 elapsed 57.3s


  it 300/329 tr_loss 0.9344 elapsed 68.7s


  Epoch 1: tr_loss 0.9220 val_loss 0.7650 val_qwk(def) 0.4706


  it 50/329 tr_loss 1.0267 elapsed 28.2s


  it 100/329 tr_loss 0.9914 elapsed 56.1s


  it 150/329 tr_loss 0.9752 elapsed 84.2s


  it 200/329 tr_loss 0.9539 elapsed 112.4s


  it 250/329 tr_loss 0.9135 elapsed 140.5s


  it 300/329 tr_loss 0.8791 elapsed 168.7s


  Epoch 1: tr_loss 0.8622 val_loss 0.5895 val_qwk(def) 0.6785


  it 50/329 tr_loss 0.6588 elapsed 28.5s


  it 100/329 tr_loss 0.6544 elapsed 56.9s


  it 150/329 tr_loss 0.6474 elapsed 85.3s


  it 200/329 tr_loss 0.6213 elapsed 113.7s


  it 250/329 tr_loss 0.6018 elapsed 142.1s


  it 300/329 tr_loss 0.5894 elapsed 170.2s


  Epoch 2: tr_loss 0.5820 val_loss 0.4720 val_qwk(def) 0.7948


  it 50/329 tr_loss 0.4888 elapsed 28.4s


  it 100/329 tr_loss 0.4859 elapsed 56.6s


  it 150/329 tr_loss 0.4733 elapsed 84.9s


  it 200/329 tr_loss 0.4945 elapsed 113.1s


  it 250/329 tr_loss 0.5033 elapsed 141.4s


  it 300/329 tr_loss 0.4900 elapsed 169.7s


  Epoch 3: tr_loss 0.4803 val_loss 0.4677 val_qwk(def) 0.7741


  it 50/329 tr_loss 0.4338 elapsed 28.1s


  it 100/329 tr_loss 0.4433 elapsed 56.3s


  it 150/329 tr_loss 0.4332 elapsed 84.5s


  it 200/329 tr_loss 0.4333 elapsed 112.6s


  it 250/329 tr_loss 0.4353 elapsed 140.8s


  it 300/329 tr_loss 0.4326 elapsed 169.1s


  Epoch 4: tr_loss 0.4378 val_loss 0.4536 val_qwk(def) 0.7887


[B5 Ordinal] Fold 0 val QWK(def): 0.7887



===== [B5 Ordinal] Fold 1 =====


  model = create_fn(


  it 50/329 tr_loss 0.9245 elapsed 11.9s


  it 100/329 tr_loss 0.9870 elapsed 23.7s


  it 150/329 tr_loss 0.9740 elapsed 35.1s


  it 200/329 tr_loss 0.9458 elapsed 46.7s


  it 250/329 tr_loss 0.9317 elapsed 58.2s


  it 300/329 tr_loss 0.9211 elapsed 70.1s


  Epoch 1: tr_loss 0.9118 val_loss 0.7609 val_qwk(def) 0.5070


  it 50/329 tr_loss 0.9924 elapsed 28.2s


  it 100/329 tr_loss 0.9790 elapsed 56.3s


  it 150/329 tr_loss 0.9686 elapsed 84.6s


  it 200/329 tr_loss 0.9371 elapsed 112.9s


  it 250/329 tr_loss 0.8957 elapsed 141.1s


  it 300/329 tr_loss 0.8656 elapsed 169.2s


  Epoch 1: tr_loss 0.8535 val_loss 0.5990 val_qwk(def) 0.6749


  it 50/329 tr_loss 0.6186 elapsed 28.4s


  it 100/329 tr_loss 0.5851 elapsed 57.0s


  it 150/329 tr_loss 0.5795 elapsed 85.6s


  it 200/329 tr_loss 0.5934 elapsed 114.0s


  it 250/329 tr_loss 0.5891 elapsed 142.4s


  it 300/329 tr_loss 0.5837 elapsed 170.8s


  Epoch 2: tr_loss 0.5812 val_loss 0.4995 val_qwk(def) 0.7915


  it 50/329 tr_loss 0.4859 elapsed 28.2s


  it 100/329 tr_loss 0.4843 elapsed 56.6s


  it 150/329 tr_loss 0.4769 elapsed 84.9s


  it 200/329 tr_loss 0.4893 elapsed 113.1s


  it 250/329 tr_loss 0.4865 elapsed 141.3s


  it 300/329 tr_loss 0.4812 elapsed 169.2s


  Epoch 3: tr_loss 0.4838 val_loss 0.4731 val_qwk(def) 0.7932


  it 50/329 tr_loss 0.4438 elapsed 28.2s


  it 100/329 tr_loss 0.4404 elapsed 56.5s


  it 150/329 tr_loss 0.4413 elapsed 84.8s


  it 200/329 tr_loss 0.4654 elapsed 112.8s


  it 250/329 tr_loss 0.4637 elapsed 140.7s


  it 300/329 tr_loss 0.4603 elapsed 168.8s


  Epoch 4: tr_loss 0.4619 val_loss 0.4686 val_qwk(def) 0.8051


[B5 Ordinal] Fold 1 val QWK(def): 0.8051



===== [B5 Ordinal] Fold 2 =====


  model = create_fn(


  it 50/329 tr_loss 1.0386 elapsed 11.8s


  it 100/329 tr_loss 1.0203 elapsed 23.4s


  it 150/329 tr_loss 0.9777 elapsed 34.8s


  it 200/329 tr_loss 0.9699 elapsed 46.3s


  it 250/329 tr_loss 0.9450 elapsed 57.9s


  it 300/329 tr_loss 0.9227 elapsed 69.5s


  Epoch 1: tr_loss 0.9169 val_loss 0.7487 val_qwk(def) 0.4976


  it 50/329 tr_loss 0.9427 elapsed 28.2s


  it 100/329 tr_loss 0.9294 elapsed 56.5s


  it 150/329 tr_loss 0.9383 elapsed 84.8s


  it 200/329 tr_loss 0.9153 elapsed 113.0s


  it 250/329 tr_loss 0.8722 elapsed 141.2s


  it 300/329 tr_loss 0.8603 elapsed 169.4s


  Epoch 1: tr_loss 0.8440 val_loss 0.6006 val_qwk(def) 0.6629


  it 50/329 tr_loss 0.6252 elapsed 28.1s


  it 100/329 tr_loss 0.6030 elapsed 56.5s


  it 150/329 tr_loss 0.5940 elapsed 84.5s


  it 200/329 tr_loss 0.5840 elapsed 112.8s


  it 250/329 tr_loss 0.6087 elapsed 141.1s


  it 300/329 tr_loss 0.6085 elapsed 169.0s


  Epoch 2: tr_loss 0.6090 val_loss 0.5201 val_qwk(def) 0.7165


  it 50/329 tr_loss 0.5290 elapsed 28.2s


  it 100/329 tr_loss 0.5250 elapsed 56.3s


  it 150/329 tr_loss 0.5129 elapsed 84.6s


  it 200/329 tr_loss 0.5153 elapsed 113.0s


  it 250/329 tr_loss 0.5087 elapsed 141.0s


  it 300/329 tr_loss 0.5038 elapsed 169.4s


  Epoch 3: tr_loss 0.5034 val_loss 0.4590 val_qwk(def) 0.7930


  it 50/329 tr_loss 0.4500 elapsed 28.1s


  it 100/329 tr_loss 0.4567 elapsed 56.4s


  it 150/329 tr_loss 0.4505 elapsed 84.4s


  it 200/329 tr_loss 0.4544 elapsed 112.7s


  it 250/329 tr_loss 0.4533 elapsed 140.9s


  it 300/329 tr_loss 0.4524 elapsed 169.0s


  Epoch 4: tr_loss 0.4533 val_loss 0.4633 val_qwk(def) 0.8132


[B5 Ordinal] Fold 2 val QWK(def): 0.7930



===== [B5 Ordinal] Fold 3 =====


  model = create_fn(


  it 50/329 tr_loss 1.0237 elapsed 11.5s


  it 100/329 tr_loss 0.9814 elapsed 22.9s


  it 150/329 tr_loss 0.9848 elapsed 34.3s


  it 200/329 tr_loss 0.9647 elapsed 45.7s


  it 250/329 tr_loss 0.9388 elapsed 57.3s


  it 300/329 tr_loss 0.9171 elapsed 68.7s


  Epoch 1: tr_loss 0.9038 val_loss 0.7791 val_qwk(def) 0.5004


  it 50/329 tr_loss 1.0068 elapsed 28.1s


  it 100/329 tr_loss 0.9668 elapsed 56.3s


  it 150/329 tr_loss 0.9789 elapsed 84.5s


  it 200/329 tr_loss 0.9345 elapsed 112.7s


  it 250/329 tr_loss 0.9026 elapsed 140.9s


  it 300/329 tr_loss 0.8606 elapsed 169.3s


  Epoch 1: tr_loss 0.8378 val_loss 0.6254 val_qwk(def) 0.6789


  it 50/329 tr_loss 0.6361 elapsed 28.4s


  it 100/329 tr_loss 0.6043 elapsed 56.8s


  it 150/329 tr_loss 0.6036 elapsed 85.1s


  it 200/329 tr_loss 0.5974 elapsed 113.3s


  it 250/329 tr_loss 0.5898 elapsed 141.4s


  it 300/329 tr_loss 0.5794 elapsed 169.3s


  Epoch 2: tr_loss 0.5697 val_loss 0.5375 val_qwk(def) 0.7584


  it 50/329 tr_loss 0.5240 elapsed 28.3s


  it 100/329 tr_loss 0.5125 elapsed 56.6s


  it 150/329 tr_loss 0.4873 elapsed 85.2s


  it 200/329 tr_loss 0.4864 elapsed 113.5s


  it 250/329 tr_loss 0.4840 elapsed 141.9s


  it 300/329 tr_loss 0.4805 elapsed 170.4s


  Epoch 3: tr_loss 0.4803 val_loss 0.4844 val_qwk(def) 0.7862


  it 50/329 tr_loss 0.4633 elapsed 28.3s


  it 100/329 tr_loss 0.4632 elapsed 56.4s


  it 150/329 tr_loss 0.4511 elapsed 84.8s


  it 200/329 tr_loss 0.4494 elapsed 113.2s


  it 250/329 tr_loss 0.4471 elapsed 141.7s


  it 300/329 tr_loss 0.4465 elapsed 170.1s


  Epoch 4: tr_loss 0.4440 val_loss 0.4957 val_qwk(def) 0.7991


[B5 Ordinal] Fold 3 val QWK(def): 0.7862



===== [B5 Ordinal] Fold 4 =====


  model = create_fn(


  it 50/329 tr_loss 0.9645 elapsed 11.7s


  it 100/329 tr_loss 0.9648 elapsed 23.1s


  it 150/329 tr_loss 0.9635 elapsed 34.7s


  it 200/329 tr_loss 0.9617 elapsed 46.1s


  it 250/329 tr_loss 0.9366 elapsed 57.8s


  it 300/329 tr_loss 0.9165 elapsed 69.3s


  Epoch 1: tr_loss 0.9037 val_loss 0.7574 val_qwk(def) 0.4998


  it 50/329 tr_loss 1.0192 elapsed 28.4s


  it 100/329 tr_loss 0.9828 elapsed 56.8s


  it 150/329 tr_loss 0.9623 elapsed 84.9s


  it 200/329 tr_loss 0.9298 elapsed 113.1s


  it 250/329 tr_loss 0.9027 elapsed 141.2s


  it 300/329 tr_loss 0.8689 elapsed 169.5s


  Epoch 1: tr_loss 0.8420 val_loss 0.6065 val_qwk(def) 0.6823


  it 50/329 tr_loss 0.5788 elapsed 28.3s


  it 100/329 tr_loss 0.5816 elapsed 56.3s


  it 150/329 tr_loss 0.5555 elapsed 84.4s


  it 200/329 tr_loss 0.5751 elapsed 112.3s


  it 250/329 tr_loss 0.5828 elapsed 140.5s


  it 300/329 tr_loss 0.5857 elapsed 168.8s


  Epoch 2: tr_loss 0.5840 val_loss 0.4920 val_qwk(def) 0.7898


  it 50/329 tr_loss 0.5732 elapsed 28.1s


  it 100/329 tr_loss 0.5595 elapsed 55.9s


  it 150/329 tr_loss 0.5352 elapsed 84.0s


  it 200/329 tr_loss 0.5340 elapsed 112.2s


  it 250/329 tr_loss 0.5172 elapsed 140.4s


  it 300/329 tr_loss 0.5147 elapsed 168.6s


  Epoch 3: tr_loss 0.5178 val_loss 0.4681 val_qwk(def) 0.8031


  it 50/329 tr_loss 0.5322 elapsed 28.3s


  it 100/329 tr_loss 0.4924 elapsed 56.5s


  it 150/329 tr_loss 0.4672 elapsed 84.7s


  it 200/329 tr_loss 0.4717 elapsed 113.0s


  it 250/329 tr_loss 0.4651 elapsed 141.3s


  it 300/329 tr_loss 0.4610 elapsed 169.5s


  Epoch 4: tr_loss 0.4611 val_loss 0.4694 val_qwk(def) 0.8002


[B5 Ordinal] Fold 4 val QWK(def): 0.8031


B5 Ordinal OOF QWK (opt): 0.83947 th: [0.47965827 1.59524857 3.43743032 3.79561891]


B5 Ordinal OOF done. OOF QWK: 0.8394662918387915


Saved test_ev_b5_ordinal.npy and test_probs4_b5_ordinal.npy


Blending ordinal EV with existing regression EVs via NNLS...


NNLS blend (incl ordinal) OOF QWK: 0.88123 weights: [0.11275188 0.17426021 0.30175443 0.10151839 0.21649879 0.04660815
 0.04660815] keys: ['b4_512', 'b5_512', 'b5_512_rrcema', 'b4_640', 'serx50_512_rrcema', 'serx50_512_rrcema_s2', 'b5_512_ordinal']


Bootstrap 30/180


Bootstrap 60/180


Bootstrap 90/180


Bootstrap 120/180


Bootstrap 150/180


Bootstrap 180/180


Submission with ordinal-enhanced NNLS blend written to submission.csv


In [26]:
# Pseudo-OOF fill for 3-fold models + L2 XGBoost stacker (EV-only) per expert plan
import os, gc, numpy as np, pandas as pd, torch
from torch.utils.data import DataLoader

print('Starting pseudo-OOF fill and L2 XGBoost stacker...', flush=True)

# 1) Build pseudo-OOF for 3-fold families by averaging their models over entire TRAIN (hflip-only TTA)
def hflip_tta_predict_models(models, dl):
    preds = []
    with torch.no_grad():
        for xb, _ in dl:
            xb = xb.to(device, non_blocking=True).to(memory_format=torch.channels_last)
            with torch.amp.autocast('cuda', enabled=True, dtype=torch.float16):
                s = 0; c = 0
                for m in models:
                    m.eval()
                    p0 = m(xb); p1 = m(torch.flip(xb, dims=[-1]))
                    s += (p0 + p1) / 2.0; c += 1
                p = s / max(1,c)
            preds.append(p.detach().float().cpu().numpy())
    return np.concatenate(preds)

def infer_train_for_models(models):
    dtr = DRDataset(folds_df[['id_code']].copy(), IMG_DIR_TRAIN, transforms=valid_tfms)
    dl_tr_all = DataLoader(dtr, batch_size=16, shuffle=False, num_workers=4, pin_memory=True, persistent_workers=False)
    return hflip_tta_predict_models(models, dl_tr_all).reshape(-1)

pseudo_cache = {}
if 'models_b4_640' in globals() and isinstance(models_b4_640, list) and len(models_b4_640) > 0:
    print('Generating pseudo-OOF for b4_640...', flush=True)
    pseudo_cache['b4_640'] = infer_train_for_models(models_b4_640)
    np.save('pseudo_oof_b4_640.npy', pseudo_cache['b4_640'])
else:
    print('models_b4_640 not in memory; will skip pseudo-OOF for b4_640 if missing.', flush=True)
if 'models_serx' in globals() and isinstance(models_serx, list) and len(models_serx) > 0:
    print('Generating pseudo-OOF for serx50_512_rrcema...', flush=True)
    pseudo_cache['serx50_512_rrcema'] = infer_train_for_models(models_serx)
    np.save('pseudo_oof_serx50_512_rrc_ema.npy', pseudo_cache['serx50_512_rrcema'])
else:
    print('models_serx not in memory; will skip pseudo-OOF for serx50.', flush=True)
if 'models_serx_s2' in globals() and isinstance(models_serx_s2, list) and len(models_serx_s2) > 0:
    print('Generating pseudo-OOF for serx50_512_rrcema_s2...', flush=True)
    pseudo_cache['serx50_512_rrcema_s2'] = infer_train_for_models(models_serx_s2)
    np.save('pseudo_oof_serx50_512_rrc_ema_seed2026.npy', pseudo_cache['serx50_512_rrcema_s2'])
else:
    print('models_serx_s2 not in memory; will skip pseudo-OOF for serx50 s2.', flush=True)

# 2) Load OOF arrays and fill NaNs (folds 3/4) with pseudo-OOF; if NaNs remain, DROP that feature for stability
oof_t = np.load('oof_targets_b4.npy') if os.path.exists('oof_targets_b4.npy') else np.load('oof_targets.npy')
feat_specs = [
    ('b4_512', 'oof_preds_b4.npy', None),
    ('b5_512', 'oof_preds.npy', None),
    ('b5_512_rrcema', 'oof_preds_b5_seed2025_rrc_ema.npy', None),
    ('b4_640', 'oof_preds_b4_640_rrc_ema.npy', 'pseudo_oof_b4_640.npy'),
    ('serx50_512_rrcema', 'oof_preds_serx50_512_rrc_ema.npy', 'pseudo_oof_serx50_512_rrc_ema.npy'),
    ('serx50_512_rrcema_s2', 'oof_preds_serx50_512_rrc_ema_seed2026.npy', 'pseudo_oof_serx50_512_rrc_ema_seed2026.npy'),
    ('b5_512_ordinal', 'oof_ev_b5_ordinal.npy', None),
]
X_cols = []
for name, oof_path, pseudo_path in feat_specs:
    if not os.path.exists(oof_path):
        print('Missing OOF for', name, '->', oof_path, 'skipping feature.', flush=True)
        continue
    arr = np.load(oof_path).reshape(-1)
    if np.any(~np.isfinite(arr)) and pseudo_path is not None and os.path.exists(pseudo_path):
        pseudo = np.load(pseudo_path).reshape(-1)
        mask_nan = ~np.isfinite(arr)
        arr[mask_nan] = pseudo[mask_nan]
    if np.any(~np.isfinite(arr)):
        print('Skipping feature due to remaining NaNs:', name, flush=True)
        continue
    X_cols.append((name, arr))

if len(X_cols) < 3:
    raise RuntimeError('Too few features collected for L2.')

X_oof = np.stack([c[1] for c in X_cols], axis=1).astype(np.float32)
names = [c[0] for c in X_cols]
mask_full = np.isfinite(X_oof).all(axis=1) & np.isfinite(oof_t)
print('L2 features:', names, 'X_oof shape:', X_oof.shape, 'full mask sum:', int(mask_full.sum()), flush=True)

# 3) Build test feature matrix in same order
test_map = {
    'b4_512': 'test_reg_preds_b4_hflip.npy',
    'b5_512': 'test_reg_preds_b5_hflip.npy',
    'b5_512_rrcema': 'test_reg_preds_b5_seed2025_rrc_ema.npy',
    'b4_640': 'test_reg_preds_b4_640_rrc_ema.npy',
    'serx50_512_rrcema': 'test_reg_preds_serx50_512_rrc_ema.npy',
    'serx50_512_rrcema_s2': 'test_reg_preds_serx50_512_rrc_ema_seed2026.npy',
    'b5_512_ordinal': 'test_ev_b5_ordinal.npy',
}
X_te_list = []
for n in names:
    p = test_map.get(n, None)
    if p is None or not os.path.exists(p):
        raise RuntimeError(f'Missing test preds for {n}: {p}')
    X_te_list.append(np.load(p).reshape(-1))
X_te = np.stack(X_te_list, axis=1).astype(np.float32)
print('Test feature shape:', X_te.shape, flush=True)

# 4) XGBoost GPU multiclass stacker, CV by original folds
try:
    import xgboost as xgb
except Exception as e:
    import sys, subprocess
    print('Installing xgboost...', flush=True)
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'xgboost'], check=False)
    import xgboost as xgb

params = dict(
    objective='multi:softprob', num_class=5,
    tree_method='gpu_hist', predictor='gpu_predictor',
    n_estimators=2000, learning_rate=0.05,
    max_depth=3, min_child_weight=2,
    subsample=0.8, colsample_bytree=0.8,
    reg_lambda=1.0, reg_alpha=0.0,
    random_state=42, n_jobs=-1
)

folds_ref = pd.read_csv('folds.csv')
X = X_oof[mask_full]
y = oof_t[mask_full].astype(int)
oof_proba = np.zeros((len(X_oof), 5), dtype=np.float32)
te_proba_acc = np.zeros((len(X_te), 5), dtype=np.float32)

for f in range(5):
    va_mask_idx = np.where((folds_ref['fold'].values == f) & mask_full)[0]
    tr_mask_idx = np.where((folds_ref['fold'].values != f) & mask_full)[0]
    if len(va_mask_idx) == 0 or len(tr_mask_idx) == 0:
        print(f'[L2-XGB] Skip fold {f} due to empty split.', flush=True);
        continue
    X_tr, y_tr = X_oof[tr_mask_idx], oof_t[tr_mask_idx].astype(int)
    X_va, y_va = X_oof[va_mask_idx], oof_t[va_mask_idx].astype(int)
    clf = xgb.XGBClassifier(**params)
    # Compatibility path: train without eval_set/early stopping due to version API differences
    clf.fit(X_tr, y_tr)
    oof_proba[va_mask_idx] = clf.predict_proba(X_va)
    te_proba_acc += clf.predict_proba(X_te)
    print(f'[L2-XGB] Fold {f} done; val size {len(va_mask_idx)}', flush=True)

te_proba = te_proba_acc / 5.0

# 5) Convert probs to EV, optimize/boot thresholds, write submission
classes = np.arange(5, dtype=np.float32).reshape(1,-1)
oof_ev = (oof_proba * classes).sum(axis=1)
te_ev = (te_proba * classes).sum(axis=1)

from sklearn.metrics import cohen_kappa_score
def preds_to_classes_ev(preds, th):
    return np.digitize(preds, bins=[th[0], th[1], th[2], th[3]])
from scipy.optimize import minimize
def optimize_thresholds_ev(y, p, init=[0.5,1.5,2.5,3.5]):
    y = np.asarray(y).astype(float); p = np.asarray(p).astype(float)
    def _loss(th):
        th = np.sort(th); th = np.clip(th, -0.5, 4.5)
        cls = preds_to_classes_ev(p, th)
        return -cohen_kappa_score(y, cls, weights='quadratic')
    res = minimize(_loss, x0=np.array(init, dtype=float), method='Nelder-Mead', options={'maxiter':2000, 'xatol':1e-3, 'fatol':1e-3})
    th = np.sort(res.x)
    for i in range(1,4):
        if th[i]-th[i-1] < 0.1: th[i] = th[i-1] + 0.1
    return th

th_opt = optimize_thresholds_ev(oof_t[mask_full], oof_ev[mask_full])
oof_q = cohen_kappa_score(oof_t[mask_full], preds_to_classes_ev(oof_ev[mask_full], th_opt), weights='quadratic')
print('L2-XGB OOF QWK (opt):', f'{oof_q:.5f}', 'th:', th_opt, flush=True)

def bootstrap_thresholds_ev(y, p, n_bootstrap=200, base=[0.5,1.5,2.5,3.5]):
    N = len(y); cuts = []
    for i in range(n_bootstrap):
        idx = np.random.randint(0, N, size=N)
        th = optimize_thresholds_ev(y[idx], p[idx], init=base)
        cuts.append(th)
        if (i+1)%30==0: print(f'Bootstrap {i+1}/{n_bootstrap}', flush=True)
    cuts = np.array(cuts); med = np.median(cuts, axis=0); med = np.clip(med, 0.3, 3.7)
    for j in range(1,4):
        if med[j]-med[j-1] < 0.1: med[j] = med[j-1] + 0.1
    base_arr = np.array([0.5,1.5,2.5,3.5])
    return 0.9*med + 0.1*base_arr

th_boot = bootstrap_thresholds_ev(oof_t[mask_full], oof_ev[mask_full], n_bootstrap=200)
np.save('l2xgb_used_keys.npy', np.array(names, dtype=object))
np.save('l2xgb_oof_ev.npy', oof_ev); np.save('l2xgb_te_ev.npy', te_ev); np.save('l2xgb_thresholds_boot.npy', th_boot)

cls = preds_to_classes_ev(te_ev, th_boot).astype(int)
sub = pd.DataFrame({'id_code': test_df['id_code'].values, 'diagnosis': cls})
sub.to_csv('submission.csv', index=False)
print('L2-XGB submission.csv written.', flush=True)

Starting pseudo-OOF fill and L2 XGBoost stacker...


models_b4_640 not in memory; will skip pseudo-OOF for b4_640 if missing.


Generating pseudo-OOF for serx50_512_rrcema...


Generating pseudo-OOF for serx50_512_rrcema_s2...


Skipping feature due to remaining NaNs: b4_640


L2 features: ['b4_512', 'b5_512', 'b5_512_rrcema', 'serx50_512_rrcema', 'serx50_512_rrcema_s2', 'b5_512_ordinal'] X_oof shape: (3295, 6) full mask sum: 3295


Test feature shape: (367, 6)



    E.g. tree_method = "hist", device = "cuda"

Parameters: { "predictor" } are not used.



[L2-XGB] Fold 0 done; val size 659



    E.g. tree_method = "hist", device = "cuda"

Potential solutions:
- Use a data structure that matches the device ordinal in the booster.
- Set the device for booster before call to inplace_predict.



    E.g. tree_method = "hist", device = "cuda"

Parameters: { "predictor" } are not used.



[L2-XGB] Fold 1 done; val size 659



    E.g. tree_method = "hist", device = "cuda"

Parameters: { "predictor" } are not used.



[L2-XGB] Fold 2 done; val size 659



    E.g. tree_method = "hist", device = "cuda"

Parameters: { "predictor" } are not used.



[L2-XGB] Fold 3 done; val size 659



    E.g. tree_method = "hist", device = "cuda"

Parameters: { "predictor" } are not used.



[L2-XGB] Fold 4 done; val size 659



    E.g. tree_method = "hist", device = "cuda"



L2-XGB OOF QWK (opt): 0.85989 th: [0.5021006  1.5080409  2.56013595 3.5100819 ]


Bootstrap 30/200


Bootstrap 60/200


Bootstrap 90/200


Bootstrap 120/200


Bootstrap 150/200


Bootstrap 180/200


L2-XGB submission.csv written.


In [27]:
# Rebuild best NNLS blend (regression EVs only) and write submission
import os, numpy as np, pandas as pd
from scipy.optimize import nnls, minimize
from sklearn.metrics import cohen_kappa_score

print('Recomputing NNLS blend over regression EVs (no ordinal)...', flush=True)
oof_t = np.load('oof_targets_b4.npy') if os.path.exists('oof_targets_b4.npy') else np.load('oof_targets.npy')
paths_oof = {
    'b4_512': 'oof_preds_b4.npy',
    'b5_512': 'oof_preds.npy',
    'b5_512_rrcema': 'oof_preds_b5_seed2025_rrc_ema.npy',
    'b4_640': 'oof_preds_b4_640_rrc_ema.npy',
    'serx50_512_rrcema': 'oof_preds_serx50_512_rrc_ema.npy',
    'serx50_512_rrcema_s2': 'oof_preds_serx50_512_rrc_ema_seed2026.npy',
}
keys = []; arrs = []; mask = np.ones_like(oof_t, dtype=bool)
for k, p in paths_oof.items():
    if os.path.exists(p):
        a = np.load(p).reshape(-1); keys.append(k); arrs.append(a); mask &= np.isfinite(a)
    else:
        print('Missing OOF for', k, '->', p, flush=True)
if len(arrs) < 2: raise RuntimeError('Not enough OOF sources to blend')
X = np.concatenate([a.reshape(-1,1) for a in arrs], axis=1)
w, _ = nnls(X[mask], oof_t[mask].astype(float))
if w.sum() > 0: w = w / w.sum()
w = np.clip(w, 0.05, 0.7); w = w / w.sum()
blend_oof = (X * w.reshape(1,-1)).sum(axis=1)
def preds_to_classes_ev(preds, th):
    return np.digitize(preds, bins=[th[0], th[1], th[2], th[3]])
def optimize_thresholds_ev(y, p, init=[0.5,1.5,2.5,3.5]):
    y = np.asarray(y).astype(float); p = np.asarray(p).astype(float)
    def _loss(th):
        th = np.sort(th); th = np.clip(th, -0.5, 4.5)
        cls = preds_to_classes_ev(p, th)
        return -cohen_kappa_score(y, cls, weights='quadratic')
    res = minimize(_loss, x0=np.array(init, dtype=float), method='Nelder-Mead', options={'maxiter':2000, 'xatol':1e-3, 'fatol':1e-3})
    th = np.sort(res.x)
    for i in range(1,4):
        if th[i]-th[i-1] < 0.1: th[i] = th[i-1] + 0.1
    return th
th_opt = optimize_thresholds_ev(oof_t[mask], blend_oof[mask])
oof_q = cohen_kappa_score(oof_t[mask], preds_to_classes_ev(blend_oof[mask], th_opt), weights='quadratic')
print('NNLS OOF QWK (opt on mask):', f'{oof_q:.5f}', 'weights:', w, 'keys:', keys, flush=True)

def bootstrap_thresholds_ev(y, p, n_bootstrap=180, base=[0.5,1.5,2.5,3.5]):
    N = len(y); cuts = []
    for i in range(n_bootstrap):
        idx = np.random.randint(0, N, size=N)
        th = optimize_thresholds_ev(y[idx], p[idx], init=base)
        cuts.append(th)
        if (i+1)%30==0: print(f'Bootstrap {i+1}/{n_bootstrap}', flush=True)
    cuts = np.array(cuts); med = np.median(cuts, axis=0); med = np.clip(med, 0.3, 3.7)
    for j in range(1,4):
        if med[j]-med[j-1] < 0.1: med[j] = med[j-1] + 0.1
    base_arr = np.array([0.5,1.5,2.5,3.5])
    return 0.9*med + 0.1*base_arr
th_boot = bootstrap_thresholds_ev(oof_t[mask], blend_oof[mask], n_bootstrap=180)
np.save('blend_weights_keys.npy', np.array(keys, dtype=object)); np.save('blend_weights_vals.npy', w); np.save('thresholds_blend_boot.npy', th_boot)

# Build test blend in same key order using standardized hflip-only preds
test_map = {
    'b4_512': 'test_reg_preds_b4_hflip.npy',
    'b5_512': 'test_reg_preds_b5_hflip.npy',
    'b5_512_rrcema': 'test_reg_preds_b5_seed2025_rrc_ema.npy',
    'b4_640': 'test_reg_preds_b4_640_rrc_ema.npy',
    'serx50_512_rrcema': 'test_reg_preds_serx50_512_rrc_ema.npy',
    'serx50_512_rrcema_s2': 'test_reg_preds_serx50_512_rrc_ema_seed2026.npy',
}
test_list = []; used_keys = []
for k in keys:
    p = test_map.get(k, None)
    if p and os.path.exists(p):
        test_list.append(np.load(p).reshape(-1)); used_keys.append(k)
    else:
        print('Missing test preds for', k, '->', p, flush=True)
if len(test_list) == 0: raise RuntimeError('No test preds available for NNLS blend')
w_use = w[:len(test_list)]
if w_use.sum() <= 0: w_use = np.ones_like(w_use)/len(w_use)
w_use = w_use / w_use.sum()
blend_test = np.zeros_like(test_list[0], dtype=float)
for wi, arr in zip(w_use, test_list): blend_test += wi * arr
th = th_boot
cls = np.digitize(blend_test, bins=[th[0], th[1], th[2], th[3]]).astype(int)
sub = pd.DataFrame({'id_code': pd.read_csv('test.csv')['id_code'].values, 'diagnosis': cls})
sub.to_csv('submission.csv', index=False)
print('Best-regression NNLS submission.csv written.', flush=True)

Recomputing NNLS blend over regression EVs (no ordinal)...


NNLS OOF QWK (opt on mask): 0.88151 weights: [0.12077895 0.17904818 0.31066866 0.11245652 0.22942863 0.04761905] keys: ['b4_512', 'b5_512', 'b5_512_rrcema', 'b4_640', 'serx50_512_rrcema', 'serx50_512_rrcema_s2']


Bootstrap 30/180


Bootstrap 60/180


Bootstrap 90/180


Bootstrap 120/180


Bootstrap 150/180


Bootstrap 180/180


Best-regression NNLS submission.csv written.


In [28]:
# ResNet200d @512 (3 folds) with RRC+EMA, then refit NNLS blend and write submission
import os, gc, copy, time, math, numpy as np, pandas as pd, torch, torch.nn as nn, albumentations as A, cv2
from torch.utils.data import DataLoader
from albumentations.pytorch import ToTensorV2
from timm.utils import ModelEmaV2
from scipy.optimize import nnls, minimize
from sklearn.metrics import cohen_kappa_score

print('Starting resnet200d@512 (3 folds) RRC+EMA...', flush=True)

MEAN = (0.485, 0.456, 0.406); STD = (0.229, 0.224, 0.225)
train_tfms_r200d = A.Compose([
    A.RandomResizedCrop(size=(IMG_SIZE, IMG_SIZE), scale=(0.88, 1.0), ratio=(0.95, 1.05), interpolation=cv2.INTER_AREA),
    A.HorizontalFlip(p=0.5),
    A.Affine(scale=(0.95, 1.05), translate_percent=(0.0, 0.05), rotate=(-12, 12), border_mode=cv2.BORDER_REFLECT_101, p=0.7),
    A.RandomBrightnessContrast(brightness_limit=0.15, contrast_limit=0.15, p=0.7),
    A.RandomGamma(gamma_limit=(90,110), p=0.4),
    A.Normalize(mean=MEAN, std=STD),
    ToTensorV2(),
])
valid_tfms_r200d = A.Compose([
    A.Resize(IMG_SIZE, IMG_SIZE),
    A.Normalize(mean=MEAN, std=STD),
    ToTensorV2(),
])

def get_loaders_r200d(tr_df, va_df, batch_size=8, num_workers=0):
    dtr = DRDataset(tr_df, IMG_DIR_TRAIN, transforms=train_tfms_r200d)
    dva = DRDataset(va_df, IMG_DIR_TRAIN, transforms=valid_tfms_r200d)
    dl_tr = DataLoader(dtr, batch_size=batch_size, shuffle=True, num_workers=num_workers, pin_memory=True, drop_last=True, persistent_workers=False)
    dl_va = DataLoader(dva, batch_size=batch_size, shuffle=False, num_workers=num_workers, pin_memory=True, persistent_workers=False)
    return dl_tr, dl_va

def validate_r200d(model, dl, loss_fn):
    model.eval(); preds=[]; targs=[]; val_loss=0.0; n=0
    with torch.no_grad():
        for xb, yb in dl:
            xb = xb.to(device, non_blocking=True).to(memory_format=torch.channels_last); yb = yb.to(device, non_blocking=True)
            with torch.amp.autocast('cuda', enabled=True, dtype=torch.float16):
                out = model(xb); loss = loss_fn(out, yb)
            bs = xb.size(0); val_loss += loss.item()*bs; n += bs
            preds.append(out.detach().float().cpu().numpy()); targs.append(yb.detach().float().cpu().numpy())
    preds = np.concatenate(preds); targs = np.concatenate(targs)
    return val_loss/max(1,n), preds, targs

def train_one_fold_r200d_ema(fold, folds_df, epochs=12, lr=2e-4, wd=1e-5, batch_size=8, patience=2, ema_decay=0.9996):
    print(f"\n===== [resnet200d@512 RRC+EMA] Fold {fold} =====", flush=True)
    tr_df = folds_df[folds_df['fold'] != fold][['id_code','diagnosis']].reset_index(drop=True)
    va_df = folds_df[folds_df['fold'] == fold][['id_code','diagnosis']].reset_index(drop=True)
    dl_tr, dl_va = get_loaders_r200d(tr_df, va_df, batch_size=batch_size, num_workers=0)
    model = RegHeadModel(backbone_name='resnet200d', pretrained=True).to(device)
    model = model.to(memory_format=torch.channels_last)
    opt = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=wd)
    loss_fn = nn.HuberLoss(delta=1.0)
    scaler = torch.amp.GradScaler('cuda', enabled=True)
    ema = ModelEmaV2(model, decay=ema_decay, device=None)
    total_steps = epochs * len(dl_tr)
    def lr_lambda(step):
        if step < len(dl_tr): return (step+1)/len(dl_tr)
        prog = (step - len(dl_tr)) / max(1, total_steps - len(dl_tr))
        return 0.5*(1+math.cos(math.pi*prog))
    sch = torch.optim.lr_scheduler.LambdaLR(opt, lr_lambda)
    best_loss = float('inf'); best_state=None; best_preds=None; best_targs=None; no_imp=0; accum = max(1, 16 // batch_size)
    opt.zero_grad(set_to_none=True)
    for ep in range(1, epochs+1):
        model.train(); tr_loss=0.0; n=0; t0=time.time()
        for it,(xb,yb) in enumerate(dl_tr):
            xb = xb.to(device, non_blocking=True).to(memory_format=torch.channels_last); yb = yb.to(device, non_blocking=True)
            with torch.amp.autocast('cuda', enabled=True, dtype=torch.float16):
                out = model(xb); loss = loss_fn(out, yb)/accum
            scaler.scale(loss).backward()
            if (it+1)%accum==0:
                scaler.step(opt); scaler.update(); opt.zero_grad(set_to_none=True); sch.step(); ema.update(model)
            bs = xb.size(0); tr_loss += (loss.item()*accum)*bs; n += bs
            if (it+1)%50==0: print(f'Epoch {ep} it {it+1}/{len(dl_tr)} tr_loss {tr_loss/max(1,n):.4f} elapsed {time.time()-t0:.1f}s', flush=True)
        vloss, vpreds, vtargs = validate_r200d(model, dl_va, loss_fn)
        vq = cohen_kappa_score(vtargs, np.digitize(vpreds, bins=[0.5,1.5,2.5,3.5]), weights='quadratic')
        print(f'Epoch {ep}: tr_loss {tr_loss/max(1,n):.4f} val_loss {vloss:.4f} val_qwk(def) {vq:.4f} epoch_time {time.time()-t0:.1f}s', flush=True)
        if vloss < best_loss:
            best_loss = vloss; best_state = copy.deepcopy(ema.module.state_dict()); best_preds = vpreds.copy(); best_targs = vtargs.copy(); no_imp=0
        else:
            no_imp += 1
        if no_imp >= patience:
            print('Early stopping triggered', flush=True); break
    ema.module.load_state_dict(best_state)
    return ema.module, best_preds, best_targs

def run_cv_r200d_ema(folds_df, folds_to_run=(0,1,2), epochs=12, batch_size=8):
    folds_df = folds_df.reset_index(drop=True).copy()
    n = len(folds_df); oof_preds = np.full(n, np.nan, dtype=np.float32); oof_targs = folds_df['diagnosis'].values.astype(float)
    models_out = []
    for fold in folds_to_run:
        try:
            torch.cuda.empty_cache()
            if torch.cuda.is_available(): torch.cuda.reset_peak_memory_stats()
        except Exception: pass
        gc.collect()
        bs = batch_size
        for attempt in range(5):
            print(f'Attempt {attempt+1}: trying batch_size={bs}', flush=True)
            try:
                m, vpreds, vtargs = train_one_fold_r200d_ema(fold, folds_df, epochs=epochs, lr=2e-4, wd=1e-5, batch_size=bs, patience=2, ema_decay=0.9996)
                break
            except (RuntimeError, torch.cuda.OutOfMemoryError, torch.OutOfMemoryError) as e:
                if 'out of memory' in str(e).lower():
                    print(f'CUDA OOM on fold {fold} with batch_size={bs}. Reducing...', flush=True)
                    try: del m
                    except Exception: pass
                    gc.collect()
                    try: torch.cuda.empty_cache(); torch.cuda.reset_peak_memory_stats()
                    except Exception: pass
                    bs = max(2, bs-2); continue
                else:
                    raise
        models_out.append(m)
        va_idx = np.where(folds_df['fold'].values == fold)[0]
        oof_preds[va_idx] = vpreds
        fold_q = cohen_kappa_score(vtargs, np.digitize(vpreds, bins=[0.5,1.5,2.5,3.5]), weights='quadratic')
        print(f'[resnet200d@512] Fold {fold} val QWK (default th): {fold_q:.4f}', flush=True)
    np.save('oof_preds_r200d_512_rrc_ema.npy', oof_preds); np.save('oof_targets_r200d_512_rrc_ema.npy', oof_targs)
    return models_out, oof_preds, oof_targs

def nnls_blend_with_r200d():
    print('Refitting NNLS with resnet200d added...', flush=True)
    paths_oof = {
        'b4_512': 'oof_preds_b4.npy',
        'b5_512': 'oof_preds.npy',
        'b5_512_rrcema': 'oof_preds_b5_seed2025_rrc_ema.npy',
        'b4_640': 'oof_preds_b4_640_rrc_ema.npy',
        'serx50_512_rrcema': 'oof_preds_serx50_512_rrc_ema.npy',
        'serx50_512_rrcema_s2': 'oof_preds_serx50_512_rrc_ema_seed2026.npy',
        'r200d_512_rrcema': 'oof_preds_r200d_512_rrc_ema.npy',
    }
    y = np.load('oof_targets_b4.npy') if os.path.exists('oof_targets_b4.npy') else np.load('oof_targets.npy')
    keys=[]; arrs=[]; mask = np.isfinite(y).copy()
    for k,p in paths_oof.items():
        if os.path.exists(p):
            a = np.load(p).reshape(-1); keys.append(k); arrs.append(a); mask &= np.isfinite(a)
        else:
            print('Missing OOF for', k, '->', p, flush=True)
    if len(arrs) < 3:
        print('Too few arrays for NNLS. Abort blend.', flush=True); return None, None, None, None, None
    X = np.concatenate([a.reshape(-1,1) for a in arrs], axis=1)
    w, _ = nnls(X[mask], y[mask].astype(float))
    if w.sum() > 0: w = w / w.sum()
    w = np.clip(w, 0.05, 0.7); w = w / w.sum()
    blend_oof = (X * w.reshape(1,-1)).sum(axis=1)
    def preds_to_classes_ev(preds, th):
        return np.digitize(preds, bins=[th[0], th[1], th[2], th[3]])
    def optimize_thresholds_ev(y, p, init=[0.5,1.5,2.5,3.5]):
        y = np.asarray(y).astype(float); p = np.asarray(p).astype(float)
        def _loss(th):
            th = np.sort(th); th = np.clip(th, -0.5, 4.5)
            cls = preds_to_classes_ev(p, th)
            return -cohen_kappa_score(y, cls, weights='quadratic')
        res = minimize(_loss, x0=np.array(init, dtype=float), method='Nelder-Mead', options={'maxiter':2000, 'xatol':1e-3, 'fatol':1e-3})
        th = np.sort(res.x)
        for i in range(1,4):
            if th[i]-th[i-1] < 0.1: th[i] = th[i-1] + 0.1
        return th
    th_opt = optimize_thresholds_ev(y[mask], blend_oof[mask])
    oof_q = cohen_kappa_score(y[mask], preds_to_classes_ev(blend_oof[mask], th_opt), weights='quadratic')
    print('NNLS (with r200d) OOF QWK:', f'{oof_q:.5f}', 'weights:', w, 'keys:', keys, flush=True)
    # Build test blend
    test_map = {
        'b4_512': 'test_reg_preds_b4_hflip.npy',
        'b5_512': 'test_reg_preds_b5_hflip.npy',
        'b5_512_rrcema': 'test_reg_preds_b5_seed2025_rrc_ema.npy',
        'b4_640': 'test_reg_preds_b4_640_rrc_ema.npy',
        'serx50_512_rrcema': 'test_reg_preds_serx50_512_rrc_ema.npy',
        'serx50_512_rrcema_s2': 'test_reg_preds_serx50_512_rrc_ema_seed2026.npy',
        'r200d_512_rrcema': 'test_reg_preds_r200d_512_rrc_ema.npy',
    }
    test_list=[]; used_keys=[]
    for k in keys:
        p = test_map.get(k, None)
        if p and os.path.exists(p): test_list.append(np.load(p).reshape(-1)); used_keys.append(k)
        else: print('Missing test preds for', k, '->', p, flush=True)
    if len(test_list) == 0:
        print('No test preds to blend.', flush=True); return None, None, None, None, None
    w_use = w[:len(test_list)]
    if w_use.sum() <= 0: w_use = np.ones_like(w_use)/len(w_use)
    w_use = w_use / w_use.sum()
    blend_test = np.zeros_like(test_list[0], dtype=float)
    for wi, arr in zip(w_use, test_list): blend_test += wi * arr
    return blend_oof, blend_test, w, th_opt, keys

def tta_predict_hflip_mean(models, dl):
    preds = [];
    with torch.no_grad():
        for xb, _ in dl:
            xb = xb.to(device, non_blocking=True).to(memory_format=torch.channels_last)
            with torch.amp.autocast('cuda', enabled=True, dtype=torch.float16):
                s = 0; c = 0
                for m in models:
                    m.eval(); p0 = m(xb); p1 = m(torch.flip(xb, dims=[-1]));
                    s += (p0 + p1)/2.0; c += 1
                p = s / max(1,c)
            preds.append(p.detach().float().cpu().numpy())
    return np.concatenate(preds)

# Execute: train resnet200d 3-folds, infer test, then NNLS-blend
seed_everything(1337)
models_r200d, oof_r200d, oof_t_r200d = run_cv_r200d_ema(folds_df, folds_to_run=(0,1,2), epochs=12, batch_size=8)
dte_r200d = DRDataset(test_df[['id_code']].copy(), IMG_DIR_TEST, transforms=valid_tfms_r200d)
dl_te_r200d = DataLoader(dte_r200d, batch_size=16, shuffle=False, num_workers=4, pin_memory=True, persistent_workers=False)
preds_r200d = tta_predict_hflip_mean(models_r200d, dl_te_r200d)
np.save('test_reg_preds_r200d_512_rrc_ema.npy', preds_r200d)
blend_oof_r200d, blend_test_r200d, w_r200d, th_r200d, keys_r200d = nnls_blend_with_r200d()
if blend_oof_r200d is not None:
    # Bootstrap thresholds for final stability
    def bootstrap_thresholds_ev(y, p, n_bootstrap=180, base=[0.5,1.5,2.5,3.5]):
        N = len(y); cuts=[]
        for i in range(n_bootstrap):
            idx = np.random.randint(0, N, size=N)
            # Reuse optimizer above
            def opt_th(yy, pp, init=base):
                def _loss(th):
                    th = np.sort(th); th = np.clip(th, -0.5, 4.5)
                    cls = np.digitize(pp, bins=[th[0], th[1], th[2], th[3]])
                    return -cohen_kappa_score(yy, cls, weights='quadratic')
                res = minimize(_loss, x0=np.array(init, dtype=float), method='Nelder-Mead', options={'maxiter':1200, 'xatol':1e-3, 'fatol':1e-3})
                th = np.sort(res.x)
                for j in range(1,4):
                    if th[j]-th[j-1] < 0.1: th[j] = th[j-1] + 0.1
                return th
            thb = opt_th(oof_t_r200d[np.isfinite(blend_oof_r200d)], blend_oof_r200d[np.isfinite(blend_oof_r200d)], init=base)
            cuts.append(thb)
            if (i+1)%30==0: print(f'Bootstrap {i+1}/{n_bootstrap}', flush=True)
        cuts = np.array(cuts); med = np.median(cuts, axis=0); med = np.clip(med, 0.3, 3.7)
        for j in range(1,4):
            if med[j]-med[j-1] < 0.1: med[j] = med[j-1] + 0.1
        base_arr = np.array([0.5,1.5,2.5,3.5])
        return 0.9*med + 0.1*base_arr
    th_boot = bootstrap_thresholds_ev(oof_t_r200d, blend_oof_r200d, n_bootstrap=180)
    cls = np.digitize(blend_test_r200d, bins=[th_boot[0], th_boot[1], th_boot[2], th_boot[3]]).astype(int)
    sub = pd.DataFrame({'id_code': test_df['id_code'].values, 'diagnosis': cls})
    sub.to_csv('submission.csv', index=False)
    print('Submission with resnet200d-enhanced NNLS blend written to submission.csv', flush=True)
else:
    print('Skipping submission due to missing blend inputs.', flush=True)

Starting resnet200d@512 (3 folds) RRC+EMA...


Attempt 1: trying batch_size=8



===== [resnet200d@512 RRC+EMA] Fold 0 =====


Epoch 1 it 50/329 tr_loss 0.7889 elapsed 33.8s


Epoch 1 it 100/329 tr_loss 0.7351 elapsed 66.6s


Epoch 1 it 150/329 tr_loss 0.6393 elapsed 99.3s


Epoch 1 it 200/329 tr_loss 0.5627 elapsed 132.6s


Epoch 1 it 250/329 tr_loss 0.5106 elapsed 165.2s


Epoch 1 it 300/329 tr_loss 0.4765 elapsed 198.2s


Epoch 1: tr_loss 0.4618 val_loss 0.2885 val_qwk(def) 0.6658 epoch_time 235.2s


Epoch 2 it 50/329 tr_loss 0.3547 elapsed 32.6s


Epoch 2 it 100/329 tr_loss 0.3106 elapsed 65.7s


Epoch 2 it 150/329 tr_loss 0.2983 elapsed 98.4s


Epoch 2 it 200/329 tr_loss 0.2796 elapsed 131.5s


Epoch 2 it 250/329 tr_loss 0.2644 elapsed 164.4s


Epoch 2 it 300/329 tr_loss 0.2644 elapsed 197.2s


Epoch 2: tr_loss 0.2645 val_loss 0.2254 val_qwk(def) 0.7846 epoch_time 234.3s


Epoch 3 it 50/329 tr_loss 0.2420 elapsed 33.1s


Epoch 3 it 100/329 tr_loss 0.2257 elapsed 65.9s


Epoch 3 it 150/329 tr_loss 0.2221 elapsed 99.1s


Epoch 3 it 200/329 tr_loss 0.2232 elapsed 131.9s


Epoch 3 it 250/329 tr_loss 0.2248 elapsed 165.2s


Epoch 3 it 300/329 tr_loss 0.2291 elapsed 198.1s


Epoch 3: tr_loss 0.2231 val_loss 0.2115 val_qwk(def) 0.8232 epoch_time 235.1s


Epoch 4 it 50/329 tr_loss 0.2043 elapsed 33.1s


Epoch 4 it 100/329 tr_loss 0.2093 elapsed 66.4s


Epoch 4 it 150/329 tr_loss 0.1989 elapsed 99.2s


Epoch 4 it 200/329 tr_loss 0.1955 elapsed 132.1s


Epoch 4 it 250/329 tr_loss 0.1938 elapsed 165.2s


Epoch 4 it 300/329 tr_loss 0.1965 elapsed 198.0s


Epoch 4: tr_loss 0.1927 val_loss 0.2529 val_qwk(def) 0.8538 epoch_time 235.0s


Epoch 5 it 50/329 tr_loss 0.2020 elapsed 33.1s


Epoch 5 it 100/329 tr_loss 0.1776 elapsed 66.0s


Epoch 5 it 150/329 tr_loss 0.1772 elapsed 99.1s


Epoch 5 it 200/329 tr_loss 0.1720 elapsed 131.8s


Epoch 5 it 250/329 tr_loss 0.1689 elapsed 164.9s


Epoch 5 it 300/329 tr_loss 0.1714 elapsed 198.0s


Epoch 5: tr_loss 0.1724 val_loss 0.1866 val_qwk(def) 0.8152 epoch_time 234.7s


Epoch 6 it 50/329 tr_loss 0.1160 elapsed 33.1s


Epoch 6 it 100/329 tr_loss 0.1437 elapsed 66.0s


Epoch 6 it 150/329 tr_loss 0.1433 elapsed 98.8s


Epoch 6 it 200/329 tr_loss 0.1492 elapsed 131.9s


Epoch 6 it 250/329 tr_loss 0.1585 elapsed 165.2s


Epoch 6 it 300/329 tr_loss 0.1611 elapsed 198.5s


Epoch 6: tr_loss 0.1617 val_loss 0.1880 val_qwk(def) 0.8439 epoch_time 235.8s


Epoch 7 it 50/329 tr_loss 0.1471 elapsed 32.8s


Epoch 7 it 100/329 tr_loss 0.1393 elapsed 65.8s


Epoch 7 it 150/329 tr_loss 0.1459 elapsed 98.4s


Epoch 7 it 200/329 tr_loss 0.1467 elapsed 130.9s


Epoch 7 it 250/329 tr_loss 0.1464 elapsed 163.9s


Epoch 7 it 300/329 tr_loss 0.1438 elapsed 196.5s


Epoch 7: tr_loss 0.1421 val_loss 0.2060 val_qwk(def) 0.8396 epoch_time 233.5s


Early stopping triggered


[resnet200d@512] Fold 0 val QWK (default th): 0.8152


Attempt 1: trying batch_size=8



===== [resnet200d@512 RRC+EMA] Fold 1 =====


Epoch 1 it 50/329 tr_loss 0.7355 elapsed 32.9s


Epoch 1 it 100/329 tr_loss 0.7555 elapsed 65.9s


Epoch 1 it 150/329 tr_loss 0.6703 elapsed 98.5s


Epoch 1 it 200/329 tr_loss 0.5893 elapsed 131.5s


Epoch 1 it 250/329 tr_loss 0.5325 elapsed 164.1s


Epoch 1 it 300/329 tr_loss 0.4915 elapsed 197.2s


Epoch 1: tr_loss 0.4766 val_loss 0.2940 val_qwk(def) 0.6610 epoch_time 234.1s


Epoch 2 it 50/329 tr_loss 0.3100 elapsed 32.8s


Epoch 2 it 100/329 tr_loss 0.2791 elapsed 65.4s


Epoch 2 it 150/329 tr_loss 0.2643 elapsed 98.0s


Epoch 2 it 200/329 tr_loss 0.2698 elapsed 131.1s


Epoch 2 it 250/329 tr_loss 0.2592 elapsed 163.7s


Epoch 2 it 300/329 tr_loss 0.2627 elapsed 196.3s


Epoch 2: tr_loss 0.2630 val_loss 0.2089 val_qwk(def) 0.8247 epoch_time 233.1s


Epoch 3 it 50/329 tr_loss 0.2260 elapsed 32.5s


Epoch 3 it 100/329 tr_loss 0.2284 elapsed 65.2s


Epoch 3 it 150/329 tr_loss 0.2321 elapsed 98.2s


Epoch 3 it 200/329 tr_loss 0.2303 elapsed 130.8s


Epoch 3 it 250/329 tr_loss 0.2312 elapsed 163.4s


Epoch 3 it 300/329 tr_loss 0.2295 elapsed 196.1s


Epoch 3: tr_loss 0.2263 val_loss 0.1939 val_qwk(def) 0.8044 epoch_time 233.0s


Epoch 4 it 50/329 tr_loss 0.1678 elapsed 32.9s


Epoch 4 it 100/329 tr_loss 0.1913 elapsed 65.4s


Epoch 4 it 150/329 tr_loss 0.2033 elapsed 97.9s


Epoch 4 it 200/329 tr_loss 0.1942 elapsed 130.8s


Epoch 4 it 250/329 tr_loss 0.1927 elapsed 163.4s


Epoch 4 it 300/329 tr_loss 0.1969 elapsed 196.0s


Epoch 4: tr_loss 0.1959 val_loss 0.2258 val_qwk(def) 0.7409 epoch_time 232.9s


Epoch 5 it 50/329 tr_loss 0.1882 elapsed 32.9s


Epoch 5 it 100/329 tr_loss 0.1686 elapsed 65.5s


Epoch 5 it 150/329 tr_loss 0.1623 elapsed 98.2s


Epoch 5 it 200/329 tr_loss 0.1720 elapsed 130.6s


Epoch 5 it 250/329 tr_loss 0.1732 elapsed 163.4s


Epoch 5 it 300/329 tr_loss 0.1715 elapsed 196.3s


Epoch 5: tr_loss 0.1723 val_loss 0.2298 val_qwk(def) 0.8430 epoch_time 233.1s


Early stopping triggered


[resnet200d@512] Fold 1 val QWK (default th): 0.8044


Attempt 1: trying batch_size=8



===== [resnet200d@512 RRC+EMA] Fold 2 =====


Epoch 1 it 50/329 tr_loss 0.7809 elapsed 32.7s


Epoch 1 it 100/329 tr_loss 0.7758 elapsed 65.6s


Epoch 1 it 150/329 tr_loss 0.6384 elapsed 98.3s


Epoch 1 it 200/329 tr_loss 0.5693 elapsed 131.3s


Epoch 1 it 250/329 tr_loss 0.5182 elapsed 164.0s


Epoch 1 it 300/329 tr_loss 0.4846 elapsed 197.1s


Epoch 1: tr_loss 0.4682 val_loss 0.2799 val_qwk(def) 0.6791 epoch_time 234.0s


Epoch 2 it 50/329 tr_loss 0.2752 elapsed 32.6s


Epoch 2 it 100/329 tr_loss 0.2807 elapsed 65.5s


Epoch 2 it 150/329 tr_loss 0.2841 elapsed 98.0s


Epoch 2 it 200/329 tr_loss 0.2801 elapsed 130.6s


Epoch 2 it 250/329 tr_loss 0.2691 elapsed 163.5s


Epoch 2 it 300/329 tr_loss 0.2657 elapsed 196.2s


Epoch 2: tr_loss 0.2599 val_loss 0.2034 val_qwk(def) 0.8059 epoch_time 233.0s


Epoch 3 it 50/329 tr_loss 0.1839 elapsed 32.7s


Epoch 3 it 100/329 tr_loss 0.1855 elapsed 65.5s


Epoch 3 it 150/329 tr_loss 0.2095 elapsed 98.0s


Epoch 3 it 200/329 tr_loss 0.2134 elapsed 130.7s


Epoch 3 it 250/329 tr_loss 0.2118 elapsed 163.7s


Epoch 3 it 300/329 tr_loss 0.2107 elapsed 196.3s


Epoch 3: tr_loss 0.2149 val_loss 0.2002 val_qwk(def) 0.8312 epoch_time 233.1s


Epoch 4 it 50/329 tr_loss 0.2121 elapsed 32.7s


Epoch 4 it 100/329 tr_loss 0.2052 elapsed 65.7s


Epoch 4 it 150/329 tr_loss 0.2001 elapsed 98.3s


Epoch 4 it 200/329 tr_loss 0.1935 elapsed 130.9s


Epoch 4 it 250/329 tr_loss 0.1953 elapsed 163.9s


Epoch 4 it 300/329 tr_loss 0.1898 elapsed 196.5s


Epoch 4: tr_loss 0.1909 val_loss 0.1947 val_qwk(def) 0.8542 epoch_time 233.4s


Epoch 5 it 50/329 tr_loss 0.1823 elapsed 32.5s


Epoch 5 it 100/329 tr_loss 0.1782 elapsed 65.7s


Epoch 5 it 150/329 tr_loss 0.1802 elapsed 98.2s


Epoch 5 it 200/329 tr_loss 0.1758 elapsed 130.8s


Epoch 5 it 250/329 tr_loss 0.1697 elapsed 163.9s


Epoch 5 it 300/329 tr_loss 0.1704 elapsed 196.6s


Epoch 5: tr_loss 0.1733 val_loss 0.2189 val_qwk(def) 0.7791 epoch_time 233.5s


Epoch 6 it 50/329 tr_loss 0.1633 elapsed 32.7s


Epoch 6 it 100/329 tr_loss 0.1497 elapsed 65.2s


Epoch 6 it 150/329 tr_loss 0.1696 elapsed 97.8s


Epoch 6 it 200/329 tr_loss 0.1683 elapsed 130.5s


Epoch 6 it 250/329 tr_loss 0.1588 elapsed 163.4s


Epoch 6 it 300/329 tr_loss 0.1575 elapsed 196.1s


Epoch 6: tr_loss 0.1573 val_loss 0.1788 val_qwk(def) 0.8558 epoch_time 232.6s


Epoch 7 it 50/329 tr_loss 0.1111 elapsed 32.9s


Epoch 7 it 100/329 tr_loss 0.1281 elapsed 65.5s


Epoch 7 it 150/329 tr_loss 0.1319 elapsed 98.0s


Epoch 7 it 200/329 tr_loss 0.1336 elapsed 130.5s


Epoch 7 it 250/329 tr_loss 0.1331 elapsed 163.5s


Epoch 7 it 300/329 tr_loss 0.1397 elapsed 195.9s


Epoch 7: tr_loss 0.1385 val_loss 0.1927 val_qwk(def) 0.8156 epoch_time 232.7s


Epoch 8 it 50/329 tr_loss 0.1045 elapsed 32.8s


Epoch 8 it 100/329 tr_loss 0.1175 elapsed 65.9s


Epoch 8 it 150/329 tr_loss 0.1234 elapsed 98.6s


Epoch 8 it 200/329 tr_loss 0.1231 elapsed 131.7s


Epoch 8 it 250/329 tr_loss 0.1286 elapsed 164.5s


Epoch 8 it 300/329 tr_loss 0.1267 elapsed 197.2s


Epoch 8: tr_loss 0.1273 val_loss 0.2013 val_qwk(def) 0.8453 epoch_time 234.1s


Early stopping triggered


[resnet200d@512] Fold 2 val QWK (default th): 0.8558


Refitting NNLS with resnet200d added...


NNLS (with r200d) OOF QWK: 0.88248 weights: [0.11219482 0.17026146 0.29639526 0.11005915 0.18994785 0.04761905
 0.07352242] keys: ['b4_512', 'b5_512', 'b5_512_rrcema', 'b4_640', 'serx50_512_rrcema', 'serx50_512_rrcema_s2', 'r200d_512_rrcema']


Bootstrap 30/180


Bootstrap 60/180


Bootstrap 90/180


Bootstrap 120/180


Bootstrap 150/180


Bootstrap 180/180


Submission with resnet200d-enhanced NNLS blend written to submission.csv


In [29]:
# ConvNeXt-Base @512 (3 folds, RRC+EMA), z-scored NNLS blend with caps and robust thresholds
import os, gc, copy, time, math, numpy as np, pandas as pd, torch, torch.nn as nn, albumentations as A, cv2
from torch.utils.data import DataLoader
from albumentations.pytorch import ToTensorV2
from timm.utils import ModelEmaV2
from sklearn.metrics import cohen_kappa_score
from scipy.optimize import nnls, minimize

print('Starting ConvNeXt-Base@512 (3 folds) RRC+EMA...', flush=True)

SEED_CNX = 2027
seed_everything(SEED_CNX)
BACKBONE_CNX = 'convnext_base.fb_in22k_ft_in1k'
EPOCHS_CNX = 12
PATIENCE_CNX = 3
BATCH_CNX = 8
LR_CNX = 2e-4
WD_CNX = 1e-5

MEAN = (0.485, 0.456, 0.406); STD = (0.229, 0.224, 0.225)
train_tfms_cnx = A.Compose([
    A.RandomResizedCrop(size=(IMG_SIZE, IMG_SIZE), scale=(0.88, 1.0), ratio=(0.95, 1.05), interpolation=cv2.INTER_AREA),
    A.HorizontalFlip(p=0.5),
    A.Affine(scale=(0.95, 1.05), translate_percent=(0.0, 0.06), rotate=(-12, 12), border_mode=cv2.BORDER_REFLECT_101, p=0.7),
    A.RandomBrightnessContrast(brightness_limit=0.15, contrast_limit=0.15, p=0.7),
    A.RandomGamma(gamma_limit=(90,110), p=0.4),
    A.Normalize(mean=MEAN, std=STD),
    ToTensorV2(),
])
valid_tfms_cnx = A.Compose([
    A.Resize(IMG_SIZE, IMG_SIZE),
    A.Normalize(mean=MEAN, std=STD),
    ToTensorV2(),
])

def get_loaders_cnx(tr_df, va_df, batch_size=8, num_workers=0):
    dtr = DRDataset(tr_df, IMG_DIR_TRAIN, transforms=train_tfms_cnx)
    dva = DRDataset(va_df, IMG_DIR_TRAIN, transforms=valid_tfms_cnx)
    dl_tr = DataLoader(dtr, batch_size=batch_size, shuffle=True, num_workers=num_workers, pin_memory=True, drop_last=True, persistent_workers=False)
    dl_va = DataLoader(dva, batch_size=batch_size, shuffle=False, num_workers=num_workers, pin_memory=True, persistent_workers=False)
    return dl_tr, dl_va

def validate_loss_preds(model, dl, loss_fn):
    model.eval(); preds=[]; targs=[]; vloss=0.0; n=0
    with torch.no_grad():
        for xb, yb in dl:
            xb = xb.to(device, non_blocking=True).to(memory_format=torch.channels_last); yb = yb.to(device, non_blocking=True)
            with torch.amp.autocast('cuda', enabled=True, dtype=torch.float16):
                out = model(xb); loss = loss_fn(out, yb)
            bs = xb.size(0); vloss += loss.item()*bs; n += bs
            preds.append(out.detach().float().cpu().numpy()); targs.append(yb.detach().float().cpu().numpy())
    preds = np.concatenate(preds); targs = np.concatenate(targs)
    return vloss/max(1,n), preds, targs

def train_one_fold_cnx_ema(fold, folds_df, epochs=12, lr=2e-4, wd=1e-5, batch_size=8, patience=3, ema_decay=0.9996):
    print(f"\n===== [ConvNeXtB@512 RRC+EMA] Fold {fold} =====", flush=True)
    tr_df = folds_df[folds_df['fold'] != fold][['id_code','diagnosis']].reset_index(drop=True)
    va_df = folds_df[folds_df['fold'] == fold][['id_code','diagnosis']].reset_index(drop=True)
    dl_tr, dl_va = get_loaders_cnx(tr_df, va_df, batch_size=batch_size, num_workers=0)
    model = RegHeadModel(backbone_name=BACKBONE_CNX, pretrained=True).to(device)
    model = model.to(memory_format=torch.channels_last)
    opt = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=wd)
    loss_fn = nn.HuberLoss(delta=1.0)
    scaler = torch.amp.GradScaler('cuda', enabled=True)
    ema = ModelEmaV2(model, decay=ema_decay, device=None)
    total_steps = epochs * len(dl_tr)
    def lr_lambda(step):
        if step < len(dl_tr): return (step+1)/len(dl_tr)
        prog = (step - len(dl_tr)) / max(1, total_steps - len(dl_tr))
        return 0.5*(1+math.cos(math.pi*prog))
    sch = torch.optim.lr_scheduler.LambdaLR(opt, lr_lambda)
    best_loss = float('inf'); best_state=None; best_preds=None; best_targs=None; no_imp=0
    accum = max(1, 16 // batch_size); opt.zero_grad(set_to_none=True)
    for ep in range(1, epochs+1):
        model.train(); tr_loss=0.0; n=0; t0=time.time()
        for it,(xb,yb) in enumerate(dl_tr):
            xb = xb.to(device, non_blocking=True).to(memory_format=torch.channels_last); yb = yb.to(device, non_blocking=True)
            with torch.amp.autocast('cuda', enabled=True, dtype=torch.float16):
                out = model(xb); loss = loss_fn(out, yb)/accum
            scaler.scale(loss).backward()
            if (it+1)%accum==0:
                scaler.step(opt); scaler.update(); opt.zero_grad(set_to_none=True); sch.step(); ema.update(model)
            bs = xb.size(0); tr_loss += (loss.item()*accum)*bs; n += bs
            if (it+1)%50==0: print(f'Epoch {ep} it {it+1}/{len(dl_tr)} tr_loss {tr_loss/max(1,n):.4f} elapsed {time.time()-t0:.1f}s', flush=True)
        # validate EMA model (save/validate EMA)
        vloss, vpreds, vtargs = validate_loss_preds(ema.module, dl_va, loss_fn)
        vq = cohen_kappa_score(vtargs, np.digitize(vpreds, bins=[0.5,1.5,2.5,3.5]), weights='quadratic')
        print(f'Epoch {ep}: tr_loss {tr_loss/max(1,n):.4f} val_loss {vloss:.4f} val_qwk(def) {vq:.4f} epoch_time {time.time()-t0:.1f}s', flush=True)
        if vloss < best_loss:
            best_loss = vloss; best_state = copy.deepcopy(ema.module.state_dict()); best_preds = vpreds.copy(); best_targs = vtargs.copy(); no_imp=0
        else:
            no_imp += 1
        if no_imp >= patience:
            print('Early stopping triggered', flush=True); break
    ema.module.load_state_dict(best_state)
    return ema.module, best_preds, best_targs

def run_cv_cnx_ema(folds_df, folds_to_run=(0,1,2), epochs=12, batch_size=8):
    folds_df = folds_df.reset_index(drop=True).copy()
    n = len(folds_df); oof_preds = np.full(n, np.nan, dtype=np.float32); oof_targs = folds_df['diagnosis'].values.astype(float)
    models_out = []
    for fold in folds_to_run:
        try:
            torch.cuda.empty_cache()
            if torch.cuda.is_available(): torch.cuda.reset_peak_memory_stats()
        except Exception: pass
        gc.collect()
        bs = batch_size
        for attempt in range(4):
            print(f'Attempt {attempt+1}: trying batch_size={bs}', flush=True)
            try:
                m, vpreds, vtargs = train_one_fold_cnx_ema(fold, folds_df, epochs=epochs, lr=LR_CNX, wd=WD_CNX, batch_size=bs, patience=PATIENCE_CNX, ema_decay=0.9996)
                break
            except (RuntimeError, torch.cuda.OutOfMemoryError, torch.OutOfMemoryError) as e:
                if 'out of memory' in str(e).lower():
                    print(f'CUDA OOM on fold {fold} with batch_size={bs}. Reducing...', flush=True)
                    try: del m
                    except Exception: pass
                    gc.collect()
                    try: torch.cuda.empty_cache(); torch.cuda.reset_peak_memory_stats()
                    except Exception: pass
                    bs = max(2, bs-2); continue
                else:
                    raise
        models_out.append(m)
        va_idx = np.where(folds_df['fold'].values == fold)[0]
        oof_preds[va_idx] = vpreds
        fold_q = cohen_kappa_score(vtargs, np.digitize(vpreds, bins=[0.5,1.5,2.5,3.5]), weights='quadratic')
        print(f'[ConvNeXtB@512] Fold {fold} val QWK (default th): {fold_q:.4f}', flush=True)
    np.save('oof_preds_convnextb_512_rrc_ema.npy', oof_preds); np.save('oof_targets_convnextb_512_rrc_ema.npy', oof_targs)
    return models_out, oof_preds, oof_targs

def hflip_tta_predict_mean(models, dl):
    preds = []
    with torch.no_grad():
        for xb, _ in dl:
            xb = xb.to(device, non_blocking=True).to(memory_format=torch.channels_last)
            with torch.amp.autocast('cuda', enabled=True, dtype=torch.float16):
                s = 0; c = 0
                for m in models:
                    m.eval(); p0 = m(xb); p1 = m(torch.flip(xb, dims=[-1]))
                    s += (p0 + p1) / 2.0; c += 1
                p = s / max(1, c)
            preds.append(p.detach().float().cpu().numpy())
    return np.concatenate(preds)

def zscore_fit_transform(arr_list, mask):
    stats = []
    z_list = []
    for a in arr_list:
        a_m = a[mask]
        mu = float(np.mean(a_m))
        sd = float(np.std(a_m) + 1e-6)
        z = (a - mu) / sd
        stats.append((mu, sd))
        z_list.append(z)
    return z_list, stats

def apply_zscore(arr_list, stats):
    z_list = []
    for a, (mu, sd) in zip(arr_list, stats):
        z_list.append((a - mu) / (sd + 1e-6))
    return z_list

def optimize_thresholds_generic(y, p, init=[0.5,1.5,2.5,3.5]):
    y = np.asarray(y).astype(float); p = np.asarray(p).astype(float)
    def _loss(th):
        th = np.sort(th); th = np.clip(th, 0.3, 3.7)
        cls = np.digitize(p, bins=[th[0], th[1], th[2], th[3]])
        return -cohen_kappa_score(y, cls, weights='quadratic')
    res = minimize(_loss, x0=np.array(init, dtype=float), method='Nelder-Mead', options={'maxiter':2000, 'xatol':1e-3, 'fatol':1e-3})
    th = np.sort(res.x)
    for i in range(1,4):
        if th[i]-th[i-1] < 0.12: th[i] = th[i-1] + 0.12
    th = np.clip(th, 0.3, 3.7)
    return th

def bootstrap_thresholds_generic(y, p, n_bootstrap=200, base=[0.5,1.5,2.5,3.5]):
    N = len(y); cuts = []
    for i in range(n_bootstrap):
        idx = np.random.randint(0, N, size=N)
        th = optimize_thresholds_generic(y[idx], p[idx], init=base)
        cuts.append(th)
        if (i+1)%40==0: print(f'Bootstrap {i+1}/{n_bootstrap}', flush=True)
    cuts = np.array(cuts); med = np.median(cuts, axis=0); med = np.clip(med, 0.3, 3.7)
    for j in range(1,4):
        if med[j]-med[j-1] < 0.12: med[j] = med[j-1] + 0.12
    base_arr = np.array([0.5,1.5,2.5,3.5])
    return 0.9*med + 0.1*base_arr

def nnls_with_caps_zscore_and_serx_cap(y, oof_map, test_map, keys, lo=0.05, hi=0.70, serx_pair=('serx50_512_rrcema','serx50_512_rrcema_s2'), serx_cap_sum=0.30):
    arrs = [oof_map[k] for k in keys]
    mask = np.ones_like(y, dtype=bool)
    for a in arrs: mask &= np.isfinite(a)
    z_arrs, stats = zscore_fit_transform(arrs, mask)
    X = np.concatenate([z.reshape(-1,1) for z in z_arrs], axis=1)
    w, _ = nnls(X[mask], y[mask].astype(float))
    if w.sum() > 0: w = w / w.sum()
    w = np.clip(w, lo, hi); w = w / w.sum()
    # serx correlated sum cap
    if serx_pair[0] in keys and serx_pair[1] in keys:
        i0 = keys.index(serx_pair[0]); i1 = keys.index(serx_pair[1])
        s = w[i0] + w[i1]
        if s > serx_cap_sum:
            scale = serx_cap_sum / s
            w[i0] *= scale; w[i1] *= scale
            rem = 1.0 - (w[i0] + w[i1])
            others = [i for i in range(len(w)) if i not in (i0,i1)]
            w_others = w[others]; s_others = w_others.sum() + 1e-9
            w[others] = w_others / s_others * rem
    # build blended OOF
    blend_oof = (X * w.reshape(1,-1)).sum(axis=1)
    # build blended TEST using same z-score stats
    te_arrs = [test_map[k] for k in keys]
    te_z = apply_zscore(te_arrs, stats)
    te_mat = np.concatenate([z.reshape(-1,1) for z in te_z], axis=1)
    blend_te = (te_mat * w.reshape(1,-1)).sum(axis=1)
    return blend_oof, blend_te, w, mask

# 1) Train ConvNeXtB 3-folds
models_cnx, oof_cnx, oof_t_cnx = run_cv_cnx_ema(folds_df, folds_to_run=(0,1,2), epochs=EPOCHS_CNX, batch_size=BATCH_CNX)

# 2) Inference on test (hflip-only) for ConvNeXtB
dte_cnx = DRDataset(test_df[['id_code']].copy(), IMG_DIR_TEST, transforms=valid_tfms_cnx)
dl_te_cnx = DataLoader(dte_cnx, batch_size=16, shuffle=False, num_workers=4, pin_memory=True, persistent_workers=False)
preds_cnx = hflip_tta_predict_mean(models_cnx, dl_te_cnx).reshape(-1)
np.save('test_reg_preds_convnextb_512_rrc_ema.npy', preds_cnx)

# 3) Refit NNLS on z-scored OOF EVs with caps, include ConvNeXtB
oof_paths = {
    'b4_512': 'oof_preds_b4.npy',
    'b5_512': 'oof_preds.npy',
    'b5_512_rrcema': 'oof_preds_b5_seed2025_rrc_ema.npy',
    'b4_640': 'oof_preds_b4_640_rrc_ema.npy',
    'serx50_512_rrcema': 'oof_preds_serx50_512_rrc_ema.npy',
    'serx50_512_rrcema_s2': 'oof_preds_serx50_512_rrc_ema_seed2026.npy',
    'r200d_512_rrcema': 'oof_preds_r200d_512_rrc_ema.npy',
    'convnextb_512_rrcema': 'oof_preds_convnextb_512_rrc_ema.npy',
}
test_paths = {
    'b4_512': 'test_reg_preds_b4_hflip.npy',
    'b5_512': 'test_reg_preds_b5_hflip.npy',
    'b5_512_rrcema': 'test_reg_preds_b5_seed2025_rrc_ema.npy',
    'b4_640': 'test_reg_preds_b4_640_rrc_ema.npy',
    'serx50_512_rrcema': 'test_reg_preds_serx50_512_rrc_ema.npy',
    'serx50_512_rrcema_s2': 'test_reg_preds_serx50_512_rrc_ema_seed2026.npy',
    'r200d_512_rrcema': 'test_reg_preds_r200d_512_rrc_ema.npy',
    'convnextb_512_rrcema': 'test_reg_preds_convnextb_512_rrc_ema.npy',
}
y_ref = None
for tgt_path in ['oof_targets_b4.npy', 'oof_targets.npy']:
    if os.path.exists(tgt_path):
        y_ref = np.load(tgt_path).reshape(-1); break
if y_ref is None:
    y_ref = oof_t_cnx.reshape(-1)

loaded_keys = []; oof_map = {}; te_map = {}
for k, p in oof_paths.items():
    if os.path.exists(p):
        a = np.load(p).reshape(-1);
        oof_map[k] = a; loaded_keys.append(k)
for k, p in test_paths.items():
    if os.path.exists(p):
        te_map[k] = np.load(p).reshape(-1)

keys_use = [k for k in loaded_keys if k in te_map]
if len(keys_use) < 3:
    print('Too few models available for NNLS; aborting blend.', flush=True)
else:
    blend_oof, blend_te, w, mask_nnls = nnls_with_caps_zscore_and_serx_cap(y_ref, oof_map, te_map, keys_use, lo=0.05, hi=0.70, serx_cap_sum=0.30)
    th = optimize_thresholds_generic(y_ref[mask_nnls], blend_oof[mask_nnls])
    oof_q = cohen_kappa_score(y_ref[mask_nnls], np.digitize(blend_oof[mask_nnls], bins=[th[0], th[1], th[2], th[3]]), weights='quadratic')
    print('Z-scored NNLS OOF QWK:', f'{oof_q:.5f}', 'weights:', w, 'keys:', keys_use, flush=True)
    th_boot = bootstrap_thresholds_generic(y_ref[mask_nnls], blend_oof[mask_nnls], n_bootstrap=200, base=th.tolist())
    # Optional tiny smoothing: clip EV to [0,4]
    blend_te = np.clip(blend_te, 0.0, 4.0)
    cls = np.digitize(blend_te, bins=[th_boot[0], th_boot[1], th_boot[2], th_boot[3]]).astype(int)
    sub = pd.DataFrame({'id_code': test_df['id_code'].values, 'diagnosis': cls})
    sub.to_csv('submission.csv', index=False)
    np.save('blend_weights_keys.npy', np.array(keys_use, dtype=object)); np.save('blend_weights_vals.npy', w); np.save('thresholds_blend_boot.npy', th_boot)
    print('Final submission.csv written with ConvNeXtB included.', flush=True)

Starting ConvNeXt-Base@512 (3 folds) RRC+EMA...


Attempt 1: trying batch_size=8



===== [ConvNeXtB@512 RRC+EMA] Fold 0 =====




Epoch 1 it 50/329 tr_loss 0.6565 elapsed 33.5s


Epoch 1 it 100/329 tr_loss 0.5066 elapsed 66.9s


Epoch 1 it 150/329 tr_loss 0.4398 elapsed 99.6s


Epoch 1 it 200/329 tr_loss 0.3992 elapsed 132.8s


Epoch 1 it 250/329 tr_loss 0.3718 elapsed 165.8s


Epoch 1 it 300/329 tr_loss 0.3524 elapsed 199.0s


Epoch 1: tr_loss 0.3468 val_loss 0.7725 val_qwk(def) -0.0875 epoch_time 235.2s


Epoch 2 it 50/329 tr_loss 0.2411 elapsed 33.1s


Epoch 2 it 100/329 tr_loss 0.2279 elapsed 66.2s


Epoch 2 it 150/329 tr_loss 0.2348 elapsed 99.2s


Epoch 2 it 200/329 tr_loss 0.2293 elapsed 132.2s


Epoch 2 it 250/329 tr_loss 0.2329 elapsed 165.4s


Epoch 2 it 300/329 tr_loss 0.2208 elapsed 198.3s


Epoch 2: tr_loss 0.2157 val_loss 0.6495 val_qwk(def) 0.0424 epoch_time 234.4s


Epoch 3 it 50/329 tr_loss 0.1917 elapsed 32.9s


Epoch 3 it 100/329 tr_loss 0.2082 elapsed 65.9s


Epoch 3 it 150/329 tr_loss 0.1985 elapsed 99.0s


Epoch 3 it 200/329 tr_loss 0.2027 elapsed 131.9s


Epoch 3 it 250/329 tr_loss 0.1890 elapsed 164.7s


Epoch 3 it 300/329 tr_loss 0.1837 elapsed 198.0s


Epoch 3: tr_loss 0.1838 val_loss 0.5329 val_qwk(def) 0.3375 epoch_time 234.0s


Epoch 4 it 50/329 tr_loss 0.1750 elapsed 32.9s


Epoch 4 it 100/329 tr_loss 0.1753 elapsed 65.8s


Epoch 4 it 150/329 tr_loss 0.1598 elapsed 98.6s


Epoch 4 it 200/329 tr_loss 0.1607 elapsed 131.8s


Epoch 4 it 250/329 tr_loss 0.1564 elapsed 164.5s


Epoch 4 it 300/329 tr_loss 0.1584 elapsed 197.4s


Epoch 4: tr_loss 0.1605 val_loss 0.4370 val_qwk(def) 0.4784 epoch_time 233.5s


Epoch 5 it 50/329 tr_loss 0.1423 elapsed 32.9s


Epoch 5 it 100/329 tr_loss 0.1427 elapsed 65.9s


Epoch 5 it 150/329 tr_loss 0.1378 elapsed 98.9s


Epoch 5 it 200/329 tr_loss 0.1368 elapsed 131.6s


Epoch 5 it 250/329 tr_loss 0.1369 elapsed 164.4s


Epoch 5 it 300/329 tr_loss 0.1396 elapsed 197.2s


Epoch 5: tr_loss 0.1432 val_loss 0.3650 val_qwk(def) 0.5535 epoch_time 233.2s


Epoch 6 it 50/329 tr_loss 0.1133 elapsed 33.0s


Epoch 6 it 100/329 tr_loss 0.1114 elapsed 65.9s


Epoch 6 it 150/329 tr_loss 0.1195 elapsed 98.7s


Epoch 6 it 200/329 tr_loss 0.1149 elapsed 131.4s


Epoch 6 it 250/329 tr_loss 0.1184 elapsed 164.5s


Epoch 6 it 300/329 tr_loss 0.1190 elapsed 197.4s


Epoch 6: tr_loss 0.1201 val_loss 0.3156 val_qwk(def) 0.6195 epoch_time 233.5s


Epoch 7 it 50/329 tr_loss 0.0793 elapsed 32.9s


Epoch 7 it 100/329 tr_loss 0.0995 elapsed 65.7s


Epoch 7 it 150/329 tr_loss 0.1047 elapsed 98.5s


Epoch 7 it 200/329 tr_loss 0.0992 elapsed 131.5s


Epoch 7 it 250/329 tr_loss 0.1004 elapsed 164.1s


Epoch 7 it 300/329 tr_loss 0.0994 elapsed 196.9s


Epoch 7: tr_loss 0.1011 val_loss 0.2815 val_qwk(def) 0.6740 epoch_time 232.9s


Epoch 8 it 50/329 tr_loss 0.1023 elapsed 32.8s


Epoch 8 it 100/329 tr_loss 0.0990 elapsed 66.0s


Epoch 8 it 150/329 tr_loss 0.0971 elapsed 98.8s


Epoch 8 it 200/329 tr_loss 0.0992 elapsed 131.5s


Epoch 8 it 250/329 tr_loss 0.0937 elapsed 164.2s


Epoch 8 it 300/329 tr_loss 0.0901 elapsed 197.3s


Epoch 8: tr_loss 0.0901 val_loss 0.2579 val_qwk(def) 0.7190 epoch_time 233.3s


Epoch 9 it 50/329 tr_loss 0.0638 elapsed 32.7s


Epoch 9 it 100/329 tr_loss 0.0611 elapsed 65.5s


Epoch 9 it 150/329 tr_loss 0.0630 elapsed 98.2s


Epoch 9 it 200/329 tr_loss 0.0638 elapsed 131.0s


Epoch 9 it 250/329 tr_loss 0.0690 elapsed 164.0s


Epoch 9 it 300/329 tr_loss 0.0708 elapsed 196.8s


Epoch 9: tr_loss 0.0701 val_loss 0.2399 val_qwk(def) 0.7556 epoch_time 232.8s


Epoch 10 it 50/329 tr_loss 0.0522 elapsed 32.7s


Epoch 10 it 100/329 tr_loss 0.0536 elapsed 65.5s


Epoch 10 it 150/329 tr_loss 0.0512 elapsed 98.5s


Epoch 10 it 200/329 tr_loss 0.0560 elapsed 131.2s


Epoch 10 it 250/329 tr_loss 0.0596 elapsed 163.9s


Epoch 10 it 300/329 tr_loss 0.0575 elapsed 196.7s


Epoch 10: tr_loss 0.0561 val_loss 0.2273 val_qwk(def) 0.7816 epoch_time 233.0s


Epoch 11 it 50/329 tr_loss 0.0357 elapsed 32.7s


Epoch 11 it 100/329 tr_loss 0.0361 elapsed 65.3s


Epoch 11 it 150/329 tr_loss 0.0364 elapsed 98.1s


Epoch 11 it 200/329 tr_loss 0.0401 elapsed 130.8s


Epoch 11 it 250/329 tr_loss 0.0443 elapsed 163.6s


Epoch 11 it 300/329 tr_loss 0.0431 elapsed 196.3s


Epoch 11: tr_loss 0.0428 val_loss 0.2171 val_qwk(def) 0.8016 epoch_time 232.4s


Epoch 12 it 50/329 tr_loss 0.0266 elapsed 32.7s


Epoch 12 it 100/329 tr_loss 0.0310 elapsed 65.9s


Epoch 12 it 150/329 tr_loss 0.0342 elapsed 98.6s


Epoch 12 it 200/329 tr_loss 0.0357 elapsed 131.3s


Epoch 12 it 250/329 tr_loss 0.0342 elapsed 164.0s


Epoch 12 it 300/329 tr_loss 0.0331 elapsed 196.6s


Epoch 12: tr_loss 0.0344 val_loss 0.2079 val_qwk(def) 0.8185 epoch_time 232.6s


[ConvNeXtB@512] Fold 0 val QWK (default th): 0.8185


Attempt 1: trying batch_size=8



===== [ConvNeXtB@512 RRC+EMA] Fold 1 =====




Epoch 1 it 50/329 tr_loss 0.9835 elapsed 33.1s


Epoch 1 it 100/329 tr_loss 0.7007 elapsed 66.1s


Epoch 1 it 150/329 tr_loss 0.5648 elapsed 99.0s


Epoch 1 it 200/329 tr_loss 0.5073 elapsed 132.0s


Epoch 1 it 250/329 tr_loss 0.4590 elapsed 165.3s


Epoch 1 it 300/329 tr_loss 0.4174 elapsed 198.4s


Epoch 1: tr_loss 0.4014 val_loss 1.0190 val_qwk(def) -0.0014 epoch_time 234.8s


Epoch 2 it 50/329 tr_loss 0.2162 elapsed 33.1s


Epoch 2 it 100/329 tr_loss 0.1950 elapsed 66.2s


Epoch 2 it 150/329 tr_loss 0.2117 elapsed 99.1s


Epoch 2 it 200/329 tr_loss 0.2222 elapsed 132.1s


Epoch 2 it 250/329 tr_loss 0.2162 elapsed 165.0s


Epoch 2 it 300/329 tr_loss 0.2143 elapsed 198.3s


Epoch 2: tr_loss 0.2159 val_loss 0.8760 val_qwk(def) 0.0092 epoch_time 234.4s


Epoch 3 it 50/329 tr_loss 0.1834 elapsed 33.0s


Epoch 3 it 100/329 tr_loss 0.1688 elapsed 65.8s


Epoch 3 it 150/329 tr_loss 0.1689 elapsed 98.6s


Epoch 3 it 200/329 tr_loss 0.1796 elapsed 131.5s


Epoch 3 it 250/329 tr_loss 0.1912 elapsed 164.6s


Epoch 3 it 300/329 tr_loss 0.1928 elapsed 197.6s


Epoch 3: tr_loss 0.1911 val_loss 0.7404 val_qwk(def) 0.0870 epoch_time 233.6s


Epoch 4 it 50/329 tr_loss 0.1702 elapsed 32.8s


Epoch 4 it 100/329 tr_loss 0.1647 elapsed 65.6s


Epoch 4 it 150/329 tr_loss 0.1623 elapsed 98.7s


Epoch 4 it 200/329 tr_loss 0.1669 elapsed 131.5s


Epoch 4 it 250/329 tr_loss 0.1650 elapsed 164.3s


Epoch 4 it 300/329 tr_loss 0.1656 elapsed 197.2s


Epoch 4: tr_loss 0.1628 val_loss 0.6320 val_qwk(def) 0.2669 epoch_time 233.6s


Epoch 5 it 50/329 tr_loss 0.1579 elapsed 32.7s


Epoch 5 it 100/329 tr_loss 0.1479 elapsed 65.6s


Epoch 5 it 150/329 tr_loss 0.1458 elapsed 98.3s


Epoch 5 it 200/329 tr_loss 0.1430 elapsed 131.1s


Epoch 5 it 250/329 tr_loss 0.1443 elapsed 163.9s


Epoch 5 it 300/329 tr_loss 0.1456 elapsed 197.0s


Epoch 5: tr_loss 0.1435 val_loss 0.5353 val_qwk(def) 0.3751 epoch_time 233.2s


Epoch 6 it 50/329 tr_loss 0.1081 elapsed 32.9s


Epoch 6 it 100/329 tr_loss 0.1136 elapsed 65.6s


Epoch 6 it 150/329 tr_loss 0.1206 elapsed 98.3s


Epoch 6 it 200/329 tr_loss 0.1202 elapsed 131.0s


Epoch 6 it 250/329 tr_loss 0.1164 elapsed 164.1s


Epoch 6 it 300/329 tr_loss 0.1144 elapsed 196.8s


Epoch 6: tr_loss 0.1157 val_loss 0.4631 val_qwk(def) 0.4389 epoch_time 232.8s


Epoch 7 it 50/329 tr_loss 0.0901 elapsed 32.8s


Epoch 7 it 100/329 tr_loss 0.0932 elapsed 65.5s


Epoch 7 it 150/329 tr_loss 0.0992 elapsed 98.2s


Epoch 7 it 200/329 tr_loss 0.1009 elapsed 131.2s


Epoch 7 it 250/329 tr_loss 0.1008 elapsed 163.9s


Epoch 7 it 300/329 tr_loss 0.1010 elapsed 196.8s


Epoch 7: tr_loss 0.0991 val_loss 0.4079 val_qwk(def) 0.5349 epoch_time 232.9s


Epoch 8 it 50/329 tr_loss 0.0814 elapsed 33.0s


Epoch 8 it 100/329 tr_loss 0.0784 elapsed 65.7s


Epoch 8 it 150/329 tr_loss 0.0761 elapsed 98.5s


Epoch 8 it 200/329 tr_loss 0.0804 elapsed 131.1s


Epoch 8 it 250/329 tr_loss 0.0788 elapsed 163.7s


Epoch 8 it 300/329 tr_loss 0.0835 elapsed 196.4s


Epoch 8: tr_loss 0.0851 val_loss 0.3655 val_qwk(def) 0.6111 epoch_time 232.7s


Epoch 9 it 50/329 tr_loss 0.0620 elapsed 32.8s


Epoch 9 it 100/329 tr_loss 0.0605 elapsed 65.4s


Epoch 9 it 150/329 tr_loss 0.0588 elapsed 98.0s


Epoch 9 it 200/329 tr_loss 0.0614 elapsed 130.6s


Epoch 9 it 250/329 tr_loss 0.0643 elapsed 163.1s


Epoch 9 it 300/329 tr_loss 0.0639 elapsed 196.1s


Epoch 9: tr_loss 0.0642 val_loss 0.3338 val_qwk(def) 0.6579 epoch_time 232.2s


Epoch 10 it 50/329 tr_loss 0.0429 elapsed 32.5s


Epoch 10 it 100/329 tr_loss 0.0486 elapsed 65.1s


Epoch 10 it 150/329 tr_loss 0.0505 elapsed 97.8s


Epoch 10 it 200/329 tr_loss 0.0491 elapsed 130.7s


Epoch 10 it 250/329 tr_loss 0.0505 elapsed 163.3s


Epoch 10 it 300/329 tr_loss 0.0508 elapsed 196.1s


Epoch 10: tr_loss 0.0496 val_loss 0.3078 val_qwk(def) 0.6951 epoch_time 232.0s


Epoch 11 it 50/329 tr_loss 0.0290 elapsed 32.9s


Epoch 11 it 100/329 tr_loss 0.0331 elapsed 65.4s


Epoch 11 it 150/329 tr_loss 0.0361 elapsed 98.0s


Epoch 11 it 200/329 tr_loss 0.0391 elapsed 130.5s


Epoch 11 it 250/329 tr_loss 0.0374 elapsed 163.2s


Epoch 11 it 300/329 tr_loss 0.0383 elapsed 195.9s


Epoch 11: tr_loss 0.0383 val_loss 0.2872 val_qwk(def) 0.7355 epoch_time 232.3s


Epoch 12 it 50/329 tr_loss 0.0379 elapsed 32.6s


Epoch 12 it 100/329 tr_loss 0.0352 elapsed 65.2s


Epoch 12 it 150/329 tr_loss 0.0340 elapsed 97.8s


Epoch 12 it 200/329 tr_loss 0.0323 elapsed 130.4s


Epoch 12 it 250/329 tr_loss 0.0318 elapsed 163.3s


Epoch 12 it 300/329 tr_loss 0.0315 elapsed 195.9s


Epoch 12: tr_loss 0.0316 val_loss 0.2690 val_qwk(def) 0.7582 epoch_time 232.0s


[ConvNeXtB@512] Fold 1 val QWK (default th): 0.7582


Attempt 1: trying batch_size=8



===== [ConvNeXtB@512 RRC+EMA] Fold 2 =====




Epoch 1 it 50/329 tr_loss 0.7245 elapsed 33.1s


Epoch 1 it 100/329 tr_loss 0.5437 elapsed 66.2s


Epoch 1 it 150/329 tr_loss 0.4742 elapsed 99.2s


Epoch 1 it 200/329 tr_loss 0.4172 elapsed 132.3s


Epoch 1 it 250/329 tr_loss 0.4016 elapsed 165.7s


Epoch 1 it 300/329 tr_loss 0.3709 elapsed 198.8s


Epoch 1: tr_loss 0.3606 val_loss 0.7673 val_qwk(def) 0.0100 epoch_time 235.1s


Epoch 2 it 50/329 tr_loss 0.2309 elapsed 33.1s


Epoch 2 it 100/329 tr_loss 0.2325 elapsed 66.3s


Epoch 2 it 150/329 tr_loss 0.2192 elapsed 99.3s


Epoch 2 it 200/329 tr_loss 0.2239 elapsed 132.3s


Epoch 2 it 250/329 tr_loss 0.2086 elapsed 165.2s


Epoch 2 it 300/329 tr_loss 0.2081 elapsed 198.5s


Epoch 2: tr_loss 0.2040 val_loss 0.6454 val_qwk(def) 0.1724 epoch_time 234.7s


Epoch 3 it 50/329 tr_loss 0.2369 elapsed 32.9s


Epoch 3 it 100/329 tr_loss 0.1997 elapsed 65.8s


Epoch 3 it 150/329 tr_loss 0.1905 elapsed 98.7s


Epoch 3 it 200/329 tr_loss 0.1793 elapsed 131.5s


Epoch 3 it 250/329 tr_loss 0.1867 elapsed 164.7s


Epoch 3 it 300/329 tr_loss 0.1844 elapsed 197.5s


Epoch 3: tr_loss 0.1873 val_loss 0.5230 val_qwk(def) 0.3730 epoch_time 233.5s


Epoch 4 it 50/329 tr_loss 0.1624 elapsed 32.8s


Epoch 4 it 100/329 tr_loss 0.1500 elapsed 65.7s


Epoch 4 it 150/329 tr_loss 0.1465 elapsed 98.7s


Epoch 4 it 200/329 tr_loss 0.1503 elapsed 131.5s


Epoch 4 it 250/329 tr_loss 0.1540 elapsed 164.3s


Epoch 4 it 300/329 tr_loss 0.1563 elapsed 197.2s


Epoch 4: tr_loss 0.1546 val_loss 0.4217 val_qwk(def) 0.4853 epoch_time 233.4s


Epoch 5 it 50/329 tr_loss 0.1476 elapsed 33.2s


Epoch 5 it 100/329 tr_loss 0.1443 elapsed 65.9s


Epoch 5 it 150/329 tr_loss 0.1437 elapsed 98.7s


Epoch 5 it 200/329 tr_loss 0.1377 elapsed 131.5s


Epoch 5 it 250/329 tr_loss 0.1363 elapsed 164.2s


Epoch 5 it 300/329 tr_loss 0.1344 elapsed 196.9s


Epoch 5: tr_loss 0.1340 val_loss 0.3447 val_qwk(def) 0.5771 epoch_time 233.5s


Epoch 6 it 50/329 tr_loss 0.1113 elapsed 32.8s


Epoch 6 it 100/329 tr_loss 0.1102 elapsed 65.6s


Epoch 6 it 150/329 tr_loss 0.1146 elapsed 98.2s


Epoch 6 it 200/329 tr_loss 0.1235 elapsed 131.3s


Epoch 6 it 250/329 tr_loss 0.1252 elapsed 164.0s


Epoch 6 it 300/329 tr_loss 0.1242 elapsed 196.8s


Epoch 6: tr_loss 0.1241 val_loss 0.2986 val_qwk(def) 0.6614 epoch_time 232.9s


Epoch 7 it 50/329 tr_loss 0.1106 elapsed 32.9s


Epoch 7 it 100/329 tr_loss 0.0975 elapsed 65.4s


Epoch 7 it 150/329 tr_loss 0.0945 elapsed 98.4s


Epoch 7 it 200/329 tr_loss 0.0900 elapsed 131.1s


Epoch 7 it 250/329 tr_loss 0.0889 elapsed 163.9s


Epoch 7 it 300/329 tr_loss 0.0934 elapsed 196.6s


Epoch 7: tr_loss 0.0948 val_loss 0.2680 val_qwk(def) 0.7237 epoch_time 232.7s


Epoch 8 it 50/329 tr_loss 0.0662 elapsed 33.0s


Epoch 8 it 100/329 tr_loss 0.0727 elapsed 65.8s


Epoch 8 it 150/329 tr_loss 0.0746 elapsed 98.4s


Epoch 8 it 200/329 tr_loss 0.0785 elapsed 131.0s


Epoch 8 it 250/329 tr_loss 0.0782 elapsed 164.0s


Epoch 8 it 300/329 tr_loss 0.0771 elapsed 196.8s


Epoch 8: tr_loss 0.0797 val_loss 0.2498 val_qwk(def) 0.7597 epoch_time 233.0s


Epoch 9 it 50/329 tr_loss 0.0650 elapsed 32.6s


Epoch 9 it 100/329 tr_loss 0.0702 elapsed 65.3s


Epoch 9 it 150/329 tr_loss 0.0705 elapsed 98.0s


Epoch 9 it 200/329 tr_loss 0.0680 elapsed 131.1s


Epoch 9 it 250/329 tr_loss 0.0704 elapsed 163.8s


Epoch 9 it 300/329 tr_loss 0.0679 elapsed 196.4s


Epoch 9: tr_loss 0.0681 val_loss 0.2408 val_qwk(def) 0.7823 epoch_time 232.4s


Epoch 10 it 50/329 tr_loss 0.0594 elapsed 32.6s


Epoch 10 it 100/329 tr_loss 0.0656 elapsed 65.7s


Epoch 10 it 150/329 tr_loss 0.0633 elapsed 98.3s


Epoch 10 it 200/329 tr_loss 0.0599 elapsed 130.9s


Epoch 10 it 250/329 tr_loss 0.0582 elapsed 163.6s


Epoch 10 it 300/329 tr_loss 0.0590 elapsed 196.6s


Epoch 10: tr_loss 0.0579 val_loss 0.2311 val_qwk(def) 0.7977 epoch_time 232.6s


Epoch 11 it 50/329 tr_loss 0.0367 elapsed 32.6s


Epoch 11 it 100/329 tr_loss 0.0405 elapsed 65.3s


Epoch 11 it 150/329 tr_loss 0.0441 elapsed 98.0s


Epoch 11 it 200/329 tr_loss 0.0413 elapsed 130.6s


Epoch 11 it 250/329 tr_loss 0.0442 elapsed 163.4s


Epoch 11 it 300/329 tr_loss 0.0441 elapsed 196.1s


Epoch 11: tr_loss 0.0444 val_loss 0.2201 val_qwk(def) 0.8073 epoch_time 232.3s


Epoch 12 it 50/329 tr_loss 0.0265 elapsed 32.6s


Epoch 12 it 100/329 tr_loss 0.0296 elapsed 65.3s


Epoch 12 it 150/329 tr_loss 0.0327 elapsed 97.9s


Epoch 12 it 200/329 tr_loss 0.0354 elapsed 130.7s


Epoch 12 it 250/329 tr_loss 0.0360 elapsed 163.3s


Epoch 12 it 300/329 tr_loss 0.0348 elapsed 195.9s


Epoch 12: tr_loss 0.0343 val_loss 0.2089 val_qwk(def) 0.8205 epoch_time 231.7s


[ConvNeXtB@512] Fold 2 val QWK (default th): 0.8205


Z-scored NNLS OOF QWK: 0.75996 weights: [0.11134283 0.17306955 0.28974523 0.08823347 0.18743803 0.04602204
 0.05812681 0.04602204] keys: ['b4_512', 'b5_512', 'b5_512_rrcema', 'b4_640', 'serx50_512_rrcema', 'serx50_512_rrcema_s2', 'r200d_512_rrcema', 'convnextb_512_rrcema']


Bootstrap 40/200


Bootstrap 80/200


Bootstrap 120/200


Bootstrap 160/200


Bootstrap 200/200


Final submission.csv written with ConvNeXtB included.


In [30]:
# Refit LOFO-averaged NNLS (caps + serx cap) incl. ConvNeXt; bootstrap thresholds; write submission
import os, numpy as np, pandas as pd
from scipy.optimize import nnls, minimize
from sklearn.metrics import cohen_kappa_score

print('Refitting LOFO-averaged NNLS with caps and serx-pair cap (incl. ConvNeXt)...', flush=True)

paths_oof = {
    'b4_512': 'oof_preds_b4.npy',
    'b5_512': 'oof_preds.npy',
    'b5_512_rrcema': 'oof_preds_b5_seed2025_rrc_ema.npy',
    'b4_640': 'oof_preds_b4_640_rrc_ema.npy',
    'serx50_512_rrcema': 'oof_preds_serx50_512_rrc_ema.npy',
    'serx50_512_rrcema_s2': 'oof_preds_serx50_512_rrc_ema_seed2026.npy',
    'r200d_512_rrcema': 'oof_preds_r200d_512_rrc_ema.npy',
    'convnextb_512_rrcema': 'oof_preds_convnextb_512_rrc_ema.npy',
}
paths_te = {
    'b4_512': 'test_reg_preds_b4_hflip.npy',
    'b5_512': 'test_reg_preds_b5_hflip.npy',
    'b5_512_rrcema': 'test_reg_preds_b5_seed2025_rrc_ema.npy',
    'b4_640': 'test_reg_preds_b4_640_rrc_ema.npy',
    'serx50_512_rrcema': 'test_reg_preds_serx50_512_rrc_ema.npy',
    'serx50_512_rrcema_s2': 'test_reg_preds_serx50_512_rrc_ema_seed2026.npy',
    'r200d_512_rrcema': 'test_reg_preds_r200d_512_rrc_ema.npy',
    'convnextb_512_rrcema': 'test_reg_preds_convnextb_512_rrc_ema.npy',
}
y = None
for tgt in ['oof_targets_b4.npy', 'oof_targets.npy']:
    if os.path.exists(tgt):
        y = np.load(tgt).reshape(-1); break
if y is None:
    raise RuntimeError('OOF targets not found')

loaded_keys = []; arrs = [];
for k,p in paths_oof.items():
    if os.path.exists(p):
        a = np.load(p).reshape(-1); loaded_keys.append(k); arrs.append(a)
    
if len(arrs) < 3:
    raise RuntimeError('Too few OOF arrays for NNLS')

X_full = np.concatenate([a.reshape(-1,1) for a in arrs], axis=1)
mask = np.isfinite(y).copy()
for a in arrs: mask &= np.isfinite(a)

def fit_nnls_masked(y, X, mask, lo=0.05, hi=0.70, serx_pair=('serx50_512_rrcema','serx50_512_rrcema_s2'), serx_cap_sum=0.30, keys=None):
    w, _ = nnls(X[mask], y[mask].astype(float))
    if w.sum() > 0: w = w / w.sum()
    w = np.clip(w, lo, hi); w = w / w.sum()
    # serx-pair sum cap
    if keys is not None and all(k in keys for k in serx_pair):
        i0 = keys.index(serx_pair[0]); i1 = keys.index(serx_pair[1])
        s = w[i0] + w[i1]
        if s > serx_cap_sum:
            scale = serx_cap_sum / s
            w[i0] *= scale; w[i1] *= scale
            rem = 1.0 - (w[i0] + w[i1])
            idx_others = [i for i in range(len(w)) if i not in (i0,i1)]
            s_others = w[idx_others].sum() + 1e-9
            w[idx_others] = w[idx_others] / s_others * rem
    return w

# LOFO-averaged NNLS
keys_all = loaded_keys.copy()
weights_list = []
w_all = fit_nnls_masked(y, X_full, mask, lo=0.05, hi=0.70, keys=keys_all)
weights_list.append(w_all)
for i in range(len(keys_all)):
    keep_idx = [j for j in range(len(keys_all)) if j != i]
    X_lofo = X_full[:, keep_idx]
    keys_lofo = [keys_all[j] for j in keep_idx]
    w_lofo = fit_nnls_masked(y, X_lofo, mask, lo=0.05, hi=0.70, keys=keys_lofo)
    # map back to full length with zero for left-out index
    w_full = np.zeros(len(keys_all), dtype=float)
    ptr = 0
    for j in range(len(keys_all)):
        if j == i: continue
        w_full[j] = w_lofo[ptr]; ptr += 1
    # renormalize to 1
    if w_full.sum() > 0: w_full /= w_full.sum()
    weights_list.append(w_full)
w_avg = np.mean(np.stack(weights_list, axis=0), axis=0)
w_avg = np.clip(w_avg, 0.05, 0.70); w_avg = w_avg / w_avg.sum()

# Build blended OOF with averaged weights
blend_oof = (X_full * w_avg.reshape(1,-1)).sum(axis=1)

def optimize_thresholds_generic(y, p, init=[0.5,1.5,2.5,3.5]):
    y = np.asarray(y).astype(float); p = np.asarray(p).astype(float)
    def _loss(th):
        th = np.sort(th); th = np.clip(th, 0.3, 3.7)
        cls = np.digitize(p, bins=[th[0], th[1], th[2], th[3]])
        return -cohen_kappa_score(y, cls, weights='quadratic')
    res = minimize(_loss, x0=np.array(init, dtype=float), method='Nelder-Mead', options={'maxiter':2000, 'xatol':1e-3, 'fatol':1e-3})
    th = np.sort(res.x)
    for i in range(1,4):
        if th[i]-th[i-1] < 0.12: th[i] = th[i-1] + 0.12
    th = np.clip(th, 0.3, 3.7)
    return th

th = optimize_thresholds_generic(y[mask], blend_oof[mask])
oof_q = cohen_kappa_score(y[mask], np.digitize(blend_oof[mask], bins=[th[0], th[1], th[2], th[3]]), weights='quadratic')
print('LOFO-avg NNLS OOF QWK:', f'{oof_q:.5f}', 'weights:', w_avg, 'keys:', keys_all, flush=True)

def bootstrap_thresholds_generic(y, p, n_bootstrap=200, base=None):
    if base is None: base = [0.5,1.5,2.5,3.5]
    N = len(y); cuts = []
    for i in range(n_bootstrap):
        idx = np.random.randint(0, N, size=N)
        thb = optimize_thresholds_generic(y[idx], p[idx], init=base)
        cuts.append(thb)
        if (i+1)%40==0: print(f'Bootstrap {i+1}/{n_bootstrap}', flush=True)
    cuts = np.array(cuts); med = np.median(cuts, axis=0); med = np.clip(med, 0.3, 3.7)
    for j in range(1,4):
        if med[j]-med[j-1] < 0.12: med[j] = med[j-1] + 0.12
    base_arr = np.array([0.5,1.5,2.5,3.5])
    return 0.9*med + 0.1*base_arr

th_boot = bootstrap_thresholds_generic(y[mask], blend_oof[mask], n_bootstrap=200, base=th.tolist())
np.save('blend_weights_keys.npy', np.array(keys_all, dtype=object))
np.save('blend_weights_vals.npy', w_avg)
np.save('thresholds_blend_boot.npy', th_boot)

# Build blended TEST with same key order; use available preds only
test_list = []; used_keys = []
for k in keys_all:
    p = paths_te.get(k, None)
    if p and os.path.exists(p):
        test_list.append(np.load(p).reshape(-1)); used_keys.append(k)
    else:
        print('Missing test preds for', k, '->', p, flush=True)
if len(test_list) == 0:
    raise RuntimeError('No test preds found for final blend')
w_use = w_avg[:len(test_list)]
if w_use.sum() <= 0: w_use = np.ones_like(w_use)/len(w_use)
w_use = w_use / w_use.sum()
blend_te = np.zeros_like(test_list[0], dtype=float)
for wi, arr in zip(w_use, test_list):
    blend_te += wi * arr
# optional EV clipping
blend_te = np.clip(blend_te, 0.0, 4.0)
cls = np.digitize(blend_te, bins=[th_boot[0], th_boot[1], th_boot[2], th_boot[3]]).astype(int)
sub = pd.DataFrame({'id_code': pd.read_csv('test.csv')['id_code'].values, 'diagnosis': cls})
sub.to_csv('submission.csv', index=False)
print('Submission.csv written with LOFO-averaged NNLS blend (incl. ConvNeXt).', flush=True)

Refitting LOFO-averaged NNLS with caps and serx-pair cap (incl. ConvNeXt)...


LOFO-avg NNLS OOF QWK: 0.88284 weights: [0.11385803 0.15888844 0.26617223 0.10379826 0.18198154 0.05618721
 0.06956592 0.04954837] keys: ['b4_512', 'b5_512', 'b5_512_rrcema', 'b4_640', 'serx50_512_rrcema', 'serx50_512_rrcema_s2', 'r200d_512_rrcema', 'convnextb_512_rrcema']


Bootstrap 40/200


Bootstrap 80/200


Bootstrap 120/200


Bootstrap 160/200


Bootstrap 200/200


Submission.csv written with LOFO-averaged NNLS blend (incl. ConvNeXt).


In [31]:
# SEResNeXt101_32x8d @512 (3 folds) with RRC+EMA, then LOFO-avg NNLS (caps + serx pair cap), bootstrap thresholds, write submission
import os, gc, copy, time, math, numpy as np, pandas as pd, torch, torch.nn as nn, albumentations as A, cv2
from torch.utils.data import DataLoader
from albumentations.pytorch import ToTensorV2
from timm.utils import ModelEmaV2
from sklearn.metrics import cohen_kappa_score
from scipy.optimize import nnls, minimize

print('Starting SEResNeXt101_32x8d@512 (3 folds) RRC+EMA...', flush=True)

MEAN = (0.485, 0.456, 0.406); STD = (0.229, 0.224, 0.225)
train_tfms_s101 = A.Compose([
    A.RandomResizedCrop(size=(IMG_SIZE, IMG_SIZE), scale=(0.90, 1.0), ratio=(0.95, 1.05), interpolation=cv2.INTER_AREA),
    A.HorizontalFlip(p=0.5),
    A.Affine(scale=(0.95, 1.05), translate_percent=(0.0, 0.05), rotate=(-12, 12), border_mode=cv2.BORDER_REFLECT_101, p=0.7),
    A.RandomBrightnessContrast(brightness_limit=0.15, contrast_limit=0.15, p=0.7),
    A.RandomGamma(gamma_limit=(90,110), p=0.4),
    A.Normalize(mean=MEAN, std=STD),
    ToTensorV2(),
])
valid_tfms_s101 = A.Compose([
    A.Resize(IMG_SIZE, IMG_SIZE),
    A.Normalize(mean=MEAN, std=STD),
    ToTensorV2(),
])

def get_loaders_s101(tr_df, va_df, batch_size=8, num_workers=0):
    dtr = DRDataset(tr_df, IMG_DIR_TRAIN, transforms=train_tfms_s101)
    dva = DRDataset(va_df, IMG_DIR_TRAIN, transforms=valid_tfms_s101)
    dl_tr = DataLoader(dtr, batch_size=batch_size, shuffle=True, num_workers=num_workers, pin_memory=True, drop_last=True, persistent_workers=False)
    dl_va = DataLoader(dva, batch_size=batch_size, shuffle=False, num_workers=num_workers, pin_memory=True, persistent_workers=False)
    return dl_tr, dl_va

def validate_s101(model, dl, loss_fn):
    model.eval(); preds=[]; targs=[]; val_loss=0.0; n=0
    with torch.no_grad():
        for xb, yb in dl:
            xb = xb.to(device, non_blocking=True).to(memory_format=torch.channels_last); yb = yb.to(device, non_blocking=True)
            with torch.amp.autocast('cuda', enabled=True, dtype=torch.float16):
                out = model(xb); loss = loss_fn(out, yb)
            bs = xb.size(0); val_loss += loss.item()*bs; n += bs
            preds.append(out.detach().float().cpu().numpy()); targs.append(yb.detach().float().cpu().numpy())
    preds = np.concatenate(preds); targs = np.concatenate(targs)
    return val_loss/max(1,n), preds, targs

def train_one_fold_s101_ema(fold, folds_df, epochs=12, lr=2e-4, wd=1e-5, batch_size=8, patience=3, ema_decay=0.9996):
    print(f"\n===== [SEResNeXt101@512 RRC+EMA] Fold {fold} =====", flush=True)
    tr_df = folds_df[folds_df['fold'] != fold][['id_code','diagnosis']].reset_index(drop=True)
    va_df = folds_df[folds_df['fold'] == fold][['id_code','diagnosis']].reset_index(drop=True)
    dl_tr, dl_va = get_loaders_s101(tr_df, va_df, batch_size=batch_size, num_workers=0)
    model = RegHeadModel(backbone_name='seresnext101_32x8d', pretrained=True).to(device)
    model = model.to(memory_format=torch.channels_last)
    opt = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=wd)
    loss_fn = nn.HuberLoss(delta=1.0)
    scaler = torch.amp.GradScaler('cuda', enabled=True)
    ema = ModelEmaV2(model, decay=ema_decay, device=None)
    total_steps = epochs * len(dl_tr)
    def lr_lambda(step):
        if step < len(dl_tr): return (step+1)/len(dl_tr)
        prog = (step - len(dl_tr)) / max(1, total_steps - len(dl_tr))
        return 0.5*(1+math.cos(math.pi*prog))
    sch = torch.optim.lr_scheduler.LambdaLR(opt, lr_lambda)
    best_loss = float('inf'); best_state=None; best_preds=None; best_targs=None; no_imp=0; accum = max(1, 16 // batch_size)
    opt.zero_grad(set_to_none=True)
    for ep in range(1, epochs+1):
        model.train(); tr_loss=0.0; n=0; t0=time.time()
        for it,(xb,yb) in enumerate(dl_tr):
            xb = xb.to(device, non_blocking=True).to(memory_format=torch.channels_last); yb = yb.to(device, non_blocking=True)
            with torch.amp.autocast('cuda', enabled=True, dtype=torch.float16):
                out = model(xb); loss = loss_fn(out, yb)/accum
            scaler.scale(loss).backward()
            if (it+1)%accum==0:
                scaler.step(opt); scaler.update(); opt.zero_grad(set_to_none=True); sch.step(); ema.update(model)
            bs = xb.size(0); tr_loss += (loss.item()*accum)*bs; n += bs
            if (it+1)%50==0: print(f'Epoch {ep} it {it+1}/{len(dl_tr)} tr_loss {tr_loss/max(1,n):.4f} elapsed {time.time()-t0:.1f}s', flush=True)
        vloss, vpreds, vtargs = validate_s101(ema.module, dl_va, loss_fn)
        vq = cohen_kappa_score(vtargs, np.digitize(vpreds, bins=[0.5,1.5,2.5,3.5]), weights='quadratic')
        print(f'Epoch {ep}: tr_loss {tr_loss/max(1,n):.4f} val_loss {vloss:.4f} val_qwk(def) {vq:.4f} epoch_time {time.time()-t0:.1f}s', flush=True)
        if vloss < best_loss:
            best_loss = vloss; best_state = copy.deepcopy(ema.module.state_dict()); best_preds = vpreds.copy(); best_targs = vtargs.copy(); no_imp=0
        else:
            no_imp += 1
        if no_imp >= patience:
            print('Early stopping triggered', flush=True); break
    ema.module.load_state_dict(best_state)
    return ema.module, best_preds, best_targs

def run_cv_s101_ema(folds_df, folds_to_run=(0,1,2), epochs=12, batch_size=8):
    folds_df = folds_df.reset_index(drop=True).copy()
    n = len(folds_df); oof_preds = np.full(n, np.nan, dtype=np.float32); oof_targs = folds_df['diagnosis'].values.astype(float)
    models_out = []
    for fold in folds_to_run:
        try:
            torch.cuda.empty_cache()
            if torch.cuda.is_available(): torch.cuda.reset_peak_memory_stats()
        except Exception: pass
        gc.collect()
        bs = batch_size
        for attempt in range(5):
            print(f'Attempt {attempt+1}: trying batch_size={bs}', flush=True)
            try:
                m, vpreds, vtargs = train_one_fold_s101_ema(fold, folds_df, epochs=epochs, lr=2e-4, wd=1e-5, batch_size=bs, patience=3, ema_decay=0.9996)
                break
            except (RuntimeError, torch.cuda.OutOfMemoryError, torch.OutOfMemoryError) as e:
                if 'out of memory' in str(e).lower():
                    print(f'CUDA OOM on fold {fold} with batch_size={bs}. Reducing...', flush=True)
                    try: del m
                    except Exception: pass
                    gc.collect()
                    try: torch.cuda.empty_cache(); torch.cuda.reset_peak_memory_stats()
                    except Exception: pass
                    bs = max(2, bs-2); continue
                else:
                    raise
        models_out.append(m)
        va_idx = np.where(folds_df['fold'].values == fold)[0]
        oof_preds[va_idx] = vpreds
        fold_q = cohen_kappa_score(vtargs, np.digitize(vpreds, bins=[0.5,1.5,2.5,3.5]), weights='quadratic')
        print(f'[SEResNeXt101@512] Fold {fold} val QWK (default th): {fold_q:.4f}', flush=True)
    np.save('oof_preds_serx101_512_rrc_ema.npy', oof_preds); np.save('oof_targets_serx101_512_rrc_ema.npy', oof_targs)
    return models_out, oof_preds, oof_targs

def hflip_tta_predict_mean(models, dl):
    preds = []
    with torch.no_grad():
        for xb, _ in dl:
            xb = xb.to(device, non_blocking=True).to(memory_format=torch.channels_last)
            with torch.amp.autocast('cuda', enabled=True, dtype=torch.float16):
                s = 0; c = 0
                for m in models:
                    m.eval(); p0 = m(xb); p1 = m(torch.flip(xb, dims=[-1]))
                    s += (p0 + p1) / 2.0; c += 1
                p = s / max(1,c)
            preds.append(p.detach().float().cpu().numpy())
    return np.concatenate(preds)

def optimize_thresholds_generic(y, p, init=[0.5,1.5,2.5,3.5]):
    y = np.asarray(y).astype(float); p = np.asarray(p).astype(float)
    def _loss(th):
        th = np.sort(th); th = np.clip(th, 0.3, 3.7)
        cls = np.digitize(p, bins=[th[0], th[1], th[2], th[3]])
        return -cohen_kappa_score(y, cls, weights='quadratic')
    res = minimize(_loss, x0=np.array(init, dtype=float), method='Nelder-Mead', options={'maxiter':2000, 'xatol':1e-3, 'fatol':1e-3})
    th = np.sort(res.x)
    for i in range(1,4):
        if th[i]-th[i-1] < 0.12: th[i] = th[i-1] + 0.12
    th = np.clip(th, 0.3, 3.7)
    return th

def bootstrap_thresholds_generic(y, p, n_bootstrap=200, base=[0.5,1.5,2.5,3.5]):
    N = len(y); cuts = []
    for i in range(n_bootstrap):
        idx = np.random.randint(0, N, size=N)
        th = optimize_thresholds_generic(y[idx], p[idx], init=base)
        cuts.append(th)
        if (i+1)%40==0: print(f'Bootstrap {i+1}/{n_bootstrap}', flush=True)
    cuts = np.array(cuts); med = np.median(cuts, axis=0); med = np.clip(med, 0.3, 3.7)
    for j in range(1,4):
        if med[j]-med[j-1] < 0.12: med[j] = med[j-1] + 0.12
    base_arr = np.array([0.5,1.5,2.5,3.5])
    return 0.9*med + 0.1*base_arr

def fit_nnls_masked(y, X, mask, lo=0.05, hi=0.70, serx_pair=('serx50_512_rrcema','serx50_512_rrcema_s2'), serx_cap_sum=0.30, keys=None):
    w, _ = nnls(X[mask], y[mask].astype(float))
    if w.sum() > 0: w = w / w.sum()
    w = np.clip(w, lo, hi); w = w / w.sum()
    if keys is not None and all(k in keys for k in serx_pair):
        i0 = keys.index(serx_pair[0]); i1 = keys.index(serx_pair[1])
        s = w[i0] + w[i1]
        if s > serx_cap_sum:
            scale = serx_cap_sum / s
            w[i0] *= scale; w[i1] *= scale
            rem = 1.0 - (w[i0] + w[i1])
            idx_others = [i for i in range(len(w)) if i not in (i0,i1)]
            s_others = w[idx_others].sum() + 1e-9
            w[idx_others] = w[idx_others] / s_others * rem
    return w

# 1) Train SEResNeXt101 (3 folds)
seed_everything(2028)
models_s101, oof_s101, oof_t_s101 = run_cv_s101_ema(folds_df, folds_to_run=(0,1,2), epochs=12, batch_size=8)

# 2) Test inference (hflip-only TTA) and save
dte_s101 = DRDataset(test_df[['id_code']].copy(), IMG_DIR_TEST, transforms=valid_tfms_s101)
dl_te_s101 = DataLoader(dte_s101, batch_size=16, shuffle=False, num_workers=4, pin_memory=True, persistent_workers=False)
preds_s101 = hflip_tta_predict_mean(models_s101, dl_te_s101).reshape(-1)
np.save('test_reg_preds_serx101_512_rrc_ema.npy', preds_s101)

# 3) LOFO-averaged NNLS with caps, include serx101
paths_oof = {
    'b4_512': 'oof_preds_b4.npy',
    'b5_512': 'oof_preds.npy',
    'b5_512_rrcema': 'oof_preds_b5_seed2025_rrc_ema.npy',
    'b4_640': 'oof_preds_b4_640_rrc_ema.npy',
    'serx50_512_rrcema': 'oof_preds_serx50_512_rrc_ema.npy',
    'serx50_512_rrcema_s2': 'oof_preds_serx50_512_rrc_ema_seed2026.npy',
    'r200d_512_rrcema': 'oof_preds_r200d_512_rrc_ema.npy',
    'convnextb_512_rrcema': 'oof_preds_convnextb_512_rrc_ema.npy',
    'serx101_512_rrcema': 'oof_preds_serx101_512_rrc_ema.npy',
}
paths_te = {
    'b4_512': 'test_reg_preds_b4_hflip.npy',
    'b5_512': 'test_reg_preds_b5_hflip.npy',
    'b5_512_rrcema': 'test_reg_preds_b5_seed2025_rrc_ema.npy',
    'b4_640': 'test_reg_preds_b4_640_rrc_ema.npy',
    'serx50_512_rrcema': 'test_reg_preds_serx50_512_rrc_ema.npy',
    'serx50_512_rrcema_s2': 'test_reg_preds_serx50_512_rrc_ema_seed2026.npy',
    'r200d_512_rrcema': 'test_reg_preds_r200d_512_rrc_ema.npy',
    'convnextb_512_rrcema': 'test_reg_preds_convnextb_512_rrc_ema.npy',
    'serx101_512_rrcema': 'test_reg_preds_serx101_512_rrc_ema.npy',
}
y = None
for tgt in ['oof_targets_b4.npy', 'oof_targets.npy']:
    if os.path.exists(tgt):
        y = np.load(tgt).reshape(-1); break
if y is None: y = oof_t_s101.reshape(-1)

loaded_keys = []; arrs = [];
for k,p in paths_oof.items():
    if os.path.exists(p):
        a = np.load(p).reshape(-1); loaded_keys.append(k); arrs.append(a)
X_full = np.concatenate([a.reshape(-1,1) for a in arrs], axis=1)
mask = np.isfinite(y).copy()
for a in arrs: mask &= np.isfinite(a)
keys_all = loaded_keys.copy()
weights_list = []
w_all = fit_nnls_masked(y, X_full, mask, lo=0.05, hi=0.70, keys=keys_all)
weights_list.append(w_all)
for i in range(len(keys_all)):
    keep_idx = [j for j in range(len(keys_all)) if j != i]
    X_lofo = X_full[:, keep_idx]
    keys_lofo = [keys_all[j] for j in keep_idx]
    w_lofo = fit_nnls_masked(y, X_lofo, mask, lo=0.05, hi=0.70, keys=keys_lofo)
    w_full = np.zeros(len(keys_all), dtype=float)
    ptr = 0
    for j in range(len(keys_all)):
        if j == i: continue
        w_full[j] = w_lofo[ptr]; ptr += 1
    if w_full.sum() > 0: w_full /= w_full.sum()
    weights_list.append(w_full)
w_avg = np.mean(np.stack(weights_list, axis=0), axis=0)
w_avg = np.clip(w_avg, 0.05, 0.70); w_avg = w_avg / w_avg.sum()
blend_oof = (X_full * w_avg.reshape(1,-1)).sum(axis=1)
th = optimize_thresholds_generic(y[mask], blend_oof[mask])
oof_q = cohen_kappa_score(y[mask], np.digitize(blend_oof[mask], bins=[th[0], th[1], th[2], th[3]]), weights='quadratic')
print('LOFO-avg NNLS OOF QWK (with serx101):', f'{oof_q:.5f}', 'weights:', w_avg, 'keys:', keys_all, flush=True)
th_boot = bootstrap_thresholds_generic(y[mask], blend_oof[mask], n_bootstrap=200, base=th.tolist())
np.save('blend_weights_keys.npy', np.array(keys_all, dtype=object)); np.save('blend_weights_vals.npy', w_avg); np.save('thresholds_blend_boot.npy', th_boot)

# 4) Build blended TEST in same key order; hflip-only; EV clip [0,4]
test_list = []; used_keys = []
for k in keys_all:
    p = paths_te.get(k, None)
    if p and os.path.exists(p):
        test_list.append(np.load(p).reshape(-1)); used_keys.append(k)
    else:
        print('Missing test preds for', k, '->', p, flush=True)
if len(test_list) == 0:
    raise RuntimeError('No test preds found for final blend with serx101')
w_use = w_avg[:len(test_list)]
if w_use.sum() <= 0: w_use = np.ones_like(w_use)/len(w_use)
w_use = w_use / w_use.sum()
blend_te = np.zeros_like(test_list[0], dtype=float)
for wi, arr in zip(w_use, test_list):
    blend_te += wi * arr
blend_te = np.clip(blend_te, 0.0, 4.0)
cls = np.digitize(blend_te, bins=[th_boot[0], th_boot[1], th_boot[2], th_boot[3]]).astype(int)
sub = pd.DataFrame({'id_code': pd.read_csv('test.csv')['id_code'].values, 'diagnosis': cls})
sub.to_csv('submission.csv', index=False)
print('Final submission.csv written with SEResNeXt101 included.', flush=True)

Starting SEResNeXt101_32x8d@512 (3 folds) RRC+EMA...


Attempt 1: trying batch_size=8



===== [SEResNeXt101@512 RRC+EMA] Fold 0 =====




Epoch 1 it 50/329 tr_loss 0.8250 elapsed 42.0s


Epoch 1 it 100/329 tr_loss 0.6866 elapsed 83.5s


Epoch 1 it 150/329 tr_loss 0.5888 elapsed 125.0s


Epoch 1 it 200/329 tr_loss 0.5365 elapsed 166.6s


Epoch 1 it 250/329 tr_loss 0.4897 elapsed 208.3s


Epoch 1 it 300/329 tr_loss 0.4606 elapsed 249.8s


Epoch 1: tr_loss 0.4445 val_loss 0.8055 val_qwk(def) 0.0000 epoch_time 292.7s


Epoch 2 it 50/329 tr_loss 0.2537 elapsed 41.3s


Epoch 2 it 100/329 tr_loss 0.2859 elapsed 82.9s


Epoch 2 it 150/329 tr_loss 0.2707 elapsed 124.5s


Epoch 2 it 200/329 tr_loss 0.2609 elapsed 165.8s


Epoch 2 it 250/329 tr_loss 0.2579 elapsed 207.4s


Epoch 2 it 300/329 tr_loss 0.2521 elapsed 248.7s


Epoch 2: tr_loss 0.2507 val_loss 0.7796 val_qwk(def) 0.0000 epoch_time 291.7s


Epoch 3 it 50/329 tr_loss 0.1803 elapsed 41.2s


Epoch 3 it 100/329 tr_loss 0.1970 elapsed 82.8s


Epoch 3 it 150/329 tr_loss 0.2234 elapsed 124.3s


Epoch 3 it 200/329 tr_loss 0.2243 elapsed 165.5s


Epoch 3 it 250/329 tr_loss 0.2208 elapsed 207.0s


Epoch 3 it 300/329 tr_loss 0.2212 elapsed 248.5s


Epoch 3: tr_loss 0.2181 val_loss 0.7477 val_qwk(def) 0.0000 epoch_time 290.9s


Epoch 4 it 50/329 tr_loss 0.1709 elapsed 41.6s


Epoch 4 it 100/329 tr_loss 0.1659 elapsed 82.9s


Epoch 4 it 150/329 tr_loss 0.1704 elapsed 124.3s


Epoch 4 it 200/329 tr_loss 0.1739 elapsed 165.5s


Epoch 4 it 250/329 tr_loss 0.1869 elapsed 206.9s


Epoch 4 it 300/329 tr_loss 0.1864 elapsed 248.6s


Epoch 4: tr_loss 0.1861 val_loss 0.7163 val_qwk(def) 0.0000 epoch_time 291.2s


Epoch 5 it 50/329 tr_loss 0.1677 elapsed 41.6s


Epoch 5 it 100/329 tr_loss 0.1703 elapsed 82.7s


Epoch 5 it 150/329 tr_loss 0.1812 elapsed 124.1s


Epoch 5 it 200/329 tr_loss 0.1728 elapsed 165.6s


Epoch 5 it 250/329 tr_loss 0.1715 elapsed 206.7s


Epoch 5 it 300/329 tr_loss 0.1707 elapsed 248.2s


Epoch 5: tr_loss 0.1697 val_loss 0.6875 val_qwk(def) 0.0263 epoch_time 290.7s


Epoch 6 it 50/329 tr_loss 0.1573 elapsed 41.5s


Epoch 6 it 100/329 tr_loss 0.1593 elapsed 82.7s


Epoch 6 it 150/329 tr_loss 0.1545 elapsed 124.2s


Epoch 6 it 200/329 tr_loss 0.1532 elapsed 165.6s


Epoch 6 it 250/329 tr_loss 0.1592 elapsed 206.8s


Epoch 6 it 300/329 tr_loss 0.1596 elapsed 248.3s


Epoch 6: tr_loss 0.1637 val_loss 0.6568 val_qwk(def) 0.1139 epoch_time 290.8s


Epoch 7 it 50/329 tr_loss 0.1520 elapsed 41.5s


Epoch 7 it 100/329 tr_loss 0.1492 elapsed 82.7s


Epoch 7 it 150/329 tr_loss 0.1469 elapsed 124.1s


Epoch 7 it 200/329 tr_loss 0.1430 elapsed 165.6s


Epoch 7 it 250/329 tr_loss 0.1439 elapsed 206.7s


Epoch 7 it 300/329 tr_loss 0.1493 elapsed 248.2s


Epoch 7: tr_loss 0.1483 val_loss 0.6275 val_qwk(def) 0.2671 epoch_time 290.6s


Epoch 8 it 50/329 tr_loss 0.1261 elapsed 41.4s


Epoch 8 it 100/329 tr_loss 0.1211 elapsed 82.9s


Epoch 8 it 150/329 tr_loss 0.1223 elapsed 124.0s


Epoch 8 it 200/329 tr_loss 0.1308 elapsed 165.5s


Epoch 8 it 250/329 tr_loss 0.1358 elapsed 206.7s


Epoch 8 it 300/329 tr_loss 0.1417 elapsed 248.2s


Epoch 8: tr_loss 0.1445 val_loss 0.5946 val_qwk(def) 0.3566 epoch_time 290.7s


Epoch 9 it 50/329 tr_loss 0.1544 elapsed 41.4s


Epoch 9 it 100/329 tr_loss 0.1504 elapsed 82.9s


Epoch 9 it 150/329 tr_loss 0.1484 elapsed 123.9s


Epoch 9 it 200/329 tr_loss 0.1537 elapsed 165.2s


Epoch 9 it 250/329 tr_loss 0.1447 elapsed 206.6s


Epoch 9 it 300/329 tr_loss 0.1447 elapsed 247.8s


Epoch 9: tr_loss 0.1472 val_loss 0.5617 val_qwk(def) 0.3861 epoch_time 290.5s


Epoch 10 it 50/329 tr_loss 0.1302 elapsed 41.2s


Epoch 10 it 100/329 tr_loss 0.1374 elapsed 82.7s


Epoch 10 it 150/329 tr_loss 0.1252 elapsed 123.8s


Epoch 10 it 200/329 tr_loss 0.1205 elapsed 165.2s


Epoch 10 it 250/329 tr_loss 0.1208 elapsed 206.7s


Epoch 10 it 300/329 tr_loss 0.1247 elapsed 247.8s


Epoch 10: tr_loss 0.1259 val_loss 0.5258 val_qwk(def) 0.4231 epoch_time 290.5s


Epoch 11 it 50/329 tr_loss 0.0988 elapsed 41.1s


Epoch 11 it 100/329 tr_loss 0.1111 elapsed 82.5s


Epoch 11 it 150/329 tr_loss 0.1100 elapsed 123.9s


Epoch 11 it 200/329 tr_loss 0.1137 elapsed 165.0s


Epoch 11 it 250/329 tr_loss 0.1121 elapsed 206.4s


Epoch 11 it 300/329 tr_loss 0.1107 elapsed 247.5s


Epoch 11: tr_loss 0.1114 val_loss 0.4865 val_qwk(def) 0.4400 epoch_time 290.3s


Epoch 12 it 50/329 tr_loss 0.0989 elapsed 41.2s


Epoch 12 it 100/329 tr_loss 0.1011 elapsed 82.6s


Epoch 12 it 150/329 tr_loss 0.1028 elapsed 123.9s


Epoch 12 it 200/329 tr_loss 0.0977 elapsed 165.1s


Epoch 12 it 250/329 tr_loss 0.0952 elapsed 206.5s


Epoch 12 it 300/329 tr_loss 0.0959 elapsed 247.6s


Epoch 12: tr_loss 0.0947 val_loss 0.4425 val_qwk(def) 0.4865 epoch_time 290.4s


[SEResNeXt101@512] Fold 0 val QWK (default th): 0.4865


Attempt 1: trying batch_size=8



===== [SEResNeXt101@512 RRC+EMA] Fold 1 =====




Epoch 1 it 50/329 tr_loss 0.8943 elapsed 41.5s


Epoch 1 it 100/329 tr_loss 0.7737 elapsed 83.2s


Epoch 1 it 150/329 tr_loss 0.6425 elapsed 124.6s


Epoch 1 it 200/329 tr_loss 0.5597 elapsed 166.4s


Epoch 1 it 250/329 tr_loss 0.5053 elapsed 208.0s


Epoch 1 it 300/329 tr_loss 0.4672 elapsed 249.5s


Epoch 1: tr_loss 0.4502 val_loss 0.8726 val_qwk(def) 0.0000 epoch_time 292.4s


Epoch 2 it 50/329 tr_loss 0.2823 elapsed 41.3s


Epoch 2 it 100/329 tr_loss 0.2788 elapsed 83.0s


Epoch 2 it 150/329 tr_loss 0.2804 elapsed 124.5s


Epoch 2 it 200/329 tr_loss 0.2762 elapsed 165.6s


Epoch 2 it 250/329 tr_loss 0.2638 elapsed 207.0s


Epoch 2 it 300/329 tr_loss 0.2589 elapsed 248.1s


Epoch 2: tr_loss 0.2629 val_loss 0.8442 val_qwk(def) 0.0000 epoch_time 290.8s


Epoch 3 it 50/329 tr_loss 0.2046 elapsed 41.1s


Epoch 3 it 100/329 tr_loss 0.2027 elapsed 82.5s


Epoch 3 it 150/329 tr_loss 0.2068 elapsed 123.9s


Epoch 3 it 200/329 tr_loss 0.2084 elapsed 164.9s


Epoch 3 it 250/329 tr_loss 0.2054 elapsed 206.3s


Epoch 3 it 300/329 tr_loss 0.1995 elapsed 247.7s


Epoch 3: tr_loss 0.2071 val_loss 0.8148 val_qwk(def) 0.0000 epoch_time 290.2s


Epoch 4 it 50/329 tr_loss 0.1526 elapsed 41.3s


Epoch 4 it 100/329 tr_loss 0.1943 elapsed 82.3s


Epoch 4 it 150/329 tr_loss 0.1981 elapsed 123.7s


Epoch 4 it 200/329 tr_loss 0.2036 elapsed 164.7s


Epoch 4 it 250/329 tr_loss 0.1994 elapsed 205.9s


Epoch 4 it 300/329 tr_loss 0.1983 elapsed 247.2s


Epoch 4: tr_loss 0.1974 val_loss 0.7817 val_qwk(def) 0.0000 epoch_time 289.6s


Epoch 5 it 50/329 tr_loss 0.1615 elapsed 41.3s


Epoch 5 it 100/329 tr_loss 0.1668 elapsed 82.3s


Epoch 5 it 150/329 tr_loss 0.1745 elapsed 123.4s


Epoch 5 it 200/329 tr_loss 0.1800 elapsed 164.4s


Epoch 5 it 250/329 tr_loss 0.1779 elapsed 205.7s


Epoch 5 it 300/329 tr_loss 0.1763 elapsed 246.9s


Epoch 5: tr_loss 0.1755 val_loss 0.7444 val_qwk(def) 0.0009 epoch_time 289.3s


Epoch 6 it 50/329 tr_loss 0.1804 elapsed 41.3s


Epoch 6 it 100/329 tr_loss 0.1590 elapsed 82.3s


Epoch 6 it 150/329 tr_loss 0.1670 elapsed 123.5s


Epoch 6 it 200/329 tr_loss 0.1719 elapsed 164.4s


Epoch 6 it 250/329 tr_loss 0.1683 elapsed 205.7s


Epoch 6 it 300/329 tr_loss 0.1699 elapsed 246.8s


Epoch 6: tr_loss 0.1669 val_loss 0.7024 val_qwk(def) 0.0226 epoch_time 289.2s


Epoch 7 it 50/329 tr_loss 0.1481 elapsed 41.2s


Epoch 7 it 100/329 tr_loss 0.1415 elapsed 82.1s


Epoch 7 it 150/329 tr_loss 0.1578 elapsed 123.3s


Epoch 7 it 200/329 tr_loss 0.1623 elapsed 164.4s


Epoch 7 it 250/329 tr_loss 0.1565 elapsed 205.3s


Epoch 7 it 300/329 tr_loss 0.1585 elapsed 246.6s


Epoch 7: tr_loss 0.1602 val_loss 0.6597 val_qwk(def) 0.1382 epoch_time 288.9s


Epoch 8 it 50/329 tr_loss 0.1432 elapsed 41.2s


Epoch 8 it 100/329 tr_loss 0.1577 elapsed 82.1s


Epoch 8 it 150/329 tr_loss 0.1647 elapsed 123.3s


Epoch 8 it 200/329 tr_loss 0.1620 elapsed 164.4s


Epoch 8 it 250/329 tr_loss 0.1646 elapsed 205.4s


Epoch 8 it 300/329 tr_loss 0.1571 elapsed 246.5s


Epoch 8: tr_loss 0.1520 val_loss 0.6137 val_qwk(def) 0.2946 epoch_time 288.8s


Epoch 9 it 50/329 tr_loss 0.1134 elapsed 41.2s


Epoch 9 it 100/329 tr_loss 0.1134 elapsed 82.0s


Epoch 9 it 150/329 tr_loss 0.1173 elapsed 123.3s


Epoch 9 it 200/329 tr_loss 0.1169 elapsed 164.5s


Epoch 9 it 250/329 tr_loss 0.1242 elapsed 205.4s


Epoch 9 it 300/329 tr_loss 0.1233 elapsed 246.4s


Epoch 9: tr_loss 0.1261 val_loss 0.5686 val_qwk(def) 0.3698 epoch_time 288.7s


Epoch 10 it 50/329 tr_loss 0.0927 elapsed 41.2s


Epoch 10 it 100/329 tr_loss 0.1016 elapsed 82.2s


Epoch 10 it 150/329 tr_loss 0.1017 elapsed 123.3s


Epoch 10 it 200/329 tr_loss 0.1022 elapsed 164.4s


Epoch 10 it 250/329 tr_loss 0.1069 elapsed 205.2s


Epoch 10 it 300/329 tr_loss 0.1085 elapsed 246.3s


Epoch 10: tr_loss 0.1114 val_loss 0.5209 val_qwk(def) 0.4117 epoch_time 288.6s


Epoch 11 it 50/329 tr_loss 0.0822 elapsed 41.3s


Epoch 11 it 100/329 tr_loss 0.0997 elapsed 82.4s


Epoch 11 it 150/329 tr_loss 0.0995 elapsed 123.1s


Epoch 11 it 200/329 tr_loss 0.1014 elapsed 164.3s


Epoch 11 it 250/329 tr_loss 0.1083 elapsed 205.2s


Epoch 11 it 300/329 tr_loss 0.1090 elapsed 246.5s


Epoch 11: tr_loss 0.1091 val_loss 0.4733 val_qwk(def) 0.4413 epoch_time 289.0s


Epoch 12 it 50/329 tr_loss 0.0986 elapsed 41.3s


Epoch 12 it 100/329 tr_loss 0.0984 elapsed 82.6s


Epoch 12 it 150/329 tr_loss 0.1003 elapsed 123.4s


Epoch 12 it 200/329 tr_loss 0.1014 elapsed 164.6s


Epoch 12 it 250/329 tr_loss 0.0974 elapsed 205.4s


Epoch 12 it 300/329 tr_loss 0.0977 elapsed 246.7s


Epoch 12: tr_loss 0.0964 val_loss 0.4267 val_qwk(def) 0.4613 epoch_time 289.1s


[SEResNeXt101@512] Fold 1 val QWK (default th): 0.4613


Attempt 1: trying batch_size=8



===== [SEResNeXt101@512 RRC+EMA] Fold 2 =====




Epoch 1 it 50/329 tr_loss 0.8474 elapsed 41.4s


Epoch 1 it 100/329 tr_loss 0.7344 elapsed 83.0s


Epoch 1 it 150/329 tr_loss 0.6068 elapsed 124.2s


Epoch 1 it 200/329 tr_loss 0.5429 elapsed 165.9s


Epoch 1 it 250/329 tr_loss 0.4893 elapsed 207.8s


Epoch 1 it 300/329 tr_loss 0.4578 elapsed 249.2s


Epoch 1: tr_loss 0.4409 val_loss 0.8243 val_qwk(def) 0.0000 epoch_time 292.2s


Epoch 2 it 50/329 tr_loss 0.2779 elapsed 41.4s


Epoch 2 it 100/329 tr_loss 0.2509 elapsed 83.2s


Epoch 2 it 150/329 tr_loss 0.2536 elapsed 124.6s


Epoch 2 it 200/329 tr_loss 0.2538 elapsed 166.2s


Epoch 2 it 250/329 tr_loss 0.2570 elapsed 207.8s


Epoch 2 it 300/329 tr_loss 0.2541 elapsed 249.1s


Epoch 2: tr_loss 0.2534 val_loss 0.7957 val_qwk(def) 0.0000 epoch_time 292.1s


Epoch 3 it 50/329 tr_loss 0.2133 elapsed 41.4s


Epoch 3 it 100/329 tr_loss 0.2104 elapsed 83.0s


Epoch 3 it 150/329 tr_loss 0.2182 elapsed 124.5s


Epoch 3 it 200/329 tr_loss 0.2233 elapsed 165.8s


Epoch 3 it 250/329 tr_loss 0.2245 elapsed 207.3s


Epoch 3 it 300/329 tr_loss 0.2221 elapsed 248.6s


Epoch 3: tr_loss 0.2205 val_loss 0.7608 val_qwk(def) 0.0000 epoch_time 291.6s


Epoch 4 it 50/329 tr_loss 0.1852 elapsed 41.4s


Epoch 4 it 100/329 tr_loss 0.2075 elapsed 82.9s


Epoch 4 it 150/329 tr_loss 0.2028 elapsed 124.1s


Epoch 4 it 200/329 tr_loss 0.2005 elapsed 165.6s


Epoch 4 it 250/329 tr_loss 0.1962 elapsed 207.2s


Epoch 4 it 300/329 tr_loss 0.1936 elapsed 248.4s


Epoch 4: tr_loss 0.1941 val_loss 0.7255 val_qwk(def) 0.0000 epoch_time 291.4s


Epoch 5 it 50/329 tr_loss 0.1561 elapsed 41.2s


Epoch 5 it 100/329 tr_loss 0.1651 elapsed 82.7s


Epoch 5 it 150/329 tr_loss 0.1655 elapsed 124.2s


Epoch 5 it 200/329 tr_loss 0.1691 elapsed 165.4s


Epoch 5 it 250/329 tr_loss 0.1747 elapsed 207.0s


Epoch 5 it 300/329 tr_loss 0.1773 elapsed 248.2s


Epoch 5: tr_loss 0.1790 val_loss 0.6932 val_qwk(def) 0.0283 epoch_time 291.2s


Epoch 6 it 50/329 tr_loss 0.1404 elapsed 41.3s


Epoch 6 it 100/329 tr_loss 0.1479 elapsed 82.8s


Epoch 6 it 150/329 tr_loss 0.1530 elapsed 124.3s


Epoch 6 it 200/329 tr_loss 0.1623 elapsed 165.4s


Epoch 6 it 250/329 tr_loss 0.1662 elapsed 206.9s


Epoch 6 it 300/329 tr_loss 0.1687 elapsed 248.0s


Epoch 6: tr_loss 0.1683 val_loss 0.6584 val_qwk(def) 0.1521 epoch_time 290.9s


Epoch 7 it 50/329 tr_loss 0.1588 elapsed 41.2s


Epoch 7 it 100/329 tr_loss 0.1475 elapsed 82.6s


Epoch 7 it 150/329 tr_loss 0.1525 elapsed 123.9s


Epoch 7 it 200/329 tr_loss 0.1609 elapsed 165.0s


Epoch 7 it 250/329 tr_loss 0.1567 elapsed 206.5s


Epoch 7 it 300/329 tr_loss 0.1592 elapsed 247.7s


Epoch 7: tr_loss 0.1581 val_loss 0.6236 val_qwk(def) 0.3094 epoch_time 290.6s


Epoch 8 it 50/329 tr_loss 0.1664 elapsed 41.2s


Epoch 8 it 100/329 tr_loss 0.1431 elapsed 82.6s


Epoch 8 it 150/329 tr_loss 0.1434 elapsed 124.1s


Epoch 8 it 200/329 tr_loss 0.1455 elapsed 165.2s


Epoch 8 it 250/329 tr_loss 0.1513 elapsed 206.7s


Epoch 8 it 300/329 tr_loss 0.1507 elapsed 248.1s


Epoch 8: tr_loss 0.1474 val_loss 0.5892 val_qwk(def) 0.3731 epoch_time 290.7s


Epoch 9 it 50/329 tr_loss 0.1135 elapsed 41.2s


Epoch 9 it 100/329 tr_loss 0.1305 elapsed 82.9s


Epoch 9 it 150/329 tr_loss 0.1277 elapsed 124.3s


Epoch 9 it 200/329 tr_loss 0.1286 elapsed 165.5s


Epoch 9 it 250/329 tr_loss 0.1337 elapsed 207.0s


Epoch 9 it 300/329 tr_loss 0.1334 elapsed 248.5s


Epoch 9: tr_loss 0.1354 val_loss 0.5543 val_qwk(def) 0.3917 epoch_time 291.0s


Epoch 10 it 50/329 tr_loss 0.1058 elapsed 41.5s


Epoch 10 it 100/329 tr_loss 0.1143 elapsed 82.7s


Epoch 10 it 150/329 tr_loss 0.1195 elapsed 124.1s


Epoch 10 it 200/329 tr_loss 0.1203 elapsed 165.3s


Epoch 10 it 250/329 tr_loss 0.1198 elapsed 206.7s


Epoch 10 it 300/329 tr_loss 0.1237 elapsed 248.1s


Epoch 10: tr_loss 0.1252 val_loss 0.5090 val_qwk(def) 0.4081 epoch_time 290.6s


Epoch 11 it 50/329 tr_loss 0.1070 elapsed 41.5s


Epoch 11 it 100/329 tr_loss 0.1126 elapsed 82.7s


Epoch 11 it 150/329 tr_loss 0.1086 elapsed 124.0s


Epoch 11 it 200/329 tr_loss 0.1075 elapsed 165.1s


Epoch 11 it 250/329 tr_loss 0.1065 elapsed 206.6s


Epoch 11 it 300/329 tr_loss 0.1057 elapsed 248.1s


Epoch 11: tr_loss 0.1089 val_loss 0.4677 val_qwk(def) 0.4273 epoch_time 290.7s


Epoch 12 it 50/329 tr_loss 0.0808 elapsed 41.5s


Epoch 12 it 100/329 tr_loss 0.0817 elapsed 82.7s


Epoch 12 it 150/329 tr_loss 0.0848 elapsed 124.1s


Epoch 12 it 200/329 tr_loss 0.0880 elapsed 165.3s


Epoch 12 it 250/329 tr_loss 0.0920 elapsed 206.6s


Epoch 12 it 300/329 tr_loss 0.0938 elapsed 248.1s


Epoch 12: tr_loss 0.0925 val_loss 0.4240 val_qwk(def) 0.4542 epoch_time 290.8s


[SEResNeXt101@512] Fold 2 val QWK (default th): 0.4542


LOFO-avg NNLS OOF QWK (with serx101): 0.87925 weights: [0.10734698 0.15107517 0.25412166 0.09858538 0.17259916 0.05223563
 0.06604996 0.04899303 0.04899303] keys: ['b4_512', 'b5_512', 'b5_512_rrcema', 'b4_640', 'serx50_512_rrcema', 'serx50_512_rrcema_s2', 'r200d_512_rrcema', 'convnextb_512_rrcema', 'serx101_512_rrcema']


Bootstrap 40/200


Bootstrap 80/200


Bootstrap 120/200


Bootstrap 160/200


Bootstrap 200/200


Final submission.csv written with SEResNeXt101 included.


In [32]:
# Recreate best LOFO-avg NNLS blend (exclude serx101) and write final submission
import os, numpy as np, pandas as pd
from scipy.optimize import nnls, minimize
from sklearn.metrics import cohen_kappa_score

print('Rebuilding best LOFO-avg NNLS (exclude serx101)...', flush=True)
keys_fixed = ['b4_512','b5_512','b5_512_rrcema','b4_640','serx50_512_rrcema','serx50_512_rrcema_s2','r200d_512_rrcema','convnextb_512_rrcema']
paths_oof = {
    'b4_512': 'oof_preds_b4.npy',
    'b5_512': 'oof_preds.npy',
    'b5_512_rrcema': 'oof_preds_b5_seed2025_rrc_ema.npy',
    'b4_640': 'oof_preds_b4_640_rrc_ema.npy',
    'serx50_512_rrcema': 'oof_preds_serx50_512_rrc_ema.npy',
    'serx50_512_rrcema_s2': 'oof_preds_serx50_512_rrc_ema_seed2026.npy',
    'r200d_512_rrcema': 'oof_preds_r200d_512_rrc_ema.npy',
    'convnextb_512_rrcema': 'oof_preds_convnextb_512_rrc_ema.npy',
}
paths_te = {
    'b4_512': 'test_reg_preds_b4_hflip.npy',
    'b5_512': 'test_reg_preds_b5_hflip.npy',
    'b5_512_rrcema': 'test_reg_preds_b5_seed2025_rrc_ema.npy',
    'b4_640': 'test_reg_preds_b4_640_rrc_ema.npy',
    'serx50_512_rrcema': 'test_reg_preds_serx50_512_rrc_ema.npy',
    'serx50_512_rrcema_s2': 'test_reg_preds_serx50_512_rrc_ema_seed2026.npy',
    'r200d_512_rrcema': 'test_reg_preds_r200d_512_rrc_ema.npy',
    'convnextb_512_rrcema': 'test_reg_preds_convnextb_512_rrc_ema.npy',
}
y = None
for tgt in ['oof_targets_b4.npy', 'oof_targets.npy']:
    if os.path.exists(tgt): y = np.load(tgt).reshape(-1); break
if y is None: raise RuntimeError('OOF targets not found')

loaded_keys = []; arrs = [];
for k in keys_fixed:
    p = paths_oof[k]
    if os.path.exists(p):
        a = np.load(p).reshape(-1); loaded_keys.append(k); arrs.append(a)
    else:
        print('Missing OOF for', k, '->', p, flush=True)
if len(arrs) < 3: raise RuntimeError('Too few OOF arrays for NNLS')

X_full = np.concatenate([a.reshape(-1,1) for a in arrs], axis=1)
mask = np.isfinite(y).copy()
for a in arrs: mask &= np.isfinite(a)

def fit_nnls_masked(y, X, mask, lo=0.05, hi=0.70, serx_pair=('serx50_512_rrcema','serx50_512_rrcema_s2'), serx_cap_sum=0.30, keys=None):
    w, _ = nnls(X[mask], y[mask].astype(float))
    if w.sum() > 0: w = w / w.sum()
    w = np.clip(w, lo, hi); w = w / w.sum()
    if keys is not None and all(k in keys for k in serx_pair):
        i0 = keys.index(serx_pair[0]); i1 = keys.index(serx_pair[1])
        s = w[i0] + w[i1]
        if s > serx_cap_sum:
            scale = serx_cap_sum / s
            w[i0] *= scale; w[i1] *= scale
            rem = 1.0 - (w[i0] + w[i1])
            idx_others = [i for i in range(len(w)) if i not in (i0,i1)]
            s_others = w[idx_others].sum() + 1e-9
            w[idx_others] = w[idx_others] / s_others * rem
    return w

# LOFO-averaged NNLS
keys_all = loaded_keys.copy()
weights_list = []
w_all = fit_nnls_masked(y, X_full, mask, lo=0.05, hi=0.70, keys=keys_all)
weights_list.append(w_all)
for i in range(len(keys_all)):
    keep_idx = [j for j in range(len(keys_all)) if j != i]
    X_lofo = X_full[:, keep_idx]
    keys_lofo = [keys_all[j] for j in keep_idx]
    w_lofo = fit_nnls_masked(y, X_lofo, mask, lo=0.05, hi=0.70, keys=keys_lofo)
    w_full = np.zeros(len(keys_all), dtype=float)
    ptr = 0
    for j in range(len(keys_all)):
        if j == i: continue
        w_full[j] = w_lofo[ptr]; ptr += 1
    if w_full.sum() > 0: w_full /= w_full.sum()
    weights_list.append(w_full)
w_avg = np.mean(np.stack(weights_list, axis=0), axis=0)
w_avg = np.clip(w_avg, 0.05, 0.70); w_avg = w_avg / w_avg.sum()

blend_oof = (X_full * w_avg.reshape(1,-1)).sum(axis=1)

def optimize_thresholds_generic(y_true, p, init=[0.5,1.5,2.5,3.5]):
    y_true = np.asarray(y_true).astype(float); p = np.asarray(p).astype(float)
    def _loss(th):
        th = np.sort(th); th = np.clip(th, 0.3, 3.7)
        cls = np.digitize(p, bins=[th[0], th[1], th[2], th[3]])
        return -cohen_kappa_score(y_true, cls, weights='quadratic')
    res = minimize(_loss, x0=np.array(init, dtype=float), method='Nelder-Mead', options={'maxiter':2000, 'xatol':1e-3, 'fatol':1e-3})
    th = np.sort(res.x)
    for i in range(1,4):
        if th[i]-th[i-1] < 0.12: th[i] = th[i-1] + 0.12
    th = np.clip(th, 0.3, 3.7)
    return th

th = optimize_thresholds_generic(y[mask], blend_oof[mask])
oof_q = cohen_kappa_score(y[mask], np.digitize(blend_oof[mask], bins=[th[0], th[1], th[2], th[3]]), weights='quadratic')
print('Best-set LOFO-avg NNLS OOF QWK:', f'{oof_q:.5f}', 'weights:', w_avg, 'keys:', keys_all, flush=True)

def bootstrap_thresholds_generic(y_true, p, n_bootstrap=200, base=None):
    if base is None: base = [0.5,1.5,2.5,3.5]
    N = len(y_true); cuts = []
    for i in range(n_bootstrap):
        idx = np.random.randint(0, N, size=N)
        thb = optimize_thresholds_generic(y_true[idx], p[idx], init=base)
        cuts.append(thb)
        if (i+1)%40==0: print(f'Bootstrap {i+1}/{n_bootstrap}', flush=True)
    cuts = np.array(cuts); med = np.median(cuts, axis=0); med = np.clip(med, 0.3, 3.7)
    for j in range(1,4):
        if med[j]-med[j-1] < 0.12: med[j] = med[j-1] + 0.12
    base_arr = np.array([0.5,1.5,2.5,3.5])
    return 0.9*med + 0.1*base_arr

th_boot = bootstrap_thresholds_generic(y[mask], blend_oof[mask], n_bootstrap=200, base=th.tolist())
np.save('blend_weights_keys.npy', np.array(keys_all, dtype=object))
np.save('blend_weights_vals.npy', w_avg)
np.save('thresholds_blend_boot.npy', th_boot)

# Build blended TEST with same key order
test_list = []; used_keys = []
for k in keys_all:
    p = paths_te[k]
    if os.path.exists(p):
        test_list.append(np.load(p).reshape(-1)); used_keys.append(k)
    else:
        print('Missing test preds for', k, '->', p, flush=True)
if len(test_list) == 0: raise RuntimeError('No test preds for final blend')
w_use = w_avg[:len(test_list)]
if w_use.sum() <= 0: w_use = np.ones_like(w_use)/len(w_use)
w_use = w_use / w_use.sum()
blend_te = np.zeros_like(test_list[0], dtype=float)
for wi, arr in zip(w_use, test_list): blend_te += wi * arr
blend_te = np.clip(blend_te, 0.0, 4.0)
cls = np.digitize(blend_te, bins=[th_boot[0], th_boot[1], th_boot[2], th_boot[3]]).astype(int)
sub = pd.DataFrame({'id_code': pd.read_csv('test.csv')['id_code'].values, 'diagnosis': cls})
sub.to_csv('submission.csv', index=False)
print('Final submission.csv written with best-set (no serx101).', flush=True)

Rebuilding best LOFO-avg NNLS (exclude serx101)...


Best-set LOFO-avg NNLS OOF QWK: 0.88284 weights: [0.11385803 0.15888844 0.26617223 0.10379826 0.18198154 0.05618721
 0.06956592 0.04954837] keys: ['b4_512', 'b5_512', 'b5_512_rrcema', 'b4_640', 'serx50_512_rrcema', 'serx50_512_rrcema_s2', 'r200d_512_rrcema', 'convnextb_512_rrcema']


Bootstrap 40/200


Bootstrap 80/200


Bootstrap 120/200


Bootstrap 160/200


Bootstrap 200/200


Final submission.csv written with best-set (no serx101).


In [33]:
# Subset selection: try top-K model subsets (by NNLS weight) for LOFO-avg NNLS; pick best OOF and write submission
import os, numpy as np, pandas as pd
from scipy.optimize import nnls, minimize
from sklearn.metrics import cohen_kappa_score

print('Subset LOFO-avg NNLS search over top-K models...', flush=True)
keys_all = np.load('blend_weights_keys.npy', allow_pickle=True).tolist()
w_all = np.load('blend_weights_vals.npy')
paths_oof = {
    'b4_512': 'oof_preds_b4.npy',
    'b5_512': 'oof_preds.npy',
    'b5_512_rrcema': 'oof_preds_b5_seed2025_rrc_ema.npy',
    'b4_640': 'oof_preds_b4_640_rrc_ema.npy',
    'serx50_512_rrcema': 'oof_preds_serx50_512_rrc_ema.npy',
    'serx50_512_rrcema_s2': 'oof_preds_serx50_512_rrc_ema_seed2026.npy',
    'r200d_512_rrcema': 'oof_preds_r200d_512_rrc_ema.npy',
    'convnextb_512_rrcema': 'oof_preds_convnextb_512_rrc_ema.npy',
}
paths_te = {
    'b4_512': 'test_reg_preds_b4_hflip.npy',
    'b5_512': 'test_reg_preds_b5_hflip.npy',
    'b5_512_rrcema': 'test_reg_preds_b5_seed2025_rrc_ema.npy',
    'b4_640': 'test_reg_preds_b4_640_rrc_ema.npy',
    'serx50_512_rrcema': 'test_reg_preds_serx50_512_rrc_ema.npy',
    'serx50_512_rrcema_s2': 'test_reg_preds_serx50_512_rrc_ema_seed2026.npy',
    'r200d_512_rrcema': 'test_reg_preds_r200d_512_rrc_ema.npy',
    'convnextb_512_rrcema': 'test_reg_preds_convnextb_512_rrc_ema.npy',
}

y = None
for tgt in ['oof_targets_b4.npy', 'oof_targets.npy']:
    if os.path.exists(tgt):
        y = np.load(tgt).reshape(-1); break
if y is None:
    raise RuntimeError('OOF targets not found')

# Load OOF arrays
oof_map = {}
for k in keys_all:
    p = paths_oof.get(k, None)
    if p and os.path.exists(p):
        oof_map[k] = np.load(p).reshape(-1)

def fit_nnls_masked(y, X, mask, lo=0.05, hi=0.70):
    w, _ = nnls(X[mask], y[mask].astype(float))
    if w.sum() > 0: w = w / w.sum()
    w = np.clip(w, lo, hi); w = w / w.sum()
    return w

def optimize_thresholds_generic(y, p, init=[0.5,1.5,2.5,3.5]):
    y = np.asarray(y).astype(float); p = np.asarray(p).astype(float)
    def _loss(th):
        th = np.sort(th); th = np.clip(th, 0.3, 3.7)
        cls = np.digitize(p, bins=[th[0], th[1], th[2], th[3]])
        return -cohen_kappa_score(y, cls, weights='quadratic')
    res = minimize(_loss, x0=np.array(init, dtype=float), method='Nelder-Mead', options={'maxiter':1500, 'xatol':1e-3, 'fatol':1e-3})
    th = np.sort(res.x)
    for i in range(1,4):
        if th[i]-th[i-1] < 0.12: th[i] = th[i-1] + 0.12
    th = np.clip(th, 0.3, 3.7)
    return th

def bootstrap_thresholds_generic(y, p, n_bootstrap=160, base=None):
    if base is None: base = [0.5,1.5,2.5,3.5]
    N = len(y); cuts = []
    for i in range(n_bootstrap):
        idx = np.random.randint(0, N, size=N)
        thb = optimize_thresholds_generic(y[idx], p[idx], init=base)
        cuts.append(thb)
        if (i+1)%40==0: print(f'Bootstrap {i+1}/{n_bootstrap}', flush=True)
    cuts = np.array(cuts); med = np.median(cuts, axis=0); med = np.clip(med, 0.3, 3.7)
    for j in range(1,4):
        if med[j]-med[j-1] < 0.12: med[j] = med[j-1] + 0.12
    base_arr = np.array([0.5,1.5,2.5,3.5])
    return 0.9*med + 0.1*base_arr

# Order keys by current weights (desc)
order_idx = np.argsort(w_all)[::-1]
keys_sorted = [keys_all[i] for i in order_idx]
w_sorted = w_all[order_idx]
print('Weight order:', list(zip(keys_sorted, w_sorted)), flush=True)

best = (-1.0, None, None, None, None)  # (qwk, keys_used, weights, th_boot, blend_oof)
for K in [4,5,6,7,8]:
    use = [k for k in keys_sorted[:K] if k in oof_map]
    if len(use) < 3: continue
    # Build X and mask
    arrs = [oof_map[k] for k in use]
    mask = np.isfinite(y).copy()
    for a in arrs: mask &= np.isfinite(a)
    X = np.concatenate([a.reshape(-1,1) for a in arrs], axis=1)
    # Fit NNLS with caps
    w = fit_nnls_masked(y, X, mask, lo=0.05, hi=0.70)
    blend_oof = (X * w.reshape(1,-1)).sum(axis=1)
    th = optimize_thresholds_generic(y[mask], blend_oof[mask])
    oof_q = cohen_kappa_score(y[mask], np.digitize(blend_oof[mask], bins=[th[0], th[1], th[2], th[3]]), weights='quadratic')
    print(f'Top-{K} keys {use} -> OOF QWK {oof_q:.5f}', flush=True)
    if oof_q > best[0]:
        th_boot = bootstrap_thresholds_generic(y[mask], blend_oof[mask], n_bootstrap=160, base=th.tolist())
        best = (oof_q, use, w, th_boot, blend_oof)

best_q, best_keys, best_w, best_th, best_oof = best
print('Best subset OOF QWK:', f'{best_q:.5f}', 'keys:', best_keys, 'weights:', best_w, flush=True)
np.save('blend_weights_keys.npy', np.array(best_keys, dtype=object)); np.save('blend_weights_vals.npy', best_w); np.save('thresholds_blend_boot.npy', best_th)

# Build test blend in best key order
test_list = []; used = []
for k in best_keys:
    p = paths_te.get(k, None)
    if p and os.path.exists(p):
        test_list.append(np.load(p).reshape(-1)); used.append(k)
    else:
        print('Missing test preds for', k, '->', p, flush=True)
if len(test_list) == 0:
    raise RuntimeError('No test preds for best subset')
w_use = best_w[:len(test_list)]
if w_use.sum() <= 0: w_use = np.ones_like(w_use)/len(w_use)
w_use = w_use / w_use.sum()
blend_te = np.zeros_like(test_list[0], dtype=float)
for wi, arr in zip(w_use, test_list): blend_te += wi * arr
blend_te = np.clip(blend_te, 0.0, 4.0)
cls = np.digitize(blend_te, bins=[best_th[0], best_th[1], best_th[2], best_th[3]]).astype(int)
sub = pd.DataFrame({'id_code': pd.read_csv('test.csv')['id_code'].values, 'diagnosis': cls})
sub.to_csv('submission.csv', index=False)
print('Final submission.csv written with best subset blend.', flush=True)

Subset LOFO-avg NNLS search over top-K models...


Weight order: [('b5_512_rrcema', 0.2661722288708696), ('serx50_512_rrcema', 0.18198154293406468), ('b5_512', 0.158888441188999), ('b4_512', 0.11385802994503175), ('b4_640', 0.1037982605635069), ('r200d_512_rrcema', 0.06956591954477008), ('serx50_512_rrcema_s2', 0.05618720830717532), ('convnextb_512_rrcema', 0.0495483686455826)]


Top-4 keys ['b5_512_rrcema', 'serx50_512_rrcema', 'b5_512', 'b4_512'] -> OOF QWK 0.88240


Bootstrap 40/160


Bootstrap 80/160


Bootstrap 120/160


Bootstrap 160/160


Top-5 keys ['b5_512_rrcema', 'serx50_512_rrcema', 'b5_512', 'b4_512', 'b4_640'] -> OOF QWK 0.88339


Bootstrap 40/160


Bootstrap 80/160


Bootstrap 120/160


Bootstrap 160/160


Top-6 keys ['b5_512_rrcema', 'serx50_512_rrcema', 'b5_512', 'b4_512', 'b4_640', 'r200d_512_rrcema'] -> OOF QWK 0.88010


Top-7 keys ['b5_512_rrcema', 'serx50_512_rrcema', 'b5_512', 'b4_512', 'b4_640', 'r200d_512_rrcema', 'serx50_512_rrcema_s2'] -> OOF QWK 0.88248


Top-8 keys ['b5_512_rrcema', 'serx50_512_rrcema', 'b5_512', 'b4_512', 'b4_640', 'r200d_512_rrcema', 'serx50_512_rrcema_s2', 'convnextb_512_rrcema'] -> OOF QWK 0.88172


Best subset OOF QWK: 0.88339 keys: ['b5_512_rrcema', 'serx50_512_rrcema', 'b5_512', 'b4_512', 'b4_640'] weights: [0.3262021  0.24090006 0.18800059 0.1268179  0.11807935]


Final submission.csv written with best subset blend.


In [34]:
# Final thresholds refinement: heavier bootstrap (n=320) and optional th3 +0.05 tweak; rebuild submission
import os, numpy as np, pandas as pd
from sklearn.metrics import cohen_kappa_score
from scipy.optimize import minimize

print('Refining thresholds with heavier bootstrap and optional th3 tweak...', flush=True)

# Load best subset keys/weights from previous step
keys = np.load('blend_weights_keys.npy', allow_pickle=True).tolist()
w = np.load('blend_weights_vals.npy')

# Paths
paths_oof = {
    'b4_512': 'oof_preds_b4.npy',
    'b5_512': 'oof_preds.npy',
    'b5_512_rrcema': 'oof_preds_b5_seed2025_rrc_ema.npy',
    'b4_640': 'oof_preds_b4_640_rrc_ema.npy',
    'serx50_512_rrcema': 'oof_preds_serx50_512_rrc_ema.npy',
    'serx50_512_rrcema_s2': 'oof_preds_serx50_512_rrc_ema_seed2026.npy',
    'r200d_512_rrcema': 'oof_preds_r200d_512_rrc_ema.npy',
    'convnextb_512_rrcema': 'oof_preds_convnextb_512_rrc_ema.npy',
}
paths_te = {
    'b4_512': 'test_reg_preds_b4_hflip.npy',
    'b5_512': 'test_reg_preds_b5_hflip.npy',
    'b5_512_rrcema': 'test_reg_preds_b5_seed2025_rrc_ema.npy',
    'b4_640': 'test_reg_preds_b4_640_rrc_ema.npy',
    'serx50_512_rrcema': 'test_reg_preds_serx50_512_rrc_ema.npy',
    'serx50_512_rrcema_s2': 'test_reg_preds_serx50_512_rrc_ema_seed2026.npy',
    'r200d_512_rrcema': 'test_reg_preds_r200d_512_rrc_ema.npy',
    'convnextb_512_rrcema': 'test_reg_preds_convnextb_512_rrc_ema.npy',
}

y = None
for tgt in ['oof_targets_b4.npy', 'oof_targets.npy']:
    if os.path.exists(tgt):
        y = np.load(tgt).reshape(-1); break
if y is None: raise RuntimeError('OOF targets not found')

# Build OOF matrix for current best keys
arrs = []
for k in keys:
    p = paths_oof.get(k, None)
    if p is None or not os.path.exists(p):
        raise RuntimeError(f'Missing OOF preds for {k}: {p}')
    arrs.append(np.load(p).reshape(-1))
X = np.concatenate([a.reshape(-1,1) for a in arrs], axis=1)
mask = np.isfinite(y).copy()
for a in arrs: mask &= np.isfinite(a)
w_use = w[:X.shape[1]]
if w_use.sum() <= 0: w_use = np.ones_like(w_use)/len(w_use)
w_use = w_use / w_use.sum()
blend_oof = (X * w_use.reshape(1,-1)).sum(axis=1)

def preds_to_classes(p, th):
    return np.digitize(p, bins=[th[0], th[1], th[2], th[3]])

def optimize_thresholds(y_true, p, init=[0.5,1.5,2.5,3.5]):
    y_true = np.asarray(y_true).astype(float); p = np.asarray(p).astype(float)
    def _loss(th):
        th = np.sort(th); th = np.clip(th, 0.3, 3.7)
        cls = preds_to_classes(p, th)
        return -cohen_kappa_score(y_true, cls, weights='quadratic')
    res = minimize(_loss, x0=np.array(init, dtype=float), method='Nelder-Mead', options={'maxiter':2500, 'xatol':1e-3, 'fatol':1e-3})
    th = np.sort(res.x)
    for i in range(1,4):
        if th[i]-th[i-1] < 0.12: th[i] = th[i-1] + 0.12
    th = np.clip(th, 0.3, 3.7)
    return th

th0 = optimize_thresholds(y[mask], blend_oof[mask])
q0 = cohen_kappa_score(y[mask], preds_to_classes(blend_oof[mask], th0), weights='quadratic')
print('Base th OOF QWK:', f'{q0:.5f}', 'th:', th0, flush=True)

def bootstrap_thresholds(y_true, p, base, n=320):
    N = len(y_true); cuts = []
    for i in range(n):
        idx = np.random.randint(0, N, size=N)
        thb = optimize_thresholds(y_true[idx], p[idx], init=base)
        cuts.append(thb)
        if (i+1)%40==0: print(f'Bootstrap {i+1}/{n}', flush=True)
    cuts = np.array(cuts); med = np.median(cuts, axis=0); med = np.clip(med, 0.3, 3.7)
    for j in range(1,4):
        if med[j]-med[j-1] < 0.12: med[j] = med[j-1] + 0.12
    base_arr = np.array([0.5,1.5,2.5,3.5])
    return 0.9*med + 0.1*base_arr

th_boot = bootstrap_thresholds(y[mask], blend_oof[mask], base=th0, n=320)
q_boot = cohen_kappa_score(y[mask], preds_to_classes(blend_oof[mask], th_boot), weights='quadratic')
print('Bootstrapped th OOF QWK:', f'{q_boot:.5f}', 'th_boot:', th_boot, flush=True)

# Optional final stabilizer: try +0.05 on last threshold (th3)
th_tweak = th_boot.copy()
th_tweak[3] = min(3.7, th_tweak[3] + 0.05)
q_tweak = cohen_kappa_score(y[mask], preds_to_classes(blend_oof[mask], th_tweak), weights='quadratic')
use_th = th_tweak if q_tweak > q_boot else th_boot
print('Chosen OOF QWK:', f"{(q_tweak if q_tweak > q_boot else q_boot):.5f}", 'using th:', use_th, flush=True)

# Build test blend with same keys/weights
test_list = []
for k in keys:
    p = paths_te.get(k, None)
    if p is None or not os.path.exists(p):
        raise RuntimeError(f'Missing test preds for {k}: {p}')
    test_list.append(np.load(p).reshape(-1))
w_te = w_use[:len(test_list)]
if w_te.sum() <= 0: w_te = np.ones_like(w_te)/len(w_te)
w_te = w_te / w_te.sum()
blend_te = np.zeros_like(test_list[0], dtype=float)
for wi, arr in zip(w_te, test_list):
    blend_te += wi * arr
blend_te = np.clip(blend_te, 0.0, 4.0)
cls = np.digitize(blend_te, bins=[use_th[0], use_th[1], use_th[2], use_th[3]]).astype(int)
sub = pd.DataFrame({'id_code': pd.read_csv('test.csv')['id_code'].values, 'diagnosis': cls})
sub.to_csv('submission.csv', index=False)
np.save('thresholds_blend_boot.npy', use_th)
print('Final submission.csv written with refined thresholds.', flush=True)

Refining thresholds with heavier bootstrap and optional th3 tweak...


Base th OOF QWK: 0.88339 th: [0.55947401 1.48773105 2.45901277 3.04202706]


Bootstrap 40/320


Bootstrap 80/320


Bootstrap 120/320


Bootstrap 160/320


Bootstrap 200/320


Bootstrap 240/320


Bootstrap 280/320


Bootstrap 320/320


Bootstrapped th OOF QWK: 0.88279 th_boot: [0.55416249 1.48895794 2.46311149 3.08782436]


Chosen OOF QWK: 0.88279 using th: [0.55416249 1.48895794 2.46311149 3.08782436]


Final submission.csv written with refined thresholds.


In [None]:
# Rebuild submission using base optimized thresholds (no bootstrap) for best top-5 subset
import os, numpy as np, pandas as pd
from sklearn.metrics import cohen_kappa_score
from scipy.optimize import minimize

print('Writing submission with base optimized thresholds (no bootstrap)...', flush=True)
keys = np.load('blend_weights_keys.npy', allow_pickle=True).tolist()
w = np.load('blend_weights_vals.npy')

paths_oof = {
    'b4_512': 'oof_preds_b4.npy',
    'b5_512': 'oof_preds.npy',
    'b5_512_rrcema': 'oof_preds_b5_seed2025_rrc_ema.npy',
    'b4_640': 'oof_preds_b4_640_rrc_ema.npy',
    'serx50_512_rrcema': 'oof_preds_serx50_512_rrc_ema.npy',
}
paths_te = {
    'b4_512': 'test_reg_preds_b4_hflip.npy',
    'b5_512': 'test_reg_preds_b5_hflip.npy',
    'b5_512_rrcema': 'test_reg_preds_b5_seed2025_rrc_ema.npy',
    'b4_640': 'test_reg_preds_b4_640_rrc_ema.npy',
    'serx50_512_rrcema': 'test_reg_preds_serx50_512_rrc_ema.npy',
}

y = None
for tgt in ['oof_targets_b4.npy', 'oof_targets.npy']:
    if os.path.exists(tgt):
        y = np.load(tgt).reshape(-1); break
if y is None: raise RuntimeError('OOF targets not found')

# OOF blend for base thresholds
arrs = []
for k in keys:
    p = paths_oof.get(k, None)
    if p is None or not os.path.exists(p):
        raise RuntimeError(f'Missing OOF preds for {k}: {p}')
    arrs.append(np.load(p).reshape(-1))
X = np.concatenate([a.reshape(-1,1) for a in arrs], axis=1)
mask = np.isfinite(y).copy()
for a in arrs: mask &= np.isfinite(a)
w_use = w[:X.shape[1]]; w_use = w_use / (w_use.sum() if w_use.sum() > 0 else len(w_use))
blend_oof = (X * w_use.reshape(1,-1)).sum(axis=1)

def preds_to_classes(p, th):
    return np.digitize(p, bins=[th[0], th[1], th[2], th[3]])

def optimize_thresholds(y_true, p, init=[0.5,1.5,2.5,3.5]):
    y_true = np.asarray(y_true).astype(float); p = np.asarray(p).astype(float)
    def _loss(th):
        th = np.sort(th); th = np.clip(th, 0.3, 3.7)
        cls = preds_to_classes(p, th)
        return -cohen_kappa_score(y_true, cls, weights='quadratic')
    res = minimize(_loss, x0=np.array(init, dtype=float), method='Nelder-Mead', options={'maxiter':2000,'xatol':1e-3,'fatol':1e-3})
    th = np.sort(res.x)
    for i in range(1,4):
        if th[i]-th[i-1] < 0.12: th[i] = th[i-1] + 0.12
    th = np.clip(th, 0.3, 3.7)
    return th

th_base = optimize_thresholds(y[mask], blend_oof[mask])
oof_q = cohen_kappa_score(y[mask], preds_to_classes(blend_oof[mask], th_base), weights='quadratic')
print('Base thresholds OOF QWK:', f'{oof_q:.5f}', 'th_base:', th_base, flush=True)

# Build test blend and write submission
test_list = []
for k in keys:
    p = paths_te.get(k, None)
    if p is None or not os.path.exists(p):
        raise RuntimeError(f'Missing test preds for {k}: {p}')
    test_list.append(np.load(p).reshape(-1))
w_te = w_use[:len(test_list)]
w_te = w_te / (w_te.sum() if w_te.sum() > 0 else len(w_te))
blend_te = np.zeros_like(test_list[0], dtype=float)
for wi, arr in zip(w_te, test_list):
    blend_te += wi * arr
blend_te = np.clip(blend_te, 0.0, 4.0)
cls = preds_to_classes(blend_te, th_base).astype(int)
sub = pd.DataFrame({'id_code': pd.read_csv('test.csv')['id_code'].values, 'diagnosis': cls})
sub.to_csv('submission.csv', index=False)
np.save('thresholds_blend_boot.npy', th_base)
print('Final submission.csv written with base optimized thresholds.', flush=True)

In [35]:
# Write submission using base-optimized thresholds (no bootstrap) for current best subset keys/weights
import os, numpy as np, pandas as pd
from sklearn.metrics import cohen_kappa_score
from scipy.optimize import minimize

print('Building submission with base thresholds (no bootstrap) for best subset...', flush=True)
keys = np.load('blend_weights_keys.npy', allow_pickle=True).tolist()
w = np.load('blend_weights_vals.npy')

paths_oof = {
    'b4_512': 'oof_preds_b4.npy',
    'b5_512': 'oof_preds.npy',
    'b5_512_rrcema': 'oof_preds_b5_seed2025_rrc_ema.npy',
    'b4_640': 'oof_preds_b4_640_rrc_ema.npy',
    'serx50_512_rrcema': 'oof_preds_serx50_512_rrc_ema.npy',
}
paths_te = {
    'b4_512': 'test_reg_preds_b4_hflip.npy',
    'b5_512': 'test_reg_preds_b5_hflip.npy',
    'b5_512_rrcema': 'test_reg_preds_b5_seed2025_rrc_ema.npy',
    'b4_640': 'test_reg_preds_b4_640_rrc_ema.npy',
    'serx50_512_rrcema': 'test_reg_preds_serx50_512_rrc_ema.npy',
}

y = None
for tgt in ['oof_targets_b4.npy', 'oof_targets.npy']:
    if os.path.exists(tgt):
        y = np.load(tgt).reshape(-1); break
if y is None: raise RuntimeError('OOF targets not found')

# Build OOF blend
arrs = []
for k in keys:
    p = paths_oof.get(k, None)
    if p is None or not os.path.exists(p):
        raise RuntimeError(f'Missing OOF preds for {k}: {p}')
    arrs.append(np.load(p).reshape(-1))
X = np.concatenate([a.reshape(-1,1) for a in arrs], axis=1)
mask = np.isfinite(y).copy()
for a in arrs: mask &= np.isfinite(a)
w_use = w[:X.shape[1]]
if w_use.sum() <= 0: w_use = np.ones_like(w_use)/len(w_use)
w_use = w_use / w_use.sum()
blend_oof = (X * w_use.reshape(1,-1)).sum(axis=1)

def preds_to_classes(p, th):
    return np.digitize(p, bins=[th[0], th[1], th[2], th[3]])

def optimize_thresholds(y_true, p, init=[0.5,1.5,2.5,3.5]):
    y_true = np.asarray(y_true).astype(float); p = np.asarray(p).astype(float)
    def _loss(th):
        th = np.sort(th); th = np.clip(th, 0.3, 3.7)
        cls = preds_to_classes(p, th)
        return -cohen_kappa_score(y_true, cls, weights='quadratic')
    res = minimize(_loss, x0=np.array(init, dtype=float), method='Nelder-Mead', options={'maxiter':2000, 'xatol':1e-3, 'fatol':1e-3})
    th = np.sort(res.x)
    for i in range(1,4):
        if th[i]-th[i-1] < 0.12: th[i] = th[i-1] + 0.12
    th = np.clip(th, 0.3, 3.7)
    return th

th_base = optimize_thresholds(y[mask], blend_oof[mask])
oof_q = cohen_kappa_score(y[mask], preds_to_classes(blend_oof[mask], th_base), weights='quadratic')
print('Base thresholds OOF QWK:', f'{oof_q:.5f}', 'th_base:', th_base, flush=True)

# Build test blend with same keys/weights
test_list = []
for k in keys:
    p = paths_te.get(k, None)
    if p is None or not os.path.exists(p):
        raise RuntimeError(f'Missing test preds for {k}: {p}')
    test_list.append(np.load(p).reshape(-1))
w_te = w_use[:len(test_list)]
if w_te.sum() <= 0: w_te = np.ones_like(w_te)/len(w_te)
w_te = w_te / w_te.sum()
blend_te = np.zeros_like(test_list[0], dtype=float)
for wi, arr in zip(w_te, test_list):
    blend_te += wi * arr
blend_te = np.clip(blend_te, 0.0, 4.0)
cls = preds_to_classes(blend_te, th_base).astype(int)
sub = pd.DataFrame({'id_code': pd.read_csv('test.csv')['id_code'].values, 'diagnosis': cls})
sub.to_csv('submission.csv', index=False)
np.save('thresholds_blend_boot.npy', th_base)
print('Final submission.csv written with base thresholds for best subset.', flush=True)

Building submission with base thresholds (no bootstrap) for best subset...


Base thresholds OOF QWK: 0.88339 th_base: [0.55947401 1.48773105 2.45901277 3.04202706]


Final submission.csv written with base thresholds for best subset.


In [36]:
# 1-D grid search on th3 for the current best top-5 subset blend; write submission
import os, numpy as np, pandas as pd
from sklearn.metrics import cohen_kappa_score

print('Starting 1-D grid search on th3 for best subset blend...', flush=True)

# Load best subset keys/weights determined earlier (Cell 23)
keys = np.load('blend_weights_keys.npy', allow_pickle=True).tolist()
w = np.load('blend_weights_vals.npy')

# Paths
paths_oof = {
    'b4_512': 'oof_preds_b4.npy',
    'b5_512': 'oof_preds.npy',
    'b5_512_rrcema': 'oof_preds_b5_seed2025_rrc_ema.npy',
    'b4_640': 'oof_preds_b4_640_rrc_ema.npy',
    'serx50_512_rrcema': 'oof_preds_serx50_512_rrc_ema.npy',
    'serx50_512_rrcema_s2': 'oof_preds_serx50_512_rrc_ema_seed2026.npy',
    'r200d_512_rrcema': 'oof_preds_r200d_512_rrc_ema.npy',
    'convnextb_512_rrcema': 'oof_preds_convnextb_512_rrc_ema.npy',
}
paths_te = {
    'b4_512': 'test_reg_preds_b4_hflip.npy',
    'b5_512': 'test_reg_preds_b5_hflip.npy',
    'b5_512_rrcema': 'test_reg_preds_b5_seed2025_rrc_ema.npy',
    'b4_640': 'test_reg_preds_b4_640_rrc_ema.npy',
    'serx50_512_rrcema': 'test_reg_preds_serx50_512_rrc_ema.npy',
    'serx50_512_rrcema_s2': 'test_reg_preds_serx50_512_rrc_ema_seed2026.npy',
    'r200d_512_rrcema': 'test_reg_preds_r200d_512_rrc_ema.npy',
    'convnextb_512_rrcema': 'test_reg_preds_convnextb_512_rrc_ema.npy',
}

# Targets
y = None
for tgt in ['oof_targets_b4.npy', 'oof_targets.npy']:
    if os.path.exists(tgt):
        y = np.load(tgt).reshape(-1)
        break
if y is None:
    raise RuntimeError('OOF targets not found')

# Build OOF EV blend for current best keys
arrs = []
for k in keys:
    p = paths_oof.get(k, None)
    if p is None or not os.path.exists(p):
        raise RuntimeError(f'Missing OOF preds for {k}: {p}')
    arrs.append(np.load(p).reshape(-1))
X = np.concatenate([a.reshape(-1,1) for a in arrs], axis=1)
mask = np.isfinite(y).copy()
for a in arrs:
    mask &= np.isfinite(a)
w_use = w[:X.shape[1]]
if w_use.sum() <= 0:
    w_use = np.ones_like(w_use)/len(w_use)
w_use = w_use / w_use.sum()
blend_oof = (X * w_use.reshape(1,-1)).sum(axis=1)

def preds_to_classes(p, th):
    return np.digitize(p, bins=[th[0], th[1], th[2], th[3]])

from scipy.optimize import minimize
def optimize_th012(y_true, p, init=[0.5,1.5,2.5]):
    y_true = np.asarray(y_true).astype(float); p = np.asarray(p).astype(float)
    def _loss(th3):
        th0, th1, th2 = th3[0], th3[1], th3[2]
        th = np.array([th0, th1, th2])
        th = np.sort(th)
        th = np.clip(th, 0.3, 3.7)
        # enforce minimum gaps
        th[1] = max(th[1], th[0] + 0.12)
        th[2] = max(th[2], th[1] + 0.12)
        cls = np.digitize(p, bins=[th[0], th[1], th[2], 3.7])  # temporary th3 high
        return -cohen_kappa_score(y_true, cls, weights='quadratic')
    res = minimize(_loss, x0=np.array(init, dtype=float), method='Nelder-Mead', options={'maxiter':1200, 'xatol':1e-3, 'fatol':1e-3})
    th = np.sort(res.x)
    th = np.clip(th, 0.3, 3.7)
    th[1] = max(th[1], th[0] + 0.12)
    th[2] = max(th[2], th[1] + 0.12)
    return th

# First get good th0, th1, th2 (fix th3 large during this optimization)
th012 = optimize_th012(y[mask], blend_oof[mask], init=[0.55, 1.49, 2.46])
th2_min = th012[2] + 0.12

# Now 1-D grid over th3
grid = np.arange(th2_min, 3.701, 0.005)
best_q = -1.0
best_th3 = None
for i, th3 in enumerate(grid):
    th = np.array([th012[0], th012[1], th012[2], th3], dtype=float)
    cls = preds_to_classes(blend_oof[mask], th)
    q = cohen_kappa_score(y[mask], cls, weights='quadratic')
    if q > best_q:
        best_q = q; best_th3 = th3
    if (i+1) % 50 == 0:
        print(f'Grid {i+1}/{len(grid)} th3={th3:.3f} best_q={best_q:.5f}', flush=True)

best_th = np.array([th012[0], th012[1], th012[2], best_th3], dtype=float)
print('Best OOF QWK from th3 grid:', f'{best_q:.5f}', 'best_th:', best_th, flush=True)

# Build TEST EV blend using same keys/weights
test_list = []
for k in keys:
    p = paths_te.get(k, None)
    if p is None or not os.path.exists(p):
        raise RuntimeError(f'Missing test preds for {k}: {p}')
    test_list.append(np.load(p).reshape(-1))
w_te = w_use[:len(test_list)]
if w_te.sum() <= 0:
    w_te = np.ones_like(w_te)/len(w_te)
w_te = w_te / w_te.sum()
blend_te = np.zeros_like(test_list[0], dtype=float)
for wi, arr in zip(w_te, test_list):
    blend_te += wi * arr
blend_te = np.clip(blend_te, 0.0, 4.0)
cls = preds_to_classes(blend_te, best_th).astype(int)
sub = pd.DataFrame({'id_code': pd.read_csv('test.csv')['id_code'].values, 'diagnosis': cls})
sub.to_csv('submission.csv', index=False)
np.save('thresholds_blend_grid.npy', best_th)
print('submission.csv written with th3-grid-optimized thresholds.', flush=True)

Starting 1-D grid search on th3 for best subset blend...


Grid 50/231 th3=2.791 best_q=0.88141


Grid 100/231 th3=3.041 best_q=0.88463


Grid 150/231 th3=3.291 best_q=0.88463


Grid 200/231 th3=3.541 best_q=0.88463


Best OOF QWK from th3 grid: 0.88463 best_th: [0.5697338  1.49034491 2.42640278 3.04140278]


submission.csv written with th3-grid-optimized thresholds.


In [37]:
# 2-D grid search around th2 and th3 for the best top-5 subset blend; write submission
import os, numpy as np, pandas as pd
from sklearn.metrics import cohen_kappa_score

print('Starting 2-D grid search around th2 & th3 for best subset blend...', flush=True)

# Load keys/weights from best subset
keys = np.load('blend_weights_keys.npy', allow_pickle=True).tolist()
w = np.load('blend_weights_vals.npy')

# Paths
paths_oof = {
    'b4_512': 'oof_preds_b4.npy',
    'b5_512': 'oof_preds.npy',
    'b5_512_rrcema': 'oof_preds_b5_seed2025_rrc_ema.npy',
    'b4_640': 'oof_preds_b4_640_rrc_ema.npy',
    'serx50_512_rrcema': 'oof_preds_serx50_512_rrc_ema.npy',
    'serx50_512_rrcema_s2': 'oof_preds_serx50_512_rrc_ema_seed2026.npy',
    'r200d_512_rrcema': 'oof_preds_r200d_512_rrc_ema.npy',
    'convnextb_512_rrcema': 'oof_preds_convnextb_512_rrc_ema.npy',
}
paths_te = {
    'b4_512': 'test_reg_preds_b4_hflip.npy',
    'b5_512': 'test_reg_preds_b5_hflip.npy',
    'b5_512_rrcema': 'test_reg_preds_b5_seed2025_rrc_ema.npy',
    'b4_640': 'test_reg_preds_b4_640_rrc_ema.npy',
    'serx50_512_rrcema': 'test_reg_preds_serx50_512_rrc_ema.npy',
    'serx50_512_rrcema_s2': 'test_reg_preds_serx50_512_rrc_ema_seed2026.npy',
    'r200d_512_rrcema': 'test_reg_preds_r200d_512_rrc_ema.npy',
    'convnextb_512_rrcema': 'test_reg_preds_convnextb_512_rrc_ema.npy',
}

# Targets
y = None
for tgt in ['oof_targets_b4.npy', 'oof_targets.npy']:
    if os.path.exists(tgt):
        y = np.load(tgt).reshape(-1); break
if y is None: raise RuntimeError('OOF targets not found')

# Build OOF EV blend
arrs = []
for k in keys:
    p = paths_oof.get(k, None)
    if p is None or not os.path.exists(p):
        raise RuntimeError(f'Missing OOF preds for {k}: {p}')
    arrs.append(np.load(p).reshape(-1))
X = np.concatenate([a.reshape(-1,1) for a in arrs], axis=1)
mask = np.isfinite(y).copy()
for a in arrs: mask &= np.isfinite(a)
w_use = w[:X.shape[1]]
if w_use.sum() <= 0: w_use = np.ones_like(w_use)/len(w_use)
w_use = w_use / w_use.sum()
blend_oof = (X * w_use.reshape(1,-1)).sum(axis=1)

def preds_to_classes(p, th):
    return np.digitize(p, bins=[th[0], th[1], th[2], th[3]])

from scipy.optimize import minimize
def optimize_th01(y_true, p, init=[0.55, 1.49]):
    y_true = np.asarray(y_true).astype(float); p = np.asarray(p).astype(float)
    def _loss(th):
        t0, t1 = th[0], th[1]
        t0 = float(np.clip(t0, 0.3, 1.6))
        t1 = max(float(np.clip(t1, 0.8, 2.2)), t0 + 0.12)
        cls = np.digitize(p, bins=[t0, t1, 3.7, 3.7])
        return -cohen_kappa_score(y_true, cls, weights='quadratic')
    res = minimize(_loss, x0=np.array(init, dtype=float), method='Nelder-Mead', options={'maxiter':800, 'xatol':1e-3, 'fatol':1e-3})
    t0 = float(np.clip(res.x[0], 0.3, 1.6))
    t1 = float(max(np.clip(res.x[1], 0.8, 2.2), t0 + 0.12))
    return np.array([t0, t1], dtype=float)

# Get good th0, th1 from quick opt
t01 = optimize_th01(y[mask], blend_oof[mask], init=[0.57, 1.49])

# Seed th2/th3 from prior results if available
seed_th_path = 'thresholds_blend_grid.npy' if os.path.exists('thresholds_blend_grid.npy') else ('thresholds_blend_boot.npy' if os.path.exists('thresholds_blend_boot.npy') else None)
if seed_th_path:
    seed_th = np.load(seed_th_path)
    th2_c = float(seed_th[2]); th3_c = float(seed_th[3])
else:
    th2_c = 2.46; th3_c = 3.04

# Define 2D search windows
th2_lo = max(t01[1] + 0.12, th2_c - 0.18)
th2_hi = min(3.50, th2_c + 0.18)
th3_lo_base = th3_c - 0.18
th3_hi = 3.70
grid_step = 0.005
th2_vals = np.arange(th2_lo, th2_hi + 1e-9, grid_step)

best_q = -1.0
best_th = None
cnt = 0
for t2 in th2_vals:
    t3_lo = max(t2 + 0.12, th3_lo_base)
    t3_vals = np.arange(t3_lo, th3_hi + 1e-9, grid_step)
    for t3 in t3_vals:
        th = np.array([t01[0], t01[1], t2, t3], dtype=float)
        cls = preds_to_classes(blend_oof[mask], th)
        q = cohen_kappa_score(y[mask], cls, weights='quadratic')
        cnt += 1
        if q > best_q:
            best_q = q; best_th = th.copy()
    if cnt % 2000 == 0:
        print(f'Checked {cnt} combos; current best_q={best_q:.5f} best_th={best_th}', flush=True)

print('Best OOF QWK from 2D grid:', f'{best_q:.5f}', 'best_th:', best_th, flush=True)
np.save('thresholds_blend_grid2d.npy', best_th)

# Build TEST EV blend using same keys/weights
test_list = []
for k in keys:
    p = paths_te.get(k, None)
    if p is None or not os.path.exists(p):
        raise RuntimeError(f'Missing test preds for {k}: {p}')
    test_list.append(np.load(p).reshape(-1))
w_te = w_use[:len(test_list)]
if w_te.sum() <= 0: w_te = np.ones_like(w_te)/len(w_te)
w_te = w_te / w_te.sum()
blend_te = np.zeros_like(test_list[0], dtype=float)
for wi, arr in zip(w_te, test_list): blend_te += wi * arr
blend_te = np.clip(blend_te, 0.0, 4.0)
cls = preds_to_classes(blend_te, best_th).astype(int)
sub = pd.DataFrame({'id_code': pd.read_csv('test.csv')['id_code'].values, 'diagnosis': cls})
sub.to_csv('submission.csv', index=False)
print('submission.csv written with 2D (th2, th3) grid-optimized thresholds.', flush=True)

Starting 2-D grid search around th2 & th3 for best subset blend...


Best OOF QWK from 2D grid: 0.88482 best_th: [0.57       1.4155     2.42640278 3.04140278]


submission.csv written with 2D (th2, th3) grid-optimized thresholds.


In [38]:
# Final small-step 4D threshold refinement (Nelder-Mead from current best) on top-5 blend; write submission
import os, numpy as np, pandas as pd
from sklearn.metrics import cohen_kappa_score
from scipy.optimize import minimize

print('Starting 4D NM refinement from current best thresholds...', flush=True)

keys = np.load('blend_weights_keys.npy', allow_pickle=True).tolist()
w = np.load('blend_weights_vals.npy')

paths_oof = {
    'b4_512': 'oof_preds_b4.npy',
    'b5_512': 'oof_preds.npy',
    'b5_512_rrcema': 'oof_preds_b5_seed2025_rrc_ema.npy',
    'b4_640': 'oof_preds_b4_640_rrc_ema.npy',
    'serx50_512_rrcema': 'oof_preds_serx50_512_rrc_ema.npy',
    'serx50_512_rrcema_s2': 'oof_preds_serx50_512_rrc_ema_seed2026.npy',
    'r200d_512_rrcema': 'oof_preds_r200d_512_rrc_ema.npy',
    'convnextb_512_rrcema': 'oof_preds_convnextb_512_rrc_ema.npy',
}
paths_te = {
    'b4_512': 'test_reg_preds_b4_hflip.npy',
    'b5_512': 'test_reg_preds_b5_hflip.npy',
    'b5_512_rrcema': 'test_reg_preds_b5_seed2025_rrc_ema.npy',
    'b4_640': 'test_reg_preds_b4_640_rrc_ema.npy',
    'serx50_512_rrcema': 'test_reg_preds_serx50_512_rrc_ema.npy',
    'serx50_512_rrcema_s2': 'test_reg_preds_serx50_512_rrc_ema_seed2026.npy',
    'r200d_512_rrcema': 'test_reg_preds_r200d_512_rrc_ema.npy',
    'convnextb_512_rrcema': 'test_reg_preds_convnextb_512_rrc_ema.npy',
}

y = None
for tgt in ['oof_targets_b4.npy', 'oof_targets.npy']:
    if os.path.exists(tgt):
        y = np.load(tgt).reshape(-1); break
if y is None: raise RuntimeError('OOF targets not found')

arrs = []
for k in keys:
    p = paths_oof.get(k, None)
    if p is None or not os.path.exists(p):
        raise RuntimeError(f'Missing OOF preds for {k}: {p}')
    arrs.append(np.load(p).reshape(-1))
X = np.concatenate([a.reshape(-1,1) for a in arrs], axis=1)
mask = np.isfinite(y).copy()
for a in arrs: mask &= np.isfinite(a)
w_use = w[:X.shape[1]]
if w_use.sum() <= 0: w_use = np.ones_like(w_use)/len(w_use)
w_use = w_use / w_use.sum()
blend_oof = (X * w_use.reshape(1,-1)).sum(axis=1)

def preds_to_classes(p, th):
    return np.digitize(p, bins=[th[0], th[1], th[2], th[3]])

# Seed thresholds: prefer 2D grid best if available, else previous boot/base
seed = None
for p in ['thresholds_blend_grid2d.npy', 'thresholds_blend_grid.npy', 'thresholds_blend_boot.npy']:
    if os.path.exists(p):
        seed = np.load(p).astype(float); break
if seed is None:
    seed = np.array([0.56, 1.49, 2.46, 3.04], dtype=float)

def enforce_gaps(th):
    th = np.clip(th, 0.3, 3.7)
    th = np.sort(th)
    # enforce min gaps
    if th[1] - th[0] < 0.12: th[1] = th[0] + 0.12
    if th[2] - th[1] < 0.12: th[2] = th[1] + 0.12
    if th[3] - th[2] < 0.12: th[3] = min(3.7, th[2] + 0.12)
    return th

def qwk_loss(th):
    th = enforce_gaps(np.array(th, dtype=float))
    cls = preds_to_classes(blend_oof[mask], th)
    return -cohen_kappa_score(y[mask], cls, weights='quadratic')

# Small-step NM starting from seed
res = minimize(qwk_loss, x0=seed, method='Nelder-Mead', options={'maxiter': 600, 'xatol':1e-3, 'fatol':1e-3})
best_th = enforce_gaps(res.x)
best_q = -qwk_loss(best_th)
print('Refined OOF QWK:', f'{best_q:.5f}', 'best_th:', best_th, flush=True)
np.save('thresholds_blend_nm.npy', best_th)

# Apply to TEST blend
test_list = []
for k in keys:
    p = paths_te.get(k, None)
    if p is None or not os.path.exists(p):
        raise RuntimeError(f'Missing test preds for {k}: {p}')
    test_list.append(np.load(p).reshape(-1))
w_te = w_use[:len(test_list)]
if w_te.sum() <= 0: w_te = np.ones_like(w_te)/len(w_te)
w_te = w_te / w_te.sum()
blend_te = np.zeros_like(test_list[0], dtype=float)
for wi, arr in zip(w_te, test_list):
    blend_te += wi * arr
blend_te = np.clip(blend_te, 0.0, 4.0)
cls = preds_to_classes(blend_te, best_th).astype(int)
sub = pd.DataFrame({'id_code': pd.read_csv('test.csv')['id_code'].values, 'diagnosis': cls})
sub.to_csv('submission.csv', index=False)
print('submission.csv written with 4D-NM-refined thresholds.', flush=True)

Starting 4D NM refinement from current best thresholds...


Refined OOF QWK: 0.88482 best_th: [0.57       1.4155     2.42640278 3.04140278]


submission.csv written with 4D-NM-refined thresholds.


In [44]:
# Isotonic calibration on blended EV, re-opt thresholds, micro-scan th3, write submission
import os, numpy as np, pandas as pd
from sklearn.isotonic import IsotonicRegression
from sklearn.metrics import cohen_kappa_score
from scipy.optimize import minimize

print('Starting isotonic calibration + threshold re-opt...', flush=True)

# Load best subset keys/weights
keys = np.load('blend_weights_keys.npy', allow_pickle=True).tolist()
w = np.load('blend_weights_vals.npy')

# Paths
paths_oof = {
    'b4_512': 'oof_preds_b4.npy',
    'b5_512': 'oof_preds.npy',
    'b5_512_rrcema': 'oof_preds_b5_seed2025_rrc_ema.npy',
    'b4_640': 'oof_preds_b4_640_rrc_ema.npy',
    'serx50_512_rrcema': 'oof_preds_serx50_512_rrc_ema.npy',
    'serx50_512_rrcema_s2': 'oof_preds_serx50_512_rrc_ema_seed2026.npy',
    'r200d_512_rrcema': 'oof_preds_r200d_512_rrc_ema.npy',
    'convnextb_512_rrcema': 'oof_preds_convnextb_512_rrc_ema.npy',
}
paths_te = {
    'b4_512': 'test_reg_preds_b4_hflip.npy',
    'b5_512': 'test_reg_preds_b5_hflip.npy',
    'b5_512_rrcema': 'test_reg_preds_b5_seed2025_rrc_ema.npy',
    'b4_640': 'test_reg_preds_b4_640_rrc_ema.npy',
    'serx50_512_rrcema': 'test_reg_preds_serx50_512_rrc_ema.npy',
    'serx50_512_rrcema_s2': 'test_reg_preds_serx50_512_rrc_ema_seed2026.npy',
    'r200d_512_rrcema': 'test_reg_preds_r200d_512_rrc_ema.npy',
    'convnextb_512_rrcema': 'test_reg_preds_convnextb_512_rrc_ema.npy',
}

# Targets
y = None
for tgt in ['oof_targets_b4.npy', 'oof_targets.npy']:
    if os.path.exists(tgt):
        y = np.load(tgt).reshape(-1); break
if y is None: raise RuntimeError('OOF targets not found')

# Build OOF/Test EV blend for current keys
arrs_oof = []; arrs_te = []
for k in keys:
    po = paths_oof.get(k, None); pt = paths_te.get(k, None)
    if po is None or not os.path.exists(po): raise RuntimeError(f'Missing OOF for {k}: {po}')
    if pt is None or not os.path.exists(pt): raise RuntimeError(f'Missing test for {k}: {pt}')
    arrs_oof.append(np.load(po).reshape(-1))
    arrs_te.append(np.load(pt).reshape(-1))
X = np.concatenate([a.reshape(-1,1) for a in arrs_oof], axis=1)
mask = np.isfinite(y).copy()
for a in arrs_oof: mask &= np.isfinite(a)
w_use = w[:X.shape[1]]
if w_use.sum() <= 0: w_use = np.ones_like(w_use)/len(w_use)
w_use = w_use / w_use.sum()
blend_oof = (X * w_use.reshape(1,-1)).sum(axis=1)
blend_te = np.zeros_like(arrs_te[0], dtype=float)
for wi, arr in zip(w_use, arrs_te): blend_te += wi * arr

# Fit isotonic on OOF EV -> calibrated EV (handle NaNs by fitting only on finite mask, transforming only mask)
ir = IsotonicRegression(increasing=True, out_of_bounds='clip')
mask_fit = mask & np.isfinite(blend_oof)
if mask_fit.sum() == 0:
    raise RuntimeError('No finite OOF EV available for isotonic fit')
ir.fit(blend_oof[mask_fit], y[mask_fit].astype(float))
# Transform only on finite indices to avoid NaN errors
cal_oof = blend_oof.copy()
cal_oof[mask_fit] = ir.transform(blend_oof[mask_fit])
# Any remaining non-finite, leave as original EV
cal_oof = np.where(np.isfinite(cal_oof), cal_oof, blend_oof)
# Test EV: ensure finite then transform
if not np.isfinite(blend_te).all():
    med = float(np.nanmedian(blend_te[np.isfinite(blend_te)]))
    blend_te = np.where(np.isfinite(blend_te), blend_te, med)
cal_te = ir.transform(blend_te)
cal_oof = np.clip(cal_oof, 0.0, 4.0)
cal_te = np.clip(cal_te, 0.0, 4.0)

def preds_to_classes(p, th):
    return np.digitize(p, bins=[th[0], th[1], th[2], th[3]])

def optimize_thresholds_strict(y_true, p, init=[0.5,1.5,2.5,3.5]):
    y_true = np.asarray(y_true).astype(float); p = np.asarray(p).astype(float)
    def _loss(th):
        th = np.sort(th); th = np.clip(th, 0.3, 3.7)
        for i in range(1,4):
            if th[i] - th[i-1] < 0.12: th[i] = th[i-1] + 0.12
        cls = preds_to_classes(p, th)
        return -cohen_kappa_score(y_true, cls, weights='quadratic')
    res = minimize(_loss, x0=np.array(init, dtype=float), method='Nelder-Mead', options={'maxiter':1500, 'xatol':1e-3, 'fatol':1e-3})
    th = np.sort(res.x);
    for i in range(1,4):
        if th[i] - th[i-1] < 0.12: th[i] = th[i-1] + 0.12
    th = np.clip(th, 0.3, 3.7)
    return th

# Optimize thresholds on calibrated OOF
th0 = optimize_thresholds_strict(y[mask_fit], cal_oof[mask_fit], init=[0.57, 1.42, 2.43, 3.04])
q0 = cohen_kappa_score(y[mask_fit], preds_to_classes(cal_oof[mask_fit], th0), weights='quadratic')
print('Iso-cal OOF QWK (NM):', f'{q0:.5f}', 'th0:', th0, flush=True)

# Optional micro-scan on th3 in [th3-0.05, th3+0.05] with step 0.0025, min gap respected
t2 = th0[2]
th3_min = max(t2 + 0.12, th0[3] - 0.05)
th3_max = min(3.7, th0[3] + 0.05)
grid = np.arange(th3_min, th3_max + 1e-9, 0.0025)
best_q = q0; best_th = th0.copy()
for i, th3 in enumerate(grid):
    th = th0.copy(); th[3] = th3
    if th[3] - th[2] < 0.12: th[3] = th[2] + 0.12
    q = cohen_kappa_score(y[mask_fit], preds_to_classes(cal_oof[mask_fit], th), weights='quadratic')
    if q > best_q:
        best_q = q; best_th = th.copy()
    if (i+1) % 40 == 0:
        print(f'th3 scan {i+1}/{len(grid)} best_q={best_q:.5f}', flush=True)
print('Iso-cal best OOF QWK:', f'{best_q:.5f}', 'best_th:', best_th, flush=True)

# Apply to calibrated Test EV and write submission
cls = preds_to_classes(cal_te, best_th).astype(int)
sub = pd.DataFrame({'id_code': pd.read_csv('test.csv')['id_code'].values, 'diagnosis': cls})
sub.to_csv('submission.csv', index=False)
np.save('thresholds_blend_isotonic.npy', best_th)
print('submission.csv written with isotonic-calibrated EV and optimized thresholds.', flush=True)

Starting isotonic calibration + threshold re-opt...


Iso-cal OOF QWK (NM): 0.88482 th0: [0.57  1.42  2.43  3.192]


th3 scan 40/41 best_q=0.88482


Iso-cal best OOF QWK: 0.88482 best_th: [0.57  1.42  2.43  3.192]


submission.csv written with isotonic-calibrated EV and optimized thresholds.


In [41]:
# Affine EV scaling grid (p' = a*p + b) + threshold re-opt; write submission
import os, numpy as np, pandas as pd
from sklearn.metrics import cohen_kappa_score
from scipy.optimize import minimize

print('Starting affine EV scaling grid search...', flush=True)

# Load best subset keys/weights
keys = np.load('blend_weights_keys.npy', allow_pickle=True).tolist()
w = np.load('blend_weights_vals.npy')

paths_oof = {
    'b4_512': 'oof_preds_b4.npy',
    'b5_512': 'oof_preds.npy',
    'b5_512_rrcema': 'oof_preds_b5_seed2025_rrc_ema.npy',
    'b4_640': 'oof_preds_b4_640_rrc_ema.npy',
    'serx50_512_rrcema': 'oof_preds_serx50_512_rrc_ema.npy',
    'serx50_512_rrcema_s2': 'oof_preds_serx50_512_rrc_ema_seed2026.npy',
    'r200d_512_rrcema': 'oof_preds_r200d_512_rrc_ema.npy',
    'convnextb_512_rrcema': 'oof_preds_convnextb_512_rrc_ema.npy',
}
paths_te = {
    'b4_512': 'test_reg_preds_b4_hflip.npy',
    'b5_512': 'test_reg_preds_b5_hflip.npy',
    'b5_512_rrcema': 'test_reg_preds_b5_seed2025_rrc_ema.npy',
    'b4_640': 'test_reg_preds_b4_640_rrc_ema.npy',
    'serx50_512_rrcema': 'test_reg_preds_serx50_512_rrc_ema.npy',
    'serx50_512_rrcema_s2': 'test_reg_preds_serx50_512_rrc_ema_seed2026.npy',
    'r200d_512_rrcema': 'test_reg_preds_r200d_512_rrc_ema.npy',
    'convnextb_512_rrcema': 'test_reg_preds_convnextb_512_rrc_ema.npy',
}

# Targets
y = None
for tgt in ['oof_targets_b4.npy', 'oof_targets.npy']:
    if os.path.exists(tgt):
        y = np.load(tgt).reshape(-1); break
if y is None: raise RuntimeError('OOF targets not found')

# Build OOF/Test EV blend for current keys
arrs_oof = []; arrs_te = []
for k in keys:
    po = paths_oof.get(k, None); pt = paths_te.get(k, None)
    if po is None or not os.path.exists(po): raise RuntimeError(f'Missing OOF for {k}: {po}')
    if pt is None or not os.path.exists(pt): raise RuntimeError(f'Missing test for {k}: {pt}')
    arrs_oof.append(np.load(po).reshape(-1))
    arrs_te.append(np.load(pt).reshape(-1))
X = np.concatenate([a.reshape(-1,1) for a in arrs_oof], axis=1)
mask = np.isfinite(y).copy()
for a in arrs_oof: mask &= np.isfinite(a)
w_use = w[:X.shape[1]]
if w_use.sum() <= 0: w_use = np.ones_like(w_use)/len(w_use)
w_use = w_use / w_use.sum()
blend_oof = (X * w_use.reshape(1,-1)).sum(axis=1)
blend_te = np.zeros_like(arrs_te[0], dtype=float)
for wi, arr in zip(w_use, arrs_te): blend_te += wi * arr

def preds_to_classes(p, th):
    return np.digitize(p, bins=[th[0], th[1], th[2], th[3]])

def optimize_thresholds_strict(y_true, p, init=[0.57,1.42,2.43,3.19]):
    y_true = np.asarray(y_true).astype(float); p = np.asarray(p).astype(float)
    def _loss(th):
        th = np.sort(th); th = np.clip(th, 0.3, 3.7)
        for i in range(1,4):
            if th[i] - th[i-1] < 0.12: th[i] = th[i-1] + 0.12
        cls = preds_to_classes(p, th)
        return -cohen_kappa_score(y_true, cls, weights='quadratic')
    res = minimize(_loss, x0=np.array(init, dtype=float), method='Nelder-Mead', options={'maxiter':1200, 'xatol':1e-3, 'fatol':1e-3})
    th = np.sort(res.x)
    for i in range(1,4):
        if th[i] - th[i-1] < 0.12: th[i] = th[i-1] + 0.12
    th = np.clip(th, 0.3, 3.7)
    return th

# Affine grid
a_vals = np.arange(0.97, 1.0301, 0.005)
b_vals = np.arange(-0.05, 0.0501, 0.005)
best = (-1.0, 1.0, 0.0, np.array([0.57,1.42,2.43,3.19], dtype=float))
cnt = 0
for a in a_vals:
    for b in b_vals:
        p = a*blend_oof + b
        p = np.clip(p, 0.0, 4.0)
        th = optimize_thresholds_strict(y[mask], p[mask], init=best[3])
        q = cohen_kappa_score(y[mask], preds_to_classes(p[mask], th), weights='quadratic')
        cnt += 1
        if q > best[0]:
            best = (q, a, b, th)
    # light logging
    if (cnt % 100) < len(b_vals):
        print(f'Checked {cnt} combos; current best QWK={best[0]:.5f} a={best[1]:.3f} b={best[2]:+.3f}', flush=True)

best_q, best_a, best_b, best_th = best
print('Best affine OOF QWK:', f'{best_q:.5f}', 'a,b:', (best_a, best_b), 'th:', best_th, flush=True)

# Apply to TEST and write submission
p_te = np.clip(best_a*blend_te + best_b, 0.0, 4.0)
cls = preds_to_classes(p_te, best_th).astype(int)
sub = pd.DataFrame({'id_code': pd.read_csv('test.csv')['id_code'].values, 'diagnosis': cls})
sub.to_csv('submission.csv', index=False)
np.save('thresholds_blend_affine.npy', best_th); np.save('affine_params.npy', np.array([best_a, best_b], dtype=float))
print('submission.csv written with affine-scaled EV and optimized thresholds.', flush=True)

Starting affine EV scaling grid search...


Checked 105 combos; current best QWK=0.88454 a=0.985 b=+0.020


Checked 210 combos; current best QWK=0.88463 a=1.010 b=+0.010


Best affine OOF QWK: 0.88463 a,b: (1.01, 0.009999999999999967) th: [0.57918575 1.5137256  2.46036221 3.08284331]


submission.csv written with affine-scaled EV and optimized thresholds.


In [43]:
# Recompute hvflip (orig+hflip+vflip)/3 test preds for top-5 models and overwrite hflip files; then rerun isotonic cell
import os, numpy as np, pandas as pd, torch
from torch.utils.data import DataLoader

print('Recomputing hvflip TTA test preds for top-5 models...', flush=True)

def hvflip_tta_predict_mean(models, dl):
    preds = []
    with torch.no_grad():
        for xb, _ in dl:
            xb = xb.to(device, non_blocking=True).to(memory_format=torch.channels_last)
            with torch.amp.autocast('cuda', enabled=True, dtype=torch.float16):
                s = 0; c = 0
                for m in models:
                    m.eval()
                    p0 = m(xb)
                    p1 = m(torch.flip(xb, dims=[-1]))  # hflip
                    p2 = m(torch.flip(xb, dims=[-2]))  # vflip
                    s += (p0 + p1 + p2) / 3.0; c += 1
                p = s / max(1, c)
            preds.append(p.detach().float().cpu().numpy())
    return np.concatenate(preds)

# B4 512
try:
    if 'models' in globals() and isinstance(models, list) and len(models) > 0:
        dte_b4 = DRDataset(test_df[['id_code']].copy(), IMG_DIR_TEST, transforms=valid_tfms)
        dl_b4 = DataLoader(dte_b4, batch_size=16, shuffle=False, num_workers=4, pin_memory=True, persistent_workers=False)
        p_b4 = hvflip_tta_predict_mean(models, dl_b4).reshape(-1)
        np.save('test_reg_preds_b4_hflip.npy', p_b4)
        print('Overwrote test_reg_preds_b4_hflip.npy with hvflip average', flush=True)
    else:
        print('Skip B4 512: models not in memory', flush=True)
except Exception as e:
    print('B4 512 hvflip failed:', e, flush=True)

# B5 512
try:
    if 'models_b5' in globals() and isinstance(models_b5, list) and len(models_b5) > 0:
        dte_b5 = DRDataset(test_df[['id_code']].copy(), IMG_DIR_TEST, transforms=valid_tfms)
        dl_b5 = DataLoader(dte_b5, batch_size=16, shuffle=False, num_workers=4, pin_memory=True, persistent_workers=False)
        p_b5 = hvflip_tta_predict_mean(models_b5, dl_b5).reshape(-1)
        np.save('test_reg_preds_b5_hflip.npy', p_b5)
        print('Overwrote test_reg_preds_b5_hflip.npy with hvflip average', flush=True)
    else:
        print('Skip B5 512: models_b5 not in memory', flush=True)
except Exception as e:
    print('B5 512 hvflip failed:', e, flush=True)

# B5 512 RRC+EMA
try:
    if 'models_b5_seed2025_ema' in globals() and isinstance(models_b5_seed2025_ema, list) and len(models_b5_seed2025_ema) > 0:
        dte_b5e = DRDataset(test_df[['id_code']].copy(), IMG_DIR_TEST, transforms=valid_tfms)
        dl_b5e = DataLoader(dte_b5e, batch_size=16, shuffle=False, num_workers=4, pin_memory=True, persistent_workers=False)
        p_b5e = hvflip_tta_predict_mean(models_b5_seed2025_ema, dl_b5e).reshape(-1)
        np.save('test_reg_preds_b5_seed2025_rrc_ema.npy', p_b5e)
        print('Overwrote test_reg_preds_b5_seed2025_rrc_ema.npy with hvflip average', flush=True)
    else:
        print('Skip B5 512 EMA: models_b5_seed2025_ema not in memory', flush=True)
except Exception as e:
    print('B5 512 EMA hvflip failed:', e, flush=True)

# B4 640
try:
    if 'models_b4_640' in globals() and isinstance(models_b4_640, list) and len(models_b4_640) > 0:
        dte_b4640 = LocalDRDataset(test_df[['id_code']].copy(), CACHE640_TEST, IMG_SIZE_640, transforms=valid_tfms_640)
        dl_b4640 = DataLoader(dte_b4640, batch_size=16, shuffle=False, num_workers=4, pin_memory=True, persistent_workers=False)
        p_b4640 = hvflip_tta_predict_mean(models_b4_640, dl_b4640).reshape(-1)
        np.save('test_reg_preds_b4_640_rrc_ema.npy', p_b4640)
        print('Overwrote test_reg_preds_b4_640_rrc_ema.npy with hvflip average', flush=True)
    else:
        print('Skip B4 640: models_b4_640 not in memory', flush=True)
except Exception as e:
    print('B4 640 hvflip failed:', e, flush=True)

# SEResNeXt50 512
try:
    if 'models_serx' in globals() and isinstance(models_serx, list) and len(models_serx) > 0:
        dte_serx = DRDataset(test_df[['id_code']].copy(), IMG_DIR_TEST, transforms=valid_tfms_serx)
        dl_serx = DataLoader(dte_serx, batch_size=16, shuffle=False, num_workers=4, pin_memory=True, persistent_workers=False)
        p_serx = hvflip_tta_predict_mean(models_serx, dl_serx).reshape(-1)
        np.save('test_reg_preds_serx50_512_rrc_ema.npy', p_serx)
        print('Overwrote test_reg_preds_serx50_512_rrc_ema.npy with hvflip average', flush=True)
    else:
        print('Skip SEResNeXt50: models_serx not in memory', flush=True)
except Exception as e:
    print('SEResNeXt50 hvflip failed:', e, flush=True)

print('hvflip recompute done. Now rerun isotonic cell (cell 30) to write submission.', flush=True)

Recomputing hvflip TTA test preds for top-5 models...


Overwrote test_reg_preds_b4_hflip.npy with hvflip average


In [45]:
# Isotonic + th3 safety nudge (+0.02 if within 0.0005 OOF), write submission
import os, numpy as np, pandas as pd
from sklearn.isotonic import IsotonicRegression
from sklearn.metrics import cohen_kappa_score
from scipy.optimize import minimize

print('Isotonic + th3 safety nudge...', flush=True)

keys = np.load('blend_weights_keys.npy', allow_pickle=True).tolist()
w = np.load('blend_weights_vals.npy')

paths_oof = {
    'b4_512': 'oof_preds_b4.npy',
    'b5_512': 'oof_preds.npy',
    'b5_512_rrcema': 'oof_preds_b5_seed2025_rrc_ema.npy',
    'b4_640': 'oof_preds_b4_640_rrc_ema.npy',
    'serx50_512_rrcema': 'oof_preds_serx50_512_rrc_ema.npy',
}
paths_te = {
    'b4_512': 'test_reg_preds_b4_hflip.npy',
    'b5_512': 'test_reg_preds_b5_hflip.npy',
    'b5_512_rrcema': 'test_reg_preds_b5_seed2025_rrc_ema.npy',
    'b4_640': 'test_reg_preds_b4_640_rrc_ema.npy',
    'serx50_512_rrcema': 'test_reg_preds_serx50_512_rrc_ema.npy',
}

y = None
for tgt in ['oof_targets_b4.npy', 'oof_targets.npy']:
    if os.path.exists(tgt): y = np.load(tgt).reshape(-1); break
if y is None: raise RuntimeError('OOF targets not found')

# Build blends
arrs_oof=[]; arrs_te=[]
for k in keys:
    if k not in paths_oof or k not in paths_te: continue
    po=paths_oof[k]; pt=paths_te[k]
    if not (os.path.exists(po) and os.path.exists(pt)): continue
    arrs_oof.append(np.load(po).reshape(-1)); arrs_te.append(np.load(pt).reshape(-1))
if len(arrs_oof)==0: raise RuntimeError('No arrays for isotonic step')
X = np.concatenate([a.reshape(-1,1) for a in arrs_oof], axis=1)
mask = np.isfinite(y).copy()
for a in arrs_oof: mask &= np.isfinite(a)
w_use = w[:X.shape[1]]; w_use = w_use / (w_use.sum() if w_use.sum()>0 else len(w_use))
blend_oof = (X * w_use.reshape(1,-1)).sum(axis=1)
blend_te = np.zeros_like(arrs_te[0], dtype=float)
for wi, arr in zip(w_use, arrs_te): blend_te += wi * arr

# Isotonic fit on valid OOF only
ir = IsotonicRegression(increasing=True, out_of_bounds='clip')
mask_fit = mask & np.isfinite(blend_oof)
ir.fit(blend_oof[mask_fit], y[mask_fit].astype(float))
cal_oof = blend_oof.copy()
cal_oof[mask_fit] = ir.transform(blend_oof[mask_fit])
cal_oof = np.where(np.isfinite(cal_oof), cal_oof, blend_oof)
if not np.isfinite(blend_te).all():
    med = float(np.nanmedian(blend_te[np.isfinite(blend_te)]))
    blend_te = np.where(np.isfinite(blend_te), blend_te, med)
cal_te = ir.transform(blend_te)
cal_oof = np.clip(cal_oof, 0.0, 4.0); cal_te = np.clip(cal_te, 0.0, 4.0)

def preds_to_classes(p, th):
    return np.digitize(p, bins=[th[0], th[1], th[2], th[3]])

def optimize_thresholds_strict(y_true, p, init=[0.57,1.42,2.43,3.19]):
    y_true = np.asarray(y_true).astype(float); p = np.asarray(p).astype(float)
    def _loss(th):
        th = np.sort(th); th = np.clip(th, 0.3, 3.7)
        for i in range(1,4):
            if th[i]-th[i-1] < 0.12: th[i] = th[i-1] + 0.12
        return -cohen_kappa_score(y_true, preds_to_classes(p, th), weights='quadratic')
    res = minimize(_loss, x0=np.array(init, dtype=float), method='Nelder-Mead', options={'maxiter':1200, 'xatol':1e-3, 'fatol':1e-3})
    th = np.sort(res.x);
    for i in range(1,4):
        if th[i]-th[i-1] < 0.12: th[i] = th[i-1] + 0.12
    th = np.clip(th, 0.3, 3.7)
    return th

# Base NM
th0 = optimize_thresholds_strict(y[mask_fit], cal_oof[mask_fit], init=[0.57,1.42,2.43,3.19])
q0 = cohen_kappa_score(y[mask_fit], preds_to_classes(cal_oof[mask_fit], th0), weights='quadratic')
print('Iso-cal OOF QWK:', f'{q0:.5f}', 'th0:', th0, flush=True)

# Fine th3 scan ±0.05
t2 = th0[2]; th3_min = max(t2+0.12, th0[3]-0.05); th3_max = min(3.7, th0[3]+0.05)
grid = np.arange(th3_min, th3_max+1e-9, 0.0025)
best_q = q0; best_th = th0.copy()
for i, th3 in enumerate(grid):
    th = th0.copy(); th[3] = max(th[2]+0.12, min(3.7, th3))
    q = cohen_kappa_score(y[mask_fit], preds_to_classes(cal_oof[mask_fit], th), weights='quadratic')
    if q > best_q: best_q = q; best_th = th.copy()
print('Best th3 scan OOF:', f'{best_q:.5f}', 'best_th:', best_th, flush=True)

# Safety nudge: +0.02 to th3 if within 0.0005 OOF drop
safe_th = best_th.copy(); safe_th[3] = min(3.7, best_th[3] + 0.02)
q_safe = cohen_kappa_score(y[mask_fit], preds_to_classes(cal_oof[mask_fit], safe_th), weights='quadratic')
chosen_th = safe_th if (best_q - q_safe) <= 0.0005 else best_th
chosen_q = q_safe if (best_q - q_safe) <= 0.0005 else best_q
print('Chosen OOF QWK:', f'{chosen_q:.5f}', 'chosen_th:', chosen_th, flush=True)

# Apply to test
cls = preds_to_classes(cal_te, chosen_th).astype(int)
sub = pd.DataFrame({'id_code': pd.read_csv('test.csv')['id_code'].values, 'diagnosis': cls})
sub.to_csv('submission.csv', index=False)
np.save('thresholds_blend_isotonic_safe.npy', chosen_th)
print('submission.csv written (isotonic + th3 safety nudge).', flush=True)

Isotonic + th3 safety nudge...


Iso-cal OOF QWK: 0.88482 th0: [0.57 1.42 2.43 3.19]


Best th3 scan OOF: 0.88482 best_th: [0.57 1.42 2.43 3.19]


Chosen OOF QWK: 0.88482 chosen_th: [0.57 1.42 2.43 3.21]


submission.csv written (isotonic + th3 safety nudge).
