# Plan: PetFinder Pawpularity — Medal-Oriented Roadmap

## 0) Environment Gate
- Verify GPU with nvidia-smi; if no GPU, exit.
- Install cu121 PyTorch stack and core libs (timm, albumentations, opencv, xgboost, catboost).
- Enable mixed precision and cudnn.benchmark.

## 1) Data & EDA
- Load train.csv/test.csv; inspect target stats and metadata distributions.
- Create Pawpularity bins for stratification (e.g., 10–20 bins).
- Verify images exist and dimensions sample.

## 2) Validation Protocol (lock early)
- 5-fold StratifiedKFold on Pawpularity bins; deterministic seed.
- Keep a single folds.csv reused by all runs.
- Metric: RMSE computed OOF per fold and overall.

## 3) Baseline Model (fast, reliable)
- Image-only finetune with timm pretrained backbone:
  - Candidate: efficientnet_b0/b3 or convnext_tiny/swin_tiny; start with efficientnet_b0 @ 384 px.
  - Loss: MSELoss; Output: single regression head.
  - Augmentations: RandomResizedCrop, H/V flip, color jitter, cutout; light.
  - Optim: AdamW; lr sched: cosine with warmup; batch size tuned for VRAM; AMP.
  - 10–15 epochs with early stopping patience 2–3. Save best per fold.

## 4) Metadata Model
- Train gradient boosting on the 12 metadata features (and simple image stats like brightness/contrast from thumbnails):
  - XGBoost/CatBoost with GPU if feasible; otherwise CPU XGB small grid.
  - OOF predictions saved.

## 5) Blend
- Weighted average of CNN and metadata model OOF tuned on CV (grid over weights).
- Apply same weights on test predictions.

## 6) Improvements Iteration
- Scale-up image size to 512; try stronger backbones (convnext_tiny/base, efficientnet_b3/b4, swin_tiny).
- Multi-seed ensembling (2–3 seeds).
- TTA (hflip + minor resize/crop) for test.
- Fine-tune last-N layers, then unfreeze all.
- Try head that concatenates metadata with image embedding (single multimodal model).

## 7) Error Analysis
- Analyze OOF errors by bins, brightness, subject focus, pet type guesses (cat/dog heuristic).
- Adjust aug and loss if needed (e.g., HuberLoss).

## 8) Reproducibility & Logging
- Write train.py script (fold loop, logs per fold, elapsed time).
- Cache datasets/transforms; save OOF/test preds to disk (.npy/.csv).
- Verify submission.csv schema and mtime before submit.

## Milestone Checkpoints (request expert review)
- After this plan.
- After baseline CNN OOF.
- After metadata model + first blend.
- Before long training upgrades (higher res/backbones).
- Before final submission.

## Targets
- Bronze: RMSE ≤ 17.10 via baseline CNN + blend.
- Silver: ≤ 17.07 via stronger backbone/size + multi-seed.
- Gold: ≤ 16.95 via 512px, convnext/efficientnet_b3+, multimodal head, careful CV and blending.

Next: run environment GPU check and set up deps, then fold generation and a quick smoke baseline on a 1k-image subset to validate pipeline speed.

In [None]:
import os, sys, subprocess, shutil, time, json, math, random
from pathlib import Path
import pandas as pd

def run(cmd):
    print(f"$ {' '.join(cmd)}", flush=True)
    return subprocess.run(cmd, check=False, text=True, capture_output=False)

print('=== GPU CHECK ===', flush=True)
run(['bash','-lc','nvidia-smi || true'])

print('\n=== Install PyTorch cu121 stack ===', flush=True)
for pkg in ("torch","torchvision","torchaudio"):
    subprocess.run([sys.executable, "-m", "pip", "uninstall", "-y", pkg], check=False)

for d in (
    "/app/.pip-target/torch",
    "/app/.pip-target/torch-2.8.0.dist-info",
    "/app/.pip-target/torch-2.4.1.dist-info",
    "/app/.pip-target/torchvision",
    "/app/.pip-target/torchvision-0.23.0.dist-info",
    "/app/.pip-target/torchvision-0.19.1.dist-info",
    "/app/.pip-target/torchaudio",
    "/app/.pip-target/torchaudio-2.8.0.dist-info",
    "/app/.pip-target/torchaudio-2.4.1.dist-info",
    "/app/.pip-target/torchgen",
    "/app/.pip-target/functorch",
):
    if os.path.exists(d):
        print('Removing', d, flush=True)
        shutil.rmtree(d, ignore_errors=True)

def pip(*args):
    print('> pip', *args, flush=True)
    subprocess.run([sys.executable, '-m', 'pip', *args], check=True)

# Exact cu121 torch stack
pip('install',
    '--index-url','https://download.pytorch.org/whl/cu121',
    '--extra-index-url','https://pypi.org/simple',
    'torch==2.4.1','torchvision==0.19.1','torchaudio==2.4.1')

Path('constraints.txt').write_text('torch==2.4.1\ntorchvision==0.19.1\ntorchaudio==2.4.1\n')

print('\n=== Install core deps (constrained) ===', flush=True)
pip('install','-c','constraints.txt',
    'timm==1.0.9','albumentations==1.4.10','opencv-python-headless==4.10.0.84',
    'scikit-learn==1.5.2','xgboost==2.1.1','catboost==1.2.7','pillow==10.4.0','imagehash==4.3.1','tqdm==4.66.5',
    '--upgrade-strategy','only-if-needed')

import torch
print('torch:', torch.__version__, 'built CUDA:', getattr(torch.version,'cuda',None))
print('CUDA available:', torch.cuda.is_available())
if torch.cuda.is_available():
    print('GPU:', torch.cuda.get_device_name(0))
assert str(getattr(torch.version,'cuda','')).startswith('12.1'), f"Wrong CUDA build: {torch.version.cuda}"
assert torch.cuda.is_available(), 'CUDA not available'

print('\n=== Data sanity ===', flush=True)
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
print('train shape:', train_df.shape, 'test shape:', test_df.shape)
print('train columns:', list(train_df.columns))
print('target describe:\n', train_df['Pawpularity'].describe())
print('sample train head:\n', train_df.head(3).to_string(index=False))

img_train_dir = Path('train')
img_test_dir = Path('test')
missing_train = [i for i in train_df['Id'].head(20) if not (img_train_dir / f"{i}.jpg").exists()]
missing_test = [i for i in test_df['Id'].head(20) if not (img_test_dir / f"{i}.jpg").exists()]
print('missing first-20 train imgs:', missing_train)
print('missing first-20 test imgs:', missing_test)
print('Setup complete.')

# Next: generate phashes and folds.csv with StratifiedGroupKFold (qcut=10).

In [8]:
import pandas as pd, numpy as np, os, gc, time
from pathlib import Path
from PIL import Image
import imagehash
from tqdm import tqdm
from sklearn.model_selection import StratifiedGroupKFold

t0 = time.time()
train_df = pd.read_csv('train.csv')
img_dir = Path('train')

print('=== Compute perceptual hashes (phash) for duplicates grouping ===', flush=True)
hash_vals = []  # store 64-bit integers
ids = train_df['Id'].tolist()

def phash_to_int(hobj):
    arr = np.asarray(hobj.hash, dtype=np.uint8).reshape(-1)
    packed = np.packbits(arr)
    return int.from_bytes(packed.tobytes(), 'big')

for i in tqdm(ids, total=len(ids)):
    fp = img_dir / f"{i}.jpg"
    with Image.open(fp) as img:
        img = img.convert('RGB')
        h = imagehash.phash(img, hash_size=8)  # 64 bits
        hv = phash_to_int(h)
        hash_vals.append(hv)

phash_df = pd.DataFrame({'Id': ids, 'phash64': hash_vals})
phash_df.to_csv('phash.csv', index=False)
print('Saved phash.csv. Example:\n', phash_df.head().to_string(index=False))

print('=== Build groups via exact/near-duplicate (Hamming distance <=1) ===', flush=True)
# Union-Find (DSU)
n = len(phash_df)
parent = list(range(n))
rank = [0]*n
def find(x):
    while parent[x] != x:
        parent[x] = parent[parent[x]]
        x = parent[x]
    return x
def union(a,b):
    ra, rb = find(a), find(b)
    if ra == rb: return
    if rank[ra] < rank[rb]:
        parent[ra] = rb
    elif rank[ra] > rank[rb]:
        parent[rb] = ra
    else:
        parent[rb] = ra
        rank[ra] += 1

# Map hash -> first index encountered
hash_to_idx = {}
for idx, hv in enumerate(phash_df['phash64'].values):
    if hv in hash_to_idx:
        union(idx, hash_to_idx[hv])
    else:
        hash_to_idx[hv] = idx

# Near-duplicate (Hamming 1): flip each bit and check presence
BITS = 64
for idx, hv in enumerate(phash_df['phash64'].values):
    for b in range(BITS):
        nb = hv ^ (1 << b)
        j = hash_to_idx.get(nb)
        if j is not None:
            union(idx, j)
    if (idx+1) % 1000 == 0:
        print(f' processed {idx+1}/{n} for hamming-1 unions', flush=True)

roots = [find(i) for i in range(n)]
root_to_group = {}
grp_ids = []
gid = 0
for r in roots:
    if r not in root_to_group:
        root_to_group[r] = gid
        gid += 1
    grp_ids.append(root_to_group[r])

fold_df = train_df[['Id','Pawpularity']].copy()
fold_df['group'] = grp_ids

print('Groups formed:', len(set(grp_ids)), 'over', n, 'images')

print('=== Create stratification bins with qcut=10 ===', flush=True)
bins = pd.qcut(fold_df['Pawpularity'], q=10, labels=False, duplicates='drop')
fold_df['bin'] = bins
print('Bin counts:\n', fold_df['bin'].value_counts().sort_index())

print('=== StratifiedGroupKFold (5 folds, seed 42) ===', flush=True)
sgkf = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)
folds = np.full(len(fold_df), -1, dtype=int)
for k, (_, val_idx) in enumerate(sgkf.split(fold_df, y=fold_df['bin'], groups=fold_df['group'])):
    folds[val_idx] = k
fold_df['fold'] = folds
assert (fold_df['fold']>=0).all(), 'Some rows did not get a fold assignment'

fold_df[['Id','fold','group','bin']].to_csv('folds.csv', index=False)
print(f'Saved folds.csv with shape {fold_df.shape}. Fold counts:\n', fold_df['fold'].value_counts().sort_index())
print('Elapsed: %.1fs' % (time.time()-t0))

# Next: implement smoke training (convnext_tiny @384) using these folds.

=== Compute perceptual hashes (phash) for duplicates grouping ===


  0%|          | 0/8920 [00:00<?, ?it/s]

  0%|          | 18/8920 [00:00<00:49, 178.31it/s]

  0%|          | 42/8920 [00:00<00:42, 209.28it/s]

  1%|          | 66/8920 [00:00<00:40, 221.18it/s]

  1%|          | 89/8920 [00:00<00:39, 221.91it/s]

  1%|▏         | 113/8920 [00:00<00:38, 226.46it/s]

  2%|▏         | 136/8920 [00:00<00:38, 226.67it/s]

  2%|▏         | 159/8920 [00:00<00:40, 215.74it/s]

  2%|▏         | 181/8920 [00:00<00:41, 211.59it/s]

  2%|▏         | 204/8920 [00:00<00:40, 212.99it/s]

  3%|▎         | 227/8920 [00:01<00:39, 217.88it/s]

  3%|▎         | 249/8920 [00:01<00:40, 212.27it/s]

  3%|▎         | 271/8920 [00:01<00:42, 204.66it/s]

  3%|▎         | 292/8920 [00:01<00:44, 194.19it/s]

  3%|▎         | 312/8920 [00:01<00:44, 194.27it/s]

  4%|▎         | 332/8920 [00:01<00:44, 194.15it/s]

  4%|▍         | 354/8920 [00:01<00:42, 199.82it/s]

  4%|▍         | 375/8920 [00:01<00:42, 199.18it/s]

  4%|▍         | 395/8920 [00:01<00:43, 198.25it/s]

  5%|▍         | 416/8920 [00:02<00:42, 199.89it/s]

  5%|▍         | 437/8920 [00:02<00:42, 200.90it/s]

  5%|▌         | 458/8920 [00:02<00:42, 200.41it/s]

  5%|▌         | 482/8920 [00:02<00:40, 209.05it/s]

  6%|▌         | 505/8920 [00:02<00:39, 212.86it/s]

  6%|▌         | 527/8920 [00:02<00:40, 208.32it/s]

  6%|▌         | 551/8920 [00:02<00:38, 215.99it/s]

  6%|▋         | 573/8920 [00:02<00:40, 206.43it/s]

  7%|▋         | 597/8920 [00:02<00:38, 214.23it/s]

  7%|▋         | 619/8920 [00:02<00:38, 213.70it/s]

  7%|▋         | 641/8920 [00:03<00:41, 198.56it/s]

  7%|▋         | 662/8920 [00:03<00:42, 192.32it/s]

  8%|▊         | 686/8920 [00:03<00:40, 204.35it/s]

  8%|▊         | 707/8920 [00:03<00:39, 205.82it/s]

  8%|▊         | 729/8920 [00:03<00:39, 208.54it/s]

  8%|▊         | 750/8920 [00:03<00:39, 208.77it/s]

  9%|▊         | 771/8920 [00:03<00:40, 202.06it/s]

  9%|▉         | 793/8920 [00:03<00:39, 204.52it/s]

  9%|▉         | 814/8920 [00:03<00:40, 199.47it/s]

  9%|▉         | 835/8920 [00:04<00:40, 201.16it/s]

 10%|▉         | 856/8920 [00:04<00:40, 199.71it/s]

 10%|▉         | 878/8920 [00:04<00:39, 203.20it/s]

 10%|█         | 899/8920 [00:04<00:40, 200.52it/s]

 10%|█         | 920/8920 [00:04<00:39, 202.95it/s]

 11%|█         | 941/8920 [00:04<00:39, 203.94it/s]

 11%|█         | 963/8920 [00:04<00:38, 207.74it/s]

 11%|█         | 987/8920 [00:04<00:37, 213.53it/s]

 11%|█▏        | 1009/8920 [00:04<00:37, 213.80it/s]

 12%|█▏        | 1031/8920 [00:04<00:37, 212.81it/s]

In [None]:
import os, math, time, random, gc
from pathlib import Path
import numpy as np
import pandas as pd
from PIL import Image
import torch, torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import torchvision.transforms as T
import timm
from sklearn.metrics import mean_squared_error

torch.backends.cudnn.benchmark = True
device = 'cuda' if torch.cuda.is_available() else 'cpu'

class PetDataset(Dataset):
    def __init__(self, df, img_dir, img_size=384, train=True):
        self.df = df.reset_index(drop=True)
        self.img_dir = Path(img_dir)
        self.train = train
        self.img_size = img_size
        normalize = T.Normalize(mean=(0.485,0.456,0.406), std=(0.229,0.224,0.225))
        if train:
            self.tf = T.Compose([
                T.RandomResizedCrop(img_size, scale=(0.8,1.0), interpolation=T.InterpolationMode.BICUBIC),
                T.RandomHorizontalFlip(p=0.5),
                T.ColorJitter(0.1,0.1,0.1,0.05),
                T.ToTensor(),
                normalize,
                T.RandomErasing(p=0.25, scale=(0.02, 0.1), ratio=(0.3, 3.3), value='random'),
            ])
        else:
            self.tf = T.Compose([
                T.Resize((img_size, img_size), interpolation=T.InterpolationMode.BICUBIC),
                T.ToTensor(),
                normalize,
            ])
    def __len__(self): return len(self.df)
    def __getitem__(self, idx):
        r = self.df.iloc[idx]
        img = Image.open(self.img_dir / f"{r.Id}.jpg").convert('RGB')
        x = self.tf(img)
        if 'Pawpularity' in r:
            y = torch.tensor([float(r.Pawpularity)], dtype=torch.float32)
            return x, y
        return x, r.Id

class Head(nn.Module):
    def __init__(self, in_ch):
        super().__init__()
        self.fc = nn.Sequential(
            nn.Dropout(0.2),
            nn.Linear(in_ch, in_ch//2),
            nn.ReLU(inplace=True),
            nn.Dropout(0.2),
            nn.Linear(in_ch//2, 1),
        )
    def forward(self, x):
        return self.fc(x)

def build_model(backbone='convnext_tiny', pretrained=True):
    m = timm.create_model(backbone, pretrained=pretrained, num_classes=0, global_pool='avg')
    in_ch = m.num_features
    head = Head(in_ch)
    model = nn.Sequential(m, head)
    return model

def rmse(y_true, y_pred):
    return mean_squared_error(y_true, y_pred, squared=False)

def train_smoke(seed=42, img_size=384, epochs=2, batch_size=64, fold_id=0, max_train=1000, max_val=300):
    random.seed(seed); np.random.seed(seed); torch.manual_seed(seed); torch.cuda.manual_seed_all(seed)
    folds = pd.read_csv('folds.csv')
    df = pd.read_csv('train.csv').merge(folds[['Id','fold']], on='Id', how='left')
    tr = df[df.fold != fold_id].sample(n=min(max_train, (df.fold != fold_id).sum()), random_state=seed)
    va = df[df.fold == fold_id].sample(n=min(max_val, (df.fold == fold_id).sum()), random_state=seed)
    ds_tr = PetDataset(tr, 'train', img_size, train=True)
    ds_va = PetDataset(va, 'train', img_size, train=False)
    dl_tr = DataLoader(ds_tr, batch_size=batch_size, shuffle=True, num_workers=6, pin_memory=True, drop_last=True)
    dl_va = DataLoader(ds_va, batch_size=batch_size, shuffle=False, num_workers=6, pin_memory=True)
    model = build_model('convnext_tiny', pretrained=True).to(device)
    # param groups: lower LR for backbone (first module), higher for head (second)
    back_params = list(model[0].parameters())
    head_params = list(model[1].parameters())
    opt = torch.optim.AdamW([
        {'params': back_params, 'lr': 2e-4},
        {'params': head_params, 'lr': 1e-3},
    ], weight_decay=0.05)
    total_steps = epochs * len(dl_tr)
    warmup = max(1, int(0.05 * total_steps))
    def lr_lambda(step):
        if step < warmup: return step / max(1, warmup)
        t = (step - warmup) / max(1, total_steps - warmup)
        return 0.5 * (1 + math.cos(math.pi * t))
    sch = torch.optim.lr_scheduler.LambdaLR(opt, lr_lambda=lr_lambda)
    huber = nn.HuberLoss(delta=7.5)  # targets in [0,100]
    scaler = torch.cuda.amp.GradScaler(enabled=(device=='cuda'))
    best = 1e9
    for ep in range(1, epochs+1):
        t0 = time.time()
        model.train()
        running = 0.0
        for it, (x, y) in enumerate(dl_tr):
            x = x.to(device, non_blocking=True)
            y = y.to(device, non_blocking=True)
            opt.zero_grad(set_to_none=True)
            with torch.cuda.amp.autocast(enabled=(device=='cuda')):
                logits = model(x).squeeze(1)
                preds = 100.0 * torch.sigmoid(logits)
                loss = huber(preds, y.squeeze(1))
            scaler.scale(loss).backward()
            scaler.step(opt)
            scaler.update()
            sch.step()
            running += loss.item()
            if (it+1) % 20 == 0:
                print(f'Ep{ep} it{it+1}/{len(dl_tr)} loss={running/(it+1):.4f}', flush=True)
        # validation
        model.eval()
        vs, vp = [], []
        with torch.no_grad():
            for x, y in dl_va:
                x = x.to(device, non_blocking=True)
                with torch.cuda.amp.autocast(enabled=(device=='cuda')):
                    logits = model(x).squeeze(1)
                    preds = (100.0 * torch.sigmoid(logits)).float().cpu().numpy()
                vs.append(y.squeeze(1).cpu().numpy())
                vp.append(preds)
        y_true = np.concatenate(vs); y_pred = np.concatenate(vp)
        fold_rmse = rmse(y_true, y_pred)
        print(f'Epoch {ep}: val RMSE={fold_rmse:.4f} | time {time.time()-t0:.1f}s', flush=True)
        best = min(best, fold_rmse)
    print('Smoke run done. Best val RMSE:', best)
    return best

print('=== SMOKE TRAINING (1k train / 300 val, convnext_tiny @384) ===', flush=True)
best = train_smoke(seed=42, img_size=384, epochs=2, batch_size=64, fold_id=0, max_train=1000, max_val=300)
print('Smoke best RMSE:', best)
gc.collect(); torch.cuda.empty_cache()

In [9]:
import pandas as pd, numpy as np, time
from sklearn.model_selection import StratifiedGroupKFold

t0 = time.time()
print('=== Fast folds build using existing phash.csv (exact-duplicate groups only) ===', flush=True)
train_df = pd.read_csv('train.csv')
phash_df = pd.read_csv('phash.csv')
df = train_df.merge(phash_df, on='Id', how='left')
assert df['phash64'].notna().all(), 'Missing phash values; recompute if needed.'

# Group by exact duplicate phash
df['group'] = pd.factorize(df['phash64'])[0]
print('Groups formed (exact hash):', df['group'].nunique())

# Create stratification bins; fallback if too many duplicates dropped
q = 10
for q_try in [10, 9, 8, 7, 6, 5]:
    try:
        bins = pd.qcut(df['Pawpularity'], q=q_try, labels=False, duplicates='drop')
        vc = bins.value_counts().min()
        if vc >= 5:
            q = q_try
            break
    except Exception as e:
        continue
bins = pd.qcut(df['Pawpularity'], q=q, labels=False, duplicates='drop')
df['bin'] = bins
print(f'Using q={q} bins. Min bin count:', df['bin'].value_counts().min())

sgkf = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)
folds = np.full(len(df), -1, dtype=int)
for k, (_, vidx) in enumerate(sgkf.split(df, y=df['bin'], groups=df['group'])):
    folds[vidx] = k
df['fold'] = folds
assert (df['fold']>=0).all(), 'Some rows not assigned a fold.'
df[['Id','fold','group','bin']].to_csv('folds.csv', index=False)
print('Saved folds.csv. Fold counts:\n', df['fold'].value_counts().sort_index())
print('Elapsed: %.1fs' % (time.time()-t0))

=== Fast folds build using existing phash.csv (exact-duplicate groups only) ===


Groups formed (exact hash): 8901
Using q=10 bins. Min bin count: 755


Saved folds.csv. Fold counts:
 fold
0    1785
1    1781
2    1785
3    1784
4    1785
Name: count, dtype: int64
Elapsed: 1.9s


In [None]:
import sys, subprocess
print('=== Fixing Albumentations dependency: installing compatible albucore ===', flush=True)
subprocess.run([sys.executable, '-m', 'pip', 'install', '-c', 'constraints.txt', 'albucore==0.0.12', '--upgrade-strategy', 'only-if-needed'], check=True)
import albumentations as A, albucore, cv2
print('albumentations:', A.__version__)
print('albucore:', albucore.__version__)
print('opencv:', cv2.__version__)
print('Albumentations stack OK')

In [None]:
import os, math, time, random, gc, json
from pathlib import Path
import numpy as np
import pandas as pd
from PIL import Image
import torch, torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import torchvision.transforms as T
import timm
from timm.utils import ModelEmaV2
from sklearn.metrics import mean_squared_error

torch.backends.cudnn.benchmark = True
device = 'cuda' if torch.cuda.is_available() else 'cpu'

def seed_everything(seed=42):
    random.seed(seed); np.random.seed(seed); torch.manual_seed(seed); torch.cuda.manual_seed_all(seed)

class PetDatasetFull(Dataset):
    def __init__(self, df, img_dir, img_size=448, train=True):
        self.df = df.reset_index(drop=True)
        self.img_dir = Path(img_dir)
        self.img_size = img_size
        self.train = train
        normalize = T.Normalize(mean=(0.485,0.456,0.406), std=(0.229,0.224,0.225))
        if train:
            self.tf = T.Compose([
                T.RandomResizedCrop(img_size, scale=(0.9,1.0), interpolation=T.InterpolationMode.BICUBIC),
                T.RandomHorizontalFlip(p=0.5),
                T.ColorJitter(0.05,0.05,0.05,0.025),
                T.ToTensor(),
                normalize,
            ])
        else:
            # Eval: pure Resize only (no CenterCrop/pad)
            self.tf = T.Compose([
                T.Resize((img_size, img_size), interpolation=T.InterpolationMode.BICUBIC),
                T.ToTensor(),
                normalize,
            ])
    def __len__(self): return len(self.df)
    def __getitem__(self, idx):
        r = self.df.iloc[idx]
        img = Image.open(self.img_dir / f"{r.Id}.jpg").convert('RGB')
        x = self.tf(img)
        y = torch.tensor([float(r.Pawpularity)], dtype=torch.float32)
        return x, y

def build_model(backbone='eva02_base_patch14_448.mim_in22k_ft_in1k', pretrained=True):
    m = timm.create_model(
        backbone, pretrained=pretrained,
        num_classes=1, global_pool='avg', drop_path_rate=0.05
    )
    return m

def rmse(y_true, y_pred):
    return mean_squared_error(y_true, y_pred, squared=False)

def _eval_fold_model(eval_model, ds_va, batch_size, img_size):
    va_dl = DataLoader(ds_va, batch_size=batch_size, shuffle=False, num_workers=6, pin_memory=True, persistent_workers=True)
    eval_model.eval()
    oof = np.zeros(len(ds_va), dtype=np.float32)
    vs = []
    ptr = 0
    with torch.no_grad():
        for x,y in va_dl:
            bs = x.size(0)
            x = x.to(device, non_blocking=True).float()
            logits = eval_model(x).squeeze(1)
            preds = (100.0 * torch.sigmoid(logits)).float().cpu().numpy()
            oof[ptr:ptr+bs] = np.clip(preds,0,100); ptr += bs
            vs.append(y.squeeze(1).cpu().numpy())
    y_true = np.concatenate(vs)
    val_rmse = rmse(y_true, oof)
    return oof, val_rmse

def _predict_test(eval_model, batch_size, img_size):
    test_df = pd.read_csv('test.csv')
    test_tf = T.Compose([
        T.Resize((img_size, img_size), interpolation=T.InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(mean=(0.485,0.456,0.406), std=(0.229,0.224,0.225))
    ])
    test_tf_hf = T.Compose([
        T.Resize((img_size, img_size), interpolation=T.InterpolationMode.BICUBIC),
        T.RandomHorizontalFlip(p=1.0),
        T.ToTensor(),
        T.Normalize(mean=(0.485,0.456,0.406), std=(0.229,0.224,0.225))
    ])
    def load_img(pid): return Image.open(Path('test')/f'{pid}.jpg').convert('RGB')
    preds_test = []
    eval_model.eval()
    with torch.no_grad():
        for i in range(0, len(test_df), batch_size):
            ids = test_df['Id'].iloc[i:i+batch_size].tolist()
            imgs = [load_img(pid) for pid in ids]
            x1 = torch.stack([test_tf(im) for im in imgs]).to(device, non_blocking=True).float()
            x2 = torch.stack([test_tf_hf(im) for im in imgs]).to(device, non_blocking=True).float()
            p1 = 100.0 * torch.sigmoid(eval_model(x1).squeeze(1))
            p2 = 100.0 * torch.sigmoid(eval_model(x2).squeeze(1))
            p = 0.5*(p1+p2)
            preds_test.append(p.float().cpu().numpy())
    preds_test = np.clip(np.concatenate(preds_test), 0, 100)
    return preds_test

def _build_param_groups_with_wd(model, head_ids, backbone_lr, head_lr, weight_decay=0.05):
    # Parameter-wise weight decay: exclude biases, norms, pos/bias embeddings, cls_token
    no_decay_terms = ('bias', 'bn', 'norm', 'ln', 'pos_embed', 'cls_token', 'absolute_pos_embed', 'relative_position_bias_table')
    back_decay, back_no_decay, head_decay, head_no_decay = [], [], [], []
    for n, p in model.named_parameters():
        if not p.requires_grad:
            continue
        is_no_decay = any(t in n for t in no_decay_terms)
        is_head = (id(p) in head_ids)
        if is_head:
            (head_no_decay if is_no_decay else head_decay).append(p)
        else:
            (back_no_decay if is_no_decay else back_decay).append(p)
    param_groups = [
        {'params': back_decay, 'lr': backbone_lr, 'weight_decay': weight_decay},
        {'params': back_no_decay, 'lr': backbone_lr, 'weight_decay': 0.0},
        {'params': head_decay, 'lr': head_lr, 'weight_decay': weight_decay},
        {'params': head_no_decay, 'lr': head_lr, 'weight_decay': 0.0},
    ]
    return param_groups

class WarmupCosine:
    def __init__(self, optimizer, total_steps, warmup_steps):
        self.optimizer = optimizer
        self.total_steps = max(1, total_steps)
        self.warmup_steps = max(1, warmup_steps)
        self.step_num = 0
        self.base_lrs = [pg['lr'] for pg in optimizer.param_groups]
    def step(self):
        self.step_num += 1
        s = self.step_num
        if s <= self.warmup_steps:
            mult = s / float(self.warmup_steps)
        else:
            t = (s - self.warmup_steps) / float(max(1, self.total_steps - self.warmup_steps))
            mult = max(0.01, 0.5 * (1.0 + math.cos(math.pi * t)))
        for i, pg in enumerate(self.optimizer.param_groups):
            base = self.base_lrs[i]
            pg['lr'] = base * mult

def train_one_fold(fold, df, img_size=448, epochs=18, batch_size=22, seed=42, resume_if_exists=True, backbone_max_lr=1.6e-4, head_max_lr=1.2e-3):
    seed_everything(seed+fold)
    # capture validation indices BEFORE resets for OOF alignment
    val_idx = df.index[df.fold == fold].values
    tr = df[df.fold != fold].reset_index(drop=True)
    va = df[df.fold == fold].reset_index(drop=True)
    ds_tr = PetDatasetFull(tr, 'train', img_size, train=True)
    ds_va = PetDatasetFull(va, 'train', img_size, train=False)
    dl_tr = DataLoader(ds_tr, batch_size=batch_size, shuffle=True, num_workers=10, pin_memory=True, drop_last=True, persistent_workers=True)
    dl_va = DataLoader(ds_va, batch_size=batch_size, shuffle=False, num_workers=10, pin_memory=True, persistent_workers=True)

    backbone_name = 'eva02_base_patch14_448.mim_in22k_ft_in1k'
    model = build_model(backbone_name, pretrained=True).to(device)
    # Optional speed
    if device == 'cuda':
        model = model.to(memory_format=torch.channels_last)

    # Resume: look for new best.pt first, then legacy best_ema.pt
    ckpt_new = Path(f'models/fold{fold}_best.pt')
    ckpt_legacy = Path(f'models/fold{fold}_best_ema.pt')
    ckpt_path = ckpt_new if ckpt_new.exists() else (ckpt_legacy if ckpt_legacy.exists() else None)
    if resume_if_exists and ckpt_path is not None:
        print(f'Fold {fold}: found existing {ckpt_path}, attempting resume inference...', flush=True)
        try:
            eval_model = build_model(backbone_name, pretrained=False).to(device).float()
            state = torch.load(ckpt_path, map_location='cpu', weights_only=True)
            eval_model.load_state_dict({k: v.to(device) for k, v in state.items()})
            oof, val_rmse = _eval_fold_model(eval_model, ds_va, batch_size, img_size)
            print(f'Fold {fold} (resume) val RMSE={val_rmse:.4f}', flush=True)
            if val_rmse <= 19.0:
                dbg = va[['Id','Pawpularity']].copy(); dbg['pred'] = oof; dbg.to_csv(f'fold{fold}_oof_debug.csv', index=False)
                preds_test = _predict_test(eval_model, batch_size, img_size)
                return val_idx, oof, preds_test, float(val_rmse)
            else:
                print(f'Fold {fold}: resume checkpoint underperforms (RMSE {val_rmse:.4f} > 19.0). Proceeding to retrain...', flush=True)
        except Exception as e:
            print(f'Fold {fold}: resume failed due to state mismatch or load error -> {e}. Proceeding to retrain...', flush=True)

    # Robust classifier head/backbone param split
    clf = model.get_classifier()
    if isinstance(clf, str):
        head_module = getattr(model, clf)
    else:
        head_module = clf
    head_params = list(head_module.parameters())
    head_ids = set(id(p) for p in head_params)
    back_params = [p for p in model.parameters() if id(p) not in head_ids]
    assert len(head_params) > 0, 'Head params empty'
    print(f'Param split: head={sum(p.numel() for p in head_params)} params, backbone={sum(p.numel() for p in back_params)} params', flush=True)
    print(f'LRs: backbone_max_lr={backbone_max_lr:.2e}, head_max_lr={head_max_lr:.2e}', flush=True)

    # Optimizer with parameter-wise weight decay and safer betas
    param_groups = _build_param_groups_with_wd(model, head_ids, backbone_max_lr, head_max_lr, weight_decay=0.05)
    opt = torch.optim.AdamW(param_groups, betas=(0.9, 0.98))

    # Gentle linear warmup + cosine decay (per-iteration)
    total_steps = epochs * len(dl_tr)
    warmup_steps = max(1, int(0.10 * total_steps))
    sch = WarmupCosine(opt, total_steps=total_steps, warmup_steps=warmup_steps)

    huber = nn.HuberLoss(delta=7.5)
    scaler = torch.amp.GradScaler('cuda', enabled=(device=='cuda'))
    ema = ModelEmaV2(model, decay=0.9997)

    best = 1e9; best_state = None; patience=4; no_improve=0
    Path('models').mkdir(exist_ok=True)
    print(f'Fold {fold}: train={len(tr)} val={len(va)} steps/epoch={len(dl_tr)} warmup_steps={warmup_steps}', flush=True)
    first_batch = True
    for ep in range(1, epochs+1):
        ep_t0 = time.time()
        model.train(); tr_loss=0.0
        for it,(x,y) in enumerate(dl_tr):
            if device=='cuda':
                x = x.to(device, non_blocking=True).to(memory_format=torch.channels_last)
            else:
                x = x.to(device, non_blocking=True)
            y = y.to(device, non_blocking=True)
            opt.zero_grad(set_to_none=True)
            with torch.amp.autocast('cuda', enabled=(device=='cuda')):
                logits = model(x).squeeze(1)
                preds = 100.0 * torch.sigmoid(logits)
                loss = huber(preds, y.squeeze(1))
            scaler.scale(loss).backward()
            scaler.unscale_(opt)
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            scaler.step(opt); scaler.update(); sch.step()
            ema.update(model)
            tr_loss += loss.item()
            if first_batch:
                if torch.cuda.is_available():
                    torch.cuda.synchronize()
                    print(f'  CUDA max memory MB: {torch.cuda.max_memory_allocated()/1e6:.1f}', flush=True)
                first_batch = False
            if (it+1) % 100 == 0:
                lrs = [pg['lr'] for pg in opt.param_groups]
                # Show 4 groups: back_decay, back_no_decay, head_decay, head_no_decay
                print(f'  ep{ep} it{it+1}/{len(dl_tr)} loss={tr_loss/(it+1):.4f} lr=[{lrs[0]:.2e},{lrs[1]:.2e},{lrs[2]:.2e},{lrs[3]:.2e}]', flush=True)

        # validation (early raw for ep<=3; otherwise EMA), FP32, no autocast
        eval_model = (model if ep <= 3 else ema.module).to(device).float()
        eval_model.eval(); vs=[]; vp=[]
        with torch.no_grad():
            for x,y in dl_va:
                x = x.to(device, non_blocking=True).float()
                logits = eval_model(x).squeeze(1)
                preds = (100.0 * torch.sigmoid(logits)).float().cpu().numpy()
                vs.append(y.squeeze(1).cpu().numpy()); vp.append(preds)
        y_true = np.concatenate(vs); y_pred = np.clip(np.concatenate(vp), 0, 100)
        val_rmse = rmse(y_true, y_pred)
        print(f'Fold {fold} Ep {ep}: val RMSE={val_rmse:.4f} | ep_time={time.time()-ep_t0:.1f}s', flush=True)
        print(f'  val pred mean/std: {y_pred.mean():.2f}/{y_pred.std():.2f}', flush=True)
        if val_rmse < best - 1e-4:
            best = val_rmse; no_improve=0
            best_state = {k:v.cpu() for k,v in eval_model.state_dict().items()}
            torch.save(best_state, f'models/fold{fold}_best.pt')
        else:
            no_improve += 1
            if no_improve >= patience:
                print(f'  Early stopping at epoch {ep}', flush=True)
                break

    # load best weights (raw or EMA depending on when best occurred)
    if best_state is not None:
        eval_model = build_model(backbone_name, pretrained=False).to(device).float()
        eval_model.load_state_dict({k:v.to(device) for k,v in best_state.items()})
    else:
        eval_model = ema.module.to(device).float()

    # OOF preds
    oof, final_val_rmse = _eval_fold_model(eval_model, ds_va, batch_size, img_size)
    # per-fold debug CSV
    dbg = va[['Id','Pawpularity']].copy(); dbg['pred'] = oof; dbg.to_csv(f'fold{fold}_oof_debug.csv', index=False)

    # Test preds with HFlip TTA
    preds_test = _predict_test(eval_model, batch_size, img_size)
    return val_idx, oof, preds_test, final_val_rmse

print('=== FULL 5-FOLD TRAINING: EVA02 Base @448 (Huber, Resize-only eval, EMA, HFlip TTA) ===', flush=True)
seed = 42; img_size=448; epochs=18; batch_size=22
folds = pd.read_csv('folds.csv')
train_df = pd.read_csv('train.csv').merge(folds[['Id','fold']], on='Id', how='left')

oof = np.zeros(len(train_df), dtype=np.float32)
test_preds_folds = []
fold_scores = []
Path('models').mkdir(exist_ok=True)
t0 = time.time()
for f in range(5):
    f_t0 = time.time()
    print(f'--- Fold {f} start ---', flush=True)
    back_lr = 2.0e-4 if f == 0 else 1.6e-4
    vidx, oof_f, tp_f, best_rmse = train_one_fold(f, train_df, img_size=img_size, epochs=epochs, batch_size=batch_size, seed=seed, resume_if_exists=False, backbone_max_lr=back_lr, head_max_lr=1.2e-3)
    assert len(vidx) == len(oof_f), 'OOF length mismatch vs indices'
    oof[vidx] = oof_f
    test_preds_folds.append(tp_f)
    fold_scores.append(best_rmse)
    print(f'--- Fold {f} done. Best/Final RMSE={best_rmse:.4f}. Elapsed fold {time.time()-f_t0:.1f}s ---', flush=True)
    gc.collect(); torch.cuda.empty_cache()

print('Fold RMSEs:', fold_scores, 'Mean:', float(np.mean(fold_scores)))
oof_df = train_df[['Id','Pawpularity']].copy(); oof_df['oof'] = np.clip(oof,0,100)
oof_df.to_csv('oof_eva02b_448.csv', index=False)
np.save('test_preds_eva02b_448.npy', np.stack(test_preds_folds, axis=0))
overall_rmse = rmse(oof_df['Pawpularity'].values, oof_df['oof'].values)
print(f'OOF RMSE: {overall_rmse:.5f} | Total time {time.time()-t0:.1f}s', flush=True)

# Create preliminary submission from mean of fold test preds
test_df = pd.read_csv('test.csv')
test_preds = np.mean(np.load('test_preds_eva02b_448.npy'), axis=0)
sub = pd.DataFrame({'Id': test_df['Id'], 'Pawpularity': np.clip(test_preds,0,100)})
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv shape', sub.shape)

In [3]:
# Metadata model prep: engineered image stats + XGBoost trainer (using same folds)
import os, time, gc, math, json
from pathlib import Path
import numpy as np
import pandas as pd
import cv2
import xgboost as xgb
from sklearn.metrics import mean_squared_error

def rmse(y_true, y_pred):
    return mean_squared_error(y_true, y_pred, squared=False)

def compute_image_stats(ids, img_dir='train'):
    feats = []
    t0 = time.time()
    for i, pid in enumerate(ids):
        fp = str(Path(img_dir)/f"{pid}.jpg")
        img = cv2.imdecode(np.fromfile(fp, dtype=np.uint8), cv2.IMREAD_COLOR)
        if img is None:
            feats.append((pid, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan, np.nan))
            continue
        h, w = img.shape[:2]
        gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
        hsv = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)
        # brightness/contrast
        mean_gray = float(gray.mean())
        std_gray = float(gray.std())
        # saturation/value stats
        sat_mean = float(hsv[...,1].mean()); sat_std = float(hsv[...,1].std())
        val_mean = float(hsv[...,2].mean()); val_std = float(hsv[...,2].std())
        # edge density: Laplacian variance
        lap_var = float(cv2.Laplacian(gray, cv2.CV_64F).var())
        # aspect ratio
        aspect = float(w)/max(1.0, float(h))
        feats.append((pid, mean_gray, std_gray, sat_mean, sat_std, val_mean, val_std, lap_var, aspect))
        if (i+1) % 1000 == 0:
            print(f'  processed {i+1}/{len(ids)} imgs in {time.time()-t0:.1f}s', flush=True)
    cols = ['Id','gray_mean','gray_std','sat_mean','sat_std','val_mean','val_std','lap_var','aspect']
    return pd.DataFrame(feats, columns=cols)

def build_meta_frames():
    train_df = pd.read_csv('train.csv')
    test_df = pd.read_csv('test.csv')
    # compute or load cached
    tr_path = Path('meta_features.csv'); te_path = Path('meta_features_test.csv')
    if tr_path.exists() and te_path.exists():
        meta_tr = pd.read_csv(tr_path); meta_te = pd.read_csv(te_path)
    else:
        print('Computing image stats for train...', flush=True)
        meta_tr = compute_image_stats(train_df['Id'].tolist(), img_dir='train')
        print('Computing image stats for test...', flush=True)
        meta_te = compute_image_stats(test_df['Id'].tolist(), img_dir='test')
        meta_tr.to_csv(tr_path, index=False); meta_te.to_csv(te_path, index=False)
        print('Saved meta features to CSV.')
    # merge 12 binary metadata + engineered stats
    meta_cols = ['Subject Focus','Eyes','Face','Near','Action','Accessory','Group','Collage','Human','Occlusion','Info','Blur']
    X_tr = train_df[['Id','Pawpularity'] + meta_cols].merge(meta_tr, on='Id', how='left')
    X_te = pd.read_csv('test.csv')[['Id'] + meta_cols].merge(meta_te, on='Id', how='left')
    # simple imputations
    for df in (X_tr, X_te):
        for c in ['gray_mean','gray_std','sat_mean','sat_std','val_mean','val_std','lap_var','aspect']:
            if c in df:
                df[c] = df[c].fillna(df[c].median())
    return X_tr, X_te

def train_xgb_meta(seed=42):
    folds = pd.read_csv('folds.csv')
    X_tr, X_te = build_meta_frames()
    X_tr = X_tr.merge(folds[['Id','fold']], on='Id', how='left')
    features = [c for c in X_tr.columns if c not in ('Id','Pawpularity','fold')]
    print('Meta features:', len(features))
    oof = np.zeros(len(X_tr), dtype=np.float32)
    test_preds_folds = []
    fold_scores = []
    params = {
        'learning_rate': 0.03,
        'max_depth': 6,
        'min_child_weight': 1.0,
        'subsample': 0.8,
        'colsample_bytree': 0.8,
        'reg_lambda': 1.0,
        'n_estimators': 2000,
        'tree_method': 'hist',
        'device': 'cuda',
        'random_state': seed,
    }
    for f in range(5):
        tr_idx = np.where(X_tr['fold'].values != f)[0]
        va_idx = np.where(X_tr['fold'].values == f)[0]
        dtr = xgb.DMatrix(X_tr.iloc[tr_idx][features].values, label=X_tr.iloc[tr_idx]['Pawpularity'].values)
        dva = xgb.DMatrix(X_tr.iloc[va_idx][features].values, label=X_tr.iloc[va_idx]['Pawpularity'].values)
        dte = xgb.DMatrix(X_te[features].values)
        model = xgb.train(params, dtr, num_boost_round=2000,
                          evals=[(dtr,'tr'),(dva,'va')],
                          feval=lambda y_pred, dmat: ('rmse', rmse(dmat.get_label(), y_pred)),
                          verbose_eval=100,
                          early_stopping_rounds=100)
        pva = model.predict(dva, iteration_range=(0, model.best_iteration+1))
        pte = model.predict(dte, iteration_range=(0, model.best_iteration+1))
        oof[va_idx] = pva
        test_preds_folds.append(pte)
        fold_rmse = rmse(X_tr.iloc[va_idx]['Pawpularity'].values, pva)
        fold_scores.append(fold_rmse)
        print(f'XGB fold {f} rmse={fold_rmse:.5f} | best_iter={model.best_iteration}', flush=True)
        gc.collect()
    oof_df = X_tr[['Id','Pawpularity']].copy(); oof_df['oof_meta'] = oof
    oof_df.to_csv('oof_xgb_meta.csv', index=False)
    test_pred = np.mean(np.vstack(test_preds_folds), axis=0).astype(np.float32)
    np.save('test_preds_xgb_meta.npy', test_pred)
    overall = rmse(oof_df['Pawpularity'].values, oof_df['oof_meta'].values)
    print('XGB META OOF RMSE:', overall)
    return overall

# NOTE: Do not execute now; baseline CNN training is running.
# After image model finishes, run: xgb_rmse = train_xgb_meta(seed=42), then blend on OOF.

In [None]:
# Blending: tune OOF weight between CNN and XGB meta, create blended submission
import numpy as np, pandas as pd
from sklearn.metrics import mean_squared_error

def rmse(y_true, y_pred):
    return mean_squared_error(y_true, y_pred, squared=False)

def blend_and_submit(img_oof_path='oof_convnext_tiny_384.csv',
                     meta_oof_path='oof_xgb_meta.csv',
                     img_test_npy='test_preds_convnext_tiny_384.npy',
                     meta_test_npy='test_preds_xgb_meta.npy',
                     out_csv='submission_blend.csv'):
    oof_img = pd.read_csv(img_oof_path).rename(columns={'oof':'img_pred'})
    oof_meta = pd.read_csv(meta_oof_path).rename(columns={'oof_meta':'meta_pred'})
    oof = oof_img.merge(oof_meta, on=['Id','Pawpularity'], how='inner')
    y = oof['Pawpularity'].values.astype(np.float32)
    p_img = oof['img_pred'].values.astype(np.float32)
    p_meta = oof['meta_pred'].values.astype(np.float32)
    # grid search weight on image model
    best = (1.0, rmse(y, p_img))  # (w_img, score)
    for w in np.linspace(0.50, 0.95, 46):
        pred = np.clip(w*p_img + (1.0-w)*p_meta, 0, 100)
        s = rmse(y, pred)
        if s < best[1]:
            best = (float(w), float(s))
    print(f'Best blend weight on image: w_img={best[0]:.2f}, OOF RMSE={best[1]:.5f}', flush=True)
    # blend test predictions
    img_test_folds = np.load(img_test_npy)  # shape (5, 992)
    img_test = img_test_folds.mean(axis=0).astype(np.float32)
    meta_test = np.load(meta_test_npy).astype(np.float32)
    test_df = pd.read_csv('test.csv')
    blend_test = np.clip(best[0]*img_test + (1.0-best[0])*meta_test, 0, 100)
    sub = pd.DataFrame({'Id': test_df['Id'], 'Pawpularity': blend_test})
    sub.to_csv(out_csv, index=False)
    print('Saved blended submission to', out_csv, 'shape', sub.shape, flush=True)
    return best

# Usage after both models are trained:
# best = blend_and_submit()

In [2]:
# ConvNeXt Base @512 stable 5-fold training (Huber, EMA, warmup+cosine, Resize eval, HFlip TTA)
import os, math, time, random, gc
from pathlib import Path
import numpy as np
import pandas as pd
from PIL import Image
import torch, torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import torchvision.transforms as T
import timm
from timm.utils import ModelEmaV2
from sklearn.metrics import mean_squared_error

# mitigate fragmentation before allocating tensors
os.environ.setdefault('PYTORCH_CUDA_ALLOC_CONF', 'expandable_segments:True')
torch.backends.cudnn.benchmark = True
if torch.cuda.is_available():
    torch.backends.cuda.matmul.allow_tf32 = True
    torch.backends.cudnn.allow_tf32 = True
device = 'cuda' if torch.cuda.is_available() else 'cpu'
try:
    torch.set_float32_matmul_precision('high')
except Exception:
    pass

def seed_everything(seed=42):
    random.seed(seed); np.random.seed(seed); torch.manual_seed(seed); torch.cuda.manual_seed_all(seed)

class PetDatasetCNB(Dataset):
    def __init__(self, df, img_dir, img_size=512, train=True):
        self.df = df.reset_index(drop=True)
        self.img_dir = Path(img_dir)
        self.img_size = img_size
        normalize = T.Normalize(mean=(0.485,0.456,0.406), std=(0.229,0.224,0.225))
        if train:
            self.tf = T.Compose([
                T.RandomResizedCrop(img_size, scale=(0.85,1.0), interpolation=T.InterpolationMode.BICUBIC),
                T.RandomHorizontalFlip(p=0.5),
                T.ColorJitter(0.05,0.05,0.05,0.025),
                T.ToTensor(),
                normalize,
            ])
        else:
            self.tf = T.Compose([
                T.Resize((img_size, img_size), interpolation=T.InterpolationMode.BICUBIC),
                T.ToTensor(),
                normalize,
            ])
    def __len__(self): return len(self.df)
    def __getitem__(self, idx):
        r = self.df.iloc[idx]
        img = Image.open(self.img_dir / f"{r.Id}.jpg").convert('RGB')
        x = self.tf(img)
        y = torch.tensor([float(r.Pawpularity)], dtype=torch.float32)
        return x, y

def build_cnb_model(backbone='convnext_base.fb_in22k_ft_in1k', pretrained=True):
    m = timm.create_model(backbone, pretrained=pretrained, num_classes=1, global_pool='avg', drop_path_rate=0.05)
    return m

def rmse(y_true, y_pred):
    return mean_squared_error(y_true, y_pred, squared=False)

def _eval_model(eval_model, ds_va, batch_size):
    va_dl = DataLoader(ds_va, batch_size=min(batch_size, 16), shuffle=False, num_workers=4, pin_memory=True, persistent_workers=False)
    eval_model.eval()
    oof = np.zeros(len(ds_va), dtype=np.float32)
    vs = []; ptr = 0
    with torch.no_grad():
        for x,y in va_dl:
            bs = x.size(0)
            x = x.to(device, non_blocking=True).float()
            logits = eval_model(x).squeeze(1)
            preds = (100.0 * torch.sigmoid(logits)).float().cpu().numpy()
            oof[ptr:ptr+bs] = np.clip(preds,0,100); ptr += bs
            vs.append(y.squeeze(1).cpu().numpy())
    y_true = np.concatenate(vs)
    return oof, rmse(y_true, oof)

def _predict_test_cnb(eval_model, batch_size, img_size):
    test_df = pd.read_csv('test.csv')
    tf = T.Compose([T.Resize((img_size,img_size), interpolation=T.InterpolationMode.BICUBIC), T.ToTensor(), T.Normalize(mean=(0.485,0.456,0.406), std=(0.229,0.224,0.225))])
    tf_hf = T.Compose([T.Resize((img_size,img_size), interpolation=T.InterpolationMode.BICUBIC), T.RandomHorizontalFlip(p=1.0), T.ToTensor(), T.Normalize(mean=(0.485,0.456,0.406), std=(0.229,0.224,0.225))])
    def load_img(pid): return Image.open(Path('test')/f'{pid}.jpg').convert('RGB')
    preds = []
    eval_model.eval()
    with torch.no_grad():
        for i in range(0, len(test_df), min(batch_size, 16)):
            ids = test_df['Id'].iloc[i:i+min(batch_size, 16)].tolist()
            imgs = [load_img(pid) for pid in ids]
            x1 = torch.stack([tf(im) for im in imgs]).to(device, non_blocking=True).float()
            x2 = torch.stack([tf_hf(im) for im in imgs]).to(device, non_blocking=True).float()
            p1 = 100.0 * torch.sigmoid(eval_model(x1).squeeze(1))
            p2 = 100.0 * torch.sigmoid(eval_model(x2).squeeze(1))
            p = 0.5*(p1+p2)
            preds.append(p.float().cpu().numpy())
    return np.clip(np.concatenate(preds), 0, 100)

def _build_param_groups_cnb(model, head_ids, backbone_lr, head_lr, weight_decay=0.05):
    no_decay_terms = ('bias','bn','norm','ln')
    back_decay, back_no_decay, head_decay, head_no_decay = [], [], [], []
    for n,p in model.named_parameters():
        if not p.requires_grad: continue
        is_no_decay = any(t in n for t in no_decay_terms)
        is_head = (id(p) in head_ids)
        if is_head:
            (head_no_decay if is_no_decay else head_decay).append(p)
        else:
            (back_no_decay if is_no_decay else back_decay).append(p)
    return [
        {'params': back_decay, 'lr': backbone_lr, 'weight_decay': weight_decay},
        {'params': back_no_decay, 'lr': backbone_lr, 'weight_decay': 0.0},
        {'params': head_decay, 'lr': head_lr, 'weight_decay': weight_decay},
        {'params': head_no_decay, 'lr': head_lr, 'weight_decay': 0.0},
    ]

class WarmupCosine:
    def __init__(self, optimizer, total_steps, warmup_steps):
        self.opt = optimizer; self.total_steps = max(1,total_steps); self.warmup_steps = max(1,warmup_steps); self.s=0
        self.base_lrs = [pg['lr'] for pg in optimizer.param_groups]
    def step(self):
        self.s += 1
        if self.s <= self.warmup_steps:
            mult = self.s / float(self.warmup_steps)
        else:
            t = (self.s - self.warmup_steps) / float(max(1, self.total_steps - self.warmup_steps))
            mult = max(0.01, 0.5*(1+math.cos(math.pi*t)))
        for i, pg in enumerate(self.opt.param_groups):
            pg['lr'] = self.base_lrs[i] * mult

def train_fold_cnb(fold, df, img_size=512, epochs=18, batch_size=8, seed=42, backbone_max_lr=1.6e-4, head_max_lr=1.2e-3, accum_steps=4, resume_if_exists=True):
    seed_everything(seed+fold)
    val_idx = df.index[df.fold == fold].values
    tr = df[df.fold != fold].reset_index(drop=True)
    va = df[df.fold == fold].reset_index(drop=True)
    ds_tr = PetDatasetCNB(tr, 'train', img_size, train=True)
    ds_va = PetDatasetCNB(va, 'train', img_size, train=False)

    # Try resume: if best checkpoint exists, skip training and just evaluate + predict
    ckpt = Path(f'models/cnb_fold{fold}_best.pt')
    if resume_if_exists and ckpt.exists():
        print(f'Fold {fold}: found existing {ckpt}, resuming for eval/predict...', flush=True)
        eval_model = build_cnb_model(pretrained=False).to(device).float()
        state = torch.load(ckpt, map_location='cpu', weights_only=False)
        eval_model.load_state_dict({k: v.to(device) for k, v in state.items()})
        oof, final_rmse = _eval_model(eval_model, ds_va, batch_size)
        preds_test = _predict_test_cnb(eval_model, batch_size, img_size)
        dbg = va[['Id','Pawpularity']].copy(); dbg['pred']=oof; dbg.to_csv(f'cnb_fold{fold}_oof_debug.csv', index=False)
        return val_idx, oof, preds_test, final_rmse

    dl_tr = DataLoader(ds_tr, batch_size=batch_size, shuffle=True, num_workers=4, pin_memory=True, drop_last=True, persistent_workers=False)
    dl_va = DataLoader(ds_va, batch_size=min(batch_size, 16), shuffle=False, num_workers=4, pin_memory=True, persistent_workers=False)

    backbone = 'convnext_base.fb_in22k_ft_in1k'
    model = build_cnb_model(backbone, pretrained=True).to(device)
    if device=='cuda': model = model.to(memory_format=torch.channels_last)

    clf = model.get_classifier()
    head_module = getattr(model, clf) if isinstance(clf, str) else clf
    head_params = list(head_module.parameters()); head_ids = set(id(p) for p in head_params)
    back_params = [p for p in model.parameters() if id(p) not in head_ids]
    print(f'CNB fold{fold}: head params={sum(p.numel() for p in head_params)} backbone params={sum(p.numel() for p in back_params)}', flush=True)
    print(f'LRs: backbone={backbone_max_lr:.2e} head={head_max_lr:.2e}', flush=True)

    param_groups = _build_param_groups_cnb(model, head_ids, backbone_max_lr, head_max_lr, weight_decay=0.05)
    opt = torch.optim.AdamW(param_groups, betas=(0.9,0.98))
    total_steps = epochs * len(dl_tr) // max(1, accum_steps)
    warmup_steps = max(1, int(0.10*total_steps))
    sch = WarmupCosine(opt, total_steps, warmup_steps)
    huber = nn.HuberLoss(delta=7.5)
    scaler = torch.amp.GradScaler('cuda', enabled=(device=='cuda'))
    ema = ModelEmaV2(model, decay=0.9997)

    best = 1e9; best_state=None; patience=4; no_improve=0
    Path('models').mkdir(exist_ok=True)
    print(f'Fold {fold}: train={len(tr)} val={len(va)} steps/epoch={len(dl_tr)} warmup_steps={warmup_steps} accum_steps={accum_steps}', flush=True)
    first_batch=True
    for ep in range(1, epochs+1):
        t0=time.time(); model.train(); loss_sum=0.0
        opt.zero_grad(set_to_none=True)
        for it,(x,y) in enumerate(dl_tr):
            if device=='cuda': x=x.to(device,non_blocking=True).to(memory_format=torch.channels_last)
            else: x=x.to(device,non_blocking=True)
            y=y.to(device,non_blocking=True)
            with torch.amp.autocast('cuda', enabled=(device=='cuda')):
                logits = model(x).squeeze(1)
                preds = 100.0 * torch.sigmoid(logits)
                loss = huber(preds, y.squeeze(1)) / float(accum_steps)
            scaler.scale(loss).backward()
            if ((it+1) % accum_steps) == 0:
                scaler.unscale_(opt)
                torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
                scaler.step(opt); scaler.update(); sch.step(); opt.zero_grad(set_to_none=True)
                ema.update(model)
            loss_sum += loss.item() * float(accum_steps)
            if first_batch:
                if torch.cuda.is_available():
                    torch.cuda.synchronize(); print(f'  CUDA max memory MB: {torch.cuda.max_memory_allocated()/1e6:.1f}', flush=True)
                first_batch=False
            if (it+1)%100==0:
                lrs=[pg['lr'] for pg in opt.param_groups]
                print(f'  ep{ep} it{it+1}/{len(dl_tr)} loss={loss_sum/(it+1):.4f} lr=[{lrs[0]:.2e},{lrs[1]:.2e},{lrs[2]:.2e},{lrs[3]:.2e}]', flush=True)
        # flush leftover grads if last batch wasn't multiple of accum_steps
        if ((it+1) % accum_steps) != 0:
            scaler.unscale_(opt)
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            scaler.step(opt); scaler.update(); sch.step(); opt.zero_grad(set_to_none=True)
            ema.update(model)

        eval_model = (model if ep<=3 else ema.module).to(device).float()
        eval_model.eval(); vs=[]; vp=[]
        with torch.no_grad():
            for x,y in dl_va:
                x=x.to(device,non_blocking=True).float()
                logits=eval_model(x).squeeze(1)
                preds=(100.0*torch.sigmoid(logits)).float().cpu().numpy()
                vs.append(y.squeeze(1).cpu().numpy()); vp.append(preds)
        y_true=np.concatenate(vs); y_pred=np.clip(np.concatenate(vp),0,100)
        v_rmse = rmse(y_true, y_pred)
        print(f'Fold {fold} Ep {ep}: val RMSE={v_rmse:.4f} | ep_time={time.time()-t0:.1f}s', flush=True)
        print(f'  val pred mean/std: {y_pred.mean():.2f}/{y_pred.std():.2f}', flush=True)

        # Early abort guardrail at epoch 7
        if ep == 7 and (v_rmse >= 17.8 or y_pred.std() < 7.0):
            print('  Abort guard triggered at epoch 7', flush=True)
            break

        if v_rmse < best - 1e-4:
            best=v_rmse; no_improve=0
            best_state={k:v.cpu() for k,v in eval_model.state_dict().items()}
            torch.save(best_state, f'models/cnb_fold{fold}_best.pt')
        else:
            no_improve += 1
            if no_improve >= patience:
                print(f'  Early stopping at epoch {ep}', flush=True); break

    if best_state is not None:
        eval_model = build_cnb_model(pretrained=False).to(device).float()
        eval_model.load_state_dict({k:v.to(device) for k,v in best_state.items()})
    else:
        eval_model = ema.module.to(device).float()

    oof, final_rmse = _eval_model(eval_model, ds_va, batch_size)
    preds_test = _predict_test_cnb(eval_model, batch_size, img_size)
    dbg = va[['Id','Pawpularity']].copy(); dbg['pred']=oof; dbg.to_csv(f'cnb_fold{fold}_oof_debug.csv', index=False)
    return val_idx, oof, preds_test, final_rmse

print('=== CNB 5-FOLD TRAINING @512 (Huber, EMA, Resize eval, HFlip TTA) ===', flush=True)
seed=42; img_size=512; epochs=18; batch_size=8
folds = pd.read_csv('folds.csv')
train_df = pd.read_csv('train.csv').merge(folds[['Id','fold']], on='Id', how='left')
oof = np.zeros(len(train_df), dtype=np.float32)
test_preds_folds=[]; fold_scores=[]
Path('models').mkdir(exist_ok=True)
t0=time.time()
for f in range(5):
    print(f'--- CNB Fold {f} start ---', flush=True)
    # Raise backbone LR for folds >=1 due to underfit signals in fold 0
    back_lr = 2.0e-4 if f >= 1 else 1.6e-4
    vidx, oof_f, tp_f, best_rmse = train_fold_cnb(f, train_df, img_size=img_size, epochs=epochs, batch_size=batch_size, seed=seed, backbone_max_lr=back_lr, head_max_lr=1.2e-3, accum_steps=4, resume_if_exists=True)
    oof[vidx]=oof_f; test_preds_folds.append(tp_f); fold_scores.append(best_rmse)
    print(f'--- CNB Fold {f} done. Best RMSE={best_rmse:.4f} ---', flush=True)
    gc.collect(); torch.cuda.empty_cache()
print('CNB fold RMSEs:', fold_scores, 'Mean:', float(np.mean(fold_scores)))
oof_df = train_df[['Id','Pawpularity']].copy(); oof_df['oof']=np.clip(oof,0,100)
oof_df.to_csv('oof_convnext_base_512.csv', index=False)
np.save('test_preds_convnext_base_512.npy', np.stack(test_preds_folds, axis=0))
overall = rmse(oof_df['Pawpularity'].values, oof_df['oof'].values)
print(f'CNB OOF RMSE: {overall:.5f} | Total time {time.time()-t0:.1f}s', flush=True)
test_df = pd.read_csv('test.csv')
test_preds = np.mean(np.load('test_preds_convnext_base_512.npy'), axis=0)
sub = pd.DataFrame({'Id': test_df['Id'], 'Pawpularity': np.clip(test_preds,0,100)})
sub.to_csv('submission.csv', index=False)
print('Saved submission.csv shape', sub.shape)

=== CNB 5-FOLD TRAINING @512 (Huber, EMA, Resize eval, HFlip TTA) ===


--- CNB Fold 0 start ---


Fold 0: found existing models/cnb_fold0_best.pt, resuming for eval/predict...




--- CNB Fold 0 done. Best RMSE=18.2842 ---


--- CNB Fold 1 start ---


CNB fold1: head params=1025 backbone params=87566464


LRs: backbone=2.00e-04 head=1.20e-03


Fold 1: train=7139 val=1781 steps/epoch=892 warmup_steps=401 accum_steps=4


  CUDA max memory MB: 8129.9


  ep1 it100/892 loss=97.0383 lr=[1.25e-05,1.25e-05,7.48e-05,7.48e-05]


  ep1 it200/892 loss=94.8805 lr=[2.49e-05,2.49e-05,1.50e-04,1.50e-04]


  ep1 it300/892 loss=90.7242 lr=[3.74e-05,3.74e-05,2.24e-04,2.24e-04]


  ep1 it400/892 loss=87.9907 lr=[4.99e-05,4.99e-05,2.99e-04,2.99e-04]


  ep1 it500/892 loss=85.3325 lr=[6.23e-05,6.23e-05,3.74e-04,3.74e-04]


  ep1 it600/892 loss=84.9728 lr=[7.48e-05,7.48e-05,4.49e-04,4.49e-04]


  ep1 it700/892 loss=84.1452 lr=[8.73e-05,8.73e-05,5.24e-04,5.24e-04]


  ep1 it800/892 loss=84.9877 lr=[9.98e-05,9.98e-05,5.99e-04,5.99e-04]


Fold 1 Ep 1: val RMSE=18.1554 | ep_time=185.1s


  val pred mean/std: 35.97/10.59




  ep2 it100/892 loss=70.9226 lr=[1.24e-04,1.24e-04,7.42e-04,7.42e-04]


  ep2 it200/892 loss=76.3574 lr=[1.36e-04,1.36e-04,8.17e-04,8.17e-04]


  ep2 it300/892 loss=75.5731 lr=[1.49e-04,1.49e-04,8.92e-04,8.92e-04]


  ep2 it400/892 loss=78.1371 lr=[1.61e-04,1.61e-04,9.67e-04,9.67e-04]


  ep2 it500/892 loss=76.9466 lr=[1.74e-04,1.74e-04,1.04e-03,1.04e-03]


  ep2 it600/892 loss=77.8202 lr=[1.86e-04,1.86e-04,1.12e-03,1.12e-03]


  ep2 it700/892 loss=78.3151 lr=[1.99e-04,1.99e-04,1.19e-03,1.19e-03]


  ep2 it800/892 loss=77.8021 lr=[2.00e-04,2.00e-04,1.20e-03,1.20e-03]


Fold 1 Ep 2: val RMSE=19.4080 | ep_time=186.1s


  val pred mean/std: 31.87/9.56




  ep3 it100/892 loss=79.9017 lr=[2.00e-04,2.00e-04,1.20e-03,1.20e-03]


  ep3 it200/892 loss=74.7884 lr=[2.00e-04,2.00e-04,1.20e-03,1.20e-03]


  ep3 it300/892 loss=73.3123 lr=[1.99e-04,1.99e-04,1.20e-03,1.20e-03]


  ep3 it400/892 loss=73.1154 lr=[1.99e-04,1.99e-04,1.20e-03,1.20e-03]


  ep3 it500/892 loss=71.5225 lr=[1.99e-04,1.99e-04,1.19e-03,1.19e-03]


  ep3 it600/892 loss=71.2952 lr=[1.99e-04,1.99e-04,1.19e-03,1.19e-03]


  ep3 it700/892 loss=70.5806 lr=[1.98e-04,1.98e-04,1.19e-03,1.19e-03]


  ep3 it800/892 loss=70.4482 lr=[1.98e-04,1.98e-04,1.19e-03,1.19e-03]


Fold 1 Ep 3: val RMSE=19.7835 | ep_time=186.4s


  val pred mean/std: 32.18/13.69




  ep4 it100/892 loss=59.3830 lr=[1.97e-04,1.97e-04,1.18e-03,1.18e-03]


  ep4 it200/892 loss=58.0505 lr=[1.96e-04,1.96e-04,1.18e-03,1.18e-03]


  ep4 it300/892 loss=56.7019 lr=[1.96e-04,1.96e-04,1.17e-03,1.17e-03]


  ep4 it400/892 loss=55.4624 lr=[1.95e-04,1.95e-04,1.17e-03,1.17e-03]


  ep4 it500/892 loss=56.8812 lr=[1.94e-04,1.94e-04,1.17e-03,1.17e-03]


  ep4 it600/892 loss=56.7087 lr=[1.93e-04,1.93e-04,1.16e-03,1.16e-03]


  ep4 it700/892 loss=56.4128 lr=[1.93e-04,1.93e-04,1.16e-03,1.16e-03]


  ep4 it800/892 loss=56.4729 lr=[1.92e-04,1.92e-04,1.15e-03,1.15e-03]


Fold 1 Ep 4: val RMSE=18.6558 | ep_time=186.8s


  val pred mean/std: 38.96/8.73




  ep5 it100/892 loss=44.1119 lr=[1.90e-04,1.90e-04,1.14e-03,1.14e-03]


  ep5 it200/892 loss=47.4114 lr=[1.89e-04,1.89e-04,1.13e-03,1.13e-03]


  ep5 it300/892 loss=45.3865 lr=[1.88e-04,1.88e-04,1.13e-03,1.13e-03]


  ep5 it400/892 loss=45.1585 lr=[1.87e-04,1.87e-04,1.12e-03,1.12e-03]


  ep5 it500/892 loss=44.8070 lr=[1.86e-04,1.86e-04,1.12e-03,1.12e-03]


  ep5 it600/892 loss=44.1221 lr=[1.85e-04,1.85e-04,1.11e-03,1.11e-03]


  ep5 it700/892 loss=44.2372 lr=[1.84e-04,1.84e-04,1.10e-03,1.10e-03]


  ep5 it800/892 loss=43.8476 lr=[1.82e-04,1.82e-04,1.09e-03,1.09e-03]


Fold 1 Ep 5: val RMSE=18.2563 | ep_time=186.4s


  val pred mean/std: 39.07/8.90


  Early stopping at epoch 5






--- CNB Fold 1 done. Best RMSE=18.1554 ---


--- CNB Fold 2 start ---


CNB fold2: head params=1025 backbone params=87566464


LRs: backbone=2.00e-04 head=1.20e-03


Fold 2: train=7135 val=1785 steps/epoch=891 warmup_steps=400 accum_steps=4


  CUDA max memory MB: 9180.7


  ep1 it100/891 loss=124.4569 lr=[1.25e-05,1.25e-05,7.50e-05,7.50e-05]


  ep1 it200/891 loss=108.7567 lr=[2.50e-05,2.50e-05,1.50e-04,1.50e-04]


  ep1 it300/891 loss=101.5202 lr=[3.75e-05,3.75e-05,2.25e-04,2.25e-04]


  ep1 it400/891 loss=96.2617 lr=[5.00e-05,5.00e-05,3.00e-04,3.00e-04]


  ep1 it500/891 loss=91.9585 lr=[6.25e-05,6.25e-05,3.75e-04,3.75e-04]


  ep1 it600/891 loss=89.8857 lr=[7.50e-05,7.50e-05,4.50e-04,4.50e-04]


  ep1 it700/891 loss=89.1205 lr=[8.75e-05,8.75e-05,5.25e-04,5.25e-04]


  ep1 it800/891 loss=87.6906 lr=[1.00e-04,1.00e-04,6.00e-04,6.00e-04]


Fold 2 Ep 1: val RMSE=18.7596 | ep_time=186.5s


  val pred mean/std: 39.92/10.99




  ep2 it100/891 loss=74.6893 lr=[1.24e-04,1.24e-04,7.44e-04,7.44e-04]


  ep2 it200/891 loss=75.6485 lr=[1.37e-04,1.37e-04,8.19e-04,8.19e-04]


  ep2 it300/891 loss=75.8687 lr=[1.49e-04,1.49e-04,8.94e-04,8.94e-04]


  ep2 it400/891 loss=75.4197 lr=[1.61e-04,1.61e-04,9.69e-04,9.69e-04]


  ep2 it500/891 loss=76.2573 lr=[1.74e-04,1.74e-04,1.04e-03,1.04e-03]


  ep2 it600/891 loss=76.6988 lr=[1.87e-04,1.87e-04,1.12e-03,1.12e-03]


  ep2 it700/891 loss=77.7396 lr=[1.99e-04,1.99e-04,1.19e-03,1.19e-03]


  ep2 it800/891 loss=78.3615 lr=[2.00e-04,2.00e-04,1.20e-03,1.20e-03]


Fold 2 Ep 2: val RMSE=18.5025 | ep_time=186.6s


  val pred mean/std: 39.33/10.23




  ep3 it100/891 loss=78.2949 lr=[2.00e-04,2.00e-04,1.20e-03,1.20e-03]


  ep3 it200/891 loss=74.9215 lr=[2.00e-04,2.00e-04,1.20e-03,1.20e-03]


  ep3 it300/891 loss=72.9513 lr=[1.99e-04,1.99e-04,1.20e-03,1.20e-03]


In [4]:
# Execute metadata pipeline (engineered image stats + XGBoost) and save OOF/test preds
print('=== TRAIN XGB METADATA MODEL ===', flush=True)
xgb_rmse = train_xgb_meta(seed=42)
print(f'XGB META OOF RMSE: {xgb_rmse:.5f}', flush=True)

=== TRAIN XGB METADATA MODEL ===


Computing image stats for train...


  processed 1000/8920 imgs in 9.9s


  processed 2000/8920 imgs in 19.3s


  processed 3000/8920 imgs in 29.2s


  processed 4000/8920 imgs in 39.0s


  processed 5000/8920 imgs in 48.6s


  processed 6000/8920 imgs in 58.2s


  processed 7000/8920 imgs in 67.7s


  processed 8000/8920 imgs in 77.2s


Computing image stats for test...


Saved meta features to CSV.
Meta features: 20
[0]	tr-rmse:20.71878	va-rmse:20.21670


Parameters: { "n_estimators" } are not used.



[100]	tr-rmse:18.93849	va-rmse:20.23123




[149]	tr-rmse:18.26153	va-rmse:20.29109


XGB fold 0 rmse=20.15373 | best_iter=50




[0]	tr-rmse:20.59100	va-rmse:20.73815


Parameters: { "n_estimators" } are not used.



[100]	tr-rmse:18.88596	va-rmse:20.70907




[129]	tr-rmse:18.51150	va-rmse:20.73264


XGB fold 1 rmse=20.69700 | best_iter=30


[0]	tr-rmse:20.53495	va-rmse:20.97442


Parameters: { "n_estimators" } are not used.





[100]	tr-rmse:18.86696	va-rmse:20.90346


[137]	tr-rmse:18.38003	va-rmse:20.94220


XGB fold 2 rmse=20.87232 | best_iter=37




[0]	tr-rmse:20.46780	va-rmse:21.23119


Parameters: { "n_estimators" } are not used.



[100]	tr-rmse:18.71906	va-rmse:21.28087


[124]	tr-rmse:18.37555	va-rmse:21.31688




XGB fold 3 rmse=21.22168 | best_iter=25


[0]	tr-rmse:20.75786	va-rmse:20.05083


Parameters: { "n_estimators" } are not used.





[100]	tr-rmse:18.90914	va-rmse:20.11583


[112]	tr-rmse:18.74101	va-rmse:20.13501


XGB fold 4 rmse=20.03572 | best_iter=13




XGB META OOF RMSE: 20.60077001680868
XGB META OOF RMSE: 20.60077




In [14]:
# Swin-B @384 (in1k default weights) 5-fold training (Unbounded MSE, EMA, warmup+cosine, Resize eval, HFlip TTA)
import os, math, time, random, gc
from pathlib import Path
import numpy as np
import pandas as pd
from PIL import Image
import torch, torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import torchvision.transforms as T
import timm
from timm.utils import ModelEmaV2
from sklearn.metrics import mean_squared_error

os.environ.setdefault('PYTORCH_CUDA_ALLOC_CONF', 'expandable_segments:True')
torch.backends.cudnn.benchmark = True
if torch.cuda.is_available():
    torch.backends.cuda.matmul.allow_tf32 = True
    torch.backends.cudnn.allow_tf32 = True
device = 'cuda' if torch.cuda.is_available() else 'cpu'
try:
    torch.set_float32_matmul_precision('high')
except Exception:
    pass

def seed_everything(seed=42):
    random.seed(seed); np.random.seed(seed); torch.manual_seed(seed); torch.cuda.manual_seed_all(seed)

def rmse(y_true, y_pred):
    return mean_squared_error(y_true, y_pred, squared=False)

class PetDatasetSWIN(Dataset):
    def __init__(self, df, img_dir, img_size=384, train=True):
        self.df = df.reset_index(drop=True)
        self.img_dir = Path(img_dir)
        self.img_size = img_size
        self.train = train
        normalize = T.Normalize(mean=(0.485,0.456,0.406), std=(0.229,0.224,0.225))
        if train:
            # Minimal context-preserving aug: Resize + HFlip + light ColorJitter
            self.tf = T.Compose([
                T.Resize((img_size, img_size), interpolation=T.InterpolationMode.BICUBIC),
                T.RandomHorizontalFlip(p=0.5),
                T.ColorJitter(0.05,0.05,0.05,0.025),
                T.ToTensor(),
                normalize,
            ])
        else:
            self.tf = T.Compose([
                T.Resize((img_size, img_size), interpolation=T.InterpolationMode.BICUBIC),
                T.ToTensor(),
                normalize,
            ])
    def __len__(self): return len(self.df)
    def __getitem__(self, idx):
        r = self.df.iloc[idx]
        img = Image.open(self.img_dir / f"{r.Id}.jpg").convert('RGB')
        x = self.tf(img)
        y = torch.tensor([float(r.Pawpularity)], dtype=torch.float32)
        return x, y

def build_swin_model(backbone='swin_base_patch4_window7_224', pretrained=True, img_size=384):
    m = timm.create_model(backbone, pretrained=pretrained, num_classes=1, global_pool='avg', drop_path_rate=0.05, img_size=img_size)
    return m

def _build_param_groups(model, head_ids, backbone_lr, head_lr, weight_decay=0.05):
    no_decay_terms = ('bias','bn','norm','ln','pos_embed','cls_token','absolute_pos_embed','relative_position_bias_table')
    back_decay, back_no_decay, head_decay, head_no_decay = [], [], [], []
    for n,p in model.named_parameters():
        if not p.requires_grad: continue
        is_no_decay = any(t in n for t in no_decay_terms)
        is_head = (id(p) in head_ids)
        if is_head:
            (head_no_decay if is_no_decay else head_decay).append(p)
        else:
            (back_no_decay if is_no_decay else back_decay).append(p)
    return [
        {'params': back_decay, 'lr': backbone_lr, 'weight_decay': weight_decay},
        {'params': back_no_decay, 'lr': backbone_lr, 'weight_decay': 0.0},
        {'params': head_decay, 'lr': head_lr, 'weight_decay': weight_decay},
        {'params': head_no_decay, 'lr': head_lr, 'weight_decay': 0.0},
    ]

class WarmupCosine:
    def __init__(self, optimizer, total_steps, warmup_steps, min_lr_mult=0.05):
        self.opt = optimizer; self.total_steps = max(1,total_steps); self.warmup_steps = max(1,warmup_steps); self.s=0
        self.base_lrs = [pg['lr'] for pg in optimizer.param_groups]
        self.min_lr_mult = float(min_lr_mult)
    def step(self):
        self.s += 1
        if self.s <= self.warmup_steps:
            mult = self.s / float(self.warmup_steps)
        else:
            t = (self.s - self.warmup_steps) / float(max(1, self.total_steps - self.warmup_steps))
            mult = max(self.min_lr_mult, 0.5*(1+math.cos(math.pi*t)))
        for i, pg in enumerate(self.opt.param_groups):
            pg['lr'] = self.base_lrs[i] * mult

def _eval_model(eval_model, ds_va, batch_size):
    va_dl = DataLoader(ds_va, batch_size=min(batch_size, 16), shuffle=False, num_workers=4, pin_memory=True, persistent_workers=False)
    eval_model.eval()
    oof = np.zeros(len(ds_va), dtype=np.float32); vs=[]; ptr=0
    with torch.no_grad():
        for x,y in va_dl:
            bs = x.size(0)
            x = x.to(device, non_blocking=True).float()
            preds = eval_model(x).squeeze(1).float().cpu().numpy()  # unbounded
            oof[ptr:ptr+bs] = preds; ptr += bs
            vs.append(y.squeeze(1).cpu().numpy())
    y_true = np.concatenate(vs)
    # metric on clipped preds as per contest target domain
    return np.clip(oof,0,100), rmse(y_true, np.clip(oof,0,100))

def _predict_test(eval_model, batch_size, img_size):
    test_df = pd.read_csv('test.csv')
    tf = T.Compose([T.Resize((img_size,img_size), interpolation=T.InterpolationMode.BICUBIC), T.ToTensor(), T.Normalize(mean=(0.485,0.456,0.406), std=(0.229,0.224,0.225))])
    tf_hf = T.Compose([T.Resize((img_size,img_size), interpolation=T.InterpolationMode.BICUBIC), T.RandomHorizontalFlip(p=1.0), T.ToTensor(), T.Normalize(mean=(0.485,0.456,0.406), std=(0.229,0.224,0.225))])
    def load_img(pid): return Image.open(Path('test')/f'{pid}.jpg').convert('RGB')
    preds = []; eval_model.eval()
    with torch.no_grad():
        for i in range(0, len(test_df), min(batch_size, 16)):
            ids = test_df['Id'].iloc[i:i+min(batch_size, 16)].tolist()
            imgs = [load_img(pid) for pid in ids]
            x1 = torch.stack([tf(im) for im in imgs]).to(device, non_blocking=True).float()
            x2 = torch.stack([tf_hf(im) for im in imgs]).to(device, non_blocking=True).float()
            p1 = eval_model(x1).squeeze(1)  # unbounded
            p2 = eval_model(x2).squeeze(1)  # unbounded
            p = 0.5*(p1+p2)
            preds.append(p.float().cpu().numpy())
    return np.clip(np.concatenate(preds), 0, 100)

def train_fold_swin(fold, df, img_size=384, epochs=18, batch_size=16, seed=42, backbone_max_lr=1.0e-4, head_max_lr=5.0e-4, accum_steps=2):
    seed_everything(seed+fold)
    val_idx = df.index[df.fold == fold].values
    tr = df[df.fold != fold].reset_index(drop=True)
    va = df[df.fold == fold].reset_index(drop=True)
    ds_tr = PetDatasetSWIN(tr, 'train', img_size, train=True)
    ds_va = PetDatasetSWIN(va, 'train', img_size, train=False)
    dl_tr = DataLoader(ds_tr, batch_size=batch_size, shuffle=True, num_workers=4, pin_memory=True, drop_last=True, persistent_workers=False)
    dl_va = DataLoader(ds_va, batch_size=min(batch_size, 16), shuffle=False, num_workers=4, pin_memory=True, persistent_workers=False)

    backbone = 'swin_base_patch4_window7_224'
    model = build_swin_model(backbone, pretrained=True, img_size=img_size).to(device)
    if device=='cuda': model = model.to(memory_format=torch.channels_last)

    clf = model.get_classifier()
    head_module = getattr(model, clf) if isinstance(clf, str) else clf
    head_params = list(head_module.parameters()); head_ids = set(id(p) for p in head_params)
    back_params = [p for p in model.parameters() if id(p) not in head_ids]
    print(f'SWIN fold{fold}: head params={sum(p.numel() for p in head_params)} backbone params={sum(p.numel() for p in back_params)}', flush=True)
    print(f'LRs: backbone={backbone_max_lr:.2e} head={head_max_lr:.2e}', flush=True)

    param_groups = _build_param_groups(model, head_ids, backbone_max_lr, head_max_lr, weight_decay=0.05)
    opt = torch.optim.AdamW(param_groups, betas=(0.9,0.98))
    total_steps = epochs * len(dl_tr) // max(1, accum_steps)
    warmup_steps = max(1, int(0.10*total_steps))
    sch = WarmupCosine(opt, total_steps, warmup_steps, min_lr_mult=0.05)
    mse = nn.MSELoss()
    scaler = torch.amp.GradScaler('cuda', enabled=(device=='cuda'))
    ema = ModelEmaV2(model, decay=0.9997)

    best = 1e9; best_state=None; patience=4; no_improve=0
    Path('models').mkdir(exist_ok=True)
    print(f'Fold {fold}: train={len(tr)} val={len(va)} steps/epoch={len(dl_tr)} warmup_steps={warmup_steps} accum_steps={accum_steps}', flush=True)
    first_batch=True
    for ep in range(1, epochs+1):
        t0=time.time(); model.train(); loss_sum=0.0
        opt.zero_grad(set_to_none=True)
        for it,(x,y) in enumerate(dl_tr):
            if device=='cuda': x=x.to(device,non_blocking=True).to(memory_format=torch.channels_last)
            else: x=x.to(device,non_blocking=True)
            y=y.to(device,non_blocking=True)
            with torch.amp.autocast('cuda', enabled=(device=='cuda')):
                out = model(x).squeeze(1)  # unbounded
                loss = mse(out, y.squeeze(1)) / float(accum_steps)
            scaler.scale(loss).backward()
            if ((it+1) % accum_steps) == 0:
                scaler.unscale_(opt)
                torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
                scaler.step(opt); scaler.update(); sch.step(); opt.zero_grad(set_to_none=True)
                ema.update(model)
            loss_sum += loss.item() * float(accum_steps)
            if first_batch:
                if torch.cuda.is_available():
                    torch.cuda.synchronize(); print(f'  CUDA max memory MB: {torch.cuda.max_memory_allocated()/1e6:.1f}', flush=True)
                first_batch=False
            if (it+1)%100==0:
                lrs=[pg['lr'] for pg in opt.param_groups]
                print(f'  ep{ep} it{it+1}/{len(dl_tr)} loss={loss_sum/(it+1):.4f} lr=[{lrs[0]:.2e},{lrs[1]:.2e},{lrs[2]:.2e},{lrs[3]:.2e}]', flush=True)
        if ((it+1) % accum_steps) != 0:
            scaler.unscale_(opt)
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            scaler.step(opt); scaler.update(); sch.step(); opt.zero_grad(set_to_none=True)
            ema.update(model)

        eval_model = (model if ep<=3 else ema.module).to(device).float()
        eval_model.eval(); vs=[]; vp=[]
        with torch.no_grad():
            for x,y in dl_va:
                x=x.to(device,non_blocking=True).float()
                preds=eval_model(x).squeeze(1).float().cpu().numpy()  # unbounded
                vs.append(y.squeeze(1).cpu().numpy()); vp.append(preds)
        y_true=np.concatenate(vs); y_pred=np.concatenate(vp)
        y_pred_clip = np.clip(y_pred,0,100)
        v_rmse = rmse(y_true, y_pred_clip)
        print(f'Fold {fold} Ep {ep}: val RMSE={v_rmse:.4f} | ep_time={time.time()-t0:.1f}s', flush=True)
        print(f'  val pred mean/std: {y_pred_clip.mean():.2f}/{y_pred_clip.std():.2f}', flush=True)

        # Guardrails: Ep3 > 18.5 or std < 8 -> abort
        if ep == 3 and (v_rmse > 18.5 or y_pred_clip.std() < 8.0):
            print('  Abort guard (Ep3) triggered', flush=True); break

        if v_rmse < best - 1e-4:
            best=v_rmse; no_improve=0
            best_state={k:v.cpu() for k,v in eval_model.state_dict().items()}
            torch.save(best_state, f'models/swinb_fold{fold}_best.pt')
        else:
            no_improve += 1
            if no_improve >= patience:
                print(f'  Early stopping at epoch {ep}', flush=True); break

    if best_state is not None:
        eval_model = build_swin_model(pretrained=False, img_size=img_size).to(device).float()
        eval_model.load_state_dict({k:v.to(device) for k,v in best_state.items()})
    else:
        eval_model = ema.module.to(device).float()

    oof, final_rmse = _eval_model(eval_model, ds_va, batch_size)
    preds_test = _predict_test(eval_model, batch_size, img_size)
    dbg = va[['Id','Pawpularity']].copy(); dbg['pred']=oof; dbg.to_csv(f'swinb_fold{fold}_oof_debug.csv', index=False)
    return val_idx, oof, preds_test, final_rmse

print('=== SWIN-B 5-FOLD TRAINING @384 (Unbounded MSE, EMA, Resize eval, HFlip TTA) ===', flush=True)
seed=42; img_size=384; epochs=18; batch_size=16
folds = pd.read_csv('folds.csv')
train_df = pd.read_csv('train.csv').merge(folds[['Id','fold']], on='Id', how='left')
oof = np.zeros(len(train_df), dtype=np.float32)
test_preds_folds=[]; fold_scores=[]
Path('models').mkdir(exist_ok=True)
t0=time.time()
for f in [0]:
    print(f'--- SWIN-B Fold {f} start ---', flush=True)
    back_lr = 1.0e-4
    vidx, oof_f, tp_f, best_rmse = train_fold_swin(f, train_df, img_size=img_size, epochs=epochs, batch_size=batch_size, seed=seed, backbone_max_lr=back_lr, head_max_lr=5.0e-4, accum_steps=2)
    oof[vidx]=oof_f; test_preds_folds.append(tp_f); fold_scores.append(best_rmse)
    print(f'--- SWIN-B Fold {f} done. Best RMSE={best_rmse:.4f} ---', flush=True)
    gc.collect(); torch.cuda.empty_cache()
print('SWIN-B fold RMSEs:', fold_scores, 'Mean:', float(np.mean(fold_scores)))
oof_df = train_df[['Id','Pawpularity']].copy(); oof_df['oof']=np.clip(oof,0,100)
oof_df.to_csv('oof_swinb_384.csv', index=False)
np.save('test_preds_swinb_384.npy', np.stack(test_preds_folds, axis=0))
overall = rmse(oof_df['Pawpularity'].values, oof_df['oof'].values)
print(f'SWIN-B OOF RMSE: {overall:.5f} | Total time {time.time()-t0:.1f}s', flush=True)
test_df = pd.read_csv('test.csv')
test_preds = np.mean(np.load('test_preds_swinb_384.npy'), axis=0)
sub = pd.DataFrame({'Id': test_df['Id'], 'Pawpularity': np.clip(test_preds,0,100)})
sub.to_csv('submission_swinb.csv', index=False)
print('Saved submission_swinb.csv shape', sub.shape)

# Optional quick blend with meta to check OOF immediately
try:
    oof_meta = pd.read_csv('oof_xgb_meta.csv').rename(columns={'oof_meta':'meta_pred'})
    oof_img = pd.read_csv('oof_swinb_384.csv').rename(columns={'oof':'img_pred'})
    oof_m = oof_img.merge(oof_meta, on=['Id','Pawpularity'], how='inner')
    y = oof_m['Pawpularity'].values.astype(np.float32)
    p_img = oof_m['img_pred'].values.astype(np.float32)
    p_meta = oof_m['meta_pred'].values.astype(np.float32)
    best = (1.0, rmse(y, p_img))
    for w in np.linspace(0.80,0.90,11):
        pred = np.clip(w*p_img + (1.0-w)*p_meta, 0, 100)
        s = rmse(y, pred)
        if s < best[1]: best = (float(w), float(s))
    print(f'[Quick Blend] Best w_img={best[0]:.2f} OOF RMSE={best[1]:.5f}', flush=True)
except Exception as e:
    print('Quick blend skipped:', e)

=== SWIN-B 5-FOLD TRAINING @384 (Unbounded MSE, EMA, Resize eval, HFlip TTA) ===


--- SWIN-B Fold 0 start ---


SWIN fold0: head params=1025 backbone params=86743224


LRs: backbone=1.00e-04 head=5.00e-04


Fold 0: train=7135 val=1785 steps/epoch=445 warmup_steps=400 accum_steps=2


  CUDA max memory MB: 14323.9


  ep1 it100/445 loss=1782.2747 lr=[1.25e-05,1.25e-05,6.25e-05,6.25e-05]


  ep1 it200/445 loss=1218.5705 lr=[2.50e-05,2.50e-05,1.25e-04,1.25e-04]


  ep1 it300/445 loss=956.9012 lr=[3.75e-05,3.75e-05,1.88e-04,1.88e-04]


  ep1 it400/445 loss=824.7604 lr=[5.00e-05,5.00e-05,2.50e-04,2.50e-04]


Fold 0 Ep 1: val RMSE=20.4619 | ep_time=146.0s


  val pred mean/std: 34.42/0.17




  ep2 it100/445 loss=455.3415 lr=[6.83e-05,6.83e-05,3.41e-04,3.41e-04]


  ep2 it200/445 loss=452.3151 lr=[8.07e-05,8.07e-05,4.04e-04,4.04e-04]


  ep2 it300/445 loss=430.6922 lr=[9.33e-05,9.33e-05,4.66e-04,4.66e-04]


  ep2 it400/445 loss=417.3413 lr=[1.00e-04,1.00e-04,5.00e-04,5.00e-04]


Fold 0 Ep 2: val RMSE=19.0231 | ep_time=147.9s


  val pred mean/std: 40.60/5.07




  ep3 it100/445 loss=335.9042 lr=[9.98e-05,9.98e-05,4.99e-04,4.99e-04]


  ep3 it200/445 loss=348.3608 lr=[9.96e-05,9.96e-05,4.98e-04,4.98e-04]


  ep3 it300/445 loss=338.9156 lr=[9.93e-05,9.93e-05,4.96e-04,4.96e-04]


  ep3 it400/445 loss=337.9598 lr=[9.89e-05,9.89e-05,4.94e-04,4.94e-04]


Fold 0 Ep 3: val RMSE=18.9894 | ep_time=148.4s


  val pred mean/std: 42.24/6.83


  Abort guard (Ep3) triggered






--- SWIN-B Fold 0 done. Best RMSE=19.0232 ---


SWIN-B fold RMSEs: [19.023167] Mean: 19.02316665649414


SWIN-B OOF RMSE: 39.72723 | Total time 498.9s


Saved submission_swinb.csv shape (992, 2)
[Quick Blend] Best w_img=0.80 OOF RMSE=34.08064




In [11]:
# ConvNeXtV2 Base @512 5-fold training (Huber, EMA, Warmup+Cosine, Resize eval, HFlip TTA)
import os, math, time, random, gc
from pathlib import Path
import numpy as np
import pandas as pd
from PIL import Image
import torch, torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import torchvision.transforms as T
import timm
from timm.utils import ModelEmaV2
from sklearn.metrics import mean_squared_error

os.environ.setdefault('PYTORCH_CUDA_ALLOC_CONF', 'expandable_segments:True')
torch.backends.cudnn.benchmark = True
if torch.cuda.is_available():
    torch.backends.cuda.matmul.allow_tf32 = True
    torch.backends.cudnn.allow_tf32 = True
device = 'cuda' if torch.cuda.is_available() else 'cpu'
try:
    torch.set_float32_matmul_precision('high')
except Exception:
    pass

def seed_everything(seed=42):
    random.seed(seed); np.random.seed(seed); torch.manual_seed(seed); torch.cuda.manual_seed_all(seed)

def rmse(y_true, y_pred):
    return mean_squared_error(y_true, y_pred, squared=False)

class PetDatasetCNV2(Dataset):
    def __init__(self, df, img_dir, img_size=512, train=True):
        self.df = df.reset_index(drop=True)
        self.img_dir = Path(img_dir)
        self.img_size = img_size
        normalize = T.Normalize(mean=(0.485,0.456,0.406), std=(0.229,0.224,0.225))
        if train:
            self.tf = T.Compose([
                T.RandomResizedCrop(img_size, scale=(0.85,1.0), ratio=(0.9,1.1), interpolation=T.InterpolationMode.BICUBIC),
                T.RandomHorizontalFlip(p=0.5),
                T.ColorJitter(0.1,0.1,0.1,0.05),
                T.ToTensor(),
                normalize,
            ])
        else:
            self.tf = T.Compose([
                T.Resize((img_size, img_size), interpolation=T.InterpolationMode.BICUBIC),
                T.ToTensor(),
                normalize,
            ])
    def __len__(self): return len(self.df)
    def __getitem__(self, idx):
        r = self.df.iloc[idx]
        img = Image.open(self.img_dir / f"{r.Id}.jpg").convert('RGB')
        x = self.tf(img)
        y = torch.tensor([float(r.Pawpularity)], dtype=torch.float32)
        return x, y

def build_cnv2_model(backbone='convnextv2_base.fcmae_ft_in22k_in1k', pretrained=True):
    m = timm.create_model(backbone, pretrained=pretrained, num_classes=1, global_pool='avg', drop_path_rate=0.05)
    return m

def _build_param_groups(model, head_ids, backbone_lr, head_lr, weight_decay=0.05):
    no_decay_terms = ('bias','bn','norm','ln','pos_embed','cls_token','absolute_pos_embed','relative_position_bias_table')
    back_decay, back_no_decay, head_decay, head_no_decay = [], [], [], []
    for n,p in model.named_parameters():
        if not p.requires_grad: continue
        is_no_decay = any(t in n for t in no_decay_terms)
        is_head = (id(p) in head_ids)
        if is_head:
            (head_no_decay if is_no_decay else head_decay).append(p)
        else:
            (back_no_decay if is_no_decay else back_decay).append(p)
    return [
        {'params': back_decay, 'lr': backbone_lr, 'weight_decay': weight_decay},
        {'params': back_no_decay, 'lr': backbone_lr, 'weight_decay': 0.0},
        {'params': head_decay, 'lr': head_lr, 'weight_decay': weight_decay},
        {'params': head_no_decay, 'lr': head_lr, 'weight_decay': 0.0},
    ]

class WarmupCosine:
    def __init__(self, optimizer, total_steps, warmup_steps, min_lr_mult=0.01):
        self.opt = optimizer; self.total_steps = max(1,total_steps); self.warmup_steps = max(1,warmup_steps); self.s=0
        self.base_lrs = [pg['lr'] for pg in optimizer.param_groups]
        self.min_lr_mult = float(min_lr_mult)
    def step(self):
        self.s += 1
        if self.s <= self.warmup_steps:
            mult = self.s / float(self.warmup_steps)
        else:
            t = (self.s - self.warmup_steps) / float(max(1, self.total_steps - self.warmup_steps))
            mult = max(self.min_lr_mult, 0.5*(1+math.cos(math.pi*t)))
        for i, pg in enumerate(self.opt.param_groups):
            pg['lr'] = self.base_lrs[i] * mult

def _eval_model(eval_model, ds_va, batch_size, img_size):
    va_dl = DataLoader(ds_va, batch_size=min(batch_size, 16), shuffle=False, num_workers=4, pin_memory=True, persistent_workers=False)
    eval_model.eval()
    oof = np.zeros(len(ds_va), dtype=np.float32); vs=[]; ptr=0
    with torch.no_grad():
        for x,y in va_dl:
            bs = x.size(0)
            x = x.to(device, non_blocking=True).float()
            logits = eval_model(x).squeeze(1)
            preds = (100.0 * torch.sigmoid(logits)).float().cpu().numpy()
            oof[ptr:ptr+bs] = np.clip(preds,0,100); ptr += bs
            vs.append(y.squeeze(1).cpu().numpy())
    y_true = np.concatenate(vs)
    return oof, rmse(y_true, oof)

def _predict_test(eval_model, batch_size, img_size):
    test_df = pd.read_csv('test.csv')
    tf = T.Compose([T.Resize((img_size,img_size), interpolation=T.InterpolationMode.BICUBIC), T.ToTensor(), T.Normalize(mean=(0.485,0.456,0.406), std=(0.229,0.224,0.225))])
    tf_hf = T.Compose([T.Resize((img_size,img_size), interpolation=T.InterpolationMode.BICUBIC), T.RandomHorizontalFlip(p=1.0), T.ToTensor(), T.Normalize(mean=(0.485,0.456,0.406), std=(0.229,0.224,0.225))])
    def load_img(pid): return Image.open(Path('test')/f'{pid}.jpg').convert('RGB')
    preds = []; eval_model.eval()
    with torch.no_grad():
        for i in range(0, len(test_df), min(batch_size, 16)):
            ids = test_df['Id'].iloc[i:i+min(batch_size, 16)].tolist()
            imgs = [load_img(pid) for pid in ids]
            x1 = torch.stack([tf(im) for im in imgs]).to(device, non_blocking=True).float()
            x2 = torch.stack([tf_hf(im) for im in imgs]).to(device, non_blocking=True).float()
            p1 = 100.0 * torch.sigmoid(eval_model(x1).squeeze(1))
            p2 = 100.0 * torch.sigmoid(eval_model(x2).squeeze(1))
            p = 0.5*(p1+p2)
            preds.append(p.float().cpu().numpy())
    return np.clip(np.concatenate(preds), 0, 100)

def train_fold_cnv2(fold, df, img_size=512, epochs=12, batch_size=8, seed=42, backbone_max_lr=2.0e-4, head_max_lr=1.0e-3, accum_steps=4):
    seed_everything(seed+fold)
    val_idx = df.index[df.fold == fold].values
    tr = df[df.fold != fold].reset_index(drop=True)
    va = df[df.fold == fold].reset_index(drop=True)
    ds_tr = PetDatasetCNV2(tr, 'train', img_size, train=True)
    ds_va = PetDatasetCNV2(va, 'train', img_size, train=False)
    dl_tr = DataLoader(ds_tr, batch_size=batch_size, shuffle=True, num_workers=4, pin_memory=True, drop_last=True, persistent_workers=False)
    dl_va = DataLoader(ds_va, batch_size=min(batch_size,16), shuffle=False, num_workers=4, pin_memory=True, persistent_workers=False)

    backbone = 'convnextv2_base.fcmae_ft_in22k_in1k'
    model = build_cnv2_model(backbone, pretrained=True).to(device)
    if device=='cuda': model = model.to(memory_format=torch.channels_last)

    clf = model.get_classifier()
    head_module = getattr(model, clf) if isinstance(clf, str) else clf
    head_params = list(head_module.parameters()); head_ids = set(id(p) for p in head_params)
    back_params = [p for p in model.parameters() if id(p) not in head_ids]
    print(f'CNV2 fold{fold}: head params={sum(p.numel() for p in head_params)} backbone params={sum(p.numel() for p in back_params)}', flush=True)
    print(f'LRs: backbone={backbone_max_lr:.2e} head={head_max_lr:.2e}', flush=True)

    param_groups = _build_param_groups(model, head_ids, backbone_max_lr, head_max_lr, weight_decay=0.05)
    opt = torch.optim.AdamW(param_groups, betas=(0.9,0.98))
    total_steps = epochs * len(dl_tr) // max(1, accum_steps)
    warmup_steps = max(1, int(0.10*total_steps))
    sch = WarmupCosine(opt, total_steps, warmup_steps, min_lr_mult=0.01)
    huber = nn.HuberLoss(delta=7.5)
    scaler = torch.amp.GradScaler('cuda', enabled=(device=='cuda'))
    ema = ModelEmaV2(model, decay=0.9997)

    best = 1e9; best_state=None; patience=4; no_improve=0
    Path('models').mkdir(exist_ok=True)
    print(f'Fold {fold}: train={len(tr)} val={len(va)} steps/epoch={len(dl_tr)} warmup_steps={warmup_steps} accum_steps={accum_steps}', flush=True)
    first_batch=True
    for ep in range(1, epochs+1):
        t0=time.time(); model.train(); loss_sum=0.0
        opt.zero_grad(set_to_none=True)
        for it,(x,y) in enumerate(dl_tr):
            if device=='cuda': x=x.to(device,non_blocking=True).to(memory_format=torch.channels_last)
            else: x=x.to(device,non_blocking=True)
            y=y.to(device,non_blocking=True)
            with torch.amp.autocast('cuda', enabled=(device=='cuda')):
                logits = model(x).squeeze(1)
                preds = 100.0 * torch.sigmoid(logits)
                loss = huber(preds, y.squeeze(1)) / float(accum_steps)
            scaler.scale(loss).backward()
            if ((it+1) % accum_steps) == 0:
                scaler.unscale_(opt)
                torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
                scaler.step(opt); scaler.update(); sch.step(); opt.zero_grad(set_to_none=True)
                ema.update(model)
            loss_sum += loss.item() * float(accum_steps)
            if first_batch:
                if torch.cuda.is_available():
                    torch.cuda.synchronize(); print(f'  CUDA max memory MB: {torch.cuda.max_memory_allocated()/1e6:.1f}', flush=True)
                first_batch=False
            if (it+1)%100==0:
                lrs=[pg['lr'] for pg in opt.param_groups]
                print(f'  ep{ep} it{it+1}/{len(dl_tr)} loss={loss_sum/(it+1):.4f} lr=[{lrs[0]:.2e},{lrs[1]:.2e},{lrs[2]:.2e},{lrs[3]:.2e}]', flush=True)
        # flush leftover grads
        if ((it+1) % accum_steps) != 0:
            scaler.unscale_(opt)
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            scaler.step(opt); scaler.update(); sch.step(); opt.zero_grad(set_to_none=True)
            ema.update(model)

        # Validation: raw model for ep<=3, EMA afterwards; FP32
        eval_model = (model if ep<=3 else ema.module).to(device).float()
        eval_model.eval(); vs=[]; vp=[]
        with torch.no_grad():
            for x,y in dl_va:
                x=x.to(device,non_blocking=True).float()
                logits=eval_model(x).squeeze(1)
                preds=(100.0*torch.sigmoid(logits)).float().cpu().numpy()
                vs.append(y.squeeze(1).cpu().numpy()); vp.append(preds)
        y_true=np.concatenate(vs); y_pred=np.clip(np.concatenate(vp),0,100)
        v_rmse = rmse(y_true, y_pred)
        print(f'Fold {fold} Ep {ep}: val RMSE={v_rmse:.4f} | ep_time={time.time()-t0:.1f}s', flush=True)
        print(f'  val pred mean/std: {y_pred.mean():.2f}/{y_pred.std():.2f}', flush=True)

        # Guardrails
        if ep == 3 and (v_rmse > 19.0 or y_pred.std() < 8.0):
            print('  Abort guard (Ep3) triggered', flush=True); break
        if ep == 5 and (v_rmse > 18.0 or y_pred.std() < 9.0):
            print('  Abort guard (Ep5) triggered', flush=True); break
        if ep == 7 and (v_rmse > 17.6 or y_pred.std() < 10.0):
            print('  Abort guard (Ep7) triggered', flush=True); break

        if v_rmse < best - 1e-4:
            best=v_rmse; no_improve=0
            best_state={k:v.cpu() for k,v in eval_model.state_dict().items()}
            torch.save(best_state, f'models/cnv2_fold{fold}_best.pt')
        else:
            no_improve += 1
            if no_improve >= patience:
                print(f'  Early stopping at epoch {ep}', flush=True); break

    if best_state is not None:
        eval_model = build_cnv2_model(pretrained=False).to(device).float()
        eval_model.load_state_dict({k:v.to(device) for k,v in best_state.items()})
    else:
        eval_model = ema.module.to(device).float()

    oof, final_rmse = _eval_model(eval_model, ds_va, batch_size, img_size)
    preds_test = _predict_test(eval_model, batch_size, img_size)
    dbg = va[['Id','Pawpularity']].copy(); dbg['pred']=oof; dbg.to_csv(f'cnv2_fold{fold}_oof_debug.csv', index=False)
    return val_idx, oof, preds_test, final_rmse

print('=== ConvNeXtV2-B 5-FOLD TRAINING @384 (Huber, EMA, Resize eval, HFlip TTA) ===', flush=True)
seed=42; img_size=384; epochs=12; batch_size=8
folds = pd.read_csv('folds.csv')
train_df = pd.read_csv('train.csv').merge(folds[['Id','fold']], on='Id', how='left')
oof = np.zeros(len(train_df), dtype=np.float32)
test_preds_folds=[]; fold_scores=[]
Path('models').mkdir(exist_ok=True)
t0=time.time()
for f in range(5):
    print(f'--- CNV2 Fold {f} start ---', flush=True)
    vidx, oof_f, tp_f, best_rmse = train_fold_cnv2(f, train_df, img_size=img_size, epochs=epochs, batch_size=batch_size, seed=seed, backbone_max_lr=3.0e-4, head_max_lr=1.0e-3, accum_steps=4)
    oof[vidx]=oof_f; test_preds_folds.append(tp_f); fold_scores.append(best_rmse)
    print(f'--- CNV2 Fold {f} done. Best RMSE={best_rmse:.4f} ---', flush=True)
    gc.collect(); torch.cuda.empty_cache()
print('CNV2 fold RMSEs:', fold_scores, 'Mean:', float(np.mean(fold_scores)))
oof_df = train_df[['Id','Pawpularity']].copy(); oof_df['oof']=np.clip(oof,0,100)
oof_df.to_csv('oof_cnv2b_384.csv', index=False)
np.save('test_preds_cnv2b_384.npy', np.stack(test_preds_folds, axis=0))
overall = rmse(oof_df['Pawpularity'].values, oof_df['oof'].values)
print(f'CNV2 OOF RMSE: {overall:.5f} | Total time {time.time()-t0:.1f}s', flush=True)
test_df = pd.read_csv('test.csv')
test_preds = np.mean(np.load('test_preds_cnv2b_384.npy'), axis=0)
sub = pd.DataFrame({'Id': test_df['Id'], 'Pawpularity': np.clip(test_preds,0,100)})
sub.to_csv('submission_cnv2b_384.csv', index=False)
print('Saved submission_cnv2b_384.csv shape', sub.shape)

# Quick OOF blend with metadata if available
try:
    oof_meta = pd.read_csv('oof_xgb_meta.csv').rename(columns={'oof_meta':'meta_pred'})
    oof_img = pd.read_csv('oof_cnv2b_384.csv').rename(columns={'oof':'img_pred'})
    oof_m = oof_img.merge(oof_meta, on=['Id','Pawpularity'], how='inner')
    y = oof_m['Pawpularity'].values.astype(np.float32)
    p_img = oof_m['img_pred'].values.astype(np.float32)
    p_meta = oof_m['meta_pred'].values.astype(np.float32)
    best = (1.0, rmse(y, p_img))
    for w in np.linspace(0.60,0.75,16):
        pred = np.clip(w*p_img + (1.0-w)*p_meta, 0, 100)
        s = rmse(y, pred)
        if s < best[1]: best = (float(w), float(s))
    print(f'[Quick Blend META] Best w_img={best[0]:.2f} OOF RMSE={best[1]:.5f}', flush=True)
except Exception as e:
    print('Quick blend skipped:', e)

=== ConvNeXtV2-B 5-FOLD TRAINING @384 (Huber, EMA, Resize eval, HFlip TTA) ===


--- CNV2 Fold 0 start ---


CNV2 fold0: head params=1025 backbone params=87692800


LRs: backbone=3.00e-04 head=1.00e-03


Fold 0: train=7135 val=1785 steps/epoch=891 warmup_steps=267 accum_steps=4


  CUDA max memory MB: 14323.9


  ep1 it100/891 loss=109.6341 lr=[2.81e-05,2.81e-05,9.36e-05,9.36e-05]


  ep1 it200/891 loss=95.4748 lr=[5.62e-05,5.62e-05,1.87e-04,1.87e-04]


  ep1 it300/891 loss=91.4548 lr=[8.43e-05,8.43e-05,2.81e-04,2.81e-04]


  ep1 it400/891 loss=88.8713 lr=[1.12e-04,1.12e-04,3.75e-04,3.75e-04]


  ep1 it500/891 loss=89.1932 lr=[1.40e-04,1.40e-04,4.68e-04,4.68e-04]


  ep1 it600/891 loss=87.9685 lr=[1.69e-04,1.69e-04,5.62e-04,5.62e-04]


  ep1 it700/891 loss=87.9162 lr=[1.97e-04,1.97e-04,6.55e-04,6.55e-04]


  ep1 it800/891 loss=87.9600 lr=[2.25e-04,2.25e-04,7.49e-04,7.49e-04]


Fold 0 Ep 1: val RMSE=21.4161 | ep_time=230.5s


  val pred mean/std: 28.12/6.26




  ep2 it100/891 loss=86.5986 lr=[2.79e-04,2.79e-04,9.29e-04,9.29e-04]


  ep2 it200/891 loss=84.0669 lr=[3.00e-04,3.00e-04,1.00e-03,1.00e-03]


  ep2 it300/891 loss=82.8404 lr=[3.00e-04,3.00e-04,1.00e-03,1.00e-03]


  ep2 it400/891 loss=84.1093 lr=[3.00e-04,3.00e-04,9.99e-04,9.99e-04]


  ep2 it500/891 loss=83.8058 lr=[2.99e-04,2.99e-04,9.97e-04,9.97e-04]


  ep2 it600/891 loss=83.2170 lr=[2.99e-04,2.99e-04,9.95e-04,9.95e-04]


  ep2 it700/891 loss=83.7356 lr=[2.98e-04,2.98e-04,9.93e-04,9.93e-04]


  ep2 it800/891 loss=85.3235 lr=[2.97e-04,2.97e-04,9.90e-04,9.90e-04]


Fold 0 Ep 2: val RMSE=23.7113 | ep_time=227.9s


  val pred mean/std: 25.32/0.00




In [13]:
# BEiT-Base @384 5-fold training (Huber, EMA, Warmup+Cosine, Resize eval, HFlip TTA)
import os, math, time, random, gc
from pathlib import Path
import numpy as np
import pandas as pd
from PIL import Image
import torch, torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import torchvision.transforms as T
import timm
from timm.utils import ModelEmaV2
from sklearn.metrics import mean_squared_error

os.environ.setdefault('PYTORCH_CUDA_ALLOC_CONF', 'expandable_segments:True')
torch.backends.cudnn.benchmark = True
if torch.cuda.is_available():
    torch.backends.cuda.matmul.allow_tf32 = True
    torch.backends.cudnn.allow_tf32 = True
device = 'cuda' if torch.cuda.is_available() else 'cpu'
try:
    torch.set_float32_matmul_precision('high')
except Exception:
    pass

def seed_everything(seed=42):
    random.seed(seed); np.random.seed(seed); torch.manual_seed(seed); torch.cuda.manual_seed_all(seed)

def rmse(y_true, y_pred):
    return mean_squared_error(y_true, y_pred, squared=False)

class PetDatasetBEIT(Dataset):
    def __init__(self, df, img_dir, img_size=384, train=True):
        self.df = df.reset_index(drop=True)
        self.img_dir = Path(img_dir)
        self.img_size = img_size
        normalize = T.Normalize(mean=(0.485,0.456,0.406), std=(0.229,0.224,0.225))
        if train:
            self.tf = T.Compose([
                T.RandomResizedCrop(img_size, scale=(0.85,1.0), ratio=(0.9,1.1), interpolation=T.InterpolationMode.BICUBIC),
                T.RandomHorizontalFlip(p=0.5),
                T.ColorJitter(0.1,0.1,0.1,0.05),
                T.ToTensor(),
                normalize,
            ])
        else:
            self.tf = T.Compose([
                T.Resize((img_size, img_size), interpolation=T.InterpolationMode.BICUBIC),
                T.ToTensor(),
                normalize,
            ])
    def __len__(self): return len(self.df)
    def __getitem__(self, idx):
        r = self.df.iloc[idx]
        img = Image.open(self.img_dir / f"{r.Id}.jpg").convert('RGB')
        x = self.tf(img)
        y = torch.tensor([float(r.Pawpularity)], dtype=torch.float32)
        return x, y

def build_beit_model(backbone='beit_base_patch16_384', pretrained=True):
    m = timm.create_model(backbone, pretrained=pretrained, num_classes=1, global_pool='avg', drop_path_rate=0.05)
    return m

def _build_param_groups(model, head_ids, backbone_lr, head_lr, weight_decay=0.02):
    no_decay_terms = ('bias','bn','norm','ln','pos_embed','cls_token','absolute_pos_embed','relative_position_bias_table')
    back_decay, back_no_decay, head_decay, head_no_decay = [], [], [], []
    for n,p in model.named_parameters():
        if not p.requires_grad: continue
        is_no_decay = any(t in n for t in no_decay_terms)
        is_head = (id(p) in head_ids)
        if is_head:
            (head_no_decay if is_no_decay else head_decay).append(p)
        else:
            (back_no_decay if is_no_decay else back_decay).append(p)
    return [
        {'params': back_decay, 'lr': backbone_lr, 'weight_decay': weight_decay},
        {'params': back_no_decay, 'lr': backbone_lr, 'weight_decay': 0.0},
        {'params': head_decay, 'lr': head_lr, 'weight_decay': weight_decay},
        {'params': head_no_decay, 'lr': head_lr, 'weight_decay': 0.0},
    ]

class WarmupCosine:
    def __init__(self, optimizer, total_steps, warmup_steps, min_lr_mult=0.01):
        self.opt = optimizer; self.total_steps = max(1,total_steps); self.warmup_steps = max(1,warmup_steps); self.s=0
        self.base_lrs = [pg['lr'] for pg in optimizer.param_groups]
        self.min_lr_mult = float(min_lr_mult)
    def step(self):
        self.s += 1
        if self.s <= self.warmup_steps:
            mult = self.s / float(self.warmup_steps)
        else:
            t = (self.s - self.warmup_steps) / float(max(1, self.total_steps - self.warmup_steps))
            mult = max(self.min_lr_mult, 0.5*(1+math.cos(math.pi*t)))
        for i, pg in enumerate(self.opt.param_groups):
            pg['lr'] = self.base_lrs[i] * mult

def _eval_model(eval_model, ds_va, batch_size):
    va_dl = DataLoader(ds_va, batch_size=min(batch_size, 16), shuffle=False, num_workers=4, pin_memory=True, persistent_workers=False)
    eval_model.eval()
    oof = np.zeros(len(ds_va), dtype=np.float32); vs=[]; ptr=0
    with torch.no_grad():
        for x,y in va_dl:
            bs = x.size(0)
            x = x.to(device, non_blocking=True).float()
            logits = eval_model(x).squeeze(1)
            preds = (100.0 * torch.sigmoid(logits)).float().cpu().numpy()
            oof[ptr:ptr+bs] = np.clip(preds,0,100); ptr += bs
            vs.append(y.squeeze(1).cpu().numpy())
    y_true = np.concatenate(vs)
    return oof, rmse(y_true, oof)

def _predict_test(eval_model, batch_size, img_size):
    test_df = pd.read_csv('test.csv')
    tf = T.Compose([T.Resize((img_size,img_size), interpolation=T.InterpolationMode.BICUBIC), T.ToTensor(), T.Normalize(mean=(0.485,0.456,0.406), std=(0.229,0.224,0.225))])
    tf_hf = T.Compose([T.Resize((img_size,img_size), interpolation=T.InterpolationMode.BICUBIC), T.RandomHorizontalFlip(p=1.0), T.ToTensor(), T.Normalize(mean=(0.485,0.456,0.406), std=(0.229,0.224,0.225))])
    def load_img(pid): return Image.open(Path('test')/f'{pid}.jpg').convert('RGB')
    preds = []; eval_model.eval()
    with torch.no_grad():
        for i in range(0, len(test_df), min(batch_size, 16)):
            ids = test_df['Id'].iloc[i:i+min(batch_size, 16)].tolist()
            imgs = [load_img(pid) for pid in ids]
            x1 = torch.stack([tf(im) for im in imgs]).to(device, non_blocking=True).float()
            x2 = torch.stack([tf_hf(im) for im in imgs]).to(device, non_blocking=True).float()
            p1 = 100.0 * torch.sigmoid(eval_model(x1).squeeze(1))
            p2 = 100.0 * torch.sigmoid(eval_model(x2).squeeze(1))
            p = 0.5*(p1+p2)
            preds.append(p.float().cpu().numpy())
    return np.clip(np.concatenate(preds), 0, 100)

def train_fold_beit(fold, df, img_size=384, epochs=12, batch_size=16, seed=42, backbone_max_lr=2.0e-4, head_max_lr=1.0e-3, accum_steps=2):
    seed_everything(seed+fold)
    val_idx = df.index[df.fold == fold].values
    tr = df[df.fold != fold].reset_index(drop=True)
    va = df[df.fold == fold].reset_index(drop=True)
    ds_tr = PetDatasetBEIT(tr, 'train', img_size, train=True)
    ds_va = PetDatasetBEIT(va, 'train', img_size, train=False)
    dl_tr = DataLoader(ds_tr, batch_size=batch_size, shuffle=True, num_workers=4, pin_memory=True, drop_last=True, persistent_workers=False)
    dl_va = DataLoader(ds_va, batch_size=min(batch_size,16), shuffle=False, num_workers=4, pin_memory=True, persistent_workers=False)

    backbone = 'beit_base_patch16_384'
    model = build_beit_model(backbone, pretrained=True).to(device)
    if device=='cuda': model = model.to(memory_format=torch.channels_last)

    clf = model.get_classifier()
    head_module = getattr(model, clf) if isinstance(clf, str) else clf
    head_params = list(head_module.parameters()); head_ids = set(id(p) for p in head_params)
    back_params = [p for p in model.parameters() if id(p) not in head_ids]
    print(f'BEIT fold{fold}: head params={sum(p.numel() for p in head_params)} backbone params={sum(p.numel() for p in back_params)}', flush=True)
    print(f'LRs: backbone={backbone_max_lr:.2e} head={head_max_lr:.2e}', flush=True)

    param_groups = _build_param_groups(model, head_ids, backbone_max_lr, head_max_lr, weight_decay=0.02)
    opt = torch.optim.AdamW(param_groups, betas=(0.9,0.98))
    total_steps = epochs * len(dl_tr) // max(1, accum_steps)
    warmup_steps = max(1, int(0.10*total_steps))
    sch = WarmupCosine(opt, total_steps, warmup_steps, min_lr_mult=0.01)
    huber = nn.HuberLoss(delta=7.5)
    scaler = torch.amp.GradScaler('cuda', enabled=(device=='cuda'))
    ema = ModelEmaV2(model, decay=0.9997)

    best = 1e9; best_state=None; patience=4; no_improve=0
    Path('models').mkdir(exist_ok=True)
    print(f'Fold {fold}: train={len(tr)} val={len(va)} steps/epoch={len(dl_tr)} warmup_steps={warmup_steps} accum_steps={accum_steps}', flush=True)
    first_batch=True
    for ep in range(1, epochs+1):
        t0=time.time(); model.train(); loss_sum=0.0
        opt.zero_grad(set_to_none=True)
        for it,(x,y) in enumerate(dl_tr):
            if device=='cuda': x=x.to(device,non_blocking=True).to(memory_format=torch.channels_last)
            else: x=x.to(device,non_blocking=True)
            y=y.to(device,non_blocking=True)
            with torch.amp.autocast('cuda', enabled=(device=='cuda')):
                logits = model(x).squeeze(1)
                preds = 100.0 * torch.sigmoid(logits)
                loss = huber(preds, y.squeeze(1)) / float(accum_steps)
            scaler.scale(loss).backward()
            if ((it+1) % accum_steps) == 0:
                scaler.unscale_(opt)
                torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
                scaler.step(opt); scaler.update(); sch.step(); opt.zero_grad(set_to_none=True)
                ema.update(model)
            loss_sum += loss.item() * float(accum_steps)
            if first_batch:
                if torch.cuda.is_available():
                    torch.cuda.synchronize(); print(f'  CUDA max memory MB: {torch.cuda.max_memory_allocated()/1e6:.1f}', flush=True)
                first_batch=False
            if (it+1)%100==0:
                lrs=[pg['lr'] for pg in opt.param_groups]
                print(f'  ep{ep} it{it+1}/{len(dl_tr)} loss={loss_sum/(it+1):.4f} lr=[{lrs[0]:.2e},{lrs[1]:.2e},{lrs[2]:.2e},{lrs[3]:.2e}]', flush=True)

        # Validation: raw model for ep<=3, EMA afterwards; FP32
        eval_model = (model if ep<=3 else ema.module).to(device).float()
        eval_model.eval(); vs=[]; vp=[]
        with torch.no_grad():
            for x,y in dl_va:
                x=x.to(device,non_blocking=True).float()
                logits=eval_model(x).squeeze(1)
                preds=(100.0*torch.sigmoid(logits)).float().cpu().numpy()
                vs.append(y.squeeze(1).cpu().numpy()); vp.append(preds)
        y_true=np.concatenate(vs); y_pred=np.clip(np.concatenate(vp),0,100)
        v_rmse = rmse(y_true, y_pred)
        print(f'Fold {fold} Ep {ep}: val RMSE={v_rmse:.4f} | ep_time={time.time()-t0:.1f}s', flush=True)
        print(f'  val pred mean/std: {y_pred.mean():.2f}/{y_pred.std():.2f}', flush=True)

        # Guardrails (conservative for BEiT)
        if ep == 3 and (v_rmse > 19.2 or y_pred.std() < 7.5):
            print('  Abort guard (Ep3) triggered', flush=True); break
        if ep == 5 and (v_rmse > 18.2 or y_pred.std() < 8.5):
            print('  Abort guard (Ep5) triggered', flush=True); break
        if ep == 7 and (v_rmse > 17.7 or y_pred.std() < 9.5):
            print('  Abort guard (Ep7) triggered', flush=True); break

        if v_rmse < best - 1e-4:
            best=v_rmse; no_improve=0
            best_state={k:v.cpu() for k,v in eval_model.state_dict().items()}
            torch.save(best_state, f'models/beitb_fold{fold}_best.pt')
        else:
            no_improve += 1
            if no_improve >= patience:
                print(f'  Early stopping at epoch {ep}', flush=True); break

    if best_state is not None:
        eval_model = build_beit_model(pretrained=False).to(device).float()
        eval_model.load_state_dict({k:v.to(device) for k,v in best_state.items()})
    else:
        eval_model = ema.module.to(device).float()

    oof, final_rmse = _eval_model(eval_model, ds_va, batch_size)
    preds_test = _predict_test(eval_model, batch_size, img_size)
    dbg = va[['Id','Pawpularity']].copy(); dbg['pred']=oof; dbg.to_csv(f'beitb_fold{fold}_oof_debug.csv', index=False)
    return val_idx, oof, preds_test, final_rmse

print('=== BEiT-Base 5-FOLD TRAINING @384 (Huber, EMA, Resize eval, HFlip TTA) ===', flush=True)
seed=42; img_size=384; epochs=12; batch_size=16
folds = pd.read_csv('folds.csv')
train_df = pd.read_csv('train.csv').merge(folds[['Id','fold']], on='Id', how='left')
oof = np.zeros(len(train_df), dtype=np.float32)
test_preds_folds=[]; fold_scores=[]
Path('models').mkdir(exist_ok=True)
t0=time.time()
for f in range(5):
    print(f'--- BEIT Fold {f} start ---', flush=True)
    vidx, oof_f, tp_f, best_rmse = train_fold_beit(f, train_df, img_size=img_size, epochs=epochs, batch_size=batch_size, seed=seed, backbone_max_lr=2.0e-4, head_max_lr=1.0e-3, accum_steps=2)
    oof[vidx]=oof_f; test_preds_folds.append(tp_f); fold_scores.append(best_rmse)
    print(f'--- BEIT Fold {f} done. Best RMSE={best_rmse:.4f} ---', flush=True)
    gc.collect(); torch.cuda.empty_cache()
print('BEIT fold RMSEs:', fold_scores, 'Mean:', float(np.mean(fold_scores)))
oof_df = train_df[['Id','Pawpularity']].copy(); oof_df['oof']=np.clip(oof,0,100)
oof_df.to_csv('oof_beitb_384.csv', index=False)
np.save('test_preds_beitb_384.npy', np.stack(test_preds_folds, axis=0))
overall = rmse(oof_df['Pawpularity'].values, oof_df['oof'].values)
print(f'BEIT OOF RMSE: {overall:.5f} | Total time {time.time()-t0:.1f}s', flush=True)
test_df = pd.read_csv('test.csv')
test_preds = np.mean(np.load('test_preds_beitb_384.npy'), axis=0)
sub = pd.DataFrame({'Id': test_df['Id'], 'Pawpularity': np.clip(test_preds,0,100)})
sub.to_csv('submission_beitb_384.csv', index=False)
print('Saved submission_beitb_384.csv shape', sub.shape)

# Quick OOF blend with metadata if available
try:
    oof_meta = pd.read_csv('oof_xgb_meta.csv').rename(columns={'oof_meta':'meta_pred'})
    oof_img = pd.read_csv('oof_beitb_384.csv').rename(columns={'oof':'img_pred'})
    oof_m = oof_img.merge(oof_meta, on=['Id','Pawpularity'], how='inner')
    y = oof_m['Pawpularity'].values.astype(np.float32)
    p_img = oof_m['img_pred'].values.astype(np.float32)
    p_meta = oof_m['meta_pred'].values.astype(np.float32)
    best = (1.0, rmse(y, p_img))
    for w in np.linspace(0.60,0.80,11):
        pred = np.clip(w*p_img + (1.0-w)*p_meta, 0, 100)
        s = rmse(y, pred)
        if s < best[1]: best = (float(w), float(s))
    print(f'[Quick Blend META] Best w_img={best[0]:.2f} OOF RMSE={best[1]:.5f}', flush=True)
except Exception as e:
    print('Quick blend skipped:', e)

=== BEiT-Base 5-FOLD TRAINING @384 (Huber, EMA, Resize eval, HFlip TTA) ===


--- BEIT Fold 0 start ---


BEIT fold0: head params=769 backbone params=85975104


LRs: backbone=2.00e-04 head=1.00e-03


Fold 0: train=7135 val=1785 steps/epoch=445 warmup_steps=267 accum_steps=2


  CUDA max memory MB: 14323.9


  ep1 it100/445 loss=103.2987 lr=[3.75e-05,3.75e-05,1.87e-04,1.87e-04]


  ep1 it200/445 loss=96.6062 lr=[7.49e-05,7.49e-05,3.75e-04,3.75e-04]


  ep1 it300/445 loss=94.7846 lr=[1.12e-04,1.12e-04,5.62e-04,5.62e-04]


  ep1 it400/445 loss=92.2349 lr=[1.50e-04,1.50e-04,7.49e-04,7.49e-04]


Fold 0 Ep 1: val RMSE=21.0644 | ep_time=110.6s


  val pred mean/std: 32.73/3.13




  ep2 it100/445 loss=88.2513 lr=[2.00e-04,2.00e-04,1.00e-03,1.00e-03]


  ep2 it200/445 loss=88.4633 lr=[2.00e-04,2.00e-04,9.99e-04,9.99e-04]


  ep2 it300/445 loss=88.7181 lr=[1.99e-04,1.99e-04,9.95e-04,9.95e-04]


  ep2 it400/445 loss=88.4251 lr=[1.98e-04,1.98e-04,9.90e-04,9.90e-04]


Fold 0 Ep 2: val RMSE=21.4319 | ep_time=110.6s


  val pred mean/std: 30.60/0.51




  ep3 it100/445 loss=86.3513 lr=[1.96e-04,1.96e-04,9.78e-04,9.78e-04]


  ep3 it200/445 loss=88.2242 lr=[1.94e-04,1.94e-04,9.68e-04,9.68e-04]


  ep3 it300/445 loss=88.9415 lr=[1.91e-04,1.91e-04,9.55e-04,9.55e-04]


  ep3 it400/445 loss=89.5667 lr=[1.88e-04,1.88e-04,9.40e-04,9.40e-04]


Fold 0 Ep 3: val RMSE=20.5651 | ep_time=111.4s


  val pred mean/std: 34.07/0.61


  Abort guard (Ep3) triggered






--- BEIT Fold 0 done. Best RMSE=21.0644 ---


--- BEIT Fold 1 start ---


BEIT fold1: head params=769 backbone params=85975104


LRs: backbone=2.00e-04 head=1.00e-03


Fold 1: train=7139 val=1781 steps/epoch=446 warmup_steps=267 accum_steps=2


  CUDA max memory MB: 14323.9


  ep1 it100/446 loss=107.8455 lr=[3.75e-05,3.75e-05,1.87e-04,1.87e-04]


  ep1 it200/446 loss=96.8216 lr=[7.49e-05,7.49e-05,3.75e-04,3.75e-04]


  ep1 it300/446 loss=93.2268 lr=[1.12e-04,1.12e-04,5.62e-04,5.62e-04]


  ep1 it400/446 loss=91.2224 lr=[1.50e-04,1.50e-04,7.49e-04,7.49e-04]


Fold 1 Ep 1: val RMSE=21.5800 | ep_time=112.4s


  val pred mean/std: 32.75/1.34




In [15]:
# CLIP ViT-L/14@336 feature extraction + RidgeCV per-fold + blend with metadata
import os, time, math, gc, sys, subprocess
from pathlib import Path
import numpy as np
import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader
from PIL import Image
from sklearn.linear_model import RidgeCV
from sklearn.metrics import mean_squared_error

def pip_install(pkg):
    print('> pip install', pkg, flush=True)
    subprocess.run([sys.executable, '-m', 'pip', 'install', '-c', 'constraints.txt', pkg, '--upgrade-strategy', 'only-if-needed'], check=True)

print('=== Install open_clip_torch ===', flush=True)
pip_install('open_clip_torch==2.26.1')
pip_install('ftfy')
pip_install('regex')

import open_clip

torch.backends.cudnn.benchmark = True
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print('Device:', device, flush=True)

def rmse(y_true, y_pred):
    return mean_squared_error(y_true, y_pred, squared=False)

class ImageIdDataset(Dataset):
    def __init__(self, ids, img_dir, preprocess):
        self.ids = ids
        self.dir = Path(img_dir)
        self.preprocess = preprocess
    def __len__(self): return len(self.ids)
    def __getitem__(self, i):
        pid = self.ids[i]
        img = Image.open(self.dir / f'{pid}.jpg').convert('RGB')
        x = self.preprocess(img)
        return x, pid

def extract_clip_feats(model_name='ViT-L-14-336', pretrained='openai', img_size=336, batch_size=64):
    print(f'Loading CLIP model: {model_name} ({pretrained})', flush=True)
    model, _, preprocess = open_clip.create_model_and_transforms(model_name, pretrained=pretrained, device=device)
    model.eval()
    train_df = pd.read_csv('train.csv')
    test_df = pd.read_csv('test.csv')
    train_ids = train_df['Id'].tolist()
    test_ids = test_df['Id'].tolist()
    ds_tr = ImageIdDataset(train_ids, 'train', preprocess)
    ds_te = ImageIdDataset(test_ids, 'test', preprocess)
    dl_tr = DataLoader(ds_tr, batch_size=batch_size, shuffle=False, num_workers=4, pin_memory=True, persistent_workers=False)
    dl_te = DataLoader(ds_te, batch_size=batch_size, shuffle=False, num_workers=4, pin_memory=True, persistent_workers=False)
    feats_tr = np.zeros((len(ds_tr), model.visual.output_dim), dtype=np.float32)
    feats_te = np.zeros((len(ds_te), model.visual.output_dim), dtype=np.float32)
    print('Extracting train features...', flush=True)
    ptr = 0; t0=time.time()
    with torch.no_grad():
        for i,(x, _) in enumerate(dl_tr):
            x = x.to(device, non_blocking=True)
            with torch.cuda.amp.autocast(enabled=(device=='cuda')):
                f = model.encode_image(x)
            f = f.float()
            # L2 normalize
            f = torch.nn.functional.normalize(f, dim=1)
            n = f.size(0)
            feats_tr[ptr:ptr+n] = f.cpu().numpy(); ptr += n
            if (i+1)%50==0:
                print(f'  train batches {i+1}, elapsed {time.time()-t0:.1f}s', flush=True)
    print('Train feats done in %.1fs' % (time.time()-t0))
    print('Extracting test features...', flush=True)
    ptr = 0; t0=time.time()
    with torch.no_grad():
        for i,(x, _) in enumerate(dl_te):
            x = x.to(device, non_blocking=True)
            with torch.cuda.amp.autocast(enabled=(device=='cuda')):
                f = model.encode_image(x)
            f = f.float()
            f = torch.nn.functional.normalize(f, dim=1)
            n = f.size(0)
            feats_te[ptr:ptr+n] = f.cpu().numpy(); ptr += n
            if (i+1)%50==0:
                print(f'  test batches {i+1}, elapsed {time.time()-t0:.1f}s', flush=True)
    print('Test feats done in %.1fs' % (time.time()-t0))
    np.save('clip_vitl14_336_train_feats.npy', feats_tr)
    np.save('clip_vitl14_336_test_feats.npy', feats_te)
    print('Saved features:', feats_tr.shape, feats_te.shape, flush=True)
    return feats_tr, feats_te, train_df, test_df

def ridge_cv_on_feats(feats_tr, feats_te, y, folds_df, seed=42):
    alphas = np.logspace(-2, 3, 25)
    oof = np.zeros(len(y), dtype=np.float32)
    test_preds_folds = []
    scores = []
    for f in range(5):
        tr_idx = np.where(folds_df['fold'].values != f)[0]
        va_idx = np.where(folds_df['fold'].values == f)[0]
        X_tr, X_va = feats_tr[tr_idx], feats_tr[va_idx]
        y_tr, y_va = y[tr_idx], y[va_idx]
        model = RidgeCV(alphas=alphas, fit_intercept=True, scoring='neg_root_mean_squared_error', cv=5)
        t0=time.time()
        model.fit(X_tr, y_tr)
        p_va = model.predict(X_va).astype(np.float32)
        p_te = model.predict(feats_te).astype(np.float32)
        oof[va_idx] = p_va
        test_preds_folds.append(p_te)
        s = rmse(y_va, p_va)
        scores.append(float(s))
        print(f'Ridge fold {f}: rmse={s:.5f} | alpha={model.alpha_:.5f} | elapsed={time.time()-t0:.1f}s', flush=True)
        gc.collect()
    test_mean = np.mean(np.vstack(test_preds_folds), axis=0).astype(np.float32)
    return oof, test_mean, scores

print('=== CLIP FE + RidgeCV pipeline start ===', flush=True)
feats_tr_path = Path('clip_vitl14_336_train_feats.npy')
feats_te_path = Path('clip_vitl14_336_test_feats.npy')
if feats_tr_path.exists() and feats_te_path.exists():
    feats_tr = np.load(feats_tr_path)
    feats_te = np.load(feats_te_path)
    train_df = pd.read_csv('train.csv')
    test_df = pd.read_csv('test.csv')
    print('Loaded cached CLIP features:', feats_tr.shape, feats_te.shape, flush=True)
else:
    feats_tr, feats_te, train_df, test_df = extract_clip_feats()

folds = pd.read_csv('folds.csv')
df = train_df.merge(folds[['Id','fold']], on='Id', how='left')
y = df['Pawpularity'].values.astype(np.float32)

oof_clip, test_clip, scores = ridge_cv_on_feats(feats_tr, feats_te, y, df)
print('CLIP Ridge fold RMSEs:', scores, 'Mean:', float(np.mean(scores)))
oof_df = df[['Id','Pawpularity']].copy(); oof_df['oof'] = np.clip(oof_clip, 0, 100)
oof_df.to_csv('oof_clip_vitl14_336_ridge.csv', index=False)
np.save('test_preds_clip_vitl14_336_ridge.npy', np.clip(test_clip, 0, 100))
overall = rmse(oof_df['Pawpularity'].values, oof_df['oof'].values)
print(f'CLIP Ridge OOF RMSE: {overall:.5f}', flush=True)

# Blend with metadata OOF if available
blend_weight_img = None; blend_score = None
try:
    oof_meta = pd.read_csv('oof_xgb_meta.csv').rename(columns={'oof_meta':'meta_pred'})
    oof_img = pd.read_csv('oof_clip_vitl14_336_ridge.csv').rename(columns={'oof':'img_pred'})
    oof_m = oof_img.merge(oof_meta, on=['Id','Pawpularity'], how='inner')
    yb = oof_m['Pawpularity'].values.astype(np.float32)
    p_img = oof_m['img_pred'].values.astype(np.float32)
    p_meta = oof_m['meta_pred'].values.astype(np.float32)
    best = (1.0, rmse(yb, p_img))
    for w in np.arange(0.85, 0.951, 0.005):
        pred = np.clip(w*p_img + (1.0-w)*p_meta, 0, 100)
        s = rmse(yb, pred)
        if s < best[1]: best = (float(w), float(s))
    blend_weight_img, blend_score = best
    print(f'[Blend META] Best w_img={blend_weight_img:.3f} OOF RMSE={blend_score:.5f}', flush=True)
except Exception as e:
    print('Meta blend skipped:', e)

# Create submission from CLIP Ridge or blended if meta available
test_df = pd.read_csv('test.csv')
sub_path = 'submission_clip_ridge.csv'
if blend_weight_img is not None and Path('test_preds_xgb_meta.npy').exists():
    meta_test = np.load('test_preds_xgb_meta.npy').astype(np.float32)
    blend_test = np.clip(blend_weight_img*test_clip + (1.0-blend_weight_img)*meta_test, 0, 100)
    sub = pd.DataFrame({'Id': test_df['Id'], 'Pawpularity': blend_test})
    sub_path = 'submission_clip_ridge_blend.csv'
else:
    sub = pd.DataFrame({'Id': test_df['Id'], 'Pawpularity': np.clip(test_clip, 0, 100)})
sub.to_csv(sub_path, index=False)
print('Saved', sub_path, 'shape', sub.shape, flush=True)
print('=== CLIP FE + RidgeCV pipeline done ===', flush=True)

=== Install open_clip_torch ===


> pip install open_clip_torch==2.26.1


Collecting open_clip_torch==2.26.1
  Downloading open_clip_torch-2.26.1-py3-none-any.whl (1.5 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.5/1.5 MB 15.1 MB/s eta 0:00:00
Collecting ftfy
  Downloading ftfy-6.3.1-py3-none-any.whl (44 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44.8/44.8 KB 398.6 MB/s eta 0:00:00


Collecting tqdm
  Downloading tqdm-4.67.1-py3-none-any.whl (78 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.5/78.5 KB 412.2 MB/s eta 0:00:00
Collecting huggingface-hub
  Downloading huggingface_hub-0.35.1-py3-none-any.whl (563 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 563.3/563.3 KB 463.7 MB/s eta 0:00:00
Collecting torchvision
  Downloading torchvision-0.19.1-cp311-cp311-manylinux1_x86_64.whl (7.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.0/7.0 MB 468.3 MB/s eta 0:00:00


Collecting timm
  Downloading timm-1.0.20-py3-none-any.whl (2.5 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.5/2.5 MB 443.0 MB/s eta 0:00:00


Collecting regex
  Downloading regex-2025.9.18-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (798 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 799.0/799.0 KB 498.5 MB/s eta 0:00:00
Collecting torch>=1.9.0
  Downloading torch-2.4.1-cp311-cp311-manylinux1_x86_64.whl (797.1 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 797.1/797.1 MB 315.8 MB/s eta 0:00:00


Collecting fsspec
  Downloading fsspec-2025.9.0-py3-none-any.whl (199 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 199.3/199.3 KB 439.3 MB/s eta 0:00:00
Collecting sympy
  Downloading sympy-1.14.0-py3-none-any.whl (6.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.3/6.3 MB 530.0 MB/s eta 0:00:00
Collecting nvidia-cufft-cu12==11.0.2.54
  Downloading nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 121.6/121.6 MB 267.0 MB/s eta 0:00:00
Collecting triton==3.0.0
  Downloading triton-3.0.0-1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (209.4 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 209.4/209.4 MB 273.7 MB/s eta 0:00:00
Collecting nvidia-nccl-cu12==2.20.5
  Downloading nvidia_nccl_cu12-2.20.5-py3-none-manylinux2014_x86_64.whl (176.2 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 176.2/176.2 MB 284.6 MB/s eta 0:00:00
Collecting nvidia-cusparse-cu12==12.1.0.106
  Downloading nvidia_cusparse_cu12-12.1.0.106-py3-none-manylinux1_x86_64.whl (196.0 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 196.0/196.0 MB 257.8 MB/s eta 0:00:00
Collecting filelock
  Downloading filelock-3.19.1-py3-none-any.whl (15 kB)
Collecting typing-extensions>=4.8.0
  Downloading typing_extensions-4.15.0-py3-none-any.whl (44 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44.6/44.6 KB 332.7 MB/s eta 0:00:00


Collecting nvidia-cuda-runtime-cu12==12.1.105
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 823.6/823.6 KB 374.3 MB/s eta 0:00:00
Collecting nvidia-curand-cu12==10.3.2.106
  Downloading nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.5/56.5 MB 225.5 MB/s eta 0:00:00
Collecting nvidia-cuda-cupti-cu12==12.1.105
  Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.1/14.1 MB 237.3 MB/s eta 0:00:00
Collecting nvidia-cusolver-cu12==11.4.5.107
  Downloading nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl (124.2 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 124.2/124.2 MB 279.3 MB/s eta 0:00:00
Collecting jinja2
  Downloading jinja2-3.1.6-py3-none-any.whl (134 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.9/134.9 KB 432.8 MB/s eta 0:00:00
Collecting networkx
  Downloading networkx-3.5-py3-none-any.whl (2.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 516.2 MB/s eta 0:00:00
Collecting nvidia-cublas-cu12==12.1.3.1
  Downloading nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 410.6/410.6 MB 244.1 MB/s eta 0:00:00


Collecting nvidia-cudnn-cu12==9.1.0.70
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl (664.8 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 664.8/664.8 MB 271.0 MB/s eta 0:00:00


Collecting nvidia-nvtx-cu12==12.1.105
  Downloading nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (99 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 99.1/99.1 KB 385.4 MB/s eta 0:00:00
Collecting nvidia-cuda-nvrtc-cu12==12.1.105
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.7/23.7 MB 272.0 MB/s eta 0:00:00
Collecting nvidia-nvjitlink-cu12
  Downloading nvidia_nvjitlink_cu12-12.9.86-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (39.7 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39.7/39.7 MB 309.8 MB/s eta 0:00:00
Collecting wcwidth
  Downloading wcwidth-0.2.14-py2.py3-none-any.whl (37 kB)
Collecting packaging>=20.9
  Downloading packaging-25.0-py3-none-any.whl (66 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 66.5/66.5 KB 387.3 MB/s eta 0:00:00


Collecting hf-xet<2.0.0,>=1.1.3
  Downloading hf_xet-1.1.10-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.2/3.2 MB 516.9 MB/s eta 0:00:00
Collecting requests
  Downloading requests-2.32.5-py3-none-any.whl (64 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 64.7/64.7 KB 428.6 MB/s eta 0:00:00
Collecting pyyaml>=5.1
  Downloading pyyaml-6.0.3-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (806 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 806.6/806.6 KB 501.7 MB/s eta 0:00:00


Collecting safetensors
  Downloading safetensors-0.6.2-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (485 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 485.8/485.8 KB 478.0 MB/s eta 0:00:00


Collecting numpy
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.3/18.3 MB 385.2 MB/s eta 0:00:00


Collecting pillow!=8.3.*,>=5.3.0
  Downloading pillow-11.3.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (6.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.6/6.6 MB 269.9 MB/s eta 0:00:00


Collecting MarkupSafe>=2.0
  Downloading MarkupSafe-3.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (23 kB)
Collecting charset_normalizer<4,>=2
  Downloading charset_normalizer-3.4.3-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (150 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 150.3/150.3 KB 472.4 MB/s eta 0:00:00
Collecting urllib3<3,>=1.21.1


  Downloading urllib3-2.5.0-py3-none-any.whl (129 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 129.8/129.8 KB 455.0 MB/s eta 0:00:00
Collecting idna<4,>=2.5
  Downloading idna-3.10-py3-none-any.whl (70 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 70.4/70.4 KB 422.5 MB/s eta 0:00:00
Collecting certifi>=2017.4.17
  Downloading certifi-2025.8.3-py3-none-any.whl (161 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 161.2/161.2 KB 460.5 MB/s eta 0:00:00
Collecting mpmath<1.4,>=1.1.0
  Downloading mpmath-1.3.0-py3-none-any.whl (536 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 KB 505.9 MB/s eta 0:00:00


Installing collected packages: mpmath, wcwidth, urllib3, typing-extensions, tqdm, sympy, safetensors, regex, pyyaml, pillow, packaging, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufft-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, numpy, networkx, MarkupSafe, idna, hf-xet, fsspec, filelock, charset_normalizer, certifi, triton, requests, nvidia-cusparse-cu12, nvidia-cudnn-cu12, jinja2, ftfy, nvidia-cusolver-cu12, huggingface-hub, torch, torchvision, timm, open_clip_torch


Successfully installed MarkupSafe-3.0.2 certifi-2025.8.3 charset_normalizer-3.4.3 filelock-3.19.1 fsspec-2025.9.0 ftfy-6.3.1 hf-xet-1.1.10 huggingface-hub-0.35.1 idna-3.10 jinja2-3.1.6 mpmath-1.3.0 networkx-3.5 numpy-1.26.4 nvidia-cublas-cu12-12.1.3.1 nvidia-cuda-cupti-cu12-12.1.105 nvidia-cuda-nvrtc-cu12-12.1.105 nvidia-cuda-runtime-cu12-12.1.105 nvidia-cudnn-cu12-9.1.0.70 nvidia-cufft-cu12-11.0.2.54 nvidia-curand-cu12-10.3.2.106 nvidia-cusolver-cu12-11.4.5.107 nvidia-cusparse-cu12-12.1.0.106 nvidia-nccl-cu12-2.20.5 nvidia-nvjitlink-cu12-12.9.86 nvidia-nvtx-cu12-12.1.105 open_clip_torch-2.26.1 packaging-25.0 pillow-11.3.0 pyyaml-6.0.3 regex-2025.9.18 requests-2.32.5 safetensors-0.6.2 sympy-1.14.0 timm-1.0.20 torch-2.4.1 torchvision-0.19.1 tqdm-4.67.1 triton-3.0.0 typing-extensions-4.15.0 urllib3-2.5.0 wcwidth-0.2.14




> pip install ftfy


Collecting ftfy
  Downloading ftfy-6.3.1-py3-none-any.whl (44 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44.8/44.8 KB 3.2 MB/s eta 0:00:00
Collecting wcwidth
  Downloading wcwidth-0.2.14-py2.py3-none-any.whl (37 kB)


Installing collected packages: wcwidth, ftfy
Successfully installed ftfy-6.3.1 wcwidth-0.2.14
> pip install regex




Collecting regex
  Downloading regex-2025.9.18-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (798 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 799.0/799.0 KB 29.6 MB/s eta 0:00:00


Installing collected packages: regex
Successfully installed regex-2025.9.18




Device: cuda


=== CLIP FE + RidgeCV pipeline start ===


Loading CLIP model: ViT-L-14-336 (openai)


  0%|                                               | 0.00/934M [00:00<?, ?iB/s]

  1%|▍                                      | 10.1M/934M [00:00<00:09, 101MiB/s]

  2%|▉                                      | 22.1M/934M [00:00<00:08, 112MiB/s]

  4%|█▍                                     | 33.7M/934M [00:00<00:07, 113MiB/s]

  5%|█▉                                     | 45.1M/934M [00:00<00:08, 101MiB/s]

  6%|██▍                                    | 59.0M/934M [00:00<00:07, 114MiB/s]

  8%|██▉                                    | 70.6M/934M [00:00<00:08, 105MiB/s]

  9%|███▌                                   | 84.1M/934M [00:00<00:07, 114MiB/s]

 10%|███▉                                   | 95.7M/934M [00:00<00:08, 101MiB/s]

 12%|████▋                                   | 109M/934M [00:01<00:07, 110MiB/s]

 13%|█████▏                                  | 121M/934M [00:01<00:07, 110MiB/s]

 14%|█████▋                                  | 133M/934M [00:01<00:07, 113MiB/s]

 15%|██████▏                                 | 144M/934M [00:01<00:07, 101MiB/s]

 17%|██████▊                                 | 159M/934M [00:01<00:06, 114MiB/s]

 18%|███████▎                                | 171M/934M [00:01<00:06, 112MiB/s]

 20%|███████▉                                | 185M/934M [00:01<00:06, 119MiB/s]

 21%|████████▍                               | 197M/934M [00:01<00:06, 114MiB/s]

 22%|████████▉                               | 209M/934M [00:01<00:06, 116MiB/s]

 24%|█████████▍                              | 220M/934M [00:02<00:06, 103MiB/s]

 25%|██████████                              | 234M/934M [00:02<00:06, 112MiB/s]

 26%|██████████▌                             | 246M/934M [00:02<00:06, 108MiB/s]

 28%|███████████▏                            | 261M/934M [00:02<00:05, 119MiB/s]

 29%|███████████▋                            | 273M/934M [00:02<00:05, 112MiB/s]

 30%|████████████▏                           | 284M/934M [00:02<00:05, 110MiB/s]

 32%|████████████▎                          | 295M/934M [00:02<00:06, 98.5MiB/s]

 33%|█████████████                           | 306M/934M [00:02<00:06, 101MiB/s]

 34%|█████████████▋                          | 319M/934M [00:02<00:05, 109MiB/s]

 36%|██████████████▏                         | 333M/934M [00:03<00:05, 115MiB/s]

 37%|██████████████▋                         | 344M/934M [00:03<00:05, 103MiB/s]

 38%|██████████████▊                        | 355M/934M [00:03<00:05, 99.9MiB/s]

 40%|███████████████▊                        | 369M/934M [00:03<00:05, 110MiB/s]

 41%|████████████████▎                       | 380M/934M [00:03<00:05, 101MiB/s]

 42%|████████████████▊                       | 393M/934M [00:03<00:05, 106MiB/s]

 43%|█████████████████▎                      | 403M/934M [00:03<00:05, 101MiB/s]

 44%|█████████████████▎                     | 414M/934M [00:03<00:05, 99.2MiB/s]

 46%|██████████████████▎                     | 428M/934M [00:03<00:04, 109MiB/s]

 47%|██████████████████▊                     | 439M/934M [00:04<00:04, 105MiB/s]

 49%|███████████████████▍                    | 453M/934M [00:04<00:04, 115MiB/s]

 50%|███████████████████▉                    | 465M/934M [00:04<00:04, 110MiB/s]

 51%|████████████████████▍                   | 479M/934M [00:04<00:03, 118MiB/s]

 53%|█████████████████████                   | 491M/934M [00:04<00:03, 113MiB/s]

 54%|█████████████████████▌                  | 502M/934M [00:04<00:04, 106MiB/s]

 55%|█████████████████████▍                 | 513M/934M [00:04<00:04, 85.3MiB/s]

 56%|█████████████████████▊                 | 522M/934M [00:04<00:04, 86.5MiB/s]

 57%|██████████████████████▉                 | 536M/934M [00:05<00:03, 100MiB/s]

 59%|███████████████████████▍                | 547M/934M [00:05<00:03, 100MiB/s]

 60%|███████████████████████▎               | 557M/934M [00:05<00:03, 99.5MiB/s]

 61%|████████████████████████▎               | 569M/934M [00:05<00:03, 105MiB/s]

 62%|████████████████████████▏              | 580M/934M [00:05<00:03, 98.5MiB/s]

 63%|████████████████████████▋              | 590M/934M [00:05<00:03, 95.0MiB/s]

 64%|█████████████████████████▊              | 601M/934M [00:05<00:03, 100MiB/s]

 65%|█████████████████████████▌             | 612M/934M [00:05<00:03, 99.2MiB/s]

 67%|█████████████████████████▉             | 622M/934M [00:05<00:03, 93.3MiB/s]

 68%|██████████████████████████▎            | 631M/934M [00:06<00:03, 92.0MiB/s]

 69%|███████████████████████████▋            | 645M/934M [00:06<00:02, 107MiB/s]

 70%|████████████████████████████            | 656M/934M [00:06<00:02, 100MiB/s]

 72%|████████████████████████████▋           | 670M/934M [00:06<00:02, 109MiB/s]

 73%|█████████████████████████████▏          | 681M/934M [00:06<00:02, 103MiB/s]

 74%|█████████████████████████████▋          | 692M/934M [00:06<00:02, 105MiB/s]

 75%|██████████████████████████████▏         | 705M/934M [00:06<00:02, 110MiB/s]

 77%|██████████████████████████████▋         | 716M/934M [00:06<00:02, 104MiB/s]

 78%|███████████████████████████████▏        | 728M/934M [00:06<00:01, 110MiB/s]

 79%|███████████████████████████████▋        | 739M/934M [00:07<00:01, 101MiB/s]

 80%|████████████████████████████████▏       | 750M/934M [00:07<00:01, 103MiB/s]

 82%|████████████████████████████████▋       | 764M/934M [00:07<00:01, 112MiB/s]

 83%|█████████████████████████████████▏      | 775M/934M [00:07<00:01, 107MiB/s]

 84%|█████████████████████████████████▋      | 786M/934M [00:07<00:01, 107MiB/s]

 85%|██████████████████████████████████▏     | 797M/934M [00:07<00:01, 107MiB/s]

 87%|█████████████████████████████████▋     | 808M/934M [00:07<00:01, 95.3MiB/s]

 88%|██████████████████████████████████▏    | 818M/934M [00:07<00:01, 91.1MiB/s]

 89%|██████████████████████████████████▌    | 827M/934M [00:07<00:01, 91.5MiB/s]

 90%|███████████████████████████████████▉    | 840M/934M [00:08<00:00, 101MiB/s]

 91%|████████████████████████████████████▌   | 853M/934M [00:08<00:00, 110MiB/s]

 93%|█████████████████████████████████████   | 865M/934M [00:08<00:00, 112MiB/s]

 94%|█████████████████████████████████████▌  | 877M/934M [00:08<00:00, 115MiB/s]

 95%|██████████████████████████████████████  | 889M/934M [00:08<00:00, 112MiB/s]

 96%|██████████████████████████████████████▌ | 900M/934M [00:08<00:00, 105MiB/s]

 98%|███████████████████████████████████████▏| 915M/934M [00:08<00:00, 116MiB/s]

 99%|███████████████████████████████████████▋| 926M/934M [00:08<00:00, 107MiB/s]

100%|████████████████████████████████████████| 934M/934M [00:08<00:00, 106MiB/s]




Extracting train features...


  with torch.cuda.amp.autocast(enabled=(device=='cuda')):


  train batches 50, elapsed 32.9s


  train batches 100, elapsed 64.6s


Train feats done in 90.1s
Extracting test features...


  with torch.cuda.amp.autocast(enabled=(device=='cuda')):


Test feats done in 11.2s
Saved features: (8920, 768) (992, 768)


Ridge fold 0: rmse=17.32029 | alpha=1.95734 | elapsed=4.7s




Ridge fold 1: rmse=17.18394 | alpha=1.95734 | elapsed=4.8s




Ridge fold 2: rmse=17.74895 | alpha=1.95734 | elapsed=4.0s




Ridge fold 3: rmse=18.38489 | alpha=1.95734 | elapsed=4.3s




Ridge fold 4: rmse=17.06082 | alpha=1.95734 | elapsed=4.0s




CLIP Ridge fold RMSEs: [17.32028579711914, 17.18393898010254, 17.748952865600586, 18.384885787963867, 17.06081771850586] Mean: 17.539776229858397
CLIP Ridge OOF RMSE: 17.54647


[Blend META] Best w_img=1.000 OOF RMSE=17.54646


Saved submission_clip_ridge_blend.csv shape (992, 2)


=== CLIP FE + RidgeCV pipeline done ===




In [16]:
# Set final submission.csv from CLIP Ridge (blended with meta if available)
import pandas as pd, os
if os.path.exists('submission_clip_ridge_blend.csv'):
    sub = pd.read_csv('submission_clip_ridge_blend.csv')
else:
    sub = pd.read_csv('submission_clip_ridge.csv')
sub.to_csv('submission.csv', index=False)
print('Final submission.csv written with shape:', sub.shape, 'columns:', list(sub.columns))
print(sub.head().to_string(index=False))

Final submission.csv written with shape: (992, 2) columns: ['Id', 'Pawpularity']
                              Id  Pawpularity
ee51b99832f1ba868f646df93d2b6b81    36.907543
caddfb3f8bff9c4b95dbe022018eea21    38.117523
582eeabd4a448a53ebb79995888a4b0b    28.580616
afc1ad7f0c5eea880759d09e77f7deee    29.608410
d5bdf3446e86ce4ec67ce7a00f1cccc2    38.989830


In [17]:
# XGBoost on CLIP features (+metadata) with 5-fold CV, then 3-way blend (CLIP Ridge, XGB-CLIP, META)
import gc, time
import numpy as np, pandas as pd
from pathlib import Path
import xgboost as xgb
from sklearn.metrics import mean_squared_error

def rmse(y_true, y_pred):
    return mean_squared_error(y_true, y_pred, squared=False)

print('=== XGB on CLIP features (+meta) ===', flush=True)
# Load CLIP feats cached
feats_tr = np.load('clip_vitl14_336_train_feats.npy')
feats_te = np.load('clip_vitl14_336_test_feats.npy')
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
folds = pd.read_csv('folds.csv')
df = train_df.merge(folds[['Id','fold']], on='Id', how='left')
y = df['Pawpularity'].values.astype(np.float32)

# Optional metadata features (engineered already cached by cell 7)
meta_tr = pd.read_csv('meta_features.csv') if Path('meta_features.csv').exists() else None
meta_te = pd.read_csv('meta_features_test.csv') if Path('meta_features_test.csv').exists() else None
if meta_tr is None or meta_te is None:
    # fallback: use only 12 binary metadata from CSV
    meta_cols = ['Subject Focus','Eyes','Face','Near','Action','Accessory','Group','Collage','Human','Occlusion','Info','Blur']
    meta_trm = train_df[['Id'] + meta_cols].copy()
    meta_tem = test_df[['Id'] + meta_cols].copy()
else:
    # merge 12 binaries + engineered stats
    meta_cols = ['Subject Focus','Eyes','Face','Near','Action','Accessory','Group','Collage','Human','Occlusion','Info','Blur']
    meta_trm = train_df[['Id'] + meta_cols].merge(meta_tr, on='Id', how='left')
    meta_tem = test_df[['Id'] + meta_cols].merge(meta_te, on='Id', how='left')
    for c in ['gray_mean','gray_std','sat_mean','sat_std','val_mean','val_std','lap_var','aspect']:
        if c in meta_trm: meta_trm[c] = meta_trm[c].fillna(meta_trm[c].median())
        if c in meta_tem: meta_tem[c] = meta_tem[c].fillna(meta_tem[c].median())

# Build dense feature matrices: [CLIP | META]
meta_feats = [c for c in meta_trm.columns if c != 'Id']
X_tr = np.concatenate([feats_tr, meta_trm[meta_feats].values.astype(np.float32)], axis=1)
X_te = np.concatenate([feats_te, meta_tem[meta_feats].values.astype(np.float32)], axis=1)
print('Shapes: CLIP', feats_tr.shape, '+ META', len(meta_feats), '-> X_tr', X_tr.shape, 'X_te', X_te.shape, flush=True)

oof_xgb = np.zeros(len(y), dtype=np.float32)
test_preds_folds = []
scores = []
params = {
    'learning_rate': 0.05,
    'max_depth': 7,
    'min_child_weight': 1.0,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'reg_lambda': 1.0,
    'tree_method': 'hist',
    'device': 'cuda',
    'random_state': 42,
}
for f in range(5):
    tr_idx = np.where(df['fold'].values != f)[0]
    va_idx = np.where(df['fold'].values == f)[0]
    dtr = xgb.DMatrix(X_tr[tr_idx], label=y[tr_idx])
    dva = xgb.DMatrix(X_tr[va_idx], label=y[va_idx])
    dte = xgb.DMatrix(X_te)
    t0=time.time()
    model = xgb.train(params, dtr, num_boost_round=10000, evals=[(dtr,'tr'),(dva,'va')],
                      early_stopping_rounds=200, verbose_eval=200,
                      custom_metric=lambda y_pred, dmat: ('rmse', rmse(dmat.get_label(), y_pred)))
    p_va = model.predict(dva, iteration_range=(0, model.best_iteration+1)).astype(np.float32)
    p_te = model.predict(dte, iteration_range=(0, model.best_iteration+1)).astype(np.float32)
    oof_xgb[va_idx] = p_va
    test_preds_folds.append(p_te)
    s = rmse(y[va_idx], p_va)
    scores.append(float(s))
    print(f'XGB-CLIP fold {f}: rmse={s:.5f} | best_iter={model.best_iteration} | elapsed={time.time()-t0:.1f}s', flush=True)
    gc.collect()

oof_xgb_df = df[['Id','Pawpularity']].copy(); oof_xgb_df['oof'] = np.clip(oof_xgb, 0, 100)
oof_xgb_df.to_csv('oof_xgb_clip_meta.csv', index=False)
test_xgb = np.mean(np.vstack(test_preds_folds), axis=0).astype(np.float32)
np.save('test_preds_xgb_clip_meta.npy', np.clip(test_xgb, 0, 100))
overall = rmse(oof_xgb_df['Pawpularity'].values, oof_xgb_df['oof'].values)
print('XGB-CLIP(+meta) fold RMSEs:', scores, 'Mean:', float(np.mean(scores)))
print(f'XGB-CLIP(+meta) OOF RMSE: {overall:.5f}', flush=True)

print('=== 3-way OOF blend (CLIP-Ridge, XGB-CLIP, META) ===', flush=True)
oof_ridge = pd.read_csv('oof_clip_vitl14_336_ridge.csv').rename(columns={'oof':'ridge'})
oof_meta = pd.read_csv('oof_xgb_meta.csv').rename(columns={'oof_meta':'meta'})
oof_xgbm = pd.read_csv('oof_xgb_clip_meta.csv').rename(columns={'oof':'xgb_clip'})
m = oof_ridge.merge(oof_xgbm, on=['Id','Pawpularity']).merge(oof_meta, on=['Id','Pawpularity'])
y_true = m['Pawpularity'].values.astype(np.float32)
pr = m['ridge'].values.astype(np.float32)
px = m['xgb_clip'].values.astype(np.float32)
pm = m['meta'].values.astype(np.float32)
best = (0.90, 0.05, 0.05, rmse(y_true, pr))  # wr, wx, wm, score
for wr in np.linspace(0.70, 0.95, 26):
    for wx in np.linspace(0.00, 0.30, 16):
        wm = 1.0 - wr - wx
        if wm < 0.00 or wm > 0.20:
            continue
        pred = np.clip(wr*pr + wx*px + wm*pm, 0, 100)
        s = rmse(y_true, pred)
        if s < best[3]:
            best = (float(wr), float(wx), float(wm), float(s))
print(f'[3-way Blend] wr={best[0]:.3f} wx={best[1]:.3f} wm={best[2]:.3f} OOF RMSE={best[3]:.5f}', flush=True)

# Apply 3-way blend to test
test_df = pd.read_csv('test.csv')
test_ridge = np.load('test_preds_clip_vitl14_336_ridge.npy').astype(np.float32)
test_xgbc = np.load('test_preds_xgb_clip_meta.npy').astype(np.float32)
test_meta = np.load('test_preds_xgb_meta.npy').astype(np.float32) if Path('test_preds_xgb_meta.npy').exists() else np.zeros_like(test_ridge)
wr, wx, wm = best[0], best[1], best[2]
test_blend = np.clip(wr*test_ridge + wx*test_xgbc + wm*test_meta, 0, 100)
sub = pd.DataFrame({'Id': test_df['Id'], 'Pawpularity': test_blend})
sub.to_csv('submission_clip_ensemble.csv', index=False)
print('Saved submission_clip_ensemble.csv shape', sub.shape, flush=True)

=== XGB on CLIP features (+meta) ===


Shapes: CLIP (8920, 768) + META 20 -> X_tr (8920, 788) X_te (992, 788)


[0]	tr-rmse:20.40631	va-rmse:20.09181




































[200]	tr-rmse:5.07264	va-rmse:17.50918






































[400]	tr-rmse:1.78062	va-rmse:17.44156




































[600]	tr-rmse:0.60026	va-rmse:17.43655








[632]	tr-rmse:0.50415	va-rmse:17.43651


XGB-CLIP fold 0: rmse=17.43451 | best_iter=432 | elapsed=11.7s




[0]	tr-rmse:20.29478	va-rmse:20.60252






































[200]	tr-rmse:4.86808	va-rmse:17.88545






































[400]	tr-rmse:1.66107	va-rmse:17.81039








































[600]	tr-rmse:0.54684	va-rmse:17.80423






































[800]	tr-rmse:0.17741	va-rmse:17.80436














[859]	tr-rmse:0.12690	va-rmse:17.80367


XGB-CLIP fold 1: rmse=17.80211 | best_iter=660 | elapsed=16.8s




[0]	tr-rmse:20.20075	va-rmse:20.84301




































[200]	tr-rmse:4.86245	va-rmse:18.08161




































[400]	tr-rmse:1.69246	va-rmse:18.05155






































[600]	tr-rmse:0.55818	va-rmse:18.05257










[647]	tr-rmse:0.43794	va-rmse:18.05183




XGB-CLIP fold 2: rmse=18.04831 | best_iter=447 | elapsed=12.0s




[0]	tr-rmse:20.13149	va-rmse:21.08061




































[200]	tr-rmse:4.64518	va-rmse:18.82445






































[400]	tr-rmse:1.56988	va-rmse:18.80511


































[579]	tr-rmse:0.58169	va-rmse:18.80703


XGB-CLIP fold 3: rmse=18.80007 | best_iter=379 | elapsed=10.8s




[0]	tr-rmse:20.42051	va-rmse:19.92709




































[200]	tr-rmse:4.91822	va-rmse:17.68702


























[336]	tr-rmse:2.40385	va-rmse:17.69036


XGB-CLIP fold 4: rmse=17.66584 | best_iter=137 | elapsed=6.2s




XGB-CLIP(+meta) fold RMSEs: [17.434505462646484, 17.802114486694336, 18.048307418823242, 18.800071716308594, 17.665842056274414] Mean: 17.950168228149415
XGB-CLIP(+meta) OOF RMSE: 17.95627


=== 3-way OOF blend (CLIP-Ridge, XGB-CLIP, META) ===


[3-way Blend] wr=0.820 wx=0.180 wm=0.000 OOF RMSE=17.52494


Saved submission_clip_ensemble.csv shape (992, 2)




In [22]:
# Overwrite submission.csv with best available ensemble (prefer stacked > 2CLIP ensemble)
import os, pandas as pd
candidates = [
    'submission_stacked.csv',
    'submission_clip2_ensemble.csv',
    'submission_clip_ensemble.csv',
    'submission_clip_ridge_blend.csv',
    'submission_clip_ridge.csv',
]
for p in candidates:
    if os.path.exists(p):
        sub = pd.read_csv(p)
        sub.to_csv('submission.csv', index=False)
        print('submission.csv set from', p, '-> rows:', len(sub))
        print(sub.head().to_string(index=False))
        break
else:
    raise FileNotFoundError('No candidate submission files found')

submission.csv set from submission_stacked.csv -> rows: 992
                              Id  Pawpularity
ee51b99832f1ba868f646df93d2b6b81    38.701420
caddfb3f8bff9c4b95dbe022018eea21    43.743908
582eeabd4a448a53ebb79995888a4b0b    32.262863
afc1ad7f0c5eea880759d09e77f7deee    28.141064
d5bdf3446e86ce4ec67ce7a00f1cccc2    38.118626


In [19]:
# Add second CLIP model ViT-B/16 (224) features, concat with ViT-L/14-336, RidgeCV, and re-blend
import sys, subprocess, time, gc
import numpy as np, pandas as pd
from pathlib import Path
import torch
from torch.utils.data import Dataset, DataLoader
from PIL import Image
from sklearn.linear_model import RidgeCV
from sklearn.metrics import mean_squared_error

def pip_install(pkg):
    subprocess.run([sys.executable, '-m', 'pip', 'install', '-c', 'constraints.txt', pkg, '--upgrade-strategy', 'only-if-needed'], check=True)

try:
    import open_clip
except Exception:
    print('Installing open_clip...', flush=True)
    pip_install('open_clip_torch==2.26.1'); pip_install('ftfy'); pip_install('regex')
    import open_clip

device = 'cuda' if torch.cuda.is_available() else 'cpu'

def rmse(y_true, y_pred):
    return mean_squared_error(y_true, y_pred, squared=False)

class ImgDS(Dataset):
    def __init__(self, ids, img_dir, preprocess):
        self.ids = ids; self.dir = Path(img_dir); self.preprocess = preprocess
    def __len__(self): return len(self.ids)
    def __getitem__(self, i):
        pid = self.ids[i]
        img = Image.open(self.dir/f'{pid}.jpg').convert('RGB')
        return self.preprocess(img), pid

def extract_feats(model_name, pretrained, ids, img_dir, batch=128):
    model, _, preprocess = open_clip.create_model_and_transforms(model_name, pretrained=pretrained, device=device)
    model.eval()
    ds = ImgDS(ids, img_dir, preprocess)
    dl = DataLoader(ds, batch_size=batch, shuffle=False, num_workers=4, pin_memory=True, persistent_workers=False)
    feats = np.zeros((len(ds), model.visual.output_dim), dtype=np.float32)
    ptr=0; t0=time.time()
    with torch.no_grad():
        for i,(x,_) in enumerate(dl):
            x = x.to(device, non_blocking=True)
            with torch.cuda.amp.autocast(enabled=(device=='cuda')):
                f = model.encode_image(x)
            f = torch.nn.functional.normalize(f.float(), dim=1)
            n=f.size(0); feats[ptr:ptr+n]=f.cpu().numpy(); ptr+=n
            if (i+1)%50==0: print(f'  {model_name} batches {i+1}, elapsed {time.time()-t0:.1f}s', flush=True)
    return feats

print('=== Extract ViT-B/16 (openai, 224) features ===', flush=True)
train_df = pd.read_csv('train.csv'); test_df = pd.read_csv('test.csv')
tr_ids = train_df['Id'].tolist(); te_ids = test_df['Id'].tolist()
tr_b16_path = Path('clip_vitb16_224_train_feats.npy'); te_b16_path = Path('clip_vitb16_224_test_feats.npy')
if tr_b16_path.exists() and te_b16_path.exists():
    feats_tr_b = np.load(tr_b16_path); feats_te_b = np.load(te_b16_path)
    print('Loaded cached ViT-B/16 feats:', feats_tr_b.shape, feats_te_b.shape, flush=True)
else:
    feats_tr_b = extract_feats('ViT-B-16', 'openai', tr_ids, 'train', batch=128)
    feats_te_b = extract_feats('ViT-B-16', 'openai', te_ids, 'test', batch=128)
    np.save(tr_b16_path, feats_tr_b); np.save(te_b16_path, feats_te_b)
    print('Saved ViT-B/16 feats:', feats_tr_b.shape, feats_te_b.shape, flush=True)

# Load existing ViT-L/14-336 feats
feats_tr_l = np.load('clip_vitl14_336_train_feats.npy'); feats_te_l = np.load('clip_vitl14_336_test_feats.npy')
print('ViT-L/14-336 feats:', feats_tr_l.shape, feats_te_l.shape, flush=True)

# Concat features [L14_336 | B16_224]
X_tr = np.concatenate([feats_tr_l, feats_tr_b], axis=1).astype(np.float32)
X_te = np.concatenate([feats_te_l, feats_te_b], axis=1).astype(np.float32)
folds = pd.read_csv('folds.csv')
df = train_df.merge(folds[['Id','fold']], on='Id', how='left')
y = df['Pawpularity'].values.astype(np.float32)

print('=== RidgeCV on concatenated CLIP feats ===', flush=True)
alphas = np.logspace(-2, 3, 25)
oof = np.zeros(len(y), dtype=np.float32); test_folds=[]; scores=[]; alphas_used=[]
for f in range(5):
    tr_idx = np.where(df['fold'].values != f)[0]; va_idx = np.where(df['fold'].values == f)[0]
    model = RidgeCV(alphas=alphas, fit_intercept=True, scoring='neg_root_mean_squared_error', cv=5)
    t0=time.time(); model.fit(X_tr[tr_idx], y[tr_idx])
    p_va = model.predict(X_tr[va_idx]).astype(np.float32); p_te = model.predict(X_te).astype(np.float32)
    oof[va_idx]=p_va; test_folds.append(p_te)
    s = rmse(y[va_idx], p_va); scores.append(float(s)); alphas_used.append(float(model.alpha_))
    print(f'Ridge-2CLIP fold {f}: rmse={s:.5f} alpha={model.alpha_:.5f} elapsed={time.time()-t0:.1f}s', flush=True)
test_mean = np.mean(np.vstack(test_folds), axis=0).astype(np.float32)
oof_df = df[['Id','Pawpularity']].copy(); oof_df['oof'] = np.clip(oof,0,100)
oof_df.to_csv('oof_clip2_ridge.csv', index=False)
np.save('test_preds_clip2_ridge.npy', np.clip(test_mean,0,100))
overall = rmse(oof_df['Pawpularity'].values, oof_df['oof'].values)
print('Ridge-2CLIP fold RMSEs:', scores, 'Mean:', float(np.mean(scores)))
print(f'Ridge-2CLIP OOF RMSE: {overall:.5f}', flush=True)

# 3-way OOF blend: Ridge-2CLIP (img1) + XGB-CLIP(+meta) (img2) + META
try:
    o1 = pd.read_csv('oof_clip2_ridge.csv').rename(columns={'oof':'img1'})
    o2 = pd.read_csv('oof_xgb_clip_meta.csv').rename(columns={'oof':'img2'})
    om = pd.read_csv('oof_xgb_meta.csv').rename(columns={'oof_meta':'meta'})
    m = o1.merge(o2, on=['Id','Pawpularity']).merge(om, on=['Id','Pawpularity'])
    y_true = m['Pawpularity'].values.astype(np.float32)
    p1 = m['img1'].values.astype(np.float32)
    p2 = m['img2'].values.astype(np.float32)
    pm = m['meta'].values.astype(np.float32)
    best = (0.88, 0.08, 0.04, rmse(y_true, p1))
    for w1 in np.linspace(0.70, 0.95, 26):
        for w2 in np.linspace(0.00, 0.25, 26):
            wm = 1.0 - w1 - w2
            if wm < 0.00 or wm > 0.20: continue
            pred = np.clip(w1*p1 + w2*p2 + wm*pm, 0, 100)
            s = rmse(y_true, pred)
            if s < best[3]: best = (float(w1), float(w2), float(wm), float(s))
    print(f'[Blend 2CLIP+XGB+META] w1={best[0]:.3f} w2={best[1]:.3f} wm={best[2]:.3f} OOF RMSE={best[3]:.5f}', flush=True)
    # Apply to test
    t1 = np.load('test_preds_clip2_ridge.npy').astype(np.float32)
    t2 = np.load('test_preds_xgb_clip_meta.npy').astype(np.float32)
    tm = np.load('test_preds_xgb_meta.npy').astype(np.float32)
    blend = np.clip(best[0]*t1 + best[1]*t2 + best[2]*tm, 0, 100)
    sub = pd.DataFrame({'Id': test_df['Id'], 'Pawpularity': blend})
    sub.to_csv('submission_clip2_ensemble.csv', index=False)
    print('Saved submission_clip2_ensemble.csv', sub.shape, flush=True)
except Exception as e:
    print('Blend failed/skipped:', e)

=== Extract ViT-B/16 (openai, 224) features ===


  0%|                                               | 0.00/351M [00:00<?, ?iB/s]

  3%|█                                     | 9.49M/351M [00:00<00:03, 94.5MiB/s]

  5%|██                                    | 18.9M/351M [00:00<00:04, 78.8MiB/s]

  8%|██▉                                   | 27.3M/351M [00:00<00:04, 80.6MiB/s]

 10%|███▊                                  | 35.5M/351M [00:00<00:06, 50.0MiB/s]

 13%|█████                                 | 46.2M/351M [00:00<00:04, 64.2MiB/s]

 15%|█████▉                                | 54.3M/351M [00:00<00:04, 68.5MiB/s]

 18%|██████▋                               | 62.2M/351M [00:00<00:04, 69.5MiB/s]

 21%|████████                              | 74.3M/351M [00:01<00:03, 83.6MiB/s]

 24%|█████████                             | 83.9M/351M [00:01<00:03, 87.0MiB/s]

 27%|██████████                            | 93.2M/351M [00:01<00:03, 82.7MiB/s]

 30%|███████████▌                           | 104M/351M [00:01<00:02, 88.7MiB/s]

 32%|████████████▌                          | 113M/351M [00:01<00:03, 72.9MiB/s]

 35%|█████████████▋                         | 124M/351M [00:01<00:02, 81.6MiB/s]

 38%|██████████████▉                        | 134M/351M [00:01<00:02, 82.2MiB/s]

 41%|███████████████▉                       | 143M/351M [00:01<00:02, 77.1MiB/s]

 45%|█████████████████▋                     | 159M/351M [00:01<00:01, 98.5MiB/s]

 48%|██████████████████▊                    | 169M/351M [00:02<00:01, 94.0MiB/s]

 51%|███████████████████▉                   | 179M/351M [00:02<00:01, 91.9MiB/s]

 54%|█████████████████████▏                 | 191M/351M [00:02<00:01, 98.8MiB/s]

 57%|██████████████████████▍                | 201M/351M [00:02<00:01, 98.7MiB/s]

 60%|███████████████████████▍               | 211M/351M [00:02<00:01, 94.0MiB/s]

 63%|████████████████████████▌              | 221M/351M [00:02<00:02, 63.7MiB/s]

 66%|█████████████████████████▌             | 231M/351M [00:02<00:01, 70.4MiB/s]

 69%|██████████████████████████▊            | 241M/351M [00:03<00:01, 78.0MiB/s]

 71%|███████████████████████████▊           | 250M/351M [00:03<00:01, 82.5MiB/s]

 74%|████████████████████████████▊          | 260M/351M [00:03<00:01, 83.6MiB/s]

 77%|█████████████████████████████▉         | 269M/351M [00:03<00:00, 86.0MiB/s]

 79%|██████████████████████████████▉        | 278M/351M [00:03<00:00, 85.5MiB/s]

 83%|████████████████████████████████▏      | 289M/351M [00:03<00:00, 93.7MiB/s]

 86%|█████████████████████████████████▍     | 300M/351M [00:03<00:00, 97.3MiB/s]

 89%|██████████████████████████████████▌    | 311M/351M [00:03<00:00, 98.4MiB/s]

 91%|███████████████████████████████████▋   | 321M/351M [00:03<00:00, 93.9MiB/s]

 94%|████████████████████████████████████▋  | 330M/351M [00:03<00:00, 89.5MiB/s]

 97%|█████████████████████████████████████▋ | 339M/351M [00:04<00:00, 83.6MiB/s]

100%|██████████████████████████████████████▉| 350M/351M [00:04<00:00, 90.2MiB/s]

100%|███████████████████████████████████████| 351M/351M [00:04<00:00, 83.5MiB/s]




  with torch.cuda.amp.autocast(enabled=(device=='cuda')):


  ViT-B-16 batches 50, elapsed 14.5s


Saved ViT-B/16 feats: (8920, 512) (992, 512)


ViT-L/14-336 feats: (8920, 768) (992, 768)


=== RidgeCV on concatenated CLIP feats ===


Ridge-2CLIP fold 0: rmse=17.27534 alpha=3.16228 elapsed=9.5s




Ridge-2CLIP fold 1: rmse=17.05776 alpha=3.16228 elapsed=12.4s




Ridge-2CLIP fold 2: rmse=17.69733 alpha=3.16228 elapsed=14.3s




Ridge-2CLIP fold 3: rmse=18.30937 alpha=3.16228 elapsed=14.3s




Ridge-2CLIP fold 4: rmse=16.85355 alpha=3.16228 elapsed=14.1s


Ridge-2CLIP fold RMSEs: [17.27533721923828, 17.057764053344727, 17.69733428955078, 18.309368133544922, 16.85354995727539] Mean: 17.43867073059082
Ridge-2CLIP OOF RMSE: 17.44642


[Blend 2CLIP+XGB+META] w1=0.840 w2=0.160 wm=0.000 OOF RMSE=17.42812


Saved submission_clip2_ensemble.csv (992, 2)




In [21]:
# Stacking: RidgeCV on OOF predictions (CLIP-L ridge, 2CLIP ridge, XGB-CLIP+meta, META)
import numpy as np, pandas as pd
from pathlib import Path
from sklearn.linear_model import RidgeCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error

def rmse(y_true, y_pred):
    return mean_squared_error(y_true, y_pred, squared=False)

print('=== Stacking OOF predictions with RidgeCV ===', flush=True)
o1 = pd.read_csv('oof_clip_vitl14_336_ridge.csv').rename(columns={'oof':'clip_l14'})
o2 = pd.read_csv('oof_clip2_ridge.csv').rename(columns={'oof':'clip_2clip'})
o3 = pd.read_csv('oof_xgb_clip_meta.csv').rename(columns={'oof':'xgb_clipmeta'})
o4 = pd.read_csv('oof_xgb_meta.csv').rename(columns={'oof_meta':'meta'})
m = o1.merge(o2, on=['Id','Pawpularity']).merge(o3, on=['Id','Pawpularity']).merge(o4, on=['Id','Pawpularity'])
y = m['Pawpularity'].values.astype(np.float32)
X = m[['clip_l14','clip_2clip','xgb_clipmeta','meta']].values.astype(np.float32)
print('Stack features shape:', X.shape, flush=True)

alphas = np.logspace(-6, 3, 40)
stack = make_pipeline(StandardScaler(with_mean=True, with_std=True), RidgeCV(alphas=alphas, fit_intercept=True, scoring='neg_root_mean_squared_error', cv=5))
stack.fit(X, y)
p_oof = stack.predict(X).astype(np.float32)
score = rmse(y, np.clip(p_oof, 0, 100))
print(f'Stack Ridge OOF RMSE: {score:.5f}', flush=True)

# Build test matrix in the same column order
test_df = pd.read_csv('test.csv')
t1 = np.load('test_preds_clip_vitl14_336_ridge.npy').astype(np.float32)
t2 = np.load('test_preds_clip2_ridge.npy').astype(np.float32)
t3 = np.load('test_preds_xgb_clip_meta.npy').astype(np.float32)
t4 = np.load('test_preds_xgb_meta.npy').astype(np.float32) if Path('test_preds_xgb_meta.npy').exists() else np.zeros_like(t1)
Xt = np.vstack([t1, t2, t3, t4]).T.astype(np.float32)
p_test = stack.predict(Xt).astype(np.float32)
sub = pd.DataFrame({'Id': test_df['Id'], 'Pawpularity': np.clip(p_test, 0, 100)})
sub.to_csv('submission_stacked.csv', index=False)
print('Saved submission_stacked.csv', sub.shape, flush=True)

=== Stacking OOF predictions with RidgeCV ===


Stack features shape: (8920, 4)


Stack Ridge OOF RMSE: 17.40986


Saved submission_stacked.csv (992, 2)




In [23]:
# Add third CLIP model ViT-H/14 features, concat with existing (L14-336, B16-224), RidgeCV, then restack
import sys, subprocess, time, gc
import numpy as np, pandas as pd
from pathlib import Path
import torch
from torch.utils.data import Dataset, DataLoader
from PIL import Image
from sklearn.linear_model import RidgeCV
from sklearn.metrics import mean_squared_error

def pip_install(pkg):
    subprocess.run([sys.executable, '-m', 'pip', 'install', '-c', 'constraints.txt', pkg, '--upgrade-strategy', 'only-if-needed'], check=True)

try:
    import open_clip
except Exception:
    print('Installing open_clip...', flush=True)
    pip_install('open_clip_torch==2.26.1'); pip_install('ftfy'); pip_install('regex')
    import open_clip

device = 'cuda' if torch.cuda.is_available() else 'cpu'

def rmse(y_true, y_pred):
    return mean_squared_error(y_true, y_pred, squared=False)

class ImgDS(Dataset):
    def __init__(self, ids, img_dir, preprocess):
        self.ids = ids; self.dir = Path(img_dir); self.preprocess = preprocess
    def __len__(self): return len(self.ids)
    def __getitem__(self, i):
        pid = self.ids[i]
        img = Image.open(self.dir/f'{pid}.jpg').convert('RGB')
        return self.preprocess(img), pid

def extract_feats(model_name, pretrained, ids, img_dir, batch=64):
    print(f'Load {model_name} ({pretrained})', flush=True)
    model, _, preprocess = open_clip.create_model_and_transforms(model_name, pretrained=pretrained, device=device)
    model.eval()
    ds = ImgDS(ids, img_dir, preprocess)
    dl = DataLoader(ds, batch_size=batch, shuffle=False, num_workers=4, pin_memory=True, persistent_workers=False)
    feats = np.zeros((len(ds), model.visual.output_dim), dtype=np.float32)
    ptr=0; t0=time.time()
    with torch.no_grad():
        for i,(x,_) in enumerate(dl):
            x = x.to(device, non_blocking=True)
            with torch.amp.autocast('cuda', enabled=(device=='cuda')):
                f = model.encode_image(x)
            f = torch.nn.functional.normalize(f.float(), dim=1)
            n=f.size(0); feats[ptr:ptr+n]=f.cpu().numpy(); ptr+=n
            if (i+1)%50==0: print(f'  {model_name} batches {i+1}, elapsed {time.time()-t0:.1f}s', flush=True)
    print(f'{model_name} feats done in {time.time()-t0:.1f}s', flush=True)
    return feats

print('=== Extract ViT-H/14 (laion2b_s32b_b79k) features ===', flush=True)
train_df = pd.read_csv('train.csv'); test_df = pd.read_csv('test.csv')
tr_ids = train_df['Id'].tolist(); te_ids = test_df['Id'].tolist()
tr_h14_path = Path('clip_vith14_train_feats.npy'); te_h14_path = Path('clip_vith14_test_feats.npy')
if tr_h14_path.exists() and te_h14_path.exists():
    feats_tr_h = np.load(tr_h14_path); feats_te_h = np.load(te_h14_path)
    print('Loaded cached ViT-H/14 feats:', feats_tr_h.shape, feats_te_h.shape, flush=True)
else:
    feats_tr_h = extract_feats('ViT-H-14', 'laion2b_s32b_b79k', tr_ids, 'train', batch=64)
    feats_te_h = extract_feats('ViT-H-14', 'laion2b_s32b_b79k', te_ids, 'test', batch=64)
    np.save(tr_h14_path, feats_tr_h); np.save(te_h14_path, feats_te_h)
    print('Saved ViT-H/14 feats:', feats_tr_h.shape, feats_te_h.shape, flush=True)

# Load existing feats (L14-336, B16-224) and concat all three
feats_tr_l = np.load('clip_vitl14_336_train_feats.npy'); feats_te_l = np.load('clip_vitl14_336_test_feats.npy')
feats_tr_b = np.load('clip_vitb16_224_train_feats.npy'); feats_te_b = np.load('clip_vitb16_224_test_feats.npy')
X_tr = np.concatenate([feats_tr_l, feats_tr_b, feats_tr_h], axis=1).astype(np.float32)
X_te = np.concatenate([feats_te_l, feats_te_b, feats_te_h], axis=1).astype(np.float32)
folds = pd.read_csv('folds.csv')
df = train_df.merge(folds[['Id','fold']], on='Id', how='left')
y = df['Pawpularity'].values.astype(np.float32)

print('=== RidgeCV on 3x CLIP feats (L14+B16+H14) ===', flush=True)
alphas = np.logspace(-2, 3, 25)
oof = np.zeros(len(y), dtype=np.float32); test_folds=[]; scores=[]
for f in range(5):
    tr_idx = np.where(df['fold'].values != f)[0]; va_idx = np.where(df['fold'].values == f)[0]
    model = RidgeCV(alphas=alphas, fit_intercept=True, scoring='neg_root_mean_squared_error', cv=5)
    t0=time.time(); model.fit(X_tr[tr_idx], y[tr_idx])
    p_va = model.predict(X_tr[va_idx]).astype(np.float32); p_te = model.predict(X_te).astype(np.float32)
    oof[va_idx]=p_va; test_folds.append(p_te)
    s = rmse(y[va_idx], p_va); scores.append(float(s))
    print(f'Ridge-3CLIP fold {f}: rmse={s:.5f} alpha={model.alpha_:.5f} elapsed={time.time()-t0:.1f}s', flush=True)
test_mean = np.mean(np.vstack(test_folds), axis=0).astype(np.float32)
oof_df = df[['Id','Pawpularity']].copy(); oof_df['oof'] = np.clip(oof,0,100)
oof_df.to_csv('oof_clip3_ridge.csv', index=False)
np.save('test_preds_clip3_ridge.npy', np.clip(test_mean,0,100))
overall = rmse(oof_df['Pawpularity'].values, oof_df['oof'].values)
print('Ridge-3CLIP fold RMSEs:', scores, 'Mean:', float(np.mean(scores)))
print(f'Ridge-3CLIP OOF RMSE: {overall:.5f}', flush=True)

# Update stacking to include 3-CLIP ridge
try:
    o1 = pd.read_csv('oof_clip_vitl14_336_ridge.csv').rename(columns={'oof':'clip_l14'})
    o2 = pd.read_csv('oof_clip2_ridge.csv').rename(columns={'oof':'clip_2clip'})
    o3 = pd.read_csv('oof_xgb_clip_meta.csv').rename(columns={'oof':'xgb_clipmeta'})
    o4 = pd.read_csv('oof_xgb_meta.csv').rename(columns={'oof_meta':'meta'})
    o5 = pd.read_csv('oof_clip3_ridge.csv').rename(columns={'oof':'clip_3clip'})
    m = o1.merge(o2, on=['Id','Pawpularity']).merge(o3, on=['Id','Pawpularity']).merge(o4, on=['Id','Pawpularity']).merge(o5, on=['Id','Pawpularity'])
    y_stk = m['Pawpularity'].values.astype(np.float32)
    X_stk = m[['clip_l14','clip_2clip','clip_3clip','xgb_clipmeta','meta']].values.astype(np.float32)
    from sklearn.pipeline import make_pipeline
    from sklearn.preprocessing import StandardScaler
    alphas_stk = np.logspace(-6, 3, 40)
    stk = make_pipeline(StandardScaler(with_mean=True, with_std=True), RidgeCV(alphas=alphas_stk, fit_intercept=True, scoring='neg_root_mean_squared_error', cv=5))
    stk.fit(X_stk, y_stk)
    p_oof = stk.predict(X_stk).astype(np.float32)
    s_stk = rmse(y_stk, np.clip(p_oof,0,100))
    print(f'Stack (5 sources incl. 3CLIP) OOF RMSE: {s_stk:.5f}', flush=True)
    # Build test features in same order
    t1 = np.load('test_preds_clip_vitl14_336_ridge.npy').astype(np.float32)
    t2 = np.load('test_preds_clip2_ridge.npy').astype(np.float32)
    t5 = np.load('test_preds_clip3_ridge.npy').astype(np.float32)
    t3 = np.load('test_preds_xgb_clip_meta.npy').astype(np.float32)
    from pathlib import Path as _Path
    t4 = np.load('test_preds_xgb_meta.npy').astype(np.float32) if _Path('test_preds_xgb_meta.npy').exists() else np.zeros_like(t1)
    Xt = np.vstack([t1, t2, t5, t3, t4]).T.astype(np.float32)
    p_test = stk.predict(Xt).astype(np.float32)
    sub = pd.DataFrame({'Id': test_df['Id'], 'Pawpularity': np.clip(p_test, 0, 100)})
    sub.to_csv('submission_stacked_3clip.csv', index=False)
    print('Saved submission_stacked_3clip.csv', sub.shape, flush=True)
except Exception as e:
    print('Stack update failed/skipped:', e)

=== Extract ViT-H/14 (laion2b_s32b_b79k) features ===


Load ViT-H-14 (laion2b_s32b_b79k)


  checkpoint = torch.load(checkpoint_path, map_location=map_location)


  ViT-H-14 batches 50, elapsed 25.0s


  ViT-H-14 batches 100, elapsed 49.4s


ViT-H-14 feats done in 69.0s


Load ViT-H-14 (laion2b_s32b_b79k)


ViT-H-14 feats done in 8.7s


Saved ViT-H/14 feats: (8920, 1024) (992, 1024)


=== RidgeCV on 3x CLIP feats (L14+B16+H14) ===


Ridge-3CLIP fold 0: rmse=17.15195 alpha=5.10897 elapsed=26.2s




Ridge-3CLIP fold 1: rmse=16.92295 alpha=5.10897 elapsed=26.3s




Ridge-3CLIP fold 2: rmse=17.55738 alpha=5.10897 elapsed=26.2s




Ridge-3CLIP fold 3: rmse=18.25373 alpha=5.10897 elapsed=26.2s




Ridge-3CLIP fold 4: rmse=16.84151 alpha=5.10897 elapsed=26.2s


Ridge-3CLIP fold RMSEs: [17.151952743530273, 16.922954559326172, 17.557384490966797, 18.253732681274414, 16.841506958007812] Mean: 17.345506286621095
Ridge-3CLIP OOF RMSE: 17.35332




Stack (5 sources incl. 3CLIP) OOF RMSE: 17.32087


Saved submission_stacked_3clip.csv (992, 2)




In [None]:
# ViT-g-14 (laion2b_s12b_b42k) HFlip-averaged features -> Ridge (solo + 4-CLIP) -> Restack -> Isotonic calibration
import sys, subprocess, time, gc, math
import numpy as np, pandas as pd
from pathlib import Path
import torch
from torch.utils.data import Dataset, DataLoader
from PIL import Image
from sklearn.linear_model import RidgeCV
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.isotonic import IsotonicRegression

def pip_install(pkg):
    subprocess.run([sys.executable, '-m', 'pip', 'install', '-c', 'constraints.txt', pkg, '--upgrade-strategy', 'only-if-needed'], check=True)

try:
    import open_clip
except Exception:
    print('Installing open_clip...', flush=True)
    pip_install('open_clip_torch==2.26.1'); pip_install('ftfy'); pip_install('regex')
    import open_clip

device = 'cuda' if torch.cuda.is_available() else 'cpu'
torch.backends.cudnn.benchmark = True
try:
    torch.set_float32_matmul_precision('high')
except Exception:
    pass

def rmse(y_true, y_pred):
    return mean_squared_error(y_true, y_pred, squared=False)

class ImgDS(Dataset):
    def __init__(self, ids, img_dir, preprocess):
        self.ids = ids; self.dir = Path(img_dir); self.preprocess = preprocess
    def __len__(self): return len(self.ids)
    def __getitem__(self, i):
        pid = self.ids[i]
        img = Image.open(self.dir/f'{pid}.jpg').convert('RGB')
        return self.preprocess(img), pid

def extract_feats_hflip_avg(model_name, pretrained, ids, img_dir, batch=64):
    print(f'Load {model_name} ({pretrained})', flush=True)
    model, _, preprocess = open_clip.create_model_and_transforms(model_name, pretrained=pretrained, device=device)
    model.eval()
    ds = ImgDS(ids, img_dir, preprocess)
    dl = DataLoader(ds, batch_size=batch, shuffle=False, num_workers=4, pin_memory=True, persistent_workers=False)
    out_dim = model.visual.output_dim
    feats = np.zeros((len(ds), out_dim), dtype=np.float32)
    ptr=0; t0=time.time()
    with torch.no_grad():
        for i,(x,_) in enumerate(dl):
            x = x.to(device, non_blocking=True)
            xh = torch.flip(x, dims=[-1])  # horizontal flip on tensor
            with torch.amp.autocast('cuda', enabled=(device=='cuda')):
                f1 = model.encode_image(x)
                f2 = model.encode_image(xh)
            f = 0.5*(f1 + f2)
            f = torch.nn.functional.normalize(f.float(), dim=1)
            n=f.size(0); feats[ptr:ptr+n]=f.cpu().numpy(); ptr+=n
            if (i+1)%50==0: print(f'  {model_name} batches {i+1}, elapsed {time.time()-t0:.1f}s', flush=True)
    print(f'{model_name} feats done in {time.time()-t0:.1f}s', flush=True)
    return feats

print('=== ViT-g/14 (laion2b_s12b_b42k) HFlip-averaged features ===', flush=True)
train_df = pd.read_csv('train.csv'); test_df = pd.read_csv('test.csv')
tr_ids = train_df['Id'].tolist(); te_ids = test_df['Id'].tolist()
tr_g14_path = Path('clip_vitg14_train_feats.npy'); te_g14_path = Path('clip_vitg14_test_feats.npy')
if tr_g14_path.exists() and te_g14_path.exists():
    feats_tr_g = np.load(tr_g14_path); feats_te_g = np.load(te_g14_path)
    print('Loaded cached ViT-g/14 feats:', feats_tr_g.shape, feats_te_g.shape, flush=True)
else:
    feats_tr_g = extract_feats_hflip_avg('ViT-g-14', 'laion2b_s12b_b42k', tr_ids, 'train', batch=64)
    feats_te_g = extract_feats_hflip_avg('ViT-g-14', 'laion2b_s12b_b42k', te_ids, 'test', batch=64)
    np.save(tr_g14_path, feats_tr_g); np.save(te_g14_path, feats_te_g)
    print('Saved ViT-g/14 feats:', feats_tr_g.shape, feats_te_g.shape, flush=True)

# Solo Ridge on ViT-g/14
folds = pd.read_csv('folds.csv')
df = train_df.merge(folds[['Id','fold']], on='Id', how='left')
y = df['Pawpularity'].values.astype(np.float32)
alphas = np.logspace(-2, 3, 25)
oof_g = np.zeros(len(y), dtype=np.float32); test_folds_g=[]; scores_g=[]
for f in range(5):
    tr_idx = np.where(df['fold'].values != f)[0]; va_idx = np.where(df['fold'].values == f)[0]
    model = RidgeCV(alphas=alphas, fit_intercept=True, scoring='neg_root_mean_squared_error', cv=5)
    t0=time.time(); model.fit(feats_tr_g[tr_idx], y[tr_idx])
    p_va = model.predict(feats_tr_g[va_idx]).astype(np.float32); p_te = model.predict(feats_te_g).astype(np.float32)
    oof_g[va_idx]=p_va; test_folds_g.append(p_te)
    s = rmse(y[va_idx], p_va); scores_g.append(float(s))
    print(f'Ridge-g14 fold {f}: rmse={s:.5f} alpha={model.alpha_:.5f} elapsed={time.time()-t0:.1f}s', flush=True)
test_g = np.mean(np.vstack(test_folds_g), axis=0).astype(np.float32)
oof_g_df = df[['Id','Pawpularity']].copy(); oof_g_df['oof'] = np.clip(oof_g,0,100)
oof_g_df.to_csv('oof_clip_vitg14_ridge.csv', index=False)
np.save('test_preds_clip_vitg14_ridge.npy', np.clip(test_g,0,100))
overall_g = rmse(oof_g_df['Pawpularity'].values, oof_g_df['oof'].values)
print(f'Ridge-g14 OOF RMSE: {overall_g:.5f}', flush=True)

# 4-CLIP concatenation: [L14_336 | B16_224 | H14_1024 | g14_1408]
feats_tr_l = np.load('clip_vitl14_336_train_feats.npy'); feats_te_l = np.load('clip_vitl14_336_test_feats.npy')
feats_tr_b = np.load('clip_vitb16_224_train_feats.npy'); feats_te_b = np.load('clip_vitb16_224_test_feats.npy')
feats_tr_h = np.load('clip_vith14_train_feats.npy'); feats_te_h = np.load('clip_vith14_test_feats.npy')
X_tr4 = np.concatenate([feats_tr_l, feats_tr_b, feats_tr_h, feats_tr_g], axis=1).astype(np.float32)
X_te4 = np.concatenate([feats_te_l, feats_te_b, feats_te_h, feats_te_g], axis=1).astype(np.float32)
print('4-CLIP shapes:', X_tr4.shape, X_te4.shape, flush=True)

oof4 = np.zeros(len(y), dtype=np.float32); test_folds4=[]; scores4=[]
for f in range(5):
    tr_idx = np.where(df['fold'].values != f)[0]; va_idx = np.where(df['fold'].values == f)[0]
    model = RidgeCV(alphas=alphas, fit_intercept=True, scoring='neg_root_mean_squared_error', cv=5)
    t0=time.time(); model.fit(X_tr4[tr_idx], y[tr_idx])
    p_va = model.predict(X_tr4[va_idx]).astype(np.float32); p_te = model.predict(X_te4).astype(np.float32)
    oof4[va_idx]=p_va; test_folds4.append(p_te)
    s = rmse(y[va_idx], p_va); scores4.append(float(s))
    print(f'Ridge-4CLIP fold {f}: rmse={s:.5f} alpha={model.alpha_:.5f} elapsed={time.time()-t0:.1f}s', flush=True)
test4 = np.mean(np.vstack(test_folds4), axis=0).astype(np.float32)
oof4_df = df[['Id','Pawpularity']].copy(); oof4_df['oof'] = np.clip(oof4,0,100)
oof4_df.to_csv('oof_clip4_ridge.csv', index=False)
np.save('test_preds_clip4_ridge.npy', np.clip(test4,0,100))
overall4 = rmse(oof4_df['Pawpularity'].values, oof4_df['oof'].values)
print('Ridge-4CLIP OOF RMSE:', f'{overall4:.5f}')

# Restack with new sources (include 3CLIP and 4CLIP; Ridge will regularize)
print('=== Restack with 6 inputs (l14, 2clip, 3clip, 4clip, xgb_clipmeta, meta) ===', flush=True)
o1 = pd.read_csv('oof_clip_vitl14_336_ridge.csv').rename(columns={'oof':'clip_l14'})
o2 = pd.read_csv('oof_clip2_ridge.csv').rename(columns={'oof':'clip_2clip'})
o3 = pd.read_csv('oof_clip3_ridge.csv').rename(columns={'oof':'clip_3clip'})
o4 = pd.read_csv('oof_clip4_ridge.csv').rename(columns={'oof':'clip_4clip'})
o5 = pd.read_csv('oof_xgb_clip_meta.csv').rename(columns={'oof':'xgb_clipmeta'})
o6 = pd.read_csv('oof_xgb_meta.csv').rename(columns={'oof_meta':'meta'})
m = o1.merge(o2, on=['Id','Pawpularity']).merge(o3, on=['Id','Pawpularity']).merge(o4, on=['Id','Pawpularity']).merge(o5, on=['Id','Pawpularity']).merge(o6, on=['Id','Pawpularity'])
y_stk = m['Pawpularity'].values.astype(np.float32)
X_stk = m[['clip_l14','clip_2clip','clip_3clip','clip_4clip','xgb_clipmeta','meta']].values.astype(np.float32)
alphas_stk = np.logspace(-6, 3, 40)
stack = make_pipeline(StandardScaler(with_mean=True, with_std=True), RidgeCV(alphas=alphas_stk, fit_intercept=True, scoring='neg_root_mean_squared_error', cv=5))
stack.fit(X_stk, y_stk)
p_oof = stack.predict(X_stk).astype(np.float32)
stk_oof_clip = np.clip(p_oof, 0, 100)
stk_rmse = rmse(y_stk, stk_oof_clip)
print(f'Stack (6 src) OOF RMSE: {stk_rmse:.5f}', flush=True)

# Build test features and predictions
t1 = np.load('test_preds_clip_vitl14_336_ridge.npy').astype(np.float32)
t2 = np.load('test_preds_clip2_ridge.npy').astype(np.float32)
t3 = np.load('test_preds_clip3_ridge.npy').astype(np.float32)
t4 = np.load('test_preds_clip4_ridge.npy').astype(np.float32)
t5 = np.load('test_preds_xgb_clip_meta.npy').astype(np.float32)
from pathlib import Path as _Path
t6 = np.load('test_preds_xgb_meta.npy').astype(np.float32) if _Path('test_preds_xgb_meta.npy').exists() else np.zeros_like(t1)
Xt = np.vstack([t1,t2,t3,t4,t5,t6]).T.astype(np.float32)
p_test = stack.predict(Xt).astype(np.float32)
p_test_clip = np.clip(p_test, 0, 100)
sub_stack = pd.DataFrame({'Id': test_df['Id'], 'Pawpularity': p_test_clip})
sub_stack.to_csv('submission_stacked_4clip.csv', index=False)
print('Saved submission_stacked_4clip.csv', sub_stack.shape, flush=True)

# Isotonic calibration on final stack OOF and apply to test
print('=== Isotonic calibration on stack OOF ===', flush=True)
iso = IsotonicRegression(y_min=0.0, y_max=100.0, out_of_bounds='clip')
iso.fit(stk_oof_clip, y_stk)  # map from stack_pred -> target
stk_oof_iso = iso.transform(stk_oof_clip)
iso_rmse = rmse(y_stk, np.clip(stk_oof_iso,0,100))
print(f'Isotonic-calibrated stack OOF RMSE: {iso_rmse:.5f}', flush=True)
p_test_iso = iso.transform(p_test_clip)
sub_iso = pd.DataFrame({'Id': test_df['Id'], 'Pawpularity': np.clip(p_test_iso, 0, 100)})
sub_iso.to_csv('submission_stacked_4clip_iso.csv', index=False)
print('Saved submission_stacked_4clip_iso.csv', sub_iso.shape, flush=True)

# Update primary submission.csv to calibrated stack
sub_iso.to_csv('submission.csv', index=False)
print('submission.csv updated -> submission_stacked_4clip_iso.csv')

=== ViT-g/14 (laion2b_s12b_b42k) HFlip-averaged features ===


Load ViT-g-14 (laion2b_s12b_b42k)
