# Herbarium 2022 - FGVC9: Plan and Milestones

Objective: WIN A MEDAL (macro F1 on LB).

Milestone 0: Environment & Data Sanity
- Verify GPU availability (nvidia-smi, torch cuda).
- Inspect metadata: train/test sizes, class count (taxonID), label distribution skew, per-institution/site info.
- Confirm image paths exist and are readable.

Milestone 1: Validation Protocol
- Stratified KFold by taxonID (e.g., 5 folds).
- Deterministic seed handling, consistent folds cached.
- Track macro-F1 OOF (primary) and accuracy.

Milestone 2: Fast Baseline
- Model: timm pretrained CNN (e.g., convnext_base/large or eva02 base) with mixed precision.
- Image size 384 (warm-up at 224 for smoke tests), aug: RandAug/AutoAug + ColorJitter + Mixup/CutMix (light).
- Loss: LabelSmoothingCrossEntropy or FocalLoss; try both (OOF-driven).
- Optimizer: AdamW; OneCycle or cosine schedule; EMA weights.
- Class weights or re-weighting (long tail) vs. sampling; validate.

Milestone 3: Speed/Scale-up
- Progressive resizing: 224 -> 384 -> 448 if time.
- Use fp16, channels_last, torch.compile(optional).
- Enable efficient dataloading: prefetch, pin_memory, num_workers, jpeg-turbo if available.

Milestone 4: Error Analysis & Iteration
- OOF error buckets: by family/genus, institution/source, size, orientation.
- Try hierarchical soft targets (family/genus smoothing) if time.
- Try stronger backbones (beit/beitv2, convnextv2, efficientnetv2).

Milestone 5: Ensembling
- Blend diverse seeds/models/resolutions by logits avg.
- Calibrate if needed (temperature on OOF).

Milestone 6: Submission
- Generate test predictions and save submission.csv.
- Verify shape and ids align with sample_submission.csv.

Checkpoints for Expert Review
- After Env+Data sanity (M0).
- After baseline OOF ready (M2).
- After any major iteration/ensemble (M4/M5).

Next Cell (to add):
- Environment checks, file listings, metadata inspection, class counts, and quick integrity assertions.

In [1]:
import os, json, time, csv, sys, subprocess, itertools
from collections import Counter, defaultdict

t0 = time.time()
print('=== GPU CHECK (nvidia-smi) ===', flush=True)
subprocess.run(['bash','-lc','nvidia-smi || true'])

print('\n=== FILE LIST ===', flush=True)
for p in sorted(os.listdir('.')):
    if os.path.isdir(p):
        n_dirs = sum(1 for _ in os.scandir(p) if _.is_dir())
        n_files = sum(1 for _ in os.scandir(p) if _.is_file())
        print(f'{p}/ -> {n_dirs} dirs, {n_files} files')
    else:
        print(p, os.path.getsize(p))

def load_json(path):
    with open(path, 'r') as f:
        return json.load(f)

print('\n=== LOAD METADATA ===', flush=True)
train_meta = load_json('train_metadata.json')
test_meta = load_json('test_metadata.json')
print('train type:', type(train_meta))
if isinstance(train_meta, dict):
    print('train keys:', list(train_meta.keys())[:10])
else:
    print('train len:', len(train_meta))
print('test type:', type(test_meta))
if isinstance(test_meta, dict):
    print('test keys:', list(test_meta.keys())[:10])
elif isinstance(test_meta, list):
    print('test len:', len(test_meta))

# Infer structures
def info_dataset(meta, name):
    if isinstance(meta, dict):
        imgs = meta.get('images') or meta.get('data') or meta.get('items')
        ann = meta.get('annotations')
        cat = meta.get('categories')
        if imgs is not None:
            print(f'{name}: images={len(imgs)}')
        else:
            print(f'{name}: images UNKNOWN (no key)')
        if ann is not None:
            print(f'{name}: annotations={len(ann)}')
        if cat is not None:
            print(f'{name}: categories={len(cat)}')
        return imgs, ann, cat
    elif isinstance(meta, list):
        print(f'{name}: list length={len(meta)}')
        return meta, None, None
    else:
        print(f'{name}: type={type(meta)}')
        return None, None, None

train_imgs, train_ann, train_cat = info_dataset(train_meta, 'train')
test_imgs, test_ann, test_cat = info_dataset(test_meta, 'test')

# Peek at a few samples
def peek(lst, n=2):
    if not lst: return
    for i, it in enumerate(itertools.islice(lst, n)):
        if isinstance(it, dict):
            print(f'sample[{i}] keys:', list(it.keys())[:20])
        else:
            print(f'sample[{i}] type:', type(it))

print('\ntrain images peek:')
peek(train_imgs, 3)
print('train annotations peek:')
peek(train_ann, 3)
print('train categories peek:')
peek(train_cat, 3)
print('test images/items peek:')
peek(test_imgs, 3)

# Try to locate labels and ids
def guess_label_key(sample):
    for k in ('taxonID','category_id','label','target','y','class_id','category'):
        if k in sample:
            return k
    return None

train_label_key = None
train_id_key = None
train_fname_key = None
if train_imgs and isinstance(train_imgs, list) and len(train_imgs)>0 and isinstance(train_imgs[0], dict):
    sample = train_imgs[0]
    for k in ('image_id','id','_id','uid','record_id'):
        if k in sample: train_id_key = k; break
    for k in ('file_name','file','filepath','path','image_path','name'):
        if k in sample: train_fname_key = k; break
if train_ann and len(train_ann)>0 and isinstance(train_ann[0], dict):
    train_label_key = guess_label_key(train_ann[0])
elif train_imgs and len(train_imgs)>0 and isinstance(train_imgs[0], dict):
    train_label_key = guess_label_key(train_imgs[0])

print('train_id_key:', train_id_key, 'train_label_key:', train_label_key, 'train_fname_key:', train_fname_key)

test_id_key = None
test_fname_key = None
if test_imgs and len(test_imgs)>0 and isinstance(test_imgs[0], dict):
    for k in ('image_id','id','_id','uid','record_id'):
        if k in test_imgs[0]: test_id_key = k; break
    for k in ('file_name','file','filepath','path','image_path','name'):
        if k in test_imgs[0]: test_fname_key = k; break
print('test_id_key:', test_id_key, 'test_fname_key:', test_fname_key)

# Class counts if possible
cnt = None
if train_ann and train_label_key:
    cnt = Counter(a[train_label_key] for a in train_ann if train_label_key in a)
    print('num classes from annotations:', len(cnt))
    print('top5 classes by freq:', cnt.most_common(5))
elif train_imgs and train_label_key:
    cnt = Counter(i[train_label_key] for i in train_imgs if train_label_key in i)
    print('num classes from images:', len(cnt))
    print('top5 classes by freq:', cnt.most_common(5))

print('\n=== SAMPLE SUBMISSION ===', flush=True)
with open('sample_submission.csv','r') as f:
    reader = csv.reader(f)
    header = next(reader)
    rows = [next(reader) for _ in range(5)]
print('header:', header)
print('first rows:', rows)

# Sanity: image paths exist? (we only check directory structure)
def count_leaf_files(root):
    total = 0
    for dp, dn, fn in os.walk(root):
        total += len(fn)
    return total
n_train_files = count_leaf_files('train_images') if os.path.exists('train_images') else -1
n_test_files  = count_leaf_files('test_images') if os.path.exists('test_images') else -1
print(f'train_images file count (recursive): {n_train_files}')
print(f'test_images file count  (recursive): {n_test_files}')

print('\nDone. Elapsed: %.2fs' % (time.time()-t0), flush=True)

=== GPU CHECK (nvidia-smi) ===


Sun Sep 28 21:49:22 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.06             Driver Version: 550.144.06     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A10-24Q                 On  |   00000002:00:00.0 Off |                    0 |
| N/A   N/A    P0             N/A /  N/A  |     414MiB /  24512MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

.00_eda_and_planning_kernel_state.json 183
00_eda_and_planning.ipynb 54597
agent_metadata/ -> 3 dirs, 9 files
checkpoints/ -> 0 dirs, 0 files
constraints.txt 51
description.md 9057
docker_run.log 404262
embeddings/ -> 0 dirs, 6 files
requirements.txt 2684
sample_submission.csv 1629423
submission.csv 2043720
task.txt 2372
test_images/ -> 175 dirs, 0 files
test_metadata.json 18686508
train_images/ -> 156 dirs, 0 files
train_metadata.json 590741595

=== LOAD METADATA ===


train type: <class 'dict'>
train keys: ['annotations', 'categories', 'distances', 'genera', 'images', 'institutions', 'license']
test type: <class 'list'>
test len: 174052
train: images=665720
train: annotations=665720
train: categories=15501
test: list length=174052

train images peek:
sample[0] keys: ['file_name', 'image_id', 'license']
sample[1] keys: ['file_name', 'image_id', 'license']
sample[2] keys: ['file_name', 'image_id', 'license']
train annotations peek:
sample[0] keys: ['category_id', 'genus_id', 'image_id', 'institution_id']
sample[1] keys: ['category_id', 'genus_id', 'image_id', 'institution_id']
sample[2] keys: ['category_id', 'genus_id', 'image_id', 'institution_id']
train categories peek:
sample[0] keys: ['authors', 'category_id', 'family', 'genus', 'scientificName', 'species']
sample[1] keys: ['authors', 'category_id', 'family', 'genus', 'scientificName', 'species']
sample[2] keys: ['authors', 'category_id', 'family', 'genus', 'scientificName', 'species']
test images

header: ['Id', 'Predicted']
first rows: [['0', '42'], ['1', '42'], ['2', '42'], ['3', '42'], ['4', '42']]


train_images file count (recursive): 665720
test_images file count  (recursive): 174052

Done. Elapsed: 3.30s


In [None]:
import os, sys, subprocess, shutil, time
t0=time.time()
print('== Reset any prior torch stack ==', flush=True)
for pkg in ("torch","torchvision","torchaudio"):
    subprocess.run([sys.executable, "-m", "pip", "uninstall", "-y", pkg], check=False)

for d in (
    "/app/.pip-target/torch",
    "/app/.pip-target/torchvision",
    "/app/.pip-target/torchaudio",
    "/app/.pip-target/torch-2.4.1.dist-info",
    "/app/.pip-target/torchvision-0.19.1.dist-info",
    "/app/.pip-target/torchaudio-2.4.1.dist-info",
):
    if os.path.exists(d):
        print('Removing', d); shutil.rmtree(d, ignore_errors=True)

def pip(*args):
    print('> pip', ' '.join(args), flush=True)
    subprocess.run([sys.executable, '-m', 'pip', *args], check=True)

print('== Install PyTorch cu121 stack ==', flush=True)
pip('install',
    '--index-url','https://download.pytorch.org/whl/cu121',
    '--extra-index-url','https://pypi.org/simple',
    'torch==2.4.1','torchvision==0.19.1','torchaudio==2.4.1')

open('constraints.txt','w').write('torch==2.4.1\ntorchvision==0.19.1\ntorchaudio==2.4.1\n')
print('== Install non-torch deps under constraints ==', flush=True)
pip('install','-c','constraints.txt',
    'timm==1.0.9',
    'open_clip_torch==2.24.0',
    'faiss-cpu==1.8.0',
    'pandas', 'numpy', 'scikit-learn', 'Pillow', 'tqdm',
    '--upgrade-strategy','only-if-needed')

import torch
print('torch:', torch.__version__, 'cuda build:', getattr(torch.version,'cuda',None))
print('CUDA available:', torch.cuda.is_available())
assert str(getattr(torch.version,'cuda','')).startswith('12.1'), f'Wrong CUDA build: {torch.version.cuda}'
assert torch.cuda.is_available(), 'CUDA not available'
print('GPU:', torch.cuda.get_device_name(0))
print('Done installs in %.1fs' % (time.time()-t0), flush=True)

In [2]:
import os, json, time, math, sys, gc, hashlib
import numpy as np
import pandas as pd
from PIL import Image, ImageFile
ImageFile.LOAD_TRUNCATED_IMAGES = True
import torch
from torch.utils.data import Dataset, DataLoader
import open_clip
from tqdm import tqdm

print('== Build train/test tables ==', flush=True)
with open('train_metadata.json','r') as f:
    tmeta = json.load(f)
train_imgs = tmeta['images']
train_anns = tmeta['annotations']
img_by_id = {x['image_id']: x for x in train_imgs}

rows = []
for a in train_anns:
    iid = a['image_id']
    r = img_by_id[iid]
    rows.append((iid, r['file_name'], a['category_id']))
train_df = pd.DataFrame(rows, columns=['image_id','file_name','category_id'])
train_df['path'] = 'train_images/' + train_df['file_name'].astype(str)
print('train_df:', train_df.shape, 'unique classes:', train_df['category_id'].nunique())

with open('test_metadata.json','r') as f:
    test_meta = json.load(f)
test_df = pd.DataFrame(test_meta)
test_df = test_df[['image_id','file_name']].copy()
test_df['path'] = 'test_images/' + test_df['file_name'].astype(str)
print('test_df:', test_df.shape)

# Build class frequency and maps
cls_counts = train_df['category_id'].value_counts().sort_index()
cat_ids = cls_counts.index.to_list()
cat_id2idx = {cid:i for i,cid in enumerate(cat_ids)}
idx2cat_id = {i:cid for cid,i in cat_id2idx.items()}
train_df['y'] = train_df['category_id'].map(cat_id2idx).astype(int)
num_classes = len(cat_ids)
print('num_classes:', num_classes)

os.makedirs('embeddings', exist_ok=True)

class ImageDataset(Dataset):
    def __init__(self, df, preprocess):
        self.df = df.reset_index(drop=True)
        self.preprocess = preprocess
    def __len__(self):
        return len(self.df)
    def __getitem__(self, i):
        p = self.df.at[i, 'path']
        try:
            img = Image.open(p).convert('RGB')
        except Exception as e:
            # fallback 1x1 black image on failure
            img = Image.new('RGB', (224,224), (0,0,0))
        return self.preprocess(img), i

def extract_embeddings(df, model_name='ViT-B-32', pretrained='laion2b_s34b_b79k', bs=256, num_workers=8, out_path='embeddings/train_vitb32.npy'):
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    print(f'Loading open_clip model {model_name} / {pretrained}', flush=True)
    model, _, preprocess = open_clip.create_model_and_transforms(model_name, pretrained=pretrained, device=device)
    model.eval()
    model = model.to(device)
    torch.backends.cudnn.benchmark = True
    ds = ImageDataset(df, preprocess)
    dl = DataLoader(ds, batch_size=bs, shuffle=False, num_workers=num_workers, pin_memory=True, persistent_workers=True)
    n = len(ds)
    with torch.no_grad():
        # probe dim
        x0, _ = next(iter(dl))
        x0 = x0.to(device, non_blocking=True)
        with torch.autocast(device_type='cuda', dtype=torch.float16, enabled=(device=='cuda')):
            f0 = model.encode_image(x0)
        d = f0.shape[-1]
    feats = np.memmap(out_path + '.mmap', dtype='float32', mode='w+', shape=(n, d))
    t0 = time.time()
    seen = 0
    with torch.no_grad():
        for xb, idx in tqdm(dl, total=math.ceil(n/bs), desc='emb', mininterval=2.0):
            xb = xb.to(device, non_blocking=True)
            with torch.autocast(device_type='cuda', dtype=torch.float16, enabled=(device=='cuda')):
                fb = model.encode_image(xb)
            fb = fb.float()
            fb = torch.nn.functional.normalize(fb, p=2, dim=-1)
            fb = fb.cpu().numpy()
            feats[idx.numpy()] = fb
            seen += xb.size(0)
            if seen % (bs*10) == 0:
                print(f'  wrote {seen}/{n}', flush=True)
    feats.flush()
    np.save(out_path, np.asarray(feats))
    del feats
    try: os.remove(out_path + '.mmap')
    except: pass
    print('Saved', out_path, 'elapsed %.1fs' % (time.time()-t0))

def run_retrieval_kNN(train_emb_path, test_emb_path, k=50, out_csv='submission.csv'):
    import faiss
    print('Loading embeddings...', flush=True)
    e_tr = np.load(train_emb_path).astype('float32')
    e_te = np.load(test_emb_path).astype('float32')
    # e_* already L2-normalized
    print('train emb:', e_tr.shape, 'test emb:', e_te.shape)
    index = faiss.IndexFlatIP(e_tr.shape[1])
    index.add(e_tr)
    print('Searching kNN...', flush=True)
    sims, idxs = index.search(e_te, k)
    # Map neighbor indices to labels
    y_tr = train_df['y'].to_numpy()
    # vote with similarity weights
    num_classes = y_tr.max()+1
    preds = np.empty((e_te.shape[0],), dtype=np.int32)
    for i in tqdm(range(e_te.shape[0]), desc='vote', mininterval=2.0):
        ids = idxs[i]
        ws = sims[i]
        labs = y_tr[ids]
        # accumulate per-class score
        scores = {}
        for lab, w in zip(labs, ws):
            scores[lab] = scores.get(lab, 0.0) + float(max(w, 0.0))
        # pick argmax
        if scores:
            preds[i] = max(scores.items(), key=lambda x:x[1])[0]
        else:
            preds[i] = labs[0]
    # map back to category_id
    pred_cat = [idx2cat_id[int(p)] for p in preds]
    # order by sample_submission Id (which uses sequential 0..N-1 mapping to test image_id order)
    # sample_submission Id corresponds to row index in test_meta order
    sub = pd.DataFrame({'Id': np.arange(len(test_df), dtype=np.int32), 'Predicted': pred_cat})
    sub.to_csv(out_csv, index=False)
    print('Wrote', out_csv)

print('== Retrieval baseline plan ready ==')
print('Next steps:')
print('1) extract_embeddings(train_df, model_name=\'ViT-B-32\', pretrained=\'laion2b_s34b_b79k\', bs=256, out_path=\'embeddings/train_vitb32.npy\')')
print('2) extract_embeddings(test_df,  model_name=\'ViT-B-32\', pretrained=\'laion2b_s34b_b79k\', bs=256, out_path=\'embeddings/test_vitb32.npy\')')
print('3) run_retrieval_kNN(\'embeddings/train_vitb32.npy\', \'embeddings/test_vitb32.npy\', k=50, out_csv=\'submission.csv\')')
print('Smoke test first on a small subset before full run (e.g., head 20000).')

  from .autonotebook import tqdm as notebook_tqdm


== Build train/test tables ==


train_df: (665720, 4) unique classes: 15501
test_df: (174052, 3)
num_classes: 15501
== Retrieval baseline plan ready ==
Next steps:
1) extract_embeddings(train_df, model_name='ViT-B-32', pretrained='laion2b_s34b_b79k', bs=256, out_path='embeddings/train_vitb32.npy')
2) extract_embeddings(test_df,  model_name='ViT-B-32', pretrained='laion2b_s34b_b79k', bs=256, out_path='embeddings/test_vitb32.npy')
3) run_retrieval_kNN('embeddings/train_vitb32.npy', 'embeddings/test_vitb32.npy', k=50, out_csv='submission.csv')
Smoke test first on a small subset before full run (e.g., head 20000).


In [None]:
# Smoke test: small subset embeddings and kNN vote
import time, numpy as np, pandas as pd, os

t0 = time.time()
train_n = 5000
test_n = 2000
train_sub = train_df.head(train_n).copy()
test_sub  = test_df.head(test_n).copy()
print(f'Smoke subset sizes -> train: {len(train_sub)}, test: {len(test_sub)}', flush=True)

os.makedirs('embeddings', exist_ok=True)
tr_out = 'embeddings/train_vitb32_smoke.npy'
te_out = 'embeddings/test_vitb32_smoke.npy'

extract_embeddings(train_sub, model_name='ViT-B-32', pretrained='laion2b_s34b_b79k', bs=256, num_workers=8, out_path=tr_out)
extract_embeddings(test_sub,  model_name='ViT-B-32', pretrained='laion2b_s34b_b79k', bs=256, num_workers=8, out_path=te_out)

print('Embeddings saved (smoke). Loading back to verify...', flush=True)
etr = np.load(tr_out)
ete = np.load(te_out)
print('Shapes -> train:', etr.shape, 'test:', ete.shape)

# kNN on smoke to validate pipeline (no submission write for full test yet)
def run_knn_smoke():
    import faiss, numpy as np
    idx = faiss.IndexFlatIP(etr.shape[1])
    idx.add(etr.astype('float32'))
    sims, inds = idx.search(ete.astype('float32'), 20)
    y_tr = train_sub['y'].to_numpy()
    preds = np.empty((ete.shape[0],), dtype=np.int32)
    for i in range(ete.shape[0]):
        labs = y_tr[inds[i]]
        ws = sims[i]
        scores = {}
        for lab, w in zip(labs, ws):
            if w > 0:
                scores[lab] = scores.get(lab, 0.0) + float(w)
        preds[i] = max(scores.items(), key=lambda x:x[1])[0] if scores else labs[0]
    print('kNN smoke preds sample:', preds[:10])
run_knn_smoke()

print('Smoke test done. Elapsed: %.1fs' % (time.time()-t0))
print('Next: run full extraction and kNN to produce submission.csv.')

In [None]:
# Full extraction and kNN submission
import time, numpy as np

t0 = time.time()
train_out = 'embeddings/train_vitb32.npy'
test_out  = 'embeddings/test_vitb32.npy'
print('Starting full embedding extraction with OpenCLIP ViT-B/32...', flush=True)
extract_embeddings(train_df, model_name='ViT-B-32', pretrained='laion2b_s34b_b79k', bs=256, num_workers=12, out_path=train_out)
extract_embeddings(test_df,  model_name='ViT-B-32', pretrained='laion2b_s34b_b79k', bs=256, num_workers=12, out_path=test_out)
print('Embeddings ready. Now running kNN and writing submission.csv ...', flush=True)
run_retrieval_kNN(train_out, test_out, k=50, out_csv='submission.csv')
print('All done in %.1f minutes' % ((time.time()-t0)/60.0))

In [None]:
# Prototype cosine classifier with class-prior debiasing (GPU, chunked)
import time, numpy as np, pandas as pd, torch

t0 = time.time()
train_emb_path = 'embeddings/train_vitb32.npy'
test_emb_path  = 'embeddings/test_vitb32.npy'
print('Loading embeddings...', flush=True)
E_tr = np.load(train_emb_path).astype('float32')  # (Ntr, D) L2-normalized
E_te = np.load(test_emb_path).astype('float32')   # (Nte, D) L2-normalized
print('Shapes:', E_tr.shape, E_te.shape, flush=True)

# Build class prototypes (mean of normalized train embeddings per class)
y_tr = train_df['y'].to_numpy()
num_classes = y_tr.max() + 1
D = E_tr.shape[1]
P_sum = np.zeros((num_classes, D), dtype=np.float32)
cnt = np.zeros((num_classes,), dtype=np.int64)
for c in range(num_classes):
    # slice indices per class in chunks to save RAM on boolean masks
    idx = np.where(y_tr == c)[0]
    if idx.size == 0:
        continue
    P_sum[c] = E_tr[idx].mean(axis=0)
    cnt[c] = idx.size
print('Prototype counts built. Non-empty classes:', int((cnt>0).sum()), flush=True)
# L2-normalize prototypes
P = P_sum / (np.linalg.norm(P_sum, axis=1, keepdims=True) + 1e-12)
del P_sum

# Class-prior debiasing: subtract alpha * log(pi_c) from logits
alpha = 0.5
pi = cnt / cnt.sum()
log_pi = np.log(pi + 1e-12).astype('float32')

# Move prototypes to GPU (fp16 for speed/memory)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
P_t = torch.from_numpy(P).to(device=device, dtype=torch.float16)
log_pi_t = torch.from_numpy(log_pi).to(device=device, dtype=torch.float16)
del P
torch.cuda.empty_cache() if device=='cuda' else None

# Chunked inference: logits = E_te_chunk @ P^T (cosine since both L2-normalized)
bs = 4096
Nte = E_te.shape[0]
pred_idx = np.empty((Nte,), dtype=np.int32)
start = 0
loop_t0 = time.time()
while start < Nte:
    end = min(start + bs, Nte)
    X = torch.from_numpy(E_te[start:end]).to(device=device, dtype=torch.float16, non_blocking=True)
    # mm: (B,D) x (D,C)^T -> (B,C)
    logits = X @ P_t.T  # cosine similarity
    # debias
    logits = logits - alpha * log_pi_t
    # argmax per row
    top1 = torch.argmax(logits, dim=1).to('cpu').numpy().astype(np.int32)
    pred_idx[start:end] = top1
    start = end
    if (start // bs) % 10 == 0:
        print(f'Processed {start}/{Nte} (elapsed {time.time()-loop_t0:.1f}s)', flush=True)
del P_t, log_pi_t
torch.cuda.empty_cache() if device=='cuda' else None

# Map back to original category_id and write submission in test order
pred_cat = [idx2cat_id[int(i)] for i in pred_idx]
sub = pd.DataFrame({ 'Id': np.arange(len(test_df), dtype=np.int32), 'Predicted': pred_cat })
sub.to_csv('submission.csv', index=False)
print('Wrote submission.csv')
print('Total elapsed: %.1f s' % (time.time()-t0))

In [None]:
# Re-extract OpenCLIP embeddings with 2x TTA (orig + hflip), average and L2-normalize
import time, math, numpy as np, torch, json
from torch.utils.data import Dataset, DataLoader
from PIL import Image, ImageOps, ImageFile
import open_clip
ImageFile.LOAD_TRUNCATED_IMAGES = True

class TTADataset(Dataset):
    def __init__(self, df, preprocess):
        self.df = df.reset_index(drop=True)
        self.preprocess = preprocess
    def __len__(self): return len(self.df)
    def __getitem__(self, i):
        p = self.df.at[i, 'path']
        try:
            img = Image.open(p).convert('RGB')
        except Exception:
            img = Image.new('RGB', (224,224), (0,0,0))
        img_flip = ImageOps.mirror(img)
        return self.preprocess(img), self.preprocess(img_flip), i

def extract_embeddings_tta2(df, model_name='ViT-B-32', pretrained='laion2b_s34b_b79k', bs=256, num_workers=12, out_path='embeddings/train_vitb32_tta2.npy'):
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    print(f'Loading open_clip model {model_name} / {pretrained}', flush=True)
    model, _, preprocess = open_clip.create_model_and_transforms(model_name, pretrained=pretrained, device=device)
    model.eval(); model = model.to(device)
    torch.backends.cudnn.benchmark = True
    ds = TTADataset(df, preprocess)
    dl = DataLoader(ds, batch_size=bs, shuffle=False, num_workers=num_workers, pin_memory=True, persistent_workers=True)
    n = len(ds)
    # probe dim
    with torch.no_grad():
        (x0a, x0b, _) = next(iter(dl))
        with torch.autocast(device_type='cuda', dtype=torch.float16, enabled=(device=='cuda')):
            d = model.encode_image(x0a.to(device)).shape[-1]
    feats = np.memmap(out_path + '.mmap', dtype='float32', mode='w+', shape=(n, d))
    t0 = time.time(); seen = 0
    with torch.no_grad():
        for xa, xb, idx in DataLoader(ds, batch_size=bs, shuffle=False, num_workers=num_workers, pin_memory=True, persistent_workers=True):
            xa = xa.to(device, non_blocking=True); xb = xb.to(device, non_blocking=True)
            with torch.autocast(device_type='cuda', dtype=torch.float16, enabled=(device=='cuda')):
                fa = model.encode_image(xa)
                fb = model.encode_image(xb)
            fa = torch.nn.functional.normalize(fa.float(), p=2, dim=-1)
            fb = torch.nn.functional.normalize(fb.float(), p=2, dim=-1)
            f = (fa + fb) * 0.5
            f = torch.nn.functional.normalize(f, p=2, dim=-1).cpu().numpy()
            feats[idx.numpy()] = f
            seen += xa.size(0)
            if seen % (bs*10) == 0: print(f'  wrote {seen}/{n}', flush=True)
    feats.flush(); np.save(out_path, np.asarray(feats)); del feats
    try: import os; os.remove(out_path + '.mmap')
    except: pass
    print('Saved', out_path, 'elapsed %.1fs' % (time.time()-t0))

print('Starting 2x TTA extraction for train/test (ViT-B/32)...', flush=True)
t0 = time.time()
extract_embeddings_tta2(train_df, out_path='embeddings/train_vitb32_tta2.npy')
extract_embeddings_tta2(test_df,  out_path='embeddings/test_vitb32_tta2.npy')
print('TTA embeddings done in %.1f min' % ((time.time()-t0)/60.0), flush=True)

# Run prototype cosine with debias on TTA features
import pandas as pd
E_tr = np.load('embeddings/train_vitb32_tta2.npy').astype('float32')
E_te = np.load('embeddings/test_vitb32_tta2.npy').astype('float32')
y_tr = train_df['y'].to_numpy()
num_classes = y_tr.max()+1; D = E_tr.shape[1]
P_sum = np.zeros((num_classes, D), dtype=np.float32); cnt = np.zeros((num_classes,), dtype=np.int64)
for c in range(num_classes):
    idx = np.where(y_tr==c)[0]
    if idx.size: P_sum[c] = E_tr[idx].mean(axis=0); cnt[c]=idx.size
P = P_sum / (np.linalg.norm(P_sum, axis=1, keepdims=True) + 1e-12); del P_sum
alpha = 0.5
log_pi = np.log((cnt / cnt.sum()) + 1e-12).astype('float32')
device = 'cuda' if torch.cuda.is_available() else 'cpu'
P_t = torch.from_numpy(P).to(device=device, dtype=torch.float16); del P
log_pi_t = torch.from_numpy(log_pi).to(device=device, dtype=torch.float16)
bs = 4096; Nte = E_te.shape[0]; pred_idx = np.empty((Nte,), dtype=np.int32)
s=0; t1=time.time()
while s < Nte:
    e = min(s+bs, Nte)
    X = torch.from_numpy(E_te[s:e]).to(device=device, dtype=torch.float16, non_blocking=True)
    logits = X @ P_t.T
    logits = logits - alpha * log_pi_t
    pred_idx[s:e] = torch.argmax(logits, dim=1).to('cpu').numpy().astype(np.int32)
    s = e
    if (s//bs)%10==0: print(f'TTA infer {s}/{Nte} (elapsed {time.time()-t1:.1f}s)', flush=True)
del P_t, log_pi_t; torch.cuda.empty_cache() if device=='cuda' else None
pred_cat = [idx2cat_id[int(i)] for i in pred_idx]
sub = pd.DataFrame({'Id': np.arange(len(test_df), dtype=np.int32), 'Predicted': pred_cat})
sub.to_csv('submission.csv', index=False)
print('submission.csv written (TTA prototypes)')

In [None]:
# Linear probe on CLIP embeddings + prototype blend (fast, GPU, chunked)
import os, time, math, numpy as np, pandas as pd, torch, torch.nn as nn

def pick_emb(path_tta, path_base):
    return path_tta if os.path.exists(path_tta) else path_base

train_emb_path = pick_emb('embeddings/train_vitb32_tta2.npy','embeddings/train_vitb32.npy')
test_emb_path  = pick_emb('embeddings/test_vitb32_tta2.npy', 'embeddings/test_vitb32.npy')
print('Using embeddings:', train_emb_path, ' / ', test_emb_path, flush=True)

# Load features
E_tr = np.load(train_emb_path).astype('float32')
E_te = np.load(test_emb_path).astype('float32')
y_tr = train_df['y'].to_numpy()
num_classes = int(y_tr.max()+1); D = E_tr.shape[1]
print('Train feats:', E_tr.shape, 'Test feats:', E_te.shape, 'Classes:', num_classes, flush=True)

# Class weights (inverse sqrt freq)
freq = np.bincount(y_tr, minlength=num_classes).astype(np.float32)
w = 1.0 / np.sqrt(freq + 1e-8)
w = (w * (num_classes / w.sum())).astype('float32')

# Dataset and loader over numpy arrays
class FeatDS(torch.utils.data.Dataset):
    def __init__(self, feats, labels):
        self.X = feats; self.y = labels
    def __len__(self): return self.X.shape[0]
    def __getitem__(self, i):
        return self.X[i], self.y[i]

device = 'cuda' if torch.cuda.is_available() else 'cpu'
train_ds = FeatDS(E_tr, y_tr.astype(np.int64))
bs = 16384
train_dl = torch.utils.data.DataLoader(train_ds, batch_size=bs, shuffle=True, num_workers=4, pin_memory=True, persistent_workers=True)

# Model: linear classifier
model = nn.Linear(D, num_classes, bias=True).to(device)
optim = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.0)
crit = nn.CrossEntropyLoss(weight=torch.from_numpy(w).to(device))
scaler = torch.cuda.amp.GradScaler(enabled=(device=='cuda'))
epochs = 2
t0 = time.time()
for ep in range(1, epochs+1):
    model.train()
    run_loss = 0.0; n_seen = 0
    ep_t0 = time.time()
    for xb, yb in train_dl:
        xb = xb.to(device=device, dtype=torch.float16, non_blocking=True)
        yb = yb.to(device=device, non_blocking=True)
        optim.zero_grad(set_to_none=True)
        with torch.cuda.amp.autocast(dtype=torch.float16, enabled=(device=='cuda')):
            logits = model(xb)
            loss = crit(logits, yb)
        scaler.scale(loss).backward()
        scaler.step(optim); scaler.update()
        run_loss += loss.detach().float().item() * xb.size(0)
        n_seen += xb.size(0)
    print(f'Epoch {ep}/{epochs} - loss {(run_loss/max(1,n_seen)):.4f} - elapsed {time.time()-ep_t0:.1f}s', flush=True)
print('Training done in %.1fs' % (time.time()-t0), flush=True)

# Build prototypes for blend and log priors
P_sum = np.zeros((num_classes, D), dtype=np.float32); cnt = freq.astype(np.int64)
for c in range(num_classes):
    idx = np.where(y_tr==c)[0]
    if idx.size: P_sum[c] = E_tr[idx].mean(axis=0)
P = P_sum / (np.linalg.norm(P_sum, axis=1, keepdims=True) + 1e-12); del P_sum
alpha = 0.5
log_pi = np.log((cnt / max(1, int(cnt.sum()))) + 1e-12).astype('float32')

# Inference chunked with blend: 0.7*linear + 0.3*scaled_cosine
model.eval()
with torch.no_grad():
    W_proto = torch.from_numpy(P).to(device=device, dtype=torch.float16)
    log_pi_t = torch.from_numpy(log_pi).to(device=device, dtype=torch.float16)
    blend_w_lin = 0.7
    blend_w_proto = 0.3
    proto_scale = 10.0
    Nte = E_te.shape[0]; bs_inf = 2048
    preds = np.empty((Nte,), dtype=np.int32)
    s = 0; t1 = time.time();
    while s < Nte:
        e = min(s+bs_inf, Nte)
        X = torch.from_numpy(E_te[s:e]).to(device=device, dtype=torch.float16, non_blocking=True)
        with torch.cuda.amp.autocast(dtype=torch.float16, enabled=(device=='cuda')):
            logit_lin = model(X)                              # (B,C)
            logit_proto = (X @ W_proto.T) * proto_scale      # (B,C)
            logits = blend_w_lin*logit_lin + blend_w_proto*logit_proto
            logits = logits - alpha * log_pi_t               # debias
        top1 = torch.argmax(logits, dim=1).to('cpu').numpy().astype(np.int32)
        preds[s:e] = top1
        s = e
        if (s//bs_inf) % 10 == 0:
            print(f'Infer {s}/{Nte} (elapsed {time.time()-t1:.1f}s)', flush=True)
    del W_proto, log_pi_t
    torch.cuda.empty_cache() if device=='cuda' else None

# Map to category_id and write
pred_cat = [idx2cat_id[int(i)] for i in preds]
sub = pd.DataFrame({'Id': np.arange(len(test_df), dtype=np.int32), 'Predicted': pred_cat})
sub.to_csv('submission.csv', index=False)
print('submission.csv written (linear+proto blend)')

In [8]:
# ConvNeXtV2 training (speed-tuned) with LADE, EMA, AMP
import os, time, math, json, random, gc
import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import f1_score
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from PIL import Image, ImageFile
import timm
from timm.utils import ModelEmaV2
from timm.data import create_transform

ImageFile.LOAD_TRUNCATED_IMAGES = True
torch.backends.cudnn.benchmark = True

def seed_all(s=42):
    random.seed(s); np.random.seed(s); torch.manual_seed(s); torch.cuda.manual_seed_all(s)
seed_all(42)

os.environ.setdefault('PYTORCH_CUDA_ALLOC_CONF', 'expandable_segments:True')
try: torch.set_float32_matmul_precision('medium')
except: pass
torch.backends.cuda.matmul.allow_tf32 = True

# Config
num_classes = int(train_df['y'].max()+1)
img_size = 256  # reduced for throughput
backbone_candidates = [
    'convnextv2_small',                  # use default pretrained tag
    'convnextv2_tiny',                   # fallback smaller
    'convnextv2_base.fcmae_ft_in22k_in1k',  # known-good base weights
]
epochs = 6
batch_size = 96     # try 96 first; if OOM, reduce to 80/72/64 manually
accum_steps = 1     # no grad accumulation for faster steps
num_workers = 16    # more workers for I/O
lr = 3e-4
weight_decay = 0.05
ema_decay = 0.99985
warmup_epochs = 1
val_fold_index = 0
ckpt_dir = Path('checkpoints'); ckpt_dir.mkdir(exist_ok=True)

# Folds
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
folds = list(skf.split(train_df.index.values, train_df['y'].values))
tr_idx, va_idx = folds[val_fold_index]
train_idx = train_df.index.values[tr_idx]
valid_idx = train_df.index.values[va_idx]
print(f'Fold {val_fold_index}: train {len(train_idx)}, valid {len(valid_idx)}')

# Transforms via timm (bicubic + light RandomErasing)
mean = (0.485, 0.456, 0.406); std = (0.229, 0.224, 0.225)
train_tfms = create_transform(
    input_size=img_size, is_training=True,
    scale=(0.9, 1.0), ratio=(0.8, 1.25), hflip=0.5,
    interpolation='bicubic', mean=mean, std=std,
    re_prob=0.25, re_mode='pixel'
)
valid_tfms = create_transform(
    input_size=img_size, is_training=False,
    interpolation='bicubic', mean=mean, std=std
)

class ImgDS(Dataset):
    def __init__(self, df, indices, tfm):
        self.df = df.loc[indices].reset_index(drop=True)
        self.tfm = tfm
    def __len__(self): return len(self.df)
    def __getitem__(self, i):
        row = self.df.iloc[i]
        path = row['path']; y = int(row['y'])
        try: img = Image.open(path).convert('RGB')
        except: img = Image.new('RGB', (img_size, img_size), (0,0,0))
        return self.tfm(img), y

train_ds = ImgDS(train_df, train_idx, train_tfms)
valid_ds = ImgDS(train_df, valid_idx, valid_tfms)
train_dl = DataLoader(train_ds, batch_size=batch_size, shuffle=True, num_workers=num_workers, pin_memory=True, persistent_workers=True, drop_last=True, prefetch_factor=6)
valid_dl = DataLoader(valid_ds, batch_size=max(1, batch_size*2), shuffle=False, num_workers=num_workers, pin_memory=True, persistent_workers=True, prefetch_factor=6)

# LADE priors
freq = train_df.loc[train_idx, 'y'].value_counts().reindex(range(num_classes), fill_value=0).values.astype(np.float32)
prior = (freq + 1.0) / float(freq.sum() + num_classes)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
log_prior = torch.from_numpy(np.log(prior)).float().to(device)

def create_model_with_fallback(cands, num_classes):
    last_err = None
    for name in cands:
        try:
            torch.cuda.empty_cache(); gc.collect()
            print(f'Trying model: {name}', flush=True)
            mdl = timm.create_model(name, pretrained=True, num_classes=num_classes)
            mdl = mdl.to(device)
            mdl = mdl.to(memory_format=torch.channels_last)
            print(f'Loaded model: {name}', flush=True)
            return mdl, name
        except Exception as e:
            last_err = e
            print(f'OOM or error with {name}: {e}', flush=True)
            try: del mdl
            except: pass
            torch.cuda.empty_cache(); gc.collect()
    raise last_err

model, model_name = create_model_with_fallback(backbone_candidates, num_classes)
# EMA on CPU to save VRAM
ema = ModelEmaV2(model, decay=ema_decay, device='cpu')

opt = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=weight_decay, betas=(0.9, 0.999))
updates_per_epoch = math.ceil(len(train_dl) / max(1, accum_steps))
total_steps = epochs * updates_per_epoch
warmup_steps = max(1, warmup_epochs * updates_per_epoch)

def cosine_lr(step_it):
    if step_it < warmup_steps:
        return (step_it + 1) / max(1, warmup_steps)
    t = (step_it - warmup_steps) / max(1, total_steps - warmup_steps)
    return 0.5 * (1.0 + math.cos(math.pi * min(1.0, max(0.0, t))))

scaler = torch.amp.GradScaler('cuda', enabled=(device=='cuda'))

best_f1 = -1.0
best_path = ckpt_dir / f'cnv2x{img_size}_{model_name.replace(".", "_")}_fold{val_fold_index}_ema_best.pth'

def validate(use_ema=True):
    # Move EMA to device for eval to avoid swapping full state dicts repeatedly
    mdl = model
    if use_ema:
        ema.module.to(device)
        mdl = ema.module
    mdl.eval()
    preds = []; targets = []
    with torch.no_grad(), torch.amp.autocast('cuda', enabled=(device=='cuda')):
        for xb, yb in valid_dl:
            xb = xb.to(device, non_blocking=True).to(memory_format=torch.channels_last)
            yb = yb.to(device, non_blocking=True)
            logits = mdl(xb)
            pred = torch.argmax(logits, dim=1)
            preds.append(pred.cpu().numpy()); targets.append(yb.cpu().numpy())
    if use_ema:
        ema.module.to('cpu')
        torch.cuda.empty_cache() if device=='cuda' else None
    preds = np.concatenate(preds); targets = np.concatenate(targets)
    return f1_score(targets, preds, average='macro')

micro_it = 0
step_it = 0
t_train0 = time.time()
for ep in range(1, epochs+1):
    model.train()
    run_loss = 0.0; n_seen = 0
    ep_t0 = time.time()
    opt.zero_grad(set_to_none=True)
    for xb, yb in train_dl:
        xb = xb.to(device, non_blocking=True).to(memory_format=torch.channels_last)
        yb = yb.to(device, non_blocking=True)
        with torch.amp.autocast('cuda', enabled=(device=='cuda')):
            logits = model(xb)
            logits_adj = logits + log_prior.unsqueeze(0)
            # Light label smoothing compatible with LADE
            loss = F.cross_entropy(logits_adj, yb, label_smoothing=0.05) / max(1, accum_steps)
        scaler.scale(loss).backward()
        micro_it += 1
        bs_now = xb.size(0)
        run_loss += (loss.detach().float().item() * max(1, accum_steps)) * bs_now
        n_seen += bs_now
        if (micro_it % max(1, accum_steps)) == 0:
            scaler.unscale_(opt)
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            lr_scale = cosine_lr(step_it)
            for pg in opt.param_groups: pg['lr'] = lr * lr_scale
            scaler.step(opt); scaler.update()
            opt.zero_grad(set_to_none=True)
            if step_it >= warmup_steps:
                ema.update(model)
            step_it += 1
            if step_it % 200 == 0:
                print(f'ep {ep} step {step_it}/{updates_per_epoch} lr {opt.param_groups[0]["lr"]:.2e} loss {run_loss/max(1,n_seen):.4f}', flush=True)
    ep_loss = run_loss / max(1, n_seen)
    f1 = validate(use_ema=True)
    print(f'Epoch {ep}/{epochs} - loss {ep_loss:.4f} - val macro-F1 {f1:.5f} - elapsed {time.time()-ep_t0:.1f}s', flush=True)
    if f1 > best_f1:
        best_f1 = f1
        torch.save({'model': ema.module.state_dict(), 'f1': f1, 'epoch': ep, 'cfg': {'model_name': model_name, 'img_size': img_size, 'fold': val_fold_index}}, best_path)
        print(f'  Saved new best to {best_path} (F1={f1:.5f})', flush=True)

print(f'Training complete in {(time.time()-t_train0)/60:.1f} min. Best F1={best_f1:.5f}')
gc.collect();
torch.cuda.empty_cache() if device=='cuda' else None



Fold 0: train 532576, valid 133144


Trying model: convnextv2_small


OOM or error with convnextv2_small: No pretrained weights exist for convnextv2_small. Use `pretrained=False` for random init.


Trying model: convnextv2_tiny


Loaded model: convnextv2_tiny


ep 1 step 200/5547 lr 1.08e-05 loss 9.7397


ep 1 step 400/5547 lr 2.16e-05 loss 9.6781


ep 1 step 600/5547 lr 3.24e-05 loss 9.6372


ep 1 step 800/5547 lr 4.33e-05 loss 9.5947


ep 1 step 1000/5547 lr 5.41e-05 loss 9.5180


ep 1 step 1200/5547 lr 6.49e-05 loss 9.4054


ep 1 step 1400/5547 lr 7.57e-05 loss 9.2762


ep 1 step 1600/5547 lr 8.65e-05 loss 9.1351


ep 1 step 1800/5547 lr 9.73e-05 loss 8.9870


ep 1 step 2000/5547 lr 1.08e-04 loss 8.8309


ep 1 step 2200/5547 lr 1.19e-04 loss 8.6729


ep 1 step 2400/5547 lr 1.30e-04 loss 8.5137


ep 1 step 2600/5547 lr 1.41e-04 loss 8.3527


ep 1 step 2800/5547 lr 1.51e-04 loss 8.1929


ep 1 step 3000/5547 lr 1.62e-04 loss 8.0331


ep 1 step 3200/5547 lr 1.73e-04 loss 7.8767


ep 1 step 3400/5547 lr 1.84e-04 loss 7.7235


ep 1 step 3600/5547 lr 1.95e-04 loss 7.5723


ep 1 step 3800/5547 lr 2.06e-04 loss 7.4259


ep 1 step 4000/5547 lr 2.16e-04 loss 7.2837


ep 1 step 4200/5547 lr 2.27e-04 loss 7.1480


ep 1 step 4400/5547 lr 2.38e-04 loss 7.0168


ep 1 step 4600/5547 lr 2.49e-04 loss 6.8910


ep 1 step 4800/5547 lr 2.60e-04 loss 6.7704


ep 1 step 5000/5547 lr 2.70e-04 loss 6.6569


ep 1 step 5200/5547 lr 2.81e-04 loss 6.5470


ep 1 step 5400/5547 lr 2.92e-04 loss 6.4421


Epoch 1/6 - loss 6.3700 - val macro-F1 0.00000 - elapsed 3201.9s


  Saved new best to checkpoints/cnv2x256_convnextv2_tiny_fold0_ema_best.pth (F1=0.00000)


ep 2 step 5600/5547 lr 3.00e-04 loss 3.2664


ep 2 step 5800/5547 lr 3.00e-04 loss 3.2342


ep 2 step 6000/5547 lr 3.00e-04 loss 3.2308


ep 2 step 6200/5547 lr 3.00e-04 loss 3.2232


ep 2 step 6400/5547 lr 2.99e-04 loss 3.2170


ep 2 step 6600/5547 lr 2.99e-04 loss 3.1985


ep 2 step 6800/5547 lr 2.98e-04 loss 3.1831


ep 2 step 7000/5547 lr 2.98e-04 loss 3.1657


ep 2 step 7200/5547 lr 2.97e-04 loss 3.1496


ep 2 step 7400/5547 lr 2.97e-04 loss 3.1348


ep 2 step 7600/5547 lr 2.96e-04 loss 3.1172


ep 2 step 7800/5547 lr 2.95e-04 loss 3.1022


ep 2 step 8000/5547 lr 2.94e-04 loss 3.0888


ep 2 step 8200/5547 lr 2.93e-04 loss 3.0776


ep 2 step 8400/5547 lr 2.92e-04 loss 3.0636


ep 2 step 8600/5547 lr 2.91e-04 loss 3.0480


ep 2 step 8800/5547 lr 2.90e-04 loss 3.0339


ep 2 step 9000/5547 lr 2.89e-04 loss 3.0201


ep 2 step 9200/5547 lr 2.87e-04 loss 3.0059


ep 2 step 9400/5547 lr 2.86e-04 loss 2.9920


ep 2 step 9600/5547 lr 2.84e-04 loss 2.9784


ep 2 step 9800/5547 lr 2.83e-04 loss 2.9641


ep 2 step 10000/5547 lr 2.81e-04 loss 2.9515


ep 2 step 10200/5547 lr 2.80e-04 loss 2.9391


ep 2 step 10400/5547 lr 2.78e-04 loss 2.9271


ep 2 step 10600/5547 lr 2.76e-04 loss 2.9146


ep 2 step 10800/5547 lr 2.74e-04 loss 2.9033


ep 2 step 11000/5547 lr 2.72e-04 loss 2.8914


Epoch 2/6 - loss 2.8858 - val macro-F1 0.42962 - elapsed 3380.1s


  Saved new best to checkpoints/cnv2x256_convnextv2_tiny_fold0_ema_best.pth (F1=0.42962)


ep 3 step 11200/5547 lr 2.70e-04 loss 2.0268


ep 3 step 11400/5547 lr 2.68e-04 loss 2.0375


ep 3 step 11600/5547 lr 2.66e-04 loss 2.0281


ep 3 step 11800/5547 lr 2.64e-04 loss 2.0296


ep 3 step 12000/5547 lr 2.62e-04 loss 2.0352


ep 3 step 12200/5547 lr 2.59e-04 loss 2.0404


ep 3 step 12400/5547 lr 2.57e-04 loss 2.0444


ep 3 step 12600/5547 lr 2.55e-04 loss 2.0471


ep 3 step 12800/5547 lr 2.52e-04 loss 2.0492


ep 3 step 13000/5547 lr 2.50e-04 loss 2.0494


ep 3 step 13200/5547 lr 2.47e-04 loss 2.0495


ep 3 step 13400/5547 lr 2.44e-04 loss 2.0510


ep 3 step 13600/5547 lr 2.42e-04 loss 2.0499


ep 3 step 13800/5547 lr 2.39e-04 loss 2.0495


ep 3 step 14000/5547 lr 2.36e-04 loss 2.0481


ep 3 step 14200/5547 lr 2.34e-04 loss 2.0470


ep 3 step 14400/5547 lr 2.31e-04 loss 2.0472


ep 3 step 14600/5547 lr 2.28e-04 loss 2.0459


ep 3 step 14800/5547 lr 2.25e-04 loss 2.0452


ep 3 step 15000/5547 lr 2.22e-04 loss 2.0433


ep 3 step 15200/5547 lr 2.19e-04 loss 2.0418


ep 3 step 15400/5547 lr 2.16e-04 loss 2.0396


ep 3 step 15600/5547 lr 2.13e-04 loss 2.0362


ep 3 step 15800/5547 lr 2.10e-04 loss 2.0339


ep 3 step 16000/5547 lr 2.07e-04 loss 2.0312


ep 3 step 16200/5547 lr 2.03e-04 loss 2.0292


ep 3 step 16400/5547 lr 2.00e-04 loss 2.0256


ep 3 step 16600/5547 lr 1.97e-04 loss 2.0225


Epoch 3/6 - loss 2.0219 - val macro-F1 0.59279 - elapsed 3370.7s


  Saved new best to checkpoints/cnv2x256_convnextv2_tiny_fold0_ema_best.pth (F1=0.59279)


ep 4 step 16800/5547 lr 1.94e-04 loss 1.4712


ep 4 step 17000/5547 lr 1.91e-04 loss 1.4626


ep 4 step 17200/5547 lr 1.87e-04 loss 1.4656


ep 4 step 17400/5547 lr 1.84e-04 loss 1.4697


ep 4 step 17600/5547 lr 1.81e-04 loss 1.4746


ep 4 step 17800/5547 lr 1.77e-04 loss 1.4782


ep 4 step 18000/5547 lr 1.74e-04 loss 1.4800


ep 4 step 18200/5547 lr 1.71e-04 loss 1.4825


ep 4 step 18400/5547 lr 1.67e-04 loss 1.4848


ep 4 step 18600/5547 lr 1.64e-04 loss 1.4835


ep 4 step 18800/5547 lr 1.60e-04 loss 1.4841


ep 4 step 19000/5547 lr 1.57e-04 loss 1.4851


ep 4 step 19200/5547 lr 1.54e-04 loss 1.4848


ep 4 step 19400/5547 lr 1.50e-04 loss 1.4832


ep 4 step 19600/5547 lr 1.47e-04 loss 1.4818


ep 4 step 19800/5547 lr 1.43e-04 loss 1.4815


ep 4 step 20000/5547 lr 1.40e-04 loss 1.4810


ep 4 step 20200/5547 lr 1.37e-04 loss 1.4807


ep 4 step 20400/5547 lr 1.33e-04 loss 1.4796


ep 4 step 20600/5547 lr 1.30e-04 loss 1.4780


ep 4 step 20800/5547 lr 1.27e-04 loss 1.4776


ep 4 step 21000/5547 lr 1.23e-04 loss 1.4768


ep 4 step 21200/5547 lr 1.20e-04 loss 1.4757


ep 4 step 21400/5547 lr 1.17e-04 loss 1.4747


ep 4 step 21600/5547 lr 1.13e-04 loss 1.4733


ep 4 step 21800/5547 lr 1.10e-04 loss 1.4720


ep 4 step 22000/5547 lr 1.07e-04 loss 1.4703


Epoch 4/6 - loss 1.4688 - val macro-F1 0.62790 - elapsed 3371.2s


  Saved new best to checkpoints/cnv2x256_convnextv2_tiny_fold0_ema_best.pth (F1=0.62790)


ep 5 step 22200/5547 lr 1.03e-04 loss 1.0953


ep 5 step 22400/5547 lr 1.00e-04 loss 1.1133


ep 5 step 22600/5547 lr 9.71e-05 loss 1.1106


ep 5 step 22800/5547 lr 9.39e-05 loss 1.1133


ep 5 step 23000/5547 lr 9.08e-05 loss 1.1136


ep 5 step 23200/5547 lr 8.76e-05 loss 1.1134


ep 5 step 23400/5547 lr 8.46e-05 loss 1.1125


ep 5 step 23600/5547 lr 8.15e-05 loss 1.1127


ep 5 step 23800/5547 lr 7.85e-05 loss 1.1113


ep 5 step 24000/5547 lr 7.56e-05 loss 1.1121


ep 5 step 24200/5547 lr 7.26e-05 loss 1.1116


ep 5 step 24400/5547 lr 6.97e-05 loss 1.1109


ep 5 step 24600/5547 lr 6.69e-05 loss 1.1103


ep 5 step 24800/5547 lr 6.41e-05 loss 1.1097


ep 5 step 25000/5547 lr 6.13e-05 loss 1.1095


ep 5 step 25200/5547 lr 5.86e-05 loss 1.1088


ep 5 step 25400/5547 lr 5.59e-05 loss 1.1076


ep 5 step 25600/5547 lr 5.33e-05 loss 1.1064


ep 5 step 25800/5547 lr 5.07e-05 loss 1.1056


ep 5 step 26000/5547 lr 4.82e-05 loss 1.1051


ep 5 step 26200/5547 lr 4.57e-05 loss 1.1040


ep 5 step 26400/5547 lr 4.33e-05 loss 1.1031


ep 5 step 26600/5547 lr 4.10e-05 loss 1.1020


ep 5 step 26800/5547 lr 3.87e-05 loss 1.1012


ep 5 step 27000/5547 lr 3.64e-05 loss 1.1000


ep 5 step 27200/5547 lr 3.42e-05 loss 1.0990


ep 5 step 27400/5547 lr 3.21e-05 loss 1.0980


ep 5 step 27600/5547 lr 3.00e-05 loss 1.0971


Epoch 5/6 - loss 1.0962 - val macro-F1 0.63557 - elapsed 3370.8s


  Saved new best to checkpoints/cnv2x256_convnextv2_tiny_fold0_ema_best.pth (F1=0.63557)


ep 6 step 27800/5547 lr 2.80e-05 loss 0.9494


ep 6 step 28000/5547 lr 2.61e-05 loss 0.9461


ep 6 step 28200/5547 lr 2.42e-05 loss 0.9464


ep 6 step 28400/5547 lr 2.24e-05 loss 0.9447


ep 6 step 28600/5547 lr 2.06e-05 loss 0.9440


ep 6 step 28800/5547 lr 1.89e-05 loss 0.9443


ep 6 step 29000/5547 lr 1.73e-05 loss 0.9449


ep 6 step 29200/5547 lr 1.58e-05 loss 0.9453


ep 6 step 29400/5547 lr 1.43e-05 loss 0.9447


ep 6 step 29600/5547 lr 1.29e-05 loss 0.9444


ep 6 step 29800/5547 lr 1.15e-05 loss 0.9448


ep 6 step 30000/5547 lr 1.03e-05 loss 0.9444


ep 6 step 30200/5547 lr 9.05e-06 loss 0.9444


ep 6 step 30400/5547 lr 7.93e-06 loss 0.9439


ep 6 step 30600/5547 lr 6.87e-06 loss 0.9432


ep 6 step 30800/5547 lr 5.89e-06 loss 0.9427


ep 6 step 31000/5547 lr 4.99e-06 loss 0.9423


ep 6 step 31200/5547 lr 4.16e-06 loss 0.9420


ep 6 step 31400/5547 lr 3.40e-06 loss 0.9419


ep 6 step 31600/5547 lr 2.72e-06 loss 0.9418


ep 6 step 31800/5547 lr 2.11e-06 loss 0.9417


ep 6 step 32000/5547 lr 1.58e-06 loss 0.9412


ep 6 step 32200/5547 lr 1.13e-06 loss 0.9408


ep 6 step 32400/5547 lr 7.50e-07 loss 0.9407


ep 6 step 32600/5547 lr 4.49e-07 loss 0.9406


ep 6 step 32800/5547 lr 2.24e-07 loss 0.9406


ep 6 step 33000/5547 lr 7.71e-08 loss 0.9402


ep 6 step 33200/5547 lr 6.63e-09 loss 0.9402


Epoch 6/6 - loss 0.9401 - val macro-F1 0.63298 - elapsed 3370.5s


Training complete in 334.5 min. Best F1=0.63557


In [9]:
# Inference: load EMA-best checkpoint and run 2x TTA on test, save logits and submission
import time, glob, numpy as np, torch
from torch.utils.data import Dataset, DataLoader
from PIL import Image, ImageFile
import timm, pandas as pd, os
from timm.data import create_transform
ImageFile.LOAD_TRUNCATED_IMAGES = True

device = 'cuda' if torch.cuda.is_available() else 'cpu'
# Robust glob to match our saved pattern cnv2x{img_size}_{model_name}_fold0_ema_best.pth
ckpts = sorted(glob.glob(str(ckpt_dir / 'cnv2x*_fold0_ema_best.pth')), key=lambda p: os.stat(p).st_mtime)
best_ckpt = ckpts[-1] if ckpts else None
print('Best checkpoint:', best_ckpt)
assert best_ckpt is not None, 'No best checkpoint found.'

# Load checkpoint and rebuild the exact model from cfg
ckpt = torch.load(best_ckpt, map_location=device)
cfg = ckpt.get('cfg', {})
model_name_ckpt = cfg.get('model_name', 'convnextv2_base.fcmae_ft_in22k_in1k')
img_size = int(cfg.get('img_size', 256))
num_classes = int(train_df['y'].max()+1)
model_inf = timm.create_model(model_name_ckpt, pretrained=False, num_classes=num_classes).to(device).to(memory_format=torch.channels_last)
model_inf.load_state_dict(ckpt['model'], strict=True)
model_inf.eval()

# Test transforms (timm create_transform with bicubic, center crop)
mean = (0.485, 0.456, 0.406); std = (0.229, 0.224, 0.225)
test_tfms = create_transform(input_size=img_size, is_training=False, interpolation='bicubic', mean=mean, std=std)

class TestDS(Dataset):
    def __init__(self, df, tfm):
        self.df = df.reset_index(drop=True)
        self.tfm = tfm
    def __len__(self): return len(self.df)
    def __getitem__(self, i):
        p = self.df.at[i, 'path']
        try:
            img = Image.open(p).convert('RGB')
        except Exception:
            img = Image.new('RGB', (img_size, img_size), (0,0,0))
        x = self.tfm(img)
        return x, i

test_ds = TestDS(test_df, test_tfms)
bs_inf = 128
num_workers = 10
test_dl = DataLoader(test_ds, batch_size=bs_inf, shuffle=False, num_workers=num_workers, pin_memory=True, persistent_workers=True)

# Inference with 2x TTA (orig + hflip), average logits; save logits fp16 and submission
Nte = len(test_ds)
logits_fp16 = np.memmap('logits_test_fp16.mmap', dtype='float16', mode='w+', shape=(Nte, num_classes))
t0 = time.time(); seen = 0
with torch.no_grad(), torch.amp.autocast('cuda', enabled=(device=='cuda')):
    for xb, idx in test_dl:
        xb = xb.to(device, non_blocking=True).to(memory_format=torch.channels_last)
        lo = model_inf(xb)
        xf = torch.flip(xb, dims=[-1])
        lf = model_inf(xf)
        l = 0.5 * (lo + lf)
        l = l.float()
        logits_fp16[idx.numpy()] = l.cpu().numpy().astype('float16')
        seen += xb.size(0)
        if seen % (bs_inf*10) == 0:
            print(f'Infer {seen}/{Nte} (elapsed {time.time()-t0:.1f}s)', flush=True)
del logits_fp16
np.save('logits_test_fp16.npy', np.memmap('logits_test_fp16.mmap', dtype='float16', mode='r', shape=(Nte, num_classes)))
try: os.remove('logits_test_fp16.mmap')
except: pass
print('Saved logits_test_fp16.npy')

# Build submission (argmax)
logits = np.load('logits_test_fp16.npy').astype('float32')
preds = logits.argmax(axis=1).astype('int32')
pred_cat = [idx2cat_id[int(i)] for i in preds]
sub = pd.DataFrame({'Id': np.arange(len(test_df), dtype=np.int32), 'Predicted': pred_cat})
sub.to_csv('submission_classifier.csv', index=False)
print('Wrote submission_classifier.csv')
print('Inference total elapsed: %.1f s' % (time.time()-t0))

Best checkpoint: checkpoints/cnv2x256_convnextv2_tiny_fold0_ema_best.pth


  ckpt = torch.load(best_ckpt, map_location=device)


Infer 1280/174052 (elapsed 7.7s)


Infer 2560/174052 (elapsed 11.5s)


Infer 3840/174052 (elapsed 15.4s)


Infer 5120/174052 (elapsed 19.2s)


Infer 6400/174052 (elapsed 23.0s)


Infer 7680/174052 (elapsed 26.8s)


Infer 8960/174052 (elapsed 30.6s)


Infer 10240/174052 (elapsed 34.4s)


Infer 11520/174052 (elapsed 38.3s)


Infer 12800/174052 (elapsed 42.1s)


Infer 14080/174052 (elapsed 45.9s)


Infer 15360/174052 (elapsed 49.7s)


Infer 16640/174052 (elapsed 53.6s)


Infer 17920/174052 (elapsed 57.4s)


Infer 19200/174052 (elapsed 61.2s)


Infer 20480/174052 (elapsed 65.0s)


Infer 21760/174052 (elapsed 68.9s)


Infer 23040/174052 (elapsed 72.7s)


Infer 24320/174052 (elapsed 76.5s)


Infer 25600/174052 (elapsed 80.3s)


Infer 26880/174052 (elapsed 84.2s)


Infer 28160/174052 (elapsed 88.0s)


Infer 29440/174052 (elapsed 91.8s)


Infer 30720/174052 (elapsed 95.7s)


Infer 32000/174052 (elapsed 99.5s)


Infer 33280/174052 (elapsed 103.3s)


Infer 34560/174052 (elapsed 107.1s)


Infer 35840/174052 (elapsed 111.0s)


Infer 37120/174052 (elapsed 114.8s)


Infer 38400/174052 (elapsed 118.6s)


Infer 39680/174052 (elapsed 122.5s)


Infer 40960/174052 (elapsed 126.3s)


Infer 42240/174052 (elapsed 130.1s)


Infer 43520/174052 (elapsed 134.0s)


Infer 44800/174052 (elapsed 137.8s)


Infer 46080/174052 (elapsed 141.6s)


Infer 47360/174052 (elapsed 145.5s)


Infer 48640/174052 (elapsed 149.3s)


Infer 49920/174052 (elapsed 153.1s)


Infer 51200/174052 (elapsed 157.0s)


Infer 52480/174052 (elapsed 160.8s)


Infer 53760/174052 (elapsed 164.7s)


Infer 55040/174052 (elapsed 168.5s)


Infer 56320/174052 (elapsed 172.3s)


Infer 57600/174052 (elapsed 176.2s)


Infer 58880/174052 (elapsed 180.0s)


Infer 60160/174052 (elapsed 183.8s)


Infer 61440/174052 (elapsed 187.7s)


Infer 62720/174052 (elapsed 191.5s)


Infer 64000/174052 (elapsed 195.4s)


Infer 65280/174052 (elapsed 199.2s)


Infer 66560/174052 (elapsed 203.0s)


Infer 67840/174052 (elapsed 206.9s)


Infer 69120/174052 (elapsed 210.7s)


Infer 70400/174052 (elapsed 214.6s)


Infer 71680/174052 (elapsed 218.4s)


Infer 72960/174052 (elapsed 222.6s)


Infer 74240/174052 (elapsed 226.4s)


Infer 75520/174052 (elapsed 230.2s)


Infer 76800/174052 (elapsed 234.1s)


Infer 78080/174052 (elapsed 237.9s)


Infer 79360/174052 (elapsed 241.8s)


Infer 80640/174052 (elapsed 245.6s)


Infer 81920/174052 (elapsed 249.4s)


Infer 83200/174052 (elapsed 253.3s)


Infer 84480/174052 (elapsed 257.1s)


Infer 85760/174052 (elapsed 260.9s)


Infer 87040/174052 (elapsed 264.8s)


Infer 88320/174052 (elapsed 268.6s)


Infer 89600/174052 (elapsed 272.5s)


Infer 90880/174052 (elapsed 276.3s)


Infer 92160/174052 (elapsed 280.1s)


Infer 93440/174052 (elapsed 284.2s)


Infer 94720/174052 (elapsed 288.0s)


Infer 96000/174052 (elapsed 291.9s)


Infer 97280/174052 (elapsed 295.7s)


Infer 98560/174052 (elapsed 299.5s)


Infer 99840/174052 (elapsed 303.4s)


Infer 101120/174052 (elapsed 307.2s)


Infer 102400/174052 (elapsed 311.0s)


Infer 103680/174052 (elapsed 314.9s)


Infer 104960/174052 (elapsed 318.7s)


Infer 106240/174052 (elapsed 322.6s)


Infer 107520/174052 (elapsed 326.4s)


Infer 108800/174052 (elapsed 330.2s)


Infer 110080/174052 (elapsed 334.1s)


Infer 111360/174052 (elapsed 337.9s)


Infer 112640/174052 (elapsed 341.7s)


Infer 113920/174052 (elapsed 345.6s)


Infer 115200/174052 (elapsed 349.4s)


Infer 116480/174052 (elapsed 353.3s)


Infer 117760/174052 (elapsed 357.1s)


Infer 119040/174052 (elapsed 360.9s)


Infer 120320/174052 (elapsed 364.8s)


Infer 121600/174052 (elapsed 368.6s)


Infer 122880/174052 (elapsed 372.4s)


Infer 124160/174052 (elapsed 376.3s)


Infer 125440/174052 (elapsed 380.1s)


Infer 126720/174052 (elapsed 383.9s)


Infer 128000/174052 (elapsed 387.7s)


Infer 129280/174052 (elapsed 391.6s)


Infer 130560/174052 (elapsed 395.4s)


Infer 131840/174052 (elapsed 399.2s)


Infer 133120/174052 (elapsed 403.1s)


Infer 134400/174052 (elapsed 406.9s)


Infer 135680/174052 (elapsed 410.7s)


Infer 136960/174052 (elapsed 414.6s)


Infer 138240/174052 (elapsed 418.4s)


Infer 139520/174052 (elapsed 422.2s)


Infer 140800/174052 (elapsed 426.0s)


Infer 142080/174052 (elapsed 429.9s)


Infer 143360/174052 (elapsed 433.7s)


Infer 144640/174052 (elapsed 437.5s)


Infer 145920/174052 (elapsed 441.3s)


Infer 147200/174052 (elapsed 445.2s)


Infer 148480/174052 (elapsed 449.0s)


Infer 149760/174052 (elapsed 452.8s)


Infer 151040/174052 (elapsed 456.7s)


Infer 152320/174052 (elapsed 460.5s)


Infer 153600/174052 (elapsed 464.3s)


Infer 154880/174052 (elapsed 468.1s)


Infer 156160/174052 (elapsed 472.0s)


Infer 157440/174052 (elapsed 475.8s)


Infer 158720/174052 (elapsed 479.6s)


Infer 160000/174052 (elapsed 483.5s)


Infer 161280/174052 (elapsed 487.3s)


Infer 162560/174052 (elapsed 491.1s)


Infer 163840/174052 (elapsed 495.0s)


Infer 165120/174052 (elapsed 498.8s)


Infer 166400/174052 (elapsed 502.6s)


Infer 167680/174052 (elapsed 506.5s)


Infer 168960/174052 (elapsed 510.3s)


Infer 170240/174052 (elapsed 514.1s)


Infer 171520/174052 (elapsed 518.0s)


Infer 172800/174052 (elapsed 521.8s)


Saved logits_test_fp16.npy


Wrote submission_classifier.csv
Inference total elapsed: 554.6 s


In [10]:
# Proto logits (TTA embeddings) + optional blend with classifier logits
import os, time, numpy as np, pandas as pd, torch

def pick_emb(path_tta, path_base):
    return path_tta if os.path.exists(path_tta) else path_base

train_emb_path = pick_emb('embeddings/train_vitb32_tta2.npy','embeddings/train_vitb32.npy')
test_emb_path  = pick_emb('embeddings/test_vitb32_tta2.npy', 'embeddings/test_vitb32.npy')
print('Using embeddings:', train_emb_path, ' / ', test_emb_path, flush=True)

# Load features and build prototypes
E_tr = np.load(train_emb_path).astype('float32')
E_te = np.load(test_emb_path).astype('float32')
y_tr = train_df['y'].to_numpy()
num_classes = int(y_tr.max()+1); D = E_tr.shape[1]
freq = np.bincount(y_tr, minlength=num_classes).astype(np.float32)
P_sum = np.zeros((num_classes, D), dtype=np.float32)
for c in range(num_classes):
    idx = np.where(y_tr==c)[0]
    if idx.size: P_sum[c] = E_tr[idx].mean(axis=0)
P = P_sum / (np.linalg.norm(P_sum, axis=1, keepdims=True) + 1e-12); del P_sum
alpha = 0.5
log_pi = np.log((freq / max(1, int(freq.sum()))) + 1e-12).astype('float32')
proto_scale = 12.0

# Compute proto logits for test (chunked, GPU if available)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
P_t = torch.from_numpy(P).to(device=device, dtype=torch.float16); del P
log_pi_t = torch.from_numpy(log_pi).to(device=device, dtype=torch.float16)
Nte = E_te.shape[0]; bs = 2048
proto_mmap = np.memmap('proto_logits_fp16.mmap', dtype='float16', mode='w+', shape=(Nte, num_classes))
t0 = time.time(); done = 0
with torch.no_grad():
    s = 0
    while s < Nte:
        e = min(s+bs, Nte)
        X = torch.from_numpy(E_te[s:e]).to(device=device, dtype=torch.float16, non_blocking=True)
        logits = (X @ P_t.T) * proto_scale
        logits = logits - alpha * log_pi_t
        proto_mmap[s:e] = logits.cpu().numpy().astype('float16')
        s = e; done = s
        if (done//bs) % 10 == 0:
            print(f'proto logits {done}/{Nte} (elapsed {time.time()-t0:.1f}s)', flush=True)
del P_t, log_pi_t
proto_logits = np.memmap('proto_logits_fp16.mmap', dtype='float16', mode='r', shape=(Nte, num_classes))
np.save('proto_logits_fp16.npy', proto_logits)
try: os.remove('proto_logits_fp16.mmap')
except: pass
print('Saved proto_logits_fp16.npy')

# Backup: write a proto-only submission for reference
preds_proto = np.load('proto_logits_fp16.npy').astype('float32').argmax(axis=1).astype('int32')
pred_cat_proto = [idx2cat_id[int(i)] for i in preds_proto]
pd.DataFrame({'Id': np.arange(len(test_df), dtype=np.int32), 'Predicted': pred_cat_proto}).to_csv('submission_proto.csv', index=False)
print('Wrote submission_proto.csv')

# Optional blend: if classifier logits exist, blend 0.8 classifier + 0.2 proto and write submission_blend.csv
if os.path.exists('logits_test_fp16.npy'):
    clf_logits = np.load('logits_test_fp16.npy').astype('float32')
    prt_logits = np.load('proto_logits_fp16.npy').astype('float32')
    assert clf_logits.shape == prt_logits.shape, (clf_logits.shape, prt_logits.shape)
    logits_blend = 0.8*clf_logits + 0.2*prt_logits
    preds = logits_blend.argmax(axis=1).astype('int32')
    pred_cat = [idx2cat_id[int(i)] for i in preds]
    pd.DataFrame({'Id': np.arange(len(test_df), dtype=np.int32), 'Predicted': pred_cat}).to_csv('submission_blend.csv', index=False)
    print('Wrote submission_blend.csv')
else:
    print('Classifier logits not found; blend skipped. Run Cell 10 later and re-run this section if needed.')

Using embeddings: embeddings/train_vitb32_tta2.npy  /  embeddings/test_vitb32_tta2.npy


proto logits 20480/174052 (elapsed 0.8s)


proto logits 40960/174052 (elapsed 1.5s)


proto logits 61440/174052 (elapsed 2.2s)


proto logits 81920/174052 (elapsed 3.0s)


proto logits 102400/174052 (elapsed 3.7s)


proto logits 122880/174052 (elapsed 4.4s)


proto logits 143360/174052 (elapsed 5.2s)


proto logits 163840/174052 (elapsed 5.9s)


Saved proto_logits_fp16.npy


Wrote submission_proto.csv


Wrote submission_blend.csv


In [11]:
# Copy blended submission to submission.csv
import pandas as pd, shutil, os
src = 'submission_blend.csv'
dst = 'submission.csv'
assert os.path.exists(src), f"{src} not found"
df = pd.read_csv(src)
assert list(df.columns)==['Id','Predicted'], 'Invalid submission columns'
df.to_csv(dst, index=False)
print('Wrote', dst, 'rows:', len(df))
print(df.head())

Wrote submission.csv rows: 174052
   Id  Predicted
0   0       9493
1   1       5344
2   2       3819
3   3       2799
4   4       6234
