# Plan to WIN A MEDAL – Cdiscount Image Classification

Objectives:
- Build a robust, GPU-accelerated image classification pipeline for 5k+ classes
- Establish trustworthy CV and fast baselines; iterate to strong accuracy
- Produce a valid submission.csv ASAP, then improve

Workflow:
1) Environment & GPU
- Verify GPU availability (nvidia-smi), install PyTorch CUDA 12.1 stack if needed
- Create constraints to lock torch versions; avoid drift

2) Data understanding
- Inspect files: train.bson, test.bson, train_example.bson, category_names.csv, sample_submission.csv
- Determine schema: BSON contains products, each with multiple images and category_id (train only)
- Map category_id <-> class index; confirm number of classes

3) Reader & Dataset
- Implement BSON streaming reader (avoid loading entire file in memory)
- For train: expand to (product_id, image_index) samples with label; for test: (product_id, image_index)
- Use JPEG decode from bytes (PIL) and standard transforms; cache occasional decoded bytes if RAM allows

4) CV Protocol
- StratifiedGroupKFold by product_id grouped, stratified on category_id
- Single deterministic split first (e.g., 90/10) to iterate fast; later move to 5-fold if time
- Aggregate per-product predictions by mean of image logits

5) Baseline Model (fast)
- Use torchvision/timm pretrained backbone at 180x180: e.g., efficientnet_b0, convnext_tiny, or resnet50
- Replace head with num_classes; use label-smoothing CE
- Optim: AdamW, cosine schedule, warmup; amp autocast; EMA head optional
- Data aug: RandomResizedCrop(180), HFlip; test-time: center-crop only

6) Training Loop
- Mixed precision, gradient accumulation if needed; log every N steps with elapsed time
- Early stopping by val accuracy; save best weights per fold

7) Inference
- Predict per-image, aggregate per-product (mean logits), argmax category_id
- Write submission.csv with correct columns

8) Iterations to Medal
- Resolution sweep: 224, 256 short runs
- Stronger backbones: convnext_tiny/base, efficientnet_b3, vit_tiny/base (if time)
- Class-balanced sampling / reweighting; focal loss trial
- Ensembling: average logits from 2–3 diverse seeds/backbones

9) Validation Rigor
- Save folds, OOF logits; compute accuracy by product and by image
- Inspect per-class accuracy and top-k errors; adjust sampling/augmentations

10) Time Management
- Start with train_example.bson to validate pipeline and submission
- Scale to a subset of train (e.g., 200k products) for fast baseline
- Only then run full training; keep logs; consider interrupting long runs based on expert advice

In [None]:
# Install BSON dependency (provided by pymongo)
import sys, subprocess
def pip(*args):
    print('> pip', *args, flush=True)
    return subprocess.run([sys.executable, '-m', 'pip', *args], check=True)

pip('install', 'pymongo')
print('pymongo installed for bson import')

In [None]:
# Environment & GPU check + Torch cu121 install, quick data peek
import os, sys, shutil, subprocess, time, json, bson, struct, io
from pathlib import Path
import pandas as pd

def run(cmd):
    print('>',' '.join(cmd), flush=True)
    return subprocess.run(cmd, check=False, text=True, capture_output=True)

print('=== NVIDIA SMI ===', flush=True)
print(run(['bash','-lc','nvidia-smi || true']).stdout)

# Clean any preinstalled torch stacks to avoid CUDA mismatch
for pkg in ("torch","torchvision","torchaudio"):
    subprocess.run([sys.executable, "-m", "pip", "uninstall", "-y", pkg], check=False)

# Remove possible shadow dirs
for d in (
    "/app/.pip-target/torch",
    "/app/.pip-target/torchvision",
    "/app/.pip-target/torchaudio",
    "/app/.pip-target/torch-2.4.1.dist-info",
    "/app/.pip-target/torchvision-0.19.1.dist-info",
    "/app/.pip-target/torchaudio-2.4.1.dist-info",
    "/app/.pip-target/torchgen",
    "/app/.pip-target/functorch",
):
    if os.path.exists(d):
        print('Removing', d, flush=True)
        shutil.rmtree(d, ignore_errors=True)

def pip(*args):
    print('> pip', *args, flush=True)
    return subprocess.run([sys.executable, '-m', 'pip', *args], check=True)

# Install exact cu121 torch stack
pip('install',
    '--index-url','https://download.pytorch.org/whl/cu121',
    '--extra-index-url','https://pypi.org/simple',
    'torch==2.4.1','torchvision==0.19.1','torchaudio==2.4.1')

# Freeze torch versions for later installs
Path('constraints.txt').write_text('torch==2.4.1\ntorchvision==0.19.1\ntorchaudio==2.4.1\n')

import torch
print('torch:', torch.__version__, 'CUDA build:', getattr(torch.version,'cuda',None))
print('CUDA available:', torch.cuda.is_available())
if torch.cuda.is_available():
    print('GPU:', torch.cuda.get_device_name(0))
assert str(getattr(torch.version,'cuda','')).startswith('12.1'), f"Wrong CUDA build: {torch.version.cuda}"
assert torch.cuda.is_available(), 'CUDA not available'

print('\n=== Repo contents ===')
for p in sorted(Path('.').glob('*')):
    try:
        sz = p.stat().st_size
    except Exception:
        sz = -1
    print(f"{p.name}\t{sz/1e6:.1f} MB")

# Quick CSV peeks
if Path('sample_submission.csv').exists():
    ss = pd.read_csv('sample_submission.csv')
    print('\nSample submission shape:', ss.shape)
    print(ss.head())
if Path('category_names.csv').exists():
    cat = pd.read_csv('category_names.csv')
    print('\nCategory names shape:', cat.shape)
    print(cat.head())

# Peek a few BSON docs from train_example to confirm schema
def peek_bson(path, n=3):
    print(f'\nPeeking {n} docs from {path}...')
    c = 0
    with open(path, 'rb') as f:
        while c < n:
            len_bytes = f.read(4)
            if not len_bytes or len(len_bytes) < 4:
                break
            (doc_len,) = struct.unpack('<i', len_bytes)
            rest = f.read(doc_len - 4)
            if len(rest) < doc_len - 4:
                break
            try:
                d = bson.BSON(len_bytes + rest).decode()
            except Exception as e:
                print('Decode error:', e)
                continue
            keys = list(d.keys())
            print('Doc keys:', keys)
            print({k: type(d[k]).__name__ for k in keys})
            if '_id' in d:
                print('product_id:', d['_id'])
            if 'category_id' in d:
                print('category_id:', d['category_id'])
            if 'imgs' in d:
                print('n_imgs:', len(d['imgs']))
            c += 1

if Path('train_example.bson').exists():
    peek_bson('train_example.bson', n=3)
elif Path('train.bson').exists():
    # Fallback to full train if example is absent (will be slow, we only read 1-3 docs)
    peek_bson('train.bson', n=3)

print('\nSETUP COMPLETE', flush=True)

In [None]:
# BSON indexing utilities: build product-level index and category mappings
import os, struct, json, io
from pathlib import Path
import pandas as pd

def build_bson_index(bson_path, out_index_csv, is_train=True, max_docs=None, log_every=100000):
    rows = []
    bson_path = Path(bson_path)
    total = bson_path.stat().st_size
    with open(bson_path, 'rb') as f:
        ofs = 0
        i = 0
        while True:
            len_bytes = f.read(4)
            if not len_bytes or len(len_bytes) < 4:
                break
            (doc_len,) = struct.unpack('<i', len_bytes)
            rest = f.read(doc_len - 4)
            if len(rest) < doc_len - 4:
                break
            try:
                import bson as _bson
                d = _bson.BSON(len_bytes + rest).decode()
            except Exception:
                # skip corrupted doc; advance offset anyway
                ofs += doc_len
                i += 1
                continue
            prod_id = d.get('_id')
            imgs = d.get('imgs', [])
            n_imgs = len(imgs) if isinstance(imgs, list) else 0
            cat_id = d.get('category_id') if is_train else None
            rows.append({
                'offset': ofs,
                '_id': prod_id,
                'n_imgs': n_imgs,
                'category_id': cat_id if is_train else pd.NA,
                'doc_len': doc_len,
            })
            ofs += doc_len
            i += 1
            if log_every and (i % log_every == 0):
                print(f"Indexed {i:,} docs, pos {ofs/1e9:.3f} GB / {total/1e9:.3f} GB", flush=True)
            if max_docs is not None and i >= max_docs:
                break
    df = pd.DataFrame(rows)
    df.to_csv(out_index_csv, index=False)
    print(f"Saved index: {out_index_csv} with {len(df):,} rows", flush=True)
    return df

def build_category_mapping_from_index(train_index_csv, out_map_json):
    df = pd.read_csv(train_index_csv)
    cats = sorted(df['category_id'].dropna().astype(int).unique().tolist())
    cat2idx = {int(c): i for i, c in enumerate(cats)}
    idx2cat = {i: int(c) for i, c in enumerate(cats)}
    payload = {'cat2idx': cat2idx, 'idx2cat': idx2cat, 'num_classes': len(cats)}
    Path(out_map_json).write_text(json.dumps(payload))
    print(f"Saved category mapping to {out_map_json} (num_classes={len(cats)})", flush=True)
    return payload

# Dry-run helpers (won't execute automatically):
print('Indexing utilities ready. Example usage:', flush=True)
print("- build_bson_index('train_example.bson', 'train_example_index.csv', is_train=True)", flush=True)
print("- build_bson_index('test.bson', 'test_index.csv', is_train=False, max_docs=50000)  # for smoke", flush=True)
print("- build_bson_index('train.bson', 'train_index.csv', is_train=True, log_every=200000)", flush=True)
print("- build_category_mapping_from_index('train_index.csv', 'category_mapping.json')", flush=True)

In [None]:
# Build tiny indices/mapping on train_example to validate pipeline
from pathlib import Path
import pandas as pd, json

if Path('train_example.bson').exists():
    df_idx = build_bson_index('train_example.bson', 'train_example_index.csv', is_train=True)
    print(df_idx.head())
    mapping = build_category_mapping_from_index('train_example_index.csv', 'category_mapping_example.json')
    print('Mapping keys:', list(mapping.keys()), 'num_classes:', mapping['num_classes'])
else:
    print('train_example.bson not found; skipping example index build')

In [None]:
# Build full train/test indices with explicit dtypes and save as parquet (optionally smoke-limit)
import pandas as pd, pyarrow as pa, pyarrow.parquet as pq, struct, os, json, time
from pathlib import Path
import bson as _bson

def build_index_parquet(bson_path: str, out_parquet: str, is_train: bool, max_docs=None, log_every=200000):
    t0 = time.time()
    rows = []
    p = Path(bson_path)
    total = p.stat().st_size
    with open(p, 'rb') as f:
        ofs = 0
        i = 0
        while True:
            len_bytes = f.read(4)
            if not len_bytes or len(len_bytes) < 4:
                break
            (doc_len,) = struct.unpack('<i', len_bytes)
            rest = f.read(doc_len - 4)
            if len(rest) < doc_len - 4:
                break
            try:
                d = _bson.BSON(len_bytes + rest).decode()
            except Exception:
                ofs += doc_len; i += 1
                continue
            prod_id = int(d.get('_id'))
            imgs = d.get('imgs', [])
            n_imgs = len(imgs) if isinstance(imgs, list) else 0
            if n_imgs <= 0:
                ofs += doc_len; i += 1
                continue
            cat_id = int(d['category_id']) if is_train else None
            rows.append((ofs, prod_id, n_imgs, cat_id if is_train else None, doc_len))
            ofs += doc_len
            i += 1
            if log_every and (i % log_every == 0):
                print(f"Indexed {i:,} docs, pos {ofs/1e9:.3f} GB / {total/1e9:.3f} GB, elapsed {time.time()-t0:.1f}s", flush=True)
            if max_docs is not None and i >= max_docs:
                break
    if not rows:
        print('No rows parsed; nothing to write.')
        return None
    # Build DataFrame with explicit dtypes
    df = pd.DataFrame(rows, columns=['offset','_id','n_imgs','category_id','doc_len'])
    dtypes = {
        'offset': 'int64',
        '_id': 'int64',
        'n_imgs': 'int16',
        'doc_len': 'int32'
    }
    if is_train:
        dtypes['category_id'] = 'Int64'
    else:
        dtypes['category_id'] = 'Int64'
        df['category_id'] = pd.NA
    df = df.astype(dtypes)
    table = pa.Table.from_pandas(df, preserve_index=False)
    pq.write_table(table, out_parquet, compression='zstd')
    print(f"Saved {len(df):,} rows to {out_parquet} in {time.time()-t0:.1f}s", flush=True)
    return out_parquet

def ensure_full_indices(smoke=False):
    # If smoke=True, limit docs for quick run
    train_out = 'train_index.parquet'
    test_out = 'test_index.parquet'
    if not Path(train_out).exists():
        build_index_parquet('train.bson', train_out, is_train=True, max_docs=(50000 if smoke else None))
    else:
        print(f'{train_out} exists, skipping.')
    if not Path(test_out).exists():
        build_index_parquet('test.bson', test_out, is_train=False, max_docs=(50000 if smoke else None))
    else:
        print(f'{test_out} exists, skipping.')
    # Build mapping if train index created
    if Path(train_out).exists():
        df_tr = pd.read_parquet(train_out, columns=['category_id'])
        cats = sorted(df_tr['category_id'].dropna().astype(int).unique().tolist())
        cat2idx = {int(c): i for i, c in enumerate(cats)}
        idx2cat = {i: int(c) for i, c in enumerate(cats)}
        payload = {'cat2idx': cat2idx, 'idx2cat': idx2cat, 'num_classes': len(cats)}
        Path('category_mapping.json').write_text(json.dumps(payload))
        print(f"Saved category_mapping.json (num_classes={len(cats)})", flush=True)

print('Full-index builder ready. Call ensure_full_indices(smoke=True) for a quick pass, or smoke=False for full run.', flush=True)

In [None]:
# Smoke-build indices (50k docs) for train/test to validate performance before full pass
import time
t0 = time.time()
print('Starting ensure_full_indices(smoke=True)...')
ensure_full_indices(smoke=True)
print(f'Done smoke indexing in {time.time()-t0:.1f}s')

In [3]:
# Dataset and DataLoader utilities for BSON with per-worker file handles
import os, io, random, struct
from pathlib import Path
import pandas as pd
from PIL import Image, ImageFile
import torch
from torch.utils.data import Dataset, DataLoader, Subset
import torchvision.transforms as T
import bson as _bson

ImageFile.LOAD_TRUNCATED_IMAGES = True

def load_mapping(path='category_mapping.json'):
    import json
    m = json.loads(Path(path).read_text())
    cat2idx = {int(k): int(v) for k,v in m['cat2idx'].items()}
    idx2cat = {int(k): int(v) for k,v in m['idx2cat'].items()}
    num_classes = int(m['num_classes'])
    return cat2idx, idx2cat, num_classes

def read_bson_doc_at(fh, offset):
    # fh is a file handle opened in 'rb'
    fh.seek(offset)
    len_bytes = fh.read(4)
    if not len_bytes or len(len_bytes) < 4:
        return None
    (doc_len,) = struct.unpack('<i', len_bytes)
    rest = fh.read(doc_len - 4)
    if len(rest) < doc_len - 4:
        return None
    try:
        d = _bson.BSON(len_bytes + rest).decode()
        return d
    except Exception:
        return None

class BSONProductTrain(Dataset):
    def __init__(self, index_df: pd.DataFrame, bson_path: str, cat2idx: dict, image_size=224):
        self.df = index_df.reset_index(drop=True)
        self.bson_path = str(bson_path)
        self.cat2idx = cat2idx
        self._fh = None  # per-worker handle, opened lazily
        # Strong but stable aug pipeline
        self.transform = T.Compose([
            T.RandomResizedCrop(image_size, scale=(0.5, 1.0), ratio=(0.75, 1.33)),
            T.RandomHorizontalFlip(0.5),
            T.ColorJitter(0.2, 0.2, 0.2, 0.05),
            T.ToTensor(),
            T.Normalize(mean=(0.485,0.456,0.406), std=(0.229,0.224,0.225)),
            T.RandomErasing(p=0.18, scale=(0.02, 0.2), ratio=(0.3, 3.3)),
        ])

    def __len__(self):
        return len(self.df)

    def _ensure_fh(self):
        if self._fh is None:
            self._fh = open(self.bson_path, 'rb', buffering=0)

    def __getitem__(self, idx):
        row = self.df.iloc[idx]
        self._ensure_fh()
        d = read_bson_doc_at(self._fh, int(row['offset']))
        # Always set label from index row's category_id mapping (do NOT force class 0 on failures)
        cat_id_row = int(row['category_id']) if pd.notna(row['category_id']) else None
        y = self.cat2idx.get(cat_id_row, 0) if cat_id_row is not None else 0
        if d is None:
            # return a dummy image but keep correct label
            img = Image.new('RGB', (256,256), color=(128,128,128))
            return self.transform(img), y
        imgs = d.get('imgs', [])
        # pick one random image
        choice = random.randrange(len(imgs)) if imgs else 0
        pic_bytes = imgs[choice]['picture'] if imgs else None
        try:
            img = Image.open(io.BytesIO(pic_bytes)).convert('RGB') if pic_bytes is not None else Image.new('RGB',(256,256))
        except Exception:
            img = Image.new('RGB',(256,256), color=(128,128,128))
        x = self.transform(img)
        return x, y

class BSONImageEval(Dataset):
    # Yields one image per item for val/test; aggregate per-product outside
    def __init__(self, index_df: pd.DataFrame, bson_path: str, include_label: bool, cat2idx: dict | None, image_size=224):
        # expand to per-image rows
        rows = []
        for _, r in index_df.iterrows():
            for k in range(int(r['n_imgs'])):
                rows.append((int(r['offset']), int(r['_id']), k, int(r['category_id']) if include_label and pd.notna(r['category_id']) else None))
        self.df = pd.DataFrame(rows, columns=['offset','_id','img_idx','category_id'])
        # Cheap IO win: process in offset order to improve locality
        self.df.sort_values('offset', inplace=True)
        self.bson_path = str(bson_path)
        self.include_label = include_label
        self.cat2idx = cat2idx
        self._fh = None
        self.transform = T.Compose([
            T.Resize(256),
            T.CenterCrop(image_size),
            T.ToTensor(),
            T.Normalize(mean=(0.485,0.456,0.406), std=(0.229,0.224,0.225)),
        ])

    def __len__(self):
        return len(self.df)

    def _ensure_fh(self):
        if self._fh is None:
            self._fh = open(self.bson_path, 'rb', buffering=0)

    def __getitem__(self, idx):
        r = self.df.iloc[idx]
        self._ensure_fh()
        d = read_bson_doc_at(self._fh, int(r['offset']))
        imgs = d.get('imgs', []) if d is not None else []
        pic_bytes = imgs[int(r['img_idx'])]['picture'] if d is not None and imgs else None
        try:
            img = Image.open(io.BytesIO(pic_bytes)).convert('RGB') if pic_bytes is not None else Image.new('RGB',(256,256))
        except Exception:
            img = Image.new('RGB',(256,256), color=(128,128,128))
        x = self.transform(img)
        if self.include_label:
            cat_id = int(r['category_id']) if r['category_id'] is not None and pd.notna(r['category_id']) else None
            y = self.cat2idx.get(cat_id, 0) if (self.cat2idx is not None and cat_id is not None) else -1
            return x, int(r['_id']), int(r['img_idx']), y
        else:
            return x, int(r['_id']), int(r['img_idx'])

def make_dataloaders_for_smoke(train_index_path='train_index.parquet', batch_size=64, num_workers=8):
    cat2idx, idx2cat, num_classes = load_mapping('category_mapping.json')
    dft = pd.read_parquet(train_index_path)
    # small subset for smoke: 2k products
    dft = dft.sample(n=min(2000, len(dft)), random_state=42).reset_index(drop=True)
    # split 90/10 stratified by category at product level
    from sklearn.model_selection import StratifiedGroupKFold
    y = dft['category_id'].astype(int)
    g = dft['_id'].astype(int)
    skf = StratifiedGroupKFold(n_splits=10, shuffle=True, random_state=42)
    tr_idx, va_idx = next(skf.split(dft, y, g))
    df_tr = dft.iloc[tr_idx].reset_index(drop=True)
    df_va = dft.iloc[va_idx].reset_index(drop=True)
    ds_tr = BSONProductTrain(df_tr, 'train.bson', cat2idx, image_size=192)
    ds_va = BSONImageEval(df_va, 'train.bson', include_label=True, cat2idx=cat2idx, image_size=192)
    dl_tr = DataLoader(ds_tr, batch_size=batch_size, shuffle=True, num_workers=num_workers, pin_memory=True, persistent_workers=True, prefetch_factor=4, drop_last=True)
    dl_va = DataLoader(ds_va, batch_size=batch_size, shuffle=False, num_workers=num_workers, pin_memory=True, persistent_workers=True, prefetch_factor=4)
    return dl_tr, dl_va, num_classes, (df_tr, df_va)

print('Dataset classes ready. Next: create loaders with make_dataloaders_for_smoke() and wire a minimal ConvNeXt-Tiny training loop.', flush=True)

Dataset classes ready. Next: create loaders with make_dataloaders_for_smoke() and wire a minimal ConvNeXt-Tiny training loop.


In [None]:
# Install timm, create smoke loaders, and sanity-check a couple batches
import sys, subprocess, time, torch
from pathlib import Path

def pip(*args):
    print('> pip', *args, flush=True)
    return subprocess.run([sys.executable, '-m', 'pip', *args], check=True)

# Install timm honoring torch constraints but without touching torch stack
if not Path('installed_timm.flag').exists():
    pip('install', '-c', 'constraints.txt', 'timm==1.0.9', '--upgrade-strategy', 'only-if-needed', '--no-deps')
    Path('installed_timm.flag').write_text('ok')
    print('timm installed')
else:
    print('timm already installed')

torch.backends.cudnn.benchmark = True

dl_tr, dl_va, num_classes, (df_tr, df_va) = make_dataloaders_for_smoke(batch_size=128, num_workers=12)
print('num_classes:', num_classes, 'train products:', len(df_tr), 'val products:', len(df_va))

# Fetch a couple of train batches and move to GPU to validate shapes and throughput
t0 = time.time()
for bi, (x, y) in enumerate(dl_tr):
    x = x.cuda(non_blocking=True)
    y = y.cuda(non_blocking=True)
    if bi == 0:
        print('Train batch 0:', x.shape, y.shape, 'GPU mem (MB):', torch.cuda.memory_allocated()/1e6)
    if bi >= 2:
        break
print('Train fetch elapsed:', time.time()-t0, 's')

# Fetch a couple of val batches
t1 = time.time()
for bi, batch in enumerate(dl_va):
    # dl_va yields (x, _id, img_idx, y)
    x = batch[0].cuda(non_blocking=True)
    if bi == 0:
        print('Val batch 0 x-shape:', x.shape)
    if bi >= 2:
        break
print('Val fetch elapsed:', time.time()-t1, 's')

print('Smoke loaders OK')

In [None]:
# Minimal ConvNeXt-Tiny training loop (smoke) with product-level validation
import time, math, collections
import torch
import torch.nn as nn
import torch.optim as optim
from torch.cuda.amp import autocast, GradScaler
import timm

def train_one_epoch(model, loader, optimizer, scaler, device, log_every=20):
    model.train()
    criterion = nn.CrossEntropyLoss(label_smoothing=0.1).to(device)
    total, correct, running_loss = 0, 0, 0.0
    t0 = time.time()
    for i, (x, y) in enumerate(loader):
        x = x.to(device, non_blocking=True)
        y = y.to(device, non_blocking=True)
        optimizer.zero_grad(set_to_none=True)
        with autocast():
            logits = model(x)
            loss = criterion(logits, y)
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()
        running_loss += loss.item() * x.size(0)
        preds = logits.argmax(dim=1)
        total += y.size(0)
        correct += (preds == y).sum().item()
        if (i+1) % log_every == 0:
            elapsed = time.time() - t0
            print(f"  iter {i+1}/{len(loader)}  loss={(running_loss/total):.4f}  acc={(correct/total):.4f}  elapsed={elapsed:.1f}s", flush=True)
    return running_loss/total, correct/total

def validate_product_level(model, loader, df_va, cat2idx, device):
    model.eval()
    # True label per product_id
    id2y = {int(r['_id']): cat2idx[int(r['category_id'])] for _, r in df_va.iterrows()}
    agg_logits = {}  # _id -> sum logits
    with torch.no_grad():
        for batch in loader:
            # batch: (x, _id, img_idx, y)
            x = batch[0].to(device, non_blocking=True)
            ids = batch[1].tolist()
            with autocast():
                logits = model(x).float().cpu()  # [B, C]
            for pid, logit in zip(ids, logits):
                if pid not in agg_logits:
                    agg_logits[pid] = logit.clone()
                else:
                    agg_logits[pid] += logit
    # Compute product-level accuracy
    correct = 0
    for pid, logit in agg_logits.items():
        pred = int(logit.argmax().item())
        if pid in id2y and pred == id2y[pid]:
            correct += 1
    total = len(id2y)
    acc = correct / max(1, total)
    return acc

def run_smoke_training(epochs=1, lr=1e-3, wd=0.05):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    dl_tr, dl_va, num_classes, (df_tr, df_va) = make_dataloaders_for_smoke(batch_size=128, num_workers=12)
    cat2idx, idx2cat, _ = load_mapping('category_mapping.json')
    model = timm.create_model('convnext_tiny', pretrained=True, num_classes=num_classes)
    model.to(device)
    optimizer = optim.AdamW(model.parameters(), lr=lr, weight_decay=wd)
    scaler = GradScaler()
    best_acc = -1.0
    for ep in range(1, epochs+1):
        print(f"Epoch {ep}/{epochs}")
        tr_loss, tr_acc = train_one_epoch(model, dl_tr, optimizer, scaler, device, log_every=10)
        va_acc = validate_product_level(model, dl_va, df_va, cat2idx, device)
        print(f"  Train loss={tr_loss:.4f} acc={tr_acc:.4f} | Val product-acc={va_acc:.4f}", flush=True)
        if va_acc > best_acc:
            best_acc = va_acc
            torch.save({'model': model.state_dict(), 'val_acc': va_acc}, 'model_smoke_convnext_tiny.pt')
            print(f"  Saved checkpoint with val_acc={va_acc:.4f}")
    print(f"Best val product-acc: {best_acc:.4f}")
    return best_acc

print('Smoke training utilities ready. Call run_smoke_training(epochs=1) to validate end-to-end.')

In [None]:
# Install missing dependency and run smoke training for 1 epoch
import sys, subprocess, time, os
from pathlib import Path

def pip(*args):
    print('> pip', *args, flush=True)
    return subprocess.run([sys.executable, '-m', 'pip', *args], check=True)

# Ensure huggingface_hub is available for timm pretrained weights without touching torch stack
if not Path('installed_hfhub.flag').exists():
    pip('install', '-c', 'constraints.txt', 'huggingface_hub==0.35.1', '--upgrade-strategy', 'only-if-needed', '--no-deps')
    Path('installed_hfhub.flag').write_text('ok')
    print('huggingface_hub installed')
else:
    print('huggingface_hub already installed')

t0 = time.time()
acc = run_smoke_training(epochs=1, lr=1e-3, wd=0.05)
print(f'Smoke training done in {time.time()-t0:.1f}s, val product-acc={acc:.4f}')

In [None]:
# Reload timm hub to pick up newly installed huggingface_hub, then run smoke training
import importlib, timm, time
import timm.models._hub as timm_hub
importlib.reload(timm_hub)
print('Reloaded timm.models._hub; has_hf_hub should be True now.')
t0 = time.time()
acc = run_smoke_training(epochs=1, lr=1e-3, wd=0.05)
print(f'Smoke training done in {time.time()-t0:.1f}s, val product-acc={acc:.4f}')

In [None]:
# Build full indices (train/test) and full mapping
import time
t0 = time.time()
print('Starting ensure_full_indices(smoke=False)...')
ensure_full_indices(smoke=False)
print(f'Done full indexing in {time.time()-t0:.1f}s')

In [None]:
# FORCE full reindex (overwrite smoke parquet) to get full class mapping (~5270)
from pathlib import Path
import pandas as pd, json
print('Removing existing parquets if present...')
for p in ['train_index.parquet','test_index.parquet']:
    if Path(p).exists():
        Path(p).unlink()
        print('Deleted', p)
print('Building full train index...')
build_index_parquet('train.bson', 'train_index.parquet', is_train=True, max_docs=None, log_every=200000)
print('Building full test index...')
build_index_parquet('test.bson', 'test_index.parquet', is_train=False, max_docs=None, log_every=200000)
print('Rebuilding full category mapping...')
df_tr = pd.read_parquet('train_index.parquet', columns=['category_id'])
cats = sorted(df_tr['category_id'].dropna().astype(int).unique().tolist())
cat2idx = {int(c): i for i, c in enumerate(cats)}
idx2cat = {i: int(c) for i, c in enumerate(cats)}
payload = {'cat2idx': cat2idx, 'idx2cat': idx2cat, 'num_classes': len(cats)}
Path('category_mapping.json').write_text(json.dumps(payload))
print('Full mapping saved. num_classes =', len(cats))

In [None]:
# Build 200k balanced subset (q-cap + 1/sqrt(freq) top-up) and persist SGKF 90/10 split
import math, numpy as np, pandas as pd
from pathlib import Path
from sklearn.model_selection import StratifiedGroupKFold

train_idx_path = 'train_index.parquet'
assert Path(train_idx_path).exists(), 'train_index.parquet not found'
df = pd.read_parquet(train_idx_path, columns=['_id','category_id','offset','n_imgs'])
df = df.dropna(subset=['category_id']).copy()
df['category_id'] = df['category_id'].astype(np.int64)

# Compute class freq
freq = df['category_id'].value_counts().rename_axis('category_id').reset_index(name='freq')
n_classes = freq.shape[0]
target_n = 200_000
q = int(math.ceil(target_n / n_classes))
print('Classes:', n_classes, 'target subset size:', target_n, 'q per class:', q)

# Per-class cap sample
df_grouped = df.groupby('category_id', group_keys=False)
def take_q(g):
    n = len(g)
    k = min(q, n) if n > 0 else 0
    if k == 0:
        return g.iloc[:0]
    return g.sample(n=k, random_state=42, replace=False)
cap_df = df_grouped.apply(take_q).reset_index(drop=True)
print('After cap sample:', len(cap_df))

# Top-up if needed with p(c) ∝ 1/sqrt(f_c) from remaining products
remaining = df.merge(cap_df[['_id']], on='_id', how='left', indicator=True)
remaining = remaining[remaining['_merge'] == 'left_only'].drop(columns=['_merge'])
if len(cap_df) < target_n and len(remaining) > 0:
    freq_map = freq.set_index('category_id')['freq'].to_dict()
    w = remaining['category_id'].map(lambda c: 1.0 / math.sqrt(freq_map.get(int(c), 1)))
    w = w / w.sum()
    need = target_n - len(cap_df)
    need = min(need, len(remaining))
    top_idx = remaining.sample(n=need, weights=w, random_state=42).index
    top_df = remaining.loc[top_idx]
    sub_df = pd.concat([cap_df, top_df], axis=0, ignore_index=True)
else:
    sub_df = cap_df
sub_df = sub_df.drop_duplicates('_id')
print('Final subset size:', len(sub_df))

# Persist subset product_ids
sub_ids = sub_df['_id'].astype(np.int64).tolist()
pd.Series(sub_ids, name='_id').to_csv('subset_200k_ids.csv', index=False)
print('Saved subset_200k_ids.csv')

# SGKF 90/10 split within subset
y = sub_df['category_id'].astype(np.int64).values
g = sub_df['_id'].astype(np.int64).values
skf = StratifiedGroupKFold(n_splits=10, shuffle=True, random_state=42)
tr_idx, va_idx = next(skf.split(sub_df, y, g))
train_ids = sub_df.iloc[tr_idx]['_id'].astype(np.int64).tolist()
val_ids = sub_df.iloc[va_idx]['_id'].astype(np.int64).tolist()
pd.Series(train_ids, name='_id').to_csv('train_ids.csv', index=False)
pd.Series(val_ids, name='_id').to_csv('val_ids.csv', index=False)
print('Saved train_ids.csv:', len(train_ids), 'val_ids.csv:', len(val_ids))

# Quick coverage check
print('Covered classes in subset:', sub_df['category_id'].nunique())
print('Covered classes in train split:', sub_df.iloc[tr_idx]['category_id'].nunique())
print('Covered classes in val split:', sub_df.iloc[va_idx]['category_id'].nunique())

In [1]:
# Train ConvNeXt-Tiny@192 on 200k subset (2-3 epochs) for quick LB check; then prepare inference util
import time, math, collections, io, numpy as np, pandas as pd, torch, torch.nn as nn, torch.optim as optim
from torch import amp
from torch.utils.data import DataLoader
import timm
from timm.utils import ModelEmaV2
from timm.scheduler.cosine_lr import CosineLRScheduler

torch.backends.cudnn.benchmark = True
torch.backends.cudnn.allow_tf32 = True
torch.backends.cuda.matmul.allow_tf32 = True
torch.set_float32_matmul_precision('high')

def make_subset_loaders(batch_size=128, num_workers=16, image_size=192):
    # Load full mapping and index
    cat2idx, idx2cat, num_classes = load_mapping('category_mapping.json')
    dfi = pd.read_parquet('train_index.parquet')
    train_ids = pd.read_csv('train_ids.csv')['_id'].astype(np.int64).tolist()
    val_ids = pd.read_csv('val_ids.csv')['_id'].astype(np.int64).tolist()
    df_tr = dfi[dfi['_id'].isin(train_ids)].reset_index(drop=True)
    df_va = dfi[dfi['_id'].isin(val_ids)].reset_index(drop=True)
    ds_tr = BSONProductTrain(df_tr, 'train.bson', cat2idx, image_size=image_size)
    ds_va = BSONImageEval(df_va, 'train.bson', include_label=True, cat2idx=cat2idx, image_size=image_size)
    dl_tr = DataLoader(ds_tr, batch_size=batch_size, shuffle=True, num_workers=num_workers, pin_memory=True,
                       persistent_workers=True, prefetch_factor=8, drop_last=True)
    dl_va = DataLoader(ds_va, batch_size=batch_size, shuffle=False, num_workers=num_workers, pin_memory=True,
                       persistent_workers=True, prefetch_factor=12)
    return dl_tr, dl_va, df_tr, df_va, num_classes, cat2idx, idx2cat

def compute_class_weights(df_tr, cat2idx):
    # w_c ∝ 1/sqrt(freq_c), normalized to mean=1
    freq = df_tr['category_id'].value_counts().to_dict()
    num_classes = len(cat2idx)
    w = np.ones(num_classes, dtype=np.float32)
    for cat, idx in cat2idx.items():
        f = freq.get(int(cat), 1)
        w[idx] = 1.0 / math.sqrt(float(f))
    w = w / (w.mean() + 1e-9)
    return torch.tensor(w, dtype=torch.float32)

def train_epoch(model, loader, optimizer, scaler, device, criterion, sched=None, accum_steps=2, log_every=200, ep_offset=0, model_ema=None, ep_offset_updates=0):
    model.train()
    total, correct, loss_sum = 0, 0, 0.0
    optimizer.zero_grad(set_to_none=True)
    t0 = time.time()
    upd_idx = 0
    for it, (x, y) in enumerate(loader):
        x = x.to(device, non_blocking=True).to(memory_format=torch.channels_last)
        y = y.to(device, non_blocking=True)
        with amp.autocast('cuda'):
            logits = model(x)
            loss = criterion(logits, y) / accum_steps
        scaler.scale(loss).backward()
        if (it + 1) % accum_steps == 0:
            # grad clip
            scaler.unscale_(optimizer)
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            scaler.step(optimizer)
            scaler.update()
            optimizer.zero_grad(set_to_none=True)
            if model_ema is not None:
                model_ema.update(model)
            # step scheduler per optimizer update (not per micro-iter)
            if sched is not None:
                upd_idx += 1
                sched.step_update(ep_offset_updates + upd_idx)
        preds = logits.detach().argmax(1)
        total += y.size(0)
        correct += (preds == y).sum().item()
        loss_sum += loss.item() * accum_steps * x.size(0)
        if (it + 1) % log_every == 0:
            print(f"  it {it+1}/{len(loader)} loss={loss_sum/total:.4f} acc={correct/total:.4f} elapsed={time.time()-t0:.1f}s", flush=True)
    return loss_sum/total, correct/total

def validate_products(model, loader, df_va, cat2idx, device):
    model.eval()
    id2y = {int(r['_id']): cat2idx[int(r['category_id'])] for _, r in df_va.iterrows()}
    agg = {}
    cnt = {}
    with torch.no_grad():
        for batch in loader:
            x = batch[0].to(device, non_blocking=True).to(memory_format=torch.channels_last)
            ids = batch[1].tolist()
            with amp.autocast('cuda'):
                logits = model(x).float().cpu()
            for pid, logit in zip(ids, logits):
                if pid not in agg:
                    agg[pid] = torch.zeros_like(logit)
                    cnt[pid] = 0
                agg[pid] += logit
                cnt[pid] += 1
    correct = 0
    for pid, logit in agg.items():
        m = logit / max(1, cnt.get(pid, 1))
        if int(pid) in id2y and int(m.argmax().item()) == id2y[int(pid)]:
            correct += 1
    return correct / max(1, len(id2y))

def train_convnext_tiny_subset(epochs=2, lr=1e-3, wd=0.05, image_size=192, batch_size=128, accum_steps=2):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    dl_tr, dl_va, df_tr, df_va, num_classes, cat2idx, idx2cat = make_subset_loaders(batch_size=batch_size, num_workers=16, image_size=image_size)
    model = timm.create_model('convnext_tiny', pretrained=True, num_classes=num_classes).to(device)
    model = model.to(memory_format=torch.channels_last)
    optimizer = optim.AdamW(model.parameters(), lr=lr, weight_decay=wd)
    # EMA
    model_ema = ModelEmaV2(model, decay=0.9997)
    # Class-weighted CE with smoothing
    cls_w = compute_class_weights(df_tr, cat2idx).to(device)
    criterion = nn.CrossEntropyLoss(label_smoothing=0.1, weight=cls_w)
    scaler = amp.GradScaler('cuda')
    # Scheduler: 1 epoch warmup + cosine to 1e-6, step per optimizer update
    steps_per_epoch = len(dl_tr)
    updates_per_epoch = int(math.ceil(steps_per_epoch / max(1, accum_steps)))
    total_updates = updates_per_epoch * epochs
    warmup_updates = updates_per_epoch  # 1 epoch warmup
    sched = CosineLRScheduler(
        optimizer, t_initial=max(1, total_updates - warmup_updates), lr_min=1e-6,
        warmup_t=warmup_updates, warmup_lr_init=1e-5, t_in_epochs=False
    )
    best_acc = -1.0
    for ep in range(1, epochs+1):
        print(f"Epoch {ep}/{epochs}")
        ep_offset = (ep - 1) * steps_per_epoch
        ep_offset_updates = (ep - 1) * updates_per_epoch
        tr_loss, tr_acc = train_epoch(model, dl_tr, optimizer, scaler, device, criterion, sched=sched, accum_steps=accum_steps, log_every=200, ep_offset=ep_offset, model_ema=model_ema, ep_offset_updates=ep_offset_updates)
        va_acc = validate_products(model_ema.module, dl_va, df_va, cat2idx, device)
        print(f"  train loss={tr_loss:.4f} acc={tr_acc:.4f} | val product-acc={va_acc:.4f}", flush=True)
        if va_acc > best_acc:
            best_acc = va_acc
            torch.save({'model': model_ema.module.state_dict(), 'val_acc': va_acc}, 'ckpt_tiny_subset.pt')
            print(f"  Saved ckpt_tiny_subset.pt (EMA, val_acc={va_acc:.4f})")
    return best_acc

print('Subset training utilities ready. Next: run acc = train_convnext_tiny_subset(epochs=2) and then implement test inference & submission.')

Subset training utilities ready. Next: run acc = train_convnext_tiny_subset(epochs=2) and then implement test inference & submission.


In [4]:
# Launch 2-epoch training on 200k subset (ConvNeXt-Tiny@192) with EMA + cosine schedule
import time
t0 = time.time()
acc = train_convnext_tiny_subset(epochs=2, lr=1e-3, wd=0.05, image_size=192, batch_size=128, accum_steps=2)
print(f'Training finished in {time.time()-t0:.1f}s, best val product-acc={acc:.5f}', flush=True)

Epoch 1/2


  it 200/1406 loss=8.6756 acc=0.0003 elapsed=36.7s


  it 400/1406 loss=8.5689 acc=0.0018 elapsed=71.2s


  it 600/1406 loss=8.3278 acc=0.0081 elapsed=105.9s


  it 800/1406 loss=8.0837 acc=0.0171 elapsed=140.7s


  it 1000/1406 loss=7.8805 acc=0.0257 elapsed=175.7s


  it 1200/1406 loss=7.7187 acc=0.0329 elapsed=210.7s


  it 1400/1406 loss=7.5911 acc=0.0393 elapsed=245.8s


  train loss=7.5874 acc=0.0396 | val product-acc=0.0020


  Saved ckpt_tiny_subset.pt (EMA, val_acc=0.0020)
Epoch 2/2


  it 200/1406 loss=6.3735 acc=0.1112 elapsed=35.6s


  it 400/1406 loss=6.3342 acc=0.1159 elapsed=70.8s


  it 600/1406 loss=6.3104 acc=0.1176 elapsed=106.0s


  it 800/1406 loss=6.2916 acc=0.1198 elapsed=141.3s


  it 1000/1406 loss=6.2759 acc=0.1209 elapsed=176.6s


  it 1200/1406 loss=6.2602 acc=0.1232 elapsed=211.9s


  it 1400/1406 loss=6.2481 acc=0.1248 elapsed=247.3s


  train loss=6.2467 acc=0.1249 | val product-acc=0.0573


  Saved ckpt_tiny_subset.pt (EMA, val_acc=0.0573)


Training finished in 539.5s, best val product-acc=0.05735


In [5]:
# Sanity checks: unseen val classes, label mismatches, class-weight extremes, quick val image-acc probe
import numpy as np, pandas as pd, torch

cat2idx, idx2cat, num_classes = load_mapping('category_mapping.json')
dfi = pd.read_parquet('train_index.parquet')
train_ids = pd.read_csv('train_ids.csv')['_id'].astype(np.int64).tolist()
val_ids = pd.read_csv('val_ids.csv')['_id'].astype(np.int64).tolist()
df_tr = dfi[dfi['_id'].isin(train_ids)].reset_index(drop=True)
df_va = dfi[dfi['_id'].isin(val_ids)].reset_index(drop=True)

# 1) Unseen-in-train classes check
seen_tr = set(df_tr['category_id'].dropna().astype(int).unique().tolist())
seen_va = set(df_va['category_id'].dropna().astype(int).unique().tolist())
unseen = seen_va - seen_tr
print('Unseen val classes count:', len(unseen))
print('Val products affected:', df_va[df_va['category_id'].isin(list(unseen))].shape[0])

# 2) Build val loader and check label consistency on a few batches
ds_va = BSONImageEval(df_va, 'train.bson', include_label=True, cat2idx=cat2idx, image_size=192)
from torch.utils.data import DataLoader
dl_va = DataLoader(ds_va, batch_size=128, shuffle=False, num_workers=8, pin_memory=True, persistent_workers=False, prefetch_factor=8)
id2y = {int(r['_id']): cat2idx[int(r['category_id'])] for _, r in df_va.iterrows()}
mism = 0; checked = 0
for bi, batch in enumerate(dl_va):
    _, pids, _, ys = batch
    pids = pids.tolist(); ys = ys.tolist()
    for pid, y in zip(pids, ys):
        if id2y.get(int(pid), -999) != int(y):
            mism += 1
    checked += len(pids)
    if checked >= 6400:
        break
print('Val label mismatches (first ~5k items):', mism)

# 3) Class-weight extremes
from math import sqrt
freq = df_tr['category_id'].value_counts().to_dict()
w = np.ones(len(cat2idx), dtype=np.float32)
for cat, idx in cat2idx.items():
    w[idx] = 1.0 / sqrt(float(freq.get(int(cat), 1)))
w = w / (w.mean() + 1e-9)
print('Class weight stats: min/median/max =', float(w.min()), float(np.median(w)), float(w.max()))

# 4) Quick image-level val probe on EMA checkpoint (first ~100 batches)
ckpt = torch.load('ckpt_tiny_subset.pt', map_location='cpu') if Path('ckpt_tiny_subset.pt').exists() else None
img_acc = None
if ckpt is not None:
    import timm
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model = timm.create_model('convnext_tiny', pretrained=False, num_classes=num_classes).to(device).to(memory_format=torch.channels_last)
    model.load_state_dict(ckpt['model'], strict=False)
    model.eval()
    correct = 0; total = 0
    with torch.no_grad():
        for bi, batch in enumerate(dl_va):
            x = batch[0].to(device, non_blocking=True).to(memory_format=torch.channels_last)
            y = [id2y[int(pid)] for pid in batch[1].tolist()]
            y = torch.tensor(y, device=device, dtype=torch.long)
            with torch.amp.autocast('cuda') if device.type=='cuda' else torch.no_grad():
                logits = model(x)
            pred = logits.argmax(1)
            correct += (pred == y).sum().item()
            total += y.size(0)
            if bi >= 100:
                break
    img_acc = correct / max(1, total)
print('Quick image-level val acc (~100 batches):', img_acc)

Unseen val classes count: 0
Val products affected: 0


Val label mismatches (first ~5k items): 0
Class weight stats: min/median/max = 0.6879196763038635 0.958866536617279 2.5369224548339844


  ckpt = torch.load('ckpt_tiny_subset.pt', map_location='cpu') if Path('ckpt_tiny_subset.pt').exists() else None


Quick image-level val acc (~100 batches): 0.04447710396039604


In [11]:
# Test inference with HFlip TTA, product-level aggregation, and submission.csv generation
import pandas as pd, numpy as np, torch
from pathlib import Path
import torch.multiprocessing as mp
mp.set_sharing_strategy('file_system')  # avoid /dev/shm exhaustion

def build_test_loader(image_size=192, batch_size=128, num_workers=0):
    dft = pd.read_parquet('test_index.parquet')
    ds_te = BSONImageEval(dft, 'test.bson', include_label=False, cat2idx=None, image_size=image_size)
    if num_workers <= 0:
        dl_te = DataLoader(ds_te, batch_size=batch_size, shuffle=False, num_workers=0, pin_memory=False)
    else:
        dl_te = DataLoader(ds_te, batch_size=batch_size, shuffle=False, num_workers=num_workers, pin_memory=False,
                           persistent_workers=False, prefetch_factor=2)
    return dl_te, dft

def infer_and_submit(ckpt_path='ckpt_tiny_subset.pt', model_name='convnext_tiny', image_size=192, out_path='submission.csv'):
    assert Path(ckpt_path).exists(), f'Checkpoint not found: {ckpt_path}'
    cat2idx, idx2cat, num_classes = load_mapping('category_mapping.json')
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    import timm
    model = timm.create_model(model_name, pretrained=False, num_classes=num_classes).to(device).to(memory_format=torch.channels_last)
    ckpt = torch.load(ckpt_path, map_location='cpu')
    model.load_state_dict(ckpt['model'], strict=False)
    model.eval()
    dl_te, dft = build_test_loader(image_size=image_size, batch_size=128, num_workers=0)
    agg = {}  # pid -> summed logits (torch tensor on CPU)
    cnt = {}  # pid -> count
    with torch.no_grad():
        for i, batch in enumerate(dl_te):
            x = batch[0].to(device, non_blocking=True).to(memory_format=torch.channels_last)
            pids = batch[1].tolist()
            with amp.autocast('cuda') if device.type=='cuda' else torch.no_grad():
                logits = model(x)
                logits_fl = model(x.flip(-1))
            out = (logits + logits_fl) * 0.5
            out = out.float().cpu()
            for pid, logit in zip(pids, out):
                if pid not in agg:
                    agg[pid] = logit.clone()
                    cnt[pid] = 1
                else:
                    agg[pid] += logit
                    cnt[pid] += 1
            if (i+1) % 200 == 0:
                print(f'  infer {i+1}/{len(dl_te)} batches', flush=True)
    # Build submission following sample_submission order
    ss = pd.read_csv('sample_submission.csv')
    preds = []
    missing = 0
    for pid in ss['_id'].astype(np.int64).tolist():
        if pid in agg:
            logit = agg[pid] / max(1, cnt[pid])
            cls = int(logit.argmax().item())
            cat = int(idx2cat[cls])
        else:
            missing += 1
            cat = int(idx2cat[0])
        preds.append(cat)
    if missing:
        print('Warning: missing products in aggregation:', missing)
    sub = pd.DataFrame({'_id': ss['_id'].astype(np.int64), 'category_id': preds})
    sub.to_csv(out_path, index=False)
    print('Wrote', out_path, 'rows:', len(sub))
    return out_path

# FAST path: one image per product (first image) to speed up inference for initial LB check
class BSONProductEvalOneImage(Dataset):
    def __init__(self, index_df: pd.DataFrame, bson_path: str, image_size=192):
        self.df = index_df[['offset','_id']].reset_index(drop=True).copy()
        self.bson_path = str(bson_path)
        self._fh = None
        self.transform = T.Compose([
            T.Resize(int(image_size*1.15)),
            T.CenterCrop(image_size),
            T.ToTensor(),
            T.Normalize(mean=(0.485,0.456,0.406), std=(0.229,0.224,0.225)),
        ])
    def __len__(self):
        return len(self.df)
    def _ensure_fh(self):
        if self._fh is None:
            self._fh = open(self.bson_path, 'rb', buffering=0)
    def __getitem__(self, idx):
        r = self.df.iloc[idx]
        self._ensure_fh()
        d = read_bson_doc_at(self._fh, int(r['offset']))
        imgs = d.get('imgs', []) if d is not None else []
        pic_bytes = imgs[0]['picture'] if d is not None and imgs else None
        try:
            img = Image.open(io.BytesIO(pic_bytes)).convert('RGB') if pic_bytes is not None else Image.new('RGB',(256,256))
        except Exception:
            img = Image.new('RGB',(256,256), color=(128,128,128))
        x = self.transform(img)
        return x, int(r['_id'])

def build_test_loader_fast_products(image_size=192, batch_size=256, num_workers=0):
    dft = pd.read_parquet('test_index.parquet')
    ds = BSONProductEvalOneImage(dft, 'test.bson', image_size=image_size)
    if num_workers <= 0:
        dl = DataLoader(ds, batch_size=batch_size, shuffle=False, num_workers=0, pin_memory=False)
    else:
        dl = DataLoader(ds, batch_size=batch_size, shuffle=False, num_workers=num_workers, pin_memory=False,
                        persistent_workers=False, prefetch_factor=2)
    return dl, dft

def infer_and_submit_fast(ckpt_path='ckpt_tiny_subset.pt', model_name='convnext_tiny', image_size=192, out_path='submission.csv'):
    assert Path(ckpt_path).exists(), f'Checkpoint not found: {ckpt_path}'
    _, idx2cat, num_classes = load_mapping('category_mapping.json')
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    import timm
    model = timm.create_model(model_name, pretrained=False, num_classes=num_classes).to(device).to(memory_format=torch.channels_last)
    ckpt = torch.load(ckpt_path, map_location='cpu')
    model.load_state_dict(ckpt['model'], strict=False)
    model.eval()
    dl, dft = build_test_loader_fast_products(image_size=image_size, batch_size=256, num_workers=0)
    preds_map = {}  # pid -> cls
    with torch.no_grad():
        for i, (x, pids) in enumerate(dl):
            x = x.to(device, non_blocking=True).to(memory_format=torch.channels_last)
            with amp.autocast('cuda') if device.type=='cuda' else torch.no_grad():
                logits = model(x)
                logits_fl = model(x.flip(-1))
            out = (logits + logits_fl) * 0.5
            cls = out.argmax(1).detach().cpu().tolist()
            for pid, c in zip(pids.tolist(), cls):
                preds_map[int(pid)] = int(c)
            if (i+1) % 200 == 0:
                print(f'  fast infer {i+1}/{len(dl)} batches', flush=True)
    ss = pd.read_csv('sample_submission.csv')
    preds = []
    missing = 0
    for pid in ss['_id'].astype(np.int64).tolist():
        if int(pid) in preds_map:
            cat = int(idx2cat[preds_map[int(pid)]])
        else:
            missing += 1
            cat = int(idx2cat[0])
        preds.append(cat)
    if missing:
        print('Warning: missing products in fast inference:', missing)
    sub = pd.DataFrame({'_id': ss['_id'].astype(np.int64), 'category_id': preds})
    sub.to_csv(out_path, index=False)
    print('Wrote', out_path, 'rows:', len(sub))
    return out_path

print('Inference utilities ready. Call infer_and_submit(...) for full per-image agg, or infer_and_submit_fast(...) for one-image-per-product baseline.')

Inference utilities ready. Call infer_and_submit(...) for full per-image agg, or infer_and_submit_fast(...) for one-image-per-product baseline.


In [12]:
# Generate submission quickly using fast one-image-per-product inference (HFlip TTA)
out_path = infer_and_submit_fast(ckpt_path='ckpt_tiny_subset.pt', model_name='convnext_tiny', image_size=192, out_path='submission.csv')
import pandas as pd
sub = pd.read_csv(out_path)
print('submission shape:', sub.shape)
print(sub.head())

  ckpt = torch.load(ckpt_path, map_location='cpu')


  fast infer 200/2762 batches


  fast infer 400/2762 batches


  fast infer 600/2762 batches


  fast infer 800/2762 batches


  fast infer 1000/2762 batches


  fast infer 1200/2762 batches


  fast infer 1400/2762 batches


  fast infer 1600/2762 batches


  fast infer 1800/2762 batches


  fast infer 2000/2762 batches


  fast infer 2200/2762 batches


  fast infer 2400/2762 batches


  fast infer 2600/2762 batches


Wrote submission.csv rows: 706990
submission shape: (706990, 2)
   _id  category_id
0    6   1000012546
1    7   1000012546
2   12   1000001844
3   59   1000019193
4   61   1000010633


In [13]:
# ConvNeXt-Base@224 training on 200k subset with EMA, cosine, early stopping
import math, time, numpy as np, pandas as pd, torch, torch.nn as nn, torch.optim as optim
from torch import amp
from torch.utils.data import DataLoader
import timm
from timm.utils import ModelEmaV2
from timm.scheduler.cosine_lr import CosineLRScheduler

def make_subset_loaders_base(batch_size=64, num_workers=16, image_size=224):
    cat2idx, idx2cat, num_classes = load_mapping('category_mapping.json')
    dfi = pd.read_parquet('train_index.parquet')
    train_ids = pd.read_csv('train_ids.csv')['_id'].astype(np.int64).tolist()
    val_ids = pd.read_csv('val_ids.csv')['_id'].astype(np.int64).tolist()
    df_tr = dfi[dfi['_id'].isin(train_ids)].reset_index(drop=True)
    df_va = dfi[dfi['_id'].isin(val_ids)].reset_index(drop=True)
    ds_tr = BSONProductTrain(df_tr, 'train.bson', cat2idx, image_size=image_size)
    ds_va = BSONImageEval(df_va, 'train.bson', include_label=True, cat2idx=cat2idx, image_size=image_size)
    dl_tr = DataLoader(ds_tr, batch_size=batch_size, shuffle=True, num_workers=num_workers, pin_memory=True,
                       persistent_workers=True, prefetch_factor=8, drop_last=True)
    dl_va = DataLoader(ds_va, batch_size=batch_size, shuffle=False, num_workers=num_workers, pin_memory=True,
                       persistent_workers=True, prefetch_factor=12)
    return dl_tr, dl_va, df_tr, df_va, num_classes, cat2idx, idx2cat

def compute_class_weights_from_df(df_tr, cat2idx):
    import numpy as np, math as _m
    freq = df_tr['category_id'].value_counts().to_dict()
    w = np.ones(len(cat2idx), dtype=np.float32)
    for cat, idx in cat2idx.items():
        f = float(freq.get(int(cat), 1))
        w[idx] = 1.0 / _m.sqrt(max(f, 1.0))
    w = w / (w.mean() + 1e-9)
    return torch.tensor(w, dtype=torch.float32)

def validate_products_base(model, loader, df_va, cat2idx, device):
    model.eval()
    id2y = {int(r['_id']): cat2idx[int(r['category_id'])] for _, r in df_va.iterrows()}
    agg, cnt = {}, {}
    with torch.no_grad():
        for batch in loader:
            x = batch[0].to(device, non_blocking=True).to(memory_format=torch.channels_last)
            ids = batch[1].tolist()
            with amp.autocast('cuda') if device.type == 'cuda' else torch.no_grad():
                logits = model(x).float().cpu()
            for pid, logit in zip(ids, logits):
                if pid not in agg:
                    agg[pid] = logit.clone(); cnt[pid] = 1
                else:
                    agg[pid] += logit; cnt[pid] += 1
    correct = 0
    for pid, logit in agg.items():
        m = logit / max(1, cnt.get(pid, 1))
        if int(pid) in id2y and int(m.argmax().item()) == id2y[int(pid)]:
            correct += 1
    total = len(id2y)
    return correct / max(1, total)

def train_convnext_base_subset(epochs=10, lr=8e-4, wd=0.05, image_size=224, batch_size=64, accum_steps=4, patience=2, drop_path=0.2, ckpt_path='ckpt_convnextb_subset.pt'):
    torch.backends.cudnn.benchmark = True
    torch.backends.cudnn.allow_tf32 = True
    torch.backends.cuda.matmul.allow_tf32 = True
    torch.set_float32_matmul_precision('high')
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    dl_tr, dl_va, df_tr, df_va, num_classes, cat2idx, idx2cat = make_subset_loaders_base(batch_size=batch_size, num_workers=16, image_size=image_size)
    model = timm.create_model('convnext_base', pretrained=True, num_classes=num_classes, drop_path_rate=drop_path).to(device).to(memory_format=torch.channels_last)
    optimizer = optim.AdamW(model.parameters(), lr=lr, weight_decay=wd, betas=(0.9,0.999))
    model_ema = ModelEmaV2(model, decay=0.9997)
    cls_w = compute_class_weights_from_df(df_tr, cat2idx).to(device)
    criterion = nn.CrossEntropyLoss(label_smoothing=0.1, weight=cls_w)
    scaler = amp.GradScaler('cuda') if device.type == 'cuda' else None
    steps_per_epoch = len(dl_tr)
    updates_per_epoch = int(math.ceil(steps_per_epoch / max(1, accum_steps)))
    total_updates = updates_per_epoch * epochs
    warmup_updates = updates_per_epoch  # 1 epoch warmup
    sched = CosineLRScheduler(
        optimizer, t_initial=max(1, total_updates - warmup_updates), lr_min=1e-6,
        warmup_t=warmup_updates, warmup_lr_init=1e-5, t_in_epochs=False
    )
    best_acc = -1.0
    no_improve = 0
    print(f'Start training: epochs={epochs}, eff_batch~={batch_size*accum_steps}, lr={lr}, wd={wd}, img={image_size}', flush=True)
    for ep in range(1, epochs + 1):
        t0 = time.time()
        model.train()
        total, correct, loss_sum = 0, 0, 0.0
        optimizer.zero_grad(set_to_none=True)
        upd_idx = 0
        for it, (x, y) in enumerate(dl_tr):
            x = x.to(device, non_blocking=True).to(memory_format=torch.channels_last)
            y = y.to(device, non_blocking=True)
            if scaler is not None:
                with amp.autocast('cuda'):
                    logits = model(x)
                    loss = criterion(logits, y) / accum_steps
                scaler.scale(loss).backward()
            else:
                logits = model(x)
                loss = criterion(logits, y) / accum_steps
                loss.backward()
            if (it + 1) % accum_steps == 0:
                if scaler is not None:
                    scaler.unscale_(optimizer)
                torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
                if scaler is not None:
                    scaler.step(optimizer); scaler.update()
                else:
                    optimizer.step()
                optimizer.zero_grad(set_to_none=True)
                model_ema.update(model)
                upd_idx += 1
                sched.step_update((ep - 1) * updates_per_epoch + upd_idx)
            preds = logits.detach().argmax(1)
            total += y.size(0)
            correct += (preds == y).sum().item()
            loss_sum += loss.item() * accum_steps * x.size(0)
            if (it + 1) % 100 == 0:
                print(f'E{ep} it {it+1}/{len(dl_tr)} loss={loss_sum/max(1,total):.4f} acc={correct/max(1,total):.4f} elapsed={time.time()-t0:.1f}s', flush=True)
        tr_loss = loss_sum / max(1, total)
        tr_acc = correct / max(1, total)
        va_acc = validate_products_base(model_ema.module, dl_va, df_va, cat2idx, device)
        print(f'Epoch {ep}: train loss={tr_loss:.4f} acc={tr_acc:.4f} | val product-acc={va_acc:.4f} | took {time.time()-t0:.1f}s', flush=True)
        if va_acc > best_acc:
            best_acc = va_acc; no_improve = 0
            torch.save({'model': model_ema.module.state_dict(), 'val_acc': va_acc}, ckpt_path)
            print(f'  Saved {ckpt_path} (EMA) with val_acc={va_acc:.4f}', flush=True)
        else:
            no_improve += 1
            if no_improve >= patience:
                print(f'Early stopping at epoch {ep} (no improvement {no_improve}/{patience})', flush=True)
                break
    print('Best val product-acc:', f'{best_acc:.4f}')
    return best_acc

print('ConvNeXt-Base training function ready. Next: execute train_convnext_base_subset(...) to launch the strong run.')

ConvNeXt-Base training function ready. Next: execute train_convnext_base_subset(...) to launch the strong run.


In [14]:
# Launch ConvNeXt-Base@224 strong run on 200k subset
import os, time, torch
os.environ.setdefault('OMP_NUM_THREADS', '1')
os.environ.setdefault('MKL_NUM_THREADS', '1')
print('CUDA available:', torch.cuda.is_available())
if torch.cuda.is_available():
    print('GPU:', torch.cuda.get_device_name(0))
t0 = time.time()
best_acc = train_convnext_base_subset(
    epochs=10, lr=8e-4, wd=0.05, image_size=224,
    batch_size=64, accum_steps=4, patience=2, drop_path=0.2,
    ckpt_path='ckpt_convnextb_subset.pt'
)
print(f'Total training time: {time.time()-t0:.1f}s | best val product-acc={best_acc:.4f}', flush=True)

CUDA available: True
GPU: NVIDIA A10-24Q


Start training: epochs=10, eff_batch~=256, lr=0.0008, wd=0.05, img=224


E1 it 100/2812 loss=8.7830 acc=0.0006 elapsed=35.8s


E1 it 200/2812 loss=8.7411 acc=0.0004 elapsed=63.5s


E1 it 300/2812 loss=8.6976 acc=0.0005 elapsed=91.2s


E1 it 400/2812 loss=8.6684 acc=0.0004 elapsed=119.2s


E1 it 500/2812 loss=8.6341 acc=0.0015 elapsed=147.1s


E1 it 600/2812 loss=8.5816 acc=0.0039 elapsed=175.0s


E1 it 700/2812 loss=8.5054 acc=0.0082 elapsed=203.0s


E1 it 800/2812 loss=8.4161 acc=0.0124 elapsed=231.1s


E1 it 900/2812 loss=8.3145 acc=0.0175 elapsed=259.1s


E1 it 1000/2812 loss=8.2018 acc=0.0224 elapsed=287.2s


E1 it 1100/2812 loss=8.0858 acc=0.0277 elapsed=315.4s


E1 it 1200/2812 loss=7.9727 acc=0.0331 elapsed=343.8s


E1 it 1300/2812 loss=7.8628 acc=0.0388 elapsed=371.9s


E1 it 1400/2812 loss=7.7570 acc=0.0446 elapsed=400.1s


E1 it 1500/2812 loss=7.6558 acc=0.0497 elapsed=428.3s


E1 it 1600/2812 loss=7.5604 acc=0.0552 elapsed=456.5s


E1 it 1700/2812 loss=7.4692 acc=0.0609 elapsed=484.7s


E1 it 1800/2812 loss=7.3824 acc=0.0668 elapsed=513.0s


E1 it 1900/2812 loss=7.3012 acc=0.0717 elapsed=541.3s


E1 it 2000/2812 loss=7.2259 acc=0.0768 elapsed=569.6s


E1 it 2100/2812 loss=7.1554 acc=0.0816 elapsed=597.9s


E1 it 2200/2812 loss=7.0909 acc=0.0858 elapsed=626.2s


E1 it 2300/2812 loss=7.0300 acc=0.0899 elapsed=654.4s


E1 it 2400/2812 loss=6.9737 acc=0.0939 elapsed=682.7s


E1 it 2500/2812 loss=6.9196 acc=0.0977 elapsed=710.9s


E1 it 2600/2812 loss=6.8703 acc=0.1012 elapsed=739.1s


E1 it 2700/2812 loss=6.8238 acc=0.1046 elapsed=767.3s


E1 it 2800/2812 loss=6.7806 acc=0.1077 elapsed=795.5s


Epoch 1: train loss=6.7758 acc=0.1080 | val product-acc=0.0076 | took 854.5s


  Saved ckpt_convnextb_subset.pt (EMA) with val_acc=0.0076


E2 it 100/2812 loss=4.9407 acc=0.2725 elapsed=28.4s


E2 it 200/2812 loss=4.9086 acc=0.2767 elapsed=56.6s


E2 it 300/2812 loss=4.9302 acc=0.2724 elapsed=84.7s


E2 it 400/2812 loss=4.9416 acc=0.2710 elapsed=112.9s


E2 it 500/2812 loss=4.9546 acc=0.2703 elapsed=141.2s


E2 it 600/2812 loss=4.9593 acc=0.2699 elapsed=169.5s


E2 it 700/2812 loss=4.9531 acc=0.2699 elapsed=197.8s


E2 it 800/2812 loss=4.9523 acc=0.2701 elapsed=226.1s


E2 it 900/2812 loss=4.9516 acc=0.2698 elapsed=254.4s


E2 it 1000/2812 loss=4.9585 acc=0.2691 elapsed=282.6s


E2 it 1100/2812 loss=4.9548 acc=0.2697 elapsed=310.8s


E2 it 1200/2812 loss=4.9546 acc=0.2694 elapsed=339.3s


E2 it 1300/2812 loss=4.9522 acc=0.2696 elapsed=367.4s


E2 it 1400/2812 loss=4.9462 acc=0.2705 elapsed=395.5s


E2 it 1500/2812 loss=4.9453 acc=0.2709 elapsed=423.7s


E2 it 1600/2812 loss=4.9450 acc=0.2711 elapsed=451.9s


E2 it 1700/2812 loss=4.9458 acc=0.2713 elapsed=480.2s


E2 it 1800/2812 loss=4.9418 acc=0.2723 elapsed=508.4s


E2 it 1900/2812 loss=4.9383 acc=0.2729 elapsed=536.6s


E2 it 2000/2812 loss=4.9349 acc=0.2737 elapsed=564.9s


E2 it 2100/2812 loss=4.9300 acc=0.2747 elapsed=593.2s


E2 it 2200/2812 loss=4.9268 acc=0.2754 elapsed=621.6s


E2 it 2300/2812 loss=4.9209 acc=0.2764 elapsed=649.9s


E2 it 2400/2812 loss=4.9147 acc=0.2773 elapsed=678.2s


E2 it 2500/2812 loss=4.9117 acc=0.2775 elapsed=706.5s


E2 it 2600/2812 loss=4.9066 acc=0.2784 elapsed=734.7s


E2 it 2700/2812 loss=4.9005 acc=0.2794 elapsed=762.8s


E2 it 2800/2812 loss=4.8940 acc=0.2805 elapsed=791.0s


Epoch 2: train loss=4.8928 acc=0.2806 | val product-acc=0.1933 | took 848.6s


  Saved ckpt_convnextb_subset.pt (EMA) with val_acc=0.1933


E3 it 100/2812 loss=3.8330 acc=0.4400 elapsed=28.3s


E3 it 200/2812 loss=3.8521 acc=0.4359 elapsed=56.6s


E3 it 300/2812 loss=3.8633 acc=0.4339 elapsed=84.8s


E3 it 400/2812 loss=3.8655 acc=0.4326 elapsed=113.1s


E3 it 500/2812 loss=3.8590 acc=0.4339 elapsed=141.4s


E3 it 600/2812 loss=3.8626 acc=0.4325 elapsed=169.8s


E3 it 700/2812 loss=3.8730 acc=0.4317 elapsed=198.1s


E3 it 800/2812 loss=3.8781 acc=0.4314 elapsed=226.5s


E3 it 900/2812 loss=3.8842 acc=0.4293 elapsed=254.8s


E3 it 1000/2812 loss=3.8905 acc=0.4288 elapsed=283.1s


E3 it 1100/2812 loss=3.8938 acc=0.4285 elapsed=311.4s


E3 it 1200/2812 loss=3.9038 acc=0.4269 elapsed=339.7s


E3 it 1300/2812 loss=3.9141 acc=0.4253 elapsed=367.8s


E3 it 1400/2812 loss=3.9187 acc=0.4247 elapsed=396.0s


E3 it 1500/2812 loss=3.9205 acc=0.4244 elapsed=424.2s


E3 it 1600/2812 loss=3.9248 acc=0.4244 elapsed=452.5s


E3 it 1700/2812 loss=3.9283 acc=0.4239 elapsed=480.7s


E3 it 1800/2812 loss=3.9334 acc=0.4226 elapsed=508.9s


E3 it 1900/2812 loss=3.9358 acc=0.4224 elapsed=537.2s


E3 it 2000/2812 loss=3.9363 acc=0.4226 elapsed=565.5s


E3 it 2100/2812 loss=3.9383 acc=0.4220 elapsed=593.8s


E3 it 2200/2812 loss=3.9414 acc=0.4217 elapsed=622.1s


E3 it 2300/2812 loss=3.9442 acc=0.4215 elapsed=650.4s


E3 it 2400/2812 loss=3.9465 acc=0.4211 elapsed=678.8s


E3 it 2500/2812 loss=3.9465 acc=0.4213 elapsed=707.1s


E3 it 2600/2812 loss=3.9470 acc=0.4213 elapsed=735.4s


E3 it 2700/2812 loss=3.9476 acc=0.4216 elapsed=763.7s


E3 it 2800/2812 loss=3.9485 acc=0.4213 elapsed=791.9s


Epoch 3: train loss=3.9493 acc=0.4211 | val product-acc=0.3276 | took 849.6s


  Saved ckpt_convnextb_subset.pt (EMA) with val_acc=0.3276


E4 it 100/2812 loss=3.0468 acc=0.6081 elapsed=28.5s


E4 it 200/2812 loss=3.0681 acc=0.5988 elapsed=56.7s


E4 it 300/2812 loss=3.0794 acc=0.5969 elapsed=84.9s


E4 it 400/2812 loss=3.0893 acc=0.5923 elapsed=113.1s


E4 it 500/2812 loss=3.0948 acc=0.5917 elapsed=141.3s


E4 it 600/2812 loss=3.0974 acc=0.5903 elapsed=169.6s


E4 it 700/2812 loss=3.1087 acc=0.5877 elapsed=198.0s


E4 it 800/2812 loss=3.1125 acc=0.5858 elapsed=226.3s


E4 it 900/2812 loss=3.1182 acc=0.5846 elapsed=254.6s


E4 it 1000/2812 loss=3.1234 acc=0.5836 elapsed=282.9s


E4 it 1100/2812 loss=3.1309 acc=0.5819 elapsed=311.2s


E4 it 1200/2812 loss=3.1311 acc=0.5814 elapsed=339.4s


E4 it 1300/2812 loss=3.1371 acc=0.5802 elapsed=367.6s


E4 it 1400/2812 loss=3.1473 acc=0.5784 elapsed=395.8s


E4 it 1500/2812 loss=3.1515 acc=0.5767 elapsed=424.0s


E4 it 1600/2812 loss=3.1578 acc=0.5753 elapsed=452.2s


E4 it 1700/2812 loss=3.1644 acc=0.5739 elapsed=480.4s


E4 it 1800/2812 loss=3.1698 acc=0.5729 elapsed=508.6s


E4 it 1900/2812 loss=3.1762 acc=0.5713 elapsed=536.8s


E4 it 2000/2812 loss=3.1806 acc=0.5704 elapsed=565.0s


E4 it 2100/2812 loss=3.1851 acc=0.5693 elapsed=593.3s


E4 it 2200/2812 loss=3.1885 acc=0.5683 elapsed=621.5s


E4 it 2300/2812 loss=3.1912 acc=0.5676 elapsed=649.8s


E4 it 2400/2812 loss=3.1960 acc=0.5663 elapsed=678.2s


E4 it 2500/2812 loss=3.1994 acc=0.5660 elapsed=706.6s


E4 it 2600/2812 loss=3.2023 acc=0.5653 elapsed=735.1s


E4 it 2700/2812 loss=3.2068 acc=0.5643 elapsed=763.5s


E4 it 2800/2812 loss=3.2110 acc=0.5632 elapsed=791.8s


Epoch 4: train loss=3.2108 acc=0.5633 | val product-acc=0.3901 | took 849.5s


  Saved ckpt_convnextb_subset.pt (EMA) with val_acc=0.3901


E5 it 100/2812 loss=2.4867 acc=0.7430 elapsed=28.5s


E5 it 200/2812 loss=2.5181 acc=0.7367 elapsed=56.7s


E5 it 300/2812 loss=2.5363 acc=0.7327 elapsed=84.8s


E5 it 400/2812 loss=2.5454 acc=0.7290 elapsed=112.9s


E5 it 500/2812 loss=2.5562 acc=0.7257 elapsed=141.2s


E5 it 600/2812 loss=2.5618 acc=0.7241 elapsed=169.4s


E5 it 700/2812 loss=2.5695 acc=0.7222 elapsed=197.7s


E5 it 800/2812 loss=2.5778 acc=0.7206 elapsed=225.9s


E5 it 900/2812 loss=2.5834 acc=0.7192 elapsed=254.2s


E5 it 1000/2812 loss=2.5901 acc=0.7181 elapsed=282.5s


E5 it 1100/2812 loss=2.5923 acc=0.7169 elapsed=310.7s


E5 it 1200/2812 loss=2.5956 acc=0.7158 elapsed=339.0s


E5 it 1300/2812 loss=2.5991 acc=0.7144 elapsed=367.2s


E5 it 1400/2812 loss=2.6025 acc=0.7135 elapsed=395.4s


E5 it 1500/2812 loss=2.6048 acc=0.7129 elapsed=423.6s


E5 it 1600/2812 loss=2.6082 acc=0.7115 elapsed=451.9s


E5 it 1700/2812 loss=2.6123 acc=0.7100 elapsed=480.0s


E5 it 1800/2812 loss=2.6165 acc=0.7089 elapsed=508.2s


E5 it 1900/2812 loss=2.6189 acc=0.7081 elapsed=536.3s


E5 it 2000/2812 loss=2.6203 acc=0.7077 elapsed=564.5s


E5 it 2100/2812 loss=2.6226 acc=0.7070 elapsed=592.7s


E5 it 2200/2812 loss=2.6266 acc=0.7059 elapsed=621.0s


E5 it 2300/2812 loss=2.6281 acc=0.7053 elapsed=649.3s


E5 it 2400/2812 loss=2.6299 acc=0.7047 elapsed=677.6s


E5 it 2500/2812 loss=2.6311 acc=0.7046 elapsed=705.9s


E5 it 2600/2812 loss=2.6314 acc=0.7045 elapsed=734.2s


E5 it 2700/2812 loss=2.6319 acc=0.7039 elapsed=762.5s


E5 it 2800/2812 loss=2.6328 acc=0.7038 elapsed=790.7s


Epoch 5: train loss=2.6323 acc=0.7040 | val product-acc=0.4237 | took 848.6s


  Saved ckpt_convnextb_subset.pt (EMA) with val_acc=0.4237


E6 it 100/2812 loss=2.2155 acc=0.8161 elapsed=28.5s


E6 it 200/2812 loss=2.2080 acc=0.8188 elapsed=56.8s


E6 it 300/2812 loss=2.2047 acc=0.8189 elapsed=85.0s


E6 it 400/2812 loss=2.2122 acc=0.8174 elapsed=113.3s


E6 it 500/2812 loss=2.2117 acc=0.8179 elapsed=141.6s


E6 it 600/2812 loss=2.2156 acc=0.8157 elapsed=169.8s


E6 it 700/2812 loss=2.2210 acc=0.8155 elapsed=198.0s


E6 it 800/2812 loss=2.2193 acc=0.8154 elapsed=226.2s


E6 it 900/2812 loss=2.2258 acc=0.8132 elapsed=254.5s


E6 it 1000/2812 loss=2.2236 acc=0.8137 elapsed=282.8s


E6 it 1100/2812 loss=2.2256 acc=0.8130 elapsed=311.0s


E6 it 1200/2812 loss=2.2287 acc=0.8120 elapsed=339.3s


E6 it 1300/2812 loss=2.2327 acc=0.8108 elapsed=367.6s


E6 it 1400/2812 loss=2.2329 acc=0.8103 elapsed=395.8s


E6 it 1500/2812 loss=2.2341 acc=0.8101 elapsed=424.1s


E6 it 1600/2812 loss=2.2339 acc=0.8102 elapsed=452.3s


E6 it 1700/2812 loss=2.2373 acc=0.8095 elapsed=480.4s


E6 it 1800/2812 loss=2.2372 acc=0.8092 elapsed=508.6s


E6 it 1900/2812 loss=2.2386 acc=0.8088 elapsed=536.8s


E6 it 2000/2812 loss=2.2385 acc=0.8087 elapsed=565.0s


E6 it 2100/2812 loss=2.2390 acc=0.8085 elapsed=593.1s


E6 it 2200/2812 loss=2.2406 acc=0.8081 elapsed=621.4s


E6 it 2300/2812 loss=2.2418 acc=0.8076 elapsed=649.6s


E6 it 2400/2812 loss=2.2437 acc=0.8073 elapsed=677.9s


E6 it 2500/2812 loss=2.2449 acc=0.8070 elapsed=706.1s


E6 it 2600/2812 loss=2.2462 acc=0.8066 elapsed=734.4s


E6 it 2700/2812 loss=2.2462 acc=0.8065 elapsed=762.7s


E6 it 2800/2812 loss=2.2477 acc=0.8062 elapsed=791.1s


Epoch 6: train loss=2.2480 acc=0.8062 | val product-acc=0.4420 | took 849.1s


  Saved ckpt_convnextb_subset.pt (EMA) with val_acc=0.4420


E7 it 100/2812 loss=1.9902 acc=0.8719 elapsed=28.4s


E7 it 200/2812 loss=1.9898 acc=0.8714 elapsed=56.6s


E7 it 300/2812 loss=1.9938 acc=0.8710 elapsed=84.7s


E7 it 400/2812 loss=1.9934 acc=0.8704 elapsed=112.9s


E7 it 500/2812 loss=1.9894 acc=0.8710 elapsed=141.0s


E7 it 600/2812 loss=1.9905 acc=0.8711 elapsed=169.2s


E7 it 700/2812 loss=1.9955 acc=0.8699 elapsed=197.3s


E7 it 800/2812 loss=1.9968 acc=0.8691 elapsed=225.7s


E7 it 900/2812 loss=1.9980 acc=0.8685 elapsed=253.9s


E7 it 1000/2812 loss=1.9953 acc=0.8693 elapsed=282.1s


E7 it 1100/2812 loss=1.9970 acc=0.8682 elapsed=310.4s


E7 it 1200/2812 loss=2.0001 acc=0.8672 elapsed=338.7s


E7 it 1300/2812 loss=1.9991 acc=0.8674 elapsed=367.1s


E7 it 1400/2812 loss=2.0006 acc=0.8667 elapsed=395.5s


E7 it 1500/2812 loss=2.0001 acc=0.8670 elapsed=424.0s


E7 it 1600/2812 loss=2.0011 acc=0.8669 elapsed=452.4s


E7 it 1700/2812 loss=2.0015 acc=0.8663 elapsed=480.6s


E7 it 1800/2812 loss=2.0026 acc=0.8661 elapsed=508.7s


E7 it 1900/2812 loss=2.0031 acc=0.8659 elapsed=536.9s


E7 it 2000/2812 loss=2.0035 acc=0.8657 elapsed=565.0s


E7 it 2100/2812 loss=2.0037 acc=0.8659 elapsed=593.2s


E7 it 2200/2812 loss=2.0037 acc=0.8660 elapsed=621.3s


E7 it 2300/2812 loss=2.0035 acc=0.8661 elapsed=649.5s


E7 it 2400/2812 loss=2.0036 acc=0.8660 elapsed=677.7s


E7 it 2500/2812 loss=2.0039 acc=0.8660 elapsed=705.9s


E7 it 2600/2812 loss=2.0045 acc=0.8658 elapsed=734.2s


E7 it 2700/2812 loss=2.0050 acc=0.8656 elapsed=762.4s


E7 it 2800/2812 loss=2.0042 acc=0.8657 elapsed=790.7s


Epoch 7: train loss=2.0040 acc=0.8657 | val product-acc=0.4548 | took 848.7s


  Saved ckpt_convnextb_subset.pt (EMA) with val_acc=0.4548


E8 it 100/2812 loss=1.8346 acc=0.9047 elapsed=28.6s


E8 it 200/2812 loss=1.8543 acc=0.9000 elapsed=56.8s


E8 it 300/2812 loss=1.8501 acc=0.9028 elapsed=85.1s


E8 it 400/2812 loss=1.8566 acc=0.9013 elapsed=113.5s


E8 it 500/2812 loss=1.8585 acc=0.9005 elapsed=141.8s


E8 it 600/2812 loss=1.8586 acc=0.9005 elapsed=170.1s


E8 it 700/2812 loss=1.8570 acc=0.9007 elapsed=198.4s


E8 it 800/2812 loss=1.8612 acc=0.9002 elapsed=226.7s


E8 it 900/2812 loss=1.8603 acc=0.9004 elapsed=255.0s


E8 it 1000/2812 loss=1.8602 acc=0.8999 elapsed=283.3s


E8 it 1100/2812 loss=1.8596 acc=0.9003 elapsed=311.6s


E8 it 1200/2812 loss=1.8589 acc=0.9001 elapsed=340.0s


E8 it 1300/2812 loss=1.8573 acc=0.9002 elapsed=368.3s


E8 it 1400/2812 loss=1.8565 acc=0.9006 elapsed=396.6s


E8 it 1500/2812 loss=1.8575 acc=0.9006 elapsed=424.9s


E8 it 1600/2812 loss=1.8594 acc=0.9000 elapsed=453.2s


E8 it 1700/2812 loss=1.8602 acc=0.8998 elapsed=481.4s


E8 it 1800/2812 loss=1.8593 acc=0.9000 elapsed=509.6s


E8 it 1900/2812 loss=1.8590 acc=0.9000 elapsed=537.7s


E8 it 2000/2812 loss=1.8590 acc=0.9000 elapsed=565.9s


E8 it 2100/2812 loss=1.8587 acc=0.8999 elapsed=594.1s


E8 it 2200/2812 loss=1.8582 acc=0.9000 elapsed=622.3s


E8 it 2300/2812 loss=1.8588 acc=0.8998 elapsed=650.5s


E8 it 2400/2812 loss=1.8586 acc=0.8998 elapsed=678.6s


E8 it 2500/2812 loss=1.8587 acc=0.8998 elapsed=706.8s


E8 it 2600/2812 loss=1.8582 acc=0.8998 elapsed=734.9s


E8 it 2700/2812 loss=1.8584 acc=0.8998 elapsed=763.2s


E8 it 2800/2812 loss=1.8583 acc=0.8998 elapsed=791.5s


Epoch 8: train loss=1.8584 acc=0.8998 | val product-acc=0.4607 | took 849.4s


  Saved ckpt_convnextb_subset.pt (EMA) with val_acc=0.4607


E9 it 100/2812 loss=1.8052 acc=0.9145 elapsed=28.6s


E9 it 200/2812 loss=1.7968 acc=0.9170 elapsed=56.9s


E9 it 300/2812 loss=1.7976 acc=0.9166 elapsed=85.1s


E9 it 400/2812 loss=1.7992 acc=0.9150 elapsed=113.4s


E9 it 500/2812 loss=1.7982 acc=0.9154 elapsed=141.7s


E9 it 600/2812 loss=1.7999 acc=0.9146 elapsed=170.0s


E9 it 700/2812 loss=1.7979 acc=0.9152 elapsed=198.3s


E9 it 800/2812 loss=1.7992 acc=0.9150 elapsed=226.5s


E9 it 900/2812 loss=1.8007 acc=0.9152 elapsed=254.8s


E9 it 1000/2812 loss=1.7992 acc=0.9157 elapsed=283.1s


E9 it 1100/2812 loss=1.7966 acc=0.9161 elapsed=311.4s


E9 it 1200/2812 loss=1.7958 acc=0.9163 elapsed=339.7s


E9 it 1300/2812 loss=1.7973 acc=0.9158 elapsed=368.0s


E9 it 1400/2812 loss=1.7979 acc=0.9154 elapsed=396.4s


E9 it 1500/2812 loss=1.7971 acc=0.9154 elapsed=424.8s


E9 it 1600/2812 loss=1.7962 acc=0.9153 elapsed=453.1s


E9 it 1700/2812 loss=1.7967 acc=0.9152 elapsed=481.4s


E9 it 1800/2812 loss=1.7971 acc=0.9149 elapsed=509.7s


E9 it 1900/2812 loss=1.7963 acc=0.9148 elapsed=538.0s


E9 it 2000/2812 loss=1.7972 acc=0.9146 elapsed=566.2s


E9 it 2100/2812 loss=1.7964 acc=0.9147 elapsed=594.5s


E9 it 2200/2812 loss=1.7953 acc=0.9149 elapsed=622.7s


E9 it 2300/2812 loss=1.7950 acc=0.9151 elapsed=651.0s


E9 it 2400/2812 loss=1.7956 acc=0.9150 elapsed=679.2s


E9 it 2500/2812 loss=1.7956 acc=0.9150 elapsed=707.5s


E9 it 2600/2812 loss=1.7954 acc=0.9150 elapsed=735.8s


E9 it 2700/2812 loss=1.7955 acc=0.9150 elapsed=764.1s


E9 it 2800/2812 loss=1.7949 acc=0.9151 elapsed=792.4s


Epoch 9: train loss=1.7949 acc=0.9151 | val product-acc=0.4648 | took 850.2s


  Saved ckpt_convnextb_subset.pt (EMA) with val_acc=0.4648


E10 it 100/2812 loss=1.8066 acc=0.9133 elapsed=28.4s


E10 it 200/2812 loss=1.7813 acc=0.9176 elapsed=56.7s


E10 it 300/2812 loss=1.7925 acc=0.9142 elapsed=85.0s


E10 it 400/2812 loss=1.7952 acc=0.9134 elapsed=113.2s


E10 it 500/2812 loss=1.7932 acc=0.9141 elapsed=141.5s


E10 it 600/2812 loss=1.7953 acc=0.9139 elapsed=169.7s


E10 it 700/2812 loss=1.7957 acc=0.9138 elapsed=197.9s


E10 it 800/2812 loss=1.7913 acc=0.9151 elapsed=226.1s


E10 it 900/2812 loss=1.7901 acc=0.9151 elapsed=254.3s


E10 it 1000/2812 loss=1.7922 acc=0.9147 elapsed=282.6s


E10 it 1100/2812 loss=1.7901 acc=0.9154 elapsed=310.8s


E10 it 1200/2812 loss=1.7890 acc=0.9157 elapsed=339.0s


E10 it 1300/2812 loss=1.7879 acc=0.9159 elapsed=367.3s


E10 it 1400/2812 loss=1.7874 acc=0.9160 elapsed=395.6s


E10 it 1500/2812 loss=1.7874 acc=0.9160 elapsed=424.0s


E10 it 1600/2812 loss=1.7881 acc=0.9160 elapsed=452.3s


E10 it 1700/2812 loss=1.7891 acc=0.9159 elapsed=480.5s


E10 it 1800/2812 loss=1.7894 acc=0.9160 elapsed=508.8s


E10 it 1900/2812 loss=1.7892 acc=0.9160 elapsed=537.0s


E10 it 2000/2812 loss=1.7893 acc=0.9161 elapsed=565.3s


E10 it 2100/2812 loss=1.7902 acc=0.9159 elapsed=593.5s


E10 it 2200/2812 loss=1.7908 acc=0.9155 elapsed=621.7s


E10 it 2300/2812 loss=1.7896 acc=0.9159 elapsed=649.9s


E10 it 2400/2812 loss=1.7888 acc=0.9160 elapsed=678.2s


E10 it 2500/2812 loss=1.7887 acc=0.9159 elapsed=706.4s


E10 it 2600/2812 loss=1.7892 acc=0.9158 elapsed=734.6s


E10 it 2700/2812 loss=1.7890 acc=0.9158 elapsed=762.8s


E10 it 2800/2812 loss=1.7883 acc=0.9160 elapsed=791.1s


Epoch 10: train loss=1.7882 acc=0.9160 | val product-acc=0.4663 | took 849.0s


  Saved ckpt_convnextb_subset.pt (EMA) with val_acc=0.4663


Best val product-acc: 0.4663


Total training time: 8509.6s | best val product-acc=0.4663


In [15]:
# Inference with ConvNeXt-Base EMA checkpoint: full per-image aggregation + HFlip TTA
from pathlib import Path
import pandas as pd
ckpt_path = 'ckpt_convnextb_subset.pt'
assert Path(ckpt_path).exists(), 'ConvNeXt-Base checkpoint not found. Train run must complete first.'
out_path = infer_and_submit(ckpt_path=ckpt_path, model_name='convnext_base', image_size=224, out_path='submission_base.csv')
sub = pd.read_csv(out_path)
print('submission_base shape:', sub.shape)
print(sub.head())

  ckpt = torch.load(ckpt_path, map_location='cpu')


  infer 200/9671 batches


  infer 400/9671 batches


  infer 600/9671 batches


  infer 800/9671 batches


  infer 1000/9671 batches


  infer 1200/9671 batches


  infer 1400/9671 batches


  infer 1600/9671 batches


  infer 1800/9671 batches


  infer 2000/9671 batches


  infer 2200/9671 batches


  infer 2400/9671 batches


  infer 2600/9671 batches


  infer 2800/9671 batches


  infer 3000/9671 batches


  infer 3200/9671 batches


  infer 3400/9671 batches


  infer 3600/9671 batches


  infer 3800/9671 batches


  infer 4000/9671 batches


  infer 4200/9671 batches


  infer 4400/9671 batches


  infer 4600/9671 batches


  infer 4800/9671 batches


  infer 5000/9671 batches


  infer 5200/9671 batches


  infer 5400/9671 batches


  infer 5600/9671 batches


  infer 5800/9671 batches


  infer 6000/9671 batches


  infer 6200/9671 batches


  infer 6400/9671 batches


  infer 6600/9671 batches


  infer 6800/9671 batches


  infer 7000/9671 batches


  infer 7200/9671 batches


  infer 7400/9671 batches


  infer 7600/9671 batches


  infer 7800/9671 batches


  infer 8000/9671 batches


  infer 8200/9671 batches


  infer 8400/9671 batches


  infer 8600/9671 batches


  infer 8800/9671 batches


  infer 9000/9671 batches


  infer 9200/9671 batches


  infer 9400/9671 batches


  infer 9600/9671 batches


Wrote submission_base.csv rows: 706990


submission_base shape: (706990, 2)
   _id  category_id
0    6   1000005744
1    7   1000004079
2   12   1000018286
3   59   1000007138
4   61   1000010653


In [16]:
# Copy best submission to official filename and quick sanity check
import pandas as pd, shutil, os
assert os.path.exists('submission_base.csv'), 'submission_base.csv not found'
shutil.copyfile('submission_base.csv', 'submission.csv')
sub = pd.read_csv('submission.csv')
print('submission.csv shape:', sub.shape)
print(sub.head())

submission.csv shape: (706990, 2)
   _id  category_id
0    6   1000005744
1    7   1000004079
2   12   1000018286
3   59   1000007138
4   61   1000010653


In [None]:
# Multi-scale TTA inference: add Resize 240 -> CenterCrop 224 alongside 256->224 + HFlip
import pandas as pd, numpy as np, torch
from pathlib import Path
from torch.utils.data import DataLoader
import torchvision.transforms as T

class BSONImageEvalScaled(Dataset):
    def __init__(self, index_df: pd.DataFrame, bson_path: str, include_label: bool, cat2idx: dict | None, resize_size: int, crop_size: int):
        rows = []
        for _, r in index_df.iterrows():
            for k in range(int(r['n_imgs'])):
                rows.append((int(r['offset']), int(r['_id']), k, int(r['category_id']) if include_label and pd.notna(r['category_id']) else None))
        self.df = pd.DataFrame(rows, columns=['offset','_id','img_idx','category_id'])
        self.df.sort_values('offset', inplace=True)
        self.bson_path = str(bson_path)
        self.include_label = include_label
        self.cat2idx = cat2idx
        self._fh = None
        self.transform = T.Compose([
            T.Resize(resize_size),
            T.CenterCrop(crop_size),
            T.ToTensor(),
            T.Normalize(mean=(0.485,0.456,0.406), std=(0.229,0.224,0.225)),
        ])
    def __len__(self):
        return len(self.df)
    def _ensure_fh(self):
        if self._fh is None:
            self._fh = open(self.bson_path, 'rb', buffering=0)
    def __getitem__(self, idx):
        r = self.df.iloc[idx]
        self._ensure_fh()
        d = read_bson_doc_at(self._fh, int(r['offset']))
        imgs = d.get('imgs', []) if d is not None else []
        pic_bytes = imgs[int(r['img_idx'])]['picture'] if d is not None and imgs else None
        try:
            img = Image.open(io.BytesIO(pic_bytes)).convert('RGB') if pic_bytes is not None else Image.new('RGB',(256,256))
        except Exception:
            img = Image.new('RGB',(256,256), color=(128,128,128))
        x = self.transform(img)
        if self.include_label:
            cat_id = int(r['category_id']) if r['category_id'] is not None and pd.notna(r['category_id']) else None
            y = self.cat2idx.get(cat_id, 0) if (self.cat2idx is not None and cat_id is not None) else -1
            return x, int(r['_id']), int(r['img_idx']), y
        else:
            return x, int(r['_id']), int(r['img_idx'])

def infer_and_submit_multiscale(ckpt_path: str, model_name: str, crop_size=224, scales=(256, 240), out_path='submission_ms.csv'):
    assert Path(ckpt_path).exists(), f'Checkpoint not found: {ckpt_path}'
    cat2idx, idx2cat, num_classes = load_mapping('category_mapping.json')
    import timm
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model = timm.create_model(model_name, pretrained=False, num_classes=num_classes).to(device).to(memory_format=torch.channels_last)
    ckpt = torch.load(ckpt_path, map_location='cpu')
    model.load_state_dict(ckpt['model'], strict=False)
    model.eval()
    dft = pd.read_parquet('test_index.parquet')
    agg, cnt = {}, {}  # pid -> logits sum, count
    with torch.no_grad():
        for s in scales:
            ds = BSONImageEvalScaled(dft, 'test.bson', include_label=False, cat2idx=None, resize_size=s, crop_size=crop_size)
            dl = DataLoader(ds, batch_size=128, shuffle=False, num_workers=0, pin_memory=False)
            for i, batch in enumerate(dl):
                x = batch[0].to(device, non_blocking=True).to(memory_format=torch.channels_last)
                pids = batch[1].tolist()
                with amp.autocast('cuda') if device.type=='cuda' else torch.no_grad():
                    logits = model(x)
                    logits_fl = model(x.flip(-1))
                out = (logits + logits_fl) * 0.5
                out = out.float().cpu()
                for pid, logit in zip(pids, out):
                    if pid not in agg:
                        agg[pid] = logit.clone(); cnt[pid] = 1
                    else:
                        agg[pid] += logit; cnt[pid] += 1
                if (i+1) % 200 == 0:
                    print(f'scale {s}: infer {i+1}/{len(dl)} batches', flush=True)
    ss = pd.read_csv('sample_submission.csv')
    preds = []
    missing = 0
    for pid in ss['_id'].astype(np.int64).tolist():
        if pid in agg:
            logit = agg[pid] / max(1, cnt[pid])
            cls = int(logit.argmax().item())
            cat = int(idx2cat[cls])
        else:
            missing += 1
            cat = int(idx2cat[0])
        preds.append(cat)
    if missing:
        print('Warning: missing products in aggregation:', missing)
    sub = pd.DataFrame({'_id': ss['_id'].astype(np.int64), 'category_id': preds})
    sub.to_csv(out_path, index=False)
    print('Wrote', out_path, 'rows:', len(sub))
    return out_path

# Run multi-scale TTA on ConvNeXt-Base EMA checkpoint and prepare submission.csv
ms_out = infer_and_submit_multiscale(ckpt_path='ckpt_convnextb_subset.pt', model_name='convnext_base', crop_size=224, scales=(256,240), out_path='submission_ms.csv')
import shutil
shutil.copyfile(ms_out, 'submission.csv')
print('submission.csv updated from multi-scale TTA:', ms_out)

  ckpt = torch.load(ckpt_path, map_location='cpu')
