# Plan: iWildCam 2019 - FGVC6 (Medal-focused)

Objectives:
- Establish GPU-ready environment quickly; verify GPU first.
- Build robust, fast baseline with strong CV mirroring test (site/sequence-aware if available).
- Iterate to a medal via model/augmentation/ensembling with trustworthy OOF.

Workflow:
1) Environment & GPU check
   - Verify CUDA/GPU with nvidia-smi; install PyTorch cu121 if needed.
   - Set constraints to avoid torch drift.

2) Data audit
   - Inspect train.csv/test.csv schema; image paths; class counts; label imbalance.
   - Check for site/location/domains (e.g., location, seq_id) to build GroupKFold if present.
   - Verify images exist; unzip with progress and cache paths.

3) CV protocol
   - Target: macro-F1. Use stratified KFold on category_id; if site/seq available, use StratifiedGroupKFold.
   - Fix a single seed and save folds.json for reproducibility.

4) Baseline model
   - torchvision pretrained backbones (e.g., convnext_tiny, resnet50, efficientnet_v2_s) with AMP + SGD/AdamW.
   - 224 or 256 resolution; light augmentations (RandAug/AutoAug, ColorJitter, RandomResizedCrop).
   - Class-balanced sampler or focal loss to address imbalance.
   - Train 1–2 epochs smoke test on subset; then 5-fold full run with early stopping on F1.

5) Improvements
   - Higher resolution (384/448) for final; CutMix/Mixup; cosine schedule; EMA.
   - Calibrate thresholds per-class (optimize F1 on OOF logits).
   - TTA at inference.
   - Ensemble diverse backbones/seeds.

6) Error analysis
   - OOF confusion matrix; per-class F1; tune thresholds; mine hard classes.

7) Submission
   - Predict test with TTA; save submission.csv; verify format.

Checkpoints for expert review:
- After plan (this cell).
- After data audit & CV design.
- After baseline OOF.
- After improvements/ensembles.

Next actions:
- Add GPU check cell and run.
- Load CSVs; inspect columns; check for grouping keys.
- Unzip images to folders and verify counts.

In [1]:
import subprocess, time, shutil, os, sys
print("[GPU CHECK] Running nvidia-smi...", flush=True)
try:
    out = subprocess.run(["bash","-lc","nvidia-smi || true"], capture_output=True, text=True, check=False)
    print(out.stdout)
    if out.stderr:
        print(out.stderr, file=sys.stderr)
except Exception as e:
    print("nvidia-smi failed:", e)
print("[ENV] Python:", sys.version)
print("[ENV] CUDA paths present:", os.path.exists("/usr/local/nvidia"))
print("[DISK] / and /mnt usage:")
for p in ["/","/mnt"]:
    try:
        total, used, free = shutil.disk_usage(p)
        print(f"  {p}: total={total/1e12:.2f}TB used={used/1e12:.2f}TB free={free/1e12:.2f}TB")
    except Exception as e:
        print(f"  {p}: error {e}")
print("[DONE] GPU/Env check finished.", flush=True)

[GPU CHECK] Running nvidia-smi...


Sat Sep 27 17:05:08 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.06             Driver Version: 550.144.06     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A10-24Q                 On  |   00000002:00:00.0 Off |                    0 |
| N/A   N/A    P0             N/A /  N/A  |     182MiB /  24512MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [2]:
import pandas as pd, os, json, numpy as np
pd.set_option('display.max_columns', 200)
print('[DATA AUDIT] Loading CSVs...', flush=True)
train_path, test_path = 'train.csv', 'test.csv'
train = pd.read_csv(train_path)
test = pd.read_csv(test_path)
print('[TRAIN] shape:', train.shape, ' columns:', list(train.columns))
print('[TEST ] shape:', test.shape,  ' columns:', list(test.columns))

# Inspect target and candidate grouping keys
target_col = 'category_id' if 'category_id' in train.columns else None
cand_groups = ['sequence_id','seq_id','seq_id_long','sequence','camera_id','location','site','location_id','place']
present_groups = [c for c in cand_groups if c in train.columns]
print('[CAND GROUP COLS IN TRAIN]:', present_groups)
print('[TARGET] present:', target_col is not None)
if target_col:
    print('[TARGET] nunique classes:', train[target_col].nunique())
    print('[TARGET] head value_counts:')
    print(train[target_col].value_counts().head(10))

# Missing values overview
na_train = train.isna().mean().sort_values(ascending=False)
na_test = test.isna().mean().sort_values(ascending=False)
print('[NA RATE TRAIN] top 10:\n', na_train.head(10))
print('[NA RATE TEST ] top 10:\n', na_test.head(10))

# Show a few rows for schema understanding
print('\n[TRAIN HEAD]\n', train.head(3))
print('\n[TEST  HEAD]\n', test.head(3))

# Verify image filename/path columns and zip existence
img_cols = [c for c in train.columns if 'file' in c.lower() or 'image' in c.lower() or 'path' in c.lower()]
print('[IMAGE-RELATED COLS IN TRAIN]:', img_cols)
print('[FILES] train_images.zip exists:', os.path.exists('train_images.zip'), ' size:', os.path.getsize('train_images.zip') if os.path.exists('train_images.zip') else -1)
print('[FILES] test_images.zip  exists:', os.path.exists('test_images.zip'),  ' size:', os.path.getsize('test_images.zip') if os.path.exists('test_images.zip') else -1)

# Save a quick schema summary for folds planning
schema = {
    'train_columns': list(train.columns),
    'test_columns': list(test.columns),
    'present_groups': present_groups,
    'n_classes': int(train[target_col].nunique()) if target_col else None
}
json.dump(schema, open('schema_summary.json','w'))
print('[SAVED] schema_summary.json')

[DATA AUDIT] Loading CSVs...


[TRAIN] shape: (179422, 11)  columns: ['category_id', 'date_captured', 'file_name', 'frame_num', 'id', 'location', 'rights_holder', 'seq_id', 'seq_num_frames', 'width', 'height']
[TEST ] shape: (16877, 10)  columns: ['date_captured', 'file_name', 'frame_num', 'id', 'location', 'rights_holder', 'seq_id', 'seq_num_frames', 'width', 'height']
[CAND GROUP COLS IN TRAIN]: ['seq_id', 'location']
[TARGET] present: True
[TARGET] nunique classes: 14
[TARGET] head value_counts:
category_id
0     128468
19     10861
1       6035
8       5783
11      5762
13      5303
16      4773
17      4125
3       2902
18      1846
Name: count, dtype: int64
[NA RATE TRAIN] top 10:
 category_id       0.0
date_captured     0.0
file_name         0.0
frame_num         0.0
id                0.0
location          0.0
rights_holder     0.0
seq_id            0.0
seq_num_frames    0.0
width             0.0
dtype: float64
[NA RATE TEST ] top 10:
 date_captured     0.0
file_name         0.0
frame_num         0.0
id      

In [4]:
import os, time, glob, zipfile, sys

def extract_zip_py(zip_path, out_dir, progress_interval=500):
    t0 = time.time()
    os.makedirs(out_dir, exist_ok=True)
    with zipfile.ZipFile(zip_path) as zf:
        members = zf.infolist()
        n = len(members)
        print(f"[UNZIP] Extracting {zip_path} -> {out_dir} ({n} files)", flush=True)
        for i, m in enumerate(members, 1):
            zf.extract(m, out_dir)
            if i % progress_interval == 0 or i == n:
                dt = time.time() - t0
                print(f"  extracted {i}/{n} ({i/n*100:.1f}%) elapsed {dt/60:.1f} min", flush=True)
    n_files = sum([len(files) for _, _, files in os.walk(out_dir)])
    dt = time.time() - t0
    print(f"[UNZIP DONE] {zip_path}: {n_files} files in {dt/60:.2f} min", flush=True)

def needs_extract(out_dir, pattern='*.jpg'):
    return not os.path.exists(out_dir) or len(glob.glob(os.path.join(out_dir, pattern))) == 0

if os.path.exists('train_images.zip') and needs_extract('train_images'):
    extract_zip_py('train_images.zip', 'train_images')
else:
    print('[SKIP] train_images already extracted or zip missing')

if os.path.exists('test_images.zip') and needs_extract('test_images'):
    extract_zip_py('test_images.zip', 'test_images')
else:
    print('[SKIP] test_images already extracted or zip missing')

train_samples = glob.glob('train_images/*.jpg')[:3]
test_samples = glob.glob('test_images/*.jpg')[:3]
print('[SAMPLE FILES] train:', train_samples)
print('[SAMPLE FILES] test :', test_samples)

[UNZIP] Extracting train_images.zip -> train_images (179224 files)


  extracted 500/179224 (0.3%) elapsed 0.0 min


  extracted 1000/179224 (0.6%) elapsed 0.0 min


  extracted 1500/179224 (0.8%) elapsed 0.0 min


  extracted 2000/179224 (1.1%) elapsed 0.0 min


  extracted 2500/179224 (1.4%) elapsed 0.0 min


  extracted 3000/179224 (1.7%) elapsed 0.0 min


  extracted 3500/179224 (2.0%) elapsed 0.0 min


  extracted 4000/179224 (2.2%) elapsed 0.0 min


  extracted 4500/179224 (2.5%) elapsed 0.0 min


  extracted 5000/179224 (2.8%) elapsed 0.1 min


  extracted 5500/179224 (3.1%) elapsed 0.1 min


  extracted 6000/179224 (3.3%) elapsed 0.1 min


  extracted 6500/179224 (3.6%) elapsed 0.1 min


  extracted 7000/179224 (3.9%) elapsed 0.1 min


  extracted 7500/179224 (4.2%) elapsed 0.1 min


  extracted 8000/179224 (4.5%) elapsed 0.1 min


  extracted 8500/179224 (4.7%) elapsed 0.1 min


  extracted 9000/179224 (5.0%) elapsed 0.1 min


  extracted 9500/179224 (5.3%) elapsed 0.1 min


  extracted 10000/179224 (5.6%) elapsed 0.1 min


  extracted 10500/179224 (5.9%) elapsed 0.1 min


  extracted 11000/179224 (6.1%) elapsed 0.1 min


  extracted 11500/179224 (6.4%) elapsed 0.1 min


  extracted 12000/179224 (6.7%) elapsed 0.1 min


  extracted 12500/179224 (7.0%) elapsed 0.1 min


  extracted 13000/179224 (7.3%) elapsed 0.1 min


  extracted 13500/179224 (7.5%) elapsed 0.1 min


  extracted 14000/179224 (7.8%) elapsed 0.1 min


  extracted 14500/179224 (8.1%) elapsed 0.1 min


  extracted 15000/179224 (8.4%) elapsed 0.1 min


  extracted 15500/179224 (8.6%) elapsed 0.1 min


  extracted 16000/179224 (8.9%) elapsed 0.1 min


  extracted 16500/179224 (9.2%) elapsed 0.1 min


  extracted 17000/179224 (9.5%) elapsed 0.1 min


  extracted 17500/179224 (9.8%) elapsed 0.1 min


  extracted 18000/179224 (10.0%) elapsed 0.1 min


  extracted 18500/179224 (10.3%) elapsed 0.2 min


  extracted 19000/179224 (10.6%) elapsed 0.2 min


  extracted 19500/179224 (10.9%) elapsed 0.2 min


  extracted 20000/179224 (11.2%) elapsed 0.2 min


  extracted 20500/179224 (11.4%) elapsed 0.2 min


  extracted 21000/179224 (11.7%) elapsed 0.2 min


  extracted 21500/179224 (12.0%) elapsed 0.2 min


  extracted 22000/179224 (12.3%) elapsed 0.2 min


  extracted 22500/179224 (12.6%) elapsed 0.2 min


  extracted 23000/179224 (12.8%) elapsed 0.2 min


  extracted 23500/179224 (13.1%) elapsed 0.2 min


  extracted 24000/179224 (13.4%) elapsed 0.2 min


  extracted 24500/179224 (13.7%) elapsed 0.2 min


  extracted 25000/179224 (13.9%) elapsed 0.2 min


  extracted 25500/179224 (14.2%) elapsed 0.2 min


  extracted 26000/179224 (14.5%) elapsed 0.2 min


  extracted 26500/179224 (14.8%) elapsed 0.2 min


  extracted 27000/179224 (15.1%) elapsed 0.2 min


  extracted 27500/179224 (15.3%) elapsed 0.2 min


  extracted 28000/179224 (15.6%) elapsed 0.2 min


  extracted 28500/179224 (15.9%) elapsed 0.2 min


  extracted 29000/179224 (16.2%) elapsed 0.2 min


  extracted 29500/179224 (16.5%) elapsed 0.2 min


  extracted 30000/179224 (16.7%) elapsed 0.2 min


  extracted 30500/179224 (17.0%) elapsed 0.2 min


  extracted 31000/179224 (17.3%) elapsed 0.2 min


  extracted 31500/179224 (17.6%) elapsed 0.2 min


  extracted 32000/179224 (17.9%) elapsed 0.2 min


  extracted 32500/179224 (18.1%) elapsed 0.3 min


  extracted 33000/179224 (18.4%) elapsed 0.3 min


  extracted 33500/179224 (18.7%) elapsed 0.3 min


  extracted 34000/179224 (19.0%) elapsed 0.3 min


  extracted 34500/179224 (19.2%) elapsed 0.3 min


  extracted 35000/179224 (19.5%) elapsed 0.3 min


  extracted 35500/179224 (19.8%) elapsed 0.3 min


  extracted 36000/179224 (20.1%) elapsed 0.3 min


  extracted 36500/179224 (20.4%) elapsed 0.3 min


  extracted 37000/179224 (20.6%) elapsed 0.3 min


  extracted 37500/179224 (20.9%) elapsed 0.3 min


  extracted 38000/179224 (21.2%) elapsed 0.3 min


  extracted 38500/179224 (21.5%) elapsed 0.3 min


  extracted 39000/179224 (21.8%) elapsed 0.3 min


  extracted 39500/179224 (22.0%) elapsed 0.3 min


  extracted 40000/179224 (22.3%) elapsed 0.3 min


  extracted 40500/179224 (22.6%) elapsed 0.3 min


  extracted 41000/179224 (22.9%) elapsed 0.3 min


  extracted 41500/179224 (23.2%) elapsed 0.3 min


  extracted 42000/179224 (23.4%) elapsed 0.3 min


  extracted 42500/179224 (23.7%) elapsed 0.3 min


  extracted 43000/179224 (24.0%) elapsed 0.3 min


  extracted 43500/179224 (24.3%) elapsed 0.3 min


  extracted 44000/179224 (24.6%) elapsed 0.3 min


  extracted 44500/179224 (24.8%) elapsed 0.3 min


  extracted 45000/179224 (25.1%) elapsed 0.3 min


  extracted 45500/179224 (25.4%) elapsed 0.4 min


  extracted 46000/179224 (25.7%) elapsed 0.4 min


  extracted 46500/179224 (25.9%) elapsed 0.4 min


  extracted 47000/179224 (26.2%) elapsed 0.4 min


  extracted 47500/179224 (26.5%) elapsed 0.4 min


  extracted 48000/179224 (26.8%) elapsed 0.4 min


  extracted 48500/179224 (27.1%) elapsed 0.4 min


  extracted 49000/179224 (27.3%) elapsed 0.4 min


  extracted 49500/179224 (27.6%) elapsed 0.4 min


  extracted 50000/179224 (27.9%) elapsed 0.4 min


  extracted 50500/179224 (28.2%) elapsed 0.4 min


  extracted 51000/179224 (28.5%) elapsed 0.4 min


  extracted 51500/179224 (28.7%) elapsed 0.4 min


  extracted 52000/179224 (29.0%) elapsed 0.4 min


  extracted 52500/179224 (29.3%) elapsed 0.4 min


  extracted 53000/179224 (29.6%) elapsed 0.4 min


  extracted 53500/179224 (29.9%) elapsed 0.4 min


  extracted 54000/179224 (30.1%) elapsed 0.4 min


  extracted 54500/179224 (30.4%) elapsed 0.4 min


  extracted 55000/179224 (30.7%) elapsed 0.4 min


  extracted 55500/179224 (31.0%) elapsed 0.4 min


  extracted 56000/179224 (31.2%) elapsed 0.4 min


  extracted 56500/179224 (31.5%) elapsed 0.4 min


  extracted 57000/179224 (31.8%) elapsed 0.4 min


  extracted 57500/179224 (32.1%) elapsed 0.4 min


  extracted 58000/179224 (32.4%) elapsed 0.4 min


  extracted 58500/179224 (32.6%) elapsed 0.4 min


  extracted 59000/179224 (32.9%) elapsed 0.4 min


  extracted 59500/179224 (33.2%) elapsed 0.5 min


  extracted 60000/179224 (33.5%) elapsed 0.5 min


  extracted 60500/179224 (33.8%) elapsed 0.5 min


  extracted 61000/179224 (34.0%) elapsed 0.5 min


  extracted 61500/179224 (34.3%) elapsed 0.5 min


  extracted 62000/179224 (34.6%) elapsed 0.5 min


  extracted 62500/179224 (34.9%) elapsed 0.5 min


  extracted 63000/179224 (35.2%) elapsed 0.5 min


  extracted 63500/179224 (35.4%) elapsed 0.5 min


  extracted 64000/179224 (35.7%) elapsed 0.5 min


  extracted 64500/179224 (36.0%) elapsed 0.5 min


  extracted 65000/179224 (36.3%) elapsed 0.5 min


  extracted 65500/179224 (36.5%) elapsed 0.5 min


  extracted 66000/179224 (36.8%) elapsed 0.5 min


  extracted 66500/179224 (37.1%) elapsed 0.5 min


  extracted 67000/179224 (37.4%) elapsed 0.5 min


  extracted 67500/179224 (37.7%) elapsed 0.5 min


  extracted 68000/179224 (37.9%) elapsed 0.5 min


  extracted 68500/179224 (38.2%) elapsed 0.5 min


  extracted 69000/179224 (38.5%) elapsed 0.5 min


  extracted 69500/179224 (38.8%) elapsed 0.5 min


  extracted 70000/179224 (39.1%) elapsed 0.5 min


  extracted 70500/179224 (39.3%) elapsed 0.5 min


  extracted 71000/179224 (39.6%) elapsed 0.5 min


  extracted 71500/179224 (39.9%) elapsed 0.5 min


  extracted 72000/179224 (40.2%) elapsed 0.5 min


  extracted 72500/179224 (40.5%) elapsed 0.5 min


  extracted 73000/179224 (40.7%) elapsed 0.6 min


  extracted 73500/179224 (41.0%) elapsed 0.6 min


  extracted 74000/179224 (41.3%) elapsed 0.6 min


  extracted 74500/179224 (41.6%) elapsed 0.6 min


  extracted 75000/179224 (41.8%) elapsed 0.6 min


  extracted 75500/179224 (42.1%) elapsed 0.6 min


  extracted 76000/179224 (42.4%) elapsed 0.6 min


  extracted 76500/179224 (42.7%) elapsed 0.6 min


  extracted 77000/179224 (43.0%) elapsed 0.6 min


  extracted 77500/179224 (43.2%) elapsed 0.6 min


  extracted 78000/179224 (43.5%) elapsed 0.6 min


  extracted 78500/179224 (43.8%) elapsed 0.6 min


  extracted 79000/179224 (44.1%) elapsed 0.6 min


  extracted 79500/179224 (44.4%) elapsed 0.6 min


  extracted 80000/179224 (44.6%) elapsed 0.6 min


  extracted 80500/179224 (44.9%) elapsed 0.6 min


  extracted 81000/179224 (45.2%) elapsed 0.6 min


  extracted 81500/179224 (45.5%) elapsed 0.6 min


  extracted 82000/179224 (45.8%) elapsed 0.6 min


  extracted 82500/179224 (46.0%) elapsed 0.6 min


  extracted 83000/179224 (46.3%) elapsed 0.6 min


  extracted 83500/179224 (46.6%) elapsed 0.6 min


  extracted 84000/179224 (46.9%) elapsed 0.6 min


  extracted 84500/179224 (47.1%) elapsed 0.6 min


  extracted 85000/179224 (47.4%) elapsed 0.6 min


  extracted 85500/179224 (47.7%) elapsed 0.6 min


  extracted 86000/179224 (48.0%) elapsed 0.7 min


  extracted 86500/179224 (48.3%) elapsed 0.7 min


  extracted 87000/179224 (48.5%) elapsed 0.7 min


  extracted 87500/179224 (48.8%) elapsed 0.7 min


  extracted 88000/179224 (49.1%) elapsed 0.7 min


  extracted 88500/179224 (49.4%) elapsed 0.7 min


  extracted 89000/179224 (49.7%) elapsed 0.7 min


  extracted 89500/179224 (49.9%) elapsed 0.7 min


  extracted 90000/179224 (50.2%) elapsed 0.7 min


  extracted 90500/179224 (50.5%) elapsed 0.7 min


  extracted 91000/179224 (50.8%) elapsed 0.7 min


  extracted 91500/179224 (51.1%) elapsed 0.7 min


  extracted 92000/179224 (51.3%) elapsed 0.7 min


  extracted 92500/179224 (51.6%) elapsed 0.7 min


  extracted 93000/179224 (51.9%) elapsed 0.7 min


  extracted 93500/179224 (52.2%) elapsed 0.7 min


  extracted 94000/179224 (52.4%) elapsed 0.8 min


  extracted 94500/179224 (52.7%) elapsed 0.8 min


  extracted 95000/179224 (53.0%) elapsed 0.8 min


  extracted 95500/179224 (53.3%) elapsed 0.8 min


  extracted 96000/179224 (53.6%) elapsed 0.8 min


  extracted 96500/179224 (53.8%) elapsed 0.8 min


  extracted 97000/179224 (54.1%) elapsed 0.8 min


  extracted 97500/179224 (54.4%) elapsed 0.8 min


  extracted 98000/179224 (54.7%) elapsed 0.8 min


  extracted 98500/179224 (55.0%) elapsed 0.8 min


  extracted 99000/179224 (55.2%) elapsed 0.8 min


  extracted 99500/179224 (55.5%) elapsed 0.8 min


  extracted 100000/179224 (55.8%) elapsed 0.8 min


  extracted 100500/179224 (56.1%) elapsed 0.8 min


  extracted 101000/179224 (56.4%) elapsed 0.8 min


  extracted 101500/179224 (56.6%) elapsed 0.8 min


  extracted 102000/179224 (56.9%) elapsed 0.8 min


  extracted 102500/179224 (57.2%) elapsed 0.8 min


  extracted 103000/179224 (57.5%) elapsed 0.9 min


  extracted 103500/179224 (57.7%) elapsed 0.9 min


  extracted 104000/179224 (58.0%) elapsed 0.9 min


  extracted 104500/179224 (58.3%) elapsed 0.9 min


  extracted 105000/179224 (58.6%) elapsed 0.9 min


  extracted 105500/179224 (58.9%) elapsed 0.9 min


  extracted 106000/179224 (59.1%) elapsed 0.9 min


  extracted 106500/179224 (59.4%) elapsed 0.9 min


  extracted 107000/179224 (59.7%) elapsed 0.9 min


  extracted 107500/179224 (60.0%) elapsed 0.9 min


  extracted 108000/179224 (60.3%) elapsed 0.9 min


  extracted 108500/179224 (60.5%) elapsed 0.9 min


  extracted 109000/179224 (60.8%) elapsed 0.9 min


  extracted 109500/179224 (61.1%) elapsed 0.9 min


  extracted 110000/179224 (61.4%) elapsed 0.9 min


  extracted 110500/179224 (61.7%) elapsed 0.9 min


  extracted 111000/179224 (61.9%) elapsed 0.9 min


  extracted 111500/179224 (62.2%) elapsed 0.9 min


  extracted 112000/179224 (62.5%) elapsed 0.9 min


  extracted 112500/179224 (62.8%) elapsed 1.0 min


  extracted 113000/179224 (63.0%) elapsed 1.0 min


  extracted 113500/179224 (63.3%) elapsed 1.0 min


  extracted 114000/179224 (63.6%) elapsed 1.0 min


  extracted 114500/179224 (63.9%) elapsed 1.0 min


  extracted 115000/179224 (64.2%) elapsed 1.0 min


  extracted 115500/179224 (64.4%) elapsed 1.0 min


  extracted 116000/179224 (64.7%) elapsed 1.0 min


  extracted 116500/179224 (65.0%) elapsed 1.0 min


  extracted 117000/179224 (65.3%) elapsed 1.0 min


  extracted 117500/179224 (65.6%) elapsed 1.0 min


  extracted 118000/179224 (65.8%) elapsed 1.0 min


  extracted 118500/179224 (66.1%) elapsed 1.0 min


  extracted 119000/179224 (66.4%) elapsed 1.0 min


  extracted 119500/179224 (66.7%) elapsed 1.0 min


  extracted 120000/179224 (67.0%) elapsed 1.0 min


  extracted 120500/179224 (67.2%) elapsed 1.1 min


  extracted 121000/179224 (67.5%) elapsed 1.1 min


  extracted 121500/179224 (67.8%) elapsed 1.1 min


  extracted 122000/179224 (68.1%) elapsed 1.1 min


  extracted 122500/179224 (68.4%) elapsed 1.1 min


  extracted 123000/179224 (68.6%) elapsed 1.1 min


  extracted 123500/179224 (68.9%) elapsed 1.1 min


  extracted 124000/179224 (69.2%) elapsed 1.1 min


  extracted 124500/179224 (69.5%) elapsed 1.1 min


  extracted 125000/179224 (69.7%) elapsed 1.1 min


  extracted 125500/179224 (70.0%) elapsed 1.1 min


  extracted 126000/179224 (70.3%) elapsed 1.1 min


  extracted 126500/179224 (70.6%) elapsed 1.1 min


  extracted 127000/179224 (70.9%) elapsed 1.1 min


  extracted 127500/179224 (71.1%) elapsed 1.1 min


  extracted 128000/179224 (71.4%) elapsed 1.1 min


  extracted 128500/179224 (71.7%) elapsed 1.2 min


  extracted 129000/179224 (72.0%) elapsed 1.2 min


  extracted 129500/179224 (72.3%) elapsed 1.2 min


  extracted 130000/179224 (72.5%) elapsed 1.2 min


  extracted 130500/179224 (72.8%) elapsed 1.2 min


  extracted 131000/179224 (73.1%) elapsed 1.2 min


  extracted 131500/179224 (73.4%) elapsed 1.2 min


  extracted 132000/179224 (73.7%) elapsed 1.2 min


  extracted 132500/179224 (73.9%) elapsed 1.2 min


  extracted 133000/179224 (74.2%) elapsed 1.2 min


  extracted 133500/179224 (74.5%) elapsed 1.2 min


  extracted 134000/179224 (74.8%) elapsed 1.2 min


  extracted 134500/179224 (75.0%) elapsed 1.2 min


  extracted 135000/179224 (75.3%) elapsed 1.2 min


  extracted 135500/179224 (75.6%) elapsed 1.2 min


  extracted 136000/179224 (75.9%) elapsed 1.2 min


  extracted 136500/179224 (76.2%) elapsed 1.2 min


  extracted 137000/179224 (76.4%) elapsed 1.2 min


  extracted 137500/179224 (76.7%) elapsed 1.2 min


  extracted 138000/179224 (77.0%) elapsed 1.2 min


  extracted 138500/179224 (77.3%) elapsed 1.2 min


  extracted 139000/179224 (77.6%) elapsed 1.2 min


  extracted 139500/179224 (77.8%) elapsed 1.2 min


  extracted 140000/179224 (78.1%) elapsed 1.2 min


  extracted 140500/179224 (78.4%) elapsed 1.2 min


  extracted 141000/179224 (78.7%) elapsed 1.3 min


  extracted 141500/179224 (79.0%) elapsed 1.3 min


  extracted 142000/179224 (79.2%) elapsed 1.3 min


  extracted 142500/179224 (79.5%) elapsed 1.3 min


  extracted 143000/179224 (79.8%) elapsed 1.3 min


  extracted 143500/179224 (80.1%) elapsed 1.3 min


  extracted 144000/179224 (80.3%) elapsed 1.3 min


  extracted 144500/179224 (80.6%) elapsed 1.3 min


  extracted 145000/179224 (80.9%) elapsed 1.3 min


  extracted 145500/179224 (81.2%) elapsed 1.3 min


  extracted 146000/179224 (81.5%) elapsed 1.3 min


  extracted 146500/179224 (81.7%) elapsed 1.3 min


  extracted 147000/179224 (82.0%) elapsed 1.3 min


  extracted 147500/179224 (82.3%) elapsed 1.3 min


  extracted 148000/179224 (82.6%) elapsed 1.3 min


  extracted 148500/179224 (82.9%) elapsed 1.3 min


  extracted 149000/179224 (83.1%) elapsed 1.3 min


  extracted 149500/179224 (83.4%) elapsed 1.3 min


  extracted 150000/179224 (83.7%) elapsed 1.3 min


  extracted 150500/179224 (84.0%) elapsed 1.3 min


  extracted 151000/179224 (84.3%) elapsed 1.3 min


  extracted 151500/179224 (84.5%) elapsed 1.3 min


  extracted 152000/179224 (84.8%) elapsed 1.3 min


  extracted 152500/179224 (85.1%) elapsed 1.4 min


  extracted 153000/179224 (85.4%) elapsed 1.4 min


  extracted 153500/179224 (85.6%) elapsed 1.4 min


  extracted 154000/179224 (85.9%) elapsed 1.4 min


  extracted 154500/179224 (86.2%) elapsed 1.4 min


  extracted 155000/179224 (86.5%) elapsed 1.4 min


  extracted 155500/179224 (86.8%) elapsed 1.4 min


  extracted 156000/179224 (87.0%) elapsed 1.4 min


  extracted 156500/179224 (87.3%) elapsed 1.4 min


  extracted 157000/179224 (87.6%) elapsed 1.4 min


  extracted 157500/179224 (87.9%) elapsed 1.4 min


  extracted 158000/179224 (88.2%) elapsed 1.4 min


  extracted 158500/179224 (88.4%) elapsed 1.4 min


  extracted 159000/179224 (88.7%) elapsed 1.4 min


  extracted 159500/179224 (89.0%) elapsed 1.4 min


  extracted 160000/179224 (89.3%) elapsed 1.4 min


  extracted 160500/179224 (89.6%) elapsed 1.4 min


  extracted 161000/179224 (89.8%) elapsed 1.4 min


  extracted 161500/179224 (90.1%) elapsed 1.4 min


  extracted 162000/179224 (90.4%) elapsed 1.4 min


  extracted 162500/179224 (90.7%) elapsed 1.4 min


  extracted 163000/179224 (90.9%) elapsed 1.4 min


  extracted 163500/179224 (91.2%) elapsed 1.4 min


  extracted 164000/179224 (91.5%) elapsed 1.4 min


  extracted 164500/179224 (91.8%) elapsed 1.4 min


  extracted 165000/179224 (92.1%) elapsed 1.4 min


  extracted 165500/179224 (92.3%) elapsed 1.5 min


  extracted 166000/179224 (92.6%) elapsed 1.5 min


  extracted 166500/179224 (92.9%) elapsed 1.5 min


  extracted 167000/179224 (93.2%) elapsed 1.5 min


  extracted 167500/179224 (93.5%) elapsed 1.5 min


  extracted 168000/179224 (93.7%) elapsed 1.5 min


  extracted 168500/179224 (94.0%) elapsed 1.5 min


  extracted 169000/179224 (94.3%) elapsed 1.5 min


  extracted 169500/179224 (94.6%) elapsed 1.5 min


  extracted 170000/179224 (94.9%) elapsed 1.5 min


  extracted 170500/179224 (95.1%) elapsed 1.5 min


  extracted 171000/179224 (95.4%) elapsed 1.5 min


  extracted 171500/179224 (95.7%) elapsed 1.5 min


  extracted 172000/179224 (96.0%) elapsed 1.5 min


  extracted 172500/179224 (96.2%) elapsed 1.5 min


  extracted 173000/179224 (96.5%) elapsed 1.5 min


  extracted 173500/179224 (96.8%) elapsed 1.5 min


  extracted 174000/179224 (97.1%) elapsed 1.5 min


  extracted 174500/179224 (97.4%) elapsed 1.5 min


  extracted 175000/179224 (97.6%) elapsed 1.5 min


  extracted 175500/179224 (97.9%) elapsed 1.5 min


  extracted 176000/179224 (98.2%) elapsed 1.5 min


  extracted 176500/179224 (98.5%) elapsed 1.5 min


  extracted 177000/179224 (98.8%) elapsed 1.5 min


  extracted 177500/179224 (99.0%) elapsed 1.6 min


  extracted 178000/179224 (99.3%) elapsed 1.6 min


  extracted 178500/179224 (99.6%) elapsed 1.6 min


  extracted 179000/179224 (99.9%) elapsed 1.6 min


  extracted 179224/179224 (100.0%) elapsed 1.6 min


[UNZIP DONE] train_images.zip: 179224 files in 1.57 min


[UNZIP] Extracting test_images.zip -> test_images (16862 files)


  extracted 500/16862 (3.0%) elapsed 0.0 min


  extracted 1000/16862 (5.9%) elapsed 0.0 min


  extracted 1500/16862 (8.9%) elapsed 0.0 min


  extracted 2000/16862 (11.9%) elapsed 0.0 min


  extracted 2500/16862 (14.8%) elapsed 0.0 min


  extracted 3000/16862 (17.8%) elapsed 0.0 min


  extracted 3500/16862 (20.8%) elapsed 0.0 min


  extracted 4000/16862 (23.7%) elapsed 0.0 min


  extracted 4500/16862 (26.7%) elapsed 0.0 min


  extracted 5000/16862 (29.7%) elapsed 0.0 min


  extracted 5500/16862 (32.6%) elapsed 0.0 min


  extracted 6000/16862 (35.6%) elapsed 0.0 min


  extracted 6500/16862 (38.5%) elapsed 0.1 min


  extracted 7000/16862 (41.5%) elapsed 0.1 min


  extracted 7500/16862 (44.5%) elapsed 0.1 min


  extracted 8000/16862 (47.4%) elapsed 0.1 min


  extracted 8500/16862 (50.4%) elapsed 0.1 min


  extracted 9000/16862 (53.4%) elapsed 0.1 min


  extracted 9500/16862 (56.3%) elapsed 0.1 min


  extracted 10000/16862 (59.3%) elapsed 0.1 min


  extracted 10500/16862 (62.3%) elapsed 0.1 min


  extracted 11000/16862 (65.2%) elapsed 0.1 min


  extracted 11500/16862 (68.2%) elapsed 0.1 min


  extracted 12000/16862 (71.2%) elapsed 0.1 min


  extracted 12500/16862 (74.1%) elapsed 0.1 min


  extracted 13000/16862 (77.1%) elapsed 0.1 min


  extracted 13500/16862 (80.1%) elapsed 0.1 min


  extracted 14000/16862 (83.0%) elapsed 0.1 min


  extracted 14500/16862 (86.0%) elapsed 0.1 min


  extracted 15000/16862 (89.0%) elapsed 0.1 min


  extracted 15500/16862 (91.9%) elapsed 0.1 min


  extracted 16000/16862 (94.9%) elapsed 0.1 min


  extracted 16500/16862 (97.9%) elapsed 0.1 min


  extracted 16862/16862 (100.0%) elapsed 0.1 min


[UNZIP DONE] test_images.zip: 16862 files in 0.13 min


[SAMPLE FILES] train: ['train_images/594ceb0f-23d2-11e8-a6a3-ec086b02610b.jpg', 'train_images/59c804a7-23d2-11e8-a6a3-ec086b02610b.jpg', 'train_images/58d47ec1-23d2-11e8-a6a3-ec086b02610b.jpg']
[SAMPLE FILES] test : ['test_images/5a0e34a3-23d2-11e8-a6a3-ec086b02610b.jpg', 'test_images/59fe232a-23d2-11e8-a6a3-ec086b02610b.jpg', 'test_images/59f9470a-23d2-11e8-a6a3-ec086b02610b.jpg']


In [5]:
import os, sys, subprocess, shutil, time
print('[INSTALL] Preparing CUDA 12.1 torch stack...', flush=True)
def pip(*args):
    print('> pip', ' '.join(args), flush=True)
    subprocess.run([sys.executable, '-m', 'pip', *args], check=True)

# 0) Uninstall any preexisting torch stack (best-effort)
for pkg in ('torch','torchvision','torchaudio'):
    subprocess.run([sys.executable, '-m', 'pip', 'uninstall', '-y', pkg], check=False)
for d in (
    '/app/.pip-target/torch',
    '/app/.pip-target/torchvision',
    '/app/.pip-target/torchaudio',
    '/app/.pip-target/torch-2.4.1.dist-info',
    '/app/.pip-target/torchvision-0.19.1.dist-info',
    '/app/.pip-target/torchaudio-2.4.1.dist-info',
    '/app/.pip-target/torchgen', '/app/.pip-target/functorch'
):
    if os.path.exists(d):
        print('Removing', d)
        shutil.rmtree(d, ignore_errors=True)

# 1) Install EXACT cu121 torch stack
pip('install',
    '--index-url','https://download.pytorch.org/whl/cu121',
    '--extra-index-url','https://pypi.org/simple',
    'torch==2.4.1','torchvision==0.19.1','torchaudio==2.4.1')

# 2) Freeze versions
from pathlib import Path
Path('constraints.txt').write_text('torch==2.4.1\ntorchvision==0.19.1\ntorchaudio==2.4.1\n')

# 3) Install non-torch deps honoring constraints
pip('install','-c','constraints.txt',
    'timm==1.0.9','albumentations==1.4.10','opencv-python-headless',
    'scikit-learn','pandas','numpy','matplotlib','seaborn','pyyaml',
    '--upgrade-strategy','only-if-needed')

# 4) Sanity check GPU
import torch
print('torch:', torch.__version__, 'CUDA build:', getattr(torch.version, 'cuda', None))
print('CUDA available:', torch.cuda.is_available())
assert str(getattr(torch.version,'cuda','')).startswith('12.1'), f'Wrong CUDA build: {torch.version.cuda}'
assert torch.cuda.is_available(), 'CUDA not available'
print('GPU:', torch.cuda.get_device_name(0))
print('[INSTALL] Done.', flush=True)

[INSTALL] Preparing CUDA 12.1 torch stack...






> pip install --index-url https://download.pytorch.org/whl/cu121 --extra-index-url https://pypi.org/simple torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1




Looking in indexes: https://download.pytorch.org/whl/cu121, https://pypi.org/simple


Collecting torch==2.4.1
  Downloading https://download.pytorch.org/whl/cu121/torch-2.4.1%2Bcu121-cp311-cp311-linux_x86_64.whl (799.0 MB)


Collecting torchvision==0.19.1
  Downloading https://download.pytorch.org/whl/cu121/torchvision-0.19.1%2Bcu121-cp311-cp311-linux_x86_64.whl (7.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.1/7.1 MB 447.6 MB/s eta 0:00:00


Collecting torchaudio==2.4.1
  Downloading https://download.pytorch.org/whl/cu121/torchaudio-2.4.1%2Bcu121-cp311-cp311-linux_x86_64.whl (3.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.4/3.4 MB 424.1 MB/s eta 0:00:00


Collecting nvidia-cufft-cu12==11.0.2.54
  Downloading nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 121.6/121.6 MB 321.8 MB/s eta 0:00:00


Collecting nvidia-cuda-nvrtc-cu12==12.1.105
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.7/23.7 MB 58.2 MB/s eta 0:00:00


Collecting triton==3.0.0
  Downloading triton-3.0.0-1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (209.4 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 209.4/209.4 MB 60.8 MB/s eta 0:00:00


Collecting filelock
  Downloading filelock-3.19.1-py3-none-any.whl (15 kB)


Collecting typing-extensions>=4.8.0
  Downloading typing_extensions-4.15.0-py3-none-any.whl (44 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44.6/44.6 KB 419.9 MB/s eta 0:00:00
Collecting nvidia-cuda-runtime-cu12==12.1.105
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 823.6/823.6 KB 489.6 MB/s eta 0:00:00


Collecting sympy
  Downloading sympy-1.14.0-py3-none-any.whl (6.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.3/6.3 MB 180.2 MB/s eta 0:00:00


Collecting nvidia-cusparse-cu12==12.1.0.106
  Downloading nvidia_cusparse_cu12-12.1.0.106-py3-none-manylinux1_x86_64.whl (196.0 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 196.0/196.0 MB 309.3 MB/s eta 0:00:00


Collecting nvidia-nvtx-cu12==12.1.105
  Downloading nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (99 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 99.1/99.1 KB 493.1 MB/s eta 0:00:00


Collecting nvidia-cudnn-cu12==9.1.0.70
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl (664.8 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 664.8/664.8 MB 256.4 MB/s eta 0:00:00


Collecting nvidia-cublas-cu12==12.1.3.1
  Downloading nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 410.6/410.6 MB 203.9 MB/s eta 0:00:00


Collecting nvidia-curand-cu12==10.3.2.106
  Downloading nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.5/56.5 MB 262.6 MB/s eta 0:00:00


Collecting jinja2
  Downloading jinja2-3.1.6-py3-none-any.whl (134 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.9/134.9 KB 512.7 MB/s eta 0:00:00


Collecting nvidia-cuda-cupti-cu12==12.1.105
  Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.1/14.1 MB 221.7 MB/s eta 0:00:00


Collecting nvidia-nccl-cu12==2.20.5
  Downloading nvidia_nccl_cu12-2.20.5-py3-none-manylinux2014_x86_64.whl (176.2 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 176.2/176.2 MB 261.1 MB/s eta 0:00:00


Collecting fsspec
  Downloading fsspec-2025.9.0-py3-none-any.whl (199 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 199.3/199.3 KB 511.6 MB/s eta 0:00:00


Collecting networkx
  Downloading networkx-3.5-py3-none-any.whl (2.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 575.7 MB/s eta 0:00:00


Collecting nvidia-cusolver-cu12==11.4.5.107
  Downloading nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl (124.2 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 124.2/124.2 MB 290.7 MB/s eta 0:00:00


Collecting pillow!=8.3.*,>=5.3.0
  Downloading pillow-11.3.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (6.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.6/6.6 MB 318.1 MB/s eta 0:00:00


Collecting numpy
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.3/18.3 MB 243.4 MB/s eta 0:00:00


Collecting nvidia-nvjitlink-cu12
  Downloading nvidia_nvjitlink_cu12-12.9.86-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (39.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39.7/39.7 MB 332.0 MB/s eta 0:00:00


Collecting MarkupSafe>=2.0
  Downloading MarkupSafe-3.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (23 kB)
Collecting mpmath<1.4,>=1.1.0
  Downloading mpmath-1.3.0-py3-none-any.whl (536 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 KB 379.2 MB/s eta 0:00:00


Installing collected packages: mpmath, typing-extensions, sympy, pillow, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufft-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, numpy, networkx, MarkupSafe, fsspec, filelock, triton, nvidia-cusparse-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch, torchvision, torchaudio


Successfully installed MarkupSafe-3.0.2 filelock-3.19.1 fsspec-2025.9.0 jinja2-3.1.6 mpmath-1.3.0 networkx-3.5 numpy-1.26.4 nvidia-cublas-cu12-12.1.3.1 nvidia-cuda-cupti-cu12-12.1.105 nvidia-cuda-nvrtc-cu12-12.1.105 nvidia-cuda-runtime-cu12-12.1.105 nvidia-cudnn-cu12-9.1.0.70 nvidia-cufft-cu12-11.0.2.54 nvidia-curand-cu12-10.3.2.106 nvidia-cusolver-cu12-11.4.5.107 nvidia-cusparse-cu12-12.1.0.106 nvidia-nccl-cu12-2.20.5 nvidia-nvjitlink-cu12-12.9.86 nvidia-nvtx-cu12-12.1.105 pillow-11.3.0 sympy-1.14.0 torch-2.4.1+cu121 torchaudio-2.4.1+cu121 torchvision-0.19.1+cu121 triton-3.0.0 typing-extensions-4.15.0


> pip install -c constraints.txt timm==1.0.9 albumentations==1.4.10 opencv-python-headless scikit-learn pandas numpy matplotlib seaborn pyyaml --upgrade-strategy only-if-needed


Collecting timm==1.0.9
  Downloading timm-1.0.9-py3-none-any.whl (2.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.3/2.3 MB 77.1 MB/s eta 0:00:00
Collecting albumentations==1.4.10
  Downloading albumentations-1.4.10-py3-none-any.whl (161 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 161.9/161.9 KB 420.6 MB/s eta 0:00:00


Collecting opencv-python-headless
  Downloading opencv_python_headless-4.12.0.88-cp37-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (54.0 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 54.0/54.0 MB 239.0 MB/s eta 0:00:00
Collecting scikit-learn
  Downloading scikit_learn-1.7.2-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (9.7 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9.7/9.7 MB 251.2 MB/s eta 0:00:00
Collecting pandas
  Downloading pandas-2.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.4 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.4/12.4 MB 250.1 MB/s eta 0:00:00


Collecting numpy
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.3/18.3 MB 190.5 MB/s eta 0:00:00


Collecting matplotlib
  Downloading matplotlib-3.10.6-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (8.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.7/8.7 MB 252.2 MB/s eta 0:00:00
Collecting seaborn
  Downloading seaborn-0.13.2-py3-none-any.whl (294 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 294.9/294.9 KB 512.3 MB/s eta 0:00:00
Collecting pyyaml
  Downloading pyyaml-6.0.3-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (806 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 806.6/806.6 KB 529.1 MB/s eta 0:00:00


Collecting torchvision
  Downloading torchvision-0.19.1-cp311-cp311-manylinux1_x86_64.whl (7.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.0/7.0 MB 216.4 MB/s eta 0:00:00
Collecting huggingface_hub
  Downloading huggingface_hub-0.35.1-py3-none-any.whl (563 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 563.3/563.3 KB 518.2 MB/s eta 0:00:00


Collecting safetensors
  Downloading safetensors-0.6.2-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (485 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 485.8/485.8 KB 247.6 MB/s eta 0:00:00
Collecting torch
  Downloading torch-2.4.1-cp311-cp311-manylinux1_x86_64.whl (797.1 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 797.1/797.1 MB 266.0 MB/s eta 0:00:00


Collecting scikit-image>=0.21.0
  Downloading scikit_image-0.25.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (14.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.8/14.8 MB 232.0 MB/s eta 0:00:00


Collecting scipy>=1.10.0
  Downloading scipy-1.16.2-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (35.9 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 35.9/35.9 MB 203.7 MB/s eta 0:00:00


Collecting typing-extensions>=4.9.0
  Downloading typing_extensions-4.15.0-py3-none-any.whl (44 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44.6/44.6 KB 400.8 MB/s eta 0:00:00
Collecting pydantic>=2.7.0
  Downloading pydantic-2.11.9-py3-none-any.whl (444 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 444.9/444.9 KB 523.1 MB/s eta 0:00:00


Collecting albucore>=0.0.11
  Downloading albucore-0.0.33-py3-none-any.whl (18 kB)
Collecting opencv-python-headless
  Downloading opencv_python_headless-4.11.0.86-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (50.0 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 50.0/50.0 MB 232.7 MB/s eta 0:00:00
Collecting joblib>=1.2.0
  Downloading joblib-1.5.2-py3-none-any.whl (308 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 308.4/308.4 KB 516.5 MB/s eta 0:00:00


Collecting threadpoolctl>=3.1.0
  Downloading threadpoolctl-3.6.0-py3-none-any.whl (18 kB)
Collecting pytz>=2020.1
  Downloading pytz-2025.2-py2.py3-none-any.whl (509 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 509.2/509.2 KB 515.7 MB/s eta 0:00:00
Collecting tzdata>=2022.7
  Downloading tzdata-2025.2-py2.py3-none-any.whl (347 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 347.8/347.8 KB 487.5 MB/s eta 0:00:00
Collecting python-dateutil>=2.8.2
  Downloading python_dateutil-2.9.0.post0-py2.py3-none-any.whl (229 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 229.9/229.9 KB 495.2 MB/s eta 0:00:00


Collecting pillow>=8
  Downloading pillow-11.3.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (6.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.6/6.6 MB 274.7 MB/s eta 0:00:00
Collecting packaging>=20.0
  Downloading packaging-25.0-py3-none-any.whl (66 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 66.5/66.5 KB 350.3 MB/s eta 0:00:00


Collecting fonttools>=4.22.0
  Downloading fonttools-4.60.0-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (5.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.0/5.0 MB 239.8 MB/s eta 0:00:00
Collecting cycler>=0.10
  Downloading cycler-0.12.1-py3-none-any.whl (8.3 kB)
Collecting contourpy>=1.0.1
  Downloading contourpy-1.3.3-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (355 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 355.2/355.2 KB 484.0 MB/s eta 0:00:00
Collecting pyparsing>=2.3.1
  Downloading pyparsing-3.2.5-py3-none-any.whl (113 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 113.9/113.9 KB 419.6 MB/s eta 0:00:00


Collecting kiwisolver>=1.3.1
  Downloading kiwisolver-1.4.9-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (1.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.4/1.4 MB 230.1 MB/s eta 0:00:00


Collecting simsimd>=5.9.2
  Downloading simsimd-6.5.3-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 181.2 MB/s eta 0:00:00


Collecting stringzilla>=3.10.4
  Downloading stringzilla-4.0.14-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl (496 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 496.5/496.5 KB 384.9 MB/s eta 0:00:00
Collecting typing-inspection>=0.4.0
  Downloading typing_inspection-0.4.1-py3-none-any.whl (14 kB)
Collecting annotated-types>=0.6.0
  Downloading annotated_types-0.7.0-py3-none-any.whl (13 kB)


Collecting pydantic-core==2.33.2
  Downloading pydantic_core-2.33.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 524.6 MB/s eta 0:00:00
Collecting six>=1.5
  Downloading six-1.17.0-py2.py3-none-any.whl (11 kB)
Collecting networkx>=3.0
  Downloading networkx-3.5-py3-none-any.whl (2.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 542.7 MB/s eta 0:00:00
Collecting lazy-loader>=0.4
  Downloading lazy_loader-0.4-py3-none-any.whl (12 kB)
Collecting imageio!=2.35.0,>=2.33
  Downloading imageio-2.37.0-py3-none-any.whl (315 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 315.8/315.8 KB 342.9 MB/s eta 0:00:00


Collecting tifffile>=2022.8.12
  Downloading tifffile-2025.9.20-py3-none-any.whl (230 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 230.1/230.1 KB 465.7 MB/s eta 0:00:00
Collecting fsspec>=2023.5.0
  Downloading fsspec-2025.9.0-py3-none-any.whl (199 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 199.3/199.3 KB 487.2 MB/s eta 0:00:00
Collecting filelock
  Downloading filelock-3.19.1-py3-none-any.whl (15 kB)


Collecting hf-xet<2.0.0,>=1.1.3
  Downloading hf_xet-1.1.10-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.2/3.2 MB 400.0 MB/s eta 0:00:00
Collecting requests
  Downloading requests-2.32.5-py3-none-any.whl (64 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 64.7/64.7 KB 429.3 MB/s eta 0:00:00
Collecting tqdm>=4.42.1
  Downloading tqdm-4.67.1-py3-none-any.whl (78 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.5/78.5 KB 455.3 MB/s eta 0:00:00


Collecting nvidia-cufft-cu12==11.0.2.54
  Downloading nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 121.6/121.6 MB 235.0 MB/s eta 0:00:00
Collecting nvidia-cuda-nvrtc-cu12==12.1.105
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.7/23.7 MB 241.6 MB/s eta 0:00:00
Collecting nvidia-curand-cu12==10.3.2.106
  Downloading nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.5/56.5 MB 245.6 MB/s eta 0:00:00
Collecting nvidia-cudnn-cu12==9.1.0.70
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl (664.8 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 664.8/664.8 MB 237.5 MB/s eta 0:00:00


Collecting nvidia-nccl-cu12==2.20.5
  Downloading nvidia_nccl_cu12-2.20.5-py3-none-manylinux2014_x86_64.whl (176.2 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 176.2/176.2 MB 338.6 MB/s eta 0:00:00
Collecting nvidia-cusolver-cu12==11.4.5.107
  Downloading nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl (124.2 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 124.2/124.2 MB 270.0 MB/s eta 0:00:00
Collecting nvidia-cuda-cupti-cu12==12.1.105
  Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.1/14.1 MB 257.1 MB/s eta 0:00:00
Collecting nvidia-cublas-cu12==12.1.3.1
  Downloading nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 410.6/410.6 MB 234.6 MB/s eta 0:00:00


Collecting nvidia-cuda-runtime-cu12==12.1.105
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 823.6/823.6 KB 527.8 MB/s eta 0:00:00
Collecting nvidia-cusparse-cu12==12.1.0.106
  Downloading nvidia_cusparse_cu12-12.1.0.106-py3-none-manylinux1_x86_64.whl (196.0 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 196.0/196.0 MB 229.1 MB/s eta 0:00:00
Collecting nvidia-nvtx-cu12==12.1.105
  Downloading nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (99 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 99.1/99.1 KB 373.9 MB/s eta 0:00:00
Collecting triton==3.0.0
  Downloading triton-3.0.0-1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (209.4 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 209.4/209.4 MB 295.6 MB/s eta 0:00:00
Collecting jinja2
  Downloading jinja2-3.1.6-py3-none-any.whl (134 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.9/134.9 KB 491.1 MB/s eta 0:00:00
Collecting sympy
  Downloading sympy-1.14.0-py3-none-any.whl (6.3 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.3/6.3 MB 547.9 MB/s eta 0:00:00
Collecting nvidia-nvjitlink-cu12
  Downloading nvidia_nvjitlink_cu12-12.9.86-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (39.7 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39.7/39.7 MB 233.2 MB/s eta 0:00:00


Collecting MarkupSafe>=2.0
  Downloading MarkupSafe-3.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (23 kB)
Collecting urllib3<3,>=1.21.1
  Downloading urllib3-2.5.0-py3-none-any.whl (129 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 129.8/129.8 KB 488.8 MB/s eta 0:00:00
Collecting charset_normalizer<4,>=2
  Downloading charset_normalizer-3.4.3-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (150 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 150.3/150.3 KB 460.1 MB/s eta 0:00:00
Collecting idna<4,>=2.5
  Downloading idna-3.10-py3-none-any.whl (70 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 70.4/70.4 KB 450.9 MB/s eta 0:00:00


Collecting certifi>=2017.4.17
  Downloading certifi-2025.8.3-py3-none-any.whl (161 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 161.2/161.2 KB 488.3 MB/s eta 0:00:00
Collecting mpmath<1.4,>=1.1.0
  Downloading mpmath-1.3.0-py3-none-any.whl (536 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 KB 519.5 MB/s eta 0:00:00


Installing collected packages: simsimd, pytz, mpmath, urllib3, tzdata, typing-extensions, tqdm, threadpoolctl, sympy, stringzilla, six, safetensors, pyyaml, pyparsing, pillow, packaging, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufft-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, numpy, networkx, MarkupSafe, kiwisolver, joblib, idna, hf-xet, fsspec, fonttools, filelock, cycler, charset_normalizer, certifi, annotated-types, typing-inspection, triton, tifffile, scipy, requests, python-dateutil, pydantic-core, opencv-python-headless, nvidia-cusparse-cu12, nvidia-cudnn-cu12, lazy-loader, jinja2, imageio, contourpy, scikit-learn, scikit-image, pydantic, pandas, nvidia-cusolver-cu12, matplotlib, huggingface_hub, albucore, torch, seaborn, albumentations, torchvision, timm


Successfully installed MarkupSafe-3.0.2 albucore-0.0.33 albumentations-1.4.10 annotated-types-0.7.0 certifi-2025.8.3 charset_normalizer-3.4.3 contourpy-1.3.3 cycler-0.12.1 filelock-3.19.1 fonttools-4.60.0 fsspec-2025.9.0 hf-xet-1.1.10 huggingface_hub-0.35.1 idna-3.10 imageio-2.37.0 jinja2-3.1.6 joblib-1.5.2 kiwisolver-1.4.9 lazy-loader-0.4 matplotlib-3.10.6 mpmath-1.3.0 networkx-3.5 numpy-1.26.4 nvidia-cublas-cu12-12.1.3.1 nvidia-cuda-cupti-cu12-12.1.105 nvidia-cuda-nvrtc-cu12-12.1.105 nvidia-cuda-runtime-cu12-12.1.105 nvidia-cudnn-cu12-9.1.0.70 nvidia-cufft-cu12-11.0.2.54 nvidia-curand-cu12-10.3.2.106 nvidia-cusolver-cu12-11.4.5.107 nvidia-cusparse-cu12-12.1.0.106 nvidia-nccl-cu12-2.20.5 nvidia-nvjitlink-cu12-12.9.86 nvidia-nvtx-cu12-12.1.105 opencv-python-headless-4.11.0.86 packaging-25.0 pandas-2.3.2 pillow-11.3.0 pydantic-2.11.9 pydantic-core-2.33.2 pyparsing-3.2.5 python-dateutil-2.9.0.post0 pytz-2025.2 pyyaml-6.0.3 requests-2.32.5 safetensors-0.6.2 scikit-image-0.25.2 scikit-lear









torch: 2.4.1+cu121 CUDA build: 12.1
CUDA available: True
GPU: NVIDIA A10-24Q
[INSTALL] Done.


In [6]:
import pandas as pd, numpy as np, os, json, time
from sklearn.model_selection import StratifiedGroupKFold

print('[FOLDS] Building StratifiedGroupKFold by seq_id (5 folds)...', flush=True)
train = pd.read_csv('train.csv')
assert 'category_id' in train.columns and 'seq_id' in train.columns, 'Required columns missing'

# Basic sanity checks
n_classes = train['category_id'].nunique()
print(f'[FOLDS] n_train={len(train)} n_classes={n_classes} unique seq={train.seq_id.nunique()}')

# Create folds
sgkf = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)
folds = np.full(len(train), -1, dtype=int)
t0 = time.time()
for fold, (tr_idx, va_idx) in enumerate(sgkf.split(train, y=train['category_id'], groups=train['seq_id'])):
    folds[va_idx] = fold
    # Logging
    yv = train.loc[va_idx, 'category_id']
    print(f'  fold {fold}: val_size={len(va_idx)} classes={yv.nunique()} seqs={train.loc[va_idx,"seq_id"].nunique()}')
    print('    class dist (top10):', yv.value_counts().head(10).to_dict())

assert (folds >= 0).all(), 'Some rows not assigned to folds'
train['fold'] = folds

# Verify no sequence crosses folds
seq_to_folds = train.groupby('seq_id')['fold'].nunique()
leak_seqs = (seq_to_folds > 1).sum()
print(f'[FOLDS] Sequences spanning multiple folds: {leak_seqs}')
assert leak_seqs == 0, 'Sequence leakage detected across folds!'

# Save folds to disk
cols_to_save = ['id','file_name','seq_id','location','category_id','fold']
folds_df = train[cols_to_save].copy()
folds_df.to_csv('folds.csv', index=False)
json.dump({'n_splits':5,'random_state':42,'group_col':'seq_id','stratify':'category_id'}, open('folds_meta.json','w'))
print('[FOLDS] Saved folds.csv and folds_meta.json')

# Create path columns for convenience
train_path_dir = 'train_images'
test_path_dir = 'test_images'
assert os.path.isdir(train_path_dir) and os.path.isdir(test_path_dir), 'Image dirs missing'
folds_df['path'] = folds_df['file_name'].apply(lambda x: os.path.join(train_path_dir, x))
folds_df.to_csv('folds_with_paths.csv', index=False)
print('[FOLDS] Saved folds_with_paths.csv')

# Quick sample check
print('[FOLDS] head:\n', folds_df.head())
print(f'[FOLDS] Done in {(time.time()-t0):.2f}s')

[FOLDS] Building StratifiedGroupKFold by seq_id (5 folds)...


[FOLDS] n_train=179422 n_classes=14 unique seq=141628


  fold 0: val_size=35901 classes=14 seqs=28326
    class dist (top10): {0: 25727, 19: 2137, 1: 1226, 11: 1163, 8: 1154, 13: 1047, 16: 944, 17: 810, 3: 609, 18: 357}
  fold 1: val_size=35915 classes=14 seqs=28326
    class dist (top10): {0: 25632, 19: 2199, 1: 1191, 11: 1162, 8: 1134, 13: 1046, 16: 985, 17: 857, 3: 615, 18: 392}
  fold 2: val_size=35826 classes=14 seqs=28326
    class dist (top10): {0: 25670, 19: 2114, 11: 1180, 1: 1179, 8: 1144, 13: 1076, 16: 999, 17: 834, 3: 574, 18: 360}
  fold 3: val_size=35870 classes=14 seqs=28325
    class dist (top10): {0: 25740, 19: 2220, 1: 1200, 8: 1164, 11: 1138, 13: 1059, 16: 899, 17: 833, 3: 537, 18: 380}
  fold 4: val_size=35910 classes=14 seqs=28325
    class dist (top10): {0: 25699, 19: 2191, 1: 1239, 8: 1187, 11: 1119, 13: 1075, 16: 946, 17: 791, 3: 567, 18: 357}


[FOLDS] Sequences spanning multiple folds: 0


[FOLDS] Saved folds.csv and folds_meta.json


[FOLDS] Saved folds_with_paths.csv
[FOLDS] head:
                                      id  \
0  588a679f-23d2-11e8-a6a3-ec086b02610b   
1  59279ce3-23d2-11e8-a6a3-ec086b02610b   
2  5a2af4ab-23d2-11e8-a6a3-ec086b02610b   
3  593d68d7-23d2-11e8-a6a3-ec086b02610b   
4  58782b45-23d2-11e8-a6a3-ec086b02610b   

                                  file_name  \
0  588a679f-23d2-11e8-a6a3-ec086b02610b.jpg   
1  59279ce3-23d2-11e8-a6a3-ec086b02610b.jpg   
2  5a2af4ab-23d2-11e8-a6a3-ec086b02610b.jpg   
3  593d68d7-23d2-11e8-a6a3-ec086b02610b.jpg   
4  58782b45-23d2-11e8-a6a3-ec086b02610b.jpg   

                                 seq_id  location  category_id  fold  \
0  6f12067d-5567-11e8-b3c0-dca9047ef277       115           19     2   
1  6faa92d1-5567-11e8-b1ae-dca9047ef277        96            0     1   
2  6f7d4702-5567-11e8-9e03-dca9047ef277        57            0     4   
3  6f0f6778-5567-11e8-b5d2-dca9047ef277        90            3     1   
4  6f789194-5567-11e8-946a-dca9047ef277       10

In [16]:
import os, time, math, random, gc, json
from pathlib import Path
import numpy as np
import pandas as pd
import cv2
cv2.setNumThreads(0)
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import albumentations as A
from albumentations.pytorch import ToTensorV2
import timm
from timm.utils import ModelEmaV2
from sklearn.metrics import f1_score

torch.backends.cudnn.benchmark = True
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
try:
    torch.set_num_threads(4)
except Exception:
    pass

class ImgDs(Dataset):
    def __init__(self, df, img_dir, transforms=None):
        self.df = df.reset_index(drop=True)
        self.img_dir = img_dir
        self.transforms = transforms
        self.has_y = 'category_id' in df.columns
    def __len__(self):
        return len(self.df)
    def __getitem__(self, idx):
        r = self.df.iloc[idx]
        fp = os.path.join(self.img_dir, r['file_name'])
        img = cv2.imread(fp, cv2.IMREAD_COLOR)
        if img is None:
            img = np.zeros((512,512,3), dtype=np.uint8)
        else:
            img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        if self.transforms:
            img = self.transforms(image=img)['image']
        if self.has_y:
            return img, int(r['category_id'])
        else:
            return img, -1

class FocalLoss(nn.Module):
    def __init__(self, gamma=1.5, alpha=0.25, reduction='mean'):
        super().__init__()
        self.gamma = gamma
        self.alpha = alpha
        self.reduction = reduction
    def forward(self, logits, target):
        ce = F.cross_entropy(logits, target, reduction='none')
        pt = torch.exp(-ce)
        loss = (self.alpha * (1-pt)**self.gamma) * ce
        if self.reduction == 'mean':
            return loss.mean()
        elif self.reduction == 'sum':
            return loss.sum()
        return loss

def get_transforms(img_size=224):
    train_tf = A.Compose([
        A.RandomResizedCrop(img_size, img_size, scale=(0.7,1.0), p=1.0),
        A.HorizontalFlip(p=0.5),
        A.ColorJitter(0.2,0.2,0.2,0.1,p=0.3),
        A.Normalize(mean=(0.485,0.456,0.406), std=(0.229,0.224,0.225)),
        ToTensorV2()
    ])
    val_tf = A.Compose([
        A.LongestMaxSize(max_size=img_size),
        A.PadIfNeeded(img_size, img_size, border_mode=cv2.BORDER_CONSTANT, value=0),
        A.Normalize(mean=(0.485,0.456,0.406), std=(0.229,0.224,0.225)),
        ToTensorV2()
    ])
    return train_tf, val_tf

def seed_all(seed=42):
    random.seed(seed); np.random.seed(seed); torch.manual_seed(seed); torch.cuda.manual_seed_all(seed)

def train_one_epoch(model, ema, loader, optimizer, scaler, loss_fn, epoch, log_interval=200):
    model.train()
    t0 = time.time()
    total = 0.0; n = 0
    for it, (x, y) in enumerate(loader):
        x = x.to(DEVICE, non_blocking=True).to(memory_format=torch.channels_last)
        y = y.to(DEVICE, non_blocking=True)
        optimizer.zero_grad(set_to_none=True)
        with torch.autocast(device_type='cuda', dtype=torch.float16) if DEVICE=='cuda' else torch.autocast('cpu'):
            logits = model(x)
            loss = loss_fn(logits, y)
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()
        if ema is not None:
            ema.update(model)
        total += loss.item() * x.size(0)
        n += x.size(0)
        if (it+1) % log_interval == 0:
            dt = time.time()-t0
            print(f'[EPOCH {epoch}] it {it+1}/{len(loader)} loss {total/n:.4f} elapsed {dt/60:.2f}m', flush=True)
    return total/n if n>0 else 0.0

def predict(model, loader):
    model.eval()
    preds = []
    with torch.no_grad():
        for x, _ in loader:
            x = x.to(DEVICE, non_blocking=True).to(memory_format=torch.channels_last)
            with torch.autocast(device_type='cuda', dtype=torch.float16) if DEVICE=='cuda' else torch.autocast('cpu'):
                logits = model(x)
            preds.append(logits.float().cpu())
    return torch.cat(preds, dim=0).numpy()

def seq_average_logits(df_idx, logits, df):
    val_df = df.iloc[df_idx].reset_index(drop=True)
    val_df = val_df.assign(_row=np.arange(len(val_df)))
    arr = logits
    out = np.zeros_like(arr)
    for sid, grp in val_df.groupby('seq_id')['_row']:
        idxs = grp.values
        out[idxs] = arr[idxs].mean(axis=0, keepdims=True)
    return out

def macro_f1_from_logits(y_true, logits):
    y_pred = logits.argmax(axis=1)
    return f1_score(y_true, y_pred, average='macro')

def smoke_train_one_fold(fold=0, img_size=224, epochs=1, batch_size=16, model_name='resnet18', max_train=2000, max_val=500, pretrained=False):
    seed_all(42)
    df = pd.read_csv('folds.csv')
    n_classes = df['category_id'].nunique()
    tr_idx_all = df.index[df['fold'] != fold].to_list()
    va_idx_all = df.index[df['fold'] == fold].to_list()
    tr_idx = tr_idx_all[:max_train]
    va_idx = va_idx_all[:max_val]
    print(f'[SMOKE] fold={fold} train={len(tr_idx)}/{len(tr_idx_all)} val={len(va_idx)}/{len(va_idx_all)} classes={n_classes}', flush=True)
    train_tf, val_tf = get_transforms(img_size)
    train_ds = ImgDs(df.iloc[tr_idx], 'train_images', train_tf)
    val_ds   = ImgDs(df.iloc[va_idx], 'train_images', val_tf)
    train_loader = DataLoader(train_ds, batch_size=batch_size, shuffle=True, num_workers=0, pin_memory=True, drop_last=True)
    val_loader   = DataLoader(val_ds, batch_size=max(8, batch_size), shuffle=False, num_workers=0, pin_memory=True)

    print(f'[MODEL] Creating {model_name}, pretrained={pretrained}', flush=True)
    model = timm.create_model(model_name, pretrained=pretrained, num_classes=n_classes)
    model.to(DEVICE); model.to(memory_format=torch.channels_last)
    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-2)
    loss_fn = FocalLoss(gamma=1.5, alpha=0.25)
    ema = None  # disable EMA for smoke to reduce overhead
    scaler = torch.amp.GradScaler('cuda') if DEVICE=='cuda' else torch.amp.GradScaler('cpu')

    steps_per_epoch = max(1, len(train_loader))
    warmup_steps = steps_per_epoch
    total_steps = steps_per_epoch * epochs

    def lr_schedule(step):
        if step < warmup_steps:
            return step / max(1, warmup_steps)
        prog = (step - warmup_steps) / max(1, total_steps - warmup_steps)
        return 0.5 * (1 + math.cos(math.pi * prog))

    sched = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=lr_schedule)

    global_step = 0
    for ep in range(1, epochs+1):
        loss = train_one_epoch(model, ema, train_loader, optimizer, scaler, loss_fn, ep, log_interval=20)
        for _ in range(steps_per_epoch):
            sched.step(); global_step += 1
        val_logits = predict(model, val_loader)
        y_true = df.iloc[va_idx]['category_id'].values
        val_logits_seq = seq_average_logits(va_idx, val_logits, df)
        f1_plain = macro_f1_from_logits(y_true, val_logits)
        f1_seq = macro_f1_from_logits(y_true, val_logits_seq)
        print(f'[SMOKE][E{ep}] loss={loss:.4f} F1_plain={f1_plain:.4f} F1_seq={f1_seq:.4f}', flush=True)
        gc.collect();
        try:
            torch.cuda.empty_cache()
        except Exception:
            pass

    np.save(f'oof_logits_fold{fold}.npy', val_logits_seq)
    np.save(f'oof_idx_fold{fold}.npy', np.array(va_idx))
    print('[SMOKE] Saved OOF logits/idx for fold', fold)

print('[NEXT] Ready to run smoke_train_one_fold(fold=0, img_size=224, epochs=1, batch_size=16, model_name=\'resnet18\', max_train=2000, max_val=500, pretrained=False)')

  from .autonotebook import tqdm as notebook_tqdm


[NEXT] Ready to run smoke_train_one_fold(fold=0, img_size=224, epochs=1, batch_size=32, model_name='convnext_tiny', max_train=5000, max_val=1000)


In [10]:
import sys, subprocess, shutil, os
def run(*args, check=True):
    print('>', ' '.join(args), flush=True)
    subprocess.run(list(args), check=check)

print('[FIX] Clean reinstall: albumentations==1.3.1 (no albucore dep)...', flush=True)
# Uninstall conflicting packages
run(sys.executable, '-m', 'pip', 'uninstall', '-y', 'albumentations', 'albucore', check=False)
# Remove possible stale site dirs
for d in (
    '/app/.pip-target/albumentations',
    '/app/.pip-target/albumentations-1.4.20.dist-info',
    '/app/.pip-target/albucore',
    '/app/.pip-target/albucore-0.0.19.dist-info',
):
    if os.path.exists(d):
        print('Removing', d)
        shutil.rmtree(d, ignore_errors=True)

# Reinstall older stable albumentations without albucore dependency
run(sys.executable, '-m', 'pip', 'install', '-c', 'constraints.txt',
    '--upgrade', '--force-reinstall', '--no-cache-dir',
    'albumentations==1.3.1', 'opencv-python-headless', '--upgrade-strategy', 'only-if-needed')

print('[FIX] Verifying import...')
import albumentations as A
from albumentations.pytorch import ToTensorV2
print('[OK] albumentations', A.__version__, 'module at', A.__file__)

[FIX] Clean reinstall: albumentations==1.3.1 (no albucore dep)...


> /usr/bin/python3.11 -m pip uninstall -y albumentations albucore


Found existing installation: albumentations 1.4.20
Uninstalling albumentations-1.4.20:
  Successfully uninstalled albumentations-1.4.20
Found existing installation: albucore 0.0.33
Uninstalling albucore-0.0.33:
  Successfully uninstalled albucore-0.0.33
Removing /app/.pip-target/albumentations
Removing /app/.pip-target/albucore-0.0.19.dist-info
> /usr/bin/python3.11 -m pip install -c constraints.txt --upgrade --force-reinstall --no-cache-dir albumentations==1.3.1 opencv-python-headless --upgrade-strategy only-if-needed


Collecting albumentations==1.3.1
  Downloading albumentations-1.3.1-py3-none-any.whl (125 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 125.7/125.7 KB 5.8 MB/s eta 0:00:00
Collecting opencv-python-headless
  Downloading opencv_python_headless-4.12.0.88-cp37-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (54.0 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 54.0/54.0 MB 224.9 MB/s eta 0:00:00
Collecting scikit-image>=0.16.1
  Downloading scikit_image-0.25.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (14.8 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.8/14.8 MB 237.5 MB/s eta 0:00:00


Collecting scipy>=1.1.0
  Downloading scipy-1.16.2-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (35.9 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 35.9/35.9 MB 222.8 MB/s eta 0:00:00
Collecting qudida>=0.0.4
  Downloading qudida-0.0.4-py3-none-any.whl (3.5 kB)


Collecting numpy>=1.11.1
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.3/18.3 MB 345.5 MB/s eta 0:00:00
Collecting PyYAML
  Downloading pyyaml-6.0.3-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (806 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 806.6/806.6 KB 530.2 MB/s eta 0:00:00
Collecting opencv-python-headless
  Downloading opencv_python_headless-4.11.0.86-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (50.0 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 50.0/50.0 MB 201.0 MB/s eta 0:00:00


Collecting scikit-learn>=0.19.1
  Downloading scikit_learn-1.7.2-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (9.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9.7/9.7 MB 351.0 MB/s eta 0:00:00
Collecting typing-extensions
  Downloading typing_extensions-4.15.0-py3-none-any.whl (44 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44.6/44.6 KB 285.9 MB/s eta 0:00:00
Collecting lazy-loader>=0.4
  Downloading lazy_loader-0.4-py3-none-any.whl (12 kB)
Collecting packaging>=21
  Downloading packaging-25.0-py3-none-any.whl (66 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 66.5/66.5 KB 422.4 MB/s eta 0:00:00
Collecting imageio!=2.35.0,>=2.33
  Downloading imageio-2.37.0-py3-none-any.whl (315 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 315.8/315.8 KB 413.2 MB/s eta 0:00:00


Collecting networkx>=3.0
  Downloading networkx-3.5-py3-none-any.whl (2.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 530.9 MB/s eta 0:00:00


Collecting pillow>=10.1
  Downloading pillow-11.3.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (6.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.6/6.6 MB 165.3 MB/s eta 0:00:00
Collecting tifffile>=2022.8.12
  Downloading tifffile-2025.9.20-py3-none-any.whl (230 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 230.1/230.1 KB 429.9 MB/s eta 0:00:00


Collecting threadpoolctl>=3.1.0
  Downloading threadpoolctl-3.6.0-py3-none-any.whl (18 kB)
Collecting joblib>=1.2.0
  Downloading joblib-1.5.2-py3-none-any.whl (308 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 308.4/308.4 KB 507.1 MB/s eta 0:00:00


Installing collected packages: typing-extensions, threadpoolctl, PyYAML, pillow, packaging, numpy, networkx, joblib, tifffile, scipy, opencv-python-headless, lazy-loader, imageio, scikit-learn, scikit-image, qudida, albumentations


Successfully installed PyYAML-6.0.3 albumentations-1.3.1 imageio-2.37.0 joblib-1.5.2 lazy-loader-0.4 networkx-3.5 numpy-1.26.4 opencv-python-headless-4.11.0.86 packaging-25.0 pillow-11.3.0 qudida-0.0.4 scikit-image-0.25.2 scikit-learn-1.7.2 scipy-1.16.2 threadpoolctl-3.6.0 tifffile-2025.9.20 typing-extensions-4.15.0


[FIX] Verifying import...


[OK] albumentations 1.3.1 module at /app/.pip-target/albumentations/__init__.py


In [18]:
import time
t0 = time.time()
print('[RUN] Starting 224px smoke training: fold=0, epochs=1, bs=16, model=resnet18 (pretrained=False), subset train=2000 val=500', flush=True)
smoke_train_one_fold(fold=0, img_size=224, epochs=1, batch_size=16, model_name='resnet18', max_train=2000, max_val=500, pretrained=False)
print(f'[RUN] Done. Elapsed {(time.time()-t0)/60:.2f} min', flush=True)

[RUN] Starting 224px smoke training: fold=0, epochs=1, bs=16, model=resnet18 (pretrained=False), subset train=2000 val=500


NameError: name 'smoke_train_one_fold' is not defined

In [19]:
import os, time, math, random, gc, json, sys
import numpy as np
import pandas as pd
import cv2
cv2.setNumThreads(0)
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from torchvision import models
from sklearn.metrics import f1_score

torch.backends.cudnn.benchmark = True
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
try: torch.set_num_threads(4)
except Exception: pass

class TinyDs(Dataset):
    def __init__(self, df, img_dir, img_size=224, train=True):
        self.df = df.reset_index(drop=True)
        self.img_dir = img_dir
        self.img_size = img_size
        self.train = train
        self.has_y = 'category_id' in df.columns
    def __len__(self):
        return len(self.df)
    def __getitem__(self, idx):
        r = self.df.iloc[idx]
        fp = os.path.join(self.img_dir, r['file_name'])
        img = cv2.imread(fp, cv2.IMREAD_COLOR)
        if img is None:
            img = np.zeros((self.img_size, self.img_size, 3), dtype=np.uint8)
        else:
            img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        h, w = img.shape[:2]
        s = self.img_size
        scale = s / max(h, w)
        nh, nw = int(h*scale), int(w*scale)
        img = cv2.resize(img, (nw, nh), interpolation=cv2.INTER_LINEAR)
        pad = np.zeros((s, s, 3), dtype=img.dtype)
        y0 = (s - nh) // 2; x0 = (s - nw) // 2
        pad[y0:y0+nh, x0:x0+nw] = img
        img = pad.astype(np.float32) / 255.0
        mean = np.array([0.485,0.456,0.406], dtype=np.float32)
        std  = np.array([0.229,0.224,0.225], dtype=np.float32)
        img = (img - mean) / std
        img = np.transpose(img, (2,0,1))
        x = torch.from_numpy(img)
        y = int(r['category_id']) if self.has_y else -1
        return x, y

def run_tiny_smoke(img_size=224, max_train=1024, max_val=256, batches_limit=50, bs=16):
        df = pd.read_csv('folds.csv')
        n_classes = df['category_id'].nunique()
        tr = df[df['fold']!=0].iloc[:max_train].copy()
        va = df[df['fold']==0].iloc[:max_val].copy()
        print(f'[SMOKE2] train={len(tr)} val={len(va)} classes={n_classes}', flush=True)
        train_ds = TinyDs(tr, 'train_images', img_size, train=True)
        val_ds   = TinyDs(va, 'train_images', img_size, train=False)
        train_loader = DataLoader(train_ds, batch_size=bs, shuffle=True, num_workers=0, pin_memory=True, drop_last=True)
        val_loader   = DataLoader(val_ds, batch_size=bs*2, shuffle=False, num_workers=0, pin_memory=True)
        model = models.resnet18(weights=None)
        model.fc = nn.Linear(model.fc.in_features, n_classes)
        model.to(DEVICE); model.to(memory_format=torch.channels_last)
        opt = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-2)
        loss_fn = nn.CrossEntropyLoss()
        model.train()
        t0 = time.time()
        seen = 0
        for it, (x,y) in enumerate(train_loader, 1):
            x = x.to(DEVICE, non_blocking=True).to(memory_format=torch.channels_last)
            y = y.to(DEVICE, non_blocking=True)
            opt.zero_grad(set_to_none=True)
            logits = model(x)
            loss = loss_fn(logits, y)
            loss.backward(); opt.step()
            seen += x.size(0)
            if it % 10 == 0:
                print(f'  it {it} loss {loss.item():.4f} seen {seen}', flush=True)
            if it >= batches_limit:
                break
        print(f'[SMOKE2] Train done in {(time.time()-t0):.1f}s, evaluating...', flush=True)
        model.eval()
        preds = [];
        with torch.no_grad():
            for x, _ in val_loader:
                x = x.to(DEVICE, non_blocking=True).to(memory_format=torch.channels_last)
                logits = model(x).float().cpu().numpy()
                preds.append(logits)
        logits = np.concatenate(preds, 0)
        y_true = va['category_id'].values
        y_pred = logits.argmax(1)
        f1 = f1_score(y_true, y_pred, average='macro')
        print(f'[SMOKE2] Macro-F1 (plain argmax, no seq avg): {f1:.4f}', flush=True)
        return f1

print('[NEXT] Ready to run: run_tiny_smoke(img_size=224, max_train=1024, max_val=256, batches_limit=50, bs=16)')

[NEXT] Ready to run: run_tiny_smoke(img_size=224, max_train=1024, max_val=256, batches_limit=50, bs=16)


In [20]:
import time
t0=time.time()
print('[RUN] Tiny smoke with torchvision resnet18, bs=16, 50 batches, 1024/256 subset', flush=True)
f1 = run_tiny_smoke(img_size=224, max_train=1024, max_val=256, batches_limit=50, bs=16)
print(f'[RUN] Done tiny smoke. F1={f1:.4f}. Elapsed {(time.time()-t0):.1f}s', flush=True)

[RUN] Tiny smoke with torchvision resnet18, bs=16, 50 batches, 1024/256 subset


[SMOKE2] train=1024 val=256 classes=14


KernelDied: Kernel died unexpectedly.

In [21]:
print('[DEBUG] Forcing CPU for smoke to diagnose kernel crashes...', flush=True)
DEVICE = 'cpu'
f1_cpu = run_tiny_smoke(img_size=224, max_train=256, max_val=128, batches_limit=20, bs=8)
print('[DEBUG] CPU tiny smoke F1:', f1_cpu)

[DEBUG] Forcing CPU for smoke to diagnose kernel crashes...


NameError: name 'run_tiny_smoke' is not defined

In [23]:
# Tabular-only baseline (metadata) to get a working submission and OOF estimate
import pandas as pd, numpy as np, time, json
from sklearn.model_selection import StratifiedGroupKFold
from sklearn.metrics import f1_score
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

t0 = time.time()
train = pd.read_csv('train.csv')
test  = pd.read_csv('test.csv')
target = 'category_id'

# Feature engineering on metadata
def fe(df):
    dt = pd.to_datetime(df['date_captured'], errors='coerce')
    df = df.copy()
    df['year'] = dt.dt.year.fillna(-1).astype(int)
    df['month'] = dt.dt.month.fillna(-1).astype(int)
    df['day'] = dt.dt.day.fillna(-1).astype(int)
    df['hour'] = dt.dt.hour.fillna(-1).astype(int)
    df['is_night'] = ((df['hour'] < 6) | (df['hour'] > 19)).astype(int)
    # frame/sequence context
    df['frame_num'] = df['frame_num'].fillna(-1).astype(int)
    df['seq_num_frames'] = df['seq_num_frames'].fillna(-1).astype(int)
    df['loc'] = df['location'].astype(str)
    return df

train_fe = fe(train)
test_fe  = fe(test)

# Define features
num_cols = ['frame_num','seq_num_frames','width','height','year','month','day','hour','is_night']
cat_cols = ['loc']

X = train_fe[num_cols + cat_cols]
y = train_fe[target].values
groups = train_fe['seq_id'].astype(str).values
X_test = test_fe[num_cols + cat_cols]

pre = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(with_mean=True, with_std=True), num_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=True), cat_cols),
    ]
)

clf = LogisticRegression(
    multi_class='multinomial',
    solver='saga',
    max_iter=200,
    C=1.0,
    class_weight='balanced',
    n_jobs=4,
)

pipe = Pipeline([('pre', pre), ('clf', clf)])

# CV
sgkf = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)
oof_pred = np.zeros(len(train_fe), dtype=int)
fold_scores = []

for fold, (tr_idx, va_idx) in enumerate(sgkf.split(X, y=y, groups=groups)):
    t_fold = time.time()
    X_tr, X_va = X.iloc[tr_idx], X.iloc[va_idx]
    y_tr, y_va = y[tr_idx], y[va_idx]
    print(f'[TAB] Fold {fold} train={len(tr_idx)} val={len(va_idx)}', flush=True)
    model = pipe
    model.fit(X_tr, y_tr)
    y_hat = model.predict(X_va)
    f1 = f1_score(y_va, y_hat, average='macro')
    oof_pred[va_idx] = y_hat
    fold_scores.append(f1)
    print(f'[TAB] Fold {fold} macro-F1={f1:.5f} elapsed {time.time()-t_fold:.1f}s', flush=True)

oof_f1 = f1_score(y, oof_pred, average='macro')
print(f'[TAB] OOF macro-F1={oof_f1:.5f} folds={fold_scores} mean={np.mean(fold_scores):.5f}', flush=True)

# Fit on all data and predict test
model_full = pipe
t_fit = time.time()
model_full.fit(X, y)
print(f'[TAB] Full fit done in {time.time()-t_fit:.1f}s', flush=True)
test_pred = model_full.predict(X_test)

# Build submission
sub = pd.DataFrame({'id': test['id'], 'category_id': test_pred.astype(int)})
sub.to_csv('submission.csv', index=False)
print('[SUB] Saved submission.csv shape', sub.shape, ' unique classes:', sub['category_id'].nunique())
print('[DONE] Tabular baseline finished in {:.1f}s'.format(time.time()-t0))

[TAB] Fold 0 train=143521 val=35901




[TAB] Fold 0 macro-F1=0.40463 elapsed 15.2s


[TAB] Fold 1 train=143507 val=35915




[TAB] Fold 1 macro-F1=0.42253 elapsed 15.0s


[TAB] Fold 2 train=143596 val=35826




[TAB] Fold 2 macro-F1=0.33763 elapsed 15.0s


[TAB] Fold 3 train=143552 val=35870




[TAB] Fold 3 macro-F1=0.25582 elapsed 15.0s


[TAB] Fold 4 train=143512 val=35910




[TAB] Fold 4 macro-F1=0.38640 elapsed 14.8s


[TAB] OOF macro-F1=0.35691 folds=[0.404634700095836, 0.42252826065052496, 0.3376292851021045, 0.2558230431491976, 0.38640356421654315] mean=0.36140




[TAB] Full fit done in 21.2s


[SUB] Saved submission.csv shape (16877, 2)  unique classes: 11
[DONE] Tabular baseline finished in 125.3s




In [24]:
# Tabular OOF logits + seq-avg + per-class bias optimization, and improved submission
import numpy as np, pandas as pd, time
from sklearn.model_selection import StratifiedGroupKFold
from sklearn.metrics import f1_score
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

t0 = time.time()
train = pd.read_csv('train.csv')
test  = pd.read_csv('test.csv')
target = 'category_id'

def fe(df):
    dt = pd.to_datetime(df['date_captured'], errors='coerce')
    df = df.copy()
    df['year'] = dt.dt.year.fillna(-1).astype(int)
    df['month'] = dt.dt.month.fillna(-1).astype(int)
    df['day'] = dt.dt.day.fillna(-1).astype(int)
    df['hour'] = dt.dt.hour.fillna(-1).astype(int)
    df['is_night'] = ((df['hour'] < 6) | (df['hour'] > 19)).astype(int)
    df['frame_num'] = df['frame_num'].fillna(-1).astype(int)
    df['seq_num_frames'] = df['seq_num_frames'].fillna(-1).astype(int)
    df['loc'] = df['location'].astype(str)
    return df

train_fe = fe(train)
test_fe  = fe(test)
num_cols = ['frame_num','seq_num_frames','width','height','year','month','day','hour','is_night']
cat_cols = ['loc']
X = train_fe[num_cols + cat_cols]
y = train_fe[target].values
groups = train_fe['seq_id'].astype(str).values
X_test = test_fe[num_cols + cat_cols]

pre = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(with_mean=True, with_std=True), num_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=True), cat_cols),
    ]
)
clf = LogisticRegression(
    multi_class='multinomial',
    solver='saga',
    max_iter=200,
    C=1.0,
    class_weight='balanced',
    n_jobs=4,
)
pipe = Pipeline([('pre', pre), ('clf', clf)])

n_classes = int(train[target].nunique())
oof_logits = np.full((len(train), n_classes), np.nan, dtype=np.float32)
sgkf = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)

print('[TAB2] Collecting OOF probabilities/logits with seq-aware CV...', flush=True)
for fold, (tr_idx, va_idx) in enumerate(sgkf.split(X, y=y, groups=groups)):
    t_fold = time.time()
    model = pipe
    model.fit(X.iloc[tr_idx], y[tr_idx])
    proba = model.predict_proba(X.iloc[va_idx])
    # convert to logit-like space (use log-prob as surrogate logits)
    logits = np.log(np.clip(proba, 1e-8, 1.0))
    # sequence-average within this fold's val indices
    va_df = train_fe.iloc[va_idx][['seq_id']].reset_index(drop=True)
    logits_seq = logits.copy()
    # map from seq -> indices within va fold
    for sid, grp in va_df.groupby('seq_id').groups.items():
        idxs = np.array(list(grp))
        m = logits[idxs].mean(axis=0, keepdims=True)
        logits_seq[idxs] = m
    oof_logits[va_idx] = logits_seq
    y_hat = logits_seq.argmax(axis=1)
    f1 = f1_score(y[va_idx], y_hat, average='macro')
    print(f'[TAB2] Fold {fold} seq-avg macro-F1={f1:.5f} elapsed {time.time()-t_fold:.1f}s', flush=True)

assert not np.isnan(oof_logits).any(), 'OOF logits have NaNs'

def optimize_biases(y_true, logits, n_iters=2, grid=np.linspace(-1.5, 1.5, 21)):
    b = np.zeros(logits.shape[1], dtype=np.float32)
    best = f1_score(y_true, (logits + b).argmax(1), average='macro')
    for _ in range(n_iters):
        improved = False
        for c in range(logits.shape[1]):
            bc = b[c]
            best_c = bc; best_sc = best
            for d in grid:
                b[c] = d
                sc = f1_score(y_true, (logits + b).argmax(1), average='macro')
                if sc > best_sc:
                    best_sc = sc; best_c = d
            b[c] = best_c
            if best_c != bc:
                best = best_sc; improved = True
        if not improved:
            break
    return b, best

# Optimize per-class additive biases on OOF seq-averaged logits
y_true = y
b_opt, f1_oof = optimize_biases(y_true, oof_logits)
print(f"[TAB2] OOF seq-avg macro-F1 (with biases) = {f1_oof:.5f}")
print('[TAB2] Biases:', np.round(b_opt, 3))

# Fit on full data and produce test predictions
model_full = pipe
t_fit = time.time()
model_full.fit(X, y)
print(f'[TAB2] Full fit done in {time.time()-t_fit:.1f}s', flush=True)
proba_test = model_full.predict_proba(X_test)
logits_test = np.log(np.clip(proba_test, 1e-8, 1.0))

# Sequence-average test logits, then apply biases and predict one label per seq_id, broadcast to frames
test_df = test_fe[['id','seq_id']].copy()
logits_adj = logits_test.copy()
pred_seq = {}
for sid, idxs in test_df.groupby('seq_id').groups.items():
    idxs = np.array(list(idxs))
    m = logits_adj[idxs].mean(axis=0) + b_opt
    lab = int(m.argmax())
    pred_seq[sid] = lab

test_pred = test_df['seq_id'].map(pred_seq).astype(int).values
sub2 = pd.DataFrame({'id': test['id'], 'category_id': test_pred})
sub2.to_csv('submission_seq_bias.csv', index=False)
print('[SUB2] Saved submission_seq_bias.csv shape', sub2.shape, ' unique classes:', sub2['category_id'].nunique())
print('[DONE] Post-processed tabular submission in {:.1f}s'.format(time.time()-t0))

[TAB2] Collecting OOF probabilities/logits with seq-aware CV...






[TAB2] Fold 0 seq-avg macro-F1=0.08702 elapsed 14.8s






[TAB2] Fold 1 seq-avg macro-F1=0.08409 elapsed 14.6s






[TAB2] Fold 2 seq-avg macro-F1=0.09343 elapsed 14.7s






[TAB2] Fold 3 seq-avg macro-F1=0.08599 elapsed 14.9s






[TAB2] Fold 4 seq-avg macro-F1=0.06439 elapsed 14.5s


[TAB2] OOF seq-avg macro-F1 (with biases) = 0.10264
[TAB2] Biases: [ 1.5  -0.3   1.5  -1.5  -1.5  -1.5  -1.5  -1.35  0.6  -1.05  1.5   0.
 -0.75 -0.15]




[TAB2] Full fit done in 21.3s


[SUB2] Saved submission_seq_bias.csv shape (16877, 2)  unique classes: 11
[DONE] Post-processed tabular submission in 137.4s




In [25]:
# Overwrite submission.csv with sequence-avg+bias tuned predictions
import pandas as pd, os
src = 'submission_seq_bias.csv'
dst = 'submission.csv'
assert os.path.exists(src), f"Missing {src}"
sub = pd.read_csv(src)
sub.to_csv(dst, index=False)
print('[COPY] Wrote', dst, 'from', src, 'shape', sub.shape, 'unique classes', sub['category_id'].nunique())
print(sub.head())

[COPY] Wrote submission.csv from submission_seq_bias.csv shape (16877, 2) unique classes 11
                                     id  category_id
0  5998cfa4-23d2-11e8-a6a3-ec086b02610b            9
1  599fbd89-23d2-11e8-a6a3-ec086b02610b            0
2  59fae563-23d2-11e8-a6a3-ec086b02610b            0
3  5a24a741-23d2-11e8-a6a3-ec086b02610b            0
4  59eab924-23d2-11e8-a6a3-ec086b02610b            9


In [26]:
# CatBoost metadata model with seq-avg and per-class bias tuning
import os, time, math, numpy as np, pandas as pd
from sklearn.model_selection import StratifiedGroupKFold
from sklearn.metrics import f1_score

t0 = time.time()
print('[CB] Installing catboost if missing...', flush=True)
import subprocess, sys
subprocess.run([sys.executable, '-m', 'pip', 'install', 'catboost==1.2.5'], check=False)
from catboost import CatBoostClassifier, Pool

train = pd.read_csv('train.csv')
test  = pd.read_csv('test.csv')
target = 'category_id'

def fe(df):
    df = df.copy()
    dt = pd.to_datetime(df['date_captured'], errors='coerce')
    df['year'] = dt.dt.year.fillna(-1).astype(int)
    df['month'] = dt.dt.month.fillna(-1).astype(int)
    df['day'] = dt.dt.day.fillna(-1).astype(int)
    df['hour'] = dt.dt.hour.fillna(-1).astype(int)
    df['is_night'] = ((df['hour'] < 6) | (df['hour'] > 19)).astype(int)
    # sequence context
    df['frame_num'] = df['frame_num'].fillna(-1).astype(int)
    df['seq_num_frames'] = df['seq_num_frames'].fillna(1).astype(int)
    df['frame_ratio'] = (df['frame_num'] / df['seq_num_frames']).clip(0,1)
    df['is_first'] = (df['frame_num'] <= 1).astype(int)
    df['is_last']  = (df['frame_num'] >= df['seq_num_frames']).astype(int)
    # cyclical time
    df['hour_sin'] = np.sin(2*np.pi*df['hour']/24.0)
    df['hour_cos'] = np.cos(2*np.pi*df['hour']/24.0)
    return df

train_fe = fe(train)
test_fe  = fe(test)

num_cols = [
    'width','height','year','month','day','hour','is_night',
    'frame_num','seq_num_frames','frame_ratio','is_first','is_last','hour_sin','hour_cos'
]
cat_cols = ['location','rights_holder']
all_cols = num_cols + cat_cols

X_all = train_fe[all_cols].copy()
y_all = train_fe[target].astype(int).values
groups = train_fe['seq_id'].astype(str).values
X_test = test_fe[all_cols].copy()

cat_idx = [all_cols.index(c) for c in cat_cols]
n_classes = train[target].nunique()

def add_fold_safe_counts(X_tr, X_va, cols=('location','rights_holder')):
    X_tr = X_tr.copy(); X_va = X_va.copy()
    for c in cols:
        cnt = X_tr[c].value_counts()
        X_tr[f'cnt_{c}'] = X_tr[c].map(cnt).fillna(1).astype(int)
        X_va[f'cnt_{c}'] = X_va[c].map(cnt).fillna(1).astype(int)
    return X_tr, X_va

print('[CB] 5-fold SGKF by seq_id starting...', flush=True)
sgkf = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)
oof_logits = np.full((len(train_fe), n_classes), np.nan, dtype=np.float32)

for fold, (tr_idx, va_idx) in enumerate(sgkf.split(X_all, y=y_all, groups=groups)):
    t_fold = time.time()
    X_tr = X_all.iloc[tr_idx].copy(); X_va = X_all.iloc[va_idx].copy()
    y_tr = y_all[tr_idx]; y_va = y_all[va_idx]
    # fold-safe counts
    X_tr, X_va = add_fold_safe_counts(X_tr, X_va)
    # build Pools
    cat_features_idx = cat_idx + [X_tr.columns.get_loc('cnt_location'), X_tr.columns.get_loc('cnt_rights_holder')]
    train_pool = Pool(X_tr, label=y_tr, cat_features=cat_features_idx)
    valid_pool = Pool(X_va, label=y_va, cat_features=cat_features_idx)
    model = CatBoostClassifier(
        loss_function='MultiClass',
        eval_metric='MultiClass',
        iterations=2000,
        depth=8,
        learning_rate=0.05,
        l2_leaf_reg=6,
        auto_class_weights='Balanced',
        early_stopping_rounds=100,
        random_seed=42,
        task_type='CPU',
        verbose=False
    )
    model.fit(train_pool, eval_set=valid_pool, use_best_model=True, verbose=False)
    proba = model.predict_proba(valid_pool)
    logits = np.log(np.clip(np.asarray(proba), 1e-8, 1.0))
    # seq-average within val fold
    va_df = train_fe.iloc[va_idx][['seq_id']].reset_index(drop=True)
    logits_seq = logits.copy()
    for sid, grp in va_df.groupby('seq_id').groups.items():
        idxs = np.array(list(grp))
        m = logits[idxs].mean(axis=0, keepdims=True)
        logits_seq[idxs] = m
    oof_logits[va_idx] = logits_seq
    f1 = f1_score(y_va, logits_seq.argmax(1), average='macro')
    print(f"[CB] Fold {fold} seq-avg macro-F1={f1:.5f} elapsed {time.time()-t_fold:.1f}s", flush=True)

assert not np.isnan(oof_logits).any(), 'NaNs in OOF logits'

def optimize_biases(y_true, logits, n_iters=2, grid=np.linspace(-1.5, 1.5, 21)):
    b = np.zeros(logits.shape[1], dtype=np.float32)
    best = f1_score(y_true, (logits + b).argmax(1), average='macro')
    for _ in range(n_iters):
        improved = False
        for c in range(logits.shape[1]):
            bc = b[c]; best_c = bc; best_sc = best
            for d in grid:
                b[c] = d
                sc = f1_score(y_true, (logits + b).argmax(1), average='macro')
                if sc > best_sc:
                    best_sc = sc; best_c = d
            b[c] = best_c
            if best_c != bc:
                best = best_sc; improved = True
        if not improved: break
    return b, best

b_opt, f1_oof = optimize_biases(y_all, oof_logits)
print(f"[CB] OOF seq-avg macro-F1 (with biases) = {f1_oof:.5f}", flush=True)
print('[CB] Biases:', np.round(b_opt, 3))

# Fit on full data with counts (computed on full train for test mapping) and predict test
X_full = X_all.copy()
for c in ['location','rights_holder']:
    cnt = X_full[c].value_counts()
    X_full[f'cnt_{c}'] = X_full[c].map(cnt).fillna(1).astype(int)
X_test_full = X_test.copy()
for c in ['location','rights_holder']:
    cnt = X_full[c].value_counts()
    X_test_full[f'cnt_{c}'] = X_test_full[c].map(cnt).fillna(1).astype(int)
cat_features_idx_full = [X_full.columns.get_loc(c) for c in cat_cols] + [X_full.columns.get_loc('cnt_location'), X_full.columns.get_loc('cnt_rights_holder')]
pool_full = Pool(X_full, label=y_all, cat_features=cat_features_idx_full)
pool_test = Pool(X_test_full, cat_features=cat_features_idx_full)
model_full = CatBoostClassifier(
    loss_function='MultiClass',
    eval_metric='MultiClass',
    iterations=2000,
    depth=8,
    learning_rate=0.05,
    l2_leaf_reg=6,
    auto_class_weights='Balanced',
    early_stopping_rounds=100,
    random_seed=42,
    task_type='CPU',
    verbose=False
)
print('[CB] Fitting full model...', flush=True)
model_full.fit(pool_full, verbose=False)
proba_test = model_full.predict_proba(pool_test)
logits_test = np.log(np.clip(np.asarray(proba_test), 1e-8, 1.0))

# Sequence-average test logits, apply biases, predict one label per seq_id, broadcast
test_df = test_fe[['id','seq_id']].copy()
pred_seq = {}
for sid, idxs in test_df.groupby('seq_id').groups.items():
    idxs = np.array(list(idxs))
    m = logits_test[idxs].mean(axis=0) + b_opt
    pred_seq[sid] = int(m.argmax())
test_pred = test_df['seq_id'].map(pred_seq).astype(int).values
sub_cb = pd.DataFrame({'id': test['id'], 'category_id': test_pred})
sub_cb.to_csv('submission_cat_seq_bias.csv', index=False)
print('[SUB-CB] Saved submission_cat_seq_bias.csv shape', sub_cb.shape, 'unique classes', sub_cb['category_id'].nunique())
print('[CB] Done in {:.1f}s'.format(time.time()-t0), flush=True)

[CB] Installing catboost if missing...


Collecting catboost==1.2.5
  Downloading catboost-1.2.5-cp311-cp311-manylinux2014_x86_64.whl (98.2 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 98.2/98.2 MB 140.8 MB/s eta 0:00:00
Collecting plotly
  Downloading plotly-6.3.0-py3-none-any.whl (9.8 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9.8/9.8 MB 104.8 MB/s eta 0:00:00


Collecting scipy
  Downloading scipy-1.16.2-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (35.9 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 35.9/35.9 MB 126.2 MB/s eta 0:00:00


Collecting pandas>=0.24
  Downloading pandas-2.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.4/12.4 MB 160.6 MB/s eta 0:00:00
Collecting six
  Downloading six-1.17.0-py2.py3-none-any.whl (11 kB)
Collecting graphviz
  Downloading graphviz-0.21-py3-none-any.whl (47 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 47.3/47.3 KB 397.1 MB/s eta 0:00:00


Collecting matplotlib
  Downloading matplotlib-3.10.6-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (8.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.7/8.7 MB 288.7 MB/s eta 0:00:00


Collecting numpy>=1.16.0
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.3/18.3 MB 140.4 MB/s eta 0:00:00


Collecting python-dateutil>=2.8.2
  Downloading python_dateutil-2.9.0.post0-py2.py3-none-any.whl (229 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 229.9/229.9 KB 478.2 MB/s eta 0:00:00
Collecting tzdata>=2022.7
  Downloading tzdata-2025.2-py2.py3-none-any.whl (347 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 347.8/347.8 KB 395.8 MB/s eta 0:00:00
Collecting pytz>=2020.1
  Downloading pytz-2025.2-py2.py3-none-any.whl (509 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 509.2/509.2 KB 526.3 MB/s eta 0:00:00
Collecting cycler>=0.10
  Downloading cycler-0.12.1-py3-none-any.whl (8.3 kB)


Collecting pillow>=8
  Downloading pillow-11.3.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (6.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.6/6.6 MB 35.0 MB/s eta 0:00:00


Collecting fonttools>=4.22.0
  Downloading fonttools-4.60.0-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (5.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.0/5.0 MB 265.1 MB/s eta 0:00:00
Collecting contourpy>=1.0.1
  Downloading contourpy-1.3.3-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (355 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 355.2/355.2 KB 492.2 MB/s eta 0:00:00


Collecting kiwisolver>=1.3.1
  Downloading kiwisolver-1.4.9-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (1.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.4/1.4 MB 465.6 MB/s eta 0:00:00
Collecting packaging>=20.0
  Downloading packaging-25.0-py3-none-any.whl (66 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 66.5/66.5 KB 396.6 MB/s eta 0:00:00
Collecting pyparsing>=2.3.1
  Downloading pyparsing-3.2.5-py3-none-any.whl (113 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 113.9/113.9 KB 453.9 MB/s eta 0:00:00
Collecting narwhals>=1.15.1
  Downloading narwhals-2.5.0-py3-none-any.whl (407 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 407.3/407.3 KB 327.2 MB/s eta 0:00:00


Installing collected packages: pytz, tzdata, six, pyparsing, pillow, packaging, numpy, narwhals, kiwisolver, graphviz, fonttools, cycler, scipy, python-dateutil, plotly, contourpy, pandas, matplotlib, catboost


ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
albumentations 1.4.10 requires albucore>=0.0.11, which is not installed.


Successfully installed catboost-1.2.5 contourpy-1.3.3 cycler-0.12.1 fonttools-4.60.0 graphviz-0.21 kiwisolver-1.4.9 matplotlib-3.10.6 narwhals-2.5.0 numpy-1.26.4 packaging-25.0 pandas-2.3.2 pillow-11.3.0 plotly-6.3.0 pyparsing-3.2.5 python-dateutil-2.9.0.post0 pytz-2025.2 scipy-1.16.2 six-1.17.0 tzdata-2025.2




[CB] 5-fold SGKF by seq_id starting...


[CB] Fold 0 seq-avg macro-F1=0.09345 elapsed 379.1s


[CB] Fold 1 seq-avg macro-F1=0.09262 elapsed 402.5s


[CB] Fold 2 seq-avg macro-F1=0.09276 elapsed 457.5s


[CB] Fold 3 seq-avg macro-F1=0.09357 elapsed 368.5s


[CB] Fold 4 seq-avg macro-F1=0.09397 elapsed 285.2s


[CB] OOF seq-avg macro-F1 (with biases) = 0.10027


[CB] Biases: [ 1.35 -0.75  0.6  -1.5  -1.5  -1.35 -1.5  -1.05  1.5  -0.15  0.45  1.5
  1.05 -1.5 ]
[CB] Fitting full model...


[SUB-CB] Saved submission_cat_seq_bias.csv shape (16877, 2) unique classes 13
[CB] Done in 2641.5s


In [27]:
# Overwrite submission.csv with CatBoost seq-avg+bias tuned predictions
import pandas as pd, os
src = 'submission_cat_seq_bias.csv'
dst = 'submission.csv'
assert os.path.exists(src), f"Missing {src}"
sub = pd.read_csv(src)
sub.to_csv(dst, index=False)
print('[COPY-CB] Wrote', dst, 'from', src, 'shape', sub.shape, 'unique classes', sub['category_id'].nunique())
print(sub.head())

[COPY-CB] Wrote submission.csv from submission_cat_seq_bias.csv shape (16877, 2) unique classes 13
                                     id  category_id
0  5998cfa4-23d2-11e8-a6a3-ec086b02610b            0
1  599fbd89-23d2-11e8-a6a3-ec086b02610b            0
2  59fae563-23d2-11e8-a6a3-ec086b02610b            8
3  5a24a741-23d2-11e8-a6a3-ec086b02610b            0
4  59eab924-23d2-11e8-a6a3-ec086b02610b            0


In [39]:
# CatBoost with fold-safe priors/TE + optional cheap image features, seq-avg + bias tuning
import numpy as np, pandas as pd, time, os
from sklearn.model_selection import StratifiedGroupKFold
from sklearn.metrics import f1_score
from catboost import CatBoostClassifier, Pool

t0 = time.time()
train = pd.read_csv('train.csv')
test  = pd.read_csv('test.csv')
target = 'category_id'
n_classes = int(train[target].nunique())

def fe_base(df):
    df = df.copy()
    dt = pd.to_datetime(df['date_captured'], errors='coerce')
    df['year'] = dt.dt.year.fillna(-1).astype(int)
    df['month'] = dt.dt.month.fillna(-1).astype(int)
    df['day'] = dt.dt.day.fillna(-1).astype(int)
    df['hour'] = dt.dt.hour.fillna(-1).astype(int)
    df['doy']  = dt.dt.dayofyear.fillna(1).astype(int)
    df['is_night'] = ((df['hour'] < 6) | (df['hour'] > 19)).astype(int)
    df['frame_num'] = df['frame_num'].fillna(-1).astype(int)
    df['seq_num_frames'] = df['seq_num_frames'].fillna(1).astype(int)
    df['frame_ratio'] = (df['frame_num'] / df['seq_num_frames']).clip(0,1)
    df['is_first'] = (df['frame_num'] <= 1).astype(int)
    df['is_last']  = (df['frame_num'] >= df['seq_num_frames']).astype(int)
    # cyclical time
    df['hour_sin'] = np.sin(2*np.pi*df['hour']/24.0)
    df['hour_cos'] = np.cos(2*np.pi*df['hour']/24.0)
    df['doy_sin']  = np.sin(2*np.pi*df['doy']/366.0)
    df['doy_cos']  = np.cos(2*np.pi*df['doy']/366.0)
    # hour bins for loc x hour interactions
    bins = [-1,3,7,11,15,19,23]
    labels = [0,1,2,3,4,5]
    df['hour_bin'] = pd.cut(df['hour'], bins=bins, labels=labels, include_lowest=True).astype(int)
    # aspect ratio
    df['aspect'] = (df['width'] / df['height']).replace([np.inf, -np.inf], np.nan).fillna(0.0)
    return df

def entropy_from_probs(p):
    p = np.clip(p, 1e-12, 1.0)
    return float(-(p * np.log(p)).sum())

def m_estimate_prior(counts, total, pg, m):
    return (counts + m * pg) / (total + m)

def build_group_priors(train_idx_df, key_col, classes, m):
    g = {}
    grp = train_idx_df.groupby([key_col, 'category_id']).size().unstack(fill_value=0)
    for c in classes:
        if c not in grp.columns:
            grp[c] = 0
    grp = grp[classes]
    total_counts = grp.sum(axis=1).astype(int)
    pg = train_idx_df['category_id'].value_counts(normalize=True).reindex(classes).fillna(0).values
    for key, row in grp.iterrows():
        cnts = row.values.astype(float)
        n = int(total_counts.loc[key])
        p = m_estimate_prior(cnts, n, pg, m)
        ent = entropy_from_probs(p)
        g[key] = (p, n, ent)
    p_global = pg.copy(); ent_global = entropy_from_probs(p_global)
    return g, p_global, ent_global

def map_group_priors(df_in, key_col, prior_map, p_global, ent_global, prefix, classes):
    df = df_in.copy()
    probs_mat = np.zeros((len(df), len(classes)), dtype=np.float32)
    counts = np.zeros(len(df), dtype=np.int32)
    ents = np.zeros(len(df), dtype=np.float32)
    key_vals = df[key_col].values
    for i, k in enumerate(key_vals):
        tpl = prior_map.get(k)
        if tpl is None:
            probs_mat[i] = p_global; counts[i] = 0; ents[i] = ent_global
        else:
            p, n, e = tpl; probs_mat[i] = p; counts[i] = n; ents[i] = e
    for j, c in enumerate(classes):
        df[f'{prefix}_p_{c}'] = probs_mat[:, j]
    df[f'{prefix}_count'] = np.log1p(counts)
    df[f'{prefix}_entropy'] = ents
    return df

def build_loc_hour_entropy(train_idx_df, m=300):
    key = train_idx_df['location'].astype(str) + '|' + train_idx_df['hour_bin'].astype(str)
    grp = train_idx_df.assign(k=key).groupby(['k','category_id']).size().unstack(fill_value=0)
    classes = sorted(train_idx_df['category_id'].unique().tolist())
    for c in classes:
        if c not in grp.columns: grp[c] = 0
    grp = grp[classes]
    total_counts = grp.sum(axis=1).astype(int)
    pg = train_idx_df['category_id'].value_counts(normalize=True).reindex(classes).fillna(0).values
    ent_map = {}
    for k, row in grp.iterrows():
        cnts = row.values.astype(float); n = int(total_counts.loc[k])
        p = m_estimate_prior(cnts, n, pg, m)
        ent_map[k] = (entropy_from_probs(p), n)
    ent_global = entropy_from_probs(pg)
    return ent_map, ent_global

def map_loc_hour_entropy(df_in, ent_map, ent_global):
    df = df_in.copy()
    k = df['location'].astype(str) + '|' + df['hour_bin'].astype(str)
    ents = np.zeros(len(df), dtype=np.float32)
    cnts = np.zeros(len(df), dtype=np.int32)
    for i, key in enumerate(k.values):
        tpl = ent_map.get(key)
        if tpl is None:
            ents[i] = ent_global; cnts[i] = 0
        else:
            e, n = tpl; ents[i] = e; cnts[i] = n
    df['loc_hour_entropy'] = ents
    df['loc_hour_count'] = np.log1p(cnts)
    return df

def maybe_merge_img_feats(df_tr, df_te):
    img_tr_path = 'img_feats_train.csv'; img_te_path = 'img_feats_test.csv'
    if os.path.exists(img_tr_path) and os.path.exists(img_te_path):
        fe_tr = pd.read_csv(img_tr_path); fe_te = pd.read_csv(img_te_path)
        df_tr = df_tr.merge(fe_tr, on='id', how='left')
        df_te = df_te.merge(fe_te, on='id', how='left')
        print('[CB-P] Merged image feats:', fe_tr.shape, fe_te.shape, flush=True)
    else:
        print('[CB-P] Image feats not found; proceeding without.', flush=True)
    return df_tr, df_te

train_fe = fe_base(train)
test_fe  = fe_base(test)
classes_all = sorted(train[target].unique().tolist())

# Ensure id column exists for merging image features
train_fe['id'] = train['id'].astype(str).values
test_fe['id']  = test['id'].astype(str).values
train_fe, test_fe = maybe_merge_img_feats(train_fe, test_fe)

img_num = [c for c in ['laplacian_var','hsv_s_mean','gray_mean','gray_std','hsv_v_mean','file_size_kb'] if c in train_fe.columns]
base_num = ['width','height','aspect','year','month','day','hour','doy','is_night','frame_num','seq_num_frames','frame_ratio','is_first','is_last','hour_sin','hour_cos','doy_sin','doy_cos'] + img_num
cat_cols = ['rights_holder']

X_all = train_fe[['location','rights_holder','hour_bin','id'] + base_num].copy()
y_all = train_fe[target].astype(int).values
groups = train_fe['seq_id'].astype(str).values
X_test_base = test_fe[['location','rights_holder','hour_bin','id'] + base_num].copy()

sgkf = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)
oof_logits = np.full((len(train_fe), n_classes), np.nan, dtype=np.float32)

print('[CB-P] 5-fold with fold-safe priors/entropy...', flush=True)
for fold, (tr_idx, va_idx) in enumerate(sgkf.split(X_all, y=y_all, groups=groups)):
    t_fold = time.time()
    # Build priors on train fold
    idx_cols = ['location','hour_bin','category_id']
    loc_map, loc_pg, loc_entg = build_group_priors(train_fe.iloc[tr_idx][['location','category_id']], 'location', classes_all, m=100)
    rh_map,  rh_pg,  rh_entg  = build_group_priors(train_fe.iloc[tr_idx][['rights_holder','category_id']], 'rights_holder', classes_all, m=50)
    lxh_map, lxh_entg = build_loc_hour_entropy(train_fe.iloc[tr_idx][idx_cols].copy(), m=300)

    X_tr = X_all.iloc[tr_idx].copy()
    X_va = X_all.iloc[va_idx].copy()

    # Fold-safe imputation for image features: median on train, apply to val
    for c in img_num:
        med = X_tr[c].median() if np.isfinite(X_tr[c]).any() else 0.0
        X_tr[c] = X_tr[c].fillna(med)
        X_va[c] = X_va[c].fillna(med)

    X_tr = map_group_priors(X_tr, 'location', loc_map, loc_pg, loc_entg, 'loc', classes_all)
    X_va = map_group_priors(X_va, 'location', loc_map, loc_pg, loc_entg, 'loc', classes_all)
    X_tr = map_group_priors(X_tr, 'rights_holder', rh_map, rh_pg, rh_entg, 'rh', classes_all)
    X_va = map_group_priors(X_va, 'rights_holder', rh_map, rh_pg, rh_entg, 'rh', classes_all)
    X_tr = map_loc_hour_entropy(X_tr, lxh_map, lxh_entg)
    X_va = map_loc_hour_entropy(X_va, lxh_map, lxh_entg)

    use_cols = base_num + [
        'loc_count','loc_entropy','rh_count','rh_entropy','loc_hour_entropy','loc_hour_count'
    ] + [f'loc_p_{c}' for c in classes_all] + [f'rh_p_{c}' for c in classes_all] + cat_cols
    X_tr_use = X_tr[use_cols].copy()
    X_va_use = X_va[use_cols].copy()
    cat_idx = [use_cols.index(c) for c in cat_cols]
    train_pool = Pool(X_tr_use, label=y_all[tr_idx], cat_features=cat_idx)
    valid_pool = Pool(X_va_use, label=y_all[va_idx], cat_features=cat_idx)
    model = CatBoostClassifier(
        loss_function='MultiClass',
        eval_metric='MultiClass',
        iterations=2500,
        depth=8,
        learning_rate=0.05,
        l2_leaf_reg=10,
        random_strength=4,
        subsample=0.8,
        rsm=0.8,
        bootstrap_type='Bernoulli',
        auto_class_weights='Balanced',
        early_stopping_rounds=150,
        random_seed=42,
        task_type='CPU',
        verbose=False
    )
    model.fit(train_pool, eval_set=valid_pool, use_best_model=True, verbose=False)
    proba = model.predict_proba(valid_pool)
    logits = np.log(np.clip(np.asarray(proba), 1e-8, 1.0))
    # seq-avg within val fold
    va_seq = train_fe.iloc[va_idx]['seq_id'].values
    logits_seq = logits.copy()
    from collections import defaultdict
    gmap = defaultdict(list)
    for i, sid in enumerate(va_seq):
        gmap[sid].append(i)
    for idxs in gmap.values():
        mlog = logits[idxs].mean(axis=0, keepdims=True)
        logits_seq[idxs] = mlog
    oof_logits[va_idx] = logits_seq
    f1 = f1_score(y_all[va_idx], logits_seq.argmax(1), average='macro')
    print(f"[CB-P] Fold {fold} seq-avg macro-F1={f1:.5f} elapsed {time.time()-t_fold:.1f}s", flush=True)

assert not np.isnan(oof_logits).any(), 'NaNs in OOF logits'

def optimize_biases(y_true, logits, n_iters=3, grid=np.linspace(-2.0, 2.0, 41)):
    b = np.zeros(logits.shape[1], dtype=np.float32)
    best = f1_score(y_true, (logits + b).argmax(1), average='macro')
    for _ in range(n_iters):
        improved = False
        for c in range(logits.shape[1]):
            bc = b[c]; best_c = bc; best_sc = best
            for d in grid:
                b[c] = d
                sc = f1_score(y_true, (logits + b).argmax(1), average='macro')
                if sc > best_sc:
                    best_sc = sc; best_c = d
            b[c] = best_c
            if best_c != bc:
                best = best_sc; improved = True
        if not improved: break
    return b, best

b_opt, f1_oof = optimize_biases(y_all, oof_logits)
print(f"[CB-P] OOF seq-avg macro-F1 (with biases) = {f1_oof:.5f}", flush=True)
print('[CB-P] Biases:', np.round(b_opt, 3))

# Fit full model with priors for test mapping
print('[CB-P] Fitting full model and predicting test...', flush=True)
X_full = X_all.copy()
X_test = X_test_base.copy()
for c in img_num:
    med = X_full[c].median() if np.isfinite(X_full[c]).any() else 0.0
    X_full[c] = X_full[c].fillna(med)
    X_test[c] = X_test[c].fillna(med)

loc_map_full, loc_pg_full, loc_entg_full = build_group_priors(train_fe[['location','category_id']], 'location', classes_all, m=100)
rh_map_full,  rh_pg_full,  rh_entg_full  = build_group_priors(train_fe[['rights_holder','category_id']], 'rights_holder', classes_all, m=50)
lxh_map_full, lxh_entg_full = build_loc_hour_entropy(train_fe[['location','hour_bin','category_id']].copy(), m=300)
X_full = map_group_priors(X_full, 'location', loc_map_full, loc_pg_full, loc_entg_full, 'loc', classes_all)
X_full = map_group_priors(X_full, 'rights_holder', rh_map_full,  rh_pg_full,  rh_entg_full,  'rh', classes_all)
X_full = map_loc_hour_entropy(X_full, lxh_map_full, lxh_entg_full)
X_test = map_group_priors(X_test, 'location', loc_map_full, loc_pg_full, loc_entg_full, 'loc', classes_all)
X_test = map_group_priors(X_test, 'rights_holder', rh_map_full,  rh_pg_full,  rh_entg_full,  'rh', classes_all)
X_test = map_loc_hour_entropy(X_test, lxh_map_full, lxh_entg_full)
use_cols = base_num + ['loc_count','loc_entropy','rh_count','rh_entropy','loc_hour_entropy','loc_hour_count'] + [f'loc_p_{c}' for c in classes_all] + [f'rh_p_{c}' for c in classes_all] + cat_cols
cat_idx_full = [use_cols.index(c) for c in cat_cols]
pool_full = Pool(X_full[use_cols], label=y_all, cat_features=cat_idx_full)
pool_test = Pool(X_test[use_cols], cat_features=cat_idx_full)
model_full = CatBoostClassifier(
    loss_function='MultiClass',
    eval_metric='MultiClass',
    iterations=2500,
    depth=8,
    learning_rate=0.05,
    l2_leaf_reg=10,
    random_strength=4,
    subsample=0.8,
    rsm=0.8,
    bootstrap_type='Bernoulli',
    auto_class_weights='Balanced',
    early_stopping_rounds=150,
    random_seed=42,
    task_type='CPU',
    verbose=False
)
model_full.fit(pool_full, verbose=False)
proba_test = model_full.predict_proba(pool_test)
logits_test = np.log(np.clip(np.asarray(proba_test), 1e-8, 1.0))

# Sequence-average per seq and apply biases
test_df = test_fe[['id','seq_id']].copy()
from collections import defaultdict
seq_map = defaultdict(list)
for i, sid in enumerate(test_df['seq_id'].values):
    seq_map[sid].append(i)
pred_seq = {}
for sid, idxs in seq_map.items():
    m = logits_test[idxs].mean(axis=0) + b_opt
    pred_seq[sid] = int(np.argmax(m))
test_pred = test_df['seq_id'].map(pred_seq).astype(int).values
# Use test_df['id'] to ensure same length as predictions to avoid mismatches; later mapping cell aligns strictly to test ids
sub_cbp = pd.DataFrame({'id': test_df['id'].astype(str).values, 'category_id': test_pred})
sub_cbp.to_csv('submission_cb_priors_seq_bias.csv', index=False)
print('[SUB-CB-P] Saved submission_cb_priors_seq_bias.csv', sub_cbp.shape, 'unique classes', sub_cbp['category_id'].nunique())
print('[CB-P] Done in {:.1f}s'.format(time.time()-t0), flush=True)

[CB-P] Merged image feats: (179422, 7) (16877, 7)


[CB-P] 5-fold with fold-safe priors/entropy...


[CB-P] Fold 0 seq-avg macro-F1=0.09273 elapsed 114.4s


[CB-P] Fold 1 seq-avg macro-F1=0.09286 elapsed 129.5s


[CB-P] Fold 2 seq-avg macro-F1=0.09378 elapsed 119.4s


[CB-P] Fold 3 seq-avg macro-F1=0.09248 elapsed 126.8s


[CB-P] Fold 4 seq-avg macro-F1=0.09351 elapsed 121.4s


[CB-P] OOF seq-avg macro-F1 (with biases) = 0.10246


[CB-P] Biases: [ 1.3 -0.8  0.4 -2.  -2.  -0.9 -2.  -0.4  2.  -1.9  1.6  2.   0.5  0. ]
[CB-P] Fitting full model and predicting test...


[SUB-CB-P] Saved submission_cb_priors_seq_bias.csv (16937, 2) unique classes 5
[CB-P] Done in 899.1s


In [28]:
# FIX submission: map class indices to original category_id labels and regenerate submission.csv
import numpy as np, pandas as pd
from catboost import Pool

# Rebuild features exactly as in Cell 15 to get test Pool; reuse model_full and b_opt from memory
train = pd.read_csv('train.csv')
test  = pd.read_csv('test.csv')
classes_all = sorted(train['category_id'].unique().tolist())

def fe(df):
    df = df.copy()
    dt = pd.to_datetime(df['date_captured'], errors='coerce')
    df['year'] = dt.dt.year.fillna(-1).astype(int)
    df['month'] = dt.dt.month.fillna(-1).astype(int)
    df['day'] = dt.dt.day.fillna(-1).astype(int)
    df['hour'] = dt.dt.hour.fillna(-1).astype(int)
    df['is_night'] = ((df['hour'] < 6) | (df['hour'] > 19)).astype(int)
    df['frame_num'] = df['frame_num'].fillna(-1).astype(int)
    df['seq_num_frames'] = df['seq_num_frames'].fillna(1).astype(int)
    df['frame_ratio'] = (df['frame_num'] / df['seq_num_frames']).clip(0,1)
    df['is_first'] = (df['frame_num'] <= 1).astype(int)
    df['is_last']  = (df['frame_num'] >= df['seq_num_frames']).astype(int)
    df['hour_sin'] = np.sin(2*np.pi*df['hour']/24.0)
    df['hour_cos'] = np.cos(2*np.pi*df['hour']/24.0)
    return df

train_fe = fe(train)
test_fe  = fe(test)

num_cols = [
    'width','height','year','month','day','hour','is_night',
    'frame_num','seq_num_frames','frame_ratio','is_first','is_last','hour_sin','hour_cos'
]
cat_cols = ['location','rights_holder']
all_cols = num_cols + cat_cols

X_full = train_fe[all_cols].copy()
y_all = train_fe['category_id'].astype(int).values
X_test_full = test_fe[all_cols].copy()

# add count features as in Cell 15
for c in ['location','rights_holder']:
    cnt = X_full[c].value_counts()
    X_full[f'cnt_{c}'] = X_full[c].map(cnt).fillna(1).astype(int)
for c in ['location','rights_holder']:
    cnt = X_full[c].value_counts()
    X_test_full[f'cnt_{c}'] = X_test_full[c].map(cnt).fillna(1).astype(int)

use_cols = all_cols + ['cnt_location','cnt_rights_holder']
cat_features_idx_full = [use_cols.index(c) for c in cat_cols] + [use_cols.index('cnt_location'), use_cols.index('cnt_rights_holder')]

# Build Pools for prediction with the same cat feature indices
pool_test = Pool(X_test_full[use_cols], cat_features=cat_features_idx_full)

# Predict probabilities with trained model_full (from Cell 15)
proba_test = model_full.predict_proba(pool_test)
logits_test = np.log(np.clip(np.asarray(proba_test), 1e-8, 1.0))

# Sequence-average logits and apply learned biases b_opt (from Cell 15), then map argmax indices to original labels
from collections import defaultdict
seq_map = defaultdict(list)
for i, sid in enumerate(test_fe['seq_id'].values):
    seq_map[sid].append(i)
pred_seq = {}
for sid, idxs in seq_map.items():
    m = logits_test[idxs].mean(axis=0) + b_opt
    idx = int(np.argmax(m))
    pred_seq[sid] = classes_all[idx]  # map index -> original category_id label

test_pred = test_fe['seq_id'].map(pred_seq).astype(int).values
sub_fix = pd.DataFrame({'id': test['id'], 'category_id': test_pred})
sub_fix.to_csv('submission.csv', index=False)
print('[SUB-FIX] Saved submission.csv (mapped to original labels) shape', sub_fix.shape, 'unique classes', sub_fix['category_id'].nunique())
print(sub_fix.head())

[SUB-FIX] Saved submission.csv (mapped to original labels) shape (16877, 2) unique classes 13
                                     id  category_id
0  5998cfa4-23d2-11e8-a6a3-ec086b02610b            0
1  599fbd89-23d2-11e8-a6a3-ec086b02610b            0
2  59fae563-23d2-11e8-a6a3-ec086b02610b           14
3  5a24a741-23d2-11e8-a6a3-ec086b02610b            0
4  59eab924-23d2-11e8-a6a3-ec086b02610b            0


In [29]:
# Validate submission format against sample_submission
import pandas as pd
import numpy as np
ss = pd.read_csv('sample_submission.csv')
sub = pd.read_csv('submission.csv')
print('[CHK] sample_submission shape:', ss.shape, 'columns:', list(ss.columns))
print('[CHK] submission shape:', sub.shape, 'columns:', list(sub.columns))
print('[CHK] sample head:', ss.head().to_dict('records')[:3])
print('[CHK] sub head:', sub.head().to_dict('records')[:3])
# Check column names exact match
cols_match = list(ss.columns) == list(sub.columns)
print('[CHK] Columns match exactly:', cols_match)
# Check id coverage and order
ss_ids = ss['id'].astype(str)
sub_ids = sub['id'].astype(str)
missing = set(ss_ids) - set(sub_ids)
extra = set(sub_ids) - set(ss_ids)
print('[CHK] missing ids in sub:', len(missing))
print('[CHK] extra ids in sub:', len(extra))
if (not cols_match) or (len(missing) > 0) or (len(extra) > 0) or (len(ss) != len(sub)):
    print('[FIX] Rebuilding submission to match sample_submission id order and columns...')
    # If sub has different column names, rename accordingly
    sub_renamed = sub.copy()
    # Ensure correct columns names
    sub_renamed.columns = ['id','category_id']
    # Merge to sample order
    merged = ss[['id']].merge(sub_renamed, on='id', how='left')
    # If any ids missing, fill with mode of train category_id (fallback 0)
    if merged['category_id'].isna().any():
        try:
            tr = pd.read_csv('train.csv')
            fallback = int(tr['category_id'].mode().iloc[0])
        except Exception:
            fallback = 0
        merged['category_id'] = merged['category_id'].fillna(fallback).astype(int)
    merged.to_csv('submission.csv', index=False)
    print('[FIX] Wrote aligned submission.csv shape', merged.shape)
else:
    # Ensure dtypes
    sub['category_id'] = sub['category_id'].astype(int)
    # Reorder to sample order just in case
    sub_aligned = ss[['id']].merge(sub, on='id', how='left')
    sub_aligned.to_csv('submission.csv', index=False)
    print('[OK] submission.csv already matches; re-saved aligned order. shape', sub_aligned.shape)
print('[DONE] Validation complete.')

[CHK] sample_submission shape: (16877, 3) columns: ['Unnamed: 0', 'Id', 'Category']
[CHK] submission shape: (16877, 2) columns: ['id', 'category_id']
[CHK] sample head: [{'Unnamed: 0': 0, 'Id': '5998cfa4-23d2-11e8-a6a3-ec086b02610b', 'Category': 0}, {'Unnamed: 0': 4, 'Id': '599fbd89-23d2-11e8-a6a3-ec086b02610b', 'Category': 0}, {'Unnamed: 0': 6, 'Id': '59fae563-23d2-11e8-a6a3-ec086b02610b', 'Category': 0}]
[CHK] sub head: [{'id': '5998cfa4-23d2-11e8-a6a3-ec086b02610b', 'category_id': 0}, {'id': '599fbd89-23d2-11e8-a6a3-ec086b02610b', 'category_id': 0}, {'id': '59fae563-23d2-11e8-a6a3-ec086b02610b', 'category_id': 14}]
[CHK] Columns match exactly: False


KeyError: 'id'

In [30]:
# Rebuild submission.csv to match competition headers exactly: ['Id','Category'] and sample order
import pandas as pd
ss = pd.read_csv('sample_submission.csv')
pred = pd.read_csv('submission.csv')  # current predictions with columns ['id','category_id']
pred = pred.rename(columns={'id':'Id', 'category_id':'Category'})
# Align to sample order and columns; drop any extra columns
sub_aligned = ss[['Id']].merge(pred[['Id','Category']], on='Id', how='left')
# Fill any missing categories with majority class 0 fallback (safe default)
if sub_aligned['Category'].isna().any():
    sub_aligned['Category'] = sub_aligned['Category'].fillna(0).astype(int)
sub_aligned = sub_aligned[['Id','Category']].copy()
sub_aligned.to_csv('submission.csv', index=False)
print('[FIX-FMT] Wrote submission.csv with columns', list(sub_aligned.columns), 'shape', sub_aligned.shape)
print(sub_aligned.head())

[FIX-FMT] Wrote submission.csv with columns ['Id', 'Category'] shape (16937, 2)
                                     Id  Category
0  5998cfa4-23d2-11e8-a6a3-ec086b02610b         0
1  599fbd89-23d2-11e8-a6a3-ec086b02610b         0
2  59fae563-23d2-11e8-a6a3-ec086b02610b        14
3  5a24a741-23d2-11e8-a6a3-ec086b02610b         0
4  59eab924-23d2-11e8-a6a3-ec086b02610b         0


In [31]:
# Ensure submission.csv exactly matches test ids and required headers
import pandas as pd
test = pd.read_csv('test.csv')
pred = pd.read_csv('submission.csv')  # could be with either header style

# Normalize column names
cols = {c.lower(): c for c in pred.columns}
if 'id' in cols and 'category' in cols:
    pred = pred.rename(columns={cols['id']:'Id', cols['category']:'Category'})
elif 'id' in cols and 'category_id' in cols:
    pred = pred.rename(columns={cols['id']:'Id', cols['category_id']:'Category'})
elif 'Id' in pred.columns and 'Category' in pred.columns:
    pass
else:
    raise RuntimeError(f'Unexpected submission columns: {list(pred.columns)}')

# Align to test ids and order
sub = pd.DataFrame({'Id': test['id'].astype(str)})
pred['Id'] = pred['Id'].astype(str)
sub = sub.merge(pred[['Id','Category']], on='Id', how='left')

# Fill any missing with majority class 0 (safe fallback)
sub['Category'] = sub['Category'].fillna(0).astype(int)

sub.to_csv('submission.csv', index=False)
print('[FINAL-SUB] submission.csv shape', sub.shape, 'columns', list(sub.columns), 'nunique Category', sub['Category'].nunique())
print(sub.head())

[FINAL-SUB] submission.csv shape (17177, 2) columns ['Id', 'Category'] nunique Category 13
                                     Id  Category
0  5998cfa4-23d2-11e8-a6a3-ec086b02610b         0
1  599fbd89-23d2-11e8-a6a3-ec086b02610b         0
2  59fae563-23d2-11e8-a6a3-ec086b02610b        14
3  5a24a741-23d2-11e8-a6a3-ec086b02610b         0
4  59eab924-23d2-11e8-a6a3-ec086b02610b         0


In [32]:
# Rebuild final submission from CatBoost predictions with proper label mapping and exact length == len(test)
import pandas as pd, numpy as np
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
pred_raw = pd.read_csv('submission_cat_seq_bias.csv')  # columns: id, category_id (indices 0..13)

# Map class indices -> original category_id labels
classes_all = sorted(train['category_id'].unique().tolist())
idx2label = {i: lab for i, lab in enumerate(classes_all)}

# Ensure one row per test id in predictions
pred_raw = pred_raw.drop_duplicates(subset=['id'], keep='first')
pred_raw['mapped'] = pred_raw['category_id'].map(idx2label).astype(int)

# Align strictly to test ids (no row count change, no duplication)
sub = pd.DataFrame({'Id': test['id'].astype(str)})
m = test[['id']].merge(pred_raw[['id','mapped']], on='id', how='left')
if m['mapped'].isna().any():
    # fallback to majority class 0 if any missing
    m['mapped'] = m['mapped'].fillna(0).astype(int)
sub['Category'] = m['mapped'].astype(int).values

# Final validations
assert len(sub) == len(test), f'Row count mismatch: {len(sub)} vs {len(test)}'
assert set(sub.columns) == {'Id','Category'}

sub.to_csv('submission.csv', index=False)
print('[FINAL-SUB-MAP] submission.csv shape', sub.shape, 'columns', list(sub.columns), 'nunique Category', sub['Category'].nunique())
print(sub.head())

[FINAL-SUB-MAP] submission.csv shape (16877, 2) columns ['Id', 'Category'] nunique Category 13
                                     Id  Category
0  5998cfa4-23d2-11e8-a6a3-ec086b02610b         0
1  599fbd89-23d2-11e8-a6a3-ec086b02610b         0
2  59fae563-23d2-11e8-a6a3-ec086b02610b        14
3  5a24a741-23d2-11e8-a6a3-ec086b02610b         0
4  59eab924-23d2-11e8-a6a3-ec086b02610b         0


In [35]:
# Map advanced CB priors submission (indices) to original labels and build final submission.csv
import pandas as pd, numpy as np
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
pred_raw = pd.read_csv('submission_cb_priors_seq_bias.csv')  # columns: id, category_id (indices 0..13)

# Map class indices -> original category_id labels
classes_all = sorted(train['category_id'].unique().tolist())
idx2label = {i: lab for i, lab in enumerate(classes_all)}

# Ensure one row per test id
pred_raw = pred_raw.drop_duplicates(subset=['id'], keep='first')
pred_raw['mapped'] = pred_raw['category_id'].map(idx2label).astype(int)

# Align strictly to test ids
sub = pd.DataFrame({'Id': test['id'].astype(str)})
m = test[['id']].merge(pred_raw[['id','mapped']], on='id', how='left')
if m['mapped'].isna().any():
    m['mapped'] = m['mapped'].fillna(0).astype(int)
sub['Category'] = m['mapped'].astype(int).values

assert len(sub) == len(test), f'Row count mismatch: {len(sub)} vs {len(test)}'
assert set(sub.columns) == {'Id','Category'}
sub.to_csv('submission.csv', index=False)
print('[FINAL-SUB-ADV] submission.csv shape', sub.shape, 'columns', list(sub.columns), 'nunique Category', sub['Category'].nunique())
print(sub.head())

[FINAL-SUB-ADV] submission.csv shape (16877, 2) columns ['Id', 'Category'] nunique Category 5
                                     Id  Category
0  5998cfa4-23d2-11e8-a6a3-ec086b02610b         0
1  599fbd89-23d2-11e8-a6a3-ec086b02610b         0
2  59fae563-23d2-11e8-a6a3-ec086b02610b         0
3  5a24a741-23d2-11e8-a6a3-ec086b02610b         0
4  59eab924-23d2-11e8-a6a3-ec086b02610b         0


In [36]:
# Cheap CPU image features (laplacian_var, hsv_s_mean, gray_mean/std, v_mean, file_size_kb)
import os, cv2, numpy as np, pandas as pd, time
from joblib import Parallel, delayed

cv2.setNumThreads(0)

train = pd.read_csv('train.csv')
test  = pd.read_csv('test.csv')

def compute_feats(img_path, max_side=128):
    try:
        img = cv2.imread(img_path, cv2.IMREAD_COLOR)
        if img is None:
            return (np.nan, np.nan, np.nan, np.nan, np.nan)
        h, w = img.shape[:2]
        s = max(h, w)
        if s > max_side and s > 0:
            scale = max_side / float(s)
            nh, nw = max(1, int(h*scale)), max(1, int(w*scale))
            img = cv2.resize(img, (nw, nh), interpolation=cv2.INTER_AREA)
        gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
        lap = cv2.Laplacian(gray, cv2.CV_64F)
        lap_var = float(lap.var())
        hsv = cv2.cvtColor(img, cv2.COLOR_BGR2HSV)
        s_mean = float(hsv[...,1].mean())
        v_mean = float(hsv[...,2].mean())
        g_mean = float(gray.mean())
        g_std  = float(gray.std())
        return (lap_var, s_mean, g_mean, g_std, v_mean)
    except Exception:
        return (np.nan, np.nan, np.nan, np.nan, np.nan)

def process_df(df, img_dir):
    paths = [os.path.join(img_dir, fn) for fn in df['file_name'].tolist()]
    sizes = []
    for p in paths:
        try:
            sizes.append(os.path.getsize(p) / 1024.0)
        except Exception:
            sizes.append(np.nan)
    t0 = time.time()
    feats = Parallel(n_jobs=8, prefer='threads', batch_size=64)(delayed(compute_feats)(p) for p in paths)
    dt = time.time() - t0
    print(f'[IMG-FE] processed {len(paths)} images from {img_dir} in {dt/60:.2f} min')
    feats = np.asarray(feats, dtype=np.float32)
    out = pd.DataFrame({
        'id': df['id'].astype(str).values,
        'laplacian_var': feats[:,0],
        'hsv_s_mean': feats[:,1],
        'gray_mean': feats[:,2],
        'gray_std': feats[:,3],
        'hsv_v_mean': feats[:,4],
        'file_size_kb': np.array(sizes, dtype=np.float32)
    })
    return out

print('[IMG-FE] Starting feature extraction (train)...', flush=True)
fe_tr = process_df(train[['id','file_name']].copy(), 'train_images')
fe_tr.to_csv('img_feats_train.csv', index=False)
print('[IMG-FE] Saved img_feats_train.csv', fe_tr.shape, flush=True)

print('[IMG-FE] Starting feature extraction (test)...', flush=True)
fe_te = process_df(test[['id','file_name']].copy(), 'test_images')
fe_te.to_csv('img_feats_test.csv', index=False)
print('[IMG-FE] Saved img_feats_test.csv', fe_te.shape, flush=True)

print('[IMG-FE] Done.')

[IMG-FE] Starting feature extraction (train)...


[IMG-FE] processed 179422 images from train_images in 1.79 min


[IMG-FE] Saved img_feats_train.csv (179422, 7)


[IMG-FE] Starting feature extraction (test)...


[IMG-FE] processed 16877 images from test_images in 0.15 min
[IMG-FE] Saved img_feats_test.csv (16877, 7)


[IMG-FE] Done.


In [40]:
# Reuse model_full and b_opt: alternative seq pooling (mean vs top-2 mean vs conf-weighted) and build new submission file
import numpy as np, pandas as pd, os
from catboost import Pool
from collections import defaultdict

# Rebuild features exactly as in Cell 17 to get test Pool (with image feats + priors mapping)
train = pd.read_csv('train.csv')
test  = pd.read_csv('test.csv')

def fe_base(df):
    df = df.copy()
    dt = pd.to_datetime(df['date_captured'], errors='coerce')
    df['year'] = dt.dt.year.fillna(-1).astype(int)
    df['month'] = dt.dt.month.fillna(-1).astype(int)
    df['day'] = dt.dt.day.fillna(-1).astype(int)
    df['hour'] = dt.dt.hour.fillna(-1).astype(int)
    df['doy']  = dt.dt.dayofyear.fillna(1).astype(int)
    df['is_night'] = ((df['hour'] < 6) | (df['hour'] > 19)).astype(int)
    df['frame_num'] = df['frame_num'].fillna(-1).astype(int)
    df['seq_num_frames'] = df['seq_num_frames'].fillna(1).astype(int)
    df['frame_ratio'] = (df['frame_num'] / df['seq_num_frames']).clip(0,1)
    df['is_first'] = (df['frame_num'] <= 1).astype(int)
    df['is_last']  = (df['frame_num'] >= df['seq_num_frames']).astype(int)
    df['hour_sin'] = np.sin(2*np.pi*df['hour']/24.0)
    df['hour_cos'] = np.cos(2*np.pi*df['hour']/24.0)
    df['doy_sin']  = np.sin(2*np.pi*df['doy']/366.0)
    df['doy_cos']  = np.cos(2*np.pi*df['doy']/366.0)
    bins = [-1,3,7,11,15,19,23]
    labels = [0,1,2,3,4,5]
    df['hour_bin'] = pd.cut(df['hour'], bins=bins, labels=labels, include_lowest=True).astype(int)
    df['aspect'] = (df['width'] / df['height']).replace([np.inf, -np.inf], np.nan).fillna(0.0)
    return df

train_fe = fe_base(train)
test_fe  = fe_base(test)
train_fe['id'] = train['id'].astype(str).values
test_fe['id']  = test['id'].astype(str).values

# Merge cheap image feats if available
if os.path.exists('img_feats_train.csv') and os.path.exists('img_feats_test.csv'):
    fe_tr = pd.read_csv('img_feats_train.csv'); fe_te = pd.read_csv('img_feats_test.csv')
    train_fe = train_fe.merge(fe_tr, on='id', how='left')
    test_fe  = test_fe.merge(fe_te, on='id', how='left')

classes_all = sorted(train['category_id'].unique().tolist())
img_num = [c for c in ['laplacian_var','hsv_s_mean','gray_mean','gray_std','hsv_v_mean','file_size_kb'] if c in train_fe.columns]
base_num = ['width','height','aspect','year','month','day','hour','doy','is_night','frame_num','seq_num_frames','frame_ratio','is_first','is_last','hour_sin','hour_cos','doy_sin','doy_cos'] + img_num
cat_cols = ['rights_holder']

X_full = train_fe[['location','rights_holder','hour_bin','id'] + base_num].copy()
X_test = test_fe[['location','rights_holder','hour_bin','id'] + base_num].copy()

# Median-impute image feats
for c in img_num:
    med = X_full[c].median() if np.isfinite(X_full[c]).any() else 0.0
    X_full[c] = X_full[c].fillna(med)
    X_test[c] = X_test[c].fillna(med)

# Build priors using full train to map to test
def entropy_from_probs(p):
    p = np.clip(p, 1e-12, 1.0)
    return float(-(p * np.log(p)).sum())

def m_estimate_prior(counts, total, pg, m):
    return (counts + m * pg) / (total + m)

def build_group_priors(train_idx_df, key_col, classes, m):
    g = {}; grp = train_idx_df.groupby([key_col, 'category_id']).size().unstack(fill_value=0)
    for c in classes:
        if c not in grp.columns: grp[c] = 0
    grp = grp[classes]
    total_counts = grp.sum(axis=1).astype(int)
    pg = train_idx_df['category_id'].value_counts(normalize=True).reindex(classes).fillna(0).values
    for key, row in grp.iterrows():
        cnts = row.values.astype(float); n = int(total_counts.loc[key])
        p = m_estimate_prior(cnts, n, pg, m); ent = entropy_from_probs(p)
        g[key] = (p, n, ent)
    p_global = pg.copy(); ent_global = entropy_from_probs(p_global)
    return g, p_global, ent_global

def map_group_priors(df_in, key_col, prior_map, p_global, ent_global, prefix, classes):
    df = df_in.copy()
    probs_mat = np.zeros((len(df), len(classes)), dtype=np.float32)
    counts = np.zeros(len(df), dtype=np.int32)
    ents = np.zeros(len(df), dtype=np.float32)
    for i, k in enumerate(df[key_col].values):
        tpl = prior_map.get(k)
        if tpl is None:
            probs_mat[i] = p_global; counts[i] = 0; ents[i] = ent_global
        else:
            p, n, e = tpl; probs_mat[i] = p; counts[i] = n; ents[i] = e
    for j, c in enumerate(classes_all):
        df[f'{prefix}_p_{c}'] = probs_mat[:, j]
    df[f'{prefix}_count'] = np.log1p(counts)
    df[f'{prefix}_entropy'] = ents
    return df

def build_loc_hour_entropy(train_idx_df, m=300):
    key = train_idx_df['location'].astype(str) + '|' + train_idx_df['hour_bin'].astype(str)
    grp = train_idx_df.assign(k=key).groupby(['k','category_id']).size().unstack(fill_value=0)
    classes = sorted(train_idx_df['category_id'].unique().tolist())
    for c in classes:
        if c not in grp.columns: grp[c] = 0
    grp = grp[classes]
    total_counts = grp.sum(axis=1).astype(int)
    pg = train_idx_df['category_id'].value_counts(normalize=True).reindex(classes).fillna(0).values
    ent_map = {}
    for k, row in grp.iterrows():
        cnts = row.values.astype(float); n = int(total_counts.loc[k])
        p = m_estimate_prior(cnts, n, pg, m)
        ent_map[k] = (entropy_from_probs(p), n)
    ent_global = entropy_from_probs(pg)
    return ent_map, ent_global

loc_map_full, loc_pg_full, loc_entg_full = build_group_priors(train_fe[['location','category_id']], 'location', classes_all, m=100)
rh_map_full,  rh_pg_full,  rh_entg_full  = build_group_priors(train_fe[['rights_holder','category_id']], 'rights_holder', classes_all, m=50)
lxh_map_full, lxh_entg_full = build_loc_hour_entropy(train_fe[['location','hour_bin','category_id']].copy(), m=300)
X_full = map_group_priors(X_full, 'location', loc_map_full, loc_pg_full, loc_entg_full, 'loc', classes_all)
X_full = map_group_priors(X_full, 'rights_holder', rh_map_full,  rh_pg_full,  rh_entg_full,  'rh', classes_all)
X_full = map_loc_hour_entropy(X_full, lxh_map_full, lxh_entg_full)
X_test = map_group_priors(X_test, 'location', loc_map_full, loc_pg_full, loc_entg_full, 'loc', classes_all)
X_test = map_group_priors(X_test, 'rights_holder', rh_map_full,  rh_pg_full,  rh_entg_full,  'rh', classes_all)
X_test = map_loc_hour_entropy(X_test, lxh_map_full, lxh_entg_full)

use_cols = base_num + ['loc_count','loc_entropy','rh_count','rh_entropy','loc_hour_entropy','loc_hour_count'] + [f'loc_p_{c}' for c in classes_all] + [f'rh_p_{c}' for c in classes_all] + cat_cols
cat_idx_full = [use_cols.index(c) for c in cat_cols]
pool_test = Pool(X_test[use_cols], cat_features=cat_idx_full)

# Predict logits via existing model_full in memory
proba_test = model_full.predict_proba(pool_test)
logits_test = np.log(np.clip(np.asarray(proba_test), 1e-8, 1.0))

test_df = test_fe[['id','seq_id']].copy()
seq_map = defaultdict(list)
for i, sid in enumerate(test_df['seq_id'].values):
    seq_map[sid].append(i)

def pool_mean(idxs):
    return logits_test[idxs].mean(axis=0)

def pool_top2(idxs):
    # mean of top-2 frames by max logit
    if len(idxs) <= 2:
        return logits_test[idxs].mean(axis=0)
    scores = logits_test[idxs].max(axis=1)
    top2 = np.argsort(-scores)[:2]
    sel = [idxs[i] for i in top2]
    return logits_test[sel].mean(axis=0)

def pool_conf_weight(idxs):
    # confidence-weighted mean using softmax(max-logit) as weights
    arr = logits_test[idxs]
    conf = arr.max(axis=1)
    w = np.exp(conf - conf.max())
    w = w / (w.sum() + 1e-8)
    return (arr * w[:, None]).sum(axis=0)

poolers = {'mean': pool_mean, 'top2': pool_top2, 'confw': pool_conf_weight}
subs = {}
for name, fn in poolers.items():
    pred_seq = {}
    for sid, idxs in seq_map.items():
        m = fn(idxs) + b_opt  # reuse tuned biases
        pred_seq[sid] = int(np.argmax(m))
    test_pred = test_df['seq_id'].map(pred_seq).astype(int).values
    df_out = pd.DataFrame({'id': test_df['id'].astype(str).values, 'category_id': test_pred})
    out_path = f'submission_cb_priors_seq_bias_{name}.csv'
    df_out.to_csv(out_path, index=False)
    subs[name] = out_path
    print(f'[ALT-POOL] Saved {out_path} shape', df_out.shape, 'unique classes', df_out['category_id'].nunique())

print('[ALT-POOL] Done. Next: run mapping cell to build final submission.csv from the best variant (start with top2).')

[ALT-POOL] Saved submission_cb_priors_seq_bias_mean.csv shape (16937, 2) unique classes 5
[ALT-POOL] Saved submission_cb_priors_seq_bias_top2.csv shape (16937, 2) unique classes 5
[ALT-POOL] Saved submission_cb_priors_seq_bias_confw.csv shape (16937, 2) unique classes 5
[ALT-POOL] Done. Next: run mapping cell to build final submission.csv from the best variant (start with top2).


In [41]:
# Map top-2 pooled CB priors submission to original labels and build final submission.csv
import pandas as pd, numpy as np
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
pred_raw = pd.read_csv('submission_cb_priors_seq_bias_top2.csv')  # columns: id, category_id (indices 0..13)

# Map class indices -> original category_id labels
classes_all = sorted(train['category_id'].unique().tolist())
idx2label = {i: lab for i, lab in enumerate(classes_all)}

# Ensure one row per test id
pred_raw = pred_raw.drop_duplicates(subset=['id'], keep='first')
pred_raw['mapped'] = pred_raw['category_id'].map(idx2label).astype(int)

# Align strictly to test ids and order
sub = pd.DataFrame({'Id': test['id'].astype(str)})
m = test[['id']].merge(pred_raw[['id','mapped']], on='id', how='left')
if m['mapped'].isna().any():
    m['mapped'] = m['mapped'].fillna(0).astype(int)
sub['Category'] = m['mapped'].astype(int).values

assert len(sub) == len(test), f'Row count mismatch: {len(sub)} vs {len(test)}'
assert set(sub.columns) == {'Id','Category'}
sub.to_csv('submission.csv', index=False)
print('[FINAL-SUB-TOP2] submission.csv shape', sub.shape, 'columns', list(sub.columns), 'nunique Category', sub['Category'].nunique())
print(sub.head())

[FINAL-SUB-TOP2] submission.csv shape (16877, 2) columns ['Id', 'Category'] nunique Category 5
                                     Id  Category
0  5998cfa4-23d2-11e8-a6a3-ec086b02610b         0
1  599fbd89-23d2-11e8-a6a3-ec086b02610b         0
2  59fae563-23d2-11e8-a6a3-ec086b02610b         0
3  5a24a741-23d2-11e8-a6a3-ec086b02610b         0
4  59eab924-23d2-11e8-a6a3-ec086b02610b         0


In [50]:
# Two-stage empty-vs-nonempty gate on seq-averaged logits + biases; build final submission (aligned to test.csv)
import numpy as np, pandas as pd
from sklearn.metrics import f1_score

# Assumes variables from Cell 17/30/31 are in memory: oof_logits, y_all (indices or labels), train_fe, classes_all, b_opt, logits_test, test_fe
assert 'oof_logits' in globals() and 'y_all' in globals() and 'train_fe' in globals(), 'Missing OOF artifacts; run model cell first.'
assert 'classes_all' in globals() and 'b_opt' in globals() and 'logits_test' in globals() and 'test_fe' in globals(), 'Missing test artifacts; run model cell first.'

# Identify index corresponding to original label 0 (empty)
idx_empty = classes_all.index(0)

def softmax_rows(x):
    x = x - x.max(axis=1, keepdims=True)
    ex = np.exp(x)
    return ex / (ex.sum(axis=1, keepdims=True) + 1e-12)

# Map y_all to ORIGINAL labels if it's currently indices 0..K-1
y_all_arr = np.asarray(y_all)
if np.issubdtype(y_all_arr.dtype, np.integer) and y_all_arr.max() < len(classes_all) and set(np.unique(y_all_arr)) == set(range(len(classes_all))):
    y_all_labels = np.array([classes_all[i] for i in y_all_arr], dtype=int)
else:
    y_all_labels = y_all_arr.astype(int)

# OOF seq-averaged logits already in oof_logits; apply biases
oof_adj = oof_logits + b_opt[None, :]
oof_prob = softmax_rows(oof_adj)
p0 = oof_prob[:, idx_empty]

# Collapse to one row per sequence for gate tuning; use mode target per seq (in ORIGINAL label space)
tr_seq = train_fe['seq_id'].values
df_tmp = pd.DataFrame({'seq_id': tr_seq, 'y': y_all_labels, 'p0': p0})
y_mode = df_tmp.groupby('seq_id')['y'].agg(lambda s: s.mode().iloc[0]).reset_index(name='y')
p0_first = df_tmp.groupby('seq_id')['p0'].first().reset_index(name='p0')
df_oof = y_mode.merge(p0_first, on='seq_id', how='left')

# For non-empty branch, suppress empty logit and optionally apply temperature to spread predictions
T_nonempty = 0.7  # temperature <1.0 to increase diversity

# Build map seq->first row index for logits lookup (seq-avg oof_adj values are identical per seq anyway)
seq_first_idx = {}
for i, sid in enumerate(tr_seq):
    if sid not in seq_first_idx:
        seq_first_idx[sid] = i
idxs = np.array([seq_first_idx[sid] for sid in df_oof['seq_id'].values], dtype=int)
logits_seq = oof_adj[idxs].copy()

# Candidate thresholds for empty gate
cands = np.linspace(0.3, 0.95, 66)
best_t, best_f1 = 0.5, -1.0
for t in cands:
    # Decide empty vs non-empty by p0
    is_empty = df_oof['p0'].values >= t
    # Non-empty branch: suppress empty class and apply temperature scaling
    logits_ne = logits_seq.copy()
    logits_ne[:, idx_empty] = -1e9
    logits_ne = logits_ne / max(T_nonempty, 1e-6)
    pred_idx = logits_ne.argmax(axis=1)
    pred_labels = np.where(is_empty, 0, np.array([classes_all[j] for j in pred_idx]))
    f1 = f1_score(df_oof['y'].values, pred_labels, average='macro')
    if f1 > best_f1:
        best_f1, best_t = f1, t

print(f'[GATE] Best empty gate threshold t={best_t:.3f} (T_nonempty={T_nonempty}) yields OOF seq-avg macro-F1={best_f1:.5f}')

# Apply gate to test aligned strictly to test.csv to avoid length mismatches
test_true = pd.read_csv('test.csv')
# Build mapping from test_fe rows (used to compute logits_test) to their indices
df_logits = pd.DataFrame({
    'id': test_fe['id'].astype(str).values,
    'seq_id': test_fe['seq_id'].values,
    'row_idx': np.arange(len(test_fe), dtype=int)
})
# Map each test.csv id to the corresponding logits_test row index (first occurrence if duplicates)
id2idx = {}
for rid, rix in zip(df_logits['id'].values, df_logits['row_idx'].values):
    if rid not in id2idx:
        id2idx[rid] = int(rix)
mapped_idx = test_true['id'].astype(str).map(id2idx).values
if np.any(pd.isna(mapped_idx)):
    mapped_idx = np.where(pd.isna(mapped_idx), 0, mapped_idx).astype(int)
else:
    mapped_idx = mapped_idx.astype(int)

# Group indices by seq_id from test.csv and compute seq-mean logits on those indices
from collections import defaultdict
seq_to_indices = defaultdict(list)
for i, sid in enumerate(test_true['seq_id'].values):
    seq_to_indices[sid].append(mapped_idx[i])

pred_seq = {}
for sid, idxs in seq_to_indices.items():
    m = logits_test[idxs].mean(axis=0) + b_opt  # seq-avg + biases
    prob = softmax_rows(m[None, :])[0]
    if prob[idx_empty] >= best_t:
        pred_seq[sid] = 0  # empty label
    else:
        m_ne = m.copy()
        m_ne[idx_empty] = -1e9
        m_ne = m_ne / max(T_nonempty, 1e-6)
        pred_seq[sid] = classes_all[int(np.argmax(m_ne))]

test_pred_labels = pd.Series(test_true['seq_id'].values).map(pred_seq).astype(int).values

# Build final submission with correct headers and order
sub = pd.DataFrame({'Id': test_true['id'].astype(str).values, 'Category': test_pred_labels})
assert len(sub) == len(test_true), f'Length mismatch: {len(sub)} vs {len(test_true)}'
sub.to_csv('submission.csv', index=False)
print('[GATE] Wrote submission.csv with gate. shape', sub.shape, 'nunique Category', sub['Category'].nunique())
print(sub.head())

[GATE] Best empty gate threshold t=0.340 (T_nonempty=0.7) yields OOF seq-avg macro-F1=0.68623
[GATE] Wrote submission.csv with gate. shape (16877, 2) nunique Category 14
                                     Id  Category
0  5998cfa4-23d2-11e8-a6a3-ec086b02610b        19
1  599fbd89-23d2-11e8-a6a3-ec086b02610b         0
2  59fae563-23d2-11e8-a6a3-ec086b02610b        16
3  5a24a741-23d2-11e8-a6a3-ec086b02610b        18
4  59eab924-23d2-11e8-a6a3-ec086b02610b        16


In [44]:
# CPU embedding extraction: timm resnet18 (160px, GAP) -> 512-D fp16; saves train/test .npy and id/order CSVs
import os, time, math, gc, cv2, numpy as np, pandas as pd, torch, timm
from torch import nn
from torch.utils.data import Dataset, DataLoader

cv2.setNumThreads(0)
torch.set_num_threads(8)
DEVICE = 'cpu'

IMG_SIZE = 160  # faster than 224; upgrade to 224 or switch to r50 if time allows
BATCH_SIZE = 256
NUM_WORKERS = 8
MODEL_NAME = 'resnet18'
OUT_TRAIN = f'emb_train_{MODEL_NAME}_{IMG_SIZE}.npy'
OUT_TEST  = f'emb_test_{MODEL_NAME}_{IMG_SIZE}.npy'
OUT_TRAIN_IDS = f'emb_train_{MODEL_NAME}_{IMG_SIZE}_ids.csv'
OUT_TEST_IDS  = f'emb_test_{MODEL_NAME}_{IMG_SIZE}_ids.csv'

mean = np.array([0.485, 0.456, 0.406], dtype=np.float32)
std  = np.array([0.229, 0.224, 0.225], dtype=np.float32)

def preprocess_pad(img, size=IMG_SIZE):
    if img is None:
        img = np.zeros((size, size, 3), dtype=np.uint8)
    else:
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    h, w = img.shape[:2]
    scale = size / max(h, w) if max(h, w) > 0 else 1.0
    nh, nw = max(1, int(h*scale)), max(1, int(w*scale))
    img_rs = cv2.resize(img, (nw, nh), interpolation=cv2.INTER_AREA)
    pad = np.zeros((size, size, 3), dtype=np.uint8)
    y0 = (size - nh) // 2; x0 = (size - nw) // 2
    pad[y0:y0+nh, x0:x0+nw] = img_rs
    x = pad.astype(np.float32) / 255.0
    x = (x - mean) / std
    x = np.transpose(x, (2, 0, 1))
    return x

class ImgDs(Dataset):
    def __init__(self, df, img_dir):
        self.ids = df['id'].astype(str).values
        self.fns = df['file_name'].values
        self.img_dir = img_dir
    def __len__(self): return len(self.fns)
    def __getitem__(self, i):
        fp = os.path.join(self.img_dir, self.fns[i])
        img = cv2.imread(fp, cv2.IMREAD_COLOR)
        x = preprocess_pad(img)
        return torch.from_numpy(x), self.ids[i]

def build_loader(df, img_dir):
    ds = ImgDs(df, img_dir)
    return DataLoader(ds, batch_size=BATCH_SIZE, shuffle=False, num_workers=NUM_WORKERS, pin_memory=True, persistent_workers=False)

def extract_embeddings(df, img_dir, out_npy, out_ids):
    n = len(df)
    loader = build_loader(df, img_dir)
    print(f'[EMB] Loading {MODEL_NAME} pretrained backbone...', flush=True)
    model = timm.create_model(MODEL_NAME, pretrained=True, num_classes=0)  # pooled features
    model.eval(); model.to(DEVICE); model.to(memory_format=torch.channels_last)
    feats = None; ids_all = []
    t0 = time.time(); seen = 0
    with torch.inference_mode():
        for it, (xb, ids) in enumerate(loader, 1):
            xb = xb.to(DEVICE, non_blocking=True).to(memory_format=torch.channels_last)
            f = model(xb).float().cpu().numpy()  # (B, D)
            if feats is None:
                D = f.shape[1]
                feats = np.memmap(out_npy + '.mmap', mode='w+', dtype=np.float16, shape=(n, D))
            bsz = f.shape[0]
            feats[seen:seen+bsz] = f.astype(np.float16)
            ids_all.extend(list(ids))
            seen += bsz
            if it % 50 == 0 or seen == n:
                dt = time.time()-t0
                print(f'  [EMB] it {it} seen {seen}/{n} ({seen/n*100:.1f}%) elapsed {dt/60:.2f}m', flush=True)
    # Flush memmap to .npy
    arr = np.array(feats, copy=True)  # load to RAM
    del feats; gc.collect()
    np.save(out_npy, arr)
    try:
        os.remove(out_npy + '.mmap')
    except Exception:
        pass
    pd.DataFrame({'id': ids_all}).to_csv(out_ids, index=False)
    print(f'[EMB] Saved {out_npy} shape {arr.shape} and {out_ids}', flush=True)

train = pd.read_csv('train.csv')
test  = pd.read_csv('test.csv')
print('[EMB] Starting extraction: train n=', len(train), ' test n=', len(test), ' img_size=', IMG_SIZE, ' bs=', BATCH_SIZE, flush=True)
extract_embeddings(train[['id','file_name']], 'train_images', OUT_TRAIN, OUT_TRAIN_IDS)
extract_embeddings(test[['id','file_name']],  'test_images',  OUT_TEST,  OUT_TEST_IDS)
print('[EMB] Done.')

  from .autonotebook import tqdm as notebook_tqdm


[EMB] Starting extraction: train n= 179422  test n= 16877  img_size= 160  bs= 256


[EMB] Loading resnet18 pretrained backbone...


  [EMB] it 50 seen 12800/179422 (7.1%) elapsed 0.59m


  [EMB] it 100 seen 25600/179422 (14.3%) elapsed 1.12m


  [EMB] it 150 seen 38400/179422 (21.4%) elapsed 1.65m


  [EMB] it 200 seen 51200/179422 (28.5%) elapsed 2.18m


  [EMB] it 250 seen 64000/179422 (35.7%) elapsed 2.71m


  [EMB] it 300 seen 76800/179422 (42.8%) elapsed 3.24m


  [EMB] it 350 seen 89600/179422 (49.9%) elapsed 3.77m


  [EMB] it 400 seen 102400/179422 (57.1%) elapsed 4.29m


  [EMB] it 450 seen 115200/179422 (64.2%) elapsed 4.83m


  [EMB] it 500 seen 128000/179422 (71.3%) elapsed 5.36m


  [EMB] it 550 seen 140800/179422 (78.5%) elapsed 5.88m


  [EMB] it 600 seen 153600/179422 (85.6%) elapsed 6.41m


  [EMB] it 650 seen 166400/179422 (92.7%) elapsed 6.94m


  [EMB] it 700 seen 179200/179422 (99.9%) elapsed 7.47m


  [EMB] it 701 seen 179422/179422 (100.0%) elapsed 7.47m


[EMB] Saved emb_train_resnet18_160.npy shape (179422, 512) and emb_train_resnet18_160_ids.csv


[EMB] Loading resnet18 pretrained backbone...


  [EMB] it 50 seen 12800/16877 (75.8%) elapsed 0.56m


  [EMB] it 66 seen 16877/16877 (100.0%) elapsed 0.73m


[EMB] Saved emb_test_resnet18_160.npy shape (16877, 512) and emb_test_resnet18_160_ids.csv


[EMB] Done.


In [45]:
# Logistic Regression on ResNet18 embeddings (+numeric meta), SGKF OOF -> bias tune -> test preds
import numpy as np, pandas as pd, time, os
from sklearn.model_selection import StratifiedGroupKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

t0 = time.time()
train = pd.read_csv('train.csv')
test  = pd.read_csv('test.csv')
classes_all = sorted(train['category_id'].unique().tolist())
n_classes = len(classes_all)

# Load embeddings and ensure alignment to train/test id order
emb_tr = np.load('emb_train_resnet18_160.npy')  # (N_tr, 512)
emb_te = np.load('emb_test_resnet18_160.npy')   # (N_te, 512)
ids_tr = pd.read_csv('emb_train_resnet18_160_ids.csv')['id'].astype(str).values
ids_te = pd.read_csv('emb_test_resnet18_160_ids.csv')['id'].astype(str).values
assert emb_tr.shape[0] == len(train) and emb_te.shape[0] == len(test), f'Emb shape mismatch: {emb_tr.shape}, {emb_te.shape}'
assert np.all(ids_tr == train['id'].astype(str).values), 'Train embedding id order mismatch'
assert np.all(ids_te == test['id'].astype(str).values), 'Test embedding id order mismatch'

def fe_base(df):
    df = df.copy()
    dt = pd.to_datetime(df['date_captured'], errors='coerce')
    df['year'] = dt.dt.year.fillna(-1).astype(int)
    df['month'] = dt.dt.month.fillna(-1).astype(int)
    df['day'] = dt.dt.day.fillna(-1).astype(int)
    df['hour'] = dt.dt.hour.fillna(-1).astype(int)
    df['is_night'] = ((df['hour'] < 6) | (df['hour'] > 19)).astype(int)
    df['frame_num'] = df['frame_num'].fillna(-1).astype(int)
    df['seq_num_frames'] = df['seq_num_frames'].fillna(1).astype(int)
    df['frame_ratio'] = (df['frame_num'] / df['seq_num_frames']).clip(0,1)
    df['is_first'] = (df['frame_num'] <= 1).astype(int)
    df['is_last']  = (df['frame_num'] >= df['seq_num_frames']).astype(int)
    df['aspect'] = (df['width'] / df['height']).replace([np.inf, -np.inf], np.nan).fillna(0.0)
    return df

train_fe = fe_base(train)
test_fe  = fe_base(test)

# Numeric meta features (keep small to avoid overfitting, categorical omitted for speed)
num_cols = ['width','height','aspect','year','month','day','hour','is_night','frame_num','seq_num_frames','frame_ratio','is_first','is_last']
X_num_tr = train_fe[num_cols].astype(np.float32).values
X_num_te = test_fe[num_cols].astype(np.float32).values

# Concatenate embeddings + numeric
X_all = np.concatenate([emb_tr.astype(np.float32), X_num_tr], axis=1)
X_test = np.concatenate([emb_te.astype(np.float32), X_num_te], axis=1)
y_all = train['category_id'].values
groups = train['seq_id'].astype(str).values

print('[LR-EMB] Shapes: X_all', X_all.shape, 'X_test', X_test.shape, 'classes', n_classes, flush=True)

# Build scaler+LR pipeline
pipe = Pipeline([
    ('scaler', StandardScaler(with_mean=True, with_std=True)),
    ('clf', LogisticRegression(
        multi_class='multinomial', solver='saga', max_iter=300, C=0.5,
        class_weight='balanced', n_jobs=8, verbose=0))
])

sgkf = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)
oof_logits = np.full((len(train), n_classes), np.nan, dtype=np.float32)

for fold, (tr_idx, va_idx) in enumerate(sgkf.split(X_all, y=y_all, groups=groups)):
    t_fold = time.time()
    X_tr, X_va = X_all[tr_idx], X_all[va_idx]
    y_tr, y_va = y_all[tr_idx], y_all[va_idx]
    pipe.fit(X_tr, y_tr)
    proba = pipe.predict_proba(X_va)
    logits = np.log(np.clip(proba, 1e-9, 1.0))
    # Sequence-average within fold
    va_seq = train.loc[va_idx, 'seq_id'].values
    logits_seq = logits.copy()
    from collections import defaultdict
    g = defaultdict(list)
    for i, sid in enumerate(va_seq): g[sid].append(i)
    for idxs in g.values():
        m = logits[idxs].mean(axis=0, keepdims=True)
        logits_seq[idxs] = m
    oof_logits[va_idx] = logits_seq
    f1 = f1_score(y_va, logits_seq.argmax(1), average='macro')
    print(f'[LR-EMB] Fold {fold} seq-avg macro-F1={f1:.5f} elapsed {time.time()-t_fold:.1f}s', flush=True)

assert not np.isnan(oof_logits).any(), 'NaNs in OOF logits'

def optimize_biases(y_true, logits, n_iters=2, grid=np.linspace(-1.5, 1.5, 19)):
    b = np.zeros(logits.shape[1], dtype=np.float32)
    best = f1_score(y_true, (logits + b).argmax(1), average='macro')
    for _ in range(n_iters):
        improved = False
        for c in range(logits.shape[1]):
            bc = b[c]; best_c = bc; best_sc = best
            for d in grid:
                b[c] = d
                sc = f1_score(y_true, (logits + b).argmax(1), average='macro')
                if sc > best_sc: best_sc, best_c = sc, d
            b[c] = best_c
            if best_c != bc: best = best_sc; improved = True
        if not improved: break
    return b, best

b_opt, f1_oof = optimize_biases(y_all, oof_logits)
print(f'[LR-EMB] OOF seq-avg macro-F1 (bias-tuned)={f1_oof:.5f}', flush=True)
print('[LR-EMB] Biases:', np.round(b_opt, 3))

# Fit on full data and predict test
t_fit = time.time()
pipe.fit(X_all, y_all)
print(f'[LR-EMB] Full fit done in {time.time()-t_fit:.1f}s', flush=True)
proba_test = pipe.predict_proba(X_test)
logits_test = np.log(np.clip(proba_test, 1e-9, 1.0))

# Sequence-average test logits and apply biases; predict per seq then broadcast
test_df = test[['id','seq_id']].copy()
from collections import defaultdict
seq_map = defaultdict(list)
for i, sid in enumerate(test_df['seq_id'].values): seq_map[sid].append(i)
pred_seq = {}
for sid, idxs in seq_map.items():
    m = logits_test[idxs].mean(axis=0) + b_opt
    pred_seq[sid] = int(np.argmax(m))
test_pred_idx = test_df['seq_id'].map(pred_seq).astype(int).values

# Map class indices -> original labels
idx2label = {i: lab for i, lab in enumerate(classes_all)}
test_pred = np.vectorize(idx2label.get)(test_pred_idx).astype(int)

sub = pd.DataFrame({'Id': test['id'].astype(str), 'Category': test_pred})
sub.to_csv('submission.csv', index=False)
print('[LR-EMB] Wrote submission.csv shape', sub.shape, 'nunique Category', sub['Category'].nunique())

# Expose globals for gate cell reuse
globals().update({'oof_logits': oof_logits, 'y_all': y_all, 'b_opt': b_opt, 'logits_test': logits_test, 'train_fe': train_fe, 'test_fe': test_fe, 'classes_all': classes_all})
print('[LR-EMB] Exposed oof_logits/y_all/b_opt/logits_test/train_fe/test_fe/classes_all for gating.', flush=True)
print('[LR-EMB] Done in {:.1f}s'.format(time.time()-t0), flush=True)

[LR-EMB] Shapes: X_all (179422, 525) X_test (16877, 525) classes 14






[LR-EMB] Fold 0 seq-avg macro-F1=0.09338 elapsed 736.9s




In [47]:
# XGBoost on ResNet18 embeddings (+numeric meta), 5-fold SGKF -> OOF logits -> bias tune -> test logits
import numpy as np, pandas as pd, time, os, sys, subprocess
from sklearn.model_selection import StratifiedGroupKFold
from sklearn.metrics import f1_score

# Ensure xgboost is installed
try:
    import xgboost as xgb
except Exception:
    print('[XGB] Installing xgboost...', flush=True)
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'xgboost==2.1.1'], check=True)
    import xgboost as xgb

t0 = time.time()
train = pd.read_csv('train.csv')
test  = pd.read_csv('test.csv')
classes_all = sorted(train['category_id'].unique().tolist())
n_classes = len(classes_all)
label2idx = {lab: i for i, lab in enumerate(classes_all)}

# Load embeddings
emb_tr = np.load('emb_train_resnet18_160.npy').astype(np.float32)
emb_te = np.load('emb_test_resnet18_160.npy').astype(np.float32)
ids_tr = pd.read_csv('emb_train_resnet18_160_ids.csv')['id'].astype(str).values
ids_te = pd.read_csv('emb_test_resnet18_160_ids.csv')['id'].astype(str).values
assert emb_tr.shape[0] == len(train) and emb_te.shape[0] == len(test)
assert np.all(ids_tr == train['id'].astype(str).values)
assert np.all(ids_te == test['id'].astype(str).values)

def fe_base(df):
    df = df.copy()
    dt = pd.to_datetime(df['date_captured'], errors='coerce')
    df['year'] = dt.dt.year.fillna(-1).astype(int)
    df['month'] = dt.dt.month.fillna(-1).astype(int)
    df['day'] = dt.dt.day.fillna(-1).astype(int)
    df['hour'] = dt.dt.hour.fillna(-1).astype(int)
    df['is_night'] = ((df['hour'] < 6) | (df['hour'] > 19)).astype(int)
    df['frame_num'] = df['frame_num'].fillna(-1).astype(int)
    df['seq_num_frames'] = df['seq_num_frames'].fillna(1).astype(int)
    df['frame_ratio'] = (df['frame_num'] / df['seq_num_frames']).clip(0,1)
    df['is_first'] = (df['frame_num'] <= 1).astype(int)
    df['is_last']  = (df['frame_num'] >= df['seq_num_frames']).astype(int)
    df['aspect'] = (df['width'] / df['height']).replace([np.inf, -np.inf], np.nan).fillna(0.0)
    return df

train_fe = fe_base(train); test_fe = fe_base(test)
num_cols = ['width','height','aspect','year','month','day','hour','is_night','frame_num','seq_num_frames','frame_ratio','is_first','is_last']
X_num_tr = train_fe[num_cols].astype(np.float32).values
X_num_te = test_fe[num_cols].astype(np.float32).values
X_all = np.concatenate([emb_tr, X_num_tr], axis=1)
X_test = np.concatenate([emb_te, X_num_te], axis=1)
y_all_labels = train['category_id'].values
y_all = np.array([label2idx[v] for v in y_all_labels], dtype=np.int32)  # map to 0..K-1
groups = train['seq_id'].astype(str).values

print('[XGB-EMB] Shapes:', X_all.shape, X_test.shape, 'classes', n_classes, flush=True)

params = {
    'objective': 'multi:softprob',
    'num_class': int(n_classes),
    'tree_method': 'hist',
    'max_depth': 7,
    'learning_rate': 0.05,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'eval_metric': 'mlogloss',
    'nthread': -1,
}
n_estimators = 2000
early_stopping = 100

sgkf = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)
oof_logits = np.full((len(train), n_classes), np.nan, dtype=np.float32)
best_iteration_last_fold = 1000

for fold, (tr_idx, va_idx) in enumerate(sgkf.split(X_all, y=y_all, groups=groups)):
    t_fold = time.time()
    dtr = xgb.DMatrix(X_all[tr_idx], label=y_all[tr_idx])
    dva = xgb.DMatrix(X_all[va_idx], label=y_all[va_idx])
    watchlist = [(dtr, 'train'), (dva, 'valid')]
    model = xgb.train(params, dtr, num_boost_round=n_estimators, evals=watchlist, early_stopping_rounds=early_stopping, verbose_eval=False)
    best_iteration_last_fold = int(model.best_iteration + 1)
    proba = model.predict(dva)
    logits = np.log(np.clip(proba, 1e-9, 1.0))
    # seq-average within val fold
    va_seq = train.loc[va_idx, 'seq_id'].values
    from collections import defaultdict
    g = defaultdict(list)
    for i, sid in enumerate(va_seq): g[sid].append(i)
    logits_seq = logits.copy()
    for idxs in g.values():
        m = logits[idxs].mean(axis=0, keepdims=True)
        logits_seq[idxs] = m
    oof_logits[va_idx] = logits_seq
    f1 = f1_score(y_all[va_idx], logits_seq.argmax(1), average='macro')
    print(f'[XGB-EMB] Fold {fold} seq-avg macro-F1={f1:.5f} rounds={model.best_iteration+1} elapsed {time.time()-t_fold:.1f}s', flush=True)

assert not np.isnan(oof_logits).any(), 'NaNs in OOF logits'

def optimize_biases(y_true, logits, n_iters=2, grid=np.linspace(-1.5, 1.5, 19)):
    b = np.zeros(logits.shape[1], dtype=np.float32)
    best = f1_score(y_true, (logits + b).argmax(1), average='macro')
    for _ in range(n_iters):
        improved = False
        for c in range(logits.shape[1]):
            bc = b[c]; best_c = bc; best_sc = best
            for d in grid:
                b[c] = d
                sc = f1_score(y_true, (logits + b).argmax(1), average='macro')
                if sc > best_sc: best_sc, best_c = sc, d
            b[c] = best_c
            if best_c != bc: best = best_sc; improved = True
        if not improved: break
    return b, best

b_opt, f1_oof = optimize_biases(y_all, oof_logits)
print(f'[XGB-EMB] OOF seq-avg macro-F1 (bias-tuned)={f1_oof:.5f}', flush=True)
print('[XGB-EMB] Biases:', np.round(b_opt, 3))

# Fit full model and predict test
dall = xgb.DMatrix(X_all, label=y_all)
dte = xgb.DMatrix(X_test)
num_round_full = int(max(best_iteration_last_fold, 1000) * 1.1)
model_full = xgb.train(params, dall, num_boost_round=num_round_full)
proba_test = model_full.predict(dte)
logits_test = np.log(np.clip(proba_test, 1e-9, 1.0))

# Expose globals for gating and blending
globals().update({'oof_logits': oof_logits, 'y_all': y_all, 'b_opt': b_opt, 'logits_test': logits_test, 'train_fe': train_fe, 'test_fe': test_fe, 'classes_all': classes_all})
print('[XGB-EMB] Exposed oof/logits for gate. Total time {:.1f}s'.format(time.time()-t0), flush=True)

[XGB-EMB] Shapes: (179422, 525) (16877, 525) classes 14


KeyboardInterrupt: 

In [49]:
# LightGBM on r18 embeddings (+numeric meta), 5-fold SGKF -> OOF logits -> bias tune -> test logits
import numpy as np, pandas as pd, time, os, sys, subprocess
from sklearn.model_selection import StratifiedGroupKFold
from sklearn.metrics import f1_score

# Ensure lightgbm is available
try:
    import lightgbm as lgb
except Exception:
    print('[LGB] Installing lightgbm...', flush=True)
    subprocess.run([sys.executable, '-m', 'pip', 'install', 'lightgbm==4.6.0'], check=True)
    import lightgbm as lgb

t0 = time.time()
train = pd.read_csv('train.csv')
test  = pd.read_csv('test.csv')
classes_all = sorted(train['category_id'].unique().tolist())
n_classes = len(classes_all)
label2idx = {lab: i for i, lab in enumerate(classes_all)}

# Load embeddings
emb_tr = np.load('emb_train_resnet18_160.npy').astype(np.float32)
emb_te = np.load('emb_test_resnet18_160.npy').astype(np.float32)
ids_tr = pd.read_csv('emb_train_resnet18_160_ids.csv')['id'].astype(str).values
ids_te = pd.read_csv('emb_test_resnet18_160_ids.csv')['id'].astype(str).values
assert emb_tr.shape[0] == len(train) and emb_te.shape[0] == len(test)
assert np.all(ids_tr == train['id'].astype(str).values)
assert np.all(ids_te == test['id'].astype(str).values)

def fe_base(df):
    df = df.copy()
    dt = pd.to_datetime(df['date_captured'], errors='coerce')
    df['year'] = dt.dt.year.fillna(-1).astype(int)
    df['month'] = dt.dt.month.fillna(-1).astype(int)
    df['day'] = dt.dt.day.fillna(-1).astype(int)
    df['hour'] = dt.dt.hour.fillna(-1).astype(int)
    df['is_night'] = ((df['hour'] < 6) | (df['hour'] > 19)).astype(int)
    df['frame_num'] = df['frame_num'].fillna(-1).astype(int)
    df['seq_num_frames'] = df['seq_num_frames'].fillna(1).astype(int)
    df['frame_ratio'] = (df['frame_num'] / df['seq_num_frames']).clip(0,1)
    df['is_first'] = (df['frame_num'] <= 1).astype(int)
    df['is_last']  = (df['frame_num'] >= df['seq_num_frames']).astype(int)
    df['aspect'] = (df['width'] / df['height']).replace([np.inf, -np.inf], np.nan).fillna(0.0)
    return df

train_fe = fe_base(train); test_fe = fe_base(test)
num_cols = ['width','height','aspect','year','month','day','hour','is_night','frame_num','seq_num_frames','frame_ratio','is_first','is_last']
X_num_tr = train_fe[num_cols].astype(np.float32).values
X_num_te = test_fe[num_cols].astype(np.float32).values
X_all = np.concatenate([emb_tr, X_num_tr], axis=1)
X_test = np.concatenate([emb_te, X_num_te], axis=1)
y_all_labels = train['category_id'].values
y_all = np.array([label2idx[v] for v in y_all_labels], dtype=np.int32)
groups = train['seq_id'].astype(str).values

print('[LGB-EMB] Shapes:', X_all.shape, X_test.shape, 'classes', n_classes, flush=True)

params = {
    'objective': 'multiclass',
    'num_class': int(n_classes),
    'metric': 'multi_logloss',
    'learning_rate': 0.05,
    'num_leaves': 63,
    'feature_fraction': 0.8,
    'bagging_fraction': 0.8,
    'bagging_freq': 1,
    'min_data_in_leaf': 20,
    'verbosity': -1,
    'force_col_wise': True
}
n_estimators = 800
early_stopping = 50

sgkf = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)
oof_logits = np.full((len(train), n_classes), np.nan, dtype=np.float32)
best_iter_last = 400

for fold, (tr_idx, va_idx) in enumerate(sgkf.split(X_all, y=y_all, groups=groups)):
    t_fold = time.time()
    dtr = lgb.Dataset(X_all[tr_idx], label=y_all[tr_idx])
    dva = lgb.Dataset(X_all[va_idx], label=y_all[va_idx])
    model = lgb.train(
        params,
        dtr,
        num_boost_round=n_estimators,
        valid_sets=[dtr, dva],
        valid_names=['train','valid'],
        callbacks=[lgb.early_stopping(early_stopping, verbose=False)]
    )
    best_iter_last = int(model.best_iteration or n_estimators)
    proba = model.predict(X_all[va_idx], num_iteration=model.best_iteration)
    logits = np.log(np.clip(proba, 1e-9, 1.0))
    # seq-average within val fold
    va_seq = train.loc[va_idx, 'seq_id'].values
    from collections import defaultdict
    g = defaultdict(list)
    for i, sid in enumerate(va_seq): g[sid].append(i)
    logits_seq = logits.copy()
    for idxs in g.values():
        m = logits[idxs].mean(axis=0, keepdims=True)
        logits_seq[idxs] = m
    oof_logits[va_idx] = logits_seq
    f1 = f1_score(y_all[va_idx], logits_seq.argmax(1), average='macro')
    print(f'[LGB-EMB] Fold {fold} seq-avg macro-F1={f1:.5f} rounds={best_iter_last} elapsed {time.time()-t_fold:.1f}s', flush=True)

assert not np.isnan(oof_logits).any(), 'NaNs in OOF logits'

def optimize_biases(y_true, logits, n_iters=2, grid=np.linspace(-1.5, 1.5, 19)):
    b = np.zeros(logits.shape[1], dtype=np.float32)
    best = f1_score(y_true, (logits + b).argmax(1), average='macro')
    for _ in range(n_iters):
        improved = False
        for c in range(logits.shape[1]):
            bc = b[c]; best_c = bc; best_sc = best
            for d in grid:
                b[c] = d
                sc = f1_score(y_true, (logits + b).argmax(1), average='macro')
                if sc > best_sc: best_sc, best_c = sc, d
            b[c] = best_c
            if best_c != bc: best = best_sc; improved = True
        if not improved: break
    return b, best

b_opt, f1_oof = optimize_biases(y_all, oof_logits)
print(f'[LGB-EMB] OOF seq-avg macro-F1 (bias-tuned)={f1_oof:.5f}', flush=True)
print('[LGB-EMB] Biases:', np.round(b_opt, 3))

# Fit full model and predict test
dall = lgb.Dataset(X_all, label=y_all)
model_full = lgb.train(params, dall, num_boost_round=int(best_iter_last*1.1))
proba_test = model_full.predict(X_test)
logits_test = np.log(np.clip(proba_test, 1e-9, 1.0))

# Expose globals
globals().update({'oof_logits': oof_logits, 'y_all': y_all, 'b_opt': b_opt, 'logits_test': logits_test, 'train_fe': train_fe, 'test_fe': test_fe, 'classes_all': classes_all})
print('[LGB-EMB] Exposed oof/logits. Total time {:.1f}s'.format(time.time()-t0), flush=True)

[LGB-EMB] Shapes: (179422, 525) (16877, 525) classes 14


[LGB-EMB] Fold 0 seq-avg macro-F1=0.67870 rounds=84 elapsed 65.4s


[LGB-EMB] Fold 1 seq-avg macro-F1=0.69600 rounds=73 elapsed 60.6s


[LGB-EMB] Fold 2 seq-avg macro-F1=0.68628 rounds=81 elapsed 63.3s


[LGB-EMB] Fold 3 seq-avg macro-F1=0.68206 rounds=63 elapsed 55.4s


[LGB-EMB] Fold 4 seq-avg macro-F1=0.67563 rounds=58 elapsed 52.4s


[LGB-EMB] OOF seq-avg macro-F1 (bias-tuned)=0.69890


[LGB-EMB] Biases: [-0.667 -0.333  0.    -0.167 -0.333  0.833 -0.167 -0.333  1.333  0.
  0.167  0.5   -0.5   -0.167]


[LGB-EMB] Exposed oof/logits. Total time 372.9s
