# Herbarium 2020 - FGVC7: Plan

Objectives:
- Establish environment (GPU) and data layout quickly
- Build a fast, correct baseline (strong pretrained CNN) with solid CV
- Iterate with efficient augmentations and long-tail strategies
- Save predictions to submission.csv

Workflow:
1) Env check: GPU via nvidia-smi; verify CUDA; print Python/pip; ensure torch stack
2) Data audit: list nybg2020, locate train/test images and metadata (COCO-like JSON)
3) Minimal loader: dataset + transforms; sanity dataloader preview
4) Baseline model: timm pretrained (e.g., convnext_base or efficientnetv2), label smoothing, CE focal optional
5) CV: StratifiedKFold by category_id; log macro-F1 on val; save OOF/test logits
6) Train at moderate resolution (224→384), mixup/cutmix off initially; AMP; cosine LR; early stop
7) Inference TTA; generate submission.csv
8) Error analysis; if time: debias long-tail via class-balanced loss / reweight / higher res

Major checkpoints to request expert review:
- After this plan + env/data audit
- After baseline CV ready
- After first scoring run and before heavy training

In [2]:
import os, sys, json, time, shutil, subprocess, textwrap
from pathlib import Path

t0 = time.time()
print('=== ENV CHECK ===', flush=True)
print('CWD:', os.getcwd(), flush=True)
print('Python:', sys.version.replace('\n',' '), flush=True)
print('pip list (top 20):', flush=True)
subprocess.run([sys.executable, '-m', 'pip', 'list', '--format=columns'], check=False)
print('\n--- nvidia-smi ---', flush=True)
subprocess.run(['bash','-lc','nvidia-smi || true'], check=False)

print('\n=== DATA AUDIT ===', flush=True)
root = Path('.')
data_dir = root / 'nybg2020'
print('Exists nybg2020:', data_dir.exists(), flush=True)

# List top-level and a couple of sublevels for nybg2020
def list_dir(p: Path, max_entries=50):
    try:
        items = sorted(p.iterdir())
    except Exception as e:
        print('Cannot list', p, e, flush=True); return
    for i, it in enumerate(items[:max_entries]):
        kind = 'DIR ' if it.is_dir() else 'FILE'
        size = it.stat().st_size if it.is_file() else -1
        print(f'{kind}\t{it}', ('', f'\t{size:,} bytes')[it.is_file()], flush=True)
    if len(items) > max_entries:
        print(f'... ({len(items)-max_entries} more)', flush=True)

if data_dir.exists():
    list_dir(data_dir)
    # Find json metadata files
    metas = []
    for dirpath, dirnames, filenames in os.walk(data_dir):
        for f in filenames:
            if f.lower().endswith('.json'):
                metas.append(os.path.join(dirpath, f))
    print('\nFound JSON files:', len(metas), flush=True)
    for m in metas[:20]:
        print(' -', m, flush=True)

# Inspect sample_submission
ss_path = Path('sample_submission.csv')
if ss_path.exists():
    import pandas as pd
    ss_head = pd.read_csv(ss_path, nrows=10)
    total_rows = sum(1 for _ in open(ss_path, 'r')) - 1
    print('\nsample_submission head:\n', ss_head, flush=True)
    print('sample_submission rows (excluding header):', total_rows, flush=True)

print('\nElapsed: %.2fs' % (time.time()-t0), flush=True)

=== ENV CHECK ===


CWD: /var/lib/simon/agent_run_states/herbarium-2020-fgvc7-20250923-034836


Python: 3.11.0rc1 (main, Aug 12 2022, 10:02:14) [GCC 11.2.0]


pip list (top 20):


Package                       Version
----------------------------- ------------
accelerate                    0.34.2
albucore                      0.0.33
albumentations                1.4.10
annotated-types               0.7.0
anyio                         4.10.0
asttokens                     3.0.0
cachetools                    5.5.2
catboost                      1.2.8
certifi                       2025.8.3
chardet                       5.2.0
charset-normalizer            3.4.3
comm                          0.2.3
contourpy                     1.3.3
cramjam                       2.11.0
cycler                        0.12.1
db-dtypes                     1.2.0
debugpy                       1.8.17
decorator                     5.2.1
et_xmlfile                    2.0.0
executing                     2.2.1
fastparquet                   2024.2.0
filelock                      3.19.1
fonttools                     4.60.0
fsspec                        2025.9.0
google-api-core               2.25.1


Tue Sep 23 07:38:45 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.06             Driver Version: 550.144.06     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A10-24Q                 On  |   00000002:00:00.0 Off |                    0 |
| N/A   N/A    P0             N/A /  N/A  |    2420MiB /  24512MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

Exists nybg2020: True


DIR 	nybg2020/test 


DIR 	nybg2020/train 



Found JSON files: 2


 - nybg2020/test/metadata.json


 - nybg2020/train/metadata.json



sample_submission head:
    Id  Predicted
0   0          0
1   1          0
2   2          0
3   3          0
4   4          0
5   5          0
6   6          0
7   7          0
8   8          0
9   9          0


sample_submission rows (excluding header): 219124



Elapsed: 1.84s


In [3]:
import json, os, time
from pathlib import Path
import pandas as pd
from collections import Counter

t0 = time.time()
train_meta_path = Path('nybg2020/train/metadata.json')
test_meta_path = Path('nybg2020/test/metadata.json')
assert train_meta_path.exists() and test_meta_path.exists(), 'metadata.json not found'

with open(train_meta_path, 'r') as f:
    train_meta = json.load(f)
with open(test_meta_path, 'r') as f:
    test_meta = json.load(f)

# Build dataframes
train_imgs = pd.DataFrame(train_meta['images'])[['id','file_name','width','height']].rename(columns={'id':'image_id'})
train_anns = pd.DataFrame(train_meta['annotations'])[['image_id','category_id','region_id','id']].rename(columns={'id':'ann_id'})
train_df = train_anns.merge(train_imgs, on='image_id', how='left')

categories = pd.DataFrame(train_meta['categories'])[['id','name','family','genus']].rename(columns={'id':'category_id'})
num_classes = categories.shape[0]

test_imgs = pd.DataFrame(test_meta['images'])[['id','file_name','width','height']].rename(columns={'id':'image_id'})

print('Train rows:', len(train_df), 'Images:', len(train_imgs), 'Test images:', len(test_imgs), flush=True)
print('Unique categories in train:', train_df['category_id'].nunique(), 'Categories listed:', num_classes, flush=True)

# Build contiguous class index mapping
uniq_cats = sorted(train_df['category_id'].unique())
catid2idx = {int(cid): int(i) for i, cid in enumerate(uniq_cats)}
idx2catid = {int(i): int(cid) for cid, i in catid2idx.items()}
print('Min/Max category_id:', min(uniq_cats), max(uniq_cats), '→ mapped to [0..', len(uniq_cats)-1, ']', flush=True)

train_df['class_idx'] = train_df['category_id'].map(catid2idx)

# Compose full file paths
train_df['filepath'] = train_df['file_name'].apply(lambda x: str(Path('nybg2020/train')/x))
test_imgs['filepath'] = test_imgs['file_name'].apply(lambda x: str(Path('nybg2020/test')/x))

# Sanity checks: files exist (sample)
missing_train = sum(not Path(p).exists() for p in train_df['filepath'].head(100))
missing_test = sum(not Path(p).exists() for p in test_imgs['filepath'].head(100))
print('Missing (sample) train files:', missing_train, 'Missing (sample) test files:', missing_test, flush=True)

# Class distribution summary
cnt = train_df['class_idx'].value_counts().sort_index()
print('Class count stats → min:', int(cnt.min()), 'median:', int(cnt.median()), 'p95:', int(cnt.quantile(0.95)), 'max:', int(cnt.max()), flush=True)

# Save mappings (ensure JSON-serializable types)
Path('artifacts').mkdir(exist_ok=True)
catid2idx_json = {str(int(k)): int(v) for k, v in catid2idx.items()}
idx2catid_json = {str(int(k)): int(v) for k, v in idx2catid.items()}
with open('artifacts/catid2idx.json','w') as f: json.dump(catid2idx_json, f)
with open('artifacts/idx2catid.json','w') as f: json.dump(idx2catid_json, f)

# Preview few rows
print(train_df.head(3)[['image_id','category_id','class_idx','filepath']])
print(test_imgs.head(3))
print('Elapsed: %.2fs' % (time.time()-t0), flush=True)

Train rows: 811623 Images: 811623 Test images: 219124


Unique categories in train: 32093 Categories listed: 32094


Min/Max category_id: 0 32093 → mapped to [0.. 32092 ]


Missing (sample) train files: 0 Missing (sample) test files: 0


Class count stats → min: 1 median: 7 p95: 116 max: 1412


   image_id  category_id  class_idx                                 filepath
0    124136        15672      15672  nybg2020/train/images/156/72/124136.jpg
1      5327        15672      15672    nybg2020/train/images/156/72/5327.jpg
2    449419        15672      15672  nybg2020/train/images/156/72/449419.jpg
  image_id         file_name  width  height                        filepath
0        0  images/000/0.jpg    667    1000  nybg2020/test/images/000/0.jpg
1        1  images/000/1.jpg    667    1000  nybg2020/test/images/000/1.jpg
2        2  images/000/2.jpg    675    1000  nybg2020/test/images/000/2.jpg
Elapsed: 6.76s


In [1]:
# Install correct CUDA 12.1 torch stack + timm and sanity check GPU
import os, sys, subprocess, shutil, time
from pathlib import Path

t0 = time.time()
def pip(*args):
    print('> pip', *args, flush=True)
    subprocess.run([sys.executable, '-m', 'pip', *args], check=True)

# Uninstall any preexisting torch stack to avoid mismatches
for pkg in ('torch','torchvision','torchaudio'):
    subprocess.run([sys.executable, '-m', 'pip', 'uninstall', '-y', pkg], check=False)

# Clean stray site dirs (idempotent)
for d in (
    '/app/.pip-target/torch',
    '/app/.pip-target/torchvision',
    '/app/.pip-target/torchaudio',
    '/app/.pip-target/torchgen',
    '/app/.pip-target/functorch',
):
    if os.path.exists(d):
        print('Removing', d, flush=True)
        shutil.rmtree(d, ignore_errors=True)

# Install cu121 torch stack
pip('install','--index-url','https://download.pytorch.org/whl/cu121','--extra-index-url','https://pypi.org/simple','torch==2.4.1','torchvision==0.19.1','torchaudio==2.4.1')

# Freeze versions
Path('constraints.txt').write_text('torch==2.4.1\ntorchvision==0.19.1\ntorchaudio==2.4.1\n')

# Install timm and helpers without upgrading torch
pip('install','-c','constraints.txt','timm==1.0.9','albumentations==1.4.10','opencv-python-headless==4.10.0.84','accelerate==0.34.2','--upgrade-strategy','only-if-needed')

import torch
print('torch:', torch.__version__, 'CUDA build:', getattr(torch.version, 'cuda', None), flush=True)
print('CUDA available:', torch.cuda.is_available(), flush=True)
assert str(getattr(torch.version,'cuda','')).startswith('12.1'), f'Wrong CUDA build: {torch.version.cuda}'
assert torch.cuda.is_available(), 'CUDA not available'
print('GPU:', torch.cuda.get_device_name(0), flush=True)
print('Elapsed install/check: %.1fs' % (time.time()-t0), flush=True)

Found existing installation: torch 2.4.1


Uninstalling torch-2.4.1:
  Successfully uninstalled torch-2.4.1


Found existing installation: torchvision 0.19.1
Uninstalling torchvision-0.19.1:
  Successfully uninstalled torchvision-0.19.1


> pip install --index-url https://download.pytorch.org/whl/cu121 --extra-index-url https://pypi.org/simple torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1




Looking in indexes: https://download.pytorch.org/whl/cu121, https://pypi.org/simple


Collecting torch==2.4.1
  Downloading https://download.pytorch.org/whl/cu121/torch-2.4.1%2Bcu121-cp311-cp311-linux_x86_64.whl (799.0 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 799.0/799.0 MB 531.5 MB/s eta 0:00:00


Collecting torchvision==0.19.1
  Downloading https://download.pytorch.org/whl/cu121/torchvision-0.19.1%2Bcu121-cp311-cp311-linux_x86_64.whl (7.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.1/7.1 MB 523.3 MB/s eta 0:00:00


Collecting torchaudio==2.4.1
  Downloading https://download.pytorch.org/whl/cu121/torchaudio-2.4.1%2Bcu121-cp311-cp311-linux_x86_64.whl (3.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.4/3.4 MB 512.4 MB/s eta 0:00:00
Collecting nvidia-cudnn-cu12==9.1.0.70
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl (664.8 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 664.8/664.8 MB 427.9 MB/s eta 0:00:00


Collecting networkx
  Downloading networkx-3.5-py3-none-any.whl (2.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 495.6 MB/s eta 0:00:00


Collecting nvidia-cublas-cu12==12.1.3.1
  Downloading nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 410.6/410.6 MB 263.9 MB/s eta 0:00:00


Collecting nvidia-nvtx-cu12==12.1.105
  Downloading nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (99 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 99.1/99.1 KB 447.0 MB/s eta 0:00:00
Collecting typing-extensions>=4.8.0
  Downloading typing_extensions-4.15.0-py3-none-any.whl (44 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44.6/44.6 KB 395.2 MB/s eta 0:00:00


Collecting nvidia-cuda-runtime-cu12==12.1.105
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 823.6/823.6 KB 464.4 MB/s eta 0:00:00
Collecting nvidia-cufft-cu12==11.0.2.54
  Downloading nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 121.6/121.6 MB 265.1 MB/s eta 0:00:00


Collecting nvidia-cuda-nvrtc-cu12==12.1.105
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.7/23.7 MB 173.2 MB/s eta 0:00:00


Collecting nvidia-cuda-cupti-cu12==12.1.105
  Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.1/14.1 MB 251.6 MB/s eta 0:00:00


Collecting sympy
  Downloading sympy-1.14.0-py3-none-any.whl (6.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.3/6.3 MB 502.7 MB/s eta 0:00:00


Collecting nvidia-cusolver-cu12==11.4.5.107
  Downloading nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl (124.2 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 124.2/124.2 MB 197.2 MB/s eta 0:00:00


Collecting nvidia-nccl-cu12==2.20.5
  Downloading nvidia_nccl_cu12-2.20.5-py3-none-manylinux2014_x86_64.whl (176.2 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 176.2/176.2 MB 119.5 MB/s eta 0:00:00


Collecting fsspec
  Downloading fsspec-2025.9.0-py3-none-any.whl (199 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 199.3/199.3 KB 465.5 MB/s eta 0:00:00
Collecting nvidia-cusparse-cu12==12.1.0.106


  Downloading nvidia_cusparse_cu12-12.1.0.106-py3-none-manylinux1_x86_64.whl (196.0 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 196.0/196.0 MB 214.9 MB/s eta 0:00:00


Collecting triton==3.0.0
  Downloading triton-3.0.0-1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (209.4 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 209.4/209.4 MB 185.3 MB/s eta 0:00:00


Collecting nvidia-curand-cu12==10.3.2.106
  Downloading nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.5/56.5 MB 153.3 MB/s eta 0:00:00
Collecting filelock
  Downloading filelock-3.19.1-py3-none-any.whl (15 kB)


Collecting jinja2
  Downloading jinja2-3.1.6-py3-none-any.whl (134 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.9/134.9 KB 456.2 MB/s eta 0:00:00


Collecting pillow!=8.3.*,>=5.3.0
  Downloading pillow-11.3.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (6.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.6/6.6 MB 335.4 MB/s eta 0:00:00


Collecting numpy
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.3/18.3 MB 531.4 MB/s eta 0:00:00
Collecting nvidia-nvjitlink-cu12
  Downloading nvidia_nvjitlink_cu12-12.9.86-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (39.7 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39.7/39.7 MB 296.3 MB/s eta 0:00:00


Collecting MarkupSafe>=2.0
  Downloading MarkupSafe-3.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (23 kB)


Collecting mpmath<1.4,>=1.1.0
  Downloading mpmath-1.3.0-py3-none-any.whl (536 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 KB 494.9 MB/s eta 0:00:00


Installing collected packages: mpmath, typing-extensions, sympy, pillow, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufft-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, numpy, networkx, MarkupSafe, fsspec, filelock, triton, nvidia-cusparse-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch, torchvision, torchaudio


Successfully installed MarkupSafe-3.0.2 filelock-3.19.1 fsspec-2025.9.0 jinja2-3.1.6 mpmath-1.3.0 networkx-3.5 numpy-1.26.4 nvidia-cublas-cu12-12.1.3.1 nvidia-cuda-cupti-cu12-12.1.105 nvidia-cuda-nvrtc-cu12-12.1.105 nvidia-cuda-runtime-cu12-12.1.105 nvidia-cudnn-cu12-9.1.0.70 nvidia-cufft-cu12-11.0.2.54 nvidia-curand-cu12-10.3.2.106 nvidia-cusolver-cu12-11.4.5.107 nvidia-cusparse-cu12-12.1.0.106 nvidia-nccl-cu12-2.20.5 nvidia-nvjitlink-cu12-12.9.86 nvidia-nvtx-cu12-12.1.105 pillow-11.3.0 sympy-1.14.0 torch-2.4.1+cu121 torchaudio-2.4.1+cu121 torchvision-0.19.1+cu121 triton-3.0.0 typing-extensions-4.15.0




> pip install -c constraints.txt timm==1.0.9 albumentations==1.4.10 opencv-python-headless==4.10.0.84 accelerate==0.34.2 --upgrade-strategy only-if-needed


Collecting timm==1.0.9
  Downloading timm-1.0.9-py3-none-any.whl (2.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.3/2.3 MB 69.1 MB/s eta 0:00:00
Collecting albumentations==1.4.10
  Downloading albumentations-1.4.10-py3-none-any.whl (161 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 161.9/161.9 KB 21.3 MB/s eta 0:00:00


Collecting opencv-python-headless==4.10.0.84
  Downloading opencv_python_headless-4.10.0.84-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (49.9 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 49.9/49.9 MB 340.1 MB/s eta 0:00:00
Collecting accelerate==0.34.2
  Downloading accelerate-0.34.2-py3-none-any.whl (324 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 324.4/324.4 KB 244.3 MB/s eta 0:00:00
Collecting pyyaml
  Downloading PyYAML-6.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (762 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 763.0/763.0 KB 474.5 MB/s eta 0:00:00
Collecting huggingface_hub
  Downloading huggingface_hub-0.35.0-py3-none-any.whl (563 kB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 563.4/563.4 KB 511.2 MB/s eta 0:00:00
Collecting torch
  Downloading torch-2.4.1-cp311-cp311-manylinux1_x86_64.whl (797.1 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 797.1/797.1 MB 301.9 MB/s eta 0:00:00


Collecting torchvision
  Downloading torchvision-0.19.1-cp311-cp311-manylinux1_x86_64.whl (7.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.0/7.0 MB 368.8 MB/s eta 0:00:00


Collecting safetensors
  Downloading safetensors-0.6.2-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (485 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 485.8/485.8 KB 393.7 MB/s eta 0:00:00


Collecting pydantic>=2.7.0
  Downloading pydantic-2.11.9-py3-none-any.whl (444 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 444.9/444.9 KB 496.9 MB/s eta 0:00:00
Collecting scipy>=1.10.0


  Downloading scipy-1.16.2-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (35.9 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 35.9/35.9 MB 92.0 MB/s eta 0:00:00


Collecting typing-extensions>=4.9.0
  Downloading typing_extensions-4.15.0-py3-none-any.whl (44 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44.6/44.6 KB 385.2 MB/s eta 0:00:00
Collecting scikit-learn>=1.3.2
  Downloading scikit_learn-1.7.2-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (9.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 9.7/9.7 MB 284.6 MB/s eta 0:00:00


Collecting albucore>=0.0.11
  Downloading albucore-0.0.33-py3-none-any.whl (18 kB)
Collecting scikit-image>=0.21.0
  Downloading scikit_image-0.25.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (14.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.8/14.8 MB 253.0 MB/s eta 0:00:00


Collecting numpy<2,>=1.24.4
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.3/18.3 MB 260.8 MB/s eta 0:00:00


Collecting psutil
  Downloading psutil-7.1.0-cp36-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (291 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 291.2/291.2 KB 479.6 MB/s eta 0:00:00
Collecting packaging>=20.0
  Downloading packaging-25.0-py3-none-any.whl (66 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 66.5/66.5 KB 414.3 MB/s eta 0:00:00


Collecting stringzilla>=3.10.4
  Downloading stringzilla-4.0.14-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl (496 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 496.5/496.5 KB 495.7 MB/s eta 0:00:00


Collecting simsimd>=5.9.2
  Downloading simsimd-6.5.3-cp311-cp311-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (1.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 465.2 MB/s eta 0:00:00
Collecting requests
  Downloading requests-2.32.5-py3-none-any.whl (64 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 64.7/64.7 KB 428.8 MB/s eta 0:00:00
Collecting tqdm>=4.42.1


  Downloading tqdm-4.67.1-py3-none-any.whl (78 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.5/78.5 KB 404.5 MB/s eta 0:00:00
Collecting filelock
  Downloading filelock-3.19.1-py3-none-any.whl (15 kB)
Collecting fsspec>=2023.5.0
  Downloading fsspec-2025.9.0-py3-none-any.whl (199 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 199.3/199.3 KB 395.9 MB/s eta 0:00:00
Collecting hf-xet<2.0.0,>=1.1.3
  Downloading hf_xet-1.1.10-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.2/3.2 MB 518.2 MB/s eta 0:00:00


Collecting pydantic-core==2.33.2
  Downloading pydantic_core-2.33.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 508.0 MB/s eta 0:00:00
Collecting annotated-types>=0.6.0
  Downloading annotated_types-0.7.0-py3-none-any.whl (13 kB)
Collecting typing-inspection>=0.4.0
  Downloading typing_inspection-0.4.1-py3-none-any.whl (14 kB)
Collecting tifffile>=2022.8.12
  Downloading tifffile-2025.9.20-py3-none-any.whl (230 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 230.1/230.1 KB 453.0 MB/s eta 0:00:00


Collecting pillow>=10.1
  Downloading pillow-11.3.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (6.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.6/6.6 MB 276.0 MB/s eta 0:00:00
Collecting lazy-loader>=0.4
  Downloading lazy_loader-0.4-py3-none-any.whl (12 kB)
Collecting networkx>=3.0
  Downloading networkx-3.5-py3-none-any.whl (2.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 501.0 MB/s eta 0:00:00
Collecting imageio!=2.35.0,>=2.33
  Downloading imageio-2.37.0-py3-none-any.whl (315 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 315.8/315.8 KB 442.8 MB/s eta 0:00:00
Collecting threadpoolctl>=3.1.0
  Downloading threadpoolctl-3.6.0-py3-none-any.whl (18 kB)


Collecting joblib>=1.2.0
  Downloading joblib-1.5.2-py3-none-any.whl (308 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 308.4/308.4 KB 494.4 MB/s eta 0:00:00
Collecting nvidia-cuda-cupti-cu12==12.1.105
  Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.1/14.1 MB 226.2 MB/s eta 0:00:00
Collecting sympy
  Downloading sympy-1.14.0-py3-none-any.whl (6.3 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.3/6.3 MB 157.7 MB/s eta 0:00:00
Collecting nvidia-cuda-nvrtc-cu12==12.1.105
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.7/23.7 MB 66.7 MB/s eta 0:00:00
Collecting nvidia-cusolver-cu12==11.4.5.107
  Downloading nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl (124.2 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 124.2/124.2 MB 215.7 MB/s eta 0:00:00
Collecting nvidia-cudnn-cu12==9.1.0.70
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl (664.8 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 664.8/664.8 MB 137.2 MB/s eta 0:00:00


Collecting nvidia-cublas-cu12==12.1.3.1
  Downloading nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 410.6/410.6 MB 276.2 MB/s eta 0:00:00


Collecting triton==3.0.0
  Downloading triton-3.0.0-1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (209.4 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 209.4/209.4 MB 406.9 MB/s eta 0:00:00
Collecting jinja2
  Downloading jinja2-3.1.6-py3-none-any.whl (134 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.9/134.9 KB 418.0 MB/s eta 0:00:00
Collecting nvidia-cuda-runtime-cu12==12.1.105


  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 823.6/823.6 KB 486.1 MB/s eta 0:00:00
Collecting nvidia-cufft-cu12==11.0.2.54
  Downloading nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 121.6/121.6 MB 282.8 MB/s eta 0:00:00
Collecting nvidia-cusparse-cu12==12.1.0.106
  Downloading nvidia_cusparse_cu12-12.1.0.106-py3-none-manylinux1_x86_64.whl (196.0 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 196.0/196.0 MB 297.2 MB/s eta 0:00:00
Collecting nvidia-nccl-cu12==2.20.5
  Downloading nvidia_nccl_cu12-2.20.5-py3-none-manylinux2014_x86_64.whl (176.2 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 176.2/176.2 MB 223.5 MB/s eta 0:00:00
Collecting nvidia-curand-cu12==10.3.2.106
  Downloading nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.5/56.5 MB 299.6 MB/s eta 0:00:00
Collecting nvidia-nvtx-cu12==12.1.105
  Downloading nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (99 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 99.1/99.1 KB 458.8 MB/s eta 0:00:00
Collecting nvidia-nvjitlink-cu12
  Downloading nvidia_nvjitlink_cu12-12.9.86-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (39.7 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39.7/39.7 MB 300.2 MB/s eta 0:00:00


Collecting MarkupSafe>=2.0
  Downloading MarkupSafe-3.0.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (23 kB)
Collecting idna<4,>=2.5
  Downloading idna-3.10-py3-none-any.whl (70 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 70.4/70.4 KB 400.1 MB/s eta 0:00:00
Collecting certifi>=2017.4.17
  Downloading certifi-2025.8.3-py3-none-any.whl (161 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 161.2/161.2 KB 449.9 MB/s eta 0:00:00
Collecting urllib3<3,>=1.21.1
  Downloading urllib3-2.5.0-py3-none-any.whl (129 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 129.8/129.8 KB 430.6 MB/s eta 0:00:00


Collecting charset_normalizer<4,>=2
  Downloading charset_normalizer-3.4.3-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (150 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 150.3/150.3 KB 451.4 MB/s eta 0:00:00
Collecting mpmath<1.4,>=1.1.0
  Downloading mpmath-1.3.0-py3-none-any.whl (536 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 KB 531.9 MB/s eta 0:00:00


Installing collected packages: simsimd, mpmath, urllib3, typing-extensions, tqdm, threadpoolctl, sympy, stringzilla, safetensors, pyyaml, psutil, pillow, packaging, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufft-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, numpy, networkx, MarkupSafe, joblib, idna, hf-xet, fsspec, filelock, charset_normalizer, certifi, annotated-types, typing-inspection, triton, tifffile, scipy, requests, pydantic-core, opencv-python-headless, nvidia-cusparse-cu12, nvidia-cudnn-cu12, lazy-loader, jinja2, imageio, scikit-learn, scikit-image, pydantic, nvidia-cusolver-cu12, huggingface_hub, albucore, torch, albumentations, torchvision, accelerate, timm


Successfully installed MarkupSafe-3.0.2 accelerate-0.34.2 albucore-0.0.33 albumentations-1.4.10 annotated-types-0.7.0 certifi-2025.8.3 charset_normalizer-3.4.3 filelock-3.19.1 fsspec-2025.9.0 hf-xet-1.1.10 huggingface_hub-0.35.0 idna-3.10 imageio-2.37.0 jinja2-3.1.6 joblib-1.5.2 lazy-loader-0.4 mpmath-1.3.0 networkx-3.5 numpy-1.26.4 nvidia-cublas-cu12-12.1.3.1 nvidia-cuda-cupti-cu12-12.1.105 nvidia-cuda-nvrtc-cu12-12.1.105 nvidia-cuda-runtime-cu12-12.1.105 nvidia-cudnn-cu12-9.1.0.70 nvidia-cufft-cu12-11.0.2.54 nvidia-curand-cu12-10.3.2.106 nvidia-cusolver-cu12-11.4.5.107 nvidia-cusparse-cu12-12.1.0.106 nvidia-nccl-cu12-2.20.5 nvidia-nvjitlink-cu12-12.9.86 nvidia-nvtx-cu12-12.1.105 opencv-python-headless-4.10.0.84 packaging-25.0 pillow-11.3.0 psutil-7.1.0 pydantic-2.11.9 pydantic-core-2.33.2 pyyaml-6.0.2 requests-2.32.5 safetensors-0.6.2 scikit-image-0.25.2 scikit-learn-1.7.2 scipy-1.16.2 simsimd-6.5.3 stringzilla-4.0.14 sympy-1.14.0 threadpoolctl-3.6.0 tifffile-2025.9.20 timm-1.0.9 tor



torch: 2.4.1+cu121 CUDA build: 12.1


CUDA available: True


GPU: NVIDIA A10-24Q


Elapsed install/check: 131.6s


In [4]:
# Dataset, transforms, folds, and utilities (torchvision-based to avoid albumentations issues)
import math, random, gc
import numpy as np
import pandas as pd
from pathlib import Path
from typing import List, Tuple
from sklearn.model_selection import StratifiedKFold
from PIL import Image
import torchvision.transforms as T

import torch
from torch.utils.data import Dataset, DataLoader

def set_seed(seed: int = 42):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = False
    torch.backends.cudnn.benchmark = True

IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)

def get_transforms(img_size: int = 384, train: bool = True):
    if train:
        return T.Compose([
            T.RandomResizedCrop(img_size, scale=(0.8, 1.0), ratio=(0.9, 1.1)),
            T.RandomHorizontalFlip(p=0.5),
            # Mild, stable augs for fast convergence
            T.ColorJitter(0.1, 0.1, 0.1, 0.05),
            T.ToTensor(),
            T.Normalize(mean=IMAGENET_MEAN, std=IMAGENET_STD),
            T.RandomErasing(p=0.05, scale=(0.02, 0.2), ratio=(0.3, 3.3), value=0),
        ])
    else:
        return T.Compose([
            T.Resize(img_size),
            T.CenterCrop(img_size),
            T.ToTensor(),
            T.Normalize(mean=IMAGENET_MEAN, std=IMAGENET_STD),
        ])

class HerbariumDataset(Dataset):
    def __init__(self, df: pd.DataFrame, img_col: str = 'filepath', label_col: str | None = 'class_idx', transforms=None):
        self.df = df.reset_index(drop=True)
        self.imgs = self.df[img_col].tolist()
        self.labels = None if label_col is None else self.df[label_col].astype(int).tolist()
        self.transforms = transforms
    def __len__(self):
        return len(self.imgs)
    def __getitem__(self, idx):
        p = self.imgs[idx]
        with Image.open(p) as im:
            im = im.convert('RGB')
            img = self.transforms(im) if self.transforms else T.ToTensor()(im)
        if self.labels is None:
            return img, -1
        return img, self.labels[idx]

def make_folds(df: pd.DataFrame, n_splits: int = 5, seed: int = 42) -> List[Tuple[np.ndarray, np.ndarray]]:
    # Long-tail safe stratification via frequency bins
    counts = df['class_idx'].map(df['class_idx'].value_counts()).values
    bins = np.array([1,2,3,5,10,20,50,100,200,500,1000,1e9], dtype=float)
    y_bins = np.digitize(counts, bins)
    skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=seed)
    splits = []
    for fi, (tr, va) in enumerate(skf.split(np.zeros(len(y_bins)), y_bins)):
        print(f'Fold {fi}: train {len(tr)} val {len(va)}', flush=True)
        splits.append((tr, va))
    return splits

def compute_class_weights(df: pd.DataFrame, col='class_idx', power: float = 0.5) -> torch.Tensor:
    counts = df[col].value_counts().sort_index().values.astype(float)
    w = 1.0 / np.power(counts, power)
    w = w / w.mean()
    return torch.tensor(w, dtype=torch.float32)

# Build (or load) folds and example loaders (smoke-test ready)
GLOBAL_IMG_SIZE = 384
set_seed(42)
folds_cache = Path('artifacts/folds.npy')
if not folds_cache.exists():
    assert 'train_df' in globals(), 'train_df not found; run previous cell'
    splits = make_folds(train_df, n_splits=5, seed=42)
    np.save(folds_cache, np.array(splits, dtype=object))
else:
    splits = np.load(folds_cache, allow_pickle=True).tolist()
    print('Loaded cached folds:', len(splits))

# Prepare a tiny stratified subset for smoke run (limit classes to keep it small)
def stratified_sample(df: pd.DataFrame, per_class: int = 1, max_classes: int = 512, seed: int = 42) -> pd.DataFrame:
    rng = np.random.default_rng(seed)
    classes = df['class_idx'].unique()
    if len(classes) > max_classes:
        classes = rng.choice(classes, size=max_classes, replace=False)
        df = df[df['class_idx'].isin(classes)]
    parts = []
    for k, g in df.groupby('class_idx', sort=False):
        take = min(per_class, len(g))
        parts.append(g.sample(n=take, random_state=seed))
    return pd.concat(parts, axis=0).reset_index(drop=True)

smoke_df = stratified_sample(train_df[['filepath','class_idx']], per_class=1, max_classes=512, seed=42)
print('Smoke subset size:', len(smoke_df), 'unique classes:', smoke_df['class_idx'].nunique(), flush=True)

# Example DataLoaders for smoke test
train_tfms = get_transforms(GLOBAL_IMG_SIZE, train=True)
val_tfms = get_transforms(GLOBAL_IMG_SIZE, train=False)
smoke_ds = HerbariumDataset(smoke_df, transforms=train_tfms)
smoke_loader = DataLoader(smoke_ds, batch_size=32, shuffle=True, num_workers=4, pin_memory=True, drop_last=False)
print('Smoke loader ready, batches:', len(smoke_loader), flush=True)

Loaded cached folds: 5
Smoke subset size: 512 unique classes: 512


Smoke loader ready, batches: 16


In [None]:
# Training loop (smoke run) with ConvNeXt-B @384, AMP, CE+LS, class weights
import time, math, os
import torch
import torch.nn as nn
import torch.optim as optim
from torch.cuda.amp import autocast, GradScaler
import timm

def build_model(num_classes: int, model_name: str = 'convnext_base.fb_in22k', pretrained: bool = True):
    m = timm.create_model(model_name, pretrained=pretrained, num_classes=num_classes)
    m.to(memory_format=torch.channels_last)
    return m

def train_smoke_epoch():
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    num_classes = int(train_df['class_idx'].nunique())
    print('Num classes:', num_classes, flush=True)
    model = build_model(num_classes, 'convnext_base.fb_in22k', pretrained=True).to(device)
    model = model.to(memory_format=torch.channels_last)
    torch.backends.cudnn.benchmark = True

    cls_weights = compute_class_weights(train_df).to(device)
    criterion = nn.CrossEntropyLoss(weight=cls_weights, label_smoothing=0.10).to(device)
    optimizer = optim.AdamW(model.parameters(), lr=2e-3, weight_decay=0.05)
    scaler = GradScaler(enabled=True)

    model.train()
    total, correct, running_loss = 0, 0, 0.0
    t0 = time.time()
    for bi, (imgs, labels) in enumerate(smoke_loader):
        imgs = imgs.to(device, non_blocking=True).to(memory_format=torch.channels_last)
        labels = labels.to(device, non_blocking=True)
        optimizer.zero_grad(set_to_none=True)
        with autocast(enabled=True):
            logits = model(imgs)
            loss = criterion(logits, labels)
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()
        running_loss += loss.item() * imgs.size(0)
        preds = logits.argmax(dim=1)
        correct += (preds == labels).sum().item()
        total += imgs.size(0)
        if (bi+1) % 5 == 0:
            elapsed = time.time()-t0
            print(f'[Smoke] Batch {bi+1}/{len(smoke_loader)}  loss={running_loss/total:.4f} acc={correct/total:.4f} elapsed={elapsed:.1f}s', flush=True)
    print(f'[Smoke] Done. Loss={running_loss/total:.4f} Acc={correct/total:.4f} total={total}', flush=True)

# Note: Run this once to validate end-to-end before full 5-fold training
print('Ready to run smoke epoch once installations finish. Call train_smoke_epoch() next.', flush=True)

In [None]:
# Fix albumentations dependency: install/upgrade albucore to match A>=1.4
import sys, subprocess, time
t0=time.time()
def pip(*args):
    print('> pip', *args, flush=True)
    subprocess.run([sys.executable, '-m', 'pip', *args], check=True)

# Honor torch constraints; only install albucore
pip('install','-c','constraints.txt','albucore>=0.0.12','--upgrade','--upgrade-strategy','only-if-needed')

import albucore, albumentations as A
from albumentations.pytorch import ToTensorV2
print('albucore:', getattr(albucore,'__version__','?'), 'albumentations:', getattr(A,'__version__','?'), flush=True)
print('Albumentations import OK. Elapsed: %.1fs' % (time.time()-t0), flush=True)

In [None]:
# Run smoke training epoch to validate pipeline
train_smoke_epoch()

In [None]:
# Fast prototype-based inference with ConvNeXt features (no training)
import torch, torch.nn.functional as F, time
from torch.utils.data import DataLoader
import timm
import pandas as pd

def build_feature_extractor(model_name: str = 'convnext_base.fb_in22k', img_size: int = 256):
    # num_classes=0 makes model forward return pooled features
    model = timm.create_model(model_name, pretrained=True, num_classes=0, global_pool='avg')
    model.to(torch.device('cuda' if torch.cuda.is_available() else 'cpu'))
    model.to(memory_format=torch.channels_last)
    model.eval()
    tfm = get_transforms(img_size, train=False)
    return model, tfm

def select_prototypes(df: pd.DataFrame) -> pd.DataFrame:
    # one image per class (first occurrence) for speed
    return df[['filepath','class_idx']].drop_duplicates('class_idx', keep='first').reset_index(drop=True)

def extract_embeddings(model, df: pd.DataFrame, batch_size: int = 128, img_size: int = 256, log_prefix: str = '', num_workers: int = 4):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    ds = HerbariumDataset(df, transforms=get_transforms(img_size, train=False))
    dl = DataLoader(ds, batch_size=batch_size, shuffle=False, num_workers=num_workers, pin_memory=True)
    embs = []
    t0 = time.time()
    with torch.no_grad():
        for bi, (imgs, _) in enumerate(dl):
            imgs = imgs.to(device, non_blocking=True).to(memory_format=torch.channels_last)
            with torch.amp.autocast(device_type='cuda', enabled=device.type=='cuda'):
                feat = model(imgs)
            feat = F.normalize(feat.float(), dim=1)
            embs.append(feat.cpu())
            if (bi+1) % 50 == 0:
                print(f'{log_prefix} batch {bi+1}/{len(dl)} elapsed {time.time()-t0:.1f}s', flush=True)
    embs = torch.cat(embs, dim=0)
    return embs

def run_prototype_submission(img_size: int = 256, batch_size: int = 128):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    # Build feature extractor
    print('Building feature extractor...', flush=True)
    model, _ = build_feature_extractor('convnext_base.fb_in22k', img_size)

    # Prototypes (one per class)
    print('Selecting prototypes...', flush=True)
    proto_df = select_prototypes(train_df)
    print('Prototype images:', len(proto_df), 'num classes:', train_df['class_idx'].nunique(), flush=True)
    proto_embs = extract_embeddings(model, proto_df, batch_size=batch_size, img_size=img_size, log_prefix='PROT', num_workers=4)
    proto_embs = proto_embs.to(device, non_blocking=True).half() if device.type=='cuda' else proto_embs

    # Test embeddings streamed and classified on the fly
    print('Classifying test...', flush=True)
    # Ensure test order by image_id matching sample_submission Id
    test_order = test_imgs.sort_values('image_id').reset_index(drop=True)
    test_ds = HerbariumDataset(test_order[['filepath']], label_col=None, transforms=get_transforms(img_size, train=False))
    test_dl = DataLoader(test_ds, batch_size=batch_size, shuffle=False, num_workers=4, pin_memory=True)

    preds_idx = []
    t0 = time.time()
    with torch.no_grad():
        for bi, (imgs, _) in enumerate(test_dl):
            imgs = imgs.to(device, non_blocking=True).to(memory_format=torch.channels_last)
            with torch.amp.autocast(device_type='cuda', enabled=device.type=='cuda'):
                feats = model(imgs)
            feats = F.normalize(feats.float(), dim=1)
            if device.type=='cuda':
                feats = feats.half()
            # cosine similarity via dot product of normalized vectors
            sims = feats @ proto_embs.T
            top_idx = sims.argmax(dim=1).int().cpu().tolist()
            preds_idx.extend(top_idx)
            if (bi+1) % 50 == 0:
                print(f'TEST batch {bi+1}/{len(test_dl)} elapsed {time.time()-t0:.1f}s', flush=True)

    # Map prototype class_idx back to original category_id
    # proto_df rows correspond to columns in proto_embs
    proto_class_idx = proto_df['class_idx'].tolist()
    mapped_catid = [idx2catid[str(int(proto_class_idx[i]))] if isinstance(list(idx2catid.keys())[0], str) else idx2catid[int(proto_class_idx[i])] for i in preds_idx]

    sub = pd.DataFrame({'Id': test_order['image_id'].astype(int), 'Predicted': pd.Series(mapped_catid).astype(int)})
    sub = sub.sort_values('Id').reset_index(drop=True)
    sub.to_csv('submission.csv', index=False)
    print('Wrote submission.csv with', len(sub), 'rows')

print('Prototype pipeline ready. Call run_prototype_submission() to generate submission.csv', flush=True)

In [None]:
# Generate submission with prototype-based classifier (replaces any existing submission.csv)
run_prototype_submission(img_size=224, batch_size=128)

In [23]:
# Finetune ConvNeXt-B @224 with head-warmup + full FT; CE+LS (no class weights) + WRS; EMA; step-capped epochs
import os, math, time, gc, json
from pathlib import Path
import numpy as np
import pandas as pd
from sklearn.metrics import f1_score
import torch
import torch.nn as nn
import torch.optim as optim
from torch.cuda.amp import GradScaler
from torch.optim.lr_scheduler import CosineAnnealingLR
import timm
from timm.utils import ModelEmaV2
from torch.utils.data import DataLoader, WeightedRandomSampler

os.environ.setdefault('PYTORCH_CUDA_ALLOC_CONF', 'expandable_segments:True')
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
torch.set_float32_matmul_precision('high')

# Global hyperparams
IMG_SIZE = 224
BATCH_SIZE = 64  # drop to 48 if OOM
EPOCHS = 8
WARMUP_EPOCHS = 2  # head-only warmup
LR_BACKBONE = 3e-3
LR_HEAD = 1e-2
WD = 0.05
NUM_WORKERS = 8  # after clean torch install, use workers for speed
STEPS_PER_EPOCH = 1000  # cap steps/epoch via sampler (reduced for faster iterations)
VAL_MAX_BATCHES = 400   # cap validation to speed up
EVAL_EVERY = 1          # validate every epoch
ART_DIR = Path('artifacts'); ART_DIR.mkdir(exist_ok=True)

def build_convnext_b(num_classes: int):
    m = timm.create_model('convnext_base.fb_in22k', pretrained=True, num_classes=num_classes)
    # DO NOT enable grad checkpointing (it broke grads earlier)
    # Init classifier bias to -log(num_classes) to avoid large initial logits
    try:
        head = m.get_classifier() if hasattr(m, 'get_classifier') else getattr(m, 'head', None)
        if head is None:
            head = getattr(m, 'head', None)
        bias = getattr(head, 'bias', None)
        if bias is not None:
            nn.init.constant_(bias, -math.log(float(num_classes)))
            print('Init head bias to -log(C)', flush=True)
    except Exception as e:
        print('Head bias init skipped:', e, flush=True)
    m.to(memory_format=torch.channels_last)
    return m.to(DEVICE)

def make_loaders_for_fold(fold_idx: int):
    tr_idx, va_idx = splits[fold_idx]
    tr_df = train_df.iloc[tr_idx][['filepath','class_idx']].reset_index(drop=True)
    va_df = train_df.iloc[va_idx][['filepath','class_idx']].reset_index(drop=True)
    # per-sample weights for sampler (inverse sqrt of class freq)
    counts = train_df['class_idx'].value_counts().sort_index().values.astype(float)
    samp_w_cls = 1.0 / np.sqrt(counts)
    samp_w_cls = samp_w_cls / samp_w_cls.mean()
    samp_w = torch.tensor(samp_w_cls, dtype=torch.float32)
    per_sample_w = samp_w[torch.tensor(tr_df['class_idx'].values, dtype=torch.long)]
    # Limit epoch length
    num_samples = int(STEPS_PER_EPOCH * BATCH_SIZE)
    sampler = WeightedRandomSampler(weights=per_sample_w, num_samples=num_samples, replacement=True)

    tr_ds = HerbariumDataset(tr_df, transforms=get_transforms(IMG_SIZE, train=True))
    va_ds = HerbariumDataset(va_df, transforms=get_transforms(IMG_SIZE, train=False))

    # Build DataLoader kwargs safely when NUM_WORKERS==0 (no prefetch/persistent)
    tr_kwargs = dict(batch_size=BATCH_SIZE, sampler=sampler, shuffle=False, num_workers=NUM_WORKERS, pin_memory=True, drop_last=True)
    va_kwargs = dict(batch_size=BATCH_SIZE, shuffle=False, num_workers=NUM_WORKERS, pin_memory=True, drop_last=False)
    if NUM_WORKERS > 0:
        tr_kwargs.update(dict(persistent_workers=True, prefetch_factor=2))
        va_kwargs.update(dict(persistent_workers=True, prefetch_factor=2))

    tr_loader = DataLoader(tr_ds, **tr_kwargs)
    va_loader = DataLoader(va_ds, **va_kwargs)
    return tr_loader, va_loader

def split_head_backbone_params(model):
    head_names = set()
    # ConvNeXt in timm uses m.head.fc as classifier
    if hasattr(model, 'head') and hasattr(model.head, 'fc'):
        for n, _ in model.head.fc.named_parameters(prefix='head.fc'):
            head_names.add(n)
    elif hasattr(model, 'get_classifier'):
        for n, _ in model.get_classifier().named_parameters():
            head_names.add(n)
    head_params, backbone_params = [], []
    for n, p in model.named_parameters():
        if ('head' in n and 'fc' in n) or (n in head_names):
            head_params.append(p)
        else:
            backbone_params.append(p)
    return head_params, backbone_params

def train_one_fold(fold_idx: int):
    if DEVICE.type=='cuda': torch.cuda.empty_cache()
    num_classes = int(train_df['class_idx'].nunique())
    model = build_convnext_b(num_classes)
    ema = ModelEmaV2(model, decay=0.9998)
    tr_loader, va_loader = make_loaders_for_fold(fold_idx)
    print(f'[F{fold_idx}] len(tr_loader)={len(tr_loader)} expected={STEPS_PER_EPOCH}', flush=True)
    criterion = nn.CrossEntropyLoss(label_smoothing=0.05).to(DEVICE)

    # Phase 1: head-only warmup
    for p in model.parameters():
        p.requires_grad = False
    head_params, _ = split_head_backbone_params(model)
    for p in head_params:
        p.requires_grad = True
    optimizer = optim.AdamW(head_params, lr=LR_HEAD, weight_decay=0.0)
    scheduler = None  # no scheduler in warmup
    scaler = GradScaler(enabled=(DEVICE.type=='cuda'))
    best_f1 = -1.0
    ckpt_path = ART_DIR / f'convnext_b_fold{fold_idx}.pt'

    for epoch in range(EPOCHS):
        model.train()
        torch.backends.cudnn.benchmark = True
        t0 = time.time()
        tr_loss, seen = 0.0, 0
        print(f'[F{fold_idx}] Epoch {epoch+1} steps this epoch: {len(tr_loader)}', flush=True)
        for bi, (imgs, labels) in enumerate(tr_loader):
            imgs = imgs.to(DEVICE, non_blocking=True).to(memory_format=torch.channels_last)
            labels = labels.to(DEVICE, non_blocking=True)
            optimizer.zero_grad(set_to_none=True)
            with torch.amp.autocast(device_type='cuda', enabled=(DEVICE.type=='cuda')):
                logits = model(imgs)
                loss = criterion(logits, labels)
            scaler.scale(loss).backward()
            nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            scaler.step(optimizer); scaler.update()
            ema.update(model)
            tr_loss += loss.item() * imgs.size(0)
            seen += imgs.size(0)
            if (bi+1) % 200 == 0:
                print(f'[F{fold_idx}] ep{epoch+1} step{bi+1}/{len(tr_loader)} loss={tr_loss/max(seen,1):.4f} elapsed={time.time()-t0:.1f}s', flush=True)

        # Transition to Phase 2 after warmup epochs
        if epoch + 1 == WARMUP_EPOCHS:
            # Unfreeze all, set two param groups
            for p in model.parameters():
                p.requires_grad = True
            head_params, backbone_params = split_head_backbone_params(model)
            optimizer = optim.AdamW([
                {'params': backbone_params, 'lr': LR_BACKBONE, 'weight_decay': WD},
                {'params': head_params, 'lr': LR_HEAD * 1.5, 'weight_decay': WD},
            ])
            scheduler = CosineAnnealingLR(optimizer, T_max=max(EPOCHS - WARMUP_EPOCHS, 1), eta_min=1e-6)

        # Scheduler step in Phase 2
        if scheduler is not None and (epoch + 1) > WARMUP_EPOCHS:
            scheduler.step()

        # Validation each epoch (EMA weights), capped batches
        ema_model = ema.module
        ema_model.eval()
        val_logits, val_labels = [], []
        with torch.no_grad():
            max_batches = min(len(va_loader), VAL_MAX_BATCHES)
            for bi, (imgs, labels) in enumerate(va_loader):
                imgs = imgs.to(DEVICE, non_blocking=True).to(memory_format=torch.channels_last)
                with torch.amp.autocast(device_type='cuda', enabled=(DEVICE.type=='cuda')):
                    logits = ema_model(imgs)
                val_logits.append(logits.float().cpu())
                val_labels.append(labels.cpu())
                if (bi+1) % 100 == 0:
                    print(f'[F{fold_idx}] val batch {bi+1}/{max_batches}', flush=True)
                if (bi + 1) >= VAL_MAX_BATCHES:
                    break
        val_logits = torch.cat(val_logits, dim=0)
        val_labels = torch.cat(val_labels, dim=0).numpy()
        val_preds = val_logits.argmax(dim=1).numpy()
        f1 = f1_score(val_labels, val_preds, average='macro')
        print(f'[F{fold_idx}] Epoch {epoch+1}/{EPOCHS} steps/ep={len(tr_loader)} train_loss={tr_loss/max(seen,1):.4f} val_macroF1={f1:.5f} elapsed={time.time()-t0:.1f}s', flush=True)

        if f1 > best_f1:
            best_f1 = f1
            torch.save({'model': ema_model.state_dict(), 'f1': best_f1}, ckpt_path)
            print(f'[F{fold_idx}] New best F1 {best_f1:.5f}. Saved {ckpt_path}', flush=True)

        gc.collect();
        if DEVICE.type=='cuda': torch.cuda.empty_cache()

    print(f'[F{fold_idx}] Best val macro-F1: {best_f1:.5f}', flush=True)
    return str(ckpt_path), best_f1

def train_n_folds(n_first_folds: int = 1):
    ckpts, scores = [], []
    for f in range(n_first_folds):
        print(f'=== Training fold {f} ===', flush=True)
        ck, sc = train_one_fold(f)
        ckpts.append(ck); scores.append(sc)
    print('Fold scores:', scores, 'avg:', float(np.mean(scores)) if scores else -1, flush=True)
    Path(ART_DIR/'trained_folds.json').write_text(json.dumps({'ckpts': ckpts, 'scores': scores}))
    return ckpts, scores

print('Finetune pipeline ready. Call train_n_folds(1) to start.', flush=True)

Finetune pipeline ready. Call train_n_folds(1) to start.


In [24]:
# Launch finetuning: train 3 folds for ensemble
ckpts, scores = train_n_folds(3)
print('Trained folds:', ckpts, 'Scores:', scores)

=== Training fold 0 ===


Init head bias to -log(C)


[F0] len(tr_loader)=1000 expected=1000


[F0] Epoch 1 steps this epoch: 1000


  scaler = GradScaler(enabled=(DEVICE.type=='cuda'))


[F0] ep1 step200/1000 loss=11.3310 elapsed=24.4s


[F0] ep1 step400/1000 loss=11.2726 elapsed=45.7s


[F0] ep1 step600/1000 loss=11.1530 elapsed=67.0s


[F0] ep1 step800/1000 loss=11.0239 elapsed=88.5s


[F0] ep1 step1000/1000 loss=10.8841 elapsed=110.0s


[F0] val batch 100/400


[F0] val batch 200/400


[F0] val batch 300/400


[F0] val batch 400/400


[F0] Epoch 1/8 steps/ep=1000 train_loss=10.8841 val_macroF1=0.01399 elapsed=151.5s


[F0] New best F1 0.01399. Saved artifacts/convnext_b_fold0.pt


[F0] Epoch 2 steps this epoch: 1000


[F0] ep2 step200/1000 loss=10.1689 elapsed=22.6s


[F0] ep2 step400/1000 loss=10.0721 elapsed=44.2s


[F0] ep2 step600/1000 loss=9.9590 elapsed=65.8s


[F0] ep2 step800/1000 loss=9.8625 elapsed=87.5s


[F0] ep2 step1000/1000 loss=9.7734 elapsed=109.1s


[F0] val batch 100/400


[F0] val batch 200/400


[F0] val batch 300/400


[F0] val batch 400/400


[F0] Epoch 2/8 steps/ep=1000 train_loss=9.7734 val_macroF1=0.04860 elapsed=146.5s


[F0] New best F1 0.04860. Saved artifacts/convnext_b_fold0.pt


[F0] Epoch 3 steps this epoch: 1000


[F0] ep3 step200/1000 loss=8.7229 elapsed=61.6s


[F0] ep3 step400/1000 loss=8.7125 elapsed=122.5s


[F0] ep3 step600/1000 loss=8.6495 elapsed=183.6s


[F0] ep3 step800/1000 loss=8.5406 elapsed=244.5s


[F0] ep3 step1000/1000 loss=8.4248 elapsed=305.9s


[F0] val batch 100/400


[F0] val batch 200/400


[F0] val batch 300/400


[F0] val batch 400/400


[F0] Epoch 3/8 steps/ep=1000 train_loss=8.4248 val_macroF1=0.06758 elapsed=343.6s


[F0] New best F1 0.06758. Saved artifacts/convnext_b_fold0.pt


[F0] Epoch 4 steps this epoch: 1000


[F0] ep4 step200/1000 loss=7.6868 elapsed=61.4s


[F0] ep4 step400/1000 loss=7.5446 elapsed=122.6s


[F0] ep4 step600/1000 loss=7.4283 elapsed=183.9s


[F0] ep4 step800/1000 loss=7.3281 elapsed=245.1s


[F0] ep4 step1000/1000 loss=7.2434 elapsed=306.5s


[F0] val batch 100/400


[F0] val batch 200/400


[F0] val batch 300/400


[F0] val batch 400/400


[F0] Epoch 4/8 steps/ep=1000 train_loss=7.2434 val_macroF1=0.08087 elapsed=344.1s


[F0] New best F1 0.08087. Saved artifacts/convnext_b_fold0.pt


[F0] Epoch 5 steps this epoch: 1000


[F0] ep5 step200/1000 loss=6.2975 elapsed=62.3s


[F0] ep5 step400/1000 loss=6.0409 elapsed=123.4s


[F0] ep5 step600/1000 loss=5.8902 elapsed=184.6s


[F0] ep5 step800/1000 loss=5.7981 elapsed=245.8s


[F0] ep5 step1000/1000 loss=5.7225 elapsed=307.3s


[F0] val batch 100/400


[F0] val batch 200/400


[F0] val batch 300/400


[F0] val batch 400/400


[F0] Epoch 5/8 steps/ep=1000 train_loss=5.7225 val_macroF1=0.10612 elapsed=344.9s


[F0] New best F1 0.10612. Saved artifacts/convnext_b_fold0.pt


[F0] Epoch 6 steps this epoch: 1000


[F0] ep6 step200/1000 loss=5.0576 elapsed=61.9s


[F0] ep6 step400/1000 loss=4.9197 elapsed=123.0s


[F0] ep6 step600/1000 loss=4.8367 elapsed=184.1s


[F0] ep6 step800/1000 loss=4.7792 elapsed=245.3s


[F0] ep6 step1000/1000 loss=4.7329 elapsed=306.8s


[F0] val batch 100/400


[F0] val batch 200/400


[F0] val batch 300/400


[F0] val batch 400/400


[F0] Epoch 6/8 steps/ep=1000 train_loss=4.7329 val_macroF1=0.12689 elapsed=345.0s


[F0] New best F1 0.12689. Saved artifacts/convnext_b_fold0.pt


[F0] Epoch 7 steps this epoch: 1000


[F0] ep7 step200/1000 loss=4.1085 elapsed=61.8s


[F0] ep7 step400/1000 loss=3.9005 elapsed=123.0s


[F0] ep7 step600/1000 loss=3.7656 elapsed=184.2s


[F0] ep7 step800/1000 loss=3.6691 elapsed=245.3s


[F0] ep7 step1000/1000 loss=3.5911 elapsed=306.7s


[F0] val batch 100/400


[F0] val batch 200/400


[F0] val batch 300/400


[F0] val batch 400/400


[F0] Epoch 7/8 steps/ep=1000 train_loss=3.5911 val_macroF1=0.14769 elapsed=344.2s


[F0] New best F1 0.14769. Saved artifacts/convnext_b_fold0.pt


[F0] Epoch 8 steps this epoch: 1000


[F0] ep8 step200/1000 loss=3.0637 elapsed=61.5s


[F0] ep8 step400/1000 loss=2.9928 elapsed=122.6s


[F0] ep8 step600/1000 loss=2.9515 elapsed=183.7s


[F0] ep8 step800/1000 loss=2.9227 elapsed=244.8s


[F0] ep8 step1000/1000 loss=2.8904 elapsed=306.2s


[F0] val batch 100/400


[F0] val batch 200/400


[F0] val batch 300/400


[F0] val batch 400/400


[F0] Epoch 8/8 steps/ep=1000 train_loss=2.8904 val_macroF1=0.16330 elapsed=343.6s


[F0] New best F1 0.16330. Saved artifacts/convnext_b_fold0.pt


[F0] Best val macro-F1: 0.16330


=== Training fold 1 ===


Init head bias to -log(C)


[F1] len(tr_loader)=1000 expected=1000


[F1] Epoch 1 steps this epoch: 1000


  scaler = GradScaler(enabled=(DEVICE.type=='cuda'))


[F1] ep1 step200/1000 loss=11.3274 elapsed=24.9s


[F1] ep1 step400/1000 loss=11.2767 elapsed=46.6s


[F1] ep1 step600/1000 loss=11.1542 elapsed=68.3s


[F1] ep1 step800/1000 loss=11.0078 elapsed=90.0s


[F1] ep1 step1000/1000 loss=10.8730 elapsed=111.7s


[F1] val batch 100/400


[F1] val batch 200/400


[F1] val batch 300/400


[F1] val batch 400/400


[F1] Epoch 1/8 steps/ep=1000 train_loss=10.8730 val_macroF1=0.01322 elapsed=153.7s


[F1] New best F1 0.01322. Saved artifacts/convnext_b_fold1.pt


[F1] Epoch 2 steps this epoch: 1000


[F1] ep2 step200/1000 loss=10.1018 elapsed=22.5s


[F1] ep2 step400/1000 loss=10.0015 elapsed=44.2s


[F1] ep2 step600/1000 loss=9.9143 elapsed=65.9s


[F1] ep2 step800/1000 loss=9.8220 elapsed=87.5s


[F1] ep2 step1000/1000 loss=9.7418 elapsed=109.2s


[F1] val batch 100/400


[F1] val batch 200/400


[F1] val batch 300/400


[F1] val batch 400/400


[F1] Epoch 2/8 steps/ep=1000 train_loss=9.7418 val_macroF1=0.04911 elapsed=146.5s


[F1] New best F1 0.04911. Saved artifacts/convnext_b_fold1.pt


[F1] Epoch 3 steps this epoch: 1000


[F1] ep3 step200/1000 loss=8.6458 elapsed=62.0s


[F1] ep3 step400/1000 loss=8.6592 elapsed=122.9s


[F1] ep3 step600/1000 loss=8.6033 elapsed=183.9s


[F1] ep3 step800/1000 loss=8.5048 elapsed=244.9s


[F1] ep3 step1000/1000 loss=8.3952 elapsed=306.3s


[F1] val batch 100/400


KeyboardInterrupt: 

In [16]:
# Inference: load fold checkpoints, 2x TTA (orig + hflip), average logits, write submission.csv
import json, math, time
from pathlib import Path
import torch
import torch.nn.functional as F
import pandas as pd
import numpy as np
from torch.utils.data import DataLoader

def load_fold_models(ckpt_paths):
    num_classes = int(train_df['class_idx'].nunique())
    models = []
    for p in ckpt_paths:
        if not Path(p).exists():
            print('Missing ckpt:', p, flush=True); continue
        m = timm.create_model('convnext_base.fb_in22k', pretrained=False, num_classes=num_classes)
        m.to(memory_format=torch.channels_last)
        sd = torch.load(p, map_location='cpu')
        m.load_state_dict(sd['model'], strict=True)
        m.to(DEVICE).eval()
        models.append(m)
        print('Loaded ckpt:', p, 'F1=', sd.get('f1', None), flush=True)
    return models

def infer_test(models, img_size: int = 384, batch_size: int = 64, num_workers: int = 8):
    assert len(models) > 0, 'No models provided for inference'
    test_order = test_imgs.sort_values('image_id').reset_index(drop=True)
    ds = HerbariumDataset(test_order[['filepath']], label_col=None, transforms=get_transforms(img_size, train=False))
    dl = DataLoader(ds, batch_size=batch_size, shuffle=False, num_workers=num_workers, pin_memory=True)
    all_logits_sum = []
    t0 = time.time()
    with torch.no_grad():
        for bi, (imgs, _) in enumerate(dl):
            imgs = imgs.to(DEVICE, non_blocking=True).to(memory_format=torch.channels_last)
            # TTA: original + hflip
            with torch.amp.autocast(device_type='cuda', enabled=(DEVICE.type=='cuda')):
                logits_ens = None
                for m in models:
                    l1 = m(imgs)
                    l2 = m(torch.flip(imgs, dims=[3]))  # horizontal flip
                    l = (l1 + l2) * 0.5
                    logits_ens = l if logits_ens is None else (logits_ens + l)
            all_logits_sum.append(logits_ens.float().cpu())
            if (bi+1) % 50 == 0:
                print(f'INF batch {bi+1}/{len(dl)} elapsed {time.time()-t0:.1f}s', flush=True)
    all_logits_sum = torch.cat(all_logits_sum, dim=0)
    return all_logits_sum, test_order

# Precompute smoothed log-prior for inference-time logit adjustment (long-tail debiasing)
counts = train_df['class_idx'].value_counts().sort_index().values.astype(np.float64)
C = counts.shape[0]
alpha = 1.0  # Laplace smoothing
prior = (counts + alpha) / (counts.sum() + alpha * C)
prior = torch.tensor(prior, dtype=torch.float32)
log_prior = prior.log()

# Defaults per expert advice
BETA = 0.7   # power/soften prior via effective tau scaling
TEMP = 1.5   # temperature
LAM = 0.5    # blend weight

def adjust_logits(logits_sum: torch.Tensor, tau: float, beta: float = BETA, T: float = TEMP, lam: float = LAM) -> torch.Tensor:
    # Temperature
    logits = logits_sum / float(T)
    # Adjust with smoothed prior
    adj = logits - (tau * beta) * log_prior[None, :]
    # Blend with unadjusted logits
    blended = (1.0 - lam) * logits + lam * adj
    return blended

def write_submission_from_logits(logits_sum: torch.Tensor, test_order: pd.DataFrame, tau: float, out_path: str = 'submission.csv', lam: float = LAM, T: float = TEMP, beta: float = BETA):
    blended = adjust_logits(logits_sum, tau=tau, beta=beta, T=T, lam=lam)
    preds_idx = blended.argmax(dim=1).numpy().tolist()
    # idx2catid keys may be str; handle both
    if isinstance(next(iter(idx2catid.keys())), str):
        mapped = [int(idx2catid[str(int(i))]) for i in preds_idx]
    else:
        mapped = [int(idx2catid[int(i)]) for i in preds_idx]
    sub = pd.DataFrame({'Id': test_order['image_id'].astype(int), 'Predicted': mapped})
    sub = sub.sort_values('Id').reset_index(drop=True)
    sub.to_csv(out_path, index=False)
    print(f'Saved {out_path} rows:', len(sub), flush=True)

def cache_test_logits(ckpt_paths, logits_path: str = 'artifacts/test_logits_fold0.pt', ids_path: str = 'artifacts/test_ids.npy', img_size: int = None):
    img_sz = IMG_SIZE if img_size is None else img_size
    models = load_fold_models(ckpt_paths)
    logits_sum, test_order = infer_test(models, img_size=img_sz, batch_size=64, num_workers=NUM_WORKERS)
    Path('artifacts').mkdir(exist_ok=True)
    torch.save(logits_sum, logits_path)
    np.save(ids_path, test_order['image_id'].values.astype(np.int64))
    print('Cached:', logits_path, 'and', ids_path, flush=True)
    return logits_sum, test_order

def write_submission_from_cached(logits_path: str, ids_path: str, tau: float, out_path: str = 'submission.csv', lam: float = LAM, T: float = TEMP, beta: float = BETA):
    logits_sum = torch.load(logits_path, map_location='cpu')
    ids = np.load(ids_path)
    blended = adjust_logits(logits_sum, tau=tau, beta=beta, T=T, lam=lam)
    preds_idx = blended.argmax(dim=1).numpy().tolist()
    if isinstance(next(iter(idx2catid.keys())), str):
        mapped = [int(idx2catid[str(int(i))]) for i in preds_idx]
    else:
        mapped = [int(idx2catid[int(i)]) for i in preds_idx]
    sub = pd.DataFrame({'Id': ids.astype(int), 'Predicted': mapped})
    sub = sub.sort_values('Id').reset_index(drop=True)
    sub.to_csv(out_path, index=False)
    print(f'Saved {out_path} rows:', len(sub), 'tau=', tau, 'lam=', lam, 'T=', T, 'beta=', beta, flush=True)

def run_inference_from_trained(n_folds_to_use: int = 1, tau: float = 1.5):
    # Try to read ckpt list from artifacts/trained_folds.json, else fallback to default paths
    ckpt_paths = []
    trained_json = ART_DIR / 'trained_folds.json'
    if trained_json.exists():
        meta = json.loads(trained_json.read_text())
        ckpt_paths = meta.get('ckpts', [])[:n_folds_to_use]
    if not ckpt_paths:
        # Default guess
        ckpt_paths = [str(ART_DIR / f'convnext_b_fold{i}.pt') for i in range(n_folds_to_use)]
    print('Using ckpts:', ckpt_paths, flush=True)
    models = load_fold_models(ckpt_paths)
    logits_sum, test_order = infer_test(models, img_size=IMG_SIZE, batch_size=64, num_workers=NUM_WORKERS)
    write_submission_from_logits(logits_sum, test_order, tau=tau, out_path='submission.csv', lam=LAM, T=TEMP, beta=BETA)

print('Inference pipeline ready with tau/T/beta/lambda adjustment. Use cache_test_logits once, then write_submission_from_cached for fast tau sweep.', flush=True)

Inference pipeline ready with tau/T/beta/lambda adjustment. Use cache_test_logits once, then write_submission_from_cached for fast tau sweep.


In [None]:
# Cache test logits once, then write submissions for tau sweeps (two presets) quickly
from pathlib import Path
import shutil, os

candidates = ['artifacts/convnext_b_fold0.pt', 'nybg2020/convnext_b_fold0.pt']
existing = [p for p in candidates if Path(p).exists()]
print('Candidate ckpts:', candidates, '-> existing:', existing, flush=True)
assert len(existing) > 0, 'No checkpoint found. Ensure training saved a ckpt.'

# Prefer the newest checkpoint (by mtime) to avoid averaging mismatched epochs
chosen_ckpt = max(existing, key=lambda p: Path(p).stat().st_mtime)
ckpt_paths = [chosen_ckpt]
print('Using newest ckpt:', ckpt_paths[0], flush=True)

logits_path = 'artifacts/test_logits_fold0.pt'
ids_path = 'artifacts/test_ids.npy'
force_recache = True  # Force refresh with latest checkpoint

if force_recache:
    for fp in (logits_path, ids_path):
        if Path(fp).exists():
            try:
                os.remove(fp)
                print('Removed stale cache:', fp, flush=True)
            except Exception as e:
                print('Could not remove', fp, e, flush=True)

if not Path(logits_path).exists() or not Path(ids_path).exists():
    print('Caching test logits/ids...', flush=True)
    cache_test_logits(ckpt_paths, logits_path=logits_path, ids_path=ids_path, img_size=IMG_SIZE)
else:
    print('Using cached logits/ids:', logits_path, ids_path, flush=True)

# Tau sweeps per expert advice
taus_wide = [0.7, 1.0, 1.2, 1.5, 1.8, 2.0, 2.5, 3.0, 3.5]
presets = [
    {'name':'p1','beta':0.7,'T':1.5,'lam':0.5, 'taus': taus_wide},
    {'name':'p2','beta':0.5,'T':1.2,'lam':0.3, 'taus': [1.5, 2.0, 2.5, 3.0, 3.5]},
]
out_root = Path('artifacts/submissions'); out_root.mkdir(parents=True, exist_ok=True)
for ps in presets:
    name, beta, T, lam, taus = ps['name'], ps['beta'], ps['T'], ps['lam'], ps['taus']
    out_dir = out_root / name
    out_dir.mkdir(parents=True, exist_ok=True)
    for t in taus:
        outp = out_dir / f'submission_{name}_tau_{t:.2f}.csv'
        write_submission_from_cached(logits_path, ids_path, tau=float(t), out_path=str(outp), lam=lam, T=T, beta=beta)

# Choose one to copy to submission.csv for immediate submit (preset p1, tau=3.00 as new try)
best_preset = 'p1'; best_tau = 3.00
chosen = out_root / best_preset / f'submission_{best_preset}_tau_{best_tau:.2f}.csv'
assert chosen.exists(), f'Missing expected {chosen}'
shutil.copyfile(chosen, 'submission.csv')
print('Copied', chosen, '-> submission.csv', flush=True)

Candidate ckpts: ['artifacts/convnext_b_fold0.pt', 'nybg2020/convnext_b_fold0.pt'] -> existing: ['artifacts/convnext_b_fold0.pt']


Using newest ckpt: artifacts/convnext_b_fold0.pt


Removed stale cache: artifacts/test_logits_fold0.pt


Removed stale cache: artifacts/test_ids.npy


Caching test logits/ids...


  sd = torch.load(p, map_location='cpu')


Loaded ckpt: artifacts/convnext_b_fold0.pt F1= 0.16330388671425206
