# Kuzushiji Recognition: Plan and Checklist

Objectives:
- Achieve medal-level f1-score by building a robust detection+recognition pipeline.
- Ship a working baseline ASAP; iterate with CV rigor and expert feedback.

Milestones:
1) Environment & GPU
- Verify GPU availability (nvidia-smi).
- Install PyTorch CUDA 12.1 stack if needed.

2) Data Audit & EDA
- Inspect train.csv, sample_submission.csv, unicode_translation.csv.
- Determine exact schema: image_id, bbox/points, unicode labels, etc.
- Verify submission format: triples per image: `Unicode cx cy`.
- Unzip train_images.zip and test_images.zip; count images, sizes.

3) Validation Protocol
- Stratify by page and character distribution if possible.
- Use image-level KFold; ensure transforms fit per fold only.
- Save folds to disk for reuse.

4) Baseline Model (Fast)
- Detector: Faster R-CNN (torchvision) or YOLOv5/8 if feasible.
- Single-stage baseline: torchvision FasterRCNN with ResNet50-FPN; train on resized images.
- Recognition: Treat as classification on detected crops (shared classifier head) or directly predict Unicode via detector’s class head if label space manageable.
- Start with modest image size (e.g., 1024 short-side), AMP, 1–3 epochs smoke test.

5) Improved Pipeline
- Increase resolution, stronger aug (Albumentations), longer training with early stopping.
- Class imbalance handling (focal loss or class weights).
- TTA for detection; NMS tuning.
- Unicode normalization via unicode_translation.csv mapping.

6) Inference & Submission
- Convert detections to required `Unicode cx cy` per image.
- Validate format vs sample_submission; sanity-check outputs.

7) Iteration & Ensembling
- If time, train 2–3 seeds or a second backbone and blend.
- Error analysis on OOF: per-class f1, confidence calibration, fix top buckets.

Risks & Mitigations:
- Heavy training time: start with small subset smoke runs; print progress/elapsed per epoch.
- Incorrect format: validate against sample and small hand-crafted files.
- CV mismatch: lock folds early; mirror test distribution if available.

Next Actions:
- A) GPU check and install torch cu121.
- B) Unzip data; inspect CSV schemas; preview images.
- C) Define folds and baseline training loop skeleton.
- D) Request expert review on plan and CV before heavy training.

In [1]:
# Environment check, data unzip, and CSV audit
import os, sys, subprocess, shutil, time, zipfile, math, json, re
from pathlib import Path
import pandas as pd

def run(cmd):
    print('$', ' '.join(cmd), flush=True)
    return subprocess.run(cmd, check=False, text=True, capture_output=True)

# 1) GPU check
print('=== NVIDIA SMI ===', flush=True)
print(run(['bash','-lc','nvidia-smi || true']).stdout)

# 2) Torch cu121 install (idempotent)
def ensure_torch_cu121():
    import importlib
    try:
        torch = importlib.import_module('torch')
        import torch as _t
        print('torch version present:', _t.__version__, 'cuda:', getattr(_t.version,'cuda',None), 'is_available:', _t.cuda.is_available())
        # If CUDA not available or wrong build, reinstall
        if not _t.cuda.is_available() or not str(getattr(_t.version,'cuda','')).startswith('12.1'):
            raise RuntimeError('Reinstall torch stack for cu121')
        return
    except Exception as e:
        print('Installing torch cu121 stack...', e)
        # Uninstall possible conflicting stacks (best-effort)
        subprocess.run([sys.executable, '-m', 'pip', 'uninstall', '-y', 'torch', 'torchvision', 'torchaudio'], check=False)
        # Clean stray site dirs that can shadow correct wheels
        for d in (
            '/app/.pip-target/torch',
            '/app/.pip-target/torchvision',
            '/app/.pip-target/torchaudio',
            '/app/.pip-target/torch-2.8.0.dist-info',
            '/app/.pip-target/torchvision-0.23.0.dist-info',
            '/app/.pip-target/torchaudio-2.8.0.dist-info',
            '/app/.pip-target/torch-2.4.1.dist-info',
            '/app/.pip-target/torchvision-0.19.1.dist-info',
            '/app/.pip-target/torchaudio-2.4.1.dist-info',
        ):
            if os.path.exists(d):
                shutil.rmtree(d, ignore_errors=True)
        def pip(*args):
            print('> pip', *args, flush=True)
            subprocess.run([sys.executable, '-m', 'pip', *args], check=True)
        pip('install', '--index-url', 'https://download.pytorch.org/whl/cu121', '--extra-index-url', 'https://pypi.org/simple', 'torch==2.4.1', 'torchvision==0.19.1', 'torchaudio==2.4.1')
        Path('constraints.txt').write_text('torch==2.4.1\ntorchvision==0.19.1\ntorchaudio==2.4.1\n')
        # Sanity
        import torch
        print('torch:', torch.__version__, 'built CUDA:', getattr(torch.version, 'cuda', None))
        print('CUDA available:', torch.cuda.is_available())
        assert str(getattr(torch.version,'cuda','')).startswith('12.1'), f'Wrong CUDA build: {torch.version.cuda}'
        assert torch.cuda.is_available(), 'CUDA not available'
        print('GPU:', torch.cuda.get_device_name(0))

ensure_torch_cu121()

CWD = Path('.')
train_csv = CWD / 'train.csv'
trans_csv = CWD / 'unicode_translation.csv'
sample_sub_csv = CWD / 'sample_submission.csv'
train_zip = CWD / 'train_images.zip'
test_zip = CWD / 'test_images.zip'
train_dir = CWD / 'train_images'
test_dir = CWD / 'test_images'

# 3) Unzip datasets if needed
def safe_unzip(zpath: Path, out_dir: Path):
    if out_dir.exists() and any(out_dir.iterdir()):
        print(f'{out_dir} exists; skipping unzip')
        return
    assert zpath.exists(), f'Missing zip: {zpath}'
    out_dir.mkdir(parents=True, exist_ok=True)
    print(f'Unzipping {zpath} -> {out_dir} ...', flush=True)
    t0 = time.time()
    with zipfile.ZipFile(zpath) as zf:
        zf.extractall(out_dir)
    print(f'Done in {time.time()-t0:.1f}s')

safe_unzip(train_zip, train_dir)
safe_unzip(test_zip, test_dir)

def count_images(img_dir: Path):
    exts = {'.jpg','.jpeg','.png','.bmp','.tif','.tiff'}
    n = 0
    for p in img_dir.rglob('*'):
        if p.suffix.lower() in exts:
            n += 1
    return n

print('Train images:', count_images(train_dir))
print('Test images:', count_images(test_dir))

# 4) CSV audit
assert train_csv.exists(), 'train.csv missing'
assert trans_csv.exists(), 'unicode_translation.csv missing'
assert sample_sub_csv.exists(), 'sample_submission.csv missing'

df_train = pd.read_csv(train_csv)
df_trans = pd.read_csv(trans_csv)
df_sample = pd.read_csv(sample_sub_csv)
print('train.csv shape:', df_train.shape)
print('train.csv columns:', df_train.columns.tolist())
print(df_train.head(3))
print('unicode_translation.csv shape:', df_trans.shape)
print(df_trans.head(3))
print('sample_submission.csv shape:', df_sample.shape)
print(df_sample.head(3))

# 5) Parse labels format guess: space-separated triples: unicode cx cy
def parse_labels_to_unicodes(labels: str):
    if not isinstance(labels, str) or labels.strip() == '':
        return []
    toks = labels.strip().split()
    # Expect groups of 3; if not, try groups of 5 (unicode x y w h) as fallback
    if len(toks) % 3 == 0:
        return [toks[i] for i in range(0, len(toks), 3)]
    elif len(toks) % 5 == 0:
        return [toks[i] for i in range(0, len(toks), 5)]
    else:
        return []

sample_labels = df_train.iloc[0]['labels'] if 'labels' in df_train.columns else None
print('Sample labels string:', sample_labels)

uniq = {}
cnt_per_image = []
for s in df_train.get('labels', pd.Series([], dtype=str)).fillna(''):
    ulist = parse_labels_to_unicodes(s)
    cnt_per_image.append(len(ulist))
    for u in ulist:
        uniq[u] = uniq.get(u, 0) + 1
print('Images with any labels:', sum(c>0 for c in cnt_per_image), 'of', len(cnt_per_image))
print('Total labeled instances:', sum(cnt_per_image))
print('Unique unicode tokens (raw):', len(uniq))
print('Top 10 tokens:', sorted(uniq.items(), key=lambda x: -x[1])[:10])

# 6) Tiny submission validator: mirror sample format
def make_tiny_submission(df_samp: pd.DataFrame) -> pd.DataFrame:
    sub = df_samp.copy()
    # leave empty predictions
    return sub

df_tiny = make_tiny_submission(df_sample)
out_path = Path('submission.csv')
df_tiny.to_csv(out_path, index=False)
print('Wrote tiny submission.csv with shape', df_tiny.shape, 'Head:')
print(df_tiny.head(2))

=== NVIDIA SMI ===


$ bash -lc nvidia-smi || true


Mon Sep 29 18:18:20 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.144.06             Driver Version: 550.144.06     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA A10-24Q                 On  |   00000002:00:00.0 Off |                    0 |
| N/A   N/A    P0             N/A /  N/A  |     182MiB /  24512MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

> pip install --index-url https://download.pytorch.org/whl/cu121 --extra-index-url https://pypi.org/simple torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1




Looking in indexes: https://download.pytorch.org/whl/cu121, https://pypi.org/simple


Collecting torch==2.4.1
  Downloading https://download.pytorch.org/whl/cu121/torch-2.4.1%2Bcu121-cp311-cp311-linux_x86_64.whl (799.0 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 799.0/799.0 MB 481.0 MB/s eta 0:00:00


Collecting torchvision==0.19.1
  Downloading https://download.pytorch.org/whl/cu121/torchvision-0.19.1%2Bcu121-cp311-cp311-linux_x86_64.whl (7.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.1/7.1 MB 475.6 MB/s eta 0:00:00


Collecting torchaudio==2.4.1
  Downloading https://download.pytorch.org/whl/cu121/torchaudio-2.4.1%2Bcu121-cp311-cp311-linux_x86_64.whl (3.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.4/3.4 MB 437.2 MB/s eta 0:00:00
Collecting filelock
  Downloading filelock-3.19.1-py3-none-any.whl (15 kB)


Collecting typing-extensions>=4.8.0
  Downloading typing_extensions-4.15.0-py3-none-any.whl (44 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 44.6/44.6 KB 3.2 MB/s eta 0:00:00
Collecting sympy
  Downloading sympy-1.14.0-py3-none-any.whl (6.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.3/6.3 MB 142.2 MB/s eta 0:00:00


Collecting fsspec
  Downloading fsspec-2025.9.0-py3-none-any.whl (199 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 199.3/199.3 KB 518.1 MB/s eta 0:00:00


Collecting nvidia-cuda-cupti-cu12==12.1.105
  Downloading nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.1/14.1 MB 517.8 MB/s eta 0:00:00


Collecting nvidia-cuda-nvrtc-cu12==12.1.105
  Downloading nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 23.7/23.7 MB 507.4 MB/s eta 0:00:00


Collecting nvidia-cublas-cu12==12.1.3.1
  Downloading nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl (410.6 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 410.6/410.6 MB 529.0 MB/s eta 0:00:00


Collecting nvidia-cusparse-cu12==12.1.0.106
  Downloading nvidia_cusparse_cu12-12.1.0.106-py3-none-manylinux1_x86_64.whl (196.0 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 196.0/196.0 MB 515.7 MB/s eta 0:00:00


Collecting nvidia-nvtx-cu12==12.1.105
  Downloading nvidia_nvtx_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (99 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 99.1/99.1 KB 471.9 MB/s eta 0:00:00


Collecting jinja2
  Downloading jinja2-3.1.6-py3-none-any.whl (134 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.9/134.9 KB 487.3 MB/s eta 0:00:00


Collecting nvidia-cusolver-cu12==11.4.5.107
  Downloading nvidia_cusolver_cu12-11.4.5.107-py3-none-manylinux1_x86_64.whl (124.2 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 124.2/124.2 MB 527.0 MB/s eta 0:00:00


Collecting nvidia-cufft-cu12==11.0.2.54
  Downloading nvidia_cufft_cu12-11.0.2.54-py3-none-manylinux1_x86_64.whl (121.6 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 121.6/121.6 MB 530.1 MB/s eta 0:00:00


Collecting nvidia-curand-cu12==10.3.2.106
  Downloading nvidia_curand_cu12-10.3.2.106-py3-none-manylinux1_x86_64.whl (56.5 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.5/56.5 MB 455.7 MB/s eta 0:00:00


Collecting nvidia-cudnn-cu12==9.1.0.70
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl (664.8 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 664.8/664.8 MB 534.4 MB/s eta 0:00:00


Collecting nvidia-cuda-runtime-cu12==12.1.105
  Downloading nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 823.6/823.6 KB 524.0 MB/s eta 0:00:00


Collecting triton==3.0.0
  Downloading triton-3.0.0-1-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (209.4 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 209.4/209.4 MB 538.9 MB/s eta 0:00:00


Collecting nvidia-nccl-cu12==2.20.5
  Downloading nvidia_nccl_cu12-2.20.5-py3-none-manylinux2014_x86_64.whl (176.2 MB)


     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 176.2/176.2 MB 531.2 MB/s eta 0:00:00


Collecting networkx
  Downloading networkx-3.5-py3-none-any.whl (2.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.0/2.0 MB 517.3 MB/s eta 0:00:00


Collecting pillow!=8.3.*,>=5.3.0
  Downloading pillow-11.3.0-cp311-cp311-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (6.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.6/6.6 MB 259.5 MB/s eta 0:00:00


Collecting numpy
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.3/18.3 MB 135.5 MB/s eta 0:00:00


Collecting nvidia-nvjitlink-cu12
  Downloading nvidia_nvjitlink_cu12-12.9.86-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (39.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 39.7/39.7 MB 302.8 MB/s eta 0:00:00


Collecting MarkupSafe>=2.0
  Downloading markupsafe-3.0.3-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_28_x86_64.whl (22 kB)
Collecting mpmath<1.4,>=1.1.0
  Downloading mpmath-1.3.0-py3-none-any.whl (536 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 536.2/536.2 KB 504.9 MB/s eta 0:00:00


Installing collected packages: mpmath, typing-extensions, sympy, pillow, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufft-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, numpy, networkx, MarkupSafe, fsspec, filelock, triton, nvidia-cusparse-cu12, nvidia-cudnn-cu12, jinja2, nvidia-cusolver-cu12, torch, torchvision, torchaudio


Successfully installed MarkupSafe-3.0.3 filelock-3.19.1 fsspec-2025.9.0 jinja2-3.1.6 mpmath-1.3.0 networkx-3.5 numpy-1.26.4 nvidia-cublas-cu12-12.1.3.1 nvidia-cuda-cupti-cu12-12.1.105 nvidia-cuda-nvrtc-cu12-12.1.105 nvidia-cuda-runtime-cu12-12.1.105 nvidia-cudnn-cu12-9.1.0.70 nvidia-cufft-cu12-11.0.2.54 nvidia-curand-cu12-10.3.2.106 nvidia-cusolver-cu12-11.4.5.107 nvidia-cusparse-cu12-12.1.0.106 nvidia-nccl-cu12-2.20.5 nvidia-nvjitlink-cu12-12.9.86 nvidia-nvtx-cu12-12.1.105 pillow-11.3.0 sympy-1.14.0 torch-2.4.1+cu121 torchaudio-2.4.1+cu121 torchvision-0.19.1+cu121 triton-3.0.0 typing-extensions-4.15.0


torch: 2.4.1+cu121 built CUDA: 12.1
CUDA available: True
GPU: NVIDIA A10-24Q
Unzipping train_images.zip -> train_images ...


Done in 8.2s
Unzipping test_images.zip -> test_images ...


Done in 0.9s
Train images: 3244
Test images: 361
train.csv shape: (3244, 2)
train.csv columns: ['image_id', 'labels']
            image_id                                             labels
0  200004148_00015_1  U+306F 1187 361 47 27 U+306F 1487 2581 48 28 U...
1  200021712-00008_2  U+4E00 1543 1987 58 11 U+4E00 1296 1068 91 11 ...
2  100249416_00034_1  U+4E00 1214 415 73 11 U+4E00 1386 412 72 13 U+...
unicode_translation.csv shape: (4781, 2)
  Unicode char
0  U+0031    1
1  U+0032    2
2  U+0034    4
sample_submission.csv shape: (361, 2)
            image_id                 labels
0        umgy007-028  U+003F 1 1 U+FF2F 2 2
1        hnsd004-026  U+003F 1 1 U+FF2F 2 2
2  200003076_00034_2  U+003F 1 1 U+FF2F 2 2
Sample labels string: U+306F 1187 361 47 27 U+306F 1487 2581 48 28 U+3070 1187 1063 74 30 U+3070 594 1154 93 31 U+306F 1192 1842 52 32 U+309D 755 2601 24 33 U+3070 1336 531 88 33 U+3044 1326 444 60 34 U+53E3 1342 2649 44 35 U+306F 1485 1427 46 35 U+306F 450 1642 51 35 U+306F 156

Images with any labels: 3244 of 3244
Total labeled instances: 747579
Unique unicode tokens (raw): 7819
Top 10 tokens: [('U+306B', 17289), ('U+306E', 16954), ('U+3057', 15561), ('U+3066', 14297), ('U+3068', 11609), ('U+3092', 11102), ('U+306F', 10320), ('U+304B', 9997), ('U+308A', 9848), ('U+306A', 9592)]
Wrote tiny submission.csv with shape (361, 2) Head:
      image_id                 labels
0  umgy007-028  U+003F 1 1 U+FF2F 2 2
1  hnsd004-026  U+003F 1 1 U+FF2F 2 2


In [2]:
# Parse bbox stats and create CV folds
import numpy as np
from sklearn.model_selection import KFold

def parse_labels_full(labels: str):
    if not isinstance(labels, str) or labels.strip() == '':
        return []
    toks = labels.strip().split()
    out = []
    if len(toks) % 5 == 0:
        for i in range(0, len(toks), 5):
            u, x, y, w, h = toks[i:i+5]
            try:
                out.append((u, int(x), int(y), int(w), int(h)))
            except:
                pass
    elif len(toks) % 3 == 0:
        # triples fallback (unicode, cx, cy); synthesize tiny boxes for stats
        for i in range(0, len(toks), 3):
            u, cx, cy = toks[i:i+3]
            try:
                out.append((u, int(cx), int(cy), 1, 1))
            except:
                pass
    return out

# Build per-image annotations and bbox stats
anns = []
per_image_counts = []
for r in df_train.itertuples(index=False):
    image_id = getattr(r, 'image_id') if hasattr(r, 'image_id') else r[0]
    labels = getattr(r, 'labels') if hasattr(r, 'labels') else r[1]
    boxes = parse_labels_full(labels)
    per_image_counts.append((image_id, len(boxes)))
    for (u,x,y,w,h) in boxes:
        anns.append((image_id, u, x, y, w, h))

df_anns = pd.DataFrame(anns, columns=['image_id','unicode','x','y','w','h'])
df_counts = pd.DataFrame(per_image_counts, columns=['image_id','count'])
print('Annotations dataframe:', df_anns.shape, 'unique images:', df_anns.image_id.nunique())
print('Counts per image stats:', df_counts['count'].describe().to_dict())
if len(df_anns):
    print('w stats:', df_anns['w'].describe().to_dict())
    print('h stats:', df_anns['h'].describe().to_dict())

# Recommend crop size ~ 2-3x median h
if len(df_anns):
    med_h = float(df_anns['h'].median())
    crop_rec = int(np.clip(2.5 * med_h, 64, 192))
    print('Median bbox height:', med_h, '=> recommended crop size:', crop_rec)

# 5-fold CV grouped by image, stratified by binned counts
df_counts = df_counts.sample(frac=1.0, random_state=42).reset_index(drop=True)
bins = pd.qcut(df_counts['count'], q=min(10, max(2, df_counts['count'].nunique())), duplicates='drop')
df_counts['bin'] = bins.cat.codes if hasattr(bins, 'cat') else 0

kf = KFold(n_splits=5, shuffle=True, random_state=42)
folds = []
for fold, (_, val_idx) in enumerate(kf.split(df_counts, df_counts['bin'])):
    img_ids = df_counts.loc[val_idx, 'image_id'].values
    folds.extend([(iid, fold) for iid in img_ids])
df_folds = pd.DataFrame(folds, columns=['image_id','fold'])
df_folds.to_csv('folds.csv', index=False)
print('Saved folds.csv with shape', df_folds.shape)
print(df_folds['fold'].value_counts().sort_index().to_dict())

Annotations dataframe: (613505, 6) unique images: 3244
Counts per image stats: {'count': 3244.0, 'mean': 189.1199136868064, 'std': 89.52639349462329, 'min': 2.0, '25%': 132.0, '50%': 188.0, '75%': 228.0, 'max': 597.0}
w stats: {'count': 613505.0, 'mean': 77.17600834549026, 'std': 30.474135132977892, 'min': 6.0, '25%': 55.0, '50%': 77.0, '75%': 96.0, 'max': 520.0}
h stats: {'count': 613505.0, 'mean': 94.98453476336786, 'std': 34.537406387764584, 'min': 5.0, '25%': 72.0, '50%': 91.0, '75%': 112.0, 'max': 993.0}
Median bbox height: 91.0 => recommended crop size: 192
Saved folds.csv with shape (3244, 2)
{0: 649, 1: 649, 2: 649, 3: 649, 4: 648}


In [3]:
# Optional: Regenerate folds with GroupKFold by book prefix (before '-')
from sklearn.model_selection import GroupKFold

def get_book_prefix(image_id: str):
    return str(image_id).split('-')[0] if isinstance(image_id, str) else ''

df_counts2 = df_counts.copy()
df_counts2['group'] = df_counts2['image_id'].apply(get_book_prefix)
print('Unique groups:', df_counts2['group'].nunique())

gkf = GroupKFold(n_splits=5)
folds_g = []
for fold, (_, val_idx) in enumerate(gkf.split(df_counts2, groups=df_counts2['group'])):
    img_ids = df_counts2.loc[val_idx, 'image_id'].values
    folds_g.extend([(iid, fold) for iid in img_ids])
df_folds_g = pd.DataFrame(folds_g, columns=['image_id','fold'])
df_folds_g.to_csv('folds_group.csv', index=False)
print('Saved folds_group.csv with shape', df_folds_g.shape)
print(df_folds_g['fold'].value_counts().sort_index().to_dict())

Unique groups: 1371
Saved folds_group.csv with shape (3244, 2)
{0: 649, 1: 649, 2: 649, 3: 649, 4: 648}
