
# BreakHis-IB Split Builder 📚🧪
Create **patient-held-out**, **subtype-balanced per magnification** evaluation splits for the **BreakHis** dataset, plus a natural-frequency comparator — without redistributing images.

**What this notebook does**
- Scans your local BreakHis root to build a metadata table (path, patient_id, subtype, benign/malignant, magnification).
- Builds two evaluation protocols:
  1. **StrictBalancedTest**: equal counts across **8 subtypes × 4 magnifications** (subject to data availability), patient-held-out.
  2. **NaturalTest**: patient-held-out, mirrors the original distribution.
- Creates **Validation** (balanced, smaller) and **Train** from the remaining patients.
- Saves split CSVs and a **leakage check** report.
- (Optional) **Materialize** splits as **symlinks** or **copies** to a new folder.

> ⚠️ **Licensing:** This notebook does **not** rehost BreakHis images. It only creates split metadata (CSVs) and optionally symlinks/copies from your local copy.



## 0. Configuration
Set your local paths and split preferences here.


In [None]:

from pathlib import Path
import os

# ==== REQUIRED: Set the path where you've extracted BreakHis ====
# Example structure (may vary):
#   /data/BreaKHis_v1/
#     └── histology_slides/breast/
#           ├── benign/adenosis/.../*.png
#           ├── benign/fibroadenoma/.../*.png
#           ├── malignant/ductal_carcinoma/.../*.png
#           └── ...
BREAKHIS_ROOT = Path("/path/to/BreaKHis_v1")   # <-- CHANGE THIS

# ==== Where to save outputs ====
OUT_DIR = Path("./breakhis_ib_outputs")         # CSVs & reports
MATERIALIZE_IMAGES = False                      # True to create split folders with symlinks/copies
MATERIALIZE_METHOD = "symlink"                  # "symlink" or "copy"
MATERIALIZE_DIR = Path("./breakhis_ib_materialized")  # target dir if MATERIALIZE_IMAGES=True

# ==== Split behavior ====
RANDOM_SEED = 42

# Strict balanced test target per (magnification, subtype).
# Use None to auto-compute the feasible K (minimum available across subtypes per magnification).
STRICT_TEST_K = None     # e.g., set to 30 if you want a fixed target (will adapt downward if impossible)

# Validation target per cell (magnification, subtype) for strict-balanced protocol.
VALIDATION_K = 10        # keep modest to avoid starving the train set

# Natural test fraction per patient (approximate). You can also set ABS_N_TEST_IMAGES instead.
NATURAL_TEST_FRACTION = 0.2
ABS_N_TEST_IMAGES = None  # If set (e.g., 1200), overrides NATURAL_TEST_FRACTION

# File patterns to include (feel free to add extensions)
IMG_EXTS = {".png", ".jpg", ".jpeg", ".tif", ".tiff", ".bmp"}

OUT_DIR.mkdir(parents=True, exist_ok=True)
if MATERIALIZE_IMAGES:
    MATERIALIZE_DIR.mkdir(parents=True, exist_ok=True)

print("Configured. Root:", BREAKHIS_ROOT)



## 1. Imports & helper tables
We define subtype mappings, robust filename/path parsers, and some utility functions.


In [None]:

import re
import json
import random
import shutil
from collections import defaultdict, Counter

import numpy as np
import pandas as pd

random.seed(RANDOM_SEED)
np.random.seed(RANDOM_SEED)

# Subtype canonical names & short codes used in BreakHis papers/directory names.
SUBTYPE_CANON = {
    # Benign
    "adenosis": ("A", "benign"),
    "fibroadenoma": ("F", "benign"),
    "phyllodes_tumor": ("PT", "benign"),
    "tubular_adenoma": ("TA", "benign"),
    # Malignant
    "ductal_carcinoma": ("DC", "malignant"),
    "lobular_carcinoma": ("LC", "malignant"),
    "mucinous_carcinoma": ("MC", "malignant"),
    "papillary_carcinoma": ("PC", "malignant"),
}

# Alternate names we might see in paths
SUBTYPE_ALIASES = {
    "adenosis": "adenosis",
    "fibroadenoma": "fibroadenoma",
    "phyllodes": "phyllodes_tumor",
    "phyllodes_tumor": "phyllodes_tumor",
    "tubular_adenoma": "tubular_adenoma",
    "ductal_carcinoma": "ductal_carcinoma",
    "ductal": "ductal_carcinoma",
    "lobular_carcinoma": "lobular_carcinoma",
    "lobular": "lobular_carcinoma",
    "mucinous_carcinoma": "mucinous_carcinoma",
    "mucinous": "mucinous_carcinoma",
    "papillary_carcinoma": "papillary_carcinoma",
    "papillary": "papillary_carcinoma",
}

SUBTYPE_CODE2NAME = {v[0]: k for k, v in SUBTYPE_CANON.items()}

MAG_REGEX = re.compile(r"(?:^|[_\-\/])(?P<mag>40|100|200|400)X(?:[_\-\/]|$)", re.IGNORECASE)

# Example filename pattern seen in BreakHis tiles, e.g.:
# SOB_B_A-14-22549-40X-001.png  or  SOB_M_DC-14-98765-200X-003.png
FNAME_REGEX = re.compile(
    r"SOB[_-](?P<bm>[BM])[_-](?P<code>A|F|PT|TA|DC|LC|MC|PC)-(?P<pnum>\d+)-\d+-?(?P<mag>40|100|200|400)X",
    re.IGNORECASE
)

def find_in_path(path_str, words):
    low = path_str.lower()
    for w in words:
        if w in low:
            return True
    return False

def infer_label_from_path(p: Path):
    s = p.as_posix().lower()
    if "benign" in s:
        return "benign"
    if "malignant" in s:
        return "malignant"
    return None

def infer_subtype_from_path(p: Path):
    s = p.as_posix().lower()
    # check canonical & alias keys
    for alias, canon in SUBTYPE_ALIASES.items():
        if f"/{alias}/" in s or s.endswith(f"/{alias}") or f"_{alias}_" in s or f"-{alias}-" in s:
            return canon
    return None

def infer_magnification(path: Path):
    m = MAG_REGEX.search(path.as_posix())
    if m:
        return int(m.group("mag"))
    # try filename tokens like '40x'
    base = path.name.lower()
    for k in [40, 100, 200, 400]:
        if f"{k}x" in base:
            return k
    return None

def infer_from_filename(path: Path):
    """Parse from filename (SOB_B_A-14-xxxxx-40X-yy.png). Returns dict or {}."""
    m = FNAME_REGEX.search(path.name)
    if not m:
        # Try full path
        m = FNAME_REGEX.search(path.as_posix())
    if not m:
        return {}
    bm = m.group("bm").upper()
    label = "benign" if bm == "B" else "malignant"
    code = m.group("code").upper()
    mag = int(m.group("mag"))
    pnum = m.group("pnum")
    # Patient id convention: <code>-<pnum>, e.g., A-14, DC-33
    patient_id = f"{code}-{pnum}"
    subtype = SUBTYPE_CODE2NAME.get(code)
    return {"label": label, "subtype": subtype, "subtype_code": code, "magnification": mag, "patient_id": patient_id}

def robust_parse_record(path: Path):
    """Return dict with keys: path, label, subtype, subtype_code, magnification, patient_id.
    Tries directory cues first, then filename regex. Records 'parse_issue' if anything missing.
    """
    rec = {"path": str(path), "label": None, "subtype": None, "subtype_code": None, "magnification": None, "patient_id": None, "parse_issue": None}
    # First, directory cues
    label = infer_label_from_path(path)
    subtype = infer_subtype_from_path(path)
    magnification = infer_magnification(path)

    if subtype in SUBTYPE_CANON:
        code, canon_label = SUBTYPE_CANON[subtype]
        if label is None:
            label = canon_label
        subtype_code = code
    else:
        subtype_code = None

    # Attempt filename parse for anything missing (or patient_id always via filename when possible)
    fname_guess = infer_from_filename(path)
    for k in ["label", "subtype", "subtype_code", "magnification", "patient_id"]:
        if (locals().get(k) is None) and (k in fname_guess and fname_guess[k] is not None):
            locals()[k] = fname_guess[k]

    # After merges, drop into rec
    rec.update({
        "label": label,
        "subtype": subtype if subtype else fname_guess.get("subtype"),
        "subtype_code": subtype_code if subtype_code else fname_guess.get("subtype_code"),
        "magnification": magnification if magnification is not None else fname_guess.get("magnification"),
        "patient_id": fname_guess.get("patient_id"),
    })

    # If any of the critical fields are missing, mark issue
    required = ["label", "subtype", "subtype_code", "magnification", "patient_id"]
    missing = [k for k in required if rec[k] in (None, "", "unknown")]
    if missing:
        rec["parse_issue"] = f"missing:{','.join(missing)}"
    return rec

def ensure_dir(path: Path):
    path.mkdir(parents=True, exist_ok=True)

def set_seed(seed=42):
    import random, numpy as np
    random.seed(seed); np.random.seed(seed)



## 2. Scan the BreakHis filesystem and build metadata
This walks your `BREAKHIS_ROOT` and parses labels, subtypes, magnifications, and patient IDs.  
Any files that cannot be parsed are logged to `parse_errors.csv`.


In [None]:

set_seed(RANDOM_SEED)

if not BREAKHIS_ROOT.exists():
    raise FileNotFoundError(f"Set BREAKHIS_ROOT correctly. Not found: {BREAKHIS_ROOT}")

all_paths = []
for p in BREAKHIS_ROOT.rglob("*"):
    if p.is_file() and p.suffix.lower() in IMG_EXTS:
        all_paths.append(p)

print(f"Found {len(all_paths)} image files.")

recs = []
for p in all_paths:
    rec = robust_parse_record(p)
    recs.append(rec)

df = pd.DataFrame(recs)
print("Parsed rows:", len(df))

# Save raw metadata
meta_csv = OUT_DIR / "metadata_raw.csv"
df.to_csv(meta_csv, index=False)
print("Saved:", meta_csv)

# Keep only well-parsed rows
ok = df[df["parse_issue"].isna()].copy()
bad = df[~df["parse_issue"].isna()].copy()

ok_csv = OUT_DIR / "metadata_ok.csv"
bad_csv = OUT_DIR / "parse_errors.csv"
ok.to_csv(ok_csv, index=False)
bad.to_csv(bad_csv, index=False)
print(f"OK rows: {len(ok)}  | Parse issues: {len(bad)}")
print("Saved:", ok_csv, "and", bad_csv)

display(ok.head())



## 3. Sanity checks
Quick counts by label, subtype, magnification, and patient summary.


In [None]:

def counts_table(data):
    c1 = data.groupby(["label"]).size().rename("count").reset_index()
    c2 = data.groupby(["subtype"]).size().rename("count").reset_index().sort_values("subtype")
    c3 = data.groupby(["magnification"]).size().rename("count").reset_index().sort_values("magnification")
    c4 = data.groupby(["magnification","subtype"]).size().rename("count").reset_index()
    n_patients = data["patient_id"].nunique()
    print("Unique patients:", n_patients)
    return c1, c2, c3, c4

c1, c2, c3, c4 = counts_table(ok)
display(c1); display(c2); display(c3); display(c4.head())



## 4. Split builders
We create:
- **StrictBalancedTest**: equal per `(magnification, subtype)` where feasible (auto-adjusts `K` downward if needed).
- **Validation**: smaller balanced set from remaining patients.
- **Train**: everything else.
- **NaturalTest**: patient-held-out, approximates the natural class distribution.


In [None]:

from typing import Dict, List, Tuple

def sample_balanced_by_cell(df_pool: pd.DataFrame, target_k: Dict[Tuple[int,str], int], rng: np.random.RandomState):
    """Attempt to sample exactly target_k[(mag, subtype)] images per cell.
    - Ensures no patient appears more than once *within* this sampled set? (Allowed; we ensure exclusivity across splits separately.)
    - Returns a DataFrame and a dict of shortfalls for cells that couldn't meet K.
    """
    picked = []
    shortfalls = {}
    # Shuffle order to avoid systematic biases; prioritize rare cells (lower available count)
    avail = df_pool.groupby(["magnification","subtype"]).size().rename("n").reset_index()
    order = avail.sort_values("n").reset_index(drop=True)

    for _, row in order.iterrows():
        mag, st = int(row.magnification), row.subtype
        K = target_k.get((mag, st), 0)
        if K <= 0:
            continue
        cell = df_pool[(df_pool.magnification==mag) & (df_pool.subtype==st)]
        # Prefer sampling across multiple patients: group by patient then round-robin
        by_pid = {pid: g for pid, g in cell.groupby("patient_id")}
        if not by_pid:
            shortfalls[(mag,st)] = K
            continue
        # Build a round-robin list of indices
        rr = []
        for pid, g in by_pid.items():
            idxs = list(g.index)
            rng.shuffle(idxs)
            rr.append(idxs)
        # Interleave to avoid taking too many from one patient
        flat = []
        while any(rr) and len(flat) < K:
            for lst in rr:
                if lst and len(flat) < K:
                    flat.append(lst.pop())
        if len(flat) < K:
            shortfalls[(mag,st)] = K - len(flat)
        picked.append(df_pool.loc[flat])
    if picked:
        out = pd.concat(picked).drop_duplicates()
    else:
        out = pd.DataFrame(columns=df_pool.columns)
    return out, shortfalls

def compute_auto_K(df_pool: pd.DataFrame) -> Dict[Tuple[int,str], int]:
    """For each magnification, set K to the minimum available across subtypes at that magnification."""
    target_k = {}
    for mag, gmag in df_pool.groupby("magnification"):
        counts = gmag.groupby("subtype").size()
        # Require that each subtype exists; if a subtype is missing entirely at this magnification, K=0 for that cell.
        min_k = int(counts.min()) if len(counts)>0 else 0
        for st in SUBTYPE_CANON.keys():
            # Only set K where subtype exists in pool
            n = int(gmag[gmag.subtype==st].shape[0])
            target_k[(int(mag), st)] = min(min_k, n)
    return target_k

def assign_split_strict_balanced(df_all: pd.DataFrame, strict_test_k=None, validation_k=10, seed=42):
    rng = np.random.RandomState(seed)
    pool = df_all.copy()

    # Determine K for test
    if strict_test_k is None:
        target_k = compute_auto_K(pool)
    else:
        # Use fixed K but cap by availability per cell
        target_k = {}
        for mag, st in [(m,s) for m in sorted(pool.magnification.unique()) for s in SUBTYPE_CANON.keys()]:
            n_avail = pool[(pool.magnification==mag) & (pool.subtype==st)].shape[0]
            target_k[(int(mag), st)] = min(int(strict_test_k), int(n_avail))

    # Sample StrictBalancedTest
    test_bal, short1 = sample_balanced_by_cell(pool, target_k, rng)
    # Adapt K downward if some cells shorted (common with rare subtypes)
    if short1:
        # Reduce K to feasible minimum per (mag) where needed
        # Compute achievable K per cell
        feas_k = {}
        for (mag, st), want in target_k.items():
            n = pool[(pool.magnification==mag) & (pool.subtype==st)].shape[0]
            feas_k[(mag, st)] = min(want, n)
        # Set common K per magnification to the min across subtypes
        new_target = {}
        for mag in sorted(pool.magnification.unique()):
            ks = [feas_k[(mag, st)] for st in SUBTYPE_CANON.keys() if (mag, st) in feas_k]
            if ks:
                Kmag = min(ks)
                for st in SUBTYPE_CANON.keys():
                    new_target[(mag, st)] = Kmag
        test_bal, short2 = sample_balanced_by_cell(pool, new_target, rng)
        target_k = new_target  # update

    # Mark test patients (held-out)
    test_pids = set(test_bal.patient_id.unique())
    remain = pool[~pool.patient_id.isin(test_pids)].copy()

    # Build Validation (balanced, smaller)
    val_target = {}
    for (mag, st), K in target_k.items():
        val_target[(mag, st)] = min(int(validation_k), int(remain[(remain.magnification==mag)&(remain.subtype==st)].shape[0]))

    val_bal, shortv = sample_balanced_by_cell(remain, val_target, rng)
    val_pids = set(val_bal.patient_id.unique())
    remain2 = remain[~remain.patient_id.isin(val_pids)].copy()

    train = remain2.copy()

    # Compose split labels
    df_splits = df_all.copy()
    df_splits["split"] = "train"
    df_splits.loc[df_splits.index.isin(val_bal.index), "split"] = "val"
    df_splits.loc[df_splits.index.isin(test_bal.index), "split"] = "test_strict_balanced"

    # Report
    report = {
        "n_train": int((df_splits.split=="train").sum()),
        "n_val": int((df_splits.split=="val").sum()),
        "n_test_strict_balanced": int((df_splits.split=="test_strict_balanced").sum()),
        "n_test_patients": len(test_pids),
        "n_val_patients": len(val_pids),
        "strict_target_k": target_k,
        "val_target_k": val_target,
        "shortfalls_test": short1 if 'short1' in locals() else {},
        "shortfalls_val": shortv if 'shortv' in locals() else {},
    }
    return df_splits, report

def assign_split_natural(df_all: pd.DataFrame, test_fraction=0.2, abs_n_test=None, seed=42):
    rng = np.random.RandomState(seed)
    # Patient-level sampling to approximate target size
    patients = list(df_all.patient_id.unique())
    rng.shuffle(patients)
    df = df_all.copy()

    if abs_n_test is not None:
        target = abs_n_test
    else:
        target = int(round(test_fraction * len(df)))

    test_pids = set()
    n = 0
    for pid in patients:
        k = df[df.patient_id==pid].shape[0]
        if n + k <= target or n==0:
            test_pids.add(pid); n += k
        if n >= target:
            break

    remain = df[~df.patient_id.isin(test_pids)].copy()
    test_nat = df[df.patient_id.isin(test_pids)].copy()

    # Split remain into train/val (simple patient split; you can also balance val like above if desired)
    rem_pids = list(remain.patient_id.unique())
    rng.shuffle(rem_pids)
    n_val = max(1, int(0.1 * len(rem_pids)))
    val_pids = set(rem_pids[:n_val])
    val = remain[remain.patient_id.isin(val_pids)].copy()
    train = remain[~remain.patient_id.isin(val_pids)].copy()

    df_splits = df_all.copy()
    df_splits["split"] = "train"
    df_splits.loc[df_splits.index.isin(val.index), "split"] = "val"
    df_splits.loc[df_splits.index.isin(test_nat.index), "split"] = "test_natural"

    report = {
        "n_train": int((df_splits.split=="train").sum()),
        "n_val": int((df_splits.split=="val").sum()),
        "n_test_natural": int((df_splits.split=="test_natural").sum()),
        "n_test_patients": len(test_pids),
        "n_val_patients": len(val_pids),
    }
    return df_splits, report



## 5. Build the splits
This will create:
- `splits_strict_balanced.csv` (train/val/test_strict_balanced)
- `splits_natural.csv` (train/val/test_natural)


In [None]:

# Strict balanced protocol
df_strict, rep_strict = assign_split_strict_balanced(ok, strict_test_k=STRICT_TEST_K, validation_k=VALIDATION_K, seed=RANDOM_SEED)
strict_csv = OUT_DIR / "splits_strict_balanced.csv"
df_strict.to_csv(strict_csv, index=False)
print("Saved:", strict_csv)
print(json.dumps({k: (v if not isinstance(v, dict) else "dict(size="+str(len(v))+")") for k,v in rep_strict.items()}, indent=2))

# Natural protocol
df_nat, rep_nat = assign_split_natural(ok, test_fraction=NATURAL_TEST_FRACTION, abs_n_test=ABS_N_TEST_IMAGES, seed=RANDOM_SEED)
natural_csv = OUT_DIR / "splits_natural.csv"
df_nat.to_csv(natural_csv, index=False)
print("Saved:", natural_csv)
print(json.dumps(rep_nat, indent=2))

display(df_strict['split'].value_counts())
display(df_nat['split'].value_counts())



## 6. Leakage & balance checks
Verify **no patient overlap** across splits and inspect per-cell counts.


In [None]:

def check_patient_leakage(df_splits: pd.DataFrame, split_col='split'):
    problems = {}
    splits = df_splits[split_col].unique().tolist()
    pid_by_split = {s: set(df_splits[df_splits[split_col]==s]['patient_id'].unique()) for s in splits}
    for i in range(len(splits)):
        for j in range(i+1, len(splits)):
            a, b = splits[i], splits[j]
            inter = pid_by_split[a] & pid_by_split[b]
            if inter:
                problems[(a,b)] = inter
    return problems

def per_cell_table(df_splits, which_split):
    g = df_splits[df_splits['split']==which_split].groupby(['magnification', 'subtype']).size().rename('count').reset_index()
    return g.sort_values(['magnification','subtype'])

# Strict
leak_strict = check_patient_leakage(df_strict)
print("Patient overlaps (strict):", {str(k): len(v) for k,v in leak_strict.items()})
print("StrictBalanced Test per-cell counts:")
display(per_cell_table(df_strict, 'test_strict_balanced'))

# Natural
leak_nat = check_patient_leakage(df_nat)
print("Patient overlaps (natural):", {str(k): len(v) for k,v in leak_nat.items()})
print("Natural Test per-cell counts:")
display(per_cell_table(df_nat, 'test_natural'))



## 7. (Optional) Materialize images
If `MATERIALIZE_IMAGES=True`, we create a folder structure like:
```
breakhis_ib_materialized/
  strict_balanced/{train,val,test_strict_balanced}/<label>/<subtype>/<magnification>/...
  natural/{train,val,test_natural}/<label>/<subtype>/<magnification>/...
```
Use `symlink` to save disk space, or `copy` if you need a fully independent tree.


In [None]:

def materialize_from_splits(df_splits: pd.DataFrame, root_out: Path, protocol_name: str, split_names: List[str], method='symlink'):
    assert method in ('symlink','copy')
    base = root_out / protocol_name
    ensure_dir(base)
    n=0
    for split_name in split_names:
        sub = base / split_name
        ensure_dir(sub)
        df_sub = df_splits[df_splits['split']==split_name].copy()
        for _, row in df_sub.iterrows():
            src = Path(row['path'])
            # organize: <label>/<subtype>/<magnification>/filename
            rel = Path(str(row['label'])) / str(row['subtype']) / f"{int(row['magnification'])}X"
            dst_dir = sub / rel
            ensure_dir(dst_dir)
            dst = dst_dir / src.name
            try:
                if method == 'symlink':
                    if dst.exists():
                        continue
                    os.symlink(src, dst)
                else:
                    if dst.exists():
                        continue
                    shutil.copy2(src, dst)
                n += 1
            except Exception as e:
                print("Failed:", src, "->", dst, "|", e)
    print(f"Materialized {n} files under {base}")

if MATERIALIZE_IMAGES:
    # Strict
    materialize_from_splits(df_strict, MATERIALIZE_DIR, protocol_name="strict_balanced",
                            split_names=["train","val","test_strict_balanced"], method=MATERIALIZE_METHOD)
    # Natural
    materialize_from_splits(df_nat, MATERIALIZE_DIR, protocol_name="natural",
                            split_names=["train","val","test_natural"], method=MATERIALIZE_METHOD)



## 8. Outputs
- `metadata_raw.csv`: all parsed rows (including errors).
- `metadata_ok.csv`: successfully parsed rows.
- `parse_errors.csv`: rows where any of the critical fields were missing.
- `splits_strict_balanced.csv`: patient-held-out, subtype-balanced-per-magnification (train/val/test_strict_balanced).
- `splits_natural.csv`: patient-held-out, natural-frequency (train/val/test_natural).
- `leakage & balance` printouts: confirm no patient overlap and check per-cell counts.
- *(optional)* materialized folder trees with symlinks/copies.



## 9. Export a compact JSON report
Handy for your IEEE DataPort page or README.


In [None]:

report = {
    "random_seed": RANDOM_SEED,
    "n_images_total": int(len(ok)),
    "n_patients_total": int(ok['patient_id'].nunique()),
    "strict_balanced": {
        "counts": df_strict['split'].value_counts().to_dict(),
        "patients_per_split": {s: int(ok[df_strict['split']==s]['patient_id'].nunique()) for s in df_strict['split'].unique()},
    },
    "natural": {
        "counts": df_nat['split'].value_counts().to_dict(),
        "patients_per_split": {s: int(ok[df_nat['split']==s]['patient_id'].nunique()) for s in df_nat['split'].unique()},
    },
}

with open(OUT_DIR / "split_report.json", "w") as f:
    json.dump(report, f, indent=2)

print(json.dumps(report, indent=2))
print("Saved:", OUT_DIR / "split_report.json")



## 10. Tips & troubleshooting
- If `parse_errors.csv` is large, your local folder structure or filenames differ. Open that CSV to see which fields fail.  
  You can extend `SUBTYPE_ALIASES`, tweak the regexes, or add custom rules.
- If **StrictBalancedTest** couldn't meet the exact `K` in some cells, the code auto-adjusts `K` per magnification to the feasible minimum.  
  You’ll see this in the printed report — that’s expected when some subtypes are scarce at a magnification.
- To reproduce splits exactly, keep `RANDOM_SEED` fixed and commit the generated CSVs to your repo.
- If you need **per-patient caps** (e.g., at most N tiles per patient per cell), you can add a cap before sampling in `sample_balanced_by_cell`.
