# Notebook 0 ‚Äî Config & Reproducible Bootstrap

This notebook recreates the assignment setup once, then reuses it everywhere. It downloads the same data, applies the same preprocessing, writes the same train/val/test CSVs, and saves a config.json I carry into Week 1. Goal: a clean, reproducible base so later notebooks don‚Äôt depend on the big original training notebook.

## Block 1 ‚Äî Environment bootstrap (run twice)

This block standardizes the runtime across local + Colab:

- Detect Colab, mount Drive, and set PROJECT_ROOT.
- Pin exact versions (NumPy first, then PyTorch 2.2.2 CUDA wheels with CPU fallback, then the rest).
- Install explainability libs (captum) and CV/ML stack.
- Do an editable install of my /src module with --no-deps (reuses helpers from the initial assignment without re-resolving pins).
- On Colab, patch requires-python if needed and hard-restart once so ABI/state is clean.
- Second run prints versions and exposes safe Drive writers.

How to use:
- Run once ‚Üí installs & restarts.
- Run again ‚Üí verifies, loads
- Set CXR_PROJ_ROOT to override the project path locally.

In [1]:
# --- Block 1 (fixed): Colab-only bootstrap with one-time restart -------------
import sys, os, subprocess, platform, time
from pathlib import Path

# Detect if code is running inside Google Colab by checking loaded modules.
# Side effect: enables Colab-specific behavior (Drive mount, pip pins, restart).
IN_COLAB = "google.colab" in sys.modules
print(f"üêç Python: {sys.version.split()[0]} | Colab: {IN_COLAB}")

def sh(cmd: str):
    # Thin wrapper around subprocess.run for shell commands with simple error handling.
    # Raises SystemExit on non-zero return to halt the notebook early and surface the failing command.
    print(">>", cmd)
    r = subprocess.run(cmd, shell=True)
    if r.returncode != 0:
        raise SystemExit(r.returncode)

# Marker file used to prevent infinite restarts of the Colab runtime.
# Presence of this file indicates the first-run bootstrap has completed.
MARK = Path("/content/_cxr_bootstrap_done")

# Resolve project root depending on environment.
# In Colab: try common Drive locations, else fallback to current working directory.
# Outside Colab: allow override via CXR_PROJ_ROOT env var, else cwd.
if IN_COLAB:
    from google.colab import drive
    drive.mount("/content/drive", force_remount=False)  # Avoid remount churn; user can toggle if needed.
    print("‚úÖ Google Drive mounted")
    CANDIDATES = [
        Path("/content/drive/MyDrive/code/chest-xray-lab"),  # Preferred repo path
        Path("/content/drive/MyDrive/chest_xray_lab"),       # Alternate naming
        Path.cwd(),                                          # Fallback to notebook directory
    ]
else:
    CANDIDATES = [Path(os.environ.get("CXR_PROJ_ROOT", Path.cwd()))]

# Pick the first existing candidate as the canonical project root.
# Note: if multiple exist, ordering matters; consider logging a warning in multi-match situations.
PROJ_ROOT = next((p.resolve() for p in CANDIDATES if p.exists()), Path.cwd().resolve())
os.environ["CXR_PROJ_ROOT"] = str(PROJ_ROOT)  # Export for child processes (e.g., pip, editable install).
print("üìÅ PROJECT ROOT:", PROJ_ROOT)

# Add src/ to the import path to support 'editable-like' imports without installing.
SRC_DIR = PROJ_ROOT / "src"
if str(SRC_DIR) not in sys.path:
    sys.path.insert(0, str(SRC_DIR))

# First-time bootstrap path (only in Colab and only if MARK doesn't exist yet).
# Performs pinning/uninstall to harmonize binary dependencies and avoid ABI mismatches.
if IN_COLAB and not MARK.exists():
    # Patch 'requires-python' in pyproject for Colab base kernels that lag behind.
    # Caution: mutates repository file; safe for Colab but avoid committing this change upstream.
    pyproject = PROJ_ROOT / "pyproject.toml"
    if pyproject.exists():
        txt = pyproject.read_text()
        if 'requires-python = ">=3.12"' in txt:
            print("‚ö†Ô∏è Patching requires-python to >=3.10 for Colab base kernel‚Ä¶")
            pyproject.write_text(txt.replace('requires-python = ">=3.12"', 'requires-python = ">=3.10"'))

    # Clean slate to prevent silent ABI conflicts (NumPy/Torch/Scipy stack).
    # Set CXR_FORCE_CLEAN=0 to skip aggressive uninstalls when debugging.
    if os.environ.get("CXR_FORCE_CLEAN", "1") == "1":
        sh("pip -q uninstall -y "
           "torch torchvision torchaudio "
           "numpy pandas scipy scikit-learn scikit-image "
           "matplotlib "
           "opencv-python opencv-contrib-python opencv-python-headless "
           "jax jaxlib pillow tabulate kagglehub captum torchcam || true")
        sh("pip -q install --upgrade pip")  # Keep pip recent to reduce resolver quirks.

    # 1) Pin NumPy FIRST to a version compatible with PyTorch 2.2 wheels on Colab.
    # Rationale: prevents resolver from upgrading NumPy to an ABI that mismatches Torch.
    sh("pip install --no-cache-dir numpy==1.26.4")

    # 2) Install Torch 2.2.2 with CUDA wheels where available; fallback logic handles Colab GPU variants.
    # Tries CUDA 12.1 then 11.8. If both fail (e.g., CPU runtime), installs CPU wheels from Torch index.
    for wheels in (
        "torch==2.2.2+cu121 torchvision==0.17.2+cu121 torchaudio==2.2.2+cu121",
        "torch==2.2.2+cu118 torchvision==0.17.2+cu118 torchaudio==2.2.2+cu118",
    ):
        try:
            sh(f"pip install --no-cache-dir -f https://download.pytorch.org/whl/torch_stable.html {wheels}")
            break
        except SystemExit:
            pass
    else:
        print("‚ö†Ô∏è GPU wheels failed; installing CPU wheels")
        sh("pip install --no-cache-dir --index-url https://download.pytorch.org/whl/cpu "
           "torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2")

    # 3) Install the rest of the scientific stack with strict pins where stability is critical.
    # Matplotlib pinned with fallback spec to accommodate transient wheel availability on Colab images.
    try:
        sh("pip install --no-cache-dir matplotlib==3.10.7")
    except SystemExit:
        sh("pip install --no-cache-dir 'matplotlib>=3.10,<3.11'")
    sh("pip install --no-cache-dir pandas==2.3.3 scipy==1.16.2 scikit-learn==1.7.2 scikit-image==0.25.2 tabulate==0.9.0 kagglehub==0.3.13")
    try:
        sh("pip install --no-cache-dir opencv-python==4.9.0.80")
    except SystemExit:
        # Headless fallback avoids GUI backends not present on Colab; functionality is similar for CV workloads.
        sh("pip install --no-cache-dir opencv-python-headless==4.9.0.80")

    # 4) Model interpretability tooling (optional extras).
    # Captum is aligned with Torch 2.2; TorchCAM kept optional to encourage custom Grad-CAM if desired.
    sh("pip install --no-cache-dir captum==0.7.0")
    # Optional: drop torchcam and implement Grad-CAM yourself
    # sh("pip install --no-cache-dir torchcam==0.4.0")

    # 5) Editable install of the repo WITHOUT dependency resolution.
    # --no-deps ensures previously pinned wheels aren't re-resolved by project metadata.
    sh(f"pip install --no-cache-dir -e {PROJ_ROOT} --no-deps")

    # 6) Create marker and force a hard restart to ensure the runtime imports the freshly pinned ABIs.
    # Using SIGKILL avoids partial state; the guard MARK prevents infinite restart loops.
    MARK.touch()
    import os as _os
    _os.kill(_os.getpid(), 9)

# Second run (post-restart): import and print versions to verify environment health.
# Skips all uninstall/install work; functions as a quick sanity check plus utility defs.
if IN_COLAB:
    import numpy as np, torch, cv2, pandas, scipy, sklearn, skimage, matplotlib
    print("python:", sys.version.split()[0], "|", platform.platform())
    print("numpy:", np.__version__)
    print("torch :", torch.__version__, "| CUDA?", torch.cuda.is_available())  # Note: torch.cuda.is_available() reflects driver/runtime availability.
    print("cv2   :", cv2.__version__)
    print("pandas:", pandas.__version__, "| scipy:", scipy.__version__)
    print("sklearn:", sklearn.__version__, "| skimage:", skimage.__version__)
    print("matplotlib:", matplotlib.__version__)

    # Lightweight, robust file writers for Google Drive to mitigate sync lag / buffering issues.
    # safe_write_bytes ensures data durability by fsync + size check with retries.
    import io
    def safe_write_bytes(path: Path, data: bytes, retries: int = 3, sleep_s: float = 0.5):
        path.parent.mkdir(parents=True, exist_ok=True)
        with open(path, "wb") as f:
            f.write(data); f.flush(); os.fsync(f.fileno())
        for _ in range(retries):
            if path.exists() and path.stat().st_size == len(data):
                return True
            time.sleep(sleep_s)
        raise IOError(f"Drive sync failed for {path}")
    def safe_write_text(path: Path, text: str): return safe_write_bytes(path, text.encode("utf-8"))
    globals().update(dict(safe_write_bytes=safe_write_bytes, safe_write_text=safe_write_text))
    # Note: safe_write_* return True on success; callers may want to assert the return or handle exceptions.


üêç Python: 3.12.12 | Colab: True
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
‚úÖ Google Drive mounted
üìÅ PROJECT ROOT: /content/drive/MyDrive/code/chest-xray-lab
python: 3.12.12 | Linux-6.6.105+-x86_64-with-glibc2.35
numpy: 1.26.4
torch : 2.2.2+cu121 | CUDA? False
cv2   : 4.9.0
pandas: 2.3.3 | scipy: 1.16.2
sklearn: 1.7.2 | skimage: 0.25.2
matplotlib: 3.10.7


## Block 2 ‚Äî Paths, env, and package import

This block wires up paths and avoids duplication:

- Reads `CXR_PROJ_ROOT` from Block 1 and sets env vars for raw/processed data, manifests, and a fixed seed.
- Installs `chest_xray_lab` in **editable mode** only if it‚Äôs not already importable (prevents double-installs between local and Colab).
- Imports the canonical config (`PROJ_ROOT`, `RAW_DIR`, `PROC_DIR`, `MANIFESTS`, `DEVICE`, `SEED`) so every notebook uses the same ‚Äúsingle source of truth‚Äù.
- Ensures the expected directories exist and prints a quick device summary (CUDA/MPS/CPU).


In [2]:
# --- Block 2: paths/env + editable install (no duplication) ------------------
import os, importlib.util, subprocess, pathlib
from pathlib import Path

# 1) Ensure env vars ONCE (Block 1 already set CXR_PROJ_ROOT)
# Read the project root from the environment and normalize to an absolute path.
# setdefault(...) writes defaults only if the variable is not already defined,
# which lets advanced users override via environment without editing code.
PR = Path(os.environ["CXR_PROJ_ROOT"]).resolve()
os.environ.setdefault("CXR_RAW_DIR",   str(PR / "data" / "raw" / "chest_xray"))          # Source dataset location (immutable copies ideally).
os.environ.setdefault("CXR_PROC_DIR",  str(PR / "data" / "processed" / "chest_xray_split")) # Derived/processed splits live here (safe to regenerate).
os.environ.setdefault("CXR_MANIFESTS", str(PR / "data" / "processed" / "manifests"))     # CSV/JSON manifests for reproducible pipelines.
os.environ.setdefault("CXR_SEED", "42")                                                   # Global seed as string (convert to int at use-sites).

# 2) Editable install only if package not importable
# If the project package is not importable, install in editable mode.
# This avoids duplicate installs and ensures local code edits are immediately reflected.
# Pitfall: mixing 'chest-xray-lab' (dist-name) and 'chest_xray_lab' (import-name) is intentional;
# uninstall targets the distribution, import checks the module. Keep both consistent with pyproject.
if importlib.util.find_spec("chest_xray_lab") is None:
    print("‚Ñπ Installing 'chest_xray_lab' in editable mode...")
    subprocess.run(["python", "-m", "pip", "uninstall", "-y", "chest-xray-lab"], check=False)  # Best-effort cleanup; ignore failure.
    subprocess.check_call(["python", "-m", "pip", "install", "-e", str(PR)])                   # Editable install binds imports to local source tree.

# 3) Import from config (single source of truth) and ensure dirs
# Centralized config ensures paths/devices/seeds are defined once and reused everywhere.
# Importing here asserts the package is importable and that config resolves env vars to Path objects.
from chest_xray_lab.config import PROJ_ROOT, RAW_DIR, PROC_DIR, MANIFESTS, DEVICE, SEED

# Create required directories idempotently. parents=True creates nested paths; exist_ok=True avoids errors if already present.
for d in (RAW_DIR, PROC_DIR, MANIFESTS):
    d.mkdir(parents=True, exist_ok=True)

# Log resolved paths for transparency and easier debugging in shared notebooks.
print("üìÅ PROJ_ROOT:", PROJ_ROOT)
print("üìÅ RAW_DIR  :", RAW_DIR)
print("üìÅ PROC_DIR :", PROC_DIR)
print("üìÅ MANIFESTS:", MANIFESTS)

# 4) Device summary
# Provide a concise runtime device report, preferring Apple MPS (Metal) when available,
# then CUDA, else CPU. This is purely informational‚Äîmodel code should still query DEVICE from config.
import torch
if hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
    print("üü¢ Apple MPS available")
elif torch.cuda.is_available():
    print(f"üü¢ CUDA device: {torch.cuda.get_device_name(0)}")
else:
    print("‚ö™Ô∏è CPU-only")


üìÅ PROJ_ROOT: /content/drive/MyDrive/code/chest-xray-lab
üìÅ RAW_DIR  : /content/drive/MyDrive/code/chest-xray-lab/data/raw/chest_xray
üìÅ PROC_DIR : /content/drive/MyDrive/code/chest-xray-lab/data/processed/chest_xray_split
üìÅ MANIFESTS: /content/drive/MyDrive/code/chest-xray-lab/data/processed/manifests
‚ö™Ô∏è CPU-only


## Block 3 ‚Äî Core imports, seed, and runtime summary

This block pulls in the core libs, locks the global seed, and prints a quick runtime report.

- Uses `set_global_seed(SEED)` from my utils so runs are deterministic by default.
- `FAST_GPU=False` keeps cudnn in deterministic mode; flip to `True` only if you accept small nondeterminism for speed.
- Sets `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` to reduce CUDA memory fragmentation (no-op on CPU/MPS).
- Prints Python/NumPy/OpenCV/Matplotlib/Torch versions and whether CUDA is available.


In [3]:
# --- Core imports ------------------------------------------------------------
import os, sys
import numpy as np
import matplotlib.pyplot as plt
import cv2, torch, torch.nn as nn

from chest_xray_lab.config import SEED
from chest_xray_lab.utils.repro import set_global_seed

# --- Reproducibility (deterministic by default) ------------------------------
FAST_GPU = False   # set True only if you accept minor nondeterminism for speed
set_global_seed(SEED, deterministic=not FAST_GPU, fast_gpu=FAST_GPU)


# optional (no-op on CPU/MPS)
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

# --- Versions ---------------------------------------------------------------
import matplotlib
print(f"üî¢ Seed: {SEED}")
print(f"Python: {sys.version.split()[0]}")
print(f"NumPy : {np.__version__} | OpenCV: {cv2.__version__} | Matplotlib: {matplotlib.__version__}")
print(f"Torch : {torch.__version__}")
print("CUDA available:", torch.cuda.is_available())


üî¢ Seed: 42
Python: 3.12.12
NumPy : 1.26.4 | OpenCV: 4.9.0 | Matplotlib: 3.10.7
Torch : 2.2.2+cu121
CUDA available: False


## Block 4 ‚Äî Download raw data (idempotent)

This block downloads the Chest X-Ray Pneumonia dataset via `kagglehub` and lays it out under `CXR_RAW_DIR` with the expected `train/val/test` structure.

- Targets the same dataset as the assignment (`paultimothymooney/chest-xray-pneumonia`).
- Finds the **shallowest** directory that already contains `train/val/test` (handles extra nesting in Kaggle zips).
- Copies into `CXR_RAW_DIR` if missing; otherwise **skips** existing non-empty splits (safe to re-run).
- Prints a tiny count summary per split so we can sanity-check the download.

Inputs: `CXR_RAW_DIR` from env (set earlier).  
Outputs: populated `train/val/test` folders under `CXR_RAW_DIR`, plus a human-readable summary.

> Re-running is a no-op unless the target is empty.


In [4]:
# make_dataset_raw_min.py
from pathlib import Path
import shutil
import kagglehub
import os

# Resolve the raw dataset directory from env and ensure it exists.
RAW_DIR = Path(os.environ["CXR_RAW_DIR"]).resolve()
RAW_DIR.mkdir(parents=True, exist_ok=True)

def _has_split(root: Path) -> bool:
    # Check if non-empty train/val/test subdirs exist (any file anywhere inside).
    # Note: 'any(glob("**/*"))' treats empty dirs as "absent".
    return all((root / s).is_dir() and any((root / s).glob("**/*"))
               for s in ("train", "val", "test"))

def _find_split_root(cache_root: Path) -> Path:
    # Walk cache_root and find the SHALLOWEST dir that directly contains
    # train/val/test (case-insensitive via .name.lower()).
    # Skips paths within __MACOSX bundles that can appear in zips.
    candidates = []
    for p in cache_root.rglob("*"):
        if not p.is_dir():
            continue
        names = {c.name.lower() for c in p.iterdir() if c.is_dir()}
        if {"train", "val", "test"}.issubset(names):
            if "__macosx" in {part.lower() for part in p.parts}:
                continue
            candidates.append((len(p.parts), p))
    if not candidates:
        # Defensive: dataset structure changed or missing; fail with context.
        raise FileNotFoundError(f"Could not locate train/val/test under {cache_root}")
    # Return shallowest candidate to avoid nested duplicates.
    return sorted(candidates, key=lambda t: t[0])[0][1]

def _copy_split(src_root: Path, dst_root: Path):
    # Copy each split once; if destination exists and is non-empty, skip.
    # dirs_exist_ok=True lets copytree merge into existing dirs (Py3.8+).
    for split in ("train", "val", "test"):
        src = (src_root / split).resolve()
        dst = (dst_root / split).resolve()
        if dst.exists() and any(dst.glob("**/*")):
            print(f"‚Ü™ Skip existing '{split}' at {dst}")
            continue
        print(f"Copying {split}: {src} -> {dst}")
        shutil.copytree(src, dst, dirs_exist_ok=True)

def _count_files(folder: Path) -> int:
    # Count all regular files under a folder (recursive).
    return sum(1 for f in folder.rglob("*") if f.is_file())

def prepare_raw_dataset():
    # Idempotent entrypoint: if dataset already present, do nothing.
    if _has_split(RAW_DIR):
        print("‚úÖ Raw dataset already present at:", RAW_DIR)
    else:
        # Pull dataset to a local cache via kagglehub; returns the cache path.
        cache_root = Path(kagglehub.dataset_download("paultimothymooney/chest-xray-pneumonia")).resolve()
        print("Downloaded to cache:", cache_root)
        # Detect the folder that holds train/val/test and copy into RAW_DIR.
        split_root = _find_split_root(cache_root)
        print("Using split root:", split_root)
        _copy_split(split_root, RAW_DIR)

    # Lightweight human-readable counts (no manifest creation here).
    summary = {s: _count_files(RAW_DIR / s) for s in ("train", "val", "test")}
    print("üì¶ Prepared at:", RAW_DIR)
    print("üìä Counts:", summary)

if __name__ == "__main__":
    # Allow running as a script (e.g., `python make_dataset_raw_min.py`).
    prepare_raw_dataset()


Using Colab cache for faster access to the 'chest-xray-pneumonia' dataset.
Downloaded to cache: /kaggle/input/chest-xray-pneumonia
Using split root: /kaggle/input/chest-xray-pneumonia/chest_xray
Copying train: /kaggle/input/chest-xray-pneumonia/chest_xray/train -> /content/drive/MyDrive/code/chest-xray-lab/data/raw/chest_xray/train
Copying val: /kaggle/input/chest-xray-pneumonia/chest_xray/val -> /content/drive/MyDrive/code/chest-xray-lab/data/raw/chest_xray/val
Copying test: /kaggle/input/chest-xray-pneumonia/chest_xray/test -> /content/drive/MyDrive/code/chest-xray-lab/data/raw/chest_xray/test
üì¶ Prepared at: /content/drive/MyDrive/code/chest-xray-lab/data/raw/chest_xray
üìä Counts: {'train': 5216, 'val': 16, 'test': 624}


## Block 5 ‚Äî Patient-wise split only

This cell performs the patient-wise split identical to how it was done in the initial assignment.
- Groups by patient to prevent leakage.
- Stratifies into `train/val/test` with `test_frac=0.11`, `val_frac=0.09`, `seed=42`.
- Returns `train_items`, `val_items`, `test_items`, `POS_WEIGHT`.

In [5]:
from pathlib import Path
from chest_xray_lab.utils.split import split_by_patient

# Resolve raw dataset dir from env (must be set by earlier blocks/config).
RAW_DIR = Path(os.environ["CXR_RAW_DIR"])  # e.g., $CXR_PROJ_ROOT/data/raw/chest_xray

# Patient-wise split to prevent leakage (images from same patient never cross splits).
# seed controls reproducibility; test/val fractions are proportions of the whole set.
splits = split_by_patient(RAW_DIR, seed=42, test_frac=0.11, val_frac=0.09)

# Unpack convenience views.
# Each item: (absolute_image_path: Path, label: str/int, patient_id: str)
train_items = splits["train"]
val_items   = splits["val"]
test_items  = splits["test"]

# Class imbalance weight for positive class; typically used in BCEWithLogitsLoss(pos_weight=...).
POS_WEIGHT  = splits["pos_weight"]


== Patient-level split summary ==
split   total   NORMAL   PNEUM.    Pos%   patients
train    4719     1232     3487   73.9%       2538
val       530      141      389   73.4%        286
test      607      210      397   65.4%        350

Patient overlap (should be 0): train‚à©val=0, train‚à©test=0, val‚à©test=0


## Block 6 ‚Äî Reproduce preprocessing & warm the cache (assignment-parity)

This cell **recreates the exact preprocessing pipeline** used in the initial assignment for both training and evaluation, then warms a PNG cache under `CXR_PROC_DIR`.

- Uses a versioned `PreprocConfig` (`cover_crop ‚Üí resize ‚Üí pad ‚Üí 224√ó224`) to match the original pipeline.
- Caches results **idempotently** (write only if missing), so later notebooks read the same tensors the model saw.
- These cached images are later written to file/Drive and consumed by the Week-1 tasks to:
  - preserve the **same train/val/test splits** the `best_model.pt` was trained/tested on,
  - reproduce the **same dataset mean/std** assumptions used in Week-1 (by drawing from the same preprocessed set).

Outcome: a deterministic, reusable cache that mirrors the original training/eval data exactly.


In [6]:
from pathlib import Path
from chest_xray_lab.utils.preproc import PreprocConfig, preproc_hash, cache_path, ensure_cached_png
import cv2, numpy as np

# Resolve dataset roots from environment. (Ensure earlier blocks set these.)
RAW_DIR  = Path(os.environ["CXR_RAW_DIR"])
PROC_DIR = Path(os.environ["CXR_PROC_DIR"])

# Preprocessing configuration (single source of truth).
# target_hw: final H√óW; cover_crop keeps aspect by center-cropping after scaling.
# pad_mode: how borders are padded if needed; "reflect" avoids hard edges.
# dark_frac/min_keep_run: heuristics for removing large dark margins/borders.
CFG = PreprocConfig(
    target_hw=(224,224),
    resize_mode="cover_crop",   # matches your old _RESIZE_MODE
    pad_mode="reflect",
    dark_frac=0.80,
    min_keep_run=8,
)

# Stable hash of CFG for cache versioning; changes invalidate/segregate caches.
PHASH = preproc_hash(CFG)
print("preproc hash:", PHASH)

def warm_cache(items):  # items = list of (abs_path, label, patient_id)
    # Materialize preprocessed PNGs for a split to avoid on-the-fly work later.
    new = 0
    for abs_path, *_ in items:
        # Compute path of the cached PNG relative to RAW_DIR + CFG.
        rel = str(Path(abs_path).resolve().relative_to(RAW_DIR))
        dst = cache_path(PROC_DIR, CFG, rel)
        if not dst.exists():
            # Performs crop + cover-resize + write PNG atomically (via library util).
            ensure_cached_png(Path(abs_path), dst, CFG)  # does crop + cover-resize + write PNG
            new += 1
    print(f"warmed {len(items)} items | new files written: {new}")

# Populate cache for all splits; idempotent (skips files that already exist).
warm_cache(train_items)
warm_cache(val_items)
warm_cache(test_items)


preproc hash: 3a0b791144
warmed 4719 items | new files written: 4719
warmed 530 items | new files written: 530
warmed 607 items | new files written: 607


## Block 7 ‚Äî Load checkpoint, bind cached data, and sanity-check

This cell rebuilds the exact model, loads the saved `best_model.pt`, and points it at the cached PNGs from Block 6.

- Recreates split ‚Üí cache pairs (`to_cached_pairs`) using the same `CFG` and `cache_path`.
- Runs a quick eval on `test` to sanity-check that this setup matches the original assignment (tiny drift possible from package/runtime versions, but split + preprocessing are identical).
- Computes the train-set mean/std in `[0,1]` to serve as the occlusion baseline for MoRF/SOFI in Week-1.

**Observed (this run):** `TEST: AUC=0.996, Acc=0.965, F1=0.973`  
**Original assignment:** `TEST: AUC=0.996, Acc=0.962, F1=0.971`  
**Train mean:** `0.575170` now vs `0.575157` before (negligible).

Conclusion: metrics are effectively identical; the cache + checkpoint wiring is faithful to the original setup.

In [7]:
from chest_xray_lab.config import DEVICE, CKPT_PATH, RAW_DIR, PROC_DIR, SEED
from chest_xray_lab.models.build import build_model
from chest_xray_lab.models.load  import load_checkpoint
from chest_xray_lab.data.cache_io import to_cached_pairs, mean_std_from_cached_pairs
from chest_xray_lab.data.dataset_eval import make_eval_loader
from chest_xray_lab.eval.metrics import collect_logits, compute_epoch_metrics, compute_shuf_auc

# Build + load
model = build_model("efficientnet_b0", pretrained=False, in_chans=1).to(DEVICE)
model = load_checkpoint(model, CKPT_PATH, map_location=DEVICE)

# Map split items -> cached PNG pairs using the preprocessing CFG and cache_path from earlier.
# Each element becomes (cached_png_path, label). Reuses disk cache; does not recompute if present.
cached_train = to_cached_pairs(train_items, RAW_DIR, PROC_DIR, CFG, cache_path)
cached_val   = to_cached_pairs(val_items,   RAW_DIR, PROC_DIR, CFG, cache_path)
cached_test  = to_cached_pairs(test_items,  RAW_DIR, PROC_DIR, CFG, cache_path)

# (Optional) quick sanity eval
# Eval loader: deterministic (no shuffle), appropriate transforms for cached PNGs.
test_dl = make_eval_loader(cached_test, batch_size=32)
# Collect logits for the whole split; y_logits are raw scores (sigmoid applied inside metrics if expected).
y_true, y_logits = collect_logits(model, test_dl, DEVICE)
# Threshold-dependent metrics (Acc/F1) at 0.5; AUROC is threshold-free.
metrics = compute_epoch_metrics(y_true, y_logits, thresh=0.5)
# Label permutation baseline: keeps predictions fixed, shuffles labels (expected AUROC ‚âà 0.5 if sane).
shuf_auc = compute_shuf_auc(y_true, y_logits, seed=SEED)
print(f"TEST:  AUC={metrics['auroc']:.3f}  Acc={metrics['acc']:.3f}  F1={metrics['f1']:.3f}")
print(f"SHUF AUC (labels permuted, predictions fixed): {shuf_auc:.3f}")

# Baseline stats for occlusion-style evals (e.g., MoRF/insertion): mean/std in [0,1] space from training set.
mean01, std01 = mean_std_from_cached_pairs(cached_train)
print(f\"[Occlusion] train mean in [0,1]: {mean01:.6f} (std={std01:.6f})\")


TEST:  AUC=0.996  Acc=0.965  F1=0.973
SHUF AUC (labels permuted, predictions fixed): 0.527
[Occlusion] train mean in [0,1]: 0.575170 (std=0.172597)


## Block 8 ‚Äî Persist manifests & minimal config for Week-1

This cell writes everything Week-1 needs to disk:

- CSVs with cached `(png_path, label)` for `train/val/test`.
- A minimal `part0_min.json` with checkpoint path, seed, preprocessing hash, and the `[0,1]` train mean/std used for occlusion baselines.
- All paths are absolute so downstream notebooks can load without extra setup.

**Colab note:** Google Drive can lag. Newly written files may take a little time to appear in Drive. Give it a short pause before opening the Week-1 notebook to avoid ‚Äúfile not found‚Äù hiccups.

Outputs:
- `${MANIFESTS}/train_cached.csv`, `val_cached.csv`, `test_cached.csv`
- `${MANIFESTS}/part0_min.json`

These are the artifacts Week-1 reads to reproduce the initial assignment's setup.

In [8]:
# --- Minimal manifests for Week-1 (no meta folder) --------------------------
import csv, json, hashlib
from pathlib import Path

from chest_xray_lab.config import PROJ_ROOT, MANIFESTS as MANI_DIR, CKPT_PATH, SEED
# If you didn't compute mean01,std01 earlier, import helper:
# from chest_xray_lab.data.cache_io import mean_std_from_cached_pairs

# Ensure manifests directory exists (idempotent).
MANI_DIR.mkdir(parents=True, exist_ok=True)

def _write_pairs_csv(pairs, out_csv: Path):  # pairs: [(png_path,label), ...]
    # Serialize cached pairs to a simple 2-column CSV for downstream loaders.
    out_csv.parent.mkdir(parents=True, exist_ok=True)
    with open(out_csv, "w", newline="") as f:
        w = csv.writer(f); w.writerow(["png_path","label"])
        for p, y in pairs:
            w.writerow([p, int(y)])  # force int for consistent parsing

def _filelist_md5(pairs):
    # Stable checksum of the file list (sorted) for quick change detection.
    items = "\n".join(sorted([p for p,_ in pairs])).encode()
    return hashlib.md5(items).hexdigest()

# 1) Split CSVs
# Export per-split lists of (cached_png_path, label).
train_csv = MANI_DIR / "train_cached.csv"
val_csv   = MANI_DIR / "val_cached.csv"
test_csv  = MANI_DIR / "test_cached.csv"
_write_pairs_csv(cached_train, train_csv)
_write_pairs_csv(cached_val,   val_csv)
_write_pairs_csv(cached_test,  test_csv)

# 2) Baseline stats in [0,1]
# Prefer previously computed mean/std (avoids recompute and keeps runs aligned).
train_mean01, train_std01 = float(mean01), float(std01)
# If you didn't compute them yet in this session, uncomment:
# train_mean01, train_std01 = mean_std_from_cached_pairs(cached_train)

# 3) Minimal config JSON (absolute paths; week-1 notebook can read only this)
# Acts as a single source of truth for evaluation/setup.
part0_min = {
    "ckpt_path": str(CKPT_PATH),
    "seed": int(SEED),
    "preproc_hash": PHASH,                     # from your preproc block
    "train_mean01": train_mean01,
    "train_std01":  train_std01,
    "csv": {
        "train": str(train_csv),
        "val":   str(val_csv),
        "test":  str(test_csv),
    },
    "manifests_dir": str(MANI_DIR)
}
part0_min_path = MANI_DIR / "part0_min.json"
part0_min_path.write_text(json.dumps(part0_min, indent=2))

print("Wrote CSVs:", train_csv, val_csv, test_csv)
print("Wrote cfg :", part0_min_path)


Wrote CSVs: /content/drive/MyDrive/code/chest-xray-lab/data/processed/manifests/train_cached.csv /content/drive/MyDrive/code/chest-xray-lab/data/processed/manifests/val_cached.csv /content/drive/MyDrive/code/chest-xray-lab/data/processed/manifests/test_cached.csv
Wrote cfg : /content/drive/MyDrive/code/chest-xray-lab/data/processed/manifests/part0_min.json
