# Experiment: Baseline U-Net v2.3 Pipeline Validation

**Date:** 2026-02-24  
**Experiment ID:** `baseline_v23`  
**Status:** Preliminary (seed 42 only; seeds 123/456 pending)  
**Type:** Training  
**GitHub Issue:** [#37](https://github.com/wrockey/vmat-diffusion/issues/37)  

---

## 1. Overview

### 1.1 Objective

Validate the full training-inference-evaluation pipeline end-to-end on v2.3 preprocessed data (74 cases). This is the first experiment on the work machine with the corrected preprocessing pipeline (D95 artifact fix, #4). The primary goal is **pipeline validation**, with secondary goals of establishing preliminary v2.3 baseline metrics and identifying issues before the Phase 2 ablation study.

### 1.2 Key Results (Seed 42 Only — PRELIMINARY)

| Metric | Value | Target | Status |
|--------|-------|--------|--------|
| PTV70 D95 \|error\| | **1.01 \u00b1 0.76 Gy** | < 2 Gy | 6/7 pass |
| PTV-region Gamma 3%/3mm | **85.5 \u00b1 10.9%** | > 95% | Close |
| QUANTEC compliance | **4/7 cases (57%)** | > 90% | Dmax hotspots only |
| MAE | 4.80 \u00b1 2.45 Gy | diagnostic | — |
| Global Gamma 3%/3mm | 28.1 \u00b1 12.6% | diagnostic | — |

*Note: Metrics updated 2026-02-24 after switching inference default from overlap=32 to overlap=64 (see #50).*

### 1.3 Conclusion

The v2.3 pipeline works end-to-end. Even a pure MSE baseline achieves surprisingly good PTV70 D95 accuracy (1.01 Gy mean error), but exhibits systematic PTV underdose (-0.86 Gy bias) and OAR Dmax hotspots — both patterns that the Phase 2 asymmetric PTV and DVH-aware losses are designed to address. A consistent Femur L > R asymmetry (7.2 Gy mean difference, 7/7 cases) warrants investigation. Case-level analysis reveals the model's main weakness is OAR dose prediction in cases with atypical dose spread (compact or unusually wide), where errors of 10-18 Gy per structure occur — expected to improve with more data and clinical losses.

---

## 2. What Changed

This is the **first experiment on v2.3 data**. There is no direct prior experiment to compare against on this data. The closest reference is the pilot `baseline_unet_run1` on v2.2.0 data (23 cases, home machine), but those metrics are **invalid** due to the D95 preprocessing artifact (#4).

| Parameter | Pilot (v2.2.0) | This Experiment (v2.3) |
|-----------|---------------|------------------------|
| Data version | v2.2.0 (D95 artifact) | **v2.3.0 (fixed)** |
| Cases | 23 | **74** |
| Machine | Home (Windows) | **Work (WSL2)** |
| Architecture | BaselineUNet3D | BaselineUNet3D (identical) |
| Loss | MSE + neg penalty | MSE + neg penalty (identical) |
| Epochs | 200 | 200 (identical) |
| Patch size | 128 | 128 (identical) |

**Everything else is identical.** The only changes are data version, case count, and compute platform.

---

## 3. Reproducibility

In [None]:
import json
from pathlib import Path
from IPython.display import Markdown, display

PROJECT_ROOT = Path('..').resolve()

REPRODUCIBILITY = {
    'git_commit': '82bddc5e5cac8faaa3aa63b14686bdccbf6bba3b',
    'git_message': 'fix: Patch sampling crash when volume Z equals patch_size',
    'python_version': '3.12.12',
    'pytorch_version': '2.10.0+cu126',
    'pytorch_lightning_version': '2.6.1',
    'cuda_version': '12.6',
    'gpu': 'NVIDIA GeForce RTX 3090',
    'random_seed': 42,
    'experiment_date': '2026-02-23',
    'platform': 'WSL2 Ubuntu 24.04 LTS',
}

print('Reproducibility Information:')
for k, v in REPRODUCIBILITY.items():
    print(f'  {k}: {v}')

env_snapshot = PROJECT_ROOT / 'runs' / 'baseline_v23_environment_snapshot.txt'
print(f'\n  Environment snapshot: {env_snapshot} ({"exists" if env_snapshot.exists() else "MISSING"})')

### Command to Reproduce

```bash
# 1. Checkout exact code
git checkout 82bddc5e

# 2. Activate environment
conda activate vmat-diffusion

# 3. Train (seed 42)
python scripts/train_baseline_unet.py \
    --data_dir ~/data/processed_npz \
    --exp_name baseline_v23_seed42 \
    --epochs 200 \
    --batch_size 2 \
    --seed 42

# 4. Evaluate (overlap=64, fixed in #50)
python scripts/inference_baseline_unet.py \
    --checkpoint runs/baseline_v23_seed42/checkpoints/best-epoch=172-val/mae_gy=6.047.ckpt \
    --input_dir ~/data/processed_npz \
    --output_dir predictions/baseline_v23_seed42_test \
    --compute_metrics --gamma_subsample 4 --overlap 64
```

---

## 4. Dataset

In [None]:
import json
from pathlib import Path

PROJECT_ROOT = Path('..').resolve()

# Load test case IDs
test_cases_path = PROJECT_ROOT / 'runs' / 'baseline_v23_seed42' / 'test_cases.json'
with open(test_cases_path) as f:
    test_info = json.load(f)

DATASET = {
    'preprocessing_version': 'v2.3.0',
    'total_cases': 74,
    'plan_types': '11 SIB (70/56 Gy) + 63 single-Rx (70 Gy only)',
    'train_cases': 60,
    'val_cases': 7,
    'test_case_ids': test_info['test_cases'],
    'split_seed': test_info['seed'],
    'note': 'Split is per-seed (not locked). Production will use locked stratified split (#38).',
}

print(f'Preprocessing version: {DATASET["preprocessing_version"]}')
print(f'Total cases: {DATASET["total_cases"]} ({DATASET["plan_types"]})')
print(f'Split (seed={DATASET["split_seed"]}): {DATASET["train_cases"]} train / {DATASET["val_cases"]} val / {len(DATASET["test_case_ids"])} test')
print(f'Test case IDs: {DATASET["test_case_ids"]}')
print(f'\nNote: {DATASET["note"]}')
print(f'\nWARNING: Test set contains NO PTV56 structures (all single-Rx cases).')
print(f'This limits the clinical relevance — production run must use SIB-only dataset.')

---

## 5. Model & Training Configuration

In [None]:
import json
from pathlib import Path

PROJECT_ROOT = Path('..').resolve()

config_path = PROJECT_ROOT / 'runs' / 'baseline_v23_seed42' / 'training_config.json'
with open(config_path) as f:
    config = json.load(f)

print(f'Model: {config["model"]}')
print(f'Parameters: {config["model_params"]:,}')
print(f'Script version: {config["version"]}')
print(f'\nHyperparameters:')
for k, v in sorted(config['hparams'].items()):
    print(f'  {k}: {v}')

# Training summary
summary_path = PROJECT_ROOT / 'runs' / 'baseline_v23_seed42' / 'training_summary.json'
with open(summary_path) as f:
    summary = json.load(f)

print(f'\nTraining Summary:')
print(f'  Duration: {summary["total_time_hours"]:.1f} hours')
print(f'  Final epoch: {summary["final_metrics"]["epoch"]}')
print(f'  Best val MAE: {summary["best_val_mae_gy"]:.3f} Gy (epoch 172)')
print(f'  Final val MAE: {summary["final_metrics"]["val_mae_gy"]:.3f} Gy')
print(f'  Final val Gamma: {summary["final_metrics"]["val_gamma"]:.1f}%')

---

## 6. Results

Figures generated by `scripts/generate_baseline_v23_figures.py` and loaded below.  
Representative case for single-case figures: **prostate70gy_0056** (below-median MAE = 3.30 Gy). Inference uses overlap=64 (default updated in #50).

### 6.1 Training Curves

In [None]:
from IPython.display import Image, display
display(Image(filename='../runs/baseline_v23/figures/fig1_training_curves.png', width=900))

**Caption:** Training and validation curves for baseline_v23 (seed 42, 200 epochs). (A) Training loss (blue) decreases steadily while validation loss (orange) plateaus after ~50 epochs, indicating significant overfitting (8.5x val/train ratio by epoch 150). (B) Validation MAE (solid blue) and Gamma pass rate (dashed green) on dual axes, with best checkpoint marked at epoch 172 (MAE = 6.05 Gy). Both metrics are noisy due to the small validation set (n=7).

**Key observations:**
- Significant overfitting: val/train loss ratio reaches 8.5-10x, consistent with 23.7M parameters trained on only 60 cases
- Val MAE essentially flat after epoch 100 (trend slope = -0.004 Gy/epoch) — model has converged
- High epoch-to-epoch variance in val metrics due to small val set (n=7)
- Best checkpoint at epoch 172 may be a favorable fluctuation; typical late-epoch MAE is ~7.5 Gy
- **Clinical implication:** More training data (from Institution B) is the primary lever for improvement, not more epochs or architecture changes

### 6.2 Dose Colorwash (Representative Case)

In [None]:
from IPython.display import Image, display
display(Image(filename='../runs/baseline_v23/figures/fig2_dose_colorwash.png', width=900))

**Caption:** Predicted (top) vs ground truth (bottom) dose distribution overlaid on CT for prostate70gy_0056 (MAE = 3.30 Gy, below-median test case). Views: axial (left), coronal (center), sagittal (right) through PTV70 centroid. Dose displayed in Gy with 5 Gy threshold. The high-dose PTV region (red, ~70 Gy) is well-reproduced in both shape and magnitude. The intermediate dose region (20-50 Gy, green-yellow) shows broader predicted distribution vs ground truth.

**Key observations:**
- PTV70 coverage is visually excellent — the high-dose region matches closely
- The predicted dose "spray" in the low-dose periphery appears broader than GT, consistent with the model averaging over multiple valid low-dose solutions
- Dose gradients at the PTV boundary appear smoother in the prediction than GT, which may reflect the MSE loss averaging effect
- Coronal view shows symmetric predicted dose distribution (L/R ratio = 1.054), confirming correct sliding window blending with overlap=64
- **Clinical implication:** The model captures the essential dose distribution pattern; errors are primarily in clinically less constrained regions

### 6.3 Dose Difference Map

In [None]:
from IPython.display import Image, display
display(Image(filename='../runs/baseline_v23/figures/fig3_dose_difference.png', width=900))

**Caption:** Dose difference map (predicted minus ground truth, in Gy) for prostate70gy_0056. Blue regions indicate underdose, red indicates overdose. Diverging RdBu_r colormap centered at zero.

**Key observations:**
- Largest errors are in the low-to-intermediate dose transition zone, not in the PTV itself
- Blue (underdose) patches visible at the periphery — the model underpredicts dose in the low-dose "spray" region
- The PTV region itself shows minimal difference (near-zero), confirming good target coverage accuracy
- **Clinical implication:** The error pattern is spatially coherent and concentrated where clinical constraints are weakest, consistent with the semi-multi-modal hypothesis — multiple valid dose distributions exist in these regions

### 6.4 DVH Comparison

In [None]:
from IPython.display import Image, display
display(Image(filename='../runs/baseline_v23/figures/fig4_dvh_comparison.png', width=800))

**Caption:** DVH curves for prostate70gy_0056: solid lines = ground truth, dashed = predicted. Structures shown: PTV70, Rectum, Bladder, Femur_L, Femur_R, Bowel.

**Key observations:**
- **PTV70:** Predicted DVH closely matches GT — the steep drop-off near 70 Gy is well-captured, confirming good D95/D98 accuracy for this case
- **Rectum:** Predicted DVH (dashed orange) shows higher dose across the mid-range compared to GT (solid orange), indicating the model overestimates rectum dose. This is the structure where Dmax violations occur
- **Bladder:** Similar pattern — predicted DVH shifted slightly higher than GT in the mid-dose range
- **Femur_L vs Femur_R:** Clear asymmetry — Femur_L predicted DVH diverges more from GT than Femur_R, consistent with the systematic L/R bias seen in all test cases
- **Bowel:** Good agreement at low doses; prediction slightly overestimates
- **Clinical implication:** OAR dose is systematically overestimated in predictions. While this is "safe" (conservative), it means the model predicts less OAR sparing than actually achieved, which could trigger false QUANTEC violations

### 6.5 Gamma Analysis

In [None]:
from IPython.display import Image, display
display(Image(filename='../runs/baseline_v23/figures/fig5_gamma_bar_chart.png', width=900))

**Caption:** Global vs PTV-region gamma pass rate (3%/3mm) per test case. Blue bars: global gamma; gold bars: PTV-region gamma (PTV70 + 5mm margin). Dashed orange line: 95% clinical target.

**Key observations:**
- **PTV-region Gamma (85.5% mean) dramatically outperforms Global Gamma (27.7% mean)** — a 3x ratio, confirming the strategic decision to focus on PTV-region accuracy over global metrics
- Two cases (P0005, P0024) achieve PTV-region Gamma near/above 95% — the model CAN reach clinical targets in the PTV
- P0079 is an outlier with only 60.9% PTV Gamma — warrants case-level investigation
- Global Gamma is dominated by failures in the low-dose periphery, as expected
- **Clinical implication:** The model is clinically accurate where it matters most (PTV region). Global gamma reflects valid dose diversity in unconstrained regions, not model failure

### 6.6 Per-Case Results

In [None]:
from IPython.display import Image, display
display(Image(filename='../runs/baseline_v23/figures/fig6_per_case_boxplots.png', width=900))

**Caption:** Distribution of key metrics across 7 test cases (seed 42). (A) MAE in Gy, (B) Global Gamma 3%/3mm pass rate, (C) PTV-region Gamma 3%/3mm, (D) PTV70 D95 error (predicted minus GT, negative = underdose). Individual data points shown as colored dots; box shows IQR.

**Key observations:**
- **MAE (4.80 +/- 2.45 Gy):** Wide spread from 1.5 to 9.3 Gy; two outlier cases (0005, 0065) have large OAR errors that dominate
- **Global Gamma (28.1 +/- 12.6%):** Consistently low across all cases, confirming this is not an outlier effect
- **PTV Gamma (85.5 +/- 10.9%):** Most cases clustered 83-96%, with P0079 as a clear outlier at 61%
- **D95 Error (-0.86 +/- 0.92 Gy):** Systematic negative bias (underdose) — 5/7 cases show underdose. This is exactly the pattern the asymmetric PTV loss targets
- **Clinical implication:** The D95 underdose bias is the most actionable finding — it's systematic, clinically relevant, and directly addressable with loss function engineering. The wide MAE spread reflects OAR dose prediction failures in atypical cases, addressable with more data and structure-aware losses

### 6.7 QUANTEC Compliance

In [None]:
from IPython.display import Image, display
display(Image(filename='../runs/baseline_v23/figures/fig7_quantec_compliance.png', width=900))

**Caption:** QUANTEC constraint compliance heatmap for 7 test cases. Green (P) = pass, orange (F) = fail. Rows: test cases; columns: individual clinical constraints.

**Key observations:**
- **4/7 cases pass all constraints (57% full compliance)**
- All failures are **Dmax violations** (single-voxel hotspots): Rectum Dmax (P0027: 75.8 Gy, P0056: 75.1 Gy), Bladder Dmax (P0065: 75.8 Gy), Bowel Dmax (P0065: 55.5 Gy)
- **Volume-based constraints (V70, V60, V50, V45) pass in ALL cases** — the model correctly predicts OAR DVH shapes
- PTV70 D95 and V95 pass in ALL cases — the model meets PTV coverage requirements
- **Clinical implication:** The Dmax violations are marginal (0.1-3.5 Gy over limit) and represent single-voxel artifacts, not systematic DVH failure. These could be addressed by Dmax-aware loss terms or post-processing. The volume constraint compliance is excellent and clinically meaningful

### 6.8 Femur L/R Asymmetry

In [None]:
from IPython.display import Image, display
display(Image(filename='../runs/baseline_v23/figures/fig8_femur_asymmetry.png', width=900))

**Caption:** Femur L/R dose prediction asymmetry analysis. (A) Paired MAE bars for Femur_L (green) vs Femur_R (blue) per case. (B) MAE difference (L minus R) showing consistent positive bias. Mean difference: 7.19 Gy. Femur_L is worse in 7/7 cases.

**Key observations:**
- **Femur_L MAE is 1.4-4.3x higher than Femur_R in every single test case** — this is 100% consistent, not a sampling artifact
- Mean Femur_L MAE: 11.8 Gy vs Femur_R: 4.6 Gy (mean difference: 7.2 Gy)
- The asymmetry is too consistent to be random — likely reflects a systematic bias in the training data
- **Possible causes:** (1) Beam arrangement asymmetry in treatment plans (e.g., preferential beam angles that deliver more dose through left femur), (2) Patient positioning/anatomy laterality in the dataset, (3) L/R label confusion in some source contours
- **Clinical implication:** This finding should be investigated before the production run. If it persists with more data, it may indicate a real dosimetric pattern in VMAT prostate plans. If it's a data issue (label swap), fixing it would immediately improve model accuracy

---

## 7. Statistical Analysis

**Status: PENDING** — requires seeds 123 and 456 to complete. Statistical analysis with n=7 cases from a single seed cannot support formal inference.

When all 3 seeds are complete, this section will include:
- Per-condition summary: mean +/- std across 3 seeds (averaged per-case first), with 95% bootstrap CI
- Comparison to subsequent experiments via paired Wilcoxon signed-rank test

### Preliminary Descriptive Statistics (Seed 42 Only)

In [None]:
import json
import numpy as np
from pathlib import Path

PROJECT_ROOT = Path('..').resolve()
results_path = PROJECT_ROOT / 'predictions' / 'baseline_v23_seed42_test' / 'baseline_evaluation_results.json'

with open(results_path) as f:
    data = json.load(f)

# Extract metrics
cases = data['per_case_results']
maes = [c['dose_metrics']['mae_gy'] for c in cases]
gammas_g = [c['gamma']['global_3mm3pct']['gamma_pass_rate'] for c in cases]
gammas_p = [c['gamma']['ptv_region_3mm3pct']['gamma_pass_rate'] for c in cases]
d95_errors = [c['dvh_metrics']['PTV70']['D95_error'] for c in cases if c['dvh_metrics'].get('PTV70', {}).get('D95_error') is not None]

print('Descriptive Statistics (n=7, seed 42 only — NOT for formal inference)')
print('='*65)
print(f'{"Metric":<30} {"Mean":>8} {"Std":>8} {"Min":>8} {"Max":>8}')
print('-'*65)
for name, vals in [('MAE (Gy)', maes), ('Global Gamma (%)', gammas_g),
                    ('PTV Gamma (%)', gammas_p), ('D95 Error (Gy)', d95_errors)]:
    print(f'{name:<30} {np.mean(vals):>8.2f} {np.std(vals):>8.2f} {np.min(vals):>8.2f} {np.max(vals):>8.2f}')

---

## 8. Cross-Experiment Comparison

**Status: PENDING** — this is the first v2.3 experiment. No valid prior results exist for comparison (pilot metrics are invalid due to D95 artifact).

| Experiment | Data | MAE (Gy) | Global Gamma | PTV Gamma | D95 \|Error\| | QUANTEC |
|------------|------|----------|-------------|-----------|-------------|------|
| **baseline_v23** (seed 42) | v2.3, 74 cases | 4.80 +/- 2.45 | 28.1 +/- 12.6% | 85.5 +/- 10.9% | 1.01 +/- 0.76 Gy | 57% |
| pilot baseline (INVALID) | v2.2, 23 cases | ~~1.43~~ | ~~14.2%~~ | — | ~~-20 Gy~~ | — |

*Note: MAE/Gamma updated 2026-02-24 with overlap=64 inference (#50). PTV Gamma, D95, and QUANTEC unchanged (PTV-region metrics are insensitive to overlap).*

The pilot numbers are struck through because they were computed on data with the D95 preprocessing artifact (#4) and cannot be meaningfully compared.

This table will expand as Phase 2 experiments are completed.

---

## 9. Conclusions, Limitations, and Next Steps

### Conclusions

1. **Pipeline validated end-to-end** — v2.3 preprocessing, training, inference, and evaluation all work correctly on the work machine. This was the primary goal of this experiment.
2. **D95 preprocessing fix confirmed** — all 7 GT D95 values >= 66.5 Gy (min = 68.7 Gy), confirming issue #4 is resolved.
3. **PTV70 D95 accuracy is surprisingly good for a pure MSE baseline** — mean |error| of 1.01 Gy is within the pre-registered 2 Gy clinical threshold in 6/7 cases.
4. **Systematic PTV underdose (-0.86 Gy bias) validates the Phase 2 strategy** — the asymmetric PTV loss is designed to penalize exactly this pattern.
5. **QUANTEC violations are exclusively Dmax hotspots** — all volume constraints pass, indicating the model learns good DVH shapes but has single-voxel noise at high doses.
6. **Femur L/R asymmetry is a new finding** — 7.2 Gy mean difference, consistent across 100% of test cases. Requires investigation.
7. **Significant overfitting (8.5x val/train ratio)** — more data is the primary lever for improvement.

### Limitations

- **Single seed** — no measure of training variability. Seeds 123/456 are required for publishable results.
- **Small test set (n=7)** — wide confidence intervals; individual cases dominate summary statistics.
- **Mixed SIB + single-Rx dataset** — 63/74 cases are single-Rx (no PTV56). The model is primarily learning single-Rx dose patterns, which may not transfer to the SIB-only production dataset.
- **No PTV56 in test set** — cannot evaluate the SIB dose-painting accuracy that is central to the paper.
- **Per-seed data split** — test cases differ across seeds (locked stratified split #38 not yet implemented).
- **Gamma subsample=4** — faster but lower resolution than publication-quality subsample=2.

### Next Steps

- [ ] Investigate Femur L/R asymmetry: check training data for L/R label consistency, beam arrangement patterns
- [ ] **Decision:** Run seeds 123/456 on this mixed dataset, or wait for SIB-only production data?
- [ ] Collect Institution B data (#2) — this is the primary blocker for the production experiment
- [ ] Define and implement locked stratified test set (#38)
- [ ] When production data is ready: re-run baseline as Condition 1 of the Phase 2 ablation study

---

## 10. Artifacts

| Artifact | Path |
|----------|------|
| Best Checkpoint | `runs/baseline_v23_seed42/checkpoints/best-epoch=172-val/mae_gy=6.047.ckpt` |
| Last Checkpoint | `runs/baseline_v23_seed42/checkpoints/last.ckpt` |
| Training Metrics | `runs/baseline_v23_seed42/version_1/metrics.csv` |
| Training Config | `runs/baseline_v23_seed42/training_config.json` |
| Training Summary | `runs/baseline_v23_seed42/training_summary.json` |
| Test Cases | `runs/baseline_v23_seed42/test_cases.json` |
| Environment Snapshot | `runs/baseline_v23_environment_snapshot.txt` |
| Run Log | `runs/baseline_v23_run.log` |
| Figures (PNG + PDF) | `runs/baseline_v23/figures/` (8 figures, 16 files) |
| Figure Generation Script | `scripts/generate_baseline_v23_figures.py` |
| Test Predictions | `predictions/baseline_v23_seed42_test/*.npz` |
| Evaluation Results | `predictions/baseline_v23_seed42_test/baseline_evaluation_results.json` |

---

*Notebook created: 2026-02-24*  
*Last updated: 2026-02-24*  
*Status: Preliminary (seed 42 only)*