# Experiment: Baseline U-Net v2.3 — 3-Seed Aggregate

**Date:** 2026-02-26  
**Experiment ID:** `baseline_v23` (3-seed aggregate)  
**Status:** Complete  
**Type:** Training (3-seed)  
**GitHub Issue:** [#54](https://github.com/wrockey/vmat-diffusion/issues/54)  
**Prior notebook:** [2026-02-24 seed42 preliminary](2026-02-24_baseline_v23_preliminary.ipynb)  

---

## 1. Overview

### 1.1 Objective

Complete the 3-seed baseline evaluation on v2.3 data (74 cases) to establish the MSE-only reference for the Phase 2 combined loss ablation study. Seeds 42, 123, and 456 use identical architecture and hyperparameters; only the random seed differs (affecting data split, weight initialization, and augmentation order).

### 1.2 Key Results (3-Seed Aggregate)

| Metric | Aggregate (mean \u00b1 seed std) | Per-Seed Range | Target | Status |
|--------|------|------|--------|--------|
| MAE | **4.22 \u00b1 0.53 Gy** | 3.51 \u2013 4.80 | diagnostic | \u2014 |
| Gamma 3%/3mm (global) | **33.8 \u00b1 4.6%** | 28.1 \u2013 39.2% | diagnostic | \u2014 |
| Gamma 3%/3mm (PTV-region) | **80.2 \u00b1 5.3%** | 74.6 \u2013 85.5% | > 95% | Below target |
| PTV70 D95 Gap | **-1.76 \u00b1 0.69 Gy** | -2.46 \u2013 -0.86 | > -2 Gy | 2/3 seeds pass |

### 1.3 Conclusion

The 3-seed baseline confirms **systematic PTV underdosing** (-1.76 Gy mean D95 gap, negative in all 3 seeds) as the primary clinical target for Phase 2. Seed variability is moderate (MAE std 0.53 Gy) \u2014 3 seeds is sufficient per the pre-registered decision rule (effect must be > 2\u00d7 seed std to claim significance). PTV-region Gamma (80.2%) is solid but below the 95% clinical target. Case 0065 is consistently the worst across all seeds (MAE 9\u201310 Gy), confirming an anatomy-driven outlier rather than seed sensitivity. These numbers establish the reference point for the 16-condition Phase 2 ablation study.

---

## 2. What Changed

Compared to the [seed 42 preliminary notebook](2026-02-24_baseline_v23_preliminary.ipynb), this notebook adds seeds 123 and 456. **Everything else is identical** \u2014 same architecture, loss, hyperparameters, data version.

| Parameter | Seed 42 Preliminary | This Notebook |
|-----------|--------------------|--------------|
| Seeds | 42 only | **42, 123, 456** |
| Reporting | Per-seed metrics | **Mean \u00b1 std across seeds** |
| Status | Preliminary | **Complete** |
| Statistical analysis | Pending | **Done** |

**Single variable under test:** Random seed (data split, weight init, augmentation order).

---

## 3. Reproducibility

In [None]:
import json
from pathlib import Path

PROJECT_ROOT = Path('..').resolve()

SEED_INFO = {
    42: {
        'git_commit': '82bddc5e5cac8faaa3aa63b14686bdccbf6bba3b',
        'git_message': 'fix: Patch sampling crash when volume Z equals patch_size',
    },
    123: {
        'git_commit': 'c2454b8',
        'git_message': 'docs: Add seed123 figures and update experiment index',
    },
    456: {
        'git_commit': '11afb2f',
        'git_message': 'fix: Phase 2 code review findings (6 fixes)',
    },
}

COMMON_REPRO = {
    'python_version': '3.12.12',
    'pytorch_version': '2.10.0+cu126',
    'pytorch_lightning_version': '2.6.1',
    'cuda_version': '12.6',
    'gpu': 'NVIDIA GeForce RTX 3090',
    'platform': 'WSL2 Ubuntu 24.04 LTS',
}

print('Common environment:')
for k, v in COMMON_REPRO.items():
    print(f'  {k}: {v}')

print('\nPer-seed git commits:')
for seed, info in SEED_INFO.items():
    print(f'  Seed {seed}: {info["git_commit"][:7]} ({info["git_message"]})')

# Check environment snapshots
for seed, run_dir in [(42, 'runs/baseline_v23_seed42'),
                       (123, 'scripts/runs/baseline_v23_seed123'),
                       (456, 'scripts/runs/baseline_v23_seed456')]:
    snap = PROJECT_ROOT / run_dir / 'environment_snapshot.txt'
    if not snap.exists():
        snap = PROJECT_ROOT / 'runs' / 'baseline_v23_environment_snapshot.txt'
    print(f'  Seed {seed} env snapshot: {"exists" if snap.exists() else "MISSING"}')

### Commands to Reproduce

```bash
# Seed 42
python scripts/train_baseline_unet.py \
    --data_dir ~/data/processed_npz --exp_name baseline_v23_seed42 \
    --epochs 200 --batch_size 2 --seed 42

# Seed 123
cd scripts && python train_baseline_unet.py \
    --data_dir ~/data/processed_npz --exp_name baseline_v23_seed123 \
    --epochs 200 --batch_size 2 --seed 123

# Seed 456
cd scripts && python train_baseline_unet.py \
    --data_dir ~/data/processed_npz --exp_name baseline_v23_seed456 \
    --epochs 200 --batch_size 2 --seed 456

# Inference (each seed — use overlap=64 per #50)
python scripts/inference_baseline_unet.py \
    --checkpoint <seed_checkpoint> \
    --input_dir <test_symlink_dir> \
    --output_dir predictions/baseline_v23_seed<N>_test \
    --compute_metrics --overlap 64 --gamma_subsample 4
```

---

## 4. Dataset

In [None]:
import json
from pathlib import Path

PROJECT_ROOT = Path('..').resolve()

print('Preprocessing version: v2.3.0')
print('Total cases: 74 (11 SIB 70/56 Gy + 63 single-Rx 70 Gy only)')
print('Split: 60 train / 7 val / 7 test (per seed, NOT locked)')
print()

# Load and display test splits per seed
seed_paths = {
    42: PROJECT_ROOT / 'runs/baseline_v23_seed42/test_cases.json',
    123: PROJECT_ROOT / 'scripts/runs/baseline_v23_seed123/test_cases.json',
    456: PROJECT_ROOT / 'scripts/runs/baseline_v23_seed456/test_cases.json',
}

all_test_cases = set()
for seed, path in seed_paths.items():
    with open(path) as f:
        info = json.load(f)
    cases = info['test_cases']
    all_test_cases.update(cases)
    print(f'Seed {seed} test cases: {sorted(cases)}')

# Check overlap
for s1, s2 in [(42, 123), (42, 456), (123, 456)]:
    with open(seed_paths[s1]) as f:
        c1 = set(json.load(f)['test_cases'])
    with open(seed_paths[s2]) as f:
        c2 = set(json.load(f)['test_cases'])
    overlap = c1 & c2
    print(f'  Seed {s1} \u2229 {s2} overlap: {len(overlap)} cases ({sorted(overlap) if overlap else "none"})')

print(f'\nTotal unique test cases across all seeds: {len(all_test_cases)}')
print(f'\nNote: Per-seed splits (not locked). Test sets overlap partially.')
print('WARNING: No PTV56 structures in any test split (all single-Rx cases).')

---

## 5. Model & Training Configuration

In [None]:
import json
from pathlib import Path

PROJECT_ROOT = Path('..').resolve()

# All seeds use identical config — load seed 42 as reference
config_path = PROJECT_ROOT / 'runs/baseline_v23_seed42/training_config.json'
with open(config_path) as f:
    config = json.load(f)

print(f'Model: {config["model"]}')
print(f'Parameters: {config["model_params"]:,}')
print(f'Script version: {config["version"]}')
print(f'\nHyperparameters (identical across all 3 seeds):')
for k, v in sorted(config['hparams'].items()):
    print(f'  {k}: {v}')

# Per-seed training summaries
print(f'\n{"="*65}')
print(f'{"Seed":<8} {"Duration (h)":<14} {"Best Val MAE":<16} {"Best Epoch":<12} {"Final Epoch"}')
print(f'{"-"*65}')

summaries = {
    42: PROJECT_ROOT / 'runs/baseline_v23_seed42/training_summary.json',
    123: PROJECT_ROOT / 'scripts/runs/baseline_v23_seed123/training_summary.json',
    456: PROJECT_ROOT / 'scripts/runs/baseline_v23_seed456/training_summary.json',
}

for seed, path in summaries.items():
    with open(path) as f:
        s = json.load(f)
    print(f'{seed:<8} {s["total_time_hours"]:<14.1f} {s["best_val_mae_gy"]:<16.3f} {"see ckpt":<12} {s["final_metrics"]["epoch"]}')

---

## 6. Results

All figures generated by `scripts/generate_baseline_v23_figures.py` with `--seed` flag.  
Per-seed figures in `runs/baseline_v23/figures_seed{42,123,456}/`.  
Aggregate figures (seed 42 representative case) in `runs/baseline_v23/figures/`.  
Inference uses overlap=64 (default updated in #50).

### 6.1 Training Curves (All 3 Seeds)

In [None]:
from IPython.display import Image, display, HTML
display(HTML('<h4>Seed 42</h4>'))
display(Image(filename='../runs/baseline_v23/figures_seed42/fig1_training_curves.png', width=800))
display(HTML('<h4>Seed 123</h4>'))
display(Image(filename='../runs/baseline_v23/figures_seed123/fig1_training_curves.png', width=800))
display(HTML('<h4>Seed 456</h4>'))
display(Image(filename='../runs/baseline_v23/figures_seed456/fig1_training_curves.png', width=800))

**Caption:** Training curves for all 3 seeds (42, 123, 456). Each panel shows training loss, validation loss, validation MAE, and validation Gamma vs epoch. All seeds show similar convergence patterns: rapid initial improvement followed by significant overfitting (val/train loss ratio 8-10x by epoch 150).

**Key observations:**
- All 3 seeds converge by ~epoch 100; additional training yields marginal improvement
- Best val MAE varies substantially: seed 456 (3.55 Gy) < seed 123 (4.16 Gy) < seed 42 (6.05 Gy)
- This val MAE variation is misleading \u2014 test MAE is more stable (3.51\u20134.80 Gy) because val set is only 7 cases
- Overfitting pattern is consistent across seeds, confirming it's a data-limited effect (60 training cases for 23.7M parameters)

### 6.2 Dose Colorwash (Representative Cases, Per Seed)

In [None]:
from IPython.display import Image, display, HTML
display(HTML('<h4>Seed 42 (prostate70gy_0056, MAE=3.30 Gy)</h4>'))
display(Image(filename='../runs/baseline_v23/figures_seed42/fig2_dose_colorwash.png', width=800))
display(HTML('<h4>Seed 123 (representative case)</h4>'))
display(Image(filename='../runs/baseline_v23/figures_seed123/fig2_dose_colorwash.png', width=800))
display(HTML('<h4>Seed 456 (representative case)</h4>'))
display(Image(filename='../runs/baseline_v23/figures_seed456/fig2_dose_colorwash.png', width=800))

**Caption:** Predicted vs ground truth dose distributions overlaid on CT for representative cases (below-median MAE) from each seed. Axial, coronal, and sagittal views through PTV70 centroid. Dose displayed in Gy with 5 Gy threshold.

**Key observations:**
- PTV70 coverage is visually excellent across all seeds \u2014 the high-dose region matches closely
- Low-dose "spray" region shows broader predicted distribution vs GT in all seeds, consistent with the model averaging multiple valid low-dose solutions (semi-multi-modal hypothesis)
- Dose gradients at PTV boundary appear smoother in predictions \u2014 expected with MSE loss
- **Clinical implication:** Consistent visual quality across seeds confirms model reliability

### 6.3 Dose Difference Maps

In [None]:
from IPython.display import Image, display, HTML
display(HTML('<h4>Seed 42</h4>'))
display(Image(filename='../runs/baseline_v23/figures_seed42/fig3_dose_difference.png', width=800))
display(HTML('<h4>Seed 123</h4>'))
display(Image(filename='../runs/baseline_v23/figures_seed123/fig3_dose_difference.png', width=800))
display(HTML('<h4>Seed 456</h4>'))
display(Image(filename='../runs/baseline_v23/figures_seed456/fig3_dose_difference.png', width=800))

**Caption:** Dose difference maps (predicted minus ground truth, Gy) for representative cases from each seed. Blue = underdose, red = overdose.

**Key observations:**
- Error spatial pattern is consistent across seeds: largest errors in low-to-intermediate dose transition zone, minimal difference in PTV itself
- Slight blue (underdose) bias visible near PTV boundary across all seeds, consistent with the D95 gap finding
- **Clinical implication:** Error pattern is spatially coherent, not random noise \u2014 addressable by targeted loss functions

### 6.4 DVH Comparison

In [None]:
from IPython.display import Image, display, HTML
display(HTML('<h4>Seed 42</h4>'))
display(Image(filename='../runs/baseline_v23/figures_seed42/fig4_dvh_comparison.png', width=800))
display(HTML('<h4>Seed 123</h4>'))
display(Image(filename='../runs/baseline_v23/figures_seed123/fig4_dvh_comparison.png', width=800))
display(HTML('<h4>Seed 456</h4>'))
display(Image(filename='../runs/baseline_v23/figures_seed456/fig4_dvh_comparison.png', width=800))

**Caption:** DVH curves for representative cases from each seed. Solid = ground truth, dashed = predicted. All major structures shown.

**Key observations:**
- PTV70 DVH shape is well-captured across all seeds \u2014 the steep drop-off near 70 Gy is consistent
- OAR DVH curves are systematically slightly higher in predictions (conservative bias)
- Femur L/R asymmetry visible in DVH across all seeds
- **Clinical implication:** Conservative OAR bias is "safe" but may trigger false QUANTEC violations

### 6.5 Gamma Analysis

In [None]:
from IPython.display import Image, display, HTML
display(HTML('<h4>Seed 42</h4>'))
display(Image(filename='../runs/baseline_v23/figures_seed42/fig5_gamma_bar_chart.png', width=800))
display(HTML('<h4>Seed 123</h4>'))
display(Image(filename='../runs/baseline_v23/figures_seed123/fig5_gamma_bar_chart.png', width=800))
display(HTML('<h4>Seed 456</h4>'))
display(Image(filename='../runs/baseline_v23/figures_seed456/fig5_gamma_bar_chart.png', width=800))

**Caption:** Global vs PTV-region Gamma 3%/3mm pass rates per test case for each seed. Blue bars: global; gold bars: PTV-region. Dashed line: 95% clinical target.

**Key observations:**
- PTV-region Gamma consistently outperforms global Gamma by 2-3x across all seeds
- PTV-region Gamma ranges: seed 42 (85.5%), seed 123 (74.6%), seed 456 (78.8%)
- Global Gamma is driven down by low-dose background diversity \u2014 diagnostic only, not a clinical concern
- **Clinical implication:** PTV-region Gamma (80.2% aggregate) is the meaningful metric; 95% target requires Phase 2 losses

### 6.6 Per-Case Box Plots

In [None]:
from IPython.display import Image, display, HTML
display(HTML('<h4>Seed 42</h4>'))
display(Image(filename='../runs/baseline_v23/figures_seed42/fig6_per_case_boxplots.png', width=800))
display(HTML('<h4>Seed 123</h4>'))
display(Image(filename='../runs/baseline_v23/figures_seed123/fig6_per_case_boxplots.png', width=800))
display(HTML('<h4>Seed 456</h4>'))
display(Image(filename='../runs/baseline_v23/figures_seed456/fig6_per_case_boxplots.png', width=800))

**Caption:** Metric distributions across 7 test cases per seed: MAE (Gy), Global Gamma, PTV-region Gamma, and PTV70 D95 error.

**Key observations:**
- D95 error is consistently negative across all seeds, confirming systematic underdosing
- Case 0065 appears in seed 42 and seed 456 test sets and is consistently the worst performer (MAE ~9-10 Gy), confirming anatomy-driven difficulty
- Inter-case variability is large relative to inter-seed variability, consistent with n=7 test set limitations
- **Clinical implication:** Case-level variability drives the wide confidence intervals; more test cases (locked split on full dataset) needed for tighter estimates

### 6.7 QUANTEC Compliance

In [None]:
from IPython.display import Image, display, HTML
display(HTML('<h4>Seed 42</h4>'))
display(Image(filename='../runs/baseline_v23/figures_seed42/fig7_quantec_compliance.png', width=800))
display(HTML('<h4>Seed 123</h4>'))
display(Image(filename='../runs/baseline_v23/figures_seed123/fig7_quantec_compliance.png', width=800))
display(HTML('<h4>Seed 456</h4>'))
display(Image(filename='../runs/baseline_v23/figures_seed456/fig7_quantec_compliance.png', width=800))

**Caption:** QUANTEC constraint compliance heatmaps for all 3 seeds. Green (P) = pass, orange (F) = fail.

**Key observations:**
- All failures across all seeds are Dmax violations (single-voxel hotspots), not volume constraint failures
- Volume-based constraints (V70, V60, V50, V45) pass universally
- PTV70 D95/V95 constraints pass in all cases across all seeds
- **Clinical implication:** Model correctly predicts DVH shapes; Dmax hotspot violations are marginal and addressable

### 6.8 Femur L/R Asymmetry

In [None]:
from IPython.display import Image, display, HTML
display(HTML('<h4>Seed 42</h4>'))
display(Image(filename='../runs/baseline_v23/figures_seed42/fig8_femur_asymmetry.png', width=800))
display(HTML('<h4>Seed 123</h4>'))
display(Image(filename='../runs/baseline_v23/figures_seed123/fig8_femur_asymmetry.png', width=800))
display(HTML('<h4>Seed 456</h4>'))
display(Image(filename='../runs/baseline_v23/figures_seed456/fig8_femur_asymmetry.png', width=800))

**Caption:** Femur L/R dose prediction asymmetry for all 3 seeds. Femur_L MAE is consistently higher than Femur_R.

**Key observations:**
- Femur L > R asymmetry is reproduced across all 3 seeds, confirming it is not seed-dependent
- This is likely a real dosimetric pattern in VMAT prostate plans (beam arrangement bias) rather than a label issue
- **Clinical implication:** Known limitation; monitor but not a priority for Phase 2 loss engineering

---

## 7. Statistical Analysis

### 7.1 Per-Seed Test Set Results

In [None]:
import json
import numpy as np
from pathlib import Path

PROJECT_ROOT = Path('..').resolve()
eval_paths = {
    42: PROJECT_ROOT / 'predictions/baseline_v23_seed42_test/baseline_evaluation_results.json',
    123: PROJECT_ROOT / 'predictions/baseline_v23_seed123_test/baseline_evaluation_results.json',
    456: PROJECT_ROOT / 'predictions/baseline_v23_seed456_test/baseline_evaluation_results.json',
}

seed_metrics = {}
for seed, path in eval_paths.items():
    with open(path) as f:
        d = json.load(f)
    cases = d['per_case_results']
    maes = [c['dose_metrics']['mae_gy'] for c in cases]
    gammas_g = [c['gamma']['global_3mm3pct']['gamma_pass_rate'] for c in cases]
    gammas_p = [c['gamma']['ptv_region_3mm3pct']['gamma_pass_rate'] for c in cases]
    d95_gaps = [c['dvh_metrics']['PTV70']['D95_error'] for c in cases
                if c['dvh_metrics'].get('PTV70', {}).get('D95_error') is not None]
    seed_metrics[seed] = {
        'mae': maes, 'gamma_g': gammas_g, 'gamma_p': gammas_p, 'd95': d95_gaps,
        'case_ids': [c['case_id'] for c in cases],
    }

print(f'{"Seed":<8} {"MAE (Gy)":<20} {"Gamma Global (%)":<20} {"Gamma PTV (%)":<20} {"D95 Gap (Gy)":<20}')
print('=' * 88)
for seed in [42, 123, 456]:
    s = seed_metrics[seed]
    print(f'{seed:<8} '
          f'{np.mean(s["mae"]):.2f} \u00b1 {np.std(s["mae"]):.2f}{"":<8} '
          f'{np.mean(s["gamma_g"]):.1f} \u00b1 {np.std(s["gamma_g"]):.1f}{"":<8} '
          f'{np.mean(s["gamma_p"]):.1f} \u00b1 {np.std(s["gamma_p"]):.1f}{"":<8} '
          f'{np.mean(s["d95"]):.2f} \u00b1 {np.std(s["d95"]):.2f}')

### 7.2 Cross-Seed Aggregate (Mean of Seed Means \u00b1 Seed Std)

In [None]:
import numpy as np

# Aggregate: mean of seed means, std of seed means
metrics_names = ['mae', 'gamma_g', 'gamma_p', 'd95']
display_names = ['MAE (Gy)', 'Gamma Global (%)', 'Gamma PTV-region (%)', 'PTV70 D95 Gap (Gy)']

print(f'{"Metric":<28} {"Aggregate":<20} {"Seed 42":<12} {"Seed 123":<12} {"Seed 456":<12}')
print('=' * 84)

for name, disp in zip(metrics_names, display_names):
    seed_means = [np.mean(seed_metrics[s][name]) for s in [42, 123, 456]]
    agg_mean = np.mean(seed_means)
    agg_std = np.std(seed_means)
    print(f'{disp:<28} {agg_mean:.2f} \u00b1 {agg_std:.2f}{"":<8} '
          f'{seed_means[0]:.2f}{"":<8} {seed_means[1]:.2f}{"":<8} {seed_means[2]:.2f}')

print(f'\nSeed variability assessment:')
mae_means = [np.mean(seed_metrics[s]['mae']) for s in [42, 123, 456]]
print(f'  MAE seed std: {np.std(mae_means):.2f} Gy')
print(f'  MAE seed range: {min(mae_means):.2f} \u2013 {max(mae_means):.2f} Gy')
print(f'  Decision rule: effect must be > {2*np.std(mae_means):.2f} Gy (2\u00d7 seed std) to claim significance')
print(f'  3 seeds sufficient (seed std {np.std(mae_means):.2f} < effect threshold)')

### 7.3 Case-Level Analysis

In [None]:
import numpy as np

# Identify outlier cases that appear in multiple seeds
print('Cases appearing in multiple test sets:')
all_cases = {}
for seed in [42, 123, 456]:
    for i, cid in enumerate(seed_metrics[seed]['case_ids']):
        if cid not in all_cases:
            all_cases[cid] = []
        all_cases[cid].append((seed, seed_metrics[seed]['mae'][i]))

for cid, appearances in sorted(all_cases.items()):
    if len(appearances) > 1:
        details = ', '.join([f'seed {s}: MAE={m:.2f}' for s, m in appearances])
        print(f'  {cid}: {details}')

# Case 0065 analysis
print(f'\nCase 0065 (worst performer):')
for seed in [42, 123, 456]:
    if 'prostate70gy_0065' in seed_metrics[seed]['case_ids']:
        idx = seed_metrics[seed]['case_ids'].index('prostate70gy_0065')
        print(f'  Seed {seed}: MAE={seed_metrics[seed]["mae"][idx]:.2f} Gy, '
              f'Gamma_PTV={seed_metrics[seed]["gamma_p"][idx]:.1f}%, '
              f'D95_gap={seed_metrics[seed]["d95"][idx]:.2f} Gy')
    else:
        print(f'  Seed {seed}: not in test set')

---

## 8. Cross-Experiment Comparison

This is the first v2.3 experiment with complete 3-seed results. Pilot results (v2.2) are invalid.

| Experiment | Data | Seeds | MAE (Gy) | Gamma Global | Gamma PTV | PTV70 D95 Gap | Status |
|------------|------|-------|----------|-------------|-----------|---------------|--------|
| **baseline_v23** | v2.3, 74 cases | 42, 123, 456 | **4.22 \u00b1 0.53** | **33.8 \u00b1 4.6%** | **80.2 \u00b1 5.3%** | **-1.76 \u00b1 0.69 Gy** | **Complete** |
| pilot baseline (INVALID) | v2.2, 23 cases | 42 | ~~1.43~~ | ~~14.2%~~ | \u2014 | ~~-20 Gy~~ | Invalid (#4) |

This table will expand as Phase 2 experiments complete. Comparisons to subsequent experiments will use paired Wilcoxon signed-rank tests on all seed \u00d7 case observations (3 seeds \u00d7 7 cases = 21 paired observations per metric).

### Phase 2 Targets (what combined loss must improve)

| Metric | Baseline | Target | Rationale |
|--------|----------|--------|----------|
| PTV70 D95 Gap | -1.76 \u00b1 0.69 Gy | > -0.5 Gy | Asymmetric PTV loss penalizes underdosing 3x |
| PTV-region Gamma | 80.2 \u00b1 5.3% | > 95% | Structure-weighted + DVH losses focus on anatomy |
| MAE | 4.22 \u00b1 0.53 Gy | \u2264 4.5 Gy | Should not regress with combined loss |

---

## 9. Conclusions, Limitations, and Next Steps

### Conclusions

1. **3-seed baseline is complete and reproducible.** Seed variability is moderate (MAE std 0.53 Gy, ~12% of mean). Per the pre-registered decision rule, 3 seeds is sufficient (no 5-seed expansion needed).

2. **Systematic PTV underdosing is the primary finding.** D95 gap of -1.76 Gy is negative in ALL 3 seeds (range -2.46 to -0.86 Gy). This is the strongest signal for Phase 2 improvement via asymmetric PTV loss.

3. **PTV-region Gamma (80.2%) is solid but below 95% target.** Structure-weighted and DVH-aware losses should push this closer to clinical acceptance.

4. **Global Gamma (33.8%) is low but clinically irrelevant.** Driven by valid low-dose diversity. Confirmed as diagnostic-only metric.

5. **Case 0065 is an anatomy-driven outlier** \u2014 consistently worst across all seeds where it appears (MAE 9-10 Gy). Not a seed sensitivity issue.

6. **Val MAE is a poor predictor of test performance.** Val MAE ranges 3.55\u20136.05 across seeds; test MAE is tighter (3.51\u20134.80). This is a small-val-set artifact (n=7).

7. **Femur L/R asymmetry persists across all seeds** (7.2 Gy mean difference). Likely a real VMAT dosimetric pattern.

### Limitations

- **Small test set (n=7 per seed)** with different test cases per seed (non-locked split). Wide per-case variability dominates confidence intervals.
- **Mixed SIB + single-Rx dataset** \u2014 63/74 cases are single-Rx. Production must use SIB-only cohort.
- **No PTV56 in any test split** \u2014 SIB dose-painting accuracy untested.
- **Significant overfitting** (val/train ratio 8-10x) \u2014 more data is the primary improvement lever.
- **Gamma subsample=4** \u2014 faster but lower resolution than publication-quality subsample=2.

### Next Steps

1. **Augmentation ablation** (#45) \u2014 in progress, determines if current augmentation helps or hurts
2. **Combined loss pilot** (#57) \u2014 full 5-component loss + uncertainty weighting, single seed
3. **Architecture scouts** (#53) \u2014 C11 (Attention), C13 (BottleneckAttn), C15 (WiderBaseline)
4. Collect Institution B data (#2) \u2014 primary lever for reducing overfitting
5. Implement locked stratified split (#38) for production experiments

---

## 10. Artifacts

### Per-Seed Artifacts

| Artifact | Seed 42 | Seed 123 | Seed 456 |
|----------|---------|----------|----------|
| Run directory | `runs/baseline_v23_seed42/` | `scripts/runs/baseline_v23_seed123/` | `scripts/runs/baseline_v23_seed456/` |
| Best checkpoint | `best-epoch=172-*.ckpt` | `best-epoch=124-*.ckpt` | `best-epoch=176-*.ckpt` |
| Training config | `training_config.json` | `training_config.json` | `training_config.json` |
| Training summary | `training_summary.json` | `training_summary.json` | `training_summary.json` |
| Test cases | `test_cases.json` | `test_cases.json` | `test_cases.json` |
| Env snapshot | `environment_snapshot.txt` | `environment_snapshot.txt` | `environment_snapshot.txt` |
| Predictions | `predictions/baseline_v23_seed42_test/` | `predictions/baseline_v23_seed123_test/` | `predictions/baseline_v23_seed456_test/` |
| Eval results | `baseline_evaluation_results.json` | `baseline_evaluation_results.json` | `baseline_evaluation_results.json` |
| Figures | `runs/baseline_v23/figures_seed42/` | `runs/baseline_v23/figures_seed123/` | `runs/baseline_v23/figures_seed456/` |

### Aggregate Artifacts

| Artifact | Path |
|----------|------|
| Cross-seed summary | `predictions/baseline_v23_cross_seed_summary.json` |
| Aggregate figures | `runs/baseline_v23/figures/` (seed 42 representative) |
| Figure generation script | `scripts/generate_baseline_v23_figures.py` |
| This notebook | `notebooks/2026-02-26_baseline_v23_aggregate.ipynb` |
| Seed 42 preliminary notebook | `notebooks/2026-02-24_baseline_v23_preliminary.ipynb` |

---

*Notebook created: 2026-02-26*  
*Status: Complete (3-seed aggregate)*