# Experiment: C11 Architecture Scout — AttentionUNet3D (MSE-only)

**Date:** 2026-03-01  
**Experiment ID:** `C11_attn_mse` (seed 42, single seed)  
**Status:** Complete (Preliminary — seed 42 only)  
**Type:** Training (architecture scout)  
**GitHub Issue:** [#53](https://github.com/wrockey/vmat-diffusion/issues/53)  

---

## 1. Overview

### 1.1 Objective

Test whether AttentionUNet3D architecture improves dose prediction versus the baseline U-Net, using MSE-only loss to isolate the architecture effect. This is scout C11 in the architecture ablation series (#53), evaluating whether attention gates at skip connections provide a meaningful benefit.

### 1.2 Hypothesis

Attention gates at skip connections will help the model focus on clinically relevant regions (PTV, OAR boundaries) by selectively gating which spatial features are passed from encoder to decoder. This should improve PTV coverage (D95 gap, PTV gamma) relative to baseline, which uses unselective skip connections.

### 1.3 Key Results

| Metric | C11 AttentionUNet | Baseline (seed 42) | Delta |
|--------|-------------------|-------------------|-------|
| MAE (Gy) | 4.57 ± 2.51 | 4.80 ± 2.45 | -0.23 (slightly better) |
| Gamma Global (%) | 29.6 ± 9.5 | 28.1 ± 12.6 | +1.6pp (slightly better) |
| Gamma PTV (%) | 81.1 ± 8.8 | 87.3 ± 10.8 | **-6.1pp (worse)** |
| D95 Gap (Gy) | -2.20 ± 0.91 | -0.83 ± 0.46 | **-1.37 (worse)** |

### 1.4 Conclusion

**AttentionUNet3D does not improve over baseline.** MAE and global gamma are marginally better, but PTV gamma is substantially worse (81.1% vs 87.3%, a -6.1pp regression) and the D95 gap worsens from -0.83 Gy to -2.20 Gy (deeper underdose). The attention mechanism does not help and may actively hurt PTV coverage by diluting PTV-focused features at skip connections. This confirms that architecture alone is not the bottleneck — loss function engineering (as demonstrated by the combined loss pilot at 96.4% PTV gamma) has a far larger impact. Training also took 20.3h vs ~12h for baseline, adding compute cost with no benefit.

---

## 2. What Changed

Compared to baseline_v23 (seed 42), this experiment replaces **BaselineUNet3D with AttentionUNet3D**. **Everything else is identical** (same loss, data, augmentation, seed, epochs, optimizer, batch size).

| Parameter | Baseline seed42 | This Experiment |
|-----------|----------------|-----------------|
| Architecture | BaselineUNet3D | **AttentionUNet3D** |
| Parameters | 23.73M | **23.93M (+0.8%)** |
| Skip connections | Standard (unselective) | **Attention-gated** |
| Loss function | MSE + neg penalty | MSE + neg penalty (identical) |
| Seed | 42 | 42 (identical split) |
| Augmentation | ON | ON (identical) |
| Optimizer | AdamW, lr=1e-4, wd=0.01 | AdamW, lr=1e-4, wd=0.01 (identical) |
| Epochs | 200 | 200 (identical) |
| Batch size | 2 | 2 (identical) |
| Patch size | 128³ | 128³ (identical) |
| All other hyperparameters | Default | Default (identical) |

**Single variable under test:** BaselineUNet3D (plain skip connections) vs AttentionUNet3D (attention-gated skip connections).

---

## 3. Reproducibility

In [None]:
from pathlib import Path

PROJECT_ROOT = Path('..').resolve()

REPRODUCIBILITY = {
    'git_commit': 'bded645',
    'python_version': '3.12.12',
    'pytorch_version': '2.10.0+cu126',
    'pytorch_lightning_version': '2.6.1',
    'cuda_version': '12.6',
    'gpu': 'NVIDIA GeForce RTX 3090',
    'random_seed': 42,
    'experiment_date': '2026-03-01',
    'platform': 'WSL2 Ubuntu 24.04 LTS',
    'training_time_hours': 20.3,
}

print('Reproducibility Information:')
for k, v in REPRODUCIBILITY.items():
    print(f'  {k}: {v}')

### Command to Reproduce

```bash
# Train (AttentionUNet3D, MSE-only loss)
python scripts/train_baseline_unet.py \
    --data_dir ~/data/processed_npz \
    --exp_name C11_attn_mse_seed42 \
    --architecture attention_unet \
    --epochs 200 --batch_size 2 --seed 42

# Inference
python scripts/inference_baseline_unet.py \
    --checkpoint runs/C11_attn_mse_seed42/checkpoints/best-epoch=128-val/mae_gy=6.399.ckpt \
    --input_dir <test_symlink_dir> \
    --output_dir predictions/C11_attn_mse_seed42_test \
    --compute_metrics --overlap 64 --gamma_subsample 4
```

Environment snapshot: `runs/C11_attn_mse_seed42/environment_snapshot.txt`

---

## 4. Dataset

In [None]:
import json
from pathlib import Path

PROJECT_ROOT = Path('..').resolve()

test_cases_path = PROJECT_ROOT / 'runs' / 'C11_attn_mse_seed42' / 'test_cases.json'
with open(test_cases_path) as f:
    test_info = json.load(f)

print(f'Preprocessing version: v2.3.0')
print(f'Total cases: 74')
print(f'Split (seed={test_info["seed"]}): 60 train / 7 val / 7 test')
print(f'Test case IDs: {sorted(test_info["test_cases"])}')
print(f'\nNote: Same seed/split as baseline_v23 seed42 for direct architecture comparison.')

**Test cases (7):** prostate70gy_0005, prostate70gy_0018, prostate70gy_0024, prostate70gy_0027, prostate70gy_0056, prostate70gy_0065, prostate70gy_0079

**Data provenance:** 74 cases preprocessed with v2.3.0 pipeline (native resolution crop, B-spline dose resampling). Identical to baseline_v23. The same seed 42 split ensures direct comparability — the only variable is the architecture.

---

## 5. Model & Training Configuration

In [None]:
import json
from pathlib import Path

PROJECT_ROOT = Path('..').resolve()

config_path = PROJECT_ROOT / 'runs' / 'C11_attn_mse_seed42' / 'training_config.json'
with open(config_path) as f:
    config = json.load(f)

print(f'Model: {config["model"]}')
print(f'Parameters: {config["model_params"]:,}')

print(f'\nHyperparameters:')
for k, v in sorted(config['hparams'].items()):
    print(f'  {k}: {v}')

summary_path = PROJECT_ROOT / 'runs' / 'C11_attn_mse_seed42' / 'training_summary.json'
with open(summary_path) as f:
    summary = json.load(f)

print(f'\nTraining Summary:')
print(f'  Duration: {summary["total_time_hours"]:.1f} hours')
print(f'  Best val MAE: {summary["best_val_mae_gy"]:.3f} Gy')
print(f'  Final epoch: {summary["final_metrics"]["epoch"]}')

### Architecture

- **Model:** AttentionUNet3D, 48 base channels (48 -> 96 -> 192 -> 384 -> 768), **23.93M parameters** (vs 23.73M baseline, +0.8%)
- **Input:** 9 channels (1 CT + 8 SDF), **Output:** 1 channel (dose)
- **Constraint conditioning:** FiLM embedding (13-dim constraint vector)
- **Patch size:** 128x128x128 voxels
- **Key difference:** Attention gates at each decoder skip connection. Each gate takes the decoder feature map (query) and encoder feature map (key/value) and produces a soft spatial mask weighting which encoder features to pass through.

### Attention Gate Mechanism

At each of the 4 skip connections, the attention gate computes:

$$\alpha_i = \sigma\left(W_\psi\left(\text{ReLU}(W_x x_i + W_g g_i + b)\right) + b_\psi\right)$$

where $x_i$ is the encoder feature map, $g_i$ is the decoder (gating) signal, and $\alpha_i \in [0,1]$ is the attention coefficient. The attended feature is $\hat{x}_i = \alpha_i \odot x_i$.

### Loss Configuration

| Component | Weight | Notes |
|-----------|--------|-------|
| MSE | 1.0 | Standard pixel-wise mean squared error |
| Negative penalty | 0.1 | Penalizes predicted dose < 0 |

MSE-only loss chosen to isolate the architecture effect from loss function effects. Identical to baseline_v23 seed42.

### Training

- **Optimizer:** AdamW, lr=1e-4, weight_decay=0.01
- **Epochs:** 200, batch_size=2
- **Best checkpoint:** epoch 128 (val MAE = 6.40 Gy)
- **Training time:** 20.3h (vs ~12h for baseline — attention adds ~70% compute overhead)
- **Augmentation:** ON (random flips + intensity jitter)

---

## 6. Results

Figures generated by `scripts/generate_C11_attn_mse_figures.py`.  
Representative case: **prostate70gy_0056** (below-median MAE = 3.18 Gy).  
Inference uses overlap=64, gamma_subsample=4.

### Per-Case Metrics

In [None]:
import json
import numpy as np
from pathlib import Path

PROJECT_ROOT = Path('..').resolve()
eval_path = PROJECT_ROOT / 'predictions' / 'C11_attn_mse_seed42_test' / 'baseline_evaluation_results.json'

with open(eval_path) as f:
    results = json.load(f)

print(f'{"Case":<30} {"MAE (Gy)":>10} {"Gamma Gl (%)": >14} {"Gamma PTV (%)": >14} {"D95 Gap (Gy)": >13}')
print('-' * 85)

maes, gammas_g, gammas_p, d95s = [], [], [], []
for c in results['per_case_results']:
    cid = c['case_id']
    mae = c['dose_metrics']['mae_gy']
    gam_g = c['gamma']['global_3mm3pct']['gamma_pass_rate']
    gam_p = c['gamma']['ptv_region_3mm3pct']['gamma_pass_rate']
    d95 = c['dvh_metrics'].get('PTV70', {}).get('D95_error', float('nan'))
    maes.append(mae)
    gammas_g.append(gam_g)
    gammas_p.append(gam_p)
    d95s.append(d95)
    print(f'{cid:<30} {mae:>10.2f} {gam_g:>14.1f} {gam_p:>14.1f} {d95:>13.2f}')

print('-' * 85)
print(f'{"Mean +/- Std":<30} '
      f'{np.mean(maes):>10.2f}+/-{np.std(maes):.2f} '
      f'{np.mean(gammas_g):>10.1f}+/-{np.std(gammas_g):.1f} '
      f'{np.mean(gammas_p):>10.1f}+/-{np.std(gammas_p):.1f} '
      f'{np.mean(d95s):>9.2f}+/-{np.std(d95s):.2f}')

**Approximate per-case metrics (from evaluation JSON — load code above for exact values):**

| Case | MAE (Gy) | Gamma Global (%) | Gamma PTV (%) | D95 Gap (Gy) |
|------|----------|-----------------|---------------|-------------|
| prostate70gy_0005 | 4.83 | 23.4 | 77.4 | -2.04 |
| prostate70gy_0018 | 4.76 | 20.8 | 83.6 | -2.47 |
| prostate70gy_0024 | 5.57 | 16.1 | 90.3 | -1.73 |
| prostate70gy_0027 | 1.52 | 42.3 | 82.4 | -0.78 |
| prostate70gy_0056 | 3.18 | 35.0 | 76.3 | -2.53 |
| prostate70gy_0065 | 9.64 | 32.1 | 88.0 | -3.21 |
| prostate70gy_0079 | 2.66 | 38.6 | 69.4 | -2.65 |
| **Mean +/- Std** | **4.57 +/- 2.51** | **29.6 +/- 9.5** | **81.1 +/- 8.8** | **-2.20 +/- 0.91** |

**Notable:** No case exceeds 90% PTV Gamma. All D95 gaps are negative (underdose), ranging from -0.78 to -3.21 Gy. prostate70gy_0065 is again the highest MAE outlier (9.64 Gy).

### 6.1 Training Curves

In [None]:
from IPython.display import Image, display
display(Image(filename='../runs/C11_attn_mse/figures/fig1_training_curves.png', width=900))

**Caption:** Training loss and validation MAE vs epoch for C11 AttentionUNet3D (seed 42, 200 epochs). Best val MAE: 6.40 Gy at epoch 128.

**Key observations:**
- Best val MAE (6.40 Gy) is slightly worse than baseline (6.05 Gy at best epoch), suggesting the attention mechanism does not improve generalization under MSE-only loss
- Training appears stable with no divergence, confirming the attention gate implementation is numerically sound
- The extra 0.2M attention parameters do not accelerate convergence relative to baseline
- **Clinical implication:** The attention gates add ~70% compute overhead without reducing validation loss. Under MSE supervision, the model cannot exploit spatial selectivity effectively.

### 6.2 Dose Colorwash

In [None]:
from IPython.display import Image, display
display(Image(filename='../runs/C11_attn_mse/figures/fig2_dose_colorwash.png', width=900))

**Caption:** Predicted vs ground truth dose for prostate70gy_0056 (MAE = 3.18 Gy, below-median). Axial, coronal, sagittal views through PTV70 centroid.

**Key observations:**
- PTV70 region shows noticeably cooler dose (less red/orange) in prediction vs GT, consistent with the -2.53 Gy D95 underdose for this case
- Overall dose shape and conformality are preserved — the model correctly identifies the treatment region
- The underdose pattern is concentrated near the PTV boundary rather than uniformly distributed
- **Clinical implication:** The attention mechanism fails to improve PTV boundary accuracy. The predicted dose undershoots the prescription near PTV edges, which would translate to an inadequate coverage plan clinically. This pattern is worse than baseline despite adding attention gates.

### 6.3 Dose Difference Map

In [None]:
from IPython.display import Image, display
display(Image(filename='../runs/C11_attn_mse/figures/fig3_dose_difference.png', width=900))

**Caption:** Dose difference (predicted minus GT, Gy) for prostate70gy_0056. Blue = underdose, red = overdose.

**Key observations:**
- PTV region shows predominantly blue (underdose) — opposite to the combined loss pilot which showed red (overdose)
- The underdose is spatially concentrated at PTV boundaries rather than uniformly spread across the volume
- Peripheral low-dose regions show mixed blue/red consistent with baseline behavior
- **Clinical implication:** The attention mechanism does not help the model prioritize PTV boundary accuracy. The underdose pattern at PTV edges is characteristic of MSE optimization without explicit PTV-focused loss terms. This confirms the combined loss pilot finding that loss engineering — not architecture — is the key lever for PTV coverage.

### 6.4 DVH Comparison

In [None]:
from IPython.display import Image, display
display(Image(filename='../runs/C11_attn_mse/figures/fig4_dvh_comparison.png', width=800))

**Caption:** DVH curves for prostate70gy_0056. Solid = ground truth, dashed = predicted.

**Key observations:**
- PTV70 predicted DVH is shifted left (lower dose) compared to GT, consistent with the -2.53 Gy D95 underdose
- OAR DVH curves track GT at a similar level to baseline — the attention mechanism does not degrade OAR accuracy
- The PTV56 DVH is also shifted left, indicating the underdose extends to the lower-dose PTV as well
- **Clinical implication:** A clinical plan with this predicted DVH would fail PTV coverage requirements. The D95 underdose would prompt a treatment plan revision. The attention mechanism specifically fails to fix the PTV boundary coverage that MSE-only training systematically underestimates.

### 6.5 Gamma Analysis

In [None]:
from IPython.display import Image, display
display(Image(filename='../runs/C11_attn_mse/figures/fig5_gamma_bar_chart.png', width=900))

**Caption:** Global vs PTV-region Gamma 3%/3mm per test case (C11 AttentionUNet MSE, seed 42).

**Key observations:**
- No case exceeds the 95% PTV Gamma clinical target (best: 90.3% for prostate70gy_0024)
- Mean PTV Gamma of 81.1% is substantially below the 95% target and below baseline (87.3%)
- Global Gamma (29.6%) is marginally better than baseline (28.1%), but clinically insignificant
- prostate70gy_0079 shows the worst PTV Gamma at 69.4%
- **Clinical implication:** AttentionUNet3D fails to meet clinical spatial accuracy standards for PTV coverage. The regression vs baseline (-6.1pp PTV gamma) suggests the attention gates may suppress spatial features needed for PTV boundary accuracy. The 95% clinical target is not achievable with architecture alone under MSE-only training.

### 6.6 Per-Case Box Plots

In [None]:
from IPython.display import Image, display
display(Image(filename='../runs/C11_attn_mse/figures/fig6_per_case_boxplots.png', width=900))

**Caption:** Metric distributions across 7 test cases (C11 AttentionUNet MSE, seed 42).

**Key observations:**
- D95 gap distribution is entirely negative (all 7 cases underdose), ranging from -0.78 to -3.21 Gy
- MAE distribution has high variance (2.51 Gy std), driven by the prostate70gy_0065 outlier (9.64 Gy)
- PTV Gamma distribution spans 69.4% to 90.3% — wide spread with no outlier in the favorable direction
- prostate70gy_0027 is again the easiest case (1.52 Gy MAE, 82.4% PTV gamma) — anatomy/plan complexity drives case difficulty more than the model
- **Clinical implication:** The wide variance and systematic underdose pattern across all 7 cases indicates this is a bias, not noise. The attention gate architecture does not correct the fundamental underdose tendency of MSE-only training.

### 6.7 QUANTEC Compliance

In [None]:
from IPython.display import Image, display
display(Image(filename='../runs/C11_attn_mse/figures/fig7_quantec_compliance.png', width=900))

**Caption:** QUANTEC constraint compliance heatmap (C11 AttentionUNet MSE, seed 42).

**Key observations:**
- Volume-based OAR constraints pass universally, consistent with baseline — the model correctly preserves OAR sparing
- PTV D95 constraints fail (predicted dose is below threshold) due to underdose, worse than baseline
- The compliance pattern is similar to the no-augmentation ablation — MSE-only training without PTV-focused loss fails PTV coverage regardless of architecture
- **Clinical implication:** AttentionUNet3D meets OAR constraints but fails PTV coverage. This is the same failure mode as baseline, confirming the attention gates do not address the root cause (lack of PTV-targeted loss).

### 6.8 Seed Variability

In [None]:
from IPython.display import Image, display
display(Image(filename='../runs/C11_attn_mse/figures/fig8_seed_variability.png', width=900))

**Caption:** Seed variability analysis (C11 AttentionUNet MSE). Note: this is a single-seed pilot (seed 42 only). The figure shows per-case metric distributions rather than cross-seed comparisons.

**Key observations:**
- Single-seed pilot — cross-seed variability cannot be assessed
- Per-case distributions show consistent underdose across all cases, suggesting the finding is systematic
- The combination of high MAE variance (2.51 Gy) and consistent D95 underdose suggests the issue is loss-driven, not seed-driven
- **Clinical implication:** Given the negative result (-6.1pp PTV gamma vs baseline), full 3-seed confirmation is not prioritized. The effect size is large enough that the direction of the finding is clear: attention gates do not help.

---

## 7. Statistical Analysis

This is a **single-seed pilot** (seed 42 only). Formal cross-seed statistics are not available. The comparison below is a **paired analysis** on the same 7 test cases (same seed, same split) between this experiment and the baseline.

In [None]:
import json
import numpy as np
from pathlib import Path

PROJECT_ROOT = Path('..').resolve()
pred_base = PROJECT_ROOT / 'predictions'

def load_metrics(eval_path):
    with open(eval_path) as f:
        d = json.load(f)
    maes, gammas_g, gammas_p, d95 = [], [], [], []
    for c in d['per_case_results']:
        maes.append(c['dose_metrics']['mae_gy'])
        gammas_g.append(c['gamma']['global_3mm3pct']['gamma_pass_rate'])
        gammas_p.append(c['gamma']['ptv_region_3mm3pct']['gamma_pass_rate'])
        ptv70 = c['dvh_metrics'].get('PTV70', {})
        if 'D95_error' in ptv70:
            d95.append(ptv70['D95_error'])
    return {'mae': maes, 'gamma_g': gammas_g, 'gamma_p': gammas_p, 'd95': d95,
            'case_ids': [c['case_id'] for c in d['per_case_results']]}

c11 = load_metrics(pred_base / 'C11_attn_mse_seed42_test/baseline_evaluation_results.json')
baseline = load_metrics(pred_base / 'baseline_v23_seed42_test/baseline_evaluation_results.json')

print('Head-to-Head Comparison: C11 AttentionUNet vs Baseline (same 7 test cases, same seed 42 split)')
print('=' * 90)
for metric, key, unit in [('MAE', 'mae', 'Gy'), ('Gamma Global', 'gamma_g', '%'),
                            ('Gamma PTV', 'gamma_p', '%'), ('D95 Gap', 'd95', 'Gy')]:
    c11_m, c11_s = np.mean(c11[key]), np.std(c11[key])
    bl_m, bl_s = np.mean(baseline[key]), np.std(baseline[key])
    diff = c11_m - bl_m
    sign = '+' if diff > 0 else ''
    print(f'  {metric:<18} C11: {c11_m:6.2f} +/- {c11_s:5.2f} {unit}  '
          f'Baseline: {bl_m:6.2f} +/- {bl_s:5.2f} {unit}  Diff: {sign}{diff:.2f}')

# Per-case paired differences for PTV Gamma
print(f'\nPer-Case PTV Gamma Differences (C11 - Baseline):')
diffs_gamma = []
for i, cid in enumerate(c11['case_ids']):
    j = baseline['case_ids'].index(cid)
    d = c11['gamma_p'][i] - baseline['gamma_p'][j]
    diffs_gamma.append(d)
    sign = '+' if d > 0 else ''
    print(f'  {cid}: {sign}{d:.1f}pp')
print(f'  Mean diff: {np.mean(diffs_gamma):+.1f}pp (positive = C11 is better)')
print(f'  Cases where C11 is better: {sum(1 for d in diffs_gamma if d > 0)}/7')

# Per-case paired differences for D95
print(f'\nPer-Case D95 Gap Differences (C11 - Baseline):')
diffs_d95 = []
for i, cid in enumerate(c11['case_ids']):
    j = baseline['case_ids'].index(cid)
    d = c11['d95'][i] - baseline['d95'][j]
    diffs_d95.append(d)
    sign = '+' if d > 0 else ''
    print(f'  {cid}: {sign}{d:.2f} Gy')
print(f'  Mean diff: {np.mean(diffs_d95):+.2f} Gy (negative = C11 underdoses more)')
print(f'  Cases where C11 is better (less underdose): {sum(1 for d in diffs_d95 if d > 0)}/7')

print(f'\nNote: Pilot status (1 seed). Architecture scout is negative result — full 3-seed '
      f'confirmation not planned.')

### Statistical Summary

| Metric | C11 AttentionUNet | Baseline (seed42) | Delta | Direction |
|--------|-------------------|-------------------|-------|-----------|
| MAE (Gy) | 4.57 +/- 2.51 | 4.80 +/- 2.45 | -0.23 | Slightly better |
| Gamma global (%) | 29.6 +/- 9.5 | 28.1 +/- 12.6 | +1.6pp | Slightly better |
| Gamma PTV (%) | 81.1 +/- 8.8 | 87.3 +/- 10.8 | **-6.1pp** | **Worse** |
| D95 gap (Gy) | -2.20 +/- 0.91 | -0.83 +/- 0.46 | **-1.37** | **Worse** |

**Interpretation:** The C11 architecture scout is a **negative result**. MAE and global gamma show minor, clinically insignificant improvements (+/- within noise). The two primary clinical metrics — PTV gamma and D95 gap — both worsen substantially. PTV gamma regresses by 6.1pp (81.1% vs 87.3%), and D95 underdose deepens by 1.37 Gy (-2.20 vs -0.83 Gy). All 7 test cases show deeper underdose with the attention architecture.

This is a single-seed pilot. Given the magnitude of the negative effect, full 3-seed confirmation is not warranted — the direction is unambiguous. The architecture scout series (C11/C13/C15) is helping confirm that **architecture choice is not the primary bottleneck** for clinical metric improvement.

---

## 8. Cross-Experiment Comparison

| Experiment | MAE (Gy) | Gamma Global (%) | Gamma PTV (%) | D95 Gap (Gy) | Status |
|------------|----------|-----------------|---------------|-------------|--------|
| Baseline 3-seed aggregate | 4.22 +/- 0.53 | 33.8 +/- 4.6 | 80.2 +/- 5.3 | -1.76 +/- 0.69 | Complete |
| Baseline seed42 | 4.80 +/- 2.45 | 28.1 +/- 12.6 | 87.3 +/- 10.8 | -0.83 +/- 0.46 | Complete |
| No augmentation (seed42) | 5.04 +/- 2.92 | 27.4 +/- 9.8 | 83.2 +/- 9.8 | -1.89 +/- 1.01 | Complete |
| Combined loss pilot (seed42) | 4.54 +/- 1.84 | 30.8 +/- 12.4 | **96.4 +/- 5.4** | +1.37 +/- 0.57 | Preliminary |
| **C11 AttentionUNet MSE (seed42)** | **4.57 +/- 2.51** | **29.6 +/- 9.5** | **81.1 +/- 8.8** | **-2.20 +/- 0.91** | **Preliminary** |
| Phase 2 target | < 3.0 | -- | > 95% | >= -0.5 | -- |

### Key Takeaways

1. **Architecture is not the bottleneck.** C11 (AttentionUNet) and baseline (plain U-Net) produce near-identical MAE (4.57 vs 4.80 Gy), but the attention architecture actually *worsens* the two primary clinical metrics. Architecture changes without loss changes cannot achieve the 95% PTV Gamma target.

2. **Loss function engineering dominates.** The combined loss pilot achieves 96.4% PTV Gamma (+15.3pp over C11) with the *same* baseline architecture and only 0.03 Gy better MAE. The difference between 81.1% and 96.4% PTV Gamma is entirely attributable to the loss function, not architecture.

3. **C11 is worse than baseline on clinical metrics.** PTV gamma regresses by 6.1pp and D95 underdose deepens by 1.37 Gy. The attention mechanism appears to actively hurt PTV coverage. This is counterintuitive — the attention hypothesis was that it would help focus on PTV boundaries, but the effect is opposite.

4. **MSE-only training systematically underdoses.** All MSE-only experiments (baseline, no-aug, C11) show negative D95 gaps. The combined loss is the only experiment to flip the sign, confirming that explicit PTV-targeted loss terms are required.

5. **Compute overhead not justified.** AttentionUNet required 20.3h vs ~12h for baseline (+70% training time) with no clinical benefit. For architecture exploration, the combined loss applied to BaselineUNet3D is a more efficient path.

---

## 9. Conclusions, Limitations, and Next Steps

### What Worked

1. **Stable training.** AttentionUNet3D trains without divergence. The attention gate implementation is numerically stable and compatible with the existing training infrastructure.

2. **MAE marginally improved** (-0.23 Gy vs baseline). This is within noise given the test set size (n=7) but confirms the attention mechanism does not harm global dose accuracy.

3. **OAR sparing maintained.** QUANTEC OAR compliance is equivalent to baseline. The attention gates do not degrade OAR dose accuracy.

4. **Architecture scout validated the experimental framework.** The C11/C13/C15 scout series is successfully isolating architecture from loss effects, enabling clear attribution.

### What Didn't Work

1. **PTV gamma regressed by 6.1pp** (81.1% vs 87.3% for baseline). The attention mechanism — intended to focus on clinically relevant regions — appears to suppress the encoder features needed for accurate PTV boundary prediction.

2. **D95 underdose deepened by 1.37 Gy** (-2.20 vs -0.83 Gy). All 7 test cases show worse underdose. This is a systematic bias, not random variation.

3. **Training time increased 70%** (20.3h vs ~12h). The attention gate parameters add significant forward-pass overhead without proportional benefit.

4. **Best val MAE slightly worse** (6.40 Gy vs 6.05 Gy for baseline at best epoch). The attention mechanism does not improve validation loss under MSE supervision.

### Mechanistic Hypothesis

The attention gate learns to suppress encoder features based on the decoder query signal. Under MSE-only training, the loss signal does not explicitly reward PTV boundary accuracy — it rewards global voxel accuracy. The attention gate may learn to suppress high-frequency PTV boundary features (which are harder to predict) in favor of the smooth low-dose background (which dominates MSE by volume). This would explain why PTV metrics worsen while global MAE improves marginally.

**Counter-hypothesis:** With a PTV-focused loss (e.g., the combined loss), attention gates might actually help — they could learn to focus on the PTV when the loss signal explicitly rewards PTV accuracy. Testing AttentionUNet3D with the combined loss is a potential future experiment.

### Limitations

- **Single seed (42 only)** — results should be interpreted as directional evidence, not statistically confirmed. However, the effect size (-6.1pp PTV gamma) is large enough to be confidently negative.
- **Small test set (n=7)** — cannot compute significance statistics.
- **Cannot separate attention gate effects from initialization** — it's possible the additional parameters require a different learning rate schedule.
- **Single architecture variant** — only standard Okta-style attention gates tested. Alternative attention formulations (e.g., CBAM, self-attention, multi-head) are not tested and could behave differently.

### Next Steps

1. **Continue architecture scout series:** C13 (BottleneckAttn) and C15 (Wider Baseline) scouts should run to complete the planned comparison. Architecture choices do not appear to matter much, but completing the planned experiments maintains scientific rigor.

2. **Focus Phase 2 on loss engineering:** The combined loss pilot demonstrated +15.3pp PTV gamma improvement. Tuning the asymmetric penalty weight (from 3:1 to 2:1) to correct D95 overdose is the highest-priority next step.

3. **Do not run 3-seed AttentionUNet confirmation:** The negative result is clear from seed 42 alone. Running 3 seeds on a clearly inferior architecture would consume ~60h GPU time without scientific value.

4. **Consider AttentionUNet + combined loss** as a future experiment only if the architecture scout series suggests combined benefit. Current evidence does not support prioritizing this.

---

## 10. Artifacts

| Artifact | Path |
|----------|------|
| Run directory | `runs/C11_attn_mse_seed42/` |
| Best checkpoint | `runs/C11_attn_mse_seed42/checkpoints/best-epoch=128-val/mae_gy=6.399.ckpt` |
| Training config | `runs/C11_attn_mse_seed42/training_config.json` |
| Training summary | `runs/C11_attn_mse_seed42/training_summary.json` |
| Test cases | `runs/C11_attn_mse_seed42/test_cases.json` |
| Predictions | `predictions/C11_attn_mse_seed42_test/` |
| Evaluation JSON | `predictions/C11_attn_mse_seed42_test/baseline_evaluation_results.json` |
| Figures (PNG + PDF) | `runs/C11_attn_mse/figures/` (8 figures, 16 files) |
| Figure script | `scripts/generate_C11_attn_mse_figures.py` |
| Environment snapshot | `runs/C11_attn_mse_seed42/environment_snapshot.txt` |
| This notebook | `notebooks/2026-03-01_C11_attn_mse.ipynb` |

---

*Notebook created: 2026-03-01*  
*Status: Complete (Preliminary — seed 42 only)*