# Strategic Assessment: VMAT Dose Prediction

**Date:** 2026-01-20  
**Status:** Complete  

---

## Executive Summary

This notebook synthesizes findings from our baseline U-Net and DDPM experiments to assess:
1. **Scientific value** of the dose prediction approach
2. **Novelty** of our methods
3. **Predictive utility** and clinical relevance
4. **Recommended path forward**

### Key Conclusions

| Finding | Implication |
|---------|-------------|
| DDPM matches but doesn't beat baseline | Diffusion adds complexity without benefit for this task |
| MAE is good (1.43 Gy), Gamma is poor (14.2%) | Model captures magnitude but misses gradients |
| "More steps = worse" in DDPM | Fundamental mismatch: dose prediction is deterministic |
| SDF + FiLM conditioning approach is novel | Publishable regardless of absolute performance |

**Bottom line:** The baseline approach has scientific merit and clinical utility potential. Focus on improving gradient capture (perceptual/adversarial loss) rather than continuing DDPM work.

---

## 1. Model Comparison Summary

### 1.1 Results Overview

| Model | Val MAE (Gy) | Test MAE (Gy) | Test Gamma (3%/3mm) | Inference Time |
|-------|--------------|---------------|---------------------|----------------|
| **Baseline U-Net** | 3.73 | **1.43 ± 0.24** | 14.2 ± 5.7% | ~30 sec |
| DDPM (optimized) | 3.78 | TBD | TBD | ~6 min (50 steps) |
| DDPM (as trained) | 12.19 | - | - | - |

### 1.2 What We Learned

**DDPM Investigation Results:**
- Phase 1 optimization found that **50 DDIM steps** is optimal (not 100+)
- **Counter-intuitive finding:** More steps = worse results
- **Ensemble averaging doesn't help:** Sample variability is near-zero (~0.02 std)
- Root cause: Model was evaluated with too many steps during training validation

**Why DDPM Doesn't Help Here:**

| Red Flag | Interpretation |
|----------|----------------|
| More steps = worse | Model denoises away the dose signal (structural issue) |
| Near-zero sample variability | DDIM sampling is deterministic; model isn't generative |
| 50/1000 steps optimal | Essentially one-shot prediction, not iterative refinement |
| Matches baseline exactly | Added complexity provides zero accuracy benefit |

**Fundamental Mismatch:**
- Dose prediction is **deterministic**: one correct answer per patient anatomy + prescription
- Diffusion models excel at **multi-modal generation**: many valid outputs (faces, art, etc.)
- We're forcing a generative framework onto a regression problem

---

## 2. Scientific Value Assessment

### 2.1 Is This Scientifically Interesting?

**Yes, for several reasons:**

#### Clinical Utility
Treatment planning currently takes **hours of manual iteration** by dosimetrists. A model that predicts dose in seconds could:
- Provide **instant quality assurance** (compare predicted vs planned)
- Generate **starting points** for optimization algorithms
- Enable **automated planning workflows**
- Help with **plan review** ("does this look like what we'd expect?")
- **Reduce planning time** from hours to minutes

#### What's Novel About Our Approach

| Aspect | Standard Approaches | Our Project |
|--------|---------------------|-------------|
| Structure representation | Binary masks | **Signed Distance Fields (SDFs)** - provides gradient information |
| Constraint handling | Post-hoc or ignored | **FiLM conditioning** - explicit constraint encoding |
| Disease site | Often IMRT/head-neck | **Prostate VMAT with SIB** - specific clinical scenario |
| Deliverability | Usually ignored | **Phase 2 plans MLC prediction** - end-to-end pipeline |

#### The Diffusion "Negative Result" is Publishable
Showing that diffusion **doesn't help** for deterministic dose prediction is valuable:
- Saves other researchers from going down this path
- Provides insight into when diffusion is/isn't appropriate
- "More steps = worse" finding is genuinely interesting
- Contributes to understanding of diffusion model applicability

### 2.2 Literature Context

**Dose prediction is an active research area:**

| Publication Type | Status |
|------------------|--------|
| U-Net for dose prediction | Established (multiple papers) |
| 3D convolutions for RT | Established |
| Diffusion for dose prediction | Emerging (few papers, mostly positive claims) |
| SDF representation for structures | **Novel** (typically binary masks used) |
| FiLM conditioning for constraints | **Novel** (constraints usually not encoded) |
| Rigorous diffusion vs baseline comparison | **Valuable** (most papers don't compare fairly) |

**Our contribution could be:**
1. First rigorous comparison showing DDPM doesn't outperform baseline for dose prediction
2. Novel SDF representation for anatomical structures
3. Explicit constraint conditioning via FiLM
4. Analysis of why diffusion fails for this deterministic task

---

## 3. Predictive Utility Analysis

### 3.1 Current Performance

| Metric | Achieved | Target | Assessment |
|--------|----------|--------|------------|
| MAE | 1.43 Gy (test) | < 3 Gy | **Excellent** - well below target |
| Gamma (3%/3mm) | 14.2% | > 95% | **Poor** - far below clinical threshold |

### 3.2 The MAE vs Gamma Disconnect

**Why is MAE good but Gamma poor?**

The model predicts the right **magnitude** of dose but misses the **sharp gradients** needed for clinical acceptance:

| What Model Does Well | What Model Misses |
|---------------------|-------------------|
| Overall dose level | Sharp dose falloff at PTV edges |
| Average OAR doses | Steep gradients between structures |
| Spatial distribution shape | High-frequency dose detail |

**DVH Analysis (from test evaluation):**
- **PTV70 D95:** Predicted ~50 Gy vs Target 70 Gy (**underdosed by ~20 Gy**)
- **OAR constraints:** All pass (model is conservative)

**Interpretation:** The model learns a "blurred" dose pattern - safe for OARs but inadequate for tumor coverage. This is a common failure mode for MSE-trained regression models.

### 3.3 What This Means for Utility

| Use Case | Current Viability | Path to Viability |
|----------|-------------------|-------------------|
| Quality assurance (flag outliers) | **Possible** | Already works for gross errors |
| Plan comparison | **Possible** | Relative comparisons work |
| Optimization starting point | **Limited** | Needs better gradient capture |
| Automated planning | **No** | Requires Gamma > 95% |
| Clinical deployment | **No** | Requires extensive validation |

---

## 4. Data Considerations

### 4.1 Current Dataset

| Metric | Value |
|--------|-------|
| Total cases | 24 (23 usable) |
| Train/Val/Test split | 19/2/2 |
| Disease site | Prostate with SIB |
| Data version | v2.2.0 (SDF fix) |

### 4.2 What 24 Cases CAN Tell Us

- **Relative model comparison:** Both models trained on identical data/splits, so baseline vs DDPM comparison is valid
- **Fundamental architectural issues:** Extreme volatility isn't just small-sample noise
- **Workflow validation:** Training completes, checkpoints save, metrics log correctly
- **Loss vs metric disconnect:** If diffusion loss ↓ but dose quality doesn't improve, more data won't fix that

### 4.3 What 24 Cases CAN'T Tell Us

- **Absolute performance:** Both models will likely improve with more data
- **Publication-quality claims:** Need larger test set (n≥50) for statistical significance
- **Generalization:** Limited diversity in patient anatomy

### 4.4 Will More Data (n=100+) Change Conclusions?

| Aspect | Prediction |
|--------|------------|
| Absolute MAE | Both will improve (~2-3 Gy → ~1.5-2 Gy) |
| DDPM vs Baseline gap | **Unlikely to change** - structural mismatch remains |
| Gamma pass rate | May improve with more diverse training examples |
| "More steps = worse" | Won't change - this is architectural, not data-related |

---

## 5. Recommended Path Forward

### 5.1 Priority Actions

| Priority | Action | Rationale |
|----------|--------|----------|
| **1 (Highest)** | Add perceptual/adversarial loss to baseline | Captures high-frequency gradients missing from MSE |
| **2** | Try Flow Matching | Simpler than diffusion, better for regression-style tasks |
| **3** | Collect 100 cases | Improves any approach, enables publication |
| **4** | Structure-weighted loss | Penalize PTV/OAR boundary errors more |
| **5** | DVH loss term | Directly optimize clinical metrics |

### 5.2 What NOT to Do

| Avoid | Reason |
|-------|--------|
| Continue DDPM hyperparameter tuning | Structural issues won't be fixed by tuning |
| Wait for n=100 hoping DDPM improves | DDPM vs baseline gap won't change |
| Proceed to Phase 2/3 of DDPM optimization | Diminishing returns |

### 5.3 Success Criteria for Next Phase

| Milestone | Target |
|-----------|--------|
| Gamma pass rate | > 80% (intermediate), > 95% (clinical) |
| PTV D95 error | < 5 Gy |
| OAR constraints | Continue to pass |
| Inference time | < 1 minute |

---

## 6. Publication Potential

### 6.1 Possible Paper Angles

| Angle | Strength | Requirement |
|-------|----------|-------------|
| "Diffusion doesn't help for dose prediction" | Novel negative result | Rigorous comparison (done) |
| "SDF representation improves dose prediction" | Novel method | Ablation study (planned) |
| "Constraint-conditioned dose prediction via FiLM" | Novel conditioning | Architecture description |
| "Perceptual loss for clinical dose quality" | Practical improvement | Needs implementation |

### 6.2 What's Needed for Publication

- [ ] Larger test set (n≥50) for statistical significance
- [ ] Ablation studies (SDF vs binary masks, with/without FiLM)
- [ ] Comparison with published methods
- [ ] Clinical expert review of sample predictions
- [ ] DVH analysis across all structures
- [ ] Gamma analysis at multiple thresholds

---

## 7. References

### Key Experiment Files

| Resource | Location |
|----------|----------|
| Baseline training | `notebooks/2026-01-19_baseline_unet_experiment.ipynb` |
| Baseline test eval | `notebooks/2026-01-19_baseline_unet_test_evaluation.ipynb` |
| Phase 1 results | `experiments/phase1_sampling/phase1_summary.json` |
| DDPM optimization plan | `docs/DDPM_OPTIMIZATION_PLAN.md` |
| Experiment index | `notebooks/EXPERIMENTS_INDEX.md` |

### Checkpoints

| Model | Checkpoint |
|-------|------------|
| Baseline (best) | `runs/baseline_unet_run1/checkpoints/best-epoch=012-val/mae_gy=3.735.ckpt` |
| DDPM (best) | `runs/vmat_dose_ddpm/checkpoints/best-epoch=015-val/mae_gy=12.19.ckpt` |

---

*Notebook created: 2026-01-20*  
*Analysis synthesizes findings from baseline and DDPM experiments*