# Gamma Metric Hypothesis Analysis

**Date:** 2026-01-23  
**Experiment ID:** gamma_metric_analysis  
**Status:** Complete

---

## 1. The Hypothesis

### Problem Statement
Our best model achieves only **31.2% Gamma pass rate** (3%/3mm), far below the **95% clinical target**. Before pursuing more complex solutions (more data, new architectures), we need to ask:

**Is overall Gamma actually the right metric for evaluating dose prediction quality?**

### The Insight

Gamma measures similarity to ONE specific reference plan. But consider:

1. **Clinical Reality:** As long as PTV objectives and OAR constraints are met, clinicians are satisfied. The exact dose distribution in "no-man's land" (between PTVs and OARs) doesn't matter clinically.

2. **Physical Reality:** Multiple dose distributions could be physically deliverable for the same patient. Each training case may have a different (but valid) low-dose pattern.

3. **Model Behavior:** If training cases have different low-dose patterns, our model learns to predict the **average** of these patterns - which may be:
   - Clinically acceptable (DVH constraints still pass)
   - But physically implausible (blurred average of valid solutions)
   - And low Gamma (doesn't match any specific reference)

### Hypothesis

**If our predictions pass clinical DVH constraints and achieve high Gamma in PTV regions, then overall Gamma is a misleading metric that penalizes valid variations in clinically irrelevant regions.**

### Implications

| If True | Then |
|---------|------|
| Model is already clinically useful | We have a metric problem, not a model problem |
| Low-dose variation is expected | DDPM/generative models may be more appropriate |
| 95% overall Gamma may be unrealistic | Need to define clinically relevant metrics |

---

## 2. What Actually Matters Clinically

### Prostate VMAT Clinical Constraints

| Structure | Constraint | Clinical Importance |
|-----------|------------|--------------------|
| **PTV70** | D95 ≥ 66.5 Gy (95% of Rx) | **CRITICAL** - tumor control |
| **PTV56** | D95 ≥ 53.2 Gy (95% of Rx) | **CRITICAL** - elective nodes |
| **Rectum** | V70 < 15% | **HIGH** - late toxicity |
| **Rectum** | V65 < 25% | **HIGH** - late toxicity |
| **Bladder** | V70 < 25% | **MODERATE** - toxicity |
| **Femurs** | D50 < 50 Gy | **LOW** - rarely limiting |

### What Gamma Measures vs What Clinicians Care About

| Metric | What It Measures | Clinical Relevance |
|--------|-----------------|-------------------|
| Overall Gamma | Similarity to specific reference | **LOW** - penalizes valid variations |
| PTV Gamma | Accuracy in target volumes | **HIGH** - treatment efficacy |
| OAR Gamma | Accuracy near organs | **MODERATE** - depends on constraints |
| DVH D95, Vx | Clinical constraint compliance | **HIGHEST** - what clinicians check |

### The Physical Deliverability Question

Even if DVH constraints pass, is the predicted dose **physically deliverable** by an MLC?

**Proxies for deliverability:**
- Dose gradients should be smooth (penumbra ~3-5mm)
- No sharp discontinuities
- Dose patterns should resemble real plans

**Gold standard:** Run predicted dose through inverse planning TPS to see if MLC sequence can recreate it.

---

## 3. Experimental Design

### Tests to Perform

#### Test 1: DVH Clinical Acceptability
**Question:** Do our predictions pass clinical constraints?

**Method:**
- Compute D95, D50, Dmean for PTVs
- Compute V70, V65, V50 for OARs
- Check pass/fail against clinical thresholds
- Compare to ground truth constraint compliance

**Success criterion:** Predictions pass constraints at similar rate to ground truth plans.

#### Test 2: PTV-Only Gamma
**Question:** Is the model accurate where it matters most?

**Method:**
- Compute Gamma only within PTV70 and PTV56 volumes
- Exclude low-dose regions from calculation
- Compare to overall Gamma

**Success criterion:** PTV Gamma >> Overall Gamma (indicating low-dose variation is the issue).

#### Test 3: Region-Specific Gamma Breakdown
**Question:** Where do Gamma failures concentrate?

**Regions:**
- PTV (D > 95% Rx): Critical accuracy
- OAR (Rectum, Bladder): Constraint regions
- High dose (50-95% Rx): Gradient regions
- Low dose (10-50% Rx): "No-man's land"

**Hypothesis:** Low-dose regions have lowest Gamma (most variation in training data).

### Analysis Script

```python
# Run the analysis
python scripts/analyze_gamma_metric_hypothesis.py
```

This script computes:
1. DVH metrics and constraint pass/fail for each structure
2. Region-specific Gamma (PTV, OAR, high-dose, low-dose)
3. Publication-ready figures comparing metrics

---

## 4. Reproducibility Information

### Git Information
- **Commit:** TBD (will be filled after running)
- **Repository:** wrockey/vmat-diffusion
- **Branch:** main

### Data
- **Predictions analyzed:** `predictions/structure_weighted_test/` (best Gamma model)
- **Ground truth:** `I:\processed_npz`
- **Test cases:** case_0007, case_0021

### Command to Reproduce
```cmd
call C:\pinokio\bin\miniconda\Scripts\activate.bat vmat-win
cd C:\Users\Bill\vmat-diffusion-project
python scripts\analyze_gamma_metric_hypothesis.py
```

---

## 5. Results

### 5.1 DVH Clinical Constraint Results

| Constraint | Prediction Pass Rate | Target Pass Rate | Assessment |
|------------|---------------------|------------------|------------|
| PTV70 D95 >= 66.5 Gy | **0%** | 0% | Both fail (cold spots in PTV) |
| PTV56 D95 >= 53.2 Gy | **0%** | 100% | **CRITICAL: Model underdoses PTV56** |
| Rectum V70 <= 15% | 100% | 100% | PASS |
| Rectum V65 <= 25% | 100% | 100% | PASS |
| Rectum V50 <= 50% | 100% | 100% | PASS |
| Bladder V70 <= 25% | 100% | 100% | PASS |
| Bladder V65 <= 50% | 100% | 100% | PASS |

**Key Finding:** OAR constraints pass, but PTV coverage is inadequate!

### 5.2 Detailed D95 Analysis

| Structure | Case | Pred D95 (Gy) | Target D95 (Gy) | Threshold (Gy) | Delta |
|-----------|------|---------------|-----------------|----------------|-------|
| PTV70 | 0007 | 55.1 | 55.4 | 66.5 | -0.3 Gy |
| PTV70 | 0021 | 48.0 | 54.7 | 66.5 | **-6.7 Gy** |
| PTV56 | 0007 | 46.9 | 55.2 | 53.2 | **-8.3 Gy** |
| PTV56 | 0021 | 48.6 | 55.6 | 53.2 | **-7.0 Gy** |

**Critical Finding:** The model systematically **underdoses PTVs by 7-8 Gy** compared to ground truth!

### 5.3 Region-Specific Gamma

| Region | Case 0007 | Case 0021 | Mean +/- Std | vs Overall (31.2%) |
|--------|-----------|-----------|--------------|-------------------|
| **PTV (Critical)** | 52.6% | 30.5% | **41.5% +/- 11.0%** | +10.3% |
| **High Dose (>50% Rx)** | 47.2% | 30.1% | **38.6% +/- 8.6%** | +7.4% |
| **Low Dose (10-50% Rx)** | 31.2% | 29.7% | **30.4% +/- 0.8%** | Similar |
| **OAR (Constraints)** | 20.9% | 13.8% | **17.3% +/- 3.5%** | -13.9% |
| **Overall (reference)** | 31.2% | 29.0% | **31.2%** | Baseline |

### 5.4 Figures

#### Figure 1: Clinical Constraint Pass Rates
![Constraint Pass Rates](../runs/gamma_metric_analysis/figures/fig1_constraint_pass_rates.png)

#### Figure 2: Region-Specific Gamma
![Region Gamma](../runs/gamma_metric_analysis/figures/fig2_region_gamma.png)

#### Figure 3: DVH Comparison
![DVH Comparison](../runs/gamma_metric_analysis/figures/fig3_dvh_comparison.png)

#### Figure 4: Key Finding
![Key Finding](../runs/gamma_metric_analysis/figures/fig4_key_finding.png)

---

## 6. Analysis and Discussion

### 6.1 Key Questions Answered

1. **Do predictions pass clinical constraints?**
   - **PARTIAL:** OAR constraints pass (100%), but PTV coverage fails
   - The model is "too conservative" - protecting OARs at the expense of PTVs

2. **Is PTV Gamma much higher than overall Gamma?**
   - **YES:** PTV Gamma (41.5%) is 10% higher than overall (31.2%)
   - This confirms that the model is more accurate in PTV regions
   - But 41.5% is still far below clinical acceptability (95%)

3. **Does low-dose region have lowest Gamma?**
   - **NO:** Low-dose Gamma (30.4%) is similar to overall (31.2%)
   - **Surprise:** OAR Gamma (17.3%) is the LOWEST!
   - This suggests the model struggles with OAR dose gradients, not just "no-man's land"

### 6.2 Hypothesis Evaluation

| Finding | Hypothesis Support | Reality |
|---------|-------------------|---------|
| PTV Gamma > Overall | **SUPPORTED** (41.5% > 31.2%) | Model more accurate in PTVs |
| DVH constraints pass | **MIXED** - OARs pass, PTVs fail | Model underdoses PTVs |
| Low-dose variation | **NOT SUPPORTED** | OAR region is the weak point |

**The original hypothesis is PARTIALLY SUPPORTED but reveals a more serious issue:**
- Overall Gamma IS somewhat misleading (PTV accuracy is higher)
- But the model has a **REAL clinical problem**: systematic PTV underdosing

### 6.3 Root Cause Analysis

**Why is the model underdosing PTVs?**

1. **MSE Loss Symmetry:** MSE treats overdose and underdose equally, but clinically:
   - PTV underdose = treatment failure (tumor recurrence)
   - PTV slight overdose = acceptable (within tolerance)

2. **Conservative Learning:** The model may be learning to avoid high doses near OARs (where training cases vary) by generally predicting lower doses everywhere.

3. **Limited Training Data:** With only 23 cases, the model may not have learned the correct dose levels.

### 6.4 Key Insight: The 7-8 Gy Gap

The model consistently underdoses PTVs by ~7-8 Gy compared to ground truth:
- This corresponds to approximately **10% of prescription dose**
- In clinical terms, this is the difference between:
  - 95% PTV coverage (acceptable) vs
  - 85% PTV coverage (unacceptable)

### 6.5 Implications for Next Steps

| Option | Pros | Cons |
|--------|------|------|
| Asymmetric PTV loss | Directly penalizes underdosing | May overcorrect |
| DVH-aware D95 loss | Already implemented, clinically motivated | Slow, complex |
| Scale correction | Simple post-processing | Doesn't fix learning |
| More training data | Fundamental improvement | Requires more cases |

---

## 7. Conclusions

### 7.1 Main Finding

**The model systematically underdoses PTVs by 7-8 Gy (10% of prescription), failing to meet PTV D95 clinical constraints while perfectly meeting all OAR constraints.**

This reveals a **real clinical problem** (not just a metric problem):
- The model has learned to be too conservative
- It protects OARs at the expense of PTV coverage
- This would result in inadequate tumor treatment

### 7.2 Hypothesis Outcome

The original hypothesis ("Overall Gamma is misleading") is **partially supported**:
- PTV Gamma (41.5%) > Overall Gamma (31.2%) confirms better accuracy where it matters
- BUT the hypothesis didn't anticipate that DVH constraints would fail in PTVs

The analysis revealed a more fundamental issue: **systematic PTV underdosing**.

### 7.3 Revised Success Metrics

Based on this analysis, we propose:

| Metric | Threshold | Priority | Status |
|--------|-----------|----------|--------|
| PTV D95 compliance | 100% pass | **Critical** | **FAILING** |
| OAR constraint compliance | 100% pass | **Critical** | PASSING |
| PTV Gamma (3%/3mm) | >= 80% | Secondary | 41.5% (failing) |
| Overall Gamma | Report only | Informational | 31.2% |

### 7.4 Recommendations

1. **Immediate:** Add asymmetric loss that heavily penalizes PTV underdosing
2. **Short-term:** Fine-tune with DVH-aware loss emphasizing D95
3. **Medium-term:** Acquire more training data (100+ cases)
4. **Long-term:** Consider physical deliverability constraints

---

## 8. Next Steps

### 8.1 Proposed Next Experiment: Asymmetric PTV Loss

**Hypothesis:** Adding an asymmetric loss that penalizes PTV underdosing more heavily than overdosing will improve D95 coverage without sacrificing OAR sparing.

**Implementation:**
```python
class AsymmetricPTVLoss(nn.Module):
    """
    Penalizes PTV underdosing 3x more than overdosing.
    - Underdose (pred < target): weight = 3.0
    - Overdose (pred > target): weight = 1.0
    """
    def forward(self, pred, target, ptv_mask):
        error = pred - target
        weights = torch.where(error < 0, 3.0, 1.0)  # Asymmetric
        weighted_mse = (weights * ptv_mask * error**2).mean()
        return weighted_mse
```

**Expected Outcome:**
- Higher PTV D95 values (closer to ground truth)
- May slightly increase OAR doses (acceptable trade-off)
- Improved Gamma in PTV regions

### 8.2 Alternative: Stronger DVH-Aware Loss

The existing DVH-aware loss already has D95 terms, but with low weight. Options:
1. Increase `dvh_d95_weight` from 10.0 to 50.0
2. Add explicit minimum D95 constraint term
3. Make D95 loss asymmetric (only penalize low D95)

### 8.3 Longer-term Considerations

| Approach | Description | When to Try |
|----------|-------------|-------------|
| DDPM/Flow | Generative model for sharper outputs | After fixing D95 issue |
| Attention U-Net | Better long-range dependencies | If asymmetric loss insufficient |
| More data (100+) | Fundamental improvement | When available |
| TPS integration | Verify physical deliverability | After acceptable metrics |

---

## 9. Artifacts

| Artifact | Path |
|----------|------|
| Analysis Script | `scripts/analyze_gamma_metric_hypothesis.py` |
| Results JSON | `runs/gamma_metric_analysis/analysis_results.json` |
| Figures | `runs/gamma_metric_analysis/figures/` |