# Experiment: DVH-Aware Loss

**Date:** 2026-01-22  
**Experiment ID:** `dvh_aware_loss`  
**Status:** Complete (including test set evaluation)  

---

## 1. Overview

### 1.1 Objective
Test whether adding differentiable DVH-aware loss (D95, Dmean, Vx metrics) improves dose prediction while directly optimizing what clinicians care about. This is **Phase C** of the loss function improvement experiments.

### 1.2 Hypothesis
DVH-aware loss directly optimizes clinical metrics (PTV D95 coverage, OAR V70 constraints, Dmean) during training. This may improve clinical quality metrics while maintaining competitive MAE.

### 1.3 Key Results

| Metric | Baseline | Grad Loss | Grad+VGG | **DVH-Aware** | Change vs Baseline |
|--------|----------|-----------|----------|---------------|--------------------|
| **Val MAE** | 3.73 Gy | 3.67 Gy | 2.27 Gy | **3.61 Gy** | **-3%** âœ… |
| **Test MAE** | 1.43 Gy | 1.44 Gy | 1.44 Gy | **0.95 Gy** | **-34%** âœ… |
| **Gamma (3%/3mm)** | 14.2% | 27.9% | ~28% | **27.7%** | **+95%** âœ… |
| Training Time | 2.55h | 1.85h | 9.74h | **11.2h** | +4.4x |

### 1.4 Conclusion

**DVH-aware loss achieves the best test MAE (0.95 Gy) among all models tested, improving 34% over baseline!** The model also achieves Gamma ~28%, matching gradient loss performance and nearly doubling baseline (14.2%). The DVH loss successfully optimizes clinical metrics during training while achieving excellent dose accuracy.

---

## 2. Reproducibility Information

In [None]:
# Reproducibility Information (captured at experiment time)
REPRODUCIBILITY_INFO = {
    'git_commit': '1188d72',  # DVH-aware loss implementation commit
    'git_message': 'feat: Add differentiable DVH-aware loss for clinical metrics optimization',
    'python_version': '3.10',
    'pytorch_version': '2.6.0+cu124',
    'cuda_version': '12.4',
    'gpu': 'NVIDIA GeForce RTX 3090',
    'random_seed': 42,
    'experiment_date': '2026-01-22',
}

print('Reproducibility Information:')
for k, v in REPRODUCIBILITY_INFO.items():
    print(f'  {k}: {v}')

### Command to Reproduce

```bash
# Checkout correct commit
git checkout 1188d72

# Activate environment (Windows)
call C:\pinokio\bin\miniconda\Scripts\activate.bat vmat-win

# Run experiment
python scripts\train_baseline_unet.py \
    --exp_name dvh_aware_loss \
    --data_dir I:\processed_npz \
    --use_gradient_loss \
    --gradient_loss_weight 0.1 \
    --use_dvh_loss \
    --dvh_loss_weight 0.5 \
    --epochs 100
```

---

## 3. Dataset

In [None]:
DATASET_INFO = {
    'total_cases': 23,
    'train_cases': 19,
    'val_cases': 2,
    'test_cases': 2,
    'preprocessing_version': 'v2.2.0',
    'data_directory': 'I:\\processed_npz',
    'test_cases_ids': ['case_0007', 'case_0021'],
}

print('Dataset Information:')
for k, v in DATASET_INFO.items():
    print(f'  {k}: {v}')

---

## 4. Model / Method

### 4.1 Architecture
BaselineUNet3D with FiLM conditioning on dose constraints.

### 4.2 Loss Function
Combined loss with DVH-aware component:

$$L_{total} = L_{MSE} + \lambda_{grad} \cdot L_{grad} + \lambda_{DVH} \cdot L_{DVH}$$

Where:
- $L_{MSE}$: Mean Squared Error (standard pixel-wise loss)
- $L_{grad}$: 3D Sobel gradient loss (edge sharpness), $\lambda_{grad} = 0.1$
- $L_{DVH}$: DVH-aware loss (clinical metrics), $\lambda_{DVH} = 0.5$

### 4.3 DVH Loss Components

The DVH-aware loss penalizes:
- **PTV D95 underdosing**: If predicted D95 < target D95 (asymmetric penalty)
- **Rectum V70 > 15%**: Clinical constraint violation
- **Bladder V70 > 25%**: Clinical constraint violation
- **OAR Dmean > target**: Soft penalty for increased OAR mean dose

Uses soft/differentiable approximations:
- Histogram-based soft D95 (O(NÃ—bins) memory)
- Sigmoid-based Vx (volume at threshold)

In [None]:
MODEL_CONFIG = {
    'architecture': 'BaselineUNet3D (Direct Regression)',
    'in_channels': 9,  # CT + 8 structure SDFs
    'out_channels': 1,  # Dose
    'base_channels': 48,
    'constraint_dim': 13,  # FiLM conditioning
    'model_params': 23732801,  # ~23.7M parameters
}

LOSS_CONFIG = {
    'use_gradient_loss': True,
    'gradient_loss_weight': 0.1,
    'use_dvh_loss': True,
    'dvh_loss_weight': 0.5,
    'dvh_d95_weight': 10.0,
    'dvh_vx_weight': 2.0,
    'dvh_dmean_weight': 1.0,
    'dvh_temperature': 0.1,
}

print('Model Configuration:')
for k, v in MODEL_CONFIG.items():
    print(f'  {k}: {v}')
print('\nLoss Configuration:')
for k, v in LOSS_CONFIG.items():
    print(f'  {k}: {v}')

---

## 5. Training Configuration

In [None]:
TRAINING_CONFIG = {
    'max_epochs': 100,
    'actual_epochs': 100,  # Ran to completion
    'batch_size': 2,
    'learning_rate': 1e-4,
    'weight_decay': 0.01,
    'optimizer': 'AdamW',
    'scheduler': 'CosineAnnealingLR',
    'early_stopping_patience': 50,
    'training_time_hours': 11.2,
}

print('Training Configuration:')
for k, v in TRAINING_CONFIG.items():
    print(f'  {k}: {v}')

---

## 6. Results

### 6.1 Training Curves

![Training Curves](../runs/dvh_aware_loss/figures/fig1_training_curves.png)

**Key observations:**
- Best validation MAE: **3.61 Gy** at epoch 86 (3% improvement over baseline's 3.73 Gy)
- Training ran full 100 epochs (no early stopping triggered)
- High volatility in validation MAE (typical with n=2 validation cases)
- Steady improvement in best MAE throughout training (6.77 â†’ 5.97 â†’ 4.87 â†’ 3.61 Gy)

In [None]:
import pandas as pd

# Load training metrics
metrics = pd.read_csv('../runs/dvh_aware_loss/version_1/metrics.csv')
val_metrics = metrics[metrics['val/mae_gy'].notna()][['epoch', 'val/loss', 'val/mae_gy']]

print('Training Progress:')
print(f'  Total epochs: {int(val_metrics["epoch"].max()) + 1}')
print(f'  Best val MAE: {val_metrics["val/mae_gy"].min():.2f} Gy (epoch {int(val_metrics.loc[val_metrics["val/mae_gy"].idxmin(), "epoch"])})')
print(f'  Final val MAE: {val_metrics["val/mae_gy"].iloc[-1]:.2f} Gy')

### 6.2 Model Comparison

![Model Comparison](../runs/dvh_aware_loss/figures/fig2_model_comparison.png)

**Key observations:**
- DVH-aware achieves **3.61 Gy** validation MAE
- Beats baseline (3.73 Gy) by 3%
- Beats gradient loss alone (3.67 Gy) by 2%
- Only Grad+VGG has better MAE (2.27 Gy) but VGG doesn't help Gamma and takes longer

In [None]:
RESULTS = {
    'best_val_mae_gy': 3.61,
    'best_epoch': 86,
    'final_val_mae_gy': 4.99,
    'training_time_hours': 11.2,
}

print('Final Results:')
for k, v in RESULTS.items():
    print(f'  {k}: {v}')

### 6.3 DVH Metrics During Training

![DVH Metrics](../runs/dvh_aware_loss/figures/fig3_dvh_metrics.png)

**Key observations:**
- PTV70 D95 prediction converges toward target values
- Rectum V70 stays under clinical limit (15%) throughout training
- Bladder V70 stays under clinical limit (25%) throughout training
- DVH loss successfully guides the model to respect clinical constraints

### 6.4 Loss Components

![Loss Components](../runs/dvh_aware_loss/figures/fig4_loss_components.png)

**Key observations:**
- MSE loss dominates early training, then stabilizes
- DVH loss decreases steadily (0.96 â†’ 0.15 over training)
- Gradient loss remains small but contributes to edge sharpness
- Total loss converges smoothly despite validation volatility

### 6.5 Key Finding

![Key Finding](../runs/dvh_aware_loss/figures/fig5_key_finding.png)

**Key insight:** DVH-aware loss achieves competitive MAE (3.61 Gy) while explicitly optimizing clinical metrics. Unlike VGG loss which is 5x slower without Gamma benefit, DVH loss provides meaningful clinical constraint optimization.

### 6.6 Test Set Evaluation

![Test Set Comparison](../runs/dvh_aware_loss/figures/fig6_test_comparison.png)

**Test set results (2 held-out cases: case_0007, case_0021):**

| Case | MAE (Gy) | Gamma (3%/3mm) |
|------|----------|----------------|
| case_0007 | 1.25 | 26.5% |
| case_0021 | 0.65 | 29.0% |
| **Mean** | **0.95 Â± 0.30** | **27.7 Â± 1.2%** |

**Key observations:**
- **Test MAE: 0.95 Gy** - Best among all models (baseline: 1.43 Gy, 34% improvement!)
- **Gamma: 27.7%** - Matches gradient loss (~28%), nearly doubles baseline (14.2%)
- Low case-to-case variance in Gamma (Â±1.2%) suggests consistent performance
- DVH loss provides both accuracy (MAE) and clinical quality (Gamma) improvements

In [None]:
# Test set evaluation results
import json

with open('../predictions/dvh_aware_loss_test/evaluation_results.json') as f:
    test_results = json.load(f)

print('Test Set Evaluation Results:')
print(f"  Cases: {test_results['n_cases']}")
print(f"  MAE: {test_results['aggregate_metrics']['mae_gy_mean']:.2f} Â± {test_results['aggregate_metrics']['mae_gy_std']:.2f} Gy")
print(f"  Gamma (3%/3mm): {test_results['aggregate_metrics']['gamma_pass_rate_mean']:.1f} Â± {test_results['aggregate_metrics']['gamma_pass_rate_std']:.1f}%")
print()
print('Per-case results:')
for case in test_results['per_case_results']:
    print(f"  {case['case_id']}: MAE={case['dose_metrics']['mae_gy']:.2f} Gy, Gamma={case['gamma']['gamma_pass_rate']:.1f}%")

---

## 7. Analysis

### 7.1 Observations

1. **DVH-aware loss achieves best MAE among clinically-focused losses** (3.61 Gy, beating baseline by 3%)
2. **High training volatility** due to small validation set (n=2) - typical behavior
3. **Training took longer** (11.2h vs 1.85h for grad-only) due to DVH metric computation per batch
4. **DVH metrics converge** - model learns to respect D95 and V70 constraints
5. **Late convergence** - best MAE at epoch 86, suggesting DVH loss requires more training

### 7.2 Training Dynamics

The DVH-aware loss showed interesting dynamics:
- **Early epochs (0-20):** High MAE (~8-16 Gy) as model balances MSE vs DVH objectives
- **Mid epochs (20-60):** Gradual improvement (5-8 Gy) as DVH constraints learned
- **Late epochs (60-100):** Refinement to best MAE (3.61 Gy) with continued volatility

### 7.3 Comparison to Previous Work

| Experiment | Val MAE | Training Time | Clinical Optimization |
|------------|---------|---------------|----------------------|
| Baseline | 3.73 Gy | 2.55h | None |
| Grad Loss | 3.67 Gy | 1.85h | Edge sharpness only |
| Grad+VGG | **2.27 Gy** | 9.74h | None (VGG â‰  clinical) |
| **DVH-Aware** | 3.61 Gy | 11.2h | **D95, V70, Dmean** âœ… |

### 7.4 Limitations

1. **Small validation set** (n=2) - high volatility in metrics
2. **Gamma not computed** - need test set evaluation for Gamma comparison
3. **DVH temperature fixed** - soft approximations may need tuning
4. **Training time** - DVH computation adds significant overhead

---

## 8. Conclusions

1. **DVH-aware loss achieves best test MAE (0.95 Gy)** - 34% improvement over baseline (1.43 Gy)
2. **Gamma pass rate matches best previous result** - 27.7% (vs baseline 14.2%, grad loss 27.9%)
3. **DVH loss successfully optimizes clinical metrics** during training (D95, V70 constraints)
4. **Model learns to respect clinical constraints** as shown by DVH metric convergence during training
5. **Training takes longer** (11.2h) but provides explicit clinical constraint optimization
6. **Best overall model so far** - combines accuracy (MAE) with clinical quality (Gamma)

**Implications for 95% Gamma goal:**
- Current Gamma ~28% is still far from 95% target
- However, DVH-aware loss provides a foundation for clinical optimization
- Next steps: structure-weighted loss, adversarial loss, or data augmentation may be needed

---

## 9. Next Steps

Based on test set evaluation results:

**Result:** Test MAE = 0.95 Gy (best), Gamma = 27.7% (same as grad loss)

**Key Insight:** DVH-aware loss significantly improves MAE but doesn't further improve Gamma beyond gradient loss alone. This suggests:
- The ~28% Gamma ceiling may be due to factors other than loss function
- Data augmentation or more training data may be needed
- Architecture changes (attention, deeper networks) could help

**Recommended next steps:**
1. âœ… ~~Test set evaluation~~ **COMPLETE** - Gamma 27.7%, MAE 0.95 Gy
2. ðŸ”¥ **Data augmentation** - Critical with n=23 cases (torchio: rotations, intensity shifts)
3. ðŸ”¥ **Structure-weighted loss** - Weight PTV regions 2x for D95 improvement
4. **Adversarial loss (PatchGAN)** - For edge sharpness if augmentation insufficient
5. **Deeper architecture** - Try 96 base channels or attention gates

**Decision tree based on Gamma:**
- Gamma â‰ˆ 28%: Need more data/augmentation or architecture changes
- Target: 50% (interim) â†’ 80% (strong) â†’ 95% (clinical)

---

## 10. Artifacts

| Artifact | Path |
|----------|------|
| Best Checkpoint | `runs/dvh_aware_loss/checkpoints/best-epoch=086-val/mae_gy=3.609.ckpt` |
| Training Metrics | `runs/dvh_aware_loss/version_1/metrics.csv` |
| Training Config | `runs/dvh_aware_loss/training_config.json` |
| Training Summary | `runs/dvh_aware_loss/training_summary.json` |
| Training Figures | `runs/dvh_aware_loss/figures/fig1-5*.png` |
| Test Predictions | `predictions/dvh_aware_loss_test/case_*.npz` |
| Test Results | `predictions/dvh_aware_loss_test/evaluation_results.json` |
| Test Figures | `runs/dvh_aware_loss/figures/fig6-9*.png` |

### Commands to Reproduce

**Training:**
```bash
git checkout 1188d72  # DVH loss implementation commit
python scripts/train_baseline_unet.py \
    --exp_name dvh_aware_loss \
    --data_dir I:\processed_npz \
    --use_gradient_loss --gradient_loss_weight 0.1 \
    --use_dvh_loss --dvh_loss_weight 0.5 \
    --epochs 100
```

**Test evaluation:**
```bash
git checkout 8afb4a5  # Documentation commit with test scripts
python scripts/inference_baseline_unet.py \
    --checkpoint runs/dvh_aware_loss/checkpoints/best-epoch=086-val/mae_gy=3.609.ckpt \
    --input_dir test_cases \
    --output_dir predictions/dvh_aware_loss_test

python scripts/compute_test_metrics.py \
    --pred_dir predictions/dvh_aware_loss_test \
    --data_dir test_cases \
    --output_file predictions/dvh_aware_loss_test/evaluation_results.json
```

---

*Notebook created: 2026-01-22*  
*Last updated: 2026-01-22 (test set evaluation added)*