# Experiment: [TITLE]

**Date:** YYYY-MM-DD  
**Experiment ID:** `experiment_name`  
**Status:** In Progress / Complete / Failed  
**Type:** Training / Analysis  

---

## 1. Overview

### 1.1 Objective

[What are we testing? What hypothesis? What is the clinical motivation?]

### 1.2 Key Results

| Metric | This Experiment | Baseline | Change |
|--------|----------------|----------|--------|
| MAE (Gy) | X.XX ± X.XX | X.XX ± X.XX | +/-X% |
| Gamma 3%/3mm | XX.X ± X.X% | XX.X ± X.X% | +/-X% |
| PTV70 D95 Gap (Gy) | X.XX ± X.XX | X.XX ± X.XX | +/-X Gy |

### 1.3 Conclusion

[1-2 sentence summary: what did we learn and what does it mean clinically?]

---

## 2. What Changed

Compared to **[prior experiment name]** (`git_hash`), this experiment changes:

| Parameter | Prior Value | This Experiment |
|-----------|------------|------------------|
| [changed param] | [old] | [new] |

**Everything else is identical.** If more than one variable changed, justify why.

---

## 3. Reproducibility

In [None]:
import subprocess
import sys
import torch
from datetime import datetime
from pathlib import Path

# Auto-capture reproducibility info
REPRODUCIBILITY = {
    'git_commit': subprocess.getoutput('git rev-parse HEAD'),
    'git_message': subprocess.getoutput('git log -1 --format="%s"'),
    'git_dirty': subprocess.getoutput('git status --porcelain') != '',
    'python_version': f'{sys.version_info.major}.{sys.version_info.minor}.{sys.version_info.micro}',
    'pytorch_version': torch.__version__,
    'cuda_version': torch.version.cuda if torch.cuda.is_available() else 'N/A',
    'gpu': torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'N/A',
    'random_seed': 42,
    'experiment_date': datetime.now().strftime('%Y-%m-%d'),
}

print('Reproducibility Information:')
for k, v in REPRODUCIBILITY.items():
    print(f'  {k}: {v}')

if REPRODUCIBILITY['git_dirty']:
    print('\n  WARNING: Uncommitted changes present! Commit before running.')

# Verify environment snapshot exists
EXP_NAME = 'FILL_IN'  # <-- SET THIS
env_snapshot = Path(f'../runs/{EXP_NAME}/environment_snapshot.txt')
if env_snapshot.exists():
    print(f'\n  Environment snapshot: {env_snapshot}')
else:
    print(f'\n  WARNING: No environment snapshot found. Run: conda list --export > {env_snapshot}')

### Command to Reproduce

```bash
# 1. Checkout exact code
git checkout <COMMIT_HASH>

# 2. Activate environment
conda activate vmat-diffusion

# 3. Train
python scripts/train_baseline_unet.py \
    --exp_name <NAME> \
    --data_dir ~/data/processed_npz \
    [FULL FLAGS HERE]

# 4. Evaluate
python scripts/inference_baseline_unet.py \
    --checkpoint runs/<NAME>/checkpoints/best-*.ckpt \
    --input_dir ~/data/processed_npz
```

---

## 4. Dataset

In [None]:
import json
from pathlib import Path

DATA_DIR = Path('~/data/processed_npz').expanduser()

# Load batch summary for data provenance
batch_summary = DATA_DIR / 'batch_summary.json'
if batch_summary.exists():
    with open(batch_summary) as f:
        summary = json.load(f)
    print(f'Preprocessing version: {summary.get("script_version", "unknown")}')
    print(f'Processed date: {summary.get("processed_date", "unknown")}')
    print(f'Total cases: {summary.get("total_cases", "unknown")}')
    print(f'Settings: {json.dumps(summary.get("settings", {}), indent=2)}')

# Record actual case IDs for each split
DATASET = {
    'preprocessing_version': 'v2.3.0',
    'total_cases': 0,       # FILL IN
    'train_case_ids': [],   # FILL IN with actual case IDs
    'val_case_ids': [],     # FILL IN
    'test_case_ids': [],    # FILL IN
}

print(f'\nSplit: {len(DATASET["train_case_ids"])} train / '
      f'{len(DATASET["val_case_ids"])} val / '
      f'{len(DATASET["test_case_ids"])} test')

---

## 5. Model & Training Configuration

*Skip this section for Analysis-type experiments.*

In [None]:
import json
from pathlib import Path

# Load saved training config (auto-generated by training script)
config_path = Path(f'../runs/{EXP_NAME}/training_config.json')
if config_path.exists():
    with open(config_path) as f:
        config = json.load(f)
    print('Training Configuration (from training_config.json):')
    for k, v in sorted(config.items()):
        print(f'  {k}: {v}')
else:
    print(f'WARNING: {config_path} not found')

---

## 6. Results

Figures are generated by `scripts/generate_<exp_name>_figures.py` and loaded here.
See CLAUDE.md "Medical Physics Figure Set" for the required figures.

### 6.1 Training Curves

![Training Curves](../runs/EXP_NAME/figures/fig1_training_curves.png)

**Caption:** Training loss and validation MAE vs epoch for [experiment name]. [What the reader should observe. How convergence compares to prior experiments.]

**Key observations:**
- [observation 1]
- [observation 2]

### 6.2 Dose Colorwash (Representative Case)

![Dose Colorwash](../runs/EXP_NAME/figures/fig2_dose_colorwash.png)

**Caption:** Predicted (left) vs ground truth (right) dose distribution overlaid on CT for [case_id]. Axial slice through PTV70 centroid. [Clinical interpretation — where does the prediction agree/disagree? Are dose gradients realistic?]

### 6.3 Dose Difference Map

![Dose Difference](../runs/EXP_NAME/figures/fig3_dose_difference.png)

**Caption:** Dose difference (predicted minus ground truth) for [case_id]. Positive values (red) indicate overdose, negative (blue) indicate underdose. [Where are the largest errors? Are they clinically significant?]

### 6.4 DVH Comparison

![DVH](../runs/EXP_NAME/figures/fig4_dvh_comparison.png)

**Caption:** DVH curves for predicted (dashed) vs ground truth (solid) for all structures, [case_id]. [Which structures agree well? Where are the largest discrepancies? Are clinical constraints met?]

### 6.5 Gamma Map

![Gamma](../runs/EXP_NAME/figures/fig5_gamma_map.png)

**Caption:** 3%/3mm gamma index map for [case_id]. Green = pass, red = fail. Overall pass rate: XX.X%. [Where does the model fail? Is failure concentrated in clinically relevant regions?]

### 6.6 Per-Case Results

![Box Plots](../runs/EXP_NAME/figures/fig6_per_case_boxplots.png)

**Caption:** Distribution of MAE, Gamma pass rate, and PTV70 D95 error across N test cases. Box shows IQR, whiskers show 1.5×IQR, dots show outliers. [Are results consistent across cases or do outliers dominate the mean?]

---

## 7. Statistical Analysis

In [None]:
import numpy as np
from scipy import stats

# Load per-case results
# results = load_evaluation_results(EXP_NAME)
# baseline = load_evaluation_results('baseline_experiment')

# Example statistical analysis (fill in with real data)
def report_metric(name, values, baseline_values=None):
    """Report metric with 95% CI and comparison to baseline."""
    mean = np.mean(values)
    ci_low, ci_high = np.percentile(values, [2.5, 97.5])
    print(f'{name}: {mean:.2f} (95% CI: [{ci_low:.2f}, {ci_high:.2f}])')
    
    if baseline_values is not None:
        # Paired Wilcoxon signed-rank test
        stat, p_value = stats.wilcoxon(values, baseline_values)
        diff = np.mean(values) - np.mean(baseline_values)
        print(f'  vs baseline: {diff:+.2f} (p={p_value:.4f}, '
              f'{"significant" if p_value < 0.05 else "not significant"} at alpha=0.05)')

# report_metric('MAE (Gy)', mae_values, baseline_mae)
# report_metric('Gamma 3%/3mm (%)', gamma_values, baseline_gamma)
# report_metric('PTV70 D95 Gap (Gy)', d95_values, baseline_d95)

---

## 8. Cross-Experiment Comparison

Comparison to ALL prior experiments on standardized metrics:

| Experiment | MAE (Gy) | Gamma 3%/3mm | PTV70 D95 Gap | Training Time |
|------------|----------|--------------|---------------|---------------|
| baseline_v23 | X.XX ± X.XX | XX.X ± X.X% | X.XX ± X.XX Gy | X.Xh |
| **this_experiment** | **X.XX ± X.XX** | **XX.X ± X.X%** | **X.XX ± X.XX Gy** | **X.Xh** |

**Key comparisons:**
- [How does this compare to baseline? Is the improvement clinically meaningful?]
- [Is the tradeoff (e.g., training time) worth it?]

---

## 9. Conclusions, Limitations, and Next Steps

### Conclusions
- [Key finding 1]
- [Key finding 2]

### Limitations
- [Limitation 1 — e.g., single seed, limited case count, etc.]
- [Limitation 2]

### Next Steps
- [ ] [What this experiment motivates]
- [ ] [Follow-up experiment if results warrant]

---

## 10. Artifacts

| Artifact | Path |
|----------|------|
| Best Checkpoint | `runs/<exp>/checkpoints/best-*.ckpt` |
| Training Metrics | `runs/<exp>/version_*/metrics.csv` |
| Training Config | `runs/<exp>/training_config.json` |
| Environment Snapshot | `runs/<exp>/environment_snapshot.txt` |
| Figures (PNG + PDF) | `runs/<exp>/figures/` |
| Figure Generation Script | `scripts/generate_<exp>_figures.py` |
| Test Predictions | `predictions/<exp>_test/case_*.npz` |
| Test Results | `predictions/<exp>_test/evaluation_results.json` |

---

*Notebook created: YYYY-MM-DD*  
*Last updated: YYYY-MM-DD*