# COMPASS — Clinical Validation with Annotated Datasets

**Purpose:** This notebook explains how to evaluate COMPASS predictions against annotated ground-truth labels/values from phenotype cohorts.

It covers the two validation scripts:
1. `compute_confusion_matrix.py` — Mode-aware metric outputs (binary/multiclass/regression/hierarchical)
2. `detailed_analysis.py` — Deep statistical analysis (`.txt` + multiple `.png` plots)

> These scripts are designed for post-hoc evaluation after a batch run (Step 05) has completed.


## Prerequisites

1. **Completed batch run** — Participant result folders in a `participant_runs/` directory
2. **Ground-truth annotations file** — e.g., `cases_controls_with_specific_subtypes.txt`
3. **Python dependencies**: `matplotlib`, `numpy`, `scipy` (optional, for statistical tests)

Each participant folder should contain at minimum a `report_ID{eid}.json` file.


## File Structure

```
utils/validation/with_annotated_dataset/
├── compute_confusion_matrix.py   # Task-type metric generator
├── detailed_analysis.py          # Deep statistical analysis + plots
└── validation_guide.ipynb        # This notebook
```

Output directory structure:
```
results/analysis/
├── binary_confusion_matrix/
│   ├── integrated_confusion_matrix.png
│   ├── major_depressive_disorder_confusion_matrix.png
│   └── ...
└── details/
    ├── detailed_analysis.txt
    ├── composite_vs_accuracy.png
    ├── probability_calibration.png
    ├── iteration_improvement.png
    ├── verdict_accuracy.png
    └── ...  (per-disorder variants)
```


## 1. Computing Metrics by Prediction Type

The metrics script supports binary, multiclass, regression, and hierarchical evaluation. Binary mode still generates CASE vs CONTROL confusion matrix plots.


In [None]:
# --- Integrated confusion matrix (all disorders combined) ---
# Adjust paths to match your environment.

!python utils/validation/with_annotated_dataset/compute_confusion_matrix.py \
    --results_dir ../results/participant_runs \
    --targets_file ../data/__TARGETS__/cases_controls_with_specific_subtypes.txt \
    --prediction_type binary \
    --output_dir ../results/analysis/binary_confusion_matrix


In [None]:
# --- With per-disorder breakdown ---
# Generates one additional .png per disorder group.

!python utils/validation/with_annotated_dataset/compute_confusion_matrix.py \
    --results_dir ../results/participant_runs \
    --targets_file ../data/__TARGETS__/cases_controls_with_specific_subtypes.txt \
    --prediction_type binary \
    --output_dir ../results/analysis/binary_confusion_matrix \
    --disorder_groups "MAJOR_DEPRESSIVE_DISORDER,ANXIETY_DISORDERS,SUBSTANCE_USE_DISORDERS,SLEEP_WAKE_DISORDERS,BIPOLAR_AND_MANIC_DISORDERS"


### Arguments Reference

| Argument | Required | Description |
|---|---|---|
| `--results_dir` | ✓ | Path to directory containing `participant_ID{eid}/` folders |
| `--targets_file` | ✓ (binary mode) | Path to annotated ground-truth `.txt` file |
| `--prediction_type` | ✗ | `binary` (default), `multiclass`, `regression_univariate`, `regression_multivariate`, `hierarchical` |
| `--annotations_json` | ✓ (non-binary) | JSON annotations for multiclass/regression/hierarchical runs |
| `--output_dir` | ✓ | Where to save `.png` output files |
| `--disorder_groups` | ✗ | Comma-separated list of disorder group names for per-group matrices |

### Metrics Displayed

In binary mode, each confusion matrix `.png` includes a sidebar with:
- **Accuracy** — Overall correct predictions / total
- **Sensitivity (Recall)** — TP / (TP + FN) — how well CASEs are detected
- **Specificity** — TN / (TN + FP) — how well CONTROLs are identified
- **Precision** — TP / (TP + FP) — positive predictive value
- **F1 Score** — Harmonic mean of precision and sensitivity
- **MCC** — Matthews Correlation Coefficient (balanced metric, range -1 to +1)


## 2. Detailed Statistical Analysis

The detailed analysis script generates a comprehensive text report and four diagnostic plots.


In [None]:
# --- Integrated analysis ---

!python utils/validation/with_annotated_dataset/detailed_analysis.py \
    --results_dir ../results/participant_runs \
    --targets_file ../data/__TARGETS__/cases_controls_with_specific_subtypes.txt \
    --prediction_type binary \
    --output_dir ../results/analysis/details


In [None]:
# --- With per-disorder breakdown ---

!python utils/validation/with_annotated_dataset/detailed_analysis.py \
    --results_dir ../results/participant_runs \
    --targets_file ../data/__TARGETS__/cases_controls_with_specific_subtypes.txt \
    --prediction_type binary \
    --output_dir ../results/analysis/details \
    --disorder_groups "MAJOR_DEPRESSIVE_DISORDER,ANXIETY_DISORDERS,SUBSTANCE_USE_DISORDERS,SLEEP_WAKE_DISORDERS,BIPOLAR_AND_MANIC_DISORDERS"


### Generated Outputs

**Text Report** (`detailed_analysis.txt`):
1. Cohort summary (N, cases/controls, failures)
2. Task-type metric tables (binary/multiclass/regression/hierarchical)
3. Per-disorder breakdown
4. Failure analysis
5. Verdict quality analysis
6. Composite score ↔ accuracy correlation (point-biserial r)
7. Probability calibration statistics
8. Iteration improvement analysis with significance testing
9. Resource usage summary (tokens, duration)
10. Full per-participant results table

**Plots:**

| Plot | What It Shows |
|---|---|
| `composite_vs_accuracy.png` | Scatter: Critic composite score vs. correct/incorrect (with point-biserial correlation) |
| `probability_calibration.png` | Calibration curve + probability histogram (correct vs incorrect) |
| `iteration_improvement.png` | Violin + scatter plot of composite scores across iterations, with Mann-Whitney U significance brackets |
| `verdict_accuracy.png` | Bar chart: SATISFACTORY vs UNSATISFACTORY verdict accuracy rates |


## 3. Interpreting Results

### Composite Score ↔ Accuracy Correlation
If the Critic composite score is working as intended, higher composite scores should correlate with correct predictions. A significant positive point-biserial correlation (`p < 0.05`) confirms this.

### Iteration Improvement
The violin plot shows whether the actor-critic feedback loop improves prediction quality across iterations. Significance brackets (Mann-Whitney U) annotate whether the improvement is statistically significant:
- `***` p < 0.001
- `**`  p < 0.01
- `*`   p < 0.05
- `ns`  not significant

### Probability Calibration (Binary)
Well-calibrated predictions mean that when the model says 80% probability of CASE, approximately 80% of those predictions should actually be CASEs. The calibration curve deviating from the diagonal indicates miscalibration.

### MCC (Matthews Correlation Coefficient, Binary)
MCC is the most balanced metric for binary classification. It ranges from -1 (complete disagreement) to +1 (perfect prediction). MCC > 0.3 is generally considered acceptable, > 0.5 is good.


## 4. Custom Disorder Groups

You can analyze any subset of disorders by passing a custom `--disorder_groups` list. The disorder names must match those in the targets file (e.g., `MAJOR_DEPRESSIVE_DISORDER`, `ANXIETY_DISORDERS`).

For the standard UK Biobank demonstration cohort, the five groups are:
1. `MAJOR_DEPRESSIVE_DISORDER`
2. `ANXIETY_DISORDERS`
3. `SUBSTANCE_USE_DISORDERS`
4. `SLEEP_WAKE_DISORDERS`
5. `BIPOLAR_AND_MANIC_DISORDERS`
