# COMPASS Validation Guide (Annotated Datasets)

This notebook is a **didactic, mode-specific guide** for automated annotation validation in COMPASS.

Covered prediction types:

- binary classification
- multiclass classification
- univariate regression
- multivariate regression
- hierarchical mixed-task prediction

All commands assume project root: `multi_agent_system/`.


## 1. Validation Workflow (Common to All Modes)

Validation runs in two layers:

1. `run_validation_metrics.py`
   - computes core metrics + core plots
2. `detailed_analysis.py`
   - computes extended diagnostics + row-level audit artifacts

Recommended review order:

1. annotation contract artifacts (`annotation_contract*`)
2. core metrics JSON (`*_metrics.json`)
3. mode-specific plots
4. detailed rows + detailed text reports


## 2. Inputs and Template Files

Required inputs:

- participant outputs in `results/participant_runs/`
- binary mode: `--targets_file` (JSON only)
- non-binary modes: `--annotations_json`

Template examples:

- `utils/validation/with_annotated_dataset/annotation_templates/examples/binary_targets_example.json`
- `utils/validation/with_annotated_dataset/annotation_templates/examples/multiclass_annotations_example.json`
- `utils/validation/with_annotated_dataset/annotation_templates/examples/regression_univariate_annotations_example.json`
- `utils/validation/with_annotated_dataset/annotation_templates/examples/regression_multivariate_annotations_example.json`
- `utils/validation/with_annotated_dataset/annotation_templates/examples/hierarchical_annotations_example.json`


## 3. Binary Classification Validation

### Automated contract checks

- binary annotation JSON parses correctly
- participant ID alignment (annotation â†” participant output)
- valid binary truth labels (`CASE` / `CONTROL`)
- extractable binary prediction payload

### Automated outputs (core)

- `binary_metrics_integrated.json`
- `integrated_confusion_matrix.png`

### Automated outputs (detailed)

- `detailed_analysis.txt`
- `detailed_analysis_binary.json`
- binary diagnostics plots (composite vs correctness, calibration, verdict accuracy, iteration improvement; availability depends on data support)

### How to interpret

- start with confusion matrix and sensitivity/specificity
- use Brier/ECE + calibration for probability reliability
- inspect verdict/composite diagnostics for critic-quality alignment


In [None]:
# Binary validation
!python utils/validation/with_annotated_dataset/run_validation_metrics.py \
    --results_dir ../results/participant_runs \
    --prediction_type binary \
    --targets_file ../data/__TARGETS__/binary_targets.json \
    --output_dir ../results/analysis/binary_confusion_matrix \
    --disorder_groups "MAJOR_DEPRESSIVE_DISORDER,ANXIETY_DISORDERS"

!python utils/validation/with_annotated_dataset/detailed_analysis.py \
    --results_dir ../results/participant_runs \
    --prediction_type binary \
    --targets_file ../data/__TARGETS__/binary_targets.json \
    --output_dir ../results/analysis/details \
    --disorder_groups "MAJOR_DEPRESSIVE_DISORDER,ANXIETY_DISORDERS"


## 4. Multiclass Classification Validation

### Automated contract checks

- class label presence per participant
- participant ID alignment
- valid prediction extraction for multiclass payload

### Automated outputs (core)

- `multiclass_metrics.json`
- `annotation_contract_multiclass.json`
- multiclass confusion matrix / per-class metrics / calibration plots

### Automated outputs (detailed)

- `detailed_analysis_multiclass.json`
- `detailed_analysis_multiclass.txt`
- `detailed_rows_multiclass.json`
- `detailed_annotation_contract_multiclass.json/.txt`
- extended diagnostics (top confusions, confidence/entropy, label distribution)

### How to interpret

- read macro F1 first, then per-class F1
- inspect top confusions for error structure
- use confidence diagnostics to detect overconfident mistakes


In [None]:
# Multiclass validation
!python utils/validation/with_annotated_dataset/run_validation_metrics.py \
    --results_dir ../results/participant_runs \
    --prediction_type multiclass \
    --annotations_json ../data/__TARGETS__/annotated_targets.json \
    --output_dir ../results/analysis/multiclass

!python utils/validation/with_annotated_dataset/detailed_analysis.py \
    --results_dir ../results/participant_runs \
    --prediction_type multiclass \
    --annotations_json ../data/__TARGETS__/annotated_targets.json \
    --output_dir ../results/analysis/multiclass_details


## 5. Univariate Regression Validation

### Automated contract checks

- exactly one numeric target output per participant
- participant ID alignment
- extractable univariate regression prediction

### Automated outputs (core)

- `regression_univariate_metrics.json`
- `annotation_contract_regression_univariate.json`
- parity + error bar plots

### Automated outputs (detailed)

- `detailed_analysis_regression_univariate.json/.txt`
- `detailed_rows_regression_univariate.json`
- `detailed_annotation_contract_regression_univariate.json/.txt`
- residual distribution, residual-vs-true, top-error diagnostics

### How to interpret

- MAE/RMSE = scale of error
- R2 = explained variance quality
- residual diagnostics = bias/heteroscedasticity structure


In [None]:
# Univariate regression validation
!python utils/validation/with_annotated_dataset/run_validation_metrics.py \
    --results_dir ../results/participant_runs \
    --prediction_type regression_univariate \
    --annotations_json ../data/__TARGETS__/annotated_targets.json \
    --output_dir ../results/analysis/univariate_regression

!python utils/validation/with_annotated_dataset/detailed_analysis.py \
    --results_dir ../results/participant_runs \
    --prediction_type regression_univariate \
    --annotations_json ../data/__TARGETS__/annotated_targets.json \
    --output_dir ../results/analysis/univariate_regression_details


## 6. Multivariate Regression Validation

### Automated contract checks

- two or more numeric outputs per participant
- participant ID alignment
- extractable multivariate regression prediction

### Automated outputs (core)

- `regression_multivariate_metrics.json`
- `annotation_contract_regression_multivariate.json`
- parity + error bar plots

### Automated outputs (detailed)

- `detailed_analysis_regression_multivariate.json/.txt`
- `detailed_rows_regression_multivariate.json`
- `detailed_annotation_contract_regression_multivariate.json/.txt`
- residual diagnostics + top absolute error ranking

### How to interpret

- inspect per-output metrics before macro summary
- use top-error ranking to identify problematic output dimensions


In [None]:
# Multivariate regression validation
!python utils/validation/with_annotated_dataset/run_validation_metrics.py \
    --results_dir ../results/participant_runs \
    --prediction_type regression_multivariate \
    --annotations_json ../data/__TARGETS__/annotated_targets.json \
    --output_dir ../results/analysis/multivariate_regression

!python utils/validation/with_annotated_dataset/detailed_analysis.py \
    --results_dir ../results/participant_runs \
    --prediction_type regression_multivariate \
    --annotations_json ../data/__TARGETS__/annotated_targets.json \
    --output_dir ../results/analysis/multivariate_regression_details


## 7. Hierarchical Mixed-Task Validation

### Automated contract checks (strict)

- node payload validity per participant
- **cross-participant schema consistency**:
  - same node IDs
  - same node mode per node ID
  - same regression output-key set per regression node

If this consistency is violated, validation stops with an explicit schema mismatch error.

### Schema consistency checklist

1. every participant has the same node set
2. each node keeps the same mode in all rows
3. regression node output keys are identical across rows
4. classification nodes always include a label
5. regression nodes always include numeric values

### Automated outputs (core)

- `hierarchical_metrics.json`
- `annotation_contract_hierarchical.json`
- node score/support + node-type distribution plots

### Automated outputs (detailed)

- `detailed_analysis_hierarchical.json/.txt`
- `detailed_rows_hierarchical.json`
- `detailed_annotation_contract_hierarchical.json/.txt`
- node coverage + hierarchical metric heatmap diagnostics

### How to interpret

- inspect per-node metrics before macro score
- inspect coverage and support before concluding model quality


In [None]:
# Hierarchical mixed-task validation
!python utils/validation/with_annotated_dataset/run_validation_metrics.py \
    --results_dir ../results/participant_runs \
    --prediction_type hierarchical \
    --annotations_json ../data/__TARGETS__/annotated_targets.json \
    --output_dir ../results/analysis/hierarchical

!python utils/validation/with_annotated_dataset/detailed_analysis.py \
    --results_dir ../results/participant_runs \
    --prediction_type hierarchical \
    --annotations_json ../data/__TARGETS__/annotated_targets.json \
    --output_dir ../results/analysis/hierarchical_details


## 8. Cross-Mode Artifact Inspection Helper

Use this helper to quickly inspect metrics, contract validity, and top annotation issues across generated outputs.


In [None]:
# Quick artifact inspection helper (metrics + contract + detailed rows)
import json
from pathlib import Path

out_dir = Path('../results/analysis')

print('\n[METRICS]')
for p in sorted(out_dir.rglob('*_metrics.json'))[:50]:
    payload = json.loads(p.read_text())
    ptype = payload.get('prediction_type')
    n_rows = payload.get('n_rows')
    contract = payload.get('annotation_contract') if isinstance(payload, dict) else None
    valid = contract.get('n_valid_rows') if isinstance(contract, dict) else None
    total = contract.get('n_rows') if isinstance(contract, dict) else None
    if valid is not None:
        print(f'- {p.name}: type={ptype} rows={n_rows} contract={valid}/{total}')
    else:
        print(f'- {p.name}: type={ptype} rows={n_rows}')

print('\n[ANNOTATION CONTRACT ISSUES]')
for p in sorted(out_dir.rglob('annotation_contract_*.json'))[:50]:
    payload = json.loads(p.read_text())
    issues = payload.get('issue_counts') if isinstance(payload.get('issue_counts'), dict) else {}
    top = sorted(issues.items(), key=lambda kv: int(kv[1] or 0), reverse=True)[:5]
    top_str = ', '.join([f'{k}={v}' for k, v in top]) if top else 'none'
    print(f"- {p.name}: valid={payload.get('n_valid_rows')}/{payload.get('n_rows')}; top_issues={top_str}")

print('\n[DETAILED ROW PAYLOADS]')
for p in sorted(out_dir.rglob('detailed_rows_*.json'))[:50]:
    payload = json.loads(p.read_text())
    print(f"- {p.name}: n_rows={payload.get('n_rows')}")


## 9. HPC and Batch Integration Notes

- `hpc/05_submit_batch.sh` forwards `PREDICTION_TYPE` to validation scripts.
- Binary validation expects JSON `TARGETS_FILE`.
- Non-binary validation expects `ANNOTATIONS_JSON`.
- The same annotation-contract logic is shared across local, batch, and HPC workflows.

Example:

```bash
PREDICTION_TYPE=regression_univariate \
ANNOTATIONS_JSON=~/compass_pipeline/data/__TARGETS__/annotated_targets.json \
bash hpc/05_submit_batch.sh
```


## 10. XAI Scope

XAI methods currently apply only to pure root-level binary classification.

For multiclass/regression/hierarchical validation, outputs remain fully supported and include explicit `xai_status: skipped` metadata.
