# COMPASS on HPC: Operational Guide

**Audience:** researchers and engineers running COMPASS on Slurm-managed GPU clusters.

This notebook is a didactic operations guide. It explains the architecture, workflow, and decision points required to run COMPASS reliably on HPC infrastructure.

> Execution should happen via terminal + Slurm jobs. The command cells below are examples you can copy into your shell.


## Table of Contents

1. Why HPC for COMPASS
2. Core HPC Components
3. End-to-End Workflow (`00` -> `04`)
4. Command Walkthrough
5. Slurm Directives Explained
6. Monitoring and Log Interpretation
7. Troubleshooting Playbook
8. Performance and Stability Tuning
9. Reproducibility and Validation
10. Adaptation Checklist for Other Clusters


## 1) Why HPC for COMPASS

COMPASS can run with either public APIs or local models. HPC local execution is useful when you need:

- strict data locality / privacy constraints,
- reproducible runtime control,
- budget control for repeated clinical validation runs,
- deterministic infrastructure for benchmarking.

Tradeoff: local single-GPU inference is usually slower than hosted APIs.


## 2) Core HPC Components

### Login Node
- Used for SSH access, editing files, and submitting jobs.
- Not intended for heavy GPU compute.

### Compute Node
- Runs the actual workload after Slurm allocation.
- Provides GPUs/CPUs/RAM according to your `#SBATCH` request.

### Slurm
- Scheduler that queues, allocates, and tracks jobs.
- Main commands: `sbatch`, `squeue`, `sacct`, `scancel`.

### Apptainer
- Container runtime for reproducible environments on HPC.
- Keeps runtime dependency stack stable across nodes.

### Python Virtual Environment
- Project-level dependency layer used inside the container runtime.


## 3) End-to-End Workflow (`hpc`)

| Step | Script | Goal |
|---|---|---|
| 00 | `00_check_status.sh` | Validate prerequisites before spending GPU hours |
| 01 | `01_setup_environment.sh` | Build container + virtualenv runtime |
| 02 | `02_download_models.sh` | Fetch and patch model artifacts |
| 03 | `03_submit_single.sh` | Smoke-test one participant |
| 04 | `04_submit_batch.sh` | Run sequential cohort validation on single GPU |

**Recommended policy:** never run step `04` before step `03` succeeds cleanly.


## 4) Command Walkthrough

Set project root first:


In [None]:
# Terminal commands (copy/paste in shell)
cd ~/compass_pipeline/multi_agent_system


In [None]:
# Step 0: Pre-flight
bash hpc/00_check_status.sh


In [None]:
# Step 1: Environment setup
bash hpc/01_setup_environment.sh


In [None]:
# Step 2: Model download/prep
bash hpc/02_download_models.sh


In [None]:
# Step 3: Single participant smoke test
bash hpc/03_submit_single.sh


In [None]:
# Step 4: Sequential batch execution
bash hpc/04_submit_batch.sh


## 5) Slurm Directives Explained

Submission scripts use `#SBATCH` headers. Typical single-GPU fields:

- `--partition`: queue/partition name for your cluster.
- `--gres=gpu:<type>:1`: request one GPU of a specific type.
- `--cpus-per-task`: CPU workers for preprocessing/runtime overhead.
- `--mem`: host RAM budget (separate from GPU VRAM).
- `--time`: wall-clock cap; job is terminated after this limit.
- `--output` / `--error`: stdout/stderr log destinations.

If your cluster names resources differently, adapt these values first.


## 6) Monitoring and Log Interpretation

### Runtime monitoring
```bash
squeue -u $USER
```

### Live logs
```bash
tail -f logs/compass_single_<JOBID>.out
tail -f logs/compass_batch_<JOBID>.out
```

### Error logs
```bash
cat logs/compass_single_<JOBID>.err
cat logs/compass_batch_<JOBID>.err
```

Notes:
- `.out` captures runtime milestones and tool/agent progress.
- `.err` captures exceptions/tracebacks and scheduler/runtime failures.
- Inspect logs with `cat`, `less`, or `tail`; do not execute them as scripts.


## 7) Troubleshooting Playbook

### Job exits immediately
- Check path assumptions (`PROJECT_DIR`, `MODELS_DIR`, `VENV_DIR`, container path).
- Run `00_check_status.sh` again.

### `apptainer` command missing
- Verify cluster module/environment policy for compute nodes.
- Confirm script is actually running on compute node, not login node.

### Model initialization failure
- Confirm model directories are complete and readable.
- Check quantization/runtime flags in logs.
- Reduce context window if memory pressure appears.

### JSON/structured-output instability in local model
- Lower generation temperature.
- Tighten output token limits for tool calls.
- Keep retry/repair strategy enabled in orchestrated flow.


## 8) Performance and Stability Tuning

Primary levers:

- context size (`--max_tokens`, local max model len),
- per-role token budgets (`--max_agent_input`, `--max_agent_output`, `--max_tool_input`, `--max_tool_output`),
- runtime backend flags (local engine, quantization, eager mode, memory utilization),
- chunking and evidence extraction strategy.

Guideline:
- First optimize for **successful deterministic completion**.
- Then optimize latency while keeping output quality acceptable.


## 9) Reproducibility and Validation

Best practices:

- Keep scripts and config under version control.
- Capture exact script/runtime info in `.out` logs (already implemented in scripts).
- Keep per-participant artifacts in `results/` for auditability.
- Validate with fixed cohorts and compare confusion-matrix metrics across runs.
- Change one parameter family at a time when benchmarking.


## 10) Adaptation Checklist for Other Clusters

Before first run on a new HPC:

1. Update Slurm directives (`partition`, GPU resource syntax, memory/time limits).
2. Update path variables in `hpc/03_submit_single.sh` and `hpc/04_submit_batch.sh`.
3. Confirm data root and participant folder conventions.
4. Validate `00 -> 03` end-to-end before any `04` batch run.
5. Document your site-specific overrides in a local runbook.

Related docs:
- `hpc/README.md`
- `utils/batch_run.py`
- `main.py`
