# COMPASS on HPC: Operational Guide

**Audience:** researchers and engineers running COMPASS on Slurm-managed GPU clusters.

This notebook is a didactic operations guide. It explains the infrastructure components, the execution workflow, and the practical monitoring/debugging steps required to run COMPASS reliably on HPC.

> Execution should happen via terminal + Slurm jobs. The command cells below are examples you can copy into your shell.


## Table of Contents

1. Why HPC for COMPASS
2. HPC Infrastructure Components
3. Storage and Data Layout
4. End-to-End Workflow (00 -> 05)
5. Command Walkthrough
6. Slurm Directives Explained
7. Monitoring and Log Interpretation
8. Troubleshooting Playbook
9. Performance and Stability Tuning
10. Reproducibility and Validation
11. Adaptation Checklist for Other Clusters


## 1) Why HPC for COMPASS

COMPASS can run with either public APIs or local models. HPC local execution is useful when you need:

- strict data locality / privacy constraints,
- reproducible runtime control,
- budget control for repeated clinical validation runs,
- deterministic infrastructure for benchmarking.

Tradeoff: local single-GPU inference is usually slower than hosted APIs.


## 2) HPC Infrastructure Components

### Login node
- Used for SSH access, editing files, and submitting jobs.
- Often has restrictions on heavy compute.

### Compute node
- Runs the workload after Slurm allocation.
- Provides GPUs/CPUs/RAM according to your `#SBATCH` request.

### Slurm scheduler
- Queues, allocates, and tracks jobs.
- Common commands: `sbatch`, `squeue`, `sacct`, `scancel`.

### Apptainer container runtime
- Runs a reproducible environment on compute nodes.
- Uses `--nv` to expose NVIDIA drivers inside the container.
- Uses `--bind` to mount your project/models/data into the container.

### Python environment
- A virtual environment provides Python dependencies for COMPASS.
- You typically activate it inside the container execution block.


## 3) Storage and Data Layout

Typical recommended layout on shared storage:

- Project: `~/compass_pipeline/multi_agent_system`
- Container image: `~/compass_containers/<image>.sif`
- Virtualenv: `~/compass_venv/`
- Models: `~/compass_models/`
- Logs: `~/compass_pipeline/multi_agent_system/logs/`
- Participant data: `../data/__FEATURES__/HPC_data/participant_ID<id>/`

Notes:
- Many clusters use shared parallel filesystems; large model and data directories should live on those.
- Keep secrets in environment variables or `.env` locally; do not commit them.


## 4) End-to-End Workflow (`hpc/`)

| Step | Script | Goal |
|---|---|---|
| 00 | `00_deploy_and_run.sh` | Optional: copy repo/data to HPC and SSH in |
| 01 | `01_check_status.sh` | Validate prerequisites before spending GPU hours |
| 02 | `02_setup_environment.sh` | Build container + virtualenv runtime |
| 03 | `03_download_models.sh` | Fetch and patch model artifacts |
| 04 | `04_submit_single.sh` | Smoke-test one participant end-to-end |
| 05 | `05_submit_batch.sh` | Run sequential cohort validation on single GPU |

**Recommended policy:** never run step `05` before step `04` succeeds cleanly.


## 5) Command Walkthrough

Set project root first:


In [None]:
# Terminal commands (copy/paste in shell)
cd ~/compass_pipeline/multi_agent_system


In [None]:
# Step 00 (optional): deploy from your workstation
bash hpc/00_deploy_and_run.sh


In [None]:
# Step 01: pre-flight checks
bash hpc/01_check_status.sh


In [None]:
# Step 02: environment setup (container + venv)
bash hpc/02_setup_environment.sh


In [None]:
# Step 03: download models
bash hpc/03_download_models.sh


In [None]:
# Step 04: single participant smoke test
bash hpc/04_submit_single.sh


In [None]:
# Step 05: sequential batch execution
bash hpc/05_submit_batch.sh


## 6) Slurm Directives Explained

Submission scripts use `#SBATCH` headers. Typical single-GPU fields:

- `--partition`: queue/partition name for your cluster.
- `--gres=gpu:<type>:1`: request one GPU of a specific type.
- `--cpus-per-task`: CPU workers for preprocessing/runtime overhead.
- `--mem`: host RAM budget (separate from GPU VRAM).
- `--time`: wall-clock cap; job is terminated after this limit.
- `--output` / `--error`: stdout/stderr log destinations.

If your cluster names resources differently, adapt these values first.


## 7) Monitoring and Log Interpretation

### Runtime monitoring
```bash
squeue -u $USER
```

### Live logs
```bash
tail -f logs/compass_single_<JOBID>.out
tail -f logs/compass_batch_<JOBID>.out
```

### Error logs
```bash
cat logs/compass_single_<JOBID>.err
cat logs/compass_batch_<JOBID>.err
```

Notes:
- `.out` captures runtime milestones and tool/agent progress.
- `.err` captures exceptions/tracebacks and scheduler/runtime failures.
- Inspect logs with `cat`, `less`, or `tail`; do not execute them as scripts.


## 8) Troubleshooting Playbook

### Job exits immediately
- Check path assumptions (`PROJECT_DIR`, `MODELS_DIR`, `VENV_DIR`, container path).
- Run `01_check_status.sh` again.

### `apptainer` command missing
- Verify cluster module/environment policy for compute nodes.
- Confirm script is actually running on a compute node, not a login node.

### Model initialization failure
- Confirm model directories are complete and readable.
- Check quantization/runtime flags in logs.
- Reduce context window if memory pressure appears.

### JSON/structured-output instability in local model
- Lower generation temperature.
- Tighten output token limits for tool calls.
- Keep retry/repair strategy enabled in orchestrated flow.


## 9) Performance and Stability Tuning

Primary levers:

- context size (`--max_tokens`, local max model len),
- per-role token budgets (`--max_agent_input`, `--max_agent_output`, `--max_tool_input`, `--max_tool_output`),
- runtime backend flags (local engine, quantization, eager mode, memory utilization),
- chunking and evidence extraction strategy.

Guideline:
- First optimize for **successful deterministic completion**.
- Then optimize latency while keeping output quality acceptable.


## 10) Reproducibility and Validation

Best practices:

- Keep scripts and config under version control.
- Capture exact script/runtime info in `.out` logs (scripts print fingerprints and runtime settings).
- Keep per-participant artifacts in `results/` for auditability.
- Validate with fixed cohorts and compare confusion-matrix metrics across runs.
- Change one parameter family at a time when benchmarking.


## 11) Adaptation Checklist for Other Clusters

Before first run on a new HPC:

1. Update Slurm directives (`partition`, GPU resource syntax, memory/time limits).
2. Update path variables in `hpc/04_submit_single.sh` and `hpc/05_submit_batch.sh`.
3. Confirm data root and participant folder conventions.
4. Validate `01 -> 04` end-to-end before any `05` batch run.

Related docs:
- `hpc/README.md`
- `utils/batch_run.py`
- `main.py`
