Summary
For completed jobs, hpc status <id> should also show the following fields from sacct:
State (COMPLETED / FAILED / OUT_OF_MEMORY / TIMEOUT / CANCELLED, etc.)
ExitCode
Elapsed
MaxRSS
ReqMem
Motivation
To diagnose why a job failed (OOM vs. timeout, etc.), users currently have to SSH in and run sacct by hand.
We are not going to build generalized "tuning suggestions" (the old hpc tune idea), but just surfacing the raw sacct numbers is already useful on its own.
Implementation notes
- Run
sacct -j <jobid> --format=State,ExitCode,Elapsed,MaxRSS,ReqMem --noheader -P over SSH.
- Slurm-only feature. PJM should be handled separately (or skipped for now).
- If the job is still queued or running, fall back to the existing
squeue-based display.
- Structured output (e.g.
--json) can be revisited together with MCP integration; a human-readable table is fine for now.
Out of scope
- Custom log parsing for failure modes — Slurm already reports
OUT_OF_MEMORY etc., which is enough.
Priority
Medium. Good cost/benefit: makes triage of failed jobs noticeably easier.
Summary
For completed jobs,
hpc status <id>should also show the following fields fromsacct:State(COMPLETED / FAILED / OUT_OF_MEMORY / TIMEOUT / CANCELLED, etc.)ExitCodeElapsedMaxRSSReqMemMotivation
To diagnose why a job failed (OOM vs. timeout, etc.), users currently have to SSH in and run
sacctby hand.We are not going to build generalized "tuning suggestions" (the old
hpc tuneidea), but just surfacing the raw sacct numbers is already useful on its own.Implementation notes
sacct -j <jobid> --format=State,ExitCode,Elapsed,MaxRSS,ReqMem --noheader -Pover SSH.squeue-based display.--json) can be revisited together with MCP integration; a human-readable table is fine for now.Out of scope
OUT_OF_MEMORYetc., which is enough.Priority
Medium. Good cost/benefit: makes triage of failed jobs noticeably easier.