feat(status): show sacct accounting fields for completed Slurm jobs#14
Merged
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a Slurm-specific “detail” path for hpc status <jobid> that surfaces useful sacct accounting fields for completed jobs, while keeping the existing polling/status logic unchanged for other call sites.
Changes:
- Introduced
JobDetailplus Slurmsacctdetail command + parser (including unit-aware MaxRSS selection across steps). - Added
JobManager.get_job_detail()and updatedcli.statusto render accounting fields for terminal Slurm states (with fallback when detail is unavailable / unsupported). - Added comprehensive unit tests for scheduler parsing, JobManager behavior, and CLI output formatting/fallbacks.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_scheduler.py | Adds unit tests for Slurm detail_cmd and parse_detail, plus PJM “detail not supported” coverage. |
| tests/test_job.py | Adds tests for JobManager.get_job_detail() behavior for Slurm vs PJM. |
| tests/test_cli.py | Adds CLI-level tests verifying detailed output for terminal jobs and fallback behavior. |
| src/hpc/scheduler.py | Adds JobDetail, Slurm sacct detail command, and parsing logic (including unit-aware MaxRSS max across steps). |
| src/hpc/job.py | Adds get_job_detail() API to retrieve scheduler accounting details. |
| src/hpc/cli.py | Updates status to display sacct details for terminal Slurm states and preserve raw state strings. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
`hpc status <id>` on a terminal Slurm job now also prints ExitCode, Elapsed, MaxRSS, and ReqMem alongside the raw State, fetched via a single `sacct -P` call. The polling path used by `wait_for_job` is left untouched. PJM falls back to the prior single-line display. Closes #5
The previous logic preferred the `.batch` row's MaxRSS whenever it was non-empty. For workloads dispatched via `srun` (typical for MPI / GPU jobs) `.batch` only reports the launcher's RSS, so this would under-report the actual peak memory living on numbered step rows. Use a unit-aware max across all non-empty step rows instead.
706a2b8 to
336c354
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
hpc status <id>on a terminal Slurm job now printsExitCode,Elapsed,MaxRSS, andReqMemalongside the raw sacct State string, fetched via a singlesacct -j <id> --format=JobID,State,ExitCode,Elapsed,MaxRSS,ReqMem --noheader -Pcall.status_cmd/parse_status/wait_for_job) is left untouched. The new detail path is consumed only bycli.status.MaxRSSis selected as the unit-aware maximum across all non-empty step rows, so jobs that dispatch viasrun(where.batchonly reflects the launcher) report the workload's real RSS rather than the launcher's small footprint.OUT_OF_MEMORY,CANCELLED+,CANCELLED by 12345); theJobStatusenum is not extended.Test plan
uv run pytest— 138 passed (baseline 119 + 19 new).uv run ruff check— All checks passed.uv run ruff format --check— clean.uv run pyright src/hpc/— 0 errors, 0 warnings.Closes #5