Skip to content

Integrate sacct info for completed jobs into hpc status #5

@ultimatile

Description

@ultimatile

Summary

For completed jobs, hpc status <id> should also show the following fields from sacct:

  • State (COMPLETED / FAILED / OUT_OF_MEMORY / TIMEOUT / CANCELLED, etc.)
  • ExitCode
  • Elapsed
  • MaxRSS
  • ReqMem

Motivation

To diagnose why a job failed (OOM vs. timeout, etc.), users currently have to SSH in and run sacct by hand.
We are not going to build generalized "tuning suggestions" (the old hpc tune idea), but just surfacing the raw sacct numbers is already useful on its own.

Implementation notes

  • Run sacct -j <jobid> --format=State,ExitCode,Elapsed,MaxRSS,ReqMem --noheader -P over SSH.
  • Slurm-only feature. PJM should be handled separately (or skipped for now).
  • If the job is still queued or running, fall back to the existing squeue-based display.
  • Structured output (e.g. --json) can be revisited together with MCP integration; a human-readable table is fine for now.

Out of scope

  • Custom log parsing for failure modes — Slurm already reports OUT_OF_MEMORY etc., which is enough.

Priority

Medium. Good cost/benefit: makes triage of failed jobs noticeably easier.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions