Skip to content

wait command returns prematurely for array jobs (parse_status only inspects first sacct line) #7

@ultimatile

Description

@ultimatile

Summary

hpc wait <array_job_id> returns long before the array job actually finishes. For a Slurm array job, parse_status reads sacct -j <id> --format=State --noheader and returns the status from only the first output line, ignoring every other array task.

Affected code

src/hpc/scheduler.py (Slurm branch):

def status_cmd(self, job_id: str) -> list[str]:
    return ["sacct", "-j", job_id, "--format=State", "--noheader"]

def parse_status(self, output: str) -> JobStatus:
    lines = output.strip().splitlines()
    status_str = lines[0].strip().rstrip("+") if lines else ""
    return _STATUS_MAP.get(status_str, JobStatus.FAILED)

wait_for_job in src/hpc/job.py checks status in terminal_states and returns as soon as parse_status reports a terminal state.

Repro

For an array job with mixed state (some tasks COMPLETED, others still RUNNING/PENDING):

$ sacct -j 7701965 --format=State --noheader -X
 COMPLETED 
 COMPLETED 
 COMPLETED 
 ...
 RUNNING
 RUNNING
 PENDING

parse_status reads only the first line (COMPLETED), so wait_for_job returns JobStatus.COMPLETED while ~150 array tasks are still pending or running.

Expected

hpc wait should block until all array tasks reach a terminal state.

Suggested fix

Aggregate the statuses over all sacct rows in Slurm.parse_status (and analogously for PJM.parse_status). Sketch:

def parse_status(self, output: str) -> JobStatus:
    lines = [ln.strip().rstrip("+") for ln in output.strip().splitlines() if ln.strip()]
    if not lines:
        return JobStatus.FAILED
    statuses = [_STATUS_MAP.get(s, JobStatus.FAILED) for s in lines]
    if any(s == JobStatus.PENDING for s in statuses):
        return JobStatus.PENDING
    if any(s == JobStatus.RUNNING for s in statuses):
        return JobStatus.RUNNING
    for terminal in (JobStatus.FAILED, JobStatus.CANCELLED, JobStatus.TIMEOUT):
        if any(s == terminal for s in statuses):
            return terminal
    return JobStatus.COMPLETED

Notes:

  • Adding -X to status_cmd is helpful too — it suppresses jobsteps so each row is one array task and the aggregation is over a clean set.
  • For non-array jobs, sacct returns one row anyway, so the aggregated logic still works.
  • The PJM branch has the same shape (lines[1].strip()) and likely the same issue for PJM step jobs; worth a parallel fix.

Workaround

Until this is fixed, fall back to a manual loop:

hpc exec "until [ \$(squeue -j <id> -h 2>/dev/null | wc -l) -eq 0 ]; do sleep 60; done"

Environment

  • hpc 0.4.0
  • Scheduler: Slurm
  • Array job submitted with [slurm.options].array = "1-N%K"

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions