Summary
hpc wait <array_job_id> returns long before the array job actually finishes. For a Slurm array job, parse_status reads sacct -j <id> --format=State --noheader and returns the status from only the first output line, ignoring every other array task.
Affected code
src/hpc/scheduler.py (Slurm branch):
def status_cmd(self, job_id: str) -> list[str]:
return ["sacct", "-j", job_id, "--format=State", "--noheader"]
def parse_status(self, output: str) -> JobStatus:
lines = output.strip().splitlines()
status_str = lines[0].strip().rstrip("+") if lines else ""
return _STATUS_MAP.get(status_str, JobStatus.FAILED)
wait_for_job in src/hpc/job.py checks status in terminal_states and returns as soon as parse_status reports a terminal state.
Repro
For an array job with mixed state (some tasks COMPLETED, others still RUNNING/PENDING):
$ sacct -j 7701965 --format=State --noheader -X
COMPLETED
COMPLETED
COMPLETED
...
RUNNING
RUNNING
PENDING
parse_status reads only the first line (COMPLETED), so wait_for_job returns JobStatus.COMPLETED while ~150 array tasks are still pending or running.
Expected
hpc wait should block until all array tasks reach a terminal state.
Suggested fix
Aggregate the statuses over all sacct rows in Slurm.parse_status (and analogously for PJM.parse_status). Sketch:
def parse_status(self, output: str) -> JobStatus:
lines = [ln.strip().rstrip("+") for ln in output.strip().splitlines() if ln.strip()]
if not lines:
return JobStatus.FAILED
statuses = [_STATUS_MAP.get(s, JobStatus.FAILED) for s in lines]
if any(s == JobStatus.PENDING for s in statuses):
return JobStatus.PENDING
if any(s == JobStatus.RUNNING for s in statuses):
return JobStatus.RUNNING
for terminal in (JobStatus.FAILED, JobStatus.CANCELLED, JobStatus.TIMEOUT):
if any(s == terminal for s in statuses):
return terminal
return JobStatus.COMPLETED
Notes:
- Adding
-X to status_cmd is helpful too — it suppresses jobsteps so each row is one array task and the aggregation is over a clean set.
- For non-array jobs,
sacct returns one row anyway, so the aggregated logic still works.
- The PJM branch has the same shape (
lines[1].strip()) and likely the same issue for PJM step jobs; worth a parallel fix.
Workaround
Until this is fixed, fall back to a manual loop:
hpc exec "until [ \$(squeue -j <id> -h 2>/dev/null | wc -l) -eq 0 ]; do sleep 60; done"
Environment
- hpc 0.4.0
- Scheduler: Slurm
- Array job submitted with
[slurm.options].array = "1-N%K"
Summary
hpc wait <array_job_id>returns long before the array job actually finishes. For a Slurm array job,parse_statusreadssacct -j <id> --format=State --noheaderand returns the status from only the first output line, ignoring every other array task.Affected code
src/hpc/scheduler.py(Slurm branch):wait_for_jobinsrc/hpc/job.pychecksstatus in terminal_statesand returns as soon asparse_statusreports a terminal state.Repro
For an array job with mixed state (some tasks COMPLETED, others still RUNNING/PENDING):
parse_statusreads only the first line (COMPLETED), sowait_for_jobreturnsJobStatus.COMPLETEDwhile ~150 array tasks are still pending or running.Expected
hpc waitshould block until all array tasks reach a terminal state.Suggested fix
Aggregate the statuses over all sacct rows in
Slurm.parse_status(and analogously forPJM.parse_status). Sketch:Notes:
-Xtostatus_cmdis helpful too — it suppresses jobsteps so each row is one array task and the aggregation is over a clean set.sacctreturns one row anyway, so the aggregated logic still works.lines[1].strip()) and likely the same issue for PJM step jobs; worth a parallel fix.Workaround
Until this is fixed, fall back to a manual loop:
Environment
[slurm.options].array = "1-N%K"