Summary
Slurm.parse_status and PJM.parse_status both fall back to JobStatus.FAILED when the scheduler's status command returns no data row. This conflates two distinct cases:
- The scheduler genuinely reports the job as failed.
- The scheduler does not yet have a row for the job (e.g., Slurm
sacct immediately after submission, before accounting indexing catches up).
Affected code
src/hpc/scheduler.py:
class Slurm(Scheduler):
def parse_status(self, output: str) -> JobStatus:
lines = output.strip().splitlines()
status_str = lines[0].strip().rstrip("+") if lines else ""
return _STATUS_MAP.get(status_str, JobStatus.FAILED)
class PJM(Scheduler):
def parse_status(self, output: str) -> JobStatus:
lines = output.strip().splitlines()
status_str = lines[1].strip() if len(lines) >= 2 else ""
return self._STATUS_MAP.get(status_str, JobStatus.FAILED)
For Slurm, an empty sacct response gives status_str = "" ⇒ JobStatus.FAILED.
For PJM, a header-only pjstat response gives status_str = "" ⇒ JobStatus.FAILED.
Impact
hpc job-output --follow <id> (added in #4) inspects the status before deciding whether to use tail -F (active) or fall back to cat (terminal). When the scheduler has not yet indexed the just-submitted job, the misparsed FAILED sends the command to the cat path; the output file does not exist yet, so the user gets a No such file error instead of a streaming view.
hpc wait is also affected: it currently treats FAILED as a terminal state and stops polling, so a wait launched immediately after submission can short-circuit before the job actually starts.
Suggested fix
Distinguish "no parseable row" from "FAILED" in parse_status. Sketch:
- Add a
SchedulerError exception in scheduler.py.
- Have
parse_status raise SchedulerError when its input is structurally insufficient (empty Slurm output, header-only PJM output).
- In
JobManager.get_job_status, convert SchedulerError to SSHError so existing callers (wait_for_job, get_job_output, tail_job_output) inherit the "transient / unknown status" handling they already implement for SSHError.
This keeps the JobStatus enum clean (no new UNKNOWN variant required) and reuses existing retry / fall-through paths in callers.
Repro
For a freshly submitted Slurm job before sacct indexing catches up:
job_id=$(hpc submit "sleep 30")
hpc status $job_id # may print "FAILED" before the job has actually run
Environment
- hpc 0.4.0
- Slurm and PJM both affected
Summary
Slurm.parse_statusandPJM.parse_statusboth fall back toJobStatus.FAILEDwhen the scheduler's status command returns no data row. This conflates two distinct cases:sacctimmediately after submission, before accounting indexing catches up).Affected code
src/hpc/scheduler.py:For Slurm, an empty
sacctresponse givesstatus_str = ""⇒JobStatus.FAILED.For PJM, a header-only
pjstatresponse givesstatus_str = ""⇒JobStatus.FAILED.Impact
hpc job-output --follow <id>(added in #4) inspects the status before deciding whether to usetail -F(active) or fall back tocat(terminal). When the scheduler has not yet indexed the just-submitted job, the misparsedFAILEDsends the command to the cat path; the output file does not exist yet, so the user gets aNo such fileerror instead of a streaming view.hpc waitis also affected: it currently treatsFAILEDas a terminal state and stops polling, so a wait launched immediately after submission can short-circuit before the job actually starts.Suggested fix
Distinguish "no parseable row" from "FAILED" in
parse_status. Sketch:SchedulerErrorexception inscheduler.py.parse_statusraiseSchedulerErrorwhen its input is structurally insufficient (empty Slurm output, header-only PJM output).JobManager.get_job_status, convertSchedulerErrortoSSHErrorso existing callers (wait_for_job,get_job_output,tail_job_output) inherit the "transient / unknown status" handling they already implement forSSHError.This keeps the
JobStatusenum clean (no newUNKNOWNvariant required) and reuses existing retry / fall-through paths in callers.Repro
For a freshly submitted Slurm job before
sacctindexing catches up:Environment