The v0.9.0 diagnostic surfaced that an item briefly read as COMPLETED
between "vanished from squeue" and "sacct returned a row". An agent
calling `scripthut run watch --exit-status` could observe that
transient and return success a few seconds before sacct landed and
flipped the item to FAILED. The earlier v0.6.5 ExitCode cross-check
was the correctness backstop; this release closes the window so the
correctness verdict is the FIRST verdict consumers see.
Also fills gap A: `item.exit_code` is now populated from sacct's
ExitCode column, so consumers (CI gates, agents) can read the numeric
code directly instead of scraping the `Exit code: N` line from the
task's stdout.
State machine:
- New `RunItemStatus.SETTLING` value — "scheduler queue clear,
accounting confirmation pending". Non-terminal: `Run.status`
keeps the run in RUNNING while any item is SETTLING, and
`running_count` includes SETTLING items so the concurrency cap
accounts for them. `progress`/`completed_count` deliberately
exclude SETTLING (it's not "done" yet).
- Squeue-vanish (`QUEUED`/`RUNNING` → missing) now transitions to
SETTLING instead of optimistic COMPLETED.
`finished_at` is stamped with "now" as an approximate end time
(sacct corrects it to the true `end_time` on resolution) so the
long-grace fallback has a timestamp to age against.
- Squeue says `JobState.COMPLETED` directly → also SETTLING. The
squeue tool sometimes reports COMPLETED for very brief jobs where
the script's exit code may still disagree (the v0.6.5 scenario).
Going through SETTLING means accounting confirms the verdict
before consumers see it.
- Squeue failure states (`FAILED` / `TIMEOUT` / `OOM` / etc.) keep
going directly to FAILED — slurm is the authoritative source of
the failure reason; there's no "completed after all" recovery
path for these.
- `generates_source` handling is deferred from queue-vanish to the
sacct-confirmed SETTLING → COMPLETED transition, so dependent
tasks don't spawn off an unconfirmed completion.
sacct integration (Phase C):
- `JobStats` grows `exit_code: int | None`.
`SlurmBackend.get_job_stats` parses sacct's ExitCode column via a
new `_slurm_parse_exit_int` helper. `.batch` row wins over the
main entry on conflict (that's where user code ran); main entry
is the fallback for jobs without a `.batch` step. Unparseable
fields surface as `None`, not `0` — distinguishing "no data" from
"actually exited 0".
- `main.poll_backend`'s sacct query gains Phase C alongside Phase A
(resource-stats refresh of already-terminal items) and Phase B
(SUBMITTED-past-grace resolution): SETTLING items are included
in every sacct query.
- The resolution branch transitions SETTLING → COMPLETED / FAILED
based on the sacct State, mirroring the SUBMITTED-past-grace
shape. `generates_source` fires here.
- Every sacct observation that surfaces an exit_code now writes it
to `item.exit_code`, regardless of which branch handles the item.
Long-grace fallback:
- `SETTLING_NO_RECORD_TIMEOUT_SECONDS = 600` and
`SETTLING_UNCONFIRMED_MARKER` give SETTLING items a 10-minute
window for sacct to surface. After that, the item falls back to
COMPLETED with the marker on `item.error` so consumers know the
exit code wasn't accounting-confirmed. Marking FAILED here would
invent a failure from an accounting-DB-availability problem,
which is worse than reporting "ran but unverified". The marker
text spells out the situation so users can investigate.
Persistence:
- `RunItem.to_dict` / `from_dict` serialize `exit_code`. Old
persisted runs (pre-0.10.0) load fine: missing field → None.
Tests:
- `test_settling.py` (16): `Run.status` treats SETTLING as
non-terminal in all permutations (settling-only, mixed with
completed); `running_count` includes SETTLING; `progress`
excludes it; full COMPLETED set is terminal. exit_code
round-trips through to_dict / from_dict, including the
zero-vs-None distinction. `_slurm_parse_exit_int` across the
`<n>:<n>` matrix plus unparseable edges. `SlurmBackend.get_job_stats`
populates exit_code with .batch winning on conflict, main as
fallback, and `None` on unparseable.
- `test_submit_verify.py` — the two queue-vanish tests now assert
SETTLING (not optimistic COMPLETED) since that's the new
contract.
551/551 in the broad sweep + 156 in the backend-specific suites pass.
Bumping to 0.10.0 because the run-state machine grows a new
non-terminal value: any external tooling that polled run.status
expecting only the v0.9 set must now also handle SETTLING (or
keep polling, which is the intent).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>