Skip to content

v0.10.0

@tlamadon tlamadon tagged this 04 Jun 14:19
The v0.9.0 diagnostic surfaced that an item briefly read as COMPLETED
between "vanished from squeue" and "sacct returned a row". An agent
calling `scripthut run watch --exit-status` could observe that
transient and return success a few seconds before sacct landed and
flipped the item to FAILED. The earlier v0.6.5 ExitCode cross-check
was the correctness backstop; this release closes the window so the
correctness verdict is the FIRST verdict consumers see.

Also fills gap A: `item.exit_code` is now populated from sacct's
ExitCode column, so consumers (CI gates, agents) can read the numeric
code directly instead of scraping the `Exit code: N` line from the
task's stdout.

State machine:
- New `RunItemStatus.SETTLING` value — "scheduler queue clear,
  accounting confirmation pending". Non-terminal: `Run.status`
  keeps the run in RUNNING while any item is SETTLING, and
  `running_count` includes SETTLING items so the concurrency cap
  accounts for them. `progress`/`completed_count` deliberately
  exclude SETTLING (it's not "done" yet).
- Squeue-vanish (`QUEUED`/`RUNNING` → missing) now transitions to
  SETTLING instead of optimistic COMPLETED.
  `finished_at` is stamped with "now" as an approximate end time
  (sacct corrects it to the true `end_time` on resolution) so the
  long-grace fallback has a timestamp to age against.
- Squeue says `JobState.COMPLETED` directly → also SETTLING. The
  squeue tool sometimes reports COMPLETED for very brief jobs where
  the script's exit code may still disagree (the v0.6.5 scenario).
  Going through SETTLING means accounting confirms the verdict
  before consumers see it.
- Squeue failure states (`FAILED` / `TIMEOUT` / `OOM` / etc.) keep
  going directly to FAILED — slurm is the authoritative source of
  the failure reason; there's no "completed after all" recovery
  path for these.
- `generates_source` handling is deferred from queue-vanish to the
  sacct-confirmed SETTLING → COMPLETED transition, so dependent
  tasks don't spawn off an unconfirmed completion.

sacct integration (Phase C):
- `JobStats` grows `exit_code: int | None`.
  `SlurmBackend.get_job_stats` parses sacct's ExitCode column via a
  new `_slurm_parse_exit_int` helper. `.batch` row wins over the
  main entry on conflict (that's where user code ran); main entry
  is the fallback for jobs without a `.batch` step. Unparseable
  fields surface as `None`, not `0` — distinguishing "no data" from
  "actually exited 0".
- `main.poll_backend`'s sacct query gains Phase C alongside Phase A
  (resource-stats refresh of already-terminal items) and Phase B
  (SUBMITTED-past-grace resolution): SETTLING items are included
  in every sacct query.
- The resolution branch transitions SETTLING → COMPLETED / FAILED
  based on the sacct State, mirroring the SUBMITTED-past-grace
  shape. `generates_source` fires here.
- Every sacct observation that surfaces an exit_code now writes it
  to `item.exit_code`, regardless of which branch handles the item.

Long-grace fallback:
- `SETTLING_NO_RECORD_TIMEOUT_SECONDS = 600` and
  `SETTLING_UNCONFIRMED_MARKER` give SETTLING items a 10-minute
  window for sacct to surface. After that, the item falls back to
  COMPLETED with the marker on `item.error` so consumers know the
  exit code wasn't accounting-confirmed. Marking FAILED here would
  invent a failure from an accounting-DB-availability problem,
  which is worse than reporting "ran but unverified". The marker
  text spells out the situation so users can investigate.

Persistence:
- `RunItem.to_dict` / `from_dict` serialize `exit_code`. Old
  persisted runs (pre-0.10.0) load fine: missing field → None.

Tests:
- `test_settling.py` (16): `Run.status` treats SETTLING as
  non-terminal in all permutations (settling-only, mixed with
  completed); `running_count` includes SETTLING; `progress`
  excludes it; full COMPLETED set is terminal. exit_code
  round-trips through to_dict / from_dict, including the
  zero-vs-None distinction. `_slurm_parse_exit_int` across the
  `<n>:<n>` matrix plus unparseable edges. `SlurmBackend.get_job_stats`
  populates exit_code with .batch winning on conflict, main as
  fallback, and `None` on unparseable.
- `test_submit_verify.py` — the two queue-vanish tests now assert
  SETTLING (not optimistic COMPLETED) since that's the new
  contract.

551/551 in the broad sweep + 156 in the backend-specific suites pass.

Bumping to 0.10.0 because the run-state machine grows a new
non-terminal value: any external tooling that polled run.status
expecting only the v0.9 set must now also handle SETTLING (or
keep polling, which is the intent).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Assets 2
Loading