Skip to content

v0.6.5

@tlamadon tlamadon tagged this 03 Jun 23:15
Two independent failure modes were letting a non-zero exit slip
through as COMPLETED:

1. Bash masking inside the generated script. A common pattern like
   `solve | tee log.txt` exits 0 even when solve dies, because tee
   returns 0. The script's trailing `EXIT_CODE=$?` then captured 0,
   sbatch exited cleanly, Slurm marked the job COMPLETED, and the
   framework had no way to disagree. Fix: insert `set -o pipefail`
   at the very top of the generated bash body so the pipeline's
   exit code is the rightmost non-zero command instead of always
   the last. Deliberately *not* using `set -e`: that would silently
   break legitimate `cmd || handle_failure` patterns inside user
   commands.

2. State-vs-ExitCode disagreement. Even when bash propagates the
   exit code correctly, some Slurm setups can show State=COMPLETED
   alongside an ExitCode of `1:0` (it happens on certain Slurm
   versions and srun layouts). Fix: extend the sacct query with
   ExitCode and override COMPLETED → FAILED whenever either half
   of `<exit>:<signal>` is non-zero. Defense-in-depth check runs
   on both the .batch row (where user code lives) and the main
   row (so jobs without a .batch step still benefit). The
   existing failure-state override (`OUT_OF_MEMORY` / `TIMEOUT`
   beating COMPLETED) still wins ordering, so users get the
   specific reason rather than generic FAILED when one's available.

Helper `_slurm_exitcode_indicates_failure` is conservative: empty,
unparseable, or non-`<n>:<n>`-shaped fields return False so we
never invent a failure from a column scripthut doesn't recognize.

Tests:
- Existing TestSacctParsesState fixtures grow the new ExitCode
  column (appending `|0:0` to every row, matching real sacct
  output for clean runs).
- New TestSacctExitCodeOverridesCompleted class covers the new
  override path: non-zero exit, non-zero signal, main-only entry,
  zero exit stays COMPLETED, unparseable doesn't panic, failure
  states still beat the generic override.
- New TestExitcodeParser unit-tests the helper across the
  `<exit>:<signal>` matrix plus the empty / unparseable edges.
- New TestGeneratedScriptSetsPipefail asserts pipefail appears in
  every generated body and runs *before* extra_init so
  module-load pipelines are also protected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Assets 2
Loading