Two independent failure modes were letting a non-zero exit slip
through as COMPLETED:
1. Bash masking inside the generated script. A common pattern like
`solve | tee log.txt` exits 0 even when solve dies, because tee
returns 0. The script's trailing `EXIT_CODE=$?` then captured 0,
sbatch exited cleanly, Slurm marked the job COMPLETED, and the
framework had no way to disagree. Fix: insert `set -o pipefail`
at the very top of the generated bash body so the pipeline's
exit code is the rightmost non-zero command instead of always
the last. Deliberately *not* using `set -e`: that would silently
break legitimate `cmd || handle_failure` patterns inside user
commands.
2. State-vs-ExitCode disagreement. Even when bash propagates the
exit code correctly, some Slurm setups can show State=COMPLETED
alongside an ExitCode of `1:0` (it happens on certain Slurm
versions and srun layouts). Fix: extend the sacct query with
ExitCode and override COMPLETED → FAILED whenever either half
of `<exit>:<signal>` is non-zero. Defense-in-depth check runs
on both the .batch row (where user code lives) and the main
row (so jobs without a .batch step still benefit). The
existing failure-state override (`OUT_OF_MEMORY` / `TIMEOUT`
beating COMPLETED) still wins ordering, so users get the
specific reason rather than generic FAILED when one's available.
Helper `_slurm_exitcode_indicates_failure` is conservative: empty,
unparseable, or non-`<n>:<n>`-shaped fields return False so we
never invent a failure from a column scripthut doesn't recognize.
Tests:
- Existing TestSacctParsesState fixtures grow the new ExitCode
column (appending `|0:0` to every row, matching real sacct
output for clean runs).
- New TestSacctExitCodeOverridesCompleted class covers the new
override path: non-zero exit, non-zero signal, main-only entry,
zero exit stays COMPLETED, unparseable doesn't panic, failure
states still beat the generic override.
- New TestExitcodeParser unit-tests the helper across the
`<exit>:<signal>` matrix plus the empty / unparseable edges.
- New TestGeneratedScriptSetsPipefail asserts pipefail appears in
every generated body and runs *before* extra_init so
module-load pipelines are also protected.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>