Skip to content

Async job batching#68964

Merged
dwoz merged 17 commits into
saltstack:3008.xfrom
dwoz:batching-async
Apr 27, 2026
Merged

Async job batching#68964
dwoz merged 17 commits into
saltstack:3008.xfrom
dwoz:batching-async

Conversation

@dwoz
Copy link
Copy Markdown
Contributor

@dwoz dwoz commented Apr 20, 2026

Adds async batch execution. salt '*' fun --batch N --async now returns a JID
immediately and the master drives the batch to completion on its own. Sync
batch is refactored onto the same state machine so the two drivers share a
single implementation and behave identically.

User-facing changes

  • --async --batch-size=N is now a supported combination.
  • New runner: salt-run batch.status <jid>, batch.list_active,
    batch.stop <jid> (kill=True to halt hard).
  • Batch state is persisted to the JID directory, so results survive the CLI
    disconnecting.
  • Sync batch output, yield shape, and semantics (batch_wait, failhard,
    timeouts) are unchanged.

Design

The master runs a dedicated process that owns batch progression. It listens
for minion returns, advances a small pure state machine, publishes the next
sub-batch under the same JID, and fires batch lifecycle events
(salt/batch/<jid>/{new,progress,complete,halted,recover}). The master's
maintenance loop sweeps stale batches and either recovers or retires them.
A single-operator model: eauth is validated once at publish time, and every
sub-batch thereafter is attributed to the same user so publisher-ACL and
audit behavior are preserved.

Compatibility

No CLI flag, config key, or return-shape changes. Sync batches now also
write state to disk, so they're visible to the new runner commands.

Issues

Closes #25362, #58502.

@dwoz dwoz requested a review from a team as a code owner April 20, 2026 00:21
@dwoz dwoz added the test:full Run the full test suite label Apr 20, 2026
dwoz added 17 commits April 23, 2026 19:00
Add the structural skeleton for async batch mode without changing any
runtime behavior. New files contain class/function definitions with
docstrings but no implementation bodies. Existing files have
commented-out call sites describing where the new code will be wired in.

New files:
- salt/utils/batch_state.py: BatchState, progress_batch(), Action
- salt/utils/batch_output.py: CLIOutput, EventOutput, SilentOutput
- salt/runners/batch.py: status(), list_active(), stop()

Modified files:
- salt/master.py: BatchManager process class, Maintenance.handle_batch_jobs()
- salt/cli/batch.py: commented-out refactoring plan in Batch.run()
- salt/cli/salt.py: commented-out async+batch path in _run_batch()
Pin behavior that is currently implicit in salt.cli.batch.Batch so the
upcoming progress_batch() refactor can proceed against a stable
baseline.

Covers:

- get_bnum edges: 100% exact, <1% ceil branch, batch=0, percentage with
  an empty minion list, and fractional percentages.
- __update_wait: empty, all-past, all-future, and mixed-sorted cases.
- Batch.run branches: early return with no minions, show_jid/verbose
  passthrough from an attached parser, raw-mode yield shape with
  raw=True forwarding, retcode-dict max collapse, empty retcode dict
  collapsing to zero, failed-to-respond plus failhard halting without
  yielding, and batch_wait blocking the next dispatch (verified via a
  controlled clock and a spin-sleep counter).

Made-with: Cursor
Previously minion_tracker[queue]["minions"] and next_ (the sub-batch
target list passed into cmd_iter_no_block) were the same list.  As
returns arrive, the .remove() calls that prune the tracker also
mutate next_.  salt.client.LocalClient.cmd_iter_no_block materializes
its target list before streaming, so production is unaffected, but a
streaming consumer of args[0] would see the list shorten mid-iteration
and StopIteration early.  The same aliasing made existing multi-minion
unit tests pass silently with placeholder ({'ret': {}}) yields for
minions that should have returned real data.

Copy the list so the tracker owns its own state.  Tighten the
existing multi-minion tests (test_single_jid_across_batch_iterations,
test_single_jid_with_failhard, test_single_jid_single_batch) to
assert every yield carries a real return, which would have caught the
aliasing.

Made-with: Cursor
Introduce ``salt/utils/batch_state.py`` as the single source of truth
for batch slot accounting, failhard, timeouts, and batch_wait
semantics.  ``progress_batch()`` is a pure function that consumes the
current BatchState plus new minion returns and returns an Action
describing what the caller should do next (publish, yield, halt).

Async execution lives in a dedicated ``BatchManager``
``SignalHandlingProcess`` (``salt/utils/batch_manager.py``) started by
the master's ``ProcessManager``.  The manager listens for
``salt/batch/<jid>/{new,stop,recover}`` events and for per-minion
``salt/job/<jid>/ret/<minion>`` returns, persists state to
``.batch.p`` in the JID dir, and maintains an on-disk active-batch
index at ``<cachedir>/batch_active.p``.  An active-index
reconciliation on every tick closes the race where a ``new`` event
is lost while a CLI-created batch is still on disk.  Output goes to
the event bus via ``EventOutput``; the maintenance loop periodically
sweeps stale/halted batches and fires ``recover`` events.

Sync batch (``salt/cli/batch.py``) is refactored to drive the same
``progress_batch()`` plus a ``CLIOutput`` adapter — no more inline
slot math or private ``__update_wait`` helper.  Observable yield
shape (normal / raw / failed-True-skip / batch_wait idle / failhard
early-exit) is preserved byte-for-byte against the pre-refactor
implementation, proven by a parity suite that replays the Phase 1
conformance scenarios through ``Batch.run()``.  Persistence is
best-effort on the sync path so missing ``cachedir`` (legacy test
fixtures) degrades gracefully; successful runs write ``.batch.p``
with ``driver="cli"`` so ``salt-run batch.status`` and
``batch.list_active`` now see sync batches too.

CLI: ``salt --async --batch-size=N`` no longer rejects the
combination.  The CLI gathers minions, publishes the first sub-batch
via ``LocalClient.run_job`` (so eauth is handled exactly once),
persists state, fires ``salt/batch/<jid>/new`` for the BatchManager
to adopt, prints the JID, and exits.  A ``batch`` runner exposes
``status``, ``list_active``, and ``stop`` (graceful + ``kill=True``).

Tests: 123 unit tests across the state machine (conformance + direct
helper tests), output adapters, BatchManager, stop runner,
maintenance safety net, sync-driver parity, and the preserved sync
batch suite — all green.

Made-with: Cursor
CI sets RAISE_DEPRECATIONS_RUNTIME_ERRORS=1. With Salt at 3008.0+, warn_until(3008, ...) and version="Argon" are treated as expired and raise RuntimeError. Bump numeric gates to 3009, use Potassium for file.shortcut, and align namespaced_function test expectations.

Made-with: Cursor
Use writable cachedir/root in yamlex and saltcheck tests under CI=1.
Fix mocked_tcp_pub_client to create futures on a dedicated event loop.
Patch aptpkg tests with a temp sources.list for deb822 del_repo paths.
Harden junos YAML security tests for loader aliases and suite ordering.
Relax log beacon and SSH single trace assertions when logging is noisy.
Rename batch_state _scenarios module to batch_state_scenarios and update
imports plus EXCLUDED_FILES in test_module_names.
Use grains.get for Debian osmajorrelease in test_ip_to_host.
Align vsphere deprecation removal text with the warn_until gate.
Refresh batch parity expectations for batching-async behavior.

Made-with: Cursor
Coalesce None timeout and gather_job_timeout in batch_state init and
progress_batch so orchestration with explicit null timeout does not raise.
Resolve main salt RPM path from install_salt.pkgs for RPM tilde pre-releases
in test_pkg_meta.
Skip warn_until RuntimeError tests unless RAISE_DEPRECATIONS_RUNTIME_ERRORS=1;
split date and kwargs cases into dedicated tests.
Remove nxos unit tests for cmd, show, add_config, and system_info removed in
3008.x.

Made-with: Cursor
Query %{VERSION} and %{RELEASE} from the metapackage for provides and
requires expectations so RPM metadata matches GA and RC builds.
Merge rpmlib(TildeInVersions) lines from actual rpm output so GA
packages and varying rpm versions stay valid.

Made-with: Cursor
Map PEP440 prev_version strings to RPM NEVRA spelling for Photon yum
installs, with unit coverage. Restart systemd Salt services after local
RPM upgrades and after downgrade integration tests on Linux. Harden
upgrade systemd fixture teardown when a test skips before install(upgrade)
and restart services after teardown downgrade on all Linux pkg runs.

Made-with: Cursor
Accept salt-owned paths under package dirs when not listed explicitly (test_pkg_paths). On macOS, detect /opt/saltstack/salt vs /opt/salt, align bin_dir with install_dir, refresh binary_paths after pkg install and on fixture enter. Resolve Darwin symlink/version tests via install_salt.binary_paths with legacy /usr/local/sbin fallback.
Align test_pkg_paths with real root vs salt ownership on install and upgrades. Harden macOS onedir detection (exists, flat salt wrapper, which fallback) and add unit coverage. Allow longer Windows NSIS install_previous timeout.
test_pkg_paths: some packages keep salt-owned files in group root.

_macos_salt_onedir_prefix: restore a single which() + resolve() path, and
_refresh_macos_binary_paths: prepend /opt onedir locations to PATH so
shutil.which sees the on-disk layout even when a stale default install_dir
influenced the environment.
Add pep440_public_equal for comparing salt --version to artifact or prev
versions when PEP 440 local segments differ. Use it in test_salt_version and
test_salt_upgrade; use prev_version verbatim when use_prev_version is set.
Relax fnmatch patterns for compare/symlink tests. Retry download_file on
HTTP 5xx. Extend macOS onedir prefix tests for command-v fallback.
Treat minion runtime trees and drop-in config as exclusions. Match
3008-style salt:salt subtree checks while skipping excluded paths.
Allow salt-owned onedir files under /opt/saltstack/salt with root or
salt group, and permit salt-owned files under minion runtime prefixes
when walking root-owned directories.
Pin non-Photon RPM downgrades to the requested NEVRA so dnf/yum cannot
land on the wrong major. Allow Debian/Ubuntu post-downgrade path
ownership checks for releases before 3007.0.

On Windows, verify pkg.removed with list_pkgs scoped like Add/Remove
Programs (exclude installer components and updates) so successful
uninstalls are not reported as failed.

Run 32-bit pkg functional tests against putty before npp for more
stable winrepo metadata on CI runners.
The Broadcom salt.repo enables v3006 LTS by default, which excludes *3008*
from that stanza. Package tests that install --prev-version=3008.x need
salt-repo-latest so yum/dnf can resolve published 3008 builds.

Apply the same enablement on Photon, which previously skipped repo toggles.
PhotonOS uses tdnf behind yum; config-manager --enable is unsupported.
For prev-version 3008+, set enabled=1 on the salt-repo-latest stanza in the
copied salt.repo INI and run makecache so install_previous can resolve RPMs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:full Run the full test suite

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants