Async job batching#68964
Merged
Merged
Conversation
Add the structural skeleton for async batch mode without changing any runtime behavior. New files contain class/function definitions with docstrings but no implementation bodies. Existing files have commented-out call sites describing where the new code will be wired in. New files: - salt/utils/batch_state.py: BatchState, progress_batch(), Action - salt/utils/batch_output.py: CLIOutput, EventOutput, SilentOutput - salt/runners/batch.py: status(), list_active(), stop() Modified files: - salt/master.py: BatchManager process class, Maintenance.handle_batch_jobs() - salt/cli/batch.py: commented-out refactoring plan in Batch.run() - salt/cli/salt.py: commented-out async+batch path in _run_batch()
Pin behavior that is currently implicit in salt.cli.batch.Batch so the upcoming progress_batch() refactor can proceed against a stable baseline. Covers: - get_bnum edges: 100% exact, <1% ceil branch, batch=0, percentage with an empty minion list, and fractional percentages. - __update_wait: empty, all-past, all-future, and mixed-sorted cases. - Batch.run branches: early return with no minions, show_jid/verbose passthrough from an attached parser, raw-mode yield shape with raw=True forwarding, retcode-dict max collapse, empty retcode dict collapsing to zero, failed-to-respond plus failhard halting without yielding, and batch_wait blocking the next dispatch (verified via a controlled clock and a spin-sleep counter). Made-with: Cursor
Previously minion_tracker[queue]["minions"] and next_ (the sub-batch
target list passed into cmd_iter_no_block) were the same list. As
returns arrive, the .remove() calls that prune the tracker also
mutate next_. salt.client.LocalClient.cmd_iter_no_block materializes
its target list before streaming, so production is unaffected, but a
streaming consumer of args[0] would see the list shorten mid-iteration
and StopIteration early. The same aliasing made existing multi-minion
unit tests pass silently with placeholder ({'ret': {}}) yields for
minions that should have returned real data.
Copy the list so the tracker owns its own state. Tighten the
existing multi-minion tests (test_single_jid_across_batch_iterations,
test_single_jid_with_failhard, test_single_jid_single_batch) to
assert every yield carries a real return, which would have caught the
aliasing.
Made-with: Cursor
Introduce ``salt/utils/batch_state.py`` as the single source of truth
for batch slot accounting, failhard, timeouts, and batch_wait
semantics. ``progress_batch()`` is a pure function that consumes the
current BatchState plus new minion returns and returns an Action
describing what the caller should do next (publish, yield, halt).
Async execution lives in a dedicated ``BatchManager``
``SignalHandlingProcess`` (``salt/utils/batch_manager.py``) started by
the master's ``ProcessManager``. The manager listens for
``salt/batch/<jid>/{new,stop,recover}`` events and for per-minion
``salt/job/<jid>/ret/<minion>`` returns, persists state to
``.batch.p`` in the JID dir, and maintains an on-disk active-batch
index at ``<cachedir>/batch_active.p``. An active-index
reconciliation on every tick closes the race where a ``new`` event
is lost while a CLI-created batch is still on disk. Output goes to
the event bus via ``EventOutput``; the maintenance loop periodically
sweeps stale/halted batches and fires ``recover`` events.
Sync batch (``salt/cli/batch.py``) is refactored to drive the same
``progress_batch()`` plus a ``CLIOutput`` adapter — no more inline
slot math or private ``__update_wait`` helper. Observable yield
shape (normal / raw / failed-True-skip / batch_wait idle / failhard
early-exit) is preserved byte-for-byte against the pre-refactor
implementation, proven by a parity suite that replays the Phase 1
conformance scenarios through ``Batch.run()``. Persistence is
best-effort on the sync path so missing ``cachedir`` (legacy test
fixtures) degrades gracefully; successful runs write ``.batch.p``
with ``driver="cli"`` so ``salt-run batch.status`` and
``batch.list_active`` now see sync batches too.
CLI: ``salt --async --batch-size=N`` no longer rejects the
combination. The CLI gathers minions, publishes the first sub-batch
via ``LocalClient.run_job`` (so eauth is handled exactly once),
persists state, fires ``salt/batch/<jid>/new`` for the BatchManager
to adopt, prints the JID, and exits. A ``batch`` runner exposes
``status``, ``list_active``, and ``stop`` (graceful + ``kill=True``).
Tests: 123 unit tests across the state machine (conformance + direct
helper tests), output adapters, BatchManager, stop runner,
maintenance safety net, sync-driver parity, and the preserved sync
batch suite — all green.
Made-with: Cursor
CI sets RAISE_DEPRECATIONS_RUNTIME_ERRORS=1. With Salt at 3008.0+, warn_until(3008, ...) and version="Argon" are treated as expired and raise RuntimeError. Bump numeric gates to 3009, use Potassium for file.shortcut, and align namespaced_function test expectations. Made-with: Cursor
Use writable cachedir/root in yamlex and saltcheck tests under CI=1. Fix mocked_tcp_pub_client to create futures on a dedicated event loop. Patch aptpkg tests with a temp sources.list for deb822 del_repo paths. Harden junos YAML security tests for loader aliases and suite ordering. Relax log beacon and SSH single trace assertions when logging is noisy. Rename batch_state _scenarios module to batch_state_scenarios and update imports plus EXCLUDED_FILES in test_module_names. Use grains.get for Debian osmajorrelease in test_ip_to_host. Align vsphere deprecation removal text with the warn_until gate. Refresh batch parity expectations for batching-async behavior. Made-with: Cursor
Coalesce None timeout and gather_job_timeout in batch_state init and progress_batch so orchestration with explicit null timeout does not raise. Resolve main salt RPM path from install_salt.pkgs for RPM tilde pre-releases in test_pkg_meta. Skip warn_until RuntimeError tests unless RAISE_DEPRECATIONS_RUNTIME_ERRORS=1; split date and kwargs cases into dedicated tests. Remove nxos unit tests for cmd, show, add_config, and system_info removed in 3008.x. Made-with: Cursor
Query %{VERSION} and %{RELEASE} from the metapackage for provides and
requires expectations so RPM metadata matches GA and RC builds.
Merge rpmlib(TildeInVersions) lines from actual rpm output so GA
packages and varying rpm versions stay valid.
Made-with: Cursor
Map PEP440 prev_version strings to RPM NEVRA spelling for Photon yum installs, with unit coverage. Restart systemd Salt services after local RPM upgrades and after downgrade integration tests on Linux. Harden upgrade systemd fixture teardown when a test skips before install(upgrade) and restart services after teardown downgrade on all Linux pkg runs. Made-with: Cursor
Accept salt-owned paths under package dirs when not listed explicitly (test_pkg_paths). On macOS, detect /opt/saltstack/salt vs /opt/salt, align bin_dir with install_dir, refresh binary_paths after pkg install and on fixture enter. Resolve Darwin symlink/version tests via install_salt.binary_paths with legacy /usr/local/sbin fallback.
Align test_pkg_paths with real root vs salt ownership on install and upgrades. Harden macOS onedir detection (exists, flat salt wrapper, which fallback) and add unit coverage. Allow longer Windows NSIS install_previous timeout.
test_pkg_paths: some packages keep salt-owned files in group root. _macos_salt_onedir_prefix: restore a single which() + resolve() path, and _refresh_macos_binary_paths: prepend /opt onedir locations to PATH so shutil.which sees the on-disk layout even when a stale default install_dir influenced the environment.
Add pep440_public_equal for comparing salt --version to artifact or prev versions when PEP 440 local segments differ. Use it in test_salt_version and test_salt_upgrade; use prev_version verbatim when use_prev_version is set. Relax fnmatch patterns for compare/symlink tests. Retry download_file on HTTP 5xx. Extend macOS onedir prefix tests for command-v fallback.
Treat minion runtime trees and drop-in config as exclusions. Match 3008-style salt:salt subtree checks while skipping excluded paths. Allow salt-owned onedir files under /opt/saltstack/salt with root or salt group, and permit salt-owned files under minion runtime prefixes when walking root-owned directories.
Pin non-Photon RPM downgrades to the requested NEVRA so dnf/yum cannot land on the wrong major. Allow Debian/Ubuntu post-downgrade path ownership checks for releases before 3007.0. On Windows, verify pkg.removed with list_pkgs scoped like Add/Remove Programs (exclude installer components and updates) so successful uninstalls are not reported as failed. Run 32-bit pkg functional tests against putty before npp for more stable winrepo metadata on CI runners.
The Broadcom salt.repo enables v3006 LTS by default, which excludes *3008* from that stanza. Package tests that install --prev-version=3008.x need salt-repo-latest so yum/dnf can resolve published 3008 builds. Apply the same enablement on Photon, which previously skipped repo toggles.
PhotonOS uses tdnf behind yum; config-manager --enable is unsupported. For prev-version 3008+, set enabled=1 on the salt-repo-latest stanza in the copied salt.repo INI and run makecache so install_previous can resolve RPMs.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds async batch execution.
salt '*' fun --batch N --asyncnow returns a JIDimmediately and the master drives the batch to completion on its own. Sync
batch is refactored onto the same state machine so the two drivers share a
single implementation and behave identically.
User-facing changes
--async --batch-size=Nis now a supported combination.salt-run batch.status <jid>,batch.list_active,batch.stop <jid>(kill=Trueto halt hard).disconnecting.
batch_wait,failhard,timeouts) are unchanged.
Design
The master runs a dedicated process that owns batch progression. It listens
for minion returns, advances a small pure state machine, publishes the next
sub-batch under the same JID, and fires batch lifecycle events
(
salt/batch/<jid>/{new,progress,complete,halted,recover}). The master'smaintenance loop sweeps stale batches and either recovers or retires them.
A single-operator model: eauth is validated once at publish time, and every
sub-batch thereafter is attributed to the same user so publisher-ACL and
audit behavior are preserved.
Compatibility
No CLI flag, config key, or return-shape changes. Sync batches now also
write state to disk, so they're visible to the new runner commands.
Issues
Closes #25362, #58502.