Cap minion AsyncAuth retry loop with auth_tries (#69442)#69443
Merged
Conversation
dwoz
added a commit
to dwoz/salt
that referenced
this pull request
Jun 30, 2026
Per maintainer direction on PR saltstack#69443: on the 3006.x LTS branch, the new outer-loop cap added in 86f697d must not silently change failure modes for existing users. A long-disconnected minion that used to eventually reconnect should keep doing so on upgrade unless the operator explicitly opts in to the new safety cap. Rather than reuse the pre-existing ``auth_tries`` option (which is already consumed by ``sign_in()`` as the per-request channel-send retry count, and which would change behavior for every existing 3006.x installation if we used it for the outer loop too), introduce a new dedicated ``auth_retries`` minion option: * Added to ``VALID_OPTS`` and ``DEFAULT_MINION_OPTS`` with default ``0``. * ``0`` is interpreted by ``AsyncAuth._authenticate()`` as "unlimited / preserve the pre-3006.26 behavior of retrying forever". * A positive integer caps the outer loop at that many attempts and surfaces a ``SaltClientError("Failed to authenticate with the master after N attempts")``. Tests ----- * Renamed ``test_authenticate_caps_retry_loop_with_auth_tries_69442`` to ``..._with_auth_retries_69442`` and switched the opts key. * Added ``test_authenticate_default_does_not_cap_retry_loop_69442`` asserting that with no ``auth_retries`` set the loop runs well past any plausible small finite cap and the resulting error is the generic "Attempt to authenticate ... failed" rather than the cap-specific "...after N attempts". Refs saltstack#69442
The minion's AsyncAuth._authenticate() outer loop on 3006.x and 3007.x
keeps calling sign_in() forever whenever the master answers with the
"retry" sentinel (key not yet accepted, master AES rotation in flight,
multi-master probe). The minion sleeps acceptance_wait_time between
attempts, doubling up to acceptance_wait_time_max, and never surfaces
an error: no log, no traceback, just a stuck minion.
3008.x already caps this loop using the existing auth_tries option
(default 7); backport the same guard so the minion bails out of
_authenticate() with SaltClientError("Failed to authenticate with the
master after N attempts") once auth_tries iterations have been spent
returning "retry". auth_tries=0 keeps the old "loop forever" behavior
for operators who actually want it.
The synchronous SAuth.authenticate() path is intentionally left
unchanged: that is a separate code path used by salt-call and other
single-shot CLI flows, and its existing semantics are out of scope for
this fix.
Fixes saltstack#69442
Per maintainer direction on PR saltstack#69443: on the 3006.x LTS branch, the new outer-loop cap added in 86f697d must not silently change failure modes for existing users. A long-disconnected minion that used to eventually reconnect should keep doing so on upgrade unless the operator explicitly opts in to the new safety cap. Rather than reuse the pre-existing ``auth_tries`` option (which is already consumed by ``sign_in()`` as the per-request channel-send retry count, and which would change behavior for every existing 3006.x installation if we used it for the outer loop too), introduce a new dedicated ``auth_retries`` minion option: * Added to ``VALID_OPTS`` and ``DEFAULT_MINION_OPTS`` with default ``0``. * ``0`` is interpreted by ``AsyncAuth._authenticate()`` as "unlimited / preserve the pre-3006.26 behavior of retrying forever". * A positive integer caps the outer loop at that many attempts and surfaces a ``SaltClientError("Failed to authenticate with the master after N attempts")``. Tests ----- * Renamed ``test_authenticate_caps_retry_loop_with_auth_tries_69442`` to ``..._with_auth_retries_69442`` and switched the opts key. * Added ``test_authenticate_default_does_not_cap_retry_loop_69442`` asserting that with no ``auth_retries`` set the loop runs well past any plausible small finite cap and the resulting error is the generic "Attempt to authenticate ... failed" rather than the cap-specific "...after N attempts". Refs saltstack#69442
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Backports the
auth_triesouter-loop cap on the minion'sAsyncAuth._authenticate()from 3008.x. Whensign_in()keeps returningthe
"retry"sentinel, the minion will now bail out of theauthentication loop after
auth_triesattempts (default 7) withSaltClientError("Failed to authenticate with the master after N attempts"), instead of looping silently forever with exponential backoffup to
acceptance_wait_time_max.auth_tries=0preserves the legacy "loop forever" behaviour foroperators who explicitly want it. The
SAuth.authenticate()synchronouspath is intentionally left unchanged — it is the salt-call / single-shot
CLI codepath and is out of scope for this fix.
What issues does this PR fix or reference?
Fixes #69442
Previous Behavior
On 3006.x and 3007.x, a minion whose
sign_in()consistently returns"retry"(master key not yet accepted, master AES rotation in flight,multi-master probe against an unreachable peer, etc.) sleeps
acceptance_wait_timebetween attempts, doubles up toacceptance_wait_time_max, and never logs an error. The minion appearsstuck with no operator-visible signal.
New Behavior
After
auth_triesconsecutive"retry"responses, the loop terminateswith a
SaltClientError:which is then wrapped by
salt.channel.client.AsyncPubChannel.connect()into the user-visible
"Unable to sign_in to master: ..."log line. Thismatches the behaviour 3008.x has had since the
auth_triescap wasintroduced.
Merge requirements satisfied?
auth_triesis alreadydocumented and its default of 7 carries over)
changelog/69442.fixed.md)(
tests/pytests/unit/test_crypt.py::test_authenticate_caps_retry_loop_with_auth_tries_69442)Commits signed with GPG?
No (matching the rest of 3006.x history; let me know if you want this
re-signed.)