Gracefully stop ZMQ REQ client coroutine before closing IOLoop (#68637)#69044
Open
twangboy wants to merge 9 commits intosaltstack:3006.xfrom
Open
Gracefully stop ZMQ REQ client coroutine before closing IOLoop (#68637)#69044twangboy wants to merge 9 commits intosaltstack:3006.xfrom
twangboy wants to merge 9 commits intosaltstack:3006.xfrom
Conversation
…tack#68637) AsyncReqMessageClient.close() ran while the IOLoop was stopped, so _send_recv could stay blocked in Tornado Queue.get(), retaining queue waiters and related state for each short-lived ReqChannel (e.g. salt-api). Queue a shutdown sentinel, run_sync the loop once to unwind _send_recv, signal completion via a Future, then tear down the socket/context. Split ZMQ teardown into _close_zmq_only() for reconnect paths so we do not recurse through close(). Fix future.exception() checks to call the method. Refs saltstack#68637
…k#68637) - Branch close() on io_loop._running: run_sync when idle, otherwise schedule graceful shutdown via add_future/add_callback; block with threading.Event when closing from another thread. Document Tornado _running/_thread_ident coupling in code comments and class docstring. - Add unit coverage: graceful close on dedicated loop, close while IOLoop.start() is active (no run_sync / "already running"). - Add Linux RSS regression tests (functional connect-churn, integration LocalClient.pub under live master) and tools/repro_reqchannel_churn.py. - Reject cleartext payloads missing cmd in _handle_clear with a safe log line (keys only) and unit test; fix ulimit hint in tests/conftest (-n not -u). - Add agents/docs/pr-description-68637-reqchannel.md as PR blurb draft.
…#68637) - Teach AsyncReqMessageClient.close() to avoid run_sync when the owning IOLoop is running: schedule graceful shutdown via add_future, block on an Event from other threads, and note Tornado _running/_thread_ident coupling. - Harden _handle_clear for missing cmd with keys-only ERROR logging; fix tests/conftest ulimit hint (-n). - Add Linux RSS regressions (functional connect-churn, integration LocalClient.pub), repro script tools/repro_reqchannel_churn.py, and agents PR blurb MD. - Stabilize RequestClient zeromq tests for deferred teardown: poll after close where needed, relaxed caplog for graceful shutdown path, REP stream/socket/ctx teardown helper to trim pyzmq __del__ noise. - Lint: pylint broad-except on RSS churn tests; salt.utils.files.fopen for VmRSS reads.
saltstack#68637) Deferring AsyncReqMessageClient.close() on the owning I/O loop returned before ZMQ FDs dropped, so reconnect paths could overlap old and new REQ stacks and trip functional FD/RSS regressions under load. Production: - Expose teardown completion via close_future(); keep close() semantics for callers that block off-loop with remove_timeout/Event. - Zeromq RequestClient: close_future(); minion connect_master paths yield AsyncReqChannel.close_async() before recreating ReqChannel (saltstack#68637 / fd leak flake). - Base RequestClient.default close_future() completes immediately after close(). Zeromq send(): - Cancel per-message timeouts with io_loop.remove_timeout() (Salt Tornado uses _Timeout handles without .cancel()) instead of crashing in finally paths. Developer / tests: - Add tools/repro_reqchannel_churn.py for Linux VmRSS churn outside pytest (saltstack#68637). - Harden zeromq test_request_client: sync/async finalize helpers drain REQ teardown before REP/context teardown (reduces pyzmq Socket.__del__ stderr noise).
…drain tick pkg: SaltPkgInstall.__exit__: If terminate_process_list hits ``OSError`` with errno 22 (EINVAL) during psutil wait/cleanup—seen on aarch64/debian CI—log a warning and continue instead of failing the fixture teardown after test_salt_upgrade. minion: After ``yield req_channel.close_async()`` in ``connect_master`` and the multi-master reconnect path, ``yield salt.ext.tornado.gen.sleep(0)`` once so the IOLoop can run deferred Zeromq finalizers before eval_master/recreating the channel. Helps flaky functional FD snapshots without relaxing test_fd_leak thresholds. Refs: saltstack#68637 (REQ teardown sequencing); pkg upgrade teardown flake.
…op_async Add Minion.destroy_async() and Syndic.destroy_async() so REQ channels can run close_async() and yield to the loop before the next connect attempt, matching how connect_master already closes the REQ side asynchronously. MinionManager._connect_minion now yields destroy_async() on SaltClientError, SaltMasterUnresolvableError, and generic exceptions instead of synchronous destroy(), avoiding half-closed transport state during multi-minion retries. MinionManager.stop_async now yields destroy_async() for each minion after stopping child processes. The broad Exception path in _connect_minion now applies the same auth_wait backoff as SaltClientError. Without a delay, connect_master could retry in a tight loop, creating ZMQ contexts faster than teardown and leading to resource errors on some hosts. Unit tests: assert destroy_async is invoked on connect failure paths and use a small coroutine helper for MagicMock side_effect when stubbing destroy_async.
…tency Use Minion.destroy_async() (REQ close_async + loop yield) on MinionManager _connect_minion retry paths and stop_async so transports finish closing before the next connect attempt. Back off on broad exceptions like SaltClientError to avoid tight reconnect loops that churn ZMQ contexts. Unit tests: expect destroy_async where appropriate; stop_async test asserts destroy_async on the mock minion. Functional test_minion_connection_failure_no_fd_leak: replace the single-shot baseline+margin check with a staircase detector plus a hard cap vs baseline, and document why—async teardown can defer FD drops across sampling windows, so the test targets sustained growth rather than one-sample noise.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
AsyncReqMessageClient.close() ran while the IOLoop was stopped, so _send_recv could stay blocked in Tornado Queue.get(), retaining queue waiters and related state for each short-lived ReqChannel (e.g. salt-api). Queue a shutdown sentinel, run_sync the loop once to unwind _send_recv, signal completion via a Future, then tear down the socket/context. Split ZMQ teardown into _close_zmq_only() for reconnect paths so we do not recurse through close(). Fix future.exception() checks to call the method.
What issues does this PR fix or reference?
Fixes #68637
Merge requirements satisfied?
[NOTICE] Bug fixes or features added to Salt require tests.
Commits signed with GPG?
Yes