Skip to content

Gracefully stop ZMQ REQ client coroutine before closing IOLoop (#68637)#69044

Open
twangboy wants to merge 9 commits intosaltstack:3006.xfrom
twangboy:fix/68637/3006.x/queue
Open

Gracefully stop ZMQ REQ client coroutine before closing IOLoop (#68637)#69044
twangboy wants to merge 9 commits intosaltstack:3006.xfrom
twangboy:fix/68637/3006.x/queue

Conversation

@twangboy
Copy link
Copy Markdown
Contributor

@twangboy twangboy commented May 4, 2026

What does this PR do?

AsyncReqMessageClient.close() ran while the IOLoop was stopped, so _send_recv could stay blocked in Tornado Queue.get(), retaining queue waiters and related state for each short-lived ReqChannel (e.g. salt-api). Queue a shutdown sentinel, run_sync the loop once to unwind _send_recv, signal completion via a Future, then tear down the socket/context. Split ZMQ teardown into _close_zmq_only() for reconnect paths so we do not recurse through close(). Fix future.exception() checks to call the method.

What issues does this PR fix or reference?

Fixes #68637

Merge requirements satisfied?

[NOTICE] Bug fixes or features added to Salt require tests.

Commits signed with GPG?

Yes

@twangboy twangboy requested a review from a team as a code owner May 4, 2026 19:23
@twangboy twangboy self-assigned this May 4, 2026
@twangboy twangboy added the test:full Run the full test suite label May 4, 2026
@twangboy twangboy added this to the Sulpher v3006.24 milestone May 4, 2026
twangboy added 9 commits May 7, 2026 11:27
…tack#68637)

AsyncReqMessageClient.close() ran while the IOLoop was stopped, so _send_recv
could stay blocked in Tornado Queue.get(), retaining queue waiters and related
state for each short-lived ReqChannel (e.g. salt-api).
Queue a shutdown sentinel, run_sync the loop once to unwind _send_recv, signal
completion via a Future, then tear down the socket/context. Split ZMQ teardown
into _close_zmq_only() for reconnect paths so we do not recurse through close().
Fix future.exception() checks to call the method.
Refs saltstack#68637
…k#68637)

- Branch close() on io_loop._running: run_sync when idle, otherwise
  schedule graceful shutdown via add_future/add_callback; block with
  threading.Event when closing from another thread. Document Tornado
  _running/_thread_ident coupling in code comments and class docstring.
- Add unit coverage: graceful close on dedicated loop, close while
  IOLoop.start() is active (no run_sync / "already running").
- Add Linux RSS regression tests (functional connect-churn, integration
  LocalClient.pub under live master) and tools/repro_reqchannel_churn.py.
- Reject cleartext payloads missing cmd in _handle_clear with a safe
  log line (keys only) and unit test; fix ulimit hint in tests/conftest
  (-n not -u).
- Add agents/docs/pr-description-68637-reqchannel.md as PR blurb draft.
…#68637)

- Teach AsyncReqMessageClient.close() to avoid run_sync when the owning
  IOLoop is running: schedule graceful shutdown via add_future, block on an
  Event from other threads, and note Tornado _running/_thread_ident coupling.
- Harden _handle_clear for missing cmd with keys-only ERROR logging; fix
  tests/conftest ulimit hint (-n).
- Add Linux RSS regressions (functional connect-churn, integration LocalClient.pub),
  repro script tools/repro_reqchannel_churn.py, and agents PR blurb MD.
- Stabilize RequestClient zeromq tests for deferred teardown: poll after close
  where needed, relaxed caplog for graceful shutdown path, REP stream/socket/ctx
  teardown helper to trim pyzmq __del__ noise.
- Lint: pylint broad-except on RSS churn tests; salt.utils.files.fopen for VmRSS reads.
saltstack#68637)

Deferring AsyncReqMessageClient.close() on the owning I/O loop returned before ZMQ
FDs dropped, so reconnect paths could overlap old and new REQ stacks and trip
functional FD/RSS regressions under load.
Production:
- Expose teardown completion via close_future(); keep close() semantics for callers
  that block off-loop with remove_timeout/Event.
- Zeromq RequestClient: close_future(); minion connect_master paths yield
  AsyncReqChannel.close_async() before recreating ReqChannel (saltstack#68637 / fd leak flake).
- Base RequestClient.default close_future() completes immediately after close().
Zeromq send():
- Cancel per-message timeouts with io_loop.remove_timeout() (Salt Tornado uses
  _Timeout handles without .cancel()) instead of crashing in finally paths.
Developer / tests:
- Add tools/repro_reqchannel_churn.py for Linux VmRSS churn outside pytest (saltstack#68637).
- Harden zeromq test_request_client: sync/async finalize helpers drain REQ teardown
  before REP/context teardown (reduces pyzmq Socket.__del__ stderr noise).
…drain tick

pkg: SaltPkgInstall.__exit__: If terminate_process_list hits ``OSError`` with
errno 22 (EINVAL) during psutil wait/cleanup—seen on aarch64/debian CI—log a
warning and continue instead of failing the fixture teardown after test_salt_upgrade.
minion: After ``yield req_channel.close_async()`` in ``connect_master`` and the
multi-master reconnect path, ``yield salt.ext.tornado.gen.sleep(0)`` once so the
IOLoop can run deferred Zeromq finalizers before eval_master/recreating the channel.
Helps flaky functional FD snapshots without relaxing test_fd_leak thresholds.
Refs: saltstack#68637 (REQ teardown sequencing); pkg upgrade teardown flake.
…op_async

Add Minion.destroy_async() and Syndic.destroy_async() so REQ channels can run
close_async() and yield to the loop before the next connect attempt, matching
how connect_master already closes the REQ side asynchronously.
MinionManager._connect_minion now yields destroy_async() on SaltClientError,
SaltMasterUnresolvableError, and generic exceptions instead of synchronous
destroy(), avoiding half-closed transport state during multi-minion retries.
MinionManager.stop_async now yields destroy_async() for each minion after
stopping child processes.
The broad Exception path in _connect_minion now applies the same auth_wait
backoff as SaltClientError. Without a delay, connect_master could retry in a
tight loop, creating ZMQ contexts faster than teardown and leading to resource
errors on some hosts.
Unit tests: assert destroy_async is invoked on connect failure paths and use a
small coroutine helper for MagicMock side_effect when stubbing destroy_async.
…tency

Use Minion.destroy_async() (REQ close_async + loop yield) on MinionManager
_connect_minion retry paths and stop_async so transports finish closing before
the next connect attempt. Back off on broad exceptions like SaltClientError to
avoid tight reconnect loops that churn ZMQ contexts.
Unit tests: expect destroy_async where appropriate; stop_async test asserts
destroy_async on the mock minion.
Functional test_minion_connection_failure_no_fd_leak: replace the single-shot
baseline+margin check with a staircase detector plus a hard cap vs baseline, and
document why—async teardown can defer FD drops across sampling windows, so the
test targets sustained growth rather than one-sample noise.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:full Run the full test suite

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant