Fix #68940: parallel file.managed deadlocks on master file lookups#69028
Fix #68940: parallel file.managed deadlocks on master file lookups#69028co-cy wants to merge 3 commits into
Conversation
…ookups Forked parallel-state children inherited the parent's live ZeroMQ REQ socket via salt.fileclient.RemoteClient. Multiple sibling children calling cp.hash_file (triggered by file.managed with salt:// sources) raced the REQ/REP state machine and deadlocked the asyncio loop with ~98% CPU. Register os.register_at_fork(after_in_child=...) handlers on the three client-side classes that hold long-lived ZMQ handles (RemoteClient, AsyncAuth/SAuth, SaltEvent) so each forked child drops the inherited socket/IOLoop references and lazily rebuilds them on first use. The parent is unaffected. Public API surface (channel, auth, subscriber, pusher) is preserved via property/setter pairs.
Notes on the fork-safety fixPosting this as a follow-up comment because the issue body covers the Why the deadlock happens (one more pass)
When two There's nothing weird going on in the network or in the master — What I tried before settling on this approachThe minimal fix is to set
So instead of patching one call site, I push the responsibility down Why no
|
| self.pending_tags = [] | ||
| self.pending_events = [] | ||
| self.__load_cache_regex() | ||
| type(self)._register_atfork() |
There was a problem hiding this comment.
This is incredibly suspect. AI Slop?
There was a problem hiding this comment.
Hi @dwoz, thanks for the review.
type(self)._register_atfork() is used intentionally here to resolve the actual class of the instance at runtime, which matters if a subclass ever overrides _register_atfork or maintains its own _atfork_registered flag. Using type(self) over a hardcoded class name preserves correct behavior under inheritance, and over self._register_atfork() it makes the class-level nature of the call explicit at the call site (and avoids accidental shadowing by an instance attribute).
That said, I agree the idiom is uncommon and can look odd at a glance. Happy to change it — a few alternatives:
self._register_atfork() — shortest, relies on the descriptor protocol to bind the correct class. Fine in practice.
self.class._register_atfork() — equivalent to type(self), slightly more conventional in some codebases.
Move registration out of init entirely — e.g. into init_subclass or module-level init — so the hook is registered once per class instead of on every instantiation. This is probably the cleanest fix architecturally, since os.register_at_fork only needs to be called once.
Let me know which direction you'd prefer and I'll update the PR.
It is important to me that the product is improved. I'm new here, if you have any general practices, I'd like to get acquainted with them.
… tests The two functional tests (test_parallel_file_managed_from_master and test_parallel_cmd_script_from_master) do not actually guard the regression: Salt's in-process functional harness uses FSClient, which has no ZeroMQ socket, so the fork-inherited-socket deadlock cannot occur there. Verified locally -- both tests pass with AND without the fix, so they would not catch a regression. Replace them with tests/pytests/unit/test_fork_safety.py, which asserts the fix's actual contract: after os.fork(), the at-fork handler must clear the inherited handles in the child while the parent keeps its own. - test_remoteclient_drops_channel_in_forked_child: the direct saltstack#68940/saltstack#65709 path (cp.hash_file / cp.cache_file go through RemoteClient's REQ channel). Child sees _channel/_auth cleared. - test_asyncauth_sauth_clear_singletons_but_keep_creds_in_forked_child: AsyncAuth.instance_map / SAuth.instances reset in child; creds_map deliberately preserved (documented behaviour). - test_saltevent_drops_sockets_in_forked_child: subscriber/pusher None and cpub/cpush False in child. These are deterministic and fast (no master daemon, no ZeroMQ, no race): each fails with the fix reverted and passes with it in place, so they are real regression guards. Verified both directions locally. Refs: saltstack#68940, saltstack#65709
|
Follow-up on the regression tests — I have to flag something I found while validating them, and I've reworked the approach. The two functional tests did not actually guard the regression. I got my local So in
These are deterministic and fast (no master daemon, no ZeroMQ, no race window). I verified both directions locally: all three fail with the fix reverted and pass with it in place — i.e. they are real regression guards, which the functional tests were not. Reproducing the literal deadlock end-to-end would need a real master + ZeroMQ transport, and "did not hang" is an inherently flaky assertion (a green run only means the race didn't fire that time). Asserting the at-fork invariant is the deterministic equivalent of the manual |
What does this PR do?
Closes a fork-safety hazard in three client-side classes that hold long-lived
ZeroMQ handles:
salt.fileclient.RemoteClient,salt.crypt.AsyncAuth/SAuth, andsalt.utils.event.SaltEvent. Each class now registers anos.register_at_fork(after_in_child=...)handler that drops inherited socket/ IOLoop references in any forked child; the next use rebuilds them lazily.
The parent process is unaffected.
The user-visible effect is that parallel
file.managedstates withsource: salt://...no longer deadlock the asyncio loop in their forkedchildren when two or more states race
cp.hash_fileon the inheritedmaster REQ socket.
What issues does this PR fix or reference?
Fixes #68940
Previous Behavior
When two or more
file.managedstates withparallel: Truehadsalt://sources,
state.applywould hang forever:"Started in a separate process"for eachparallel state and never returns.
ParallelState(...)child processes appear, each at~93–98 % CPU.
futex_wait, buttopshows near-100 % CPU — aspinning asyncio event loop.
ESTABLISHEDwithSend-Q: 0.salt-call cp.hash_file <same-file>works fine outside the parallelcontext.
Root cause:
State.call_parallel()forked parallel-state children viamultiprocessing.Process, and the child inherited the parent'sStateinstance — including
State.file_client, aRemoteClientwith a liveZeroMQ REQ socket to the master. Two sibling children both calling
cp.hash_file(triggered byfile.managedforsalt://sources)raced the strict ZMQ REQ/REP state machine: a reply intended for one
child could be consumed by the other, leaving the originating child's
asyncio event loop blocked forever on a reply that already arrived.
The same architectural hazard existed (latent, not currently exercised)
in
AsyncAuth/SAuth(parent-bound IOLoop in singleton map) andSaltEvent(ZMQ subscriber/pusher with own asyncio loop).New Behavior
Each of the three classes now registers an
os.register_at_forkhandler that, in any forked child:
RemoteClient: clears_channeland_auth. The lazychannelproperty rebuilds a fresh
ReqChannel(new ZMQ socket, new asyncioloop) on next access.
AsyncAuth/SAuth: clears the singleton map (instance_map/instances).creds_mapis intentionally preserved — AES credsremain valid across fork and reusing them avoids a re-auth roundtrip.
SaltEvent: clearssubscriber,pusher,cpub,cpush. Existingconnect_pub()/connect_pull()already implement lazy reconnect.The handlers explicitly do not call
.close()on inheritedhandles —
SyncWrapper.close()would tear downIOLoopFDs andasyncio loop state copied from the parent and could corrupt
parent-side state. The child has its own FD-table copies; GC reclaims
them on child exit.
Public API is preserved:
client.channel,client.auth,event.subscriber, etc. remain readable / writable attributes (now via@property/@property.setter).Merge requirements satisfied?
or behaviour change beyond removing the deadlock. No docs/ rst
change required.
changelog/68940.fixed.mdadded.tests/pytests/functional/modules/state/test_state.py::test_parallel_file_managed_from_master,a regression test that runs two parallel
file.managedstateswith
salt://sources under@pytest.mark.timeout(60). Withoutthe fix the test hangs; with the fix it completes in well under
a second.
Verified locally on Python 3.11:
tests/pytests/unit/fileclient/— 28 passed, 1 skippedtests/pytests/unit/test_fileclient.py+test_event.py— 7 passedtests/pytests/unit/test_crypt.py— 11 passedos.fork()+multiprocessing.Process(start_method=fork)smoke test confirms
_channel/_auth/ subscriber / pusher arecleared in the child and the parent retains its own state.
Commits signed with GPG?
No
Implementation notes for reviewers
Why
os.register_at_forkrather than fixing onlycall_parallel():the latter would close this specific bug but leave the architectural
gap open for any future code adding
multiprocessing.Processover aparent that holds a
RemoteClient(orAsyncAuth/SaltEvent).Re-creating
Statefrom scratch in every forked child also pays fora full pillar gather and module load on every parallel state — a
significant cost for users with large pillar trees. The at-fork
handler keeps the cheap fork path intact and only rebuilds the ZMQ
piece, lazily, when the child actually talks to the master.
Repository audit showed
RemoteClientis the only currently-exercised fork hazard.
AsyncAuthandSaltEventare coveredpreemptively because they share the same structural property
(long-lived ZMQ handle held by a long-lived object) and would
surface the same class of bug under new call sites. Master-side
worker forks are unaffected — those already have a correct
pre_fork/post_forkprotocol insalt/transport/zeromq.py.FSClient(local file-server, no ZMQ) is deliberately excludedfrom the at-fork tracking and overrides
channel/authpropertiesto never lazy-rebuild as a remote
ReqChannel.Limitations:
salt.client.LocalClientis not covered. No current call siteforks a process holding a live
LocalClient; if one is addedlater, the same pattern can be applied in <20 lines.
subprocess.Popen(fork+exec). Cost is O(n) where n is the numberof live tracked instances (typically ≤ 2), sub-microsecond, and
exec discards the handler effect quickly. No measurable impact on
cmd.run-heavy workloads expected.forkserver path in
call_parallel()(in 3007.x+) already forcesan explicit fork context for that case and the handler covers it
correctly. 3006.x predates that change, so this PR only needs to
cover plain
fork.