Skip to content

Fix #68940: parallel file.managed deadlocks on master file lookups#69028

Open
co-cy wants to merge 3 commits into
saltstack:3006.xfrom
co-cy:fix/parallel-file-managed-fork-zmq
Open

Fix #68940: parallel file.managed deadlocks on master file lookups#69028
co-cy wants to merge 3 commits into
saltstack:3006.xfrom
co-cy:fix/parallel-file-managed-fork-zmq

Conversation

@co-cy
Copy link
Copy Markdown

@co-cy co-cy commented May 3, 2026

What does this PR do?

Closes a fork-safety hazard in three client-side classes that hold long-lived
ZeroMQ handles: salt.fileclient.RemoteClient, salt.crypt.AsyncAuth /
SAuth, and salt.utils.event.SaltEvent. Each class now registers an
os.register_at_fork(after_in_child=...) handler that drops inherited socket
/ IOLoop references in any forked child; the next use rebuilds them lazily.
The parent process is unaffected.

The user-visible effect is that parallel file.managed states with
source: salt://... no longer deadlock the asyncio loop in their forked
children when two or more states race cp.hash_file on the inherited
master REQ socket.

What issues does this PR fix or reference?

Fixes #68940

Previous Behavior

When two or more file.managed states with parallel: True had salt://
sources, state.apply would hang forever:

  • Main state process logs "Started in a separate process" for each
    parallel state and never returns.
  • Two or more ParallelState(...) child processes appear, each at
    ~93–98 % CPU.
  • Kernel stacks show futex_wait, but top shows near-100 % CPU — a
    spinning asyncio event loop.
  • Network connections to master:4506 are ESTABLISHED with Send-Q: 0.
  • salt-call cp.hash_file <same-file> works fine outside the parallel
    context.

Root cause: State.call_parallel() forked parallel-state children via
multiprocessing.Process, and the child inherited the parent's State
instance — including State.file_client, a RemoteClient with a live
ZeroMQ REQ socket to the master. Two sibling children both calling
cp.hash_file (triggered by file.managed for salt:// sources)
raced the strict ZMQ REQ/REP state machine: a reply intended for one
child could be consumed by the other, leaving the originating child's
asyncio event loop blocked forever on a reply that already arrived.

The same architectural hazard existed (latent, not currently exercised)
in AsyncAuth / SAuth (parent-bound IOLoop in singleton map) and
SaltEvent (ZMQ subscriber/pusher with own asyncio loop).

New Behavior

Each of the three classes now registers an os.register_at_fork
handler that, in any forked child:

  • RemoteClient: clears _channel and _auth. The lazy channel
    property rebuilds a fresh ReqChannel (new ZMQ socket, new asyncio
    loop) on next access.
  • AsyncAuth / SAuth: clears the singleton map (instance_map /
    instances). creds_map is intentionally preserved — AES creds
    remain valid across fork and reusing them avoids a re-auth roundtrip.
  • SaltEvent: clears subscriber, pusher, cpub, cpush. Existing
    connect_pub() / connect_pull() already implement lazy reconnect.

The handlers explicitly do not call .close() on inherited
handles — SyncWrapper.close() would tear down IOLoop FDs and
asyncio loop state copied from the parent and could corrupt
parent-side state. The child has its own FD-table copies; GC reclaims
them on child exit.

Public API is preserved: client.channel, client.auth,
event.subscriber, etc. remain readable / writable attributes (now via
@property / @property.setter).

Merge requirements satisfied?

  • Docs — N/A. Internal fork-safety mechanism, no user-facing config
    or behaviour change beyond removing the deadlock. No docs/ rst
    change required.
  • Changelog — changelog/68940.fixed.md added.
  • Tests written/updated — added
    tests/pytests/functional/modules/state/test_state.py::test_parallel_file_managed_from_master,
    a regression test that runs two parallel file.managed states
    with salt:// sources under @pytest.mark.timeout(60). Without
    the fix the test hangs; with the fix it completes in well under
    a second.

Verified locally on Python 3.11:

  • tests/pytests/unit/fileclient/ — 28 passed, 1 skipped
  • tests/pytests/unit/test_fileclient.py + test_event.py — 7 passed
  • tests/pytests/unit/test_crypt.py — 11 passed
  • End-to-end os.fork() + multiprocessing.Process(start_method=fork)
    smoke test confirms _channel / _auth / subscriber / pusher are
    cleared in the child and the parent retains its own state.

Commits signed with GPG?

No


Implementation notes for reviewers

Why os.register_at_fork rather than fixing only call_parallel():
the latter would close this specific bug but leave the architectural
gap open for any future code adding multiprocessing.Process over a
parent that holds a RemoteClient (or AsyncAuth / SaltEvent).
Re-creating State from scratch in every forked child also pays for
a full pillar gather and module load on every parallel state — a
significant cost for users with large pillar trees. The at-fork
handler keeps the cheap fork path intact and only rebuilds the ZMQ
piece, lazily, when the child actually talks to the master.

Repository audit showed RemoteClient is the only currently-
exercised fork hazard. AsyncAuth and SaltEvent are covered
preemptively because they share the same structural property
(long-lived ZMQ handle held by a long-lived object) and would
surface the same class of bug under new call sites. Master-side
worker forks are unaffected — those already have a correct
pre_fork / post_fork protocol in salt/transport/zeromq.py.

FSClient (local file-server, no ZMQ) is deliberately excluded
from the at-fork tracking and overrides channel / auth properties
to never lazy-rebuild as a remote ReqChannel.

Limitations:

  1. salt.client.LocalClient is not covered. No current call site
    forks a process holding a live LocalClient; if one is added
    later, the same pattern can be applied in <20 lines.
  2. The at-fork handlers fire on every fork in the process, including
    subprocess.Popen (fork+exec). Cost is O(n) where n is the number
    of live tracked instances (typically ≤ 2), sub-microsecond, and
    exec discards the handler effect quickly. No measurable impact on
    cmd.run-heavy workloads expected.
  3. Forkserver mode (Python 3.14+ Linux default): the existing
    forkserver path in call_parallel() (in 3007.x+) already forces
    an explicit fork context for that case and the handler covers it
    correctly. 3006.x predates that change, so this PR only needs to
    cover plain fork.

…ookups

Forked parallel-state children inherited the parent's live ZeroMQ REQ
socket via salt.fileclient.RemoteClient. Multiple sibling children
calling cp.hash_file (triggered by file.managed with salt:// sources)
raced the REQ/REP state machine and deadlocked the asyncio loop with
~98% CPU.

Register os.register_at_fork(after_in_child=...) handlers on the three
client-side classes that hold long-lived ZMQ handles (RemoteClient,
AsyncAuth/SAuth, SaltEvent) so each forked child drops the inherited
socket/IOLoop references and lazily rebuilds them on first use. The
parent is unaffected. Public API surface (channel, auth, subscriber,
pusher) is preserved via property/setter pairs.
@co-cy co-cy requested a review from a team as a code owner May 3, 2026 20:56
@co-cy
Copy link
Copy Markdown
Author

co-cy commented May 3, 2026

Notes on the fork-safety fix

Posting this as a follow-up comment because the issue body covers the
symptoms but not the rationale behind the specific fix shape. Some of
this came out of the investigation while reproducing #68940; some of
it I added defensively after auditing the rest of the codebase. Hope
this helps reviewers.

Why the deadlock happens (one more pass)

State.call_parallel() forks parallel-state children using
multiprocessing.Process. On Linux with the default fork start
method, the child gets a copy of the parent's State instance —
including State.file_client, which is a RemoteClient constructed
in __init__ with a live ZeroMQ REQ socket connected to the master
on port 4506.

When two file.managed states with parallel: True and
source: salt://... run, both forked children eventually call
cp.hash_file (this is what file.managed does to compare a local
cached file against the server hash). Each child sends a request
through the inherited socket, but the ZMQ REQ socket has a strict
send → recv → send → recv state machine that only one process can
drive coherently. When sibling children share the same socket, the
reply intended for one can be consumed by the other. The originating
child's asyncio event loop then waits forever for a reply that
already arrived elsewhere, which is why top shows ~98% CPU on a
process whose kernel stack is in futex_wait.

There's nothing weird going on in the network or in the master —
Send-Q: 0 on the ESTABLISHED connection, master keeps responding
fine, and cp.hash_file works perfectly outside a parallel context.
The whole thing is a userspace race against a ZMQ invariant. ZeroMQ
documents this as undefined behaviour: a socket created in the parent
must not be used in a forked child.

What I tried before settling on this approach

The minimal fix is to set instance = None in the fork branch of
call_parallel(). That forces the child to construct a fresh
State from init_kwargs, which builds a fresh RemoteClient with
its own socket. This works, but two things bothered me:

  1. It pays for a full State.__init__ per parallel child — pillar
    gather, module load, serializer load. On hosts with large pillar
    trees and many parallel states this is significant. The cheap
    fork path is the whole reason parallel: True exists.

  2. It only fixes this particular call site. The hazard is structural
    — any future code that does multiprocessing.Process(...) over a
    parent holding a RemoteClient reintroduces it. I went looking
    for other places where this could already be biting people and
    found salt.crypt.AsyncAuth/SAuth (singleton bound to a tornado
    IOLoop) and salt.utils.event.SaltEvent (ZMQ pub/push with its
    own asyncio loop). Both have the same shape.

So instead of patching one call site, I push the responsibility down
to the classes that actually own the unsafe handles. Each of them
now registers an os.register_at_fork(after_in_child=...) handler.
After any fork, the handler iterates the live instances (tracked in
a weakref.WeakSet) and drops the inherited references — _channel,
_auth, subscriber, pusher — without calling .close(). Next
use rebuilds them lazily.

Why no .close() in the handler

SyncWrapper.close() calls io_loop.close(all_fds=True) and the
asyncio loop's close. After fork, the child's IOLoop and asyncio
loop are byte-for-byte copies of the parent's structures, with
internal pointers into what was the parent's heap. Calling close on
those copies risks tearing down state that the parent still owns.
The child has its own copy of the FD table, so we don't need to
close anything to be safe — it's enough to make sure no Salt code
uses the inherited handles. Dropping the references achieves that;
GC reclaims the rest when the child exits.

File-by-file summary

salt/fileclient.py. RemoteClient.channel and .auth become
properties backed by _channel / _auth. The eager
ReqChannel.factory(...) call in __init__ is preserved because
existing tests assert it runs exactly once at construction; the
property only kicks in to lazily rebuild after the at-fork handler
clears _channel. _refresh_channel, destroy, etc. were updated
to go through the private attribute so they don't accidentally
trigger lazy recreation. FSClient overrides both properties to
return its in-process FSChan without lazy-rebuilding as a remote
ReqChannel, and is deliberately not added to the _instances
WeakSet.

salt/crypt.py. AsyncAuth.instance_map and SAuth.instances are
replaced with empty maps in the child. creds_map is intentionally
preserved — AES creds remain valid across fork, and reusing them
saves one re-auth roundtrip on the first master RPC after fork.
Registration is idempotent and triggered from __new__ (i.e. on
first auth, not at module import) so processes that never auth
don't carry the at-fork handler.

salt/utils/event.py. SaltEvent.subscriber/pusher/cpub/cpush are
nulled in the child. These attributes are already nullable and
connect_pub()/connect_pull() already handle the "first connection"
case, so no property machinery is needed — the lazy reconnect path
exists. __init__ registers the instance in _instances right
before the conditional connect_pub() call so that even
listen=True instances are tracked before subscriber creation.

This last one is defensive. I couldn't find a current call site
that forks while holding a SaltEvent with an active subscriber —
SSHClient.event is the closest but uses listen=False and the
child never reads. But it's a latent version of the same bug, and
the fix is small enough that I'd rather close it now than wait for
someone to file the next issue.

What's not covered

salt.client.LocalClient has a similar structure (it ends up
holding a MasterPubChannel with its own ZMQ state) but I couldn't
find a call site that forks while holding it, so I left it alone.
If that ever changes the same pattern would apply in <20 lines.

The handler fires on every fork in the process, including
subprocess.Popen (which uses fork+exec). The work it does is
O(n) in the number of live tracked instances — typically n is 1 or
2 in practice — and the exec discards everything anyway, so the
overhead should be in the sub-microsecond range. I'm not expecting
measurable impact on cmd.run-heavy workloads but I haven't done a
formal benchmark.

3006.x doesn't have the forkserver path that 3007.x added in
71dba06. On 3007.x and forward, call_parallel() already forces
an explicit fork context for forkserver mode, and the at-fork
handler covers that correctly. Forkserver children that go through
the forkserver pool process directly (unrelated worker pools) start
with no live tracked instances and the handler is a no-op there.

Verification

Locally on Python 3.11 against 3006.x:

  • tests/pytests/unit/fileclient/ — 28 passed, 1 skipped
  • tests/pytests/unit/test_fileclient.py + test_event.py — 7 passed
  • tests/pytests/unit/test_crypt.py — 11 passed
  • A small os.fork() + multiprocessing.Process(start_method="fork")
    smoke harness confirming the child sees _channel = None,
    _auth = "", subscriber/pusher cleared, and that the parent
    retains its own state untouched.

The functional regression test
(test_parallel_file_managed_from_master) uses
@pytest.mark.timeout(60) as the primary assertion. Without the fix
it hangs; with the fix it completes well under a second.

Public API

client.channel, client.auth, event.subscriber/pusher/cpub/cpush
remain readable and writable from outside, so callers that touch
them (mostly tests, a few internal Salt paths) keep working.
AsyncAuth.instance_map and SAuth.instances keep their type and
class-attribute placement; their only observable difference is that
in a forked child they happen to be empty rather than carrying
parent state.

I deliberately kept the comments in the code short and pointed
at the why rather than the what; the property and at-fork pattern
itself is small enough to read top-to-bottom.

Happy to split this into smaller PRs (per-class) if reviewers want
each of the three classes considered separately, but they all share
the same idiom and I think it reads better as one unit.

References

@co-cy
Copy link
Copy Markdown
Author

co-cy commented May 3, 2026

Fixes #68940
Likely also fixes #65709 — same fork-inherited ZMQ socket race,
just on cp.cache_file instead of cp.hash_file (cmd.script with
parallel: True). Author of #65709 should retest after this lands.

Comment thread salt/utils/event.py
self.pending_tags = []
self.pending_events = []
self.__load_cache_regex()
type(self)._register_atfork()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incredibly suspect. AI Slop?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @dwoz, thanks for the review.
type(self)._register_atfork() is used intentionally here to resolve the actual class of the instance at runtime, which matters if a subclass ever overrides _register_atfork or maintains its own _atfork_registered flag. Using type(self) over a hardcoded class name preserves correct behavior under inheritance, and over self._register_atfork() it makes the class-level nature of the call explicit at the call site (and avoids accidental shadowing by an instance attribute).
That said, I agree the idiom is uncommon and can look odd at a glance. Happy to change it — a few alternatives:

self._register_atfork() — shortest, relies on the descriptor protocol to bind the correct class. Fine in practice.
self.class._register_atfork() — equivalent to type(self), slightly more conventional in some codebases.
Move registration out of init entirely — e.g. into init_subclass or module-level init — so the hook is registered once per class instead of on every instantiation. This is probably the cleanest fix architecturally, since os.register_at_fork only needs to be called once.

Let me know which direction you'd prefer and I'll update the PR.

It is important to me that the product is improved. I'm new here, if you have any general practices, I'd like to get acquainted with them.

@twangboy twangboy added this to the Sulpher v3006.25 milestone May 16, 2026
… tests

The two functional tests (test_parallel_file_managed_from_master and
test_parallel_cmd_script_from_master) do not actually guard the
regression: Salt's in-process functional harness uses FSClient, which
has no ZeroMQ socket, so the fork-inherited-socket deadlock cannot
occur there. Verified locally -- both tests pass with AND without the
fix, so they would not catch a regression.

Replace them with tests/pytests/unit/test_fork_safety.py, which asserts
the fix's actual contract: after os.fork(), the at-fork handler must
clear the inherited handles in the child while the parent keeps its
own.

- test_remoteclient_drops_channel_in_forked_child: the direct
  saltstack#68940/saltstack#65709 path (cp.hash_file / cp.cache_file go through
  RemoteClient's REQ channel). Child sees _channel/_auth cleared.
- test_asyncauth_sauth_clear_singletons_but_keep_creds_in_forked_child:
  AsyncAuth.instance_map / SAuth.instances reset in child; creds_map
  deliberately preserved (documented behaviour).
- test_saltevent_drops_sockets_in_forked_child: subscriber/pusher
  None and cpub/cpush False in child.

These are deterministic and fast (no master daemon, no ZeroMQ, no
race): each fails with the fix reverted and passes with it in place,
so they are real regression guards. Verified both directions locally.

Refs: saltstack#68940, saltstack#65709
@co-cy
Copy link
Copy Markdown
Author

co-cy commented May 17, 2026

Follow-up on the regression tests — I have to flag something I found while validating them, and I've reworked the approach.

The two functional tests did not actually guard the regression. I got my local salt-factories harness working (it was a pytest-version/plugin issue on my side, unrelated to this PR) and ran test_parallel_file_managed_from_master and test_parallel_cmd_script_from_master against the code with the fix reverted — both still passed. The reason: Salt's in-process functional harness resolves salt:// through FSClient, which has no ZeroMQ REQ socket, so the fork-inherited-socket deadlock simply cannot occur there. A green run in that harness says nothing about whether the fix is present.

So in 82ff79c259 I removed both functional parity tests and added tests/pytests/unit/test_fork_safety.py, which asserts the fix's actual contract directly: after os.fork(), the at-fork handler must clear the inherited handles in the child while the parent keeps its own.

These are deterministic and fast (no master daemon, no ZeroMQ, no race window). I verified both directions locally: all three fail with the fix reverted and pass with it in place — i.e. they are real regression guards, which the functional tests were not.

Reproducing the literal deadlock end-to-end would need a real master + ZeroMQ transport, and "did not hang" is an inherently flaky assertion (a green run only means the race didn't fire that time). Asserting the at-fork invariant is the deterministic equivalent of the manual os.fork() smoke harness described in the PR notes. Happy to also keep an integration-level repro if you'd prefer one in addition, but I'd argue it would be a weaker guard than these.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:full Run the full test suite

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants