Skip to content

Fix the third connection-establishment race: replay PendingConnection events to late listeners#43

Merged
ptesavol merged 1 commit into
mainfrom
claude/goofy-proskuriakova-92601d
Jul 4, 2026
Merged

Fix the third connection-establishment race: replay PendingConnection events to late listeners#43
ptesavol merged 1 commit into
mainfrom
claude/goofy-proskuriakova-92601d

Conversation

@ptesavol

@ptesavol ptesavol commented Jul 4, 2026

Copy link
Copy Markdown
Collaborator

Fixes the ConnectionLockingTest.CanLockConnections failures on ubuntu-latest in PR #42's CI (commits fde4ad3 and 3a71303, 2026-07-04) — the rarer third mechanism that survived the two WebsocketConnection race fixes of PR #35.

The mechanism

On the outgoing side, the connector starts socket I/O inside createConnection() (WebsocketClientConnector::connect() calls socket->open() before returning), but the endpoint's listeners on the returned PendingConnection are only registered afterwards, when ConnectionManager wraps it in an Endpoint (addEndpointchangeToConnectingStateenterState). Those two steps run on the caller's thread; the handshake runs on libdatachannel's threads.

If the calling thread is descheduled in that window for longer than one full localhost websocket + streamr handshake round-trip, PendingConnection::onHandshakeCompleted() runs first. It does two things, in this order: disarms the 15 s connect watchdog, then emits Connected — on a fire-and-forget emitter with zero listeners. The emission is lost, no timeout remains armed, and the endpoint sits in ConnectingEndpointState forever with the lock-request RPC buffered.

That asymmetry (watchdog disarmed, event dropped) is what makes this a permanent stall rather than an eventual close-and-retry, and it produces exactly the observed signature:

  • the lock RPC times out (folly::FutureTimeout after 10 s, caught and logged inside lockConnection),
  • waitForCondition times out (the manager2 side never received the lock request — its own handshake side completed fine),
  • hasConnection() == false (the endpoint exists but never reaches the connected state),
  • no crash or hang; the rest of the suite runs normally.

The window is a few hundred instructions on the caller's thread versus ~1 ms of I/O on hot rtc threads, so it needs the caller preempted at exactly the wrong moment for several scheduler quanta — realistic on a loaded 2-core CI runner, effectively unreachable on a many-core dev machine. This is why 100 stress runs under 10× load on macOS never reproduced it. The same pattern exists on the reverse-connection path (WebsocketClientConnectorRpcLocal::requestConnectionconnect then onNewConnection), and the same fix covers it. The incoming-handshake side is sequential on one thread and was never exposed.

The fix

  1. IPendingConnection now extends ReplayEventEmitter instead of the fire-and-forget EventEmitter: a listener registered after the fact receives the latest Connected/Disconnected emission. This also covers the sibling case of an early Disconnected (fast connect failure) being lost the same way.

  2. EventEmitter::emit() now stores the latest event and snapshots the handler list atomically with respect to on() (both under mEventHandlersMutex). Previously emit wrote mLatestEvent without mLatestEventMutex and iterated mEventHandlers without its mutex, so a listener registered concurrently with an emit could miss both the live dispatch and the replay. After the change a concurrent registration gets exactly one of the two — never neither, never both. mEventHandlersMutex is only taken inside mEmitLoopMutex, matching the pre-existing once-handler-removal order, so no new lock ordering is introduced.

Verification

  • Deterministic end-to-end reproduction: injecting a temporary 300 ms sleep in ConnectionManager::send() between createConnection() and onNewConnection() (simulating the preemption; not committed) makes CanLockConnections on the pre-fix code fail with the exact CI output — EXPECT_NO_THROW ... Actual: it throws folly::FutureTimeout with description "Timed out" followed by hasConnection(...) Actual: false — in ~10 s. With the fix, the same forced preemption passes in 344 ms.
  • Two new deterministic regression tests (PendingConnectionTest.ReplaysConnectedToLateListener, ReplaysDisconnectedToLateListener) encode the late-listener contract.
  • Full monorepo suite: 309/309 tests pass.
  • ConnectionLockingTest stress: 60/60 runs green.
  • Package lint (clangd-tidy 22 + clang-format): clean for both touched packages.

MODERNIZATION.md's known-issue record is extended with the mechanism and fix, alongside the PR #35 entries.

🤖 Generated with Claude Code

… events to late listeners

On the outgoing side the connector starts socket I/O inside
createConnection(), but the endpoint's listeners on the returned
PendingConnection are only registered afterwards (addEndpoint ->
changeToConnectingState). When the main thread is descheduled in that
window for longer than one localhost websocket + streamr handshake
round-trip - realistic only on a loaded 2-core CI runner -
onHandshakeCompleted() runs first: it disarms the 15 s connect watchdog
and then emits Connected on a fire-and-forget emitter with zero
listeners. The emission is lost, no timeout remains armed, and the
endpoint sits in the connecting state forever with the lock-request RPC
buffered (the ConnectionLockingTest.CanLockConnections stall on
ubuntu-latest, PR #42 CI).

IPendingConnection now extends ReplayEventEmitter, so a late listener
receives the latest Connected/Disconnected instead of nothing. To make
that airtight across threads, emit() now performs the latest-event
store and the handler snapshot atomically with respect to on() (both
under mEventHandlersMutex): a listener registered concurrently with an
emit gets exactly one of live dispatch or replay, never neither, never
both. This also closes the pre-existing unsynchronized read of
mEventHandlers in createExecutingEmitLoopHandlersMap().

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@cursor

cursor Bot commented Jul 4, 2026

Copy link
Copy Markdown

Bugbot is not enabled for this team, so this pull request was not reviewed.

Enable Bugbot in the Cursor dashboard to get automatic reviews on future PRs.

@ptesavol ptesavol merged commit 2fe7aa5 into main Jul 4, 2026
6 checks passed
@ptesavol ptesavol deleted the claude/goofy-proskuriakova-92601d branch July 4, 2026 13:28
ptesavol added a commit that referenced this pull request Jul 4, 2026
Third and largest consolidation step, and the step where the module
architecture was settled through three owner directives, each measured:

1. ONE FILE PER FORMER HEADER: all 45 dht public headers (5,013 lines)
   became individual module units under modules/ in the old directory
   tree; the include/ tree is deleted.
2. NAMED SUB-MODULES, NOT PARTITIONS (streamr.dht.X): partitions
   cannot be imported by external translation units, and tests must
   import exactly the sub-modules they test ("tests are internal to
   the module"). All 41 dht test files import their subjects directly.
3. NO UMBRELLA MODULE: a re-exporting umbrella was built, measured,
   and deleted - its compiled interface depends on every sub-module,
   so anyone importing it reinherits edit-anything-rebuild-everything.
   Consumers import the individual sub-modules they use, mirroring the
   old header includes one to one.

MEASURED (same mid-chain/leaf sub-module touch, whole-tree Release
rebuild): partitions behind a re-exporting primary ~480 s; named
sub-modules behind an umbrella 418/546 s; no umbrella 129/135 s.
Remaining cost is the true dependency cone of central types plus the
serial BMI chain; narrow edits rebuild in seconds.

The already-consolidated trackerless-network package was RETROFITTED
to the same architecture in this PR (partitions -> named sub-modules,
umbrella deleted, tests + proxyclient import narrowly). Its MessageRef
ordering operator moved from FifoMapWithTTL to the exported protos
module: as a non-exported file-scope function it had module linkage
and was unreachable from other modules instantiating the templates.

Sub-module strictness rules learned (recorded in MODERNIZATION.md for
the remaining packages):
- every unit must DIRECTLY import what its code uses (the compiler
  names the missing module); shorthand names inherited textually from
  neighboring headers need the file's own using-declaration;
- a purview forward declaration of another module's class attaches the
  name to the wrong module and conflicts with the defining one - the
  import replaces it;
- include hygiene: <string> for Branded-key containers (16 units),
  <functional>, <mutex>, <exception>, folly Collect.h named explicitly
  by the compiler.

Incidental fixes: dht standalone install.sh had committed
-fsanitize=address in every configure (fatal for downstream standalone
links once real code moved into module objects) - removed; PR #43's
ReplayEventEmitter race fix ported into the IPendingConnection
sub-module (the header it patched no longer exists);
ConnectionLockingTest.cpp excluded from clangd-tidy (owner-approved
selective disabling - the analysis tool cannot resolve folly
customization points through module interfaces; the compiler accepts
the file; also gains folly Collect.h textually since it instantiates
the imported streamr::utils::collect).

Also in MODERNIZATION.md per owner instruction: the CI-speed follow-up
plan (build-directory caching, single-platform packaging check, ccache
complement, port randomization) and the import-std verdict incl. the
NDK r30 beta 1 retest (do not adopt; blockers are in Bionic itself,
unchanged in the pre-release; clean CMake cross-probe workaround found
and recorded for a future fixed NDK).

Verified: full Release build; dht tests 83/83 (incl. the two new PR
#43 race tests); trackerless-network 12+1; proxyclient 15/15;
standalone builds of the whole chain; package lints green over all
module units.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
ptesavol added a commit that referenced this pull request Jul 4, 2026
…header [androidbuild] [iosbuild] (#44)

* Consolidation C-3: dht consolidated; module architecture settled

Third and largest consolidation step, and the step where the module
architecture was settled through three owner directives, each measured:

1. ONE FILE PER FORMER HEADER: all 45 dht public headers (5,013 lines)
   became individual module units under modules/ in the old directory
   tree; the include/ tree is deleted.
2. NAMED SUB-MODULES, NOT PARTITIONS (streamr.dht.X): partitions
   cannot be imported by external translation units, and tests must
   import exactly the sub-modules they test ("tests are internal to
   the module"). All 41 dht test files import their subjects directly.
3. NO UMBRELLA MODULE: a re-exporting umbrella was built, measured,
   and deleted - its compiled interface depends on every sub-module,
   so anyone importing it reinherits edit-anything-rebuild-everything.
   Consumers import the individual sub-modules they use, mirroring the
   old header includes one to one.

MEASURED (same mid-chain/leaf sub-module touch, whole-tree Release
rebuild): partitions behind a re-exporting primary ~480 s; named
sub-modules behind an umbrella 418/546 s; no umbrella 129/135 s.
Remaining cost is the true dependency cone of central types plus the
serial BMI chain; narrow edits rebuild in seconds.

The already-consolidated trackerless-network package was RETROFITTED
to the same architecture in this PR (partitions -> named sub-modules,
umbrella deleted, tests + proxyclient import narrowly). Its MessageRef
ordering operator moved from FifoMapWithTTL to the exported protos
module: as a non-exported file-scope function it had module linkage
and was unreachable from other modules instantiating the templates.

Sub-module strictness rules learned (recorded in MODERNIZATION.md for
the remaining packages):
- every unit must DIRECTLY import what its code uses (the compiler
  names the missing module); shorthand names inherited textually from
  neighboring headers need the file's own using-declaration;
- a purview forward declaration of another module's class attaches the
  name to the wrong module and conflicts with the defining one - the
  import replaces it;
- include hygiene: <string> for Branded-key containers (16 units),
  <functional>, <mutex>, <exception>, folly Collect.h named explicitly
  by the compiler.

Incidental fixes: dht standalone install.sh had committed
-fsanitize=address in every configure (fatal for downstream standalone
links once real code moved into module objects) - removed; PR #43's
ReplayEventEmitter race fix ported into the IPendingConnection
sub-module (the header it patched no longer exists);
ConnectionLockingTest.cpp excluded from clangd-tidy (owner-approved
selective disabling - the analysis tool cannot resolve folly
customization points through module interfaces; the compiler accepts
the file; also gains folly Collect.h textually since it instantiates
the imported streamr::utils::collect).

Also in MODERNIZATION.md per owner instruction: the CI-speed follow-up
plan (build-directory caching, single-platform packaging check, ccache
complement, port randomization) and the import-std verdict incl. the
NDK r30 beta 1 retest (do not adopt; blockers are in Bionic itself,
unchanged in the pre-release; clean CMake cross-probe workaround found
and recorded for a future fixed NDK).

Verified: full Release build; dht tests 83/83 (incl. the two new PR
#43 race tests); trackerless-network 12+1; proxyclient 15/15;
standalone builds of the whole chain; package lints green over all
module units.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* Module libraries: POSITION_INDEPENDENT_CODE ON

The module libraries' objects end up inside the shared
libstreamrproxyclient. macOS emits position-independent code by
default, but on Linux the ELF linker rejects the standalone archives
otherwise (R_AARCH64_ADR_PREL_PG_HI21 relocations against the module
vtable/thunk symbols, "recompile with -fPIC") - first surfaced on the
linux-arm64 leg once the consolidated code moved into module objects.
Set uniformly in the StreamrModules.cmake helpers (canonical + synced
copies).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* proxyclient lint: run clangd-tidy serially

The implementation file imports many module interfaces; analysing
several such files concurrently exhausted memory on the 2-core CI
runners (clangd dies mid-protocol, clangd-tidy reports "Invalid header
end"). The package has a handful of files - serial analysis costs
seconds and preserves full coverage.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* proxyclient lint: exclude the import-heavy implementation file

Even serial analysis of src/streamrproxyclient.cpp exhausts memory on
the 2-core 7 GB CI runners: the file imports module interfaces from
four packages and clangd builds that whole module graph in one
process. Owner-approved selective disabling; the compiler builds the
file on every platform and clang-format still checks it. Revisit when
runners grow or clangd's module memory use shrinks.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* Retrigger CI: ConnectionLockingTest.LockingBothWays flaked on the arm64 runner

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant