Fix the third connection-establishment race: replay PendingConnection events to late listeners#43
Merged
Merged
Conversation
… events to late listeners On the outgoing side the connector starts socket I/O inside createConnection(), but the endpoint's listeners on the returned PendingConnection are only registered afterwards (addEndpoint -> changeToConnectingState). When the main thread is descheduled in that window for longer than one localhost websocket + streamr handshake round-trip - realistic only on a loaded 2-core CI runner - onHandshakeCompleted() runs first: it disarms the 15 s connect watchdog and then emits Connected on a fire-and-forget emitter with zero listeners. The emission is lost, no timeout remains armed, and the endpoint sits in the connecting state forever with the lock-request RPC buffered (the ConnectionLockingTest.CanLockConnections stall on ubuntu-latest, PR #42 CI). IPendingConnection now extends ReplayEventEmitter, so a late listener receives the latest Connected/Disconnected instead of nothing. To make that airtight across threads, emit() now performs the latest-event store and the handler snapshot atomically with respect to on() (both under mEventHandlersMutex): a listener registered concurrently with an emit gets exactly one of live dispatch or replay, never neither, never both. This also closes the pre-existing unsynchronized read of mEventHandlers in createExecutingEmitLoopHandlersMap(). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Bugbot is not enabled for this team, so this pull request was not reviewed. Enable Bugbot in the Cursor dashboard to get automatic reviews on future PRs. |
ptesavol
added a commit
that referenced
this pull request
Jul 4, 2026
Third and largest consolidation step, and the step where the module
architecture was settled through three owner directives, each measured:
1. ONE FILE PER FORMER HEADER: all 45 dht public headers (5,013 lines)
became individual module units under modules/ in the old directory
tree; the include/ tree is deleted.
2. NAMED SUB-MODULES, NOT PARTITIONS (streamr.dht.X): partitions
cannot be imported by external translation units, and tests must
import exactly the sub-modules they test ("tests are internal to
the module"). All 41 dht test files import their subjects directly.
3. NO UMBRELLA MODULE: a re-exporting umbrella was built, measured,
and deleted - its compiled interface depends on every sub-module,
so anyone importing it reinherits edit-anything-rebuild-everything.
Consumers import the individual sub-modules they use, mirroring the
old header includes one to one.
MEASURED (same mid-chain/leaf sub-module touch, whole-tree Release
rebuild): partitions behind a re-exporting primary ~480 s; named
sub-modules behind an umbrella 418/546 s; no umbrella 129/135 s.
Remaining cost is the true dependency cone of central types plus the
serial BMI chain; narrow edits rebuild in seconds.
The already-consolidated trackerless-network package was RETROFITTED
to the same architecture in this PR (partitions -> named sub-modules,
umbrella deleted, tests + proxyclient import narrowly). Its MessageRef
ordering operator moved from FifoMapWithTTL to the exported protos
module: as a non-exported file-scope function it had module linkage
and was unreachable from other modules instantiating the templates.
Sub-module strictness rules learned (recorded in MODERNIZATION.md for
the remaining packages):
- every unit must DIRECTLY import what its code uses (the compiler
names the missing module); shorthand names inherited textually from
neighboring headers need the file's own using-declaration;
- a purview forward declaration of another module's class attaches the
name to the wrong module and conflicts with the defining one - the
import replaces it;
- include hygiene: <string> for Branded-key containers (16 units),
<functional>, <mutex>, <exception>, folly Collect.h named explicitly
by the compiler.
Incidental fixes: dht standalone install.sh had committed
-fsanitize=address in every configure (fatal for downstream standalone
links once real code moved into module objects) - removed; PR #43's
ReplayEventEmitter race fix ported into the IPendingConnection
sub-module (the header it patched no longer exists);
ConnectionLockingTest.cpp excluded from clangd-tidy (owner-approved
selective disabling - the analysis tool cannot resolve folly
customization points through module interfaces; the compiler accepts
the file; also gains folly Collect.h textually since it instantiates
the imported streamr::utils::collect).
Also in MODERNIZATION.md per owner instruction: the CI-speed follow-up
plan (build-directory caching, single-platform packaging check, ccache
complement, port randomization) and the import-std verdict incl. the
NDK r30 beta 1 retest (do not adopt; blockers are in Bionic itself,
unchanged in the pre-release; clean CMake cross-probe workaround found
and recorded for a future fixed NDK).
Verified: full Release build; dht tests 83/83 (incl. the two new PR
#43 race tests); trackerless-network 12+1; proxyclient 15/15;
standalone builds of the whole chain; package lints green over all
module units.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
ptesavol
added a commit
that referenced
this pull request
Jul 4, 2026
…header [androidbuild] [iosbuild] (#44) * Consolidation C-3: dht consolidated; module architecture settled Third and largest consolidation step, and the step where the module architecture was settled through three owner directives, each measured: 1. ONE FILE PER FORMER HEADER: all 45 dht public headers (5,013 lines) became individual module units under modules/ in the old directory tree; the include/ tree is deleted. 2. NAMED SUB-MODULES, NOT PARTITIONS (streamr.dht.X): partitions cannot be imported by external translation units, and tests must import exactly the sub-modules they test ("tests are internal to the module"). All 41 dht test files import their subjects directly. 3. NO UMBRELLA MODULE: a re-exporting umbrella was built, measured, and deleted - its compiled interface depends on every sub-module, so anyone importing it reinherits edit-anything-rebuild-everything. Consumers import the individual sub-modules they use, mirroring the old header includes one to one. MEASURED (same mid-chain/leaf sub-module touch, whole-tree Release rebuild): partitions behind a re-exporting primary ~480 s; named sub-modules behind an umbrella 418/546 s; no umbrella 129/135 s. Remaining cost is the true dependency cone of central types plus the serial BMI chain; narrow edits rebuild in seconds. The already-consolidated trackerless-network package was RETROFITTED to the same architecture in this PR (partitions -> named sub-modules, umbrella deleted, tests + proxyclient import narrowly). Its MessageRef ordering operator moved from FifoMapWithTTL to the exported protos module: as a non-exported file-scope function it had module linkage and was unreachable from other modules instantiating the templates. Sub-module strictness rules learned (recorded in MODERNIZATION.md for the remaining packages): - every unit must DIRECTLY import what its code uses (the compiler names the missing module); shorthand names inherited textually from neighboring headers need the file's own using-declaration; - a purview forward declaration of another module's class attaches the name to the wrong module and conflicts with the defining one - the import replaces it; - include hygiene: <string> for Branded-key containers (16 units), <functional>, <mutex>, <exception>, folly Collect.h named explicitly by the compiler. Incidental fixes: dht standalone install.sh had committed -fsanitize=address in every configure (fatal for downstream standalone links once real code moved into module objects) - removed; PR #43's ReplayEventEmitter race fix ported into the IPendingConnection sub-module (the header it patched no longer exists); ConnectionLockingTest.cpp excluded from clangd-tidy (owner-approved selective disabling - the analysis tool cannot resolve folly customization points through module interfaces; the compiler accepts the file; also gains folly Collect.h textually since it instantiates the imported streamr::utils::collect). Also in MODERNIZATION.md per owner instruction: the CI-speed follow-up plan (build-directory caching, single-platform packaging check, ccache complement, port randomization) and the import-std verdict incl. the NDK r30 beta 1 retest (do not adopt; blockers are in Bionic itself, unchanged in the pre-release; clean CMake cross-probe workaround found and recorded for a future fixed NDK). Verified: full Release build; dht tests 83/83 (incl. the two new PR #43 race tests); trackerless-network 12+1; proxyclient 15/15; standalone builds of the whole chain; package lints green over all module units. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * Module libraries: POSITION_INDEPENDENT_CODE ON The module libraries' objects end up inside the shared libstreamrproxyclient. macOS emits position-independent code by default, but on Linux the ELF linker rejects the standalone archives otherwise (R_AARCH64_ADR_PREL_PG_HI21 relocations against the module vtable/thunk symbols, "recompile with -fPIC") - first surfaced on the linux-arm64 leg once the consolidated code moved into module objects. Set uniformly in the StreamrModules.cmake helpers (canonical + synced copies). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * proxyclient lint: run clangd-tidy serially The implementation file imports many module interfaces; analysing several such files concurrently exhausted memory on the 2-core CI runners (clangd dies mid-protocol, clangd-tidy reports "Invalid header end"). The package has a handful of files - serial analysis costs seconds and preserves full coverage. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * proxyclient lint: exclude the import-heavy implementation file Even serial analysis of src/streamrproxyclient.cpp exhausts memory on the 2-core 7 GB CI runners: the file imports module interfaces from four packages and clangd builds that whole module graph in one process. Owner-approved selective disabling; the compiler builds the file on every platform and clang-format still checks it. Revisit when runners grow or clangd's module memory use shrinks. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * Retrigger CI: ConnectionLockingTest.LockingBothWays flaked on the arm64 runner Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes the
ConnectionLockingTest.CanLockConnectionsfailures on ubuntu-latest in PR #42's CI (commits fde4ad3 and 3a71303, 2026-07-04) — the rarer third mechanism that survived the two WebsocketConnection race fixes of PR #35.The mechanism
On the outgoing side, the connector starts socket I/O inside
createConnection()(WebsocketClientConnector::connect()callssocket->open()before returning), but the endpoint's listeners on the returnedPendingConnectionare only registered afterwards, whenConnectionManagerwraps it in anEndpoint(addEndpoint→changeToConnectingState→enterState). Those two steps run on the caller's thread; the handshake runs on libdatachannel's threads.If the calling thread is descheduled in that window for longer than one full localhost websocket + streamr handshake round-trip,
PendingConnection::onHandshakeCompleted()runs first. It does two things, in this order: disarms the 15 s connect watchdog, then emitsConnected— on a fire-and-forget emitter with zero listeners. The emission is lost, no timeout remains armed, and the endpoint sits inConnectingEndpointStateforever with the lock-request RPC buffered.That asymmetry (watchdog disarmed, event dropped) is what makes this a permanent stall rather than an eventual close-and-retry, and it produces exactly the observed signature:
folly::FutureTimeoutafter 10 s, caught and logged insidelockConnection),waitForConditiontimes out (the manager2 side never received the lock request — its own handshake side completed fine),hasConnection() == false(the endpoint exists but never reaches the connected state),The window is a few hundred instructions on the caller's thread versus ~1 ms of I/O on hot rtc threads, so it needs the caller preempted at exactly the wrong moment for several scheduler quanta — realistic on a loaded 2-core CI runner, effectively unreachable on a many-core dev machine. This is why 100 stress runs under 10× load on macOS never reproduced it. The same pattern exists on the reverse-connection path (
WebsocketClientConnectorRpcLocal::requestConnection→connectthenonNewConnection), and the same fix covers it. The incoming-handshake side is sequential on one thread and was never exposed.The fix
IPendingConnectionnow extendsReplayEventEmitterinstead of the fire-and-forgetEventEmitter: a listener registered after the fact receives the latestConnected/Disconnectedemission. This also covers the sibling case of an earlyDisconnected(fast connect failure) being lost the same way.EventEmitter::emit()now stores the latest event and snapshots the handler list atomically with respect toon()(both undermEventHandlersMutex). PreviouslyemitwrotemLatestEventwithoutmLatestEventMutexand iteratedmEventHandlerswithout its mutex, so a listener registered concurrently with an emit could miss both the live dispatch and the replay. After the change a concurrent registration gets exactly one of the two — never neither, never both.mEventHandlersMutexis only taken insidemEmitLoopMutex, matching the pre-existing once-handler-removal order, so no new lock ordering is introduced.Verification
ConnectionManager::send()betweencreateConnection()andonNewConnection()(simulating the preemption; not committed) makesCanLockConnectionson the pre-fix code fail with the exact CI output —EXPECT_NO_THROW ... Actual: it throws folly::FutureTimeout with description "Timed out"followed byhasConnection(...) Actual: false— in ~10 s. With the fix, the same forced preemption passes in 344 ms.PendingConnectionTest.ReplaysConnectedToLateListener,ReplaysDisconnectedToLateListener) encode the late-listener contract.ConnectionLockingTeststress: 60/60 runs green.MODERNIZATION.md's known-issue record is extended with the mechanism and fix, alongside the PR #35 entries.
🤖 Generated with Claude Code