Fix the two races behind the flaky socket integration tests#35
Merged
Conversation
The intermittent failures/hangs of WebsocketClientServerTest and ConnectionLockingTest (and the ConnectionManagerTest cast) were two defects in WebsocketConnection, not test timing assumptions: 1. Dropped first frames. setSocket() attached the rtc message callback immediately, but the application's Data listener is only registered later (inside the Connected event the server emits). A frame arriving in between was emitted into a listener-less EventEmitter and silently lost - the ConnectionLocking handshake never completed (FutureTimeout after 10 s) and the websocket test's promise never resolved (hang). Fix: setSocket() no longer attaches the message callback; new startReceiving() attaches it - the server calls it right after emitting Connected to the application (libdatachannel queues frames internally until the callback attaches and flushes them synchronously at attach, so nothing is lost), the client before open() as before. Also emit Connected from setSocket() when the handshake completed before the onOpen callback was attached (rtc does not retro-fire open; mConnectedEmitted keeps the paths idempotent). 2. Teardown deadlock. close()/destroy() called socket->resetCallbacks() while holding mMutex. rtc holds its callback mutex during callback invocation and resetCallbacks() blocks until in-flight callbacks return - and our callbacks lock mMutex: a classic AB-BA inversion (main: mMutex -> callback mutex; rtc thread: callback mutex -> mMutex), seen as tests hanging after "WebSocket close timeout". Fix: mark destroyed and copy the socket pointer under mMutex, then reset/close outside it. Evidence: pre-fix, WebsocketClientServerTest hung 3/200 local runs and ConnectionLockingTest failed 2/100 (Debug). Post-fix: 0/500 and 0/200 (Debug), 0/150 and 0/75 (Release), full dht suite 81/81 and trackerless-network suites green in both configs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Bugbot is not enabled for this team, so this pull request was not reviewed. Enable Bugbot in the Cursor dashboard to get automatic reviews on future PRs. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The intermittent macOS CI failures of
WebsocketClientServerTest/ConnectionLockingTest(the "flaky socket tests" that have forced retriggers on most PRs, including #34) turn out to be two real defects inWebsocketConnection, not test timing or port assumptions. Both reproduce locally and both are fixed at the library level — no test was modified.Bug 1 — first frames silently dropped (the hangs/timeouts)
setSocket()attached the rtc message callback immediately, but the application'sDatalistener is only registered later — inside theConnectedevent the server emits. A frame arriving in that window was pumped throughemit<Data>with zero listeners and vanished:WebsocketClientServerTest.TestClientCanTrasmitMessageToServer: the test's promise never resolves → hang until the ctest timeout.ConnectionLockingTest.CanLockConnections: the first frame on the wire is the handshake request → connection never materializes →folly::FutureTimeoutafter 10 s,hasConnection() == false(exactly the CI failure output).Fix:
setSocket()no longer attaches the message callback. A newstartReceiving()attaches it — the server calls it right after emittingConnectedto the application, the client beforeopen()(its listeners are registered beforeconnect()). This loses nothing: libdatachannel queues incoming frames internally until a message callback is attached and flushes them synchronously at attach (Channel::onMessage→flushPendingMessages()), so the rtc queue bridges the gap.While reading the rtc source I also closed a sibling hole: rtc does not retro-fire the open callback, so a handshake completing before
setSocket()attachedonOpenwould permanently swallowConnected.setSocket()now checksreadyState()after attaching and emitsConnecteditself in that case (mConnectedEmittedkeeps the two paths idempotent).Bug 2 — teardown deadlock (the "close timeout" hangs)
close()/destroy()calledsocket->resetCallbacks()while holdingmMutex. rtc'ssynchronized_callbackholds its own mutex during callback invocation, and assignment/reset blocks until in-flight callbacks return — and our rtc callbacks lockmMutex. Classic AB-BA inversion:mMutex→ waits on callback mutex (insideresetCallbacks())mMutex(inside ouronClosed)Observed as tests hanging forever after
WebSocket close timeoutin the rtc log — this was still reproducible after fixing bug 1 (3 hangs in the first 107 post-fix runs), which is what exposed it as a separate mechanism.Fix: mark
mDestroyedand copy the socket pointer undermMutex, then callresetCallbacks()/close()outside it.Evidence
The
test.shretry/timeout mitigations (--repeat until-pass:2 --timeout 300) are left in place as general hygiene. MODERNIZATION.md's known-issue record is updated with the resolution.🤖 Generated with Claude Code