Skip to content

Fix the two races behind the flaky socket integration tests#35

Merged
ptesavol merged 1 commit into
mainfrom
fix/flaky-socket-tests
Jul 4, 2026
Merged

Fix the two races behind the flaky socket integration tests#35
ptesavol merged 1 commit into
mainfrom
fix/flaky-socket-tests

Conversation

@ptesavol

@ptesavol ptesavol commented Jul 4, 2026

Copy link
Copy Markdown
Collaborator

The intermittent macOS CI failures of WebsocketClientServerTest / ConnectionLockingTest (the "flaky socket tests" that have forced retriggers on most PRs, including #34) turn out to be two real defects in WebsocketConnection, not test timing or port assumptions. Both reproduce locally and both are fixed at the library level — no test was modified.

Bug 1 — first frames silently dropped (the hangs/timeouts)

setSocket() attached the rtc message callback immediately, but the application's Data listener is only registered later — inside the Connected event the server emits. A frame arriving in that window was pumped through emit<Data> with zero listeners and vanished:

  • WebsocketClientServerTest.TestClientCanTrasmitMessageToServer: the test's promise never resolves → hang until the ctest timeout.
  • ConnectionLockingTest.CanLockConnections: the first frame on the wire is the handshake request → connection never materializes → folly::FutureTimeout after 10 s, hasConnection() == false (exactly the CI failure output).

Fix: setSocket() no longer attaches the message callback. A new startReceiving() attaches it — the server calls it right after emitting Connected to the application, the client before open() (its listeners are registered before connect()). This loses nothing: libdatachannel queues incoming frames internally until a message callback is attached and flushes them synchronously at attach (Channel::onMessageflushPendingMessages()), so the rtc queue bridges the gap.

While reading the rtc source I also closed a sibling hole: rtc does not retro-fire the open callback, so a handshake completing before setSocket() attached onOpen would permanently swallow Connected. setSocket() now checks readyState() after attaching and emits Connected itself in that case (mConnectedEmitted keeps the two paths idempotent).

Bug 2 — teardown deadlock (the "close timeout" hangs)

close()/destroy() called socket->resetCallbacks() while holding mMutex. rtc's synchronized_callback holds its own mutex during callback invocation, and assignment/reset blocks until in-flight callbacks return — and our rtc callbacks lock mMutex. Classic AB-BA inversion:

  • main thread: mMutex → waits on callback mutex (inside resetCallbacks())
  • rtc thread: callback mutex → waits on mMutex (inside our onClosed)

Observed as tests hanging forever after WebSocket close timeout in the rtc log — this was still reproducible after fixing bug 1 (3 hangs in the first 107 post-fix runs), which is what exposed it as a separate mechanism.

Fix: mark mDestroyed and copy the socket pointer under mMutex, then call resetCallbacks()/close() outside it.

Evidence

Loop Pre-fix Post-fix
WebsocketClientServerTest ×N (Debug) 3/200 hung 0/500
ConnectionLockingTest ×N (Debug) 2/100 failed 0/200
Both suites (Release, CI config) 0/225
Full streamr-dht suite 81/81 (Debug and Release)
trackerless-network suites green (Debug and Release)

The test.sh retry/timeout mitigations (--repeat until-pass:2 --timeout 300) are left in place as general hygiene. MODERNIZATION.md's known-issue record is updated with the resolution.

🤖 Generated with Claude Code

The intermittent failures/hangs of WebsocketClientServerTest and
ConnectionLockingTest (and the ConnectionManagerTest cast) were two
defects in WebsocketConnection, not test timing assumptions:

1. Dropped first frames. setSocket() attached the rtc message callback
   immediately, but the application's Data listener is only registered
   later (inside the Connected event the server emits). A frame
   arriving in between was emitted into a listener-less EventEmitter
   and silently lost - the ConnectionLocking handshake never completed
   (FutureTimeout after 10 s) and the websocket test's promise never
   resolved (hang). Fix: setSocket() no longer attaches the message
   callback; new startReceiving() attaches it - the server calls it
   right after emitting Connected to the application (libdatachannel
   queues frames internally until the callback attaches and flushes
   them synchronously at attach, so nothing is lost), the client before
   open() as before. Also emit Connected from setSocket() when the
   handshake completed before the onOpen callback was attached (rtc
   does not retro-fire open; mConnectedEmitted keeps the paths
   idempotent).

2. Teardown deadlock. close()/destroy() called socket->resetCallbacks()
   while holding mMutex. rtc holds its callback mutex during callback
   invocation and resetCallbacks() blocks until in-flight callbacks
   return - and our callbacks lock mMutex: a classic AB-BA inversion
   (main: mMutex -> callback mutex; rtc thread: callback mutex ->
   mMutex), seen as tests hanging after "WebSocket close timeout".
   Fix: mark destroyed and copy the socket pointer under mMutex,
   then reset/close outside it.

Evidence: pre-fix, WebsocketClientServerTest hung 3/200 local runs and
ConnectionLockingTest failed 2/100 (Debug). Post-fix: 0/500 and 0/200
(Debug), 0/150 and 0/75 (Release), full dht suite 81/81 and
trackerless-network suites green in both configs.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@cursor

cursor Bot commented Jul 4, 2026

Copy link
Copy Markdown

Bugbot is not enabled for this team, so this pull request was not reviewed.

Enable Bugbot in the Cursor dashboard to get automatic reviews on future PRs.

@ptesavol ptesavol merged commit 7c451ab into main Jul 4, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant