Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: [web3] only ever send RPC socket messages when the socket is open #29195

Merged
merged 1 commit into from
Jan 31, 2023
Merged

fix: [web3] only ever send RPC socket messages when the socket is open #29195

merged 1 commit into from
Jan 31, 2023

Conversation

steveluscher
Copy link
Contributor

@steveluscher steveluscher commented Dec 10, 2022

Problem

Our rpc-websockets dependency does not guard against the underlying socket not being OPEN before it tries to call send() on the socket itself.

  • When the readyState of a socket is CONNECTING and you try to call send(), it will fatal.
  • When the readyState of a socket is CLOSING or CLOSED and you try to call send(), it will discard the call silently. A console error results sometime later, but in no event does it fatal your program.

All of this can lead to corruption of the internal state of the RPC; it can think that it's connected, able to send messages, and is owed responses, when it is not.

Summary of Changes

  • Subclass rpc-websockets and wrap the call() and notify() methods in code that checks the readyState before trying to call send().

Fixes #25578, #27167, solana-labs/solana-web3.js#1106.

@github-actions github-actions bot added the web3.js Related to the JavaScript client label Dec 10, 2022
@codecov
Copy link

codecov bot commented Dec 10, 2022

Codecov Report

Merging #29195 (ff48353) into master (1064bcb) will decrease coverage by 0.3%.
The diff coverage is 50.0%.

@@            Coverage Diff            @@
##           master   solana-labs/solana#29195     +/-   ##
=========================================
- Coverage    76.5%    76.3%   -0.3%     
=========================================
  Files          54       56      +2     
  Lines        3129     3151     +22     
  Branches      470      475      +5     
=========================================
+ Hits         2396     2406     +10     
- Misses        568      578     +10     
- Partials      165      167      +2     

@Jac0xb
Copy link

Jac0xb commented Dec 28, 2022

What ever happened with this? @steveluscher

@steveluscher
Copy link
Contributor Author

I'm pretty sure that it works @Jac0xb, but I don't have a local repro yet. I asked @laterbreh, @Disperito, and @gallynaut to help me give it a shot, but I think that all of them have moved on.

If anyone is reading this and has a repro of the bug still, can you try to upgrade to an experimental release, by running yarn add @solana/web3.js@pr-29195? That's a package I custom built from this PR for you to test.

@steveluscher
Copy link
Contributor Author

Alright, I'm sufficiently bullish on this change that I'm going to ship it tomorrow, even though I haven't been able to scare up someone with a repro to test it out.

@steveluscher steveluscher merged commit 8921b0a into solana-labs:master Jan 31, 2023
@steveluscher steveluscher deleted the web3-never-send-rpc-socket-messages-unless-socket-is-open branch January 31, 2023 17:03
@grabbou
Copy link

grabbou commented Feb 12, 2023

@steveluscher thank you for debugging and working on this PR! I have some questions, now that I am experiencing this too. I would love to help debug this and get it eventually fixed, but to do so, I would like to better understand some context (especially in relation to this PR). FWIW, the issue I am seeing is happening in context of React Native.

I think we're having two issues here happening potentially at the same time. The way I see it is that uncaught exceptions are causing the code handling WebSocket connection/subscriptions to enter infinite loop. Checking state for connected and throwing an Error (or rejecting a promise) otherwise sounds like not changing anything but just throwing a custom error instead of the one from an underlying call function.

From my debugging, this is the line that is called infinitely as soon as first call throws an error:
https://github.com/solana-labs/solana-web3.js/blob/67f3bf855325daec91489763471cdda535b29e25/packages/library-legacy/src/connection.ts#L5990-L5992

Was the idea here that it is RpcWebSocket that gets corrupted while there is an error, so it is better to check for the state ahead of calling its call method and throwing an error ourselves?

If yes, I think a (better?) long-term fix could be to either:

  • fix an underlying issue in the RpcWebSocket
  • implement a special workaround in web3.js code (around where we call call for the first time) to guard against infinite loops

There can be other errors happening with WebSockets as well, so guarding against readyState sounds potentially dangerous.


Also, another thing that I wanted to point out that might be relevant here is - I noticed that you mentioned race conditions in one of the issues. I am currently observing a behavior when this PR doesn't actually work - the code itself is fine, but the underlyingSocket.readyState is 1, leading to call, and then, inside I am getting an actual error of readyState being 0 (connecting). Something must be getting out of sync here.

@@ -36,6 +35,7 @@ import fetchImpl, {Response} from './fetch-impl';
import {DurableNonce, NonceAccount} from './nonce-account';
import {PublicKey} from './publickey';
import {Signer} from './keypair';
import RpcWebSocketClient from './rpc-websocket';
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We create RpcWebSocketClient like this later:

this._rpcWebSocket = new RpcWebSocketClient(this._rpcWsEndpoint, {
      autoconnect: false,
      max_reconnects: Infinity,
    });

it seems that max_reconnects: Infinity is a bit misleading here, as it is going to be overwritten by explicitly hardcoded max_reconnects: 5 in the rpc-websocket.ts file?

@grabbou
Copy link

grabbou commented Feb 13, 2023

Quick follow up to my previous comment, after further debugging:

implement a special workaround in web3.js code (around where we call call for the first time) to guard against infinite loops

It looks like there's already a notion of it in the code, it's just that it doesn't seem to work in this case:

https://github.com/solana-labs/solana-web3.js/blob/67f3bf855325daec91489763471cdda535b29e25/packages/library-legacy/src/connection.ts#L6009-L6017

The reason this doesn't work is that it checks for rpcWebSocketGeneration

https://github.com/solana-labs/solana-web3.js/blob/67f3bf855325daec91489763471cdda535b29e25/packages/library-legacy/src/connection.ts#L5942-L5944

and that will return false only after the onclose event was fired:

https://github.com/solana-labs/solana-web3.js/blob/67f3bf855325daec91489763471cdda535b29e25/packages/library-legacy/src/connection.ts#L5828-L5831

In other words, this will return false-positive results for all the errors that happen while connection was either open or connecting.

After doing debugging and console logging different connection states (hell yeah), I discovered that in my particular local scenario, the following happens:

  • open event fired
  • RpcWebSocket ready is true
  • RpcWebSocket socket.readyState is 1 (connected)
  • close event fired with code 1001 and reason Stream end encountered
  • open event fired
  • RpcWebSocket ready is true
  • RpcWebSocket socket.readyState is 0 (connecting)

Now any call to call will lead to the following error:
https://github.com/solana-labs/solana-web3.js/blob/67f3bf855325daec91489763471cdda535b29e25/packages/library-legacy/src/rpc-websocket.ts#L52-L60

And that will enter infinite loop. That's why this PR has no effect (unfortunately) on the original issue.

It looks to me that the open event is fired prematurely by rpc-websocket following close event (the connection is still connecting).

Now, I started wondering why the close event was fired and it wasn't triggered by the server (apparently). It sounds like we call this._updateSubscriptions() immediately from open event and since there are no subscriptions registered at that point, it automatically closes the connection.

https://github.com/solana-labs/solana-web3.js/blob/67f3bf855325daec91489763471cdda535b29e25/packages/library-legacy/src/connection.ts#L5908-L5911

AFTER I COMMENTED OUT THIS CODE, EVERYTHING WORKS AS EXPECTED

It sounds like I may have entered into a state where my connection keeps on being created and automatically closed which causes an error. After commenting out that entire code (disconnect on no subscriptions), the error completely went away!

@grabbou
Copy link

grabbou commented Feb 13, 2023

Note there is another issue of WebSockets (and rpc-websockets) reporting ready: true while underlying socket is still readyState: 0, but I think this might be a race condition resulting of the fact we just try to connect/re-connect to fast.

TL;DR because this thread is really long now:

  1. Revert this PR
  2. Comment out https://github.com/solana-labs/solana-web3.js/blob/67f3bf855325daec91489763471cdda535b29e25/packages/library-legacy/src/connection.ts#L5908-L5911
  3. Profit

This is definitely not the right way to do it, but... I guess with a little bit of help we can make sure that its intent is preserved w/o causing this issue.

@steveluscher
Copy link
Contributor Author

@grabbou, this is awesome debugging. Are you close to being able to share a repro so that I can comb over these findings too?

@grabbou
Copy link

grabbou commented Feb 19, 2023

Hey, yes, I was taking my time since my workaround was working "just fine" and I was totally overdue with some fun... I will send something shortly!

@steveluscher
Copy link
Contributor Author

fix an underlying issue in the RpcWebSocket

I'm pretty sure that elpheria/rpc-websockets#138 fixes the source of corruption of the subscriptions state machine here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
web3.js Related to the JavaScript client
Projects
None yet
Development

Successfully merging this pull request may close these issues.

web3.js: rpc-websockets has been observed entering a corrupt state
3 participants