Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

socket-mode: do not throw exception when calling disconnect() and already disconnected; do not raise slack_event in case of type:disconnect messages #1762

Merged
merged 4 commits into from
Apr 17, 2024

Conversation

filmaj
Copy link
Contributor

@filmaj filmaj commented Mar 21, 2024

This PR addresses two, maybe three bugs, some of which are breaking changes (but I want to release a new major version of this library soon, so now is the time):

  1. When calling disconnect() and already disconnected, do not raise an exception.
  2. BREAKING CHANGE: When receiving a too_many_websockets type:disconnect message from Slack, do not raise it as a slack_event lifecycle event. Also refactor the internal state machine events, combine the reason:warning and reason:refresh_requested type: disconnect messages into one event (since they are handled the same).

Previous Working Notes

Expanding the socket-mode integration tests as I try to help a Salesforce team get to the root of some rare exceptions they are seeing in their socket mode app. It seems if the backend returns "too many websockets" this package may get itself into an unrecoverable loop.

Question: what should Bolt / this package do if it receives a type:disconnect, reason:too_many_websockets message? Reconnecting is probably bad (as we could inadvertently cause a stampede of reconnecting->disconnecting events)?

Answer: probably the same as python Slack SDK and Java Slack SDK, which is to reconnect for all type:disconnect messages received from Slack backend.

I see the Python library has some mentions of too_many_websockets, and specifically calls out to close the previous connection first before opening a new one. I do not think node-slack-sdk socket-mode does this.. it does the opposite: creates a new connection first, and then closes the old one (based on this code).

Here is an excerpt of the production logs that I am trying to reproduce via these tests:

2024-03-19T07:54:32.276559+00:00 app[app.1]: [DEBUG]  socket-mode:SocketModeClient:0 Received a message on the WebSocket: {"type":"disconnect","reason":"warning","debug_info":{"host":"applink-9"}}
2024-03-19T07:54:32.276617+00:00 app[app.1]: [DEBUG]  socket-mode:SocketModeClient:0 Received "disconnect" (warning) message - creating the second connection
2024-03-19T07:54:32.276667+00:00 app[app.1]: [DEBUG]  socket-mode:SocketModeClient:0 Cancelled the job waiting for ping from Slack
2024-03-19T07:54:32.276717+00:00 app[app.1]: [DEBUG]  socket-mode:SocketModeClient:0 Terminated the heart beat job
2024-03-19T07:54:32.276717+00:00 app[app.1]: [DEBUG]  socket-mode:SocketModeClient:0 Transitioning to state: reconnecting
2024-03-19T07:54:32.276740+00:00 app[app.1]: [INFO]  socket-mode:SocketModeClient:0 Reconnecting to Slack ...
2024-03-19T07:54:32.276881+00:00 app[app.1]: [DEBUG]  socket-mode:SocketModeClient:0 Transitioning to state: connecting
2024-03-19T07:54:32.277031+00:00 app[app.1]: [INFO]  socket-mode:SocketModeClient:0 Going to establish a new connection to Slack ...
2024-03-19T07:54:32.277065+00:00 app[app.1]: [DEBUG]  socket-mode:SocketModeClient:0 Transitioning to state: connecting:authenticating
2024-03-19T07:54:32.277084+00:00 app[app.1]: [DEBUG]  socket-mode:SocketModeClient:0 Going to retrieve a new WSS URL ...
2024-03-19T07:54:32.277128+00:00 app[app.1]: [DEBUG]  bolt-app apiCall('apps.connections.open') start
2024-03-19T07:54:32.277501+00:00 app[app.1]: [DEBUG]  bolt-app http request url: https://slack.com/api/apps.connections.open
2024-03-19T07:54:32.284315+00:00 app[app.1]: [DEBUG]  bolt-app http request body: {}
2024-03-19T07:54:32.284316+00:00 app[app.1]: [DEBUG]  bolt-app http request headers: {}
2024-03-19T07:54:32.313400+00:00 app[app.1]: [DEBUG]  bolt-app http response received
2024-03-19T07:54:32.313527+00:00 app[app.1]: [DEBUG]  bolt-app http request result: {"ok":true,"url":"wss://wss-primary.slack.com/link/?ticket=XXX&app_id=XXX","response_metadata":{"scopes":["connections:write"],"acceptedScopes":["connections:write"]}}
2024-03-19T07:54:32.313528+00:00 app[app.1]: [DEBUG]  bolt-app apiCall('apps.connections.open') end
2024-03-19T07:54:32.313529+00:00 app[app.1]: [DEBUG]  socket-mode:SocketModeClient:0 Transitioning to state: connecting:authenticated
2024-03-19T07:54:32.333581+00:00 app[app.1]: [DEBUG]  socket-mode:SocketModeClient:0 Transitioning to state: connecting:handshaking
2024-03-19T07:54:32.345322+00:00 app[app.1]: [DEBUG]  socket-mode:SocketModeClient:0 Received a message on the WebSocket: {"type":"disconnect","reason":"too_many_websockets","debug_info":{"host":"applink-1"}}
2024-03-19T07:54:32.346427+00:00 app[app.1]: [DEBUG]  bolt-app undefined
2024-03-19T07:54:32.347037+00:00 app[app.1]: [ERROR]   An unhandled error occurred while Bolt processed (type: undefined, error: TypeError: Cannot read properties of undefined (reading 'event'))
2024-03-19T07:54:32.347154+00:00 app[app.1]: [DEBUG]   Error details: TypeError: Cannot read properties of undefined (reading 'event'), retry num: undefined, retry reason: undefined
2024-03-19T07:54:32.348089+00:00 app[app.1]: [DEBUG]  socket-mode:SocketModeClient:0 Transitioning to state: reconnecting
2024-03-19T07:54:32.348144+00:00 app[app.1]: [INFO]  socket-mode:SocketModeClient:0 Reconnecting to Slack ...
2024-03-19T07:54:32.348213+00:00 app[app.1]: [DEBUG]  socket-mode:SocketModeClient:0 Transitioning to state: connecting
2024-03-19T07:54:32.348214+00:00 app[app.1]: [INFO]  socket-mode:SocketModeClient:0 Going to establish a new connection to Slack ...
2024-03-19T07:54:32.348332+00:00 app[app.1]: [DEBUG]  socket-mode:SocketModeClient:0 Transitioning to state: connecting:authenticating
2024-03-19T07:54:32.348333+00:00 app[app.1]: [DEBUG]  socket-mode:SocketModeClient:0 Going to retrieve a new WSS URL ...
2024-03-19T07:54:32.348333+00:00 app[app.1]: [DEBUG]  bolt-app apiCall('apps.connections.open') start
2024-03-19T07:54:32.349712+00:00 app[app.1]: [DEBUG]  bolt-app http request url: https://slack.com/api/apps.connections.open
2024-03-19T07:54:32.349714+00:00 app[app.1]: [DEBUG]  bolt-app http request body: {}
2024-03-19T07:54:32.349714+00:00 app[app.1]: [DEBUG]  bolt-app http request headers: {}
2024-03-19T07:54:32.372319+00:00 app[app.1]: [DEBUG]  bolt-app http response received
2024-03-19T07:54:32.372493+00:00 app[app.1]: [DEBUG]  bolt-app http request result: {"ok":true,"url":"wss://wss-primary.slack.com/link/?ticket=XXX&app_id=XXX","response_metadata":{"scopes":["connections:write"],"acceptedScopes":["connections:write"]}}
2024-03-19T07:54:32.372495+00:00 app[app.1]: [DEBUG]  bolt-app apiCall('apps.connections.open') end
2024-03-19T07:54:32.372524+00:00 app[app.1]: [DEBUG]  socket-mode:SocketModeClient:0 Transitioning to state: connecting:authenticated
2024-03-19T07:54:32.388933+00:00 app[app.1]: [DEBUG]  socket-mode:SocketModeClient:0 Transitioning to state: connecting:handshaking
2024-03-19T07:54:32.401919+00:00 app[app.1]: [DEBUG]  socket-mode:SocketModeClient:0 Received a message on the WebSocket: {"type":"disconnect","reason":"too_many_websockets","debug_info":{"host":"applink-13"}}
2024-03-19T07:54:32.402230+00:00 app[app.1]: [DEBUG]  bolt-app undefined
2024-03-19T07:54:32.402614+00:00 app[app.1]: [ERROR]   An unhandled error occurred while Bolt processed (type: undefined, error: TypeError: Cannot read properties of undefined (reading 'event'))
2024-03-19T07:54:32.402697+00:00 app[app.1]: [DEBUG]   Error details: TypeError: Cannot read properties of undefined (reading 'event'), retry num: undefined, retry reason: undefined
2024-03-19T07:54:32.403304+00:00 app[app.1]: [DEBUG]  socket-mode:SocketModeClient:0 Transitioning to state: reconnecting
2024-03-19T07:54:32.403352+00:00 app[app.1]: [INFO]  socket-mode:SocketModeClient:0 Reconnecting to Slack ...
2024-03-19T07:54:32.403458+00:00 app[app.1]: [DEBUG]  socket-mode:SocketModeClient:0 Transitioning to state: connecting
2024-03-19T07:54:32.403528+00:00 app[app.1]: [INFO]  socket-mode:SocketModeClient:0 Going to establish a new connection to Slack ...
2024-03-19T07:54:32.403610+00:00 app[app.1]: [DEBUG]  socket-mode:SocketModeClient:0 Transitioning to state: connecting:authenticating
2024-03-19T07:54:32.403661+00:00 app[app.1]: [DEBUG]  socket-mode:SocketModeClient:0 Going to retrieve a new WSS URL ...
2024-03-19T07:54:32.403725+00:00 app[app.1]: [DEBUG]  bolt-app apiCall('apps.connections.open') start
2024-03-19T07:54:32.404132+00:00 app[app.1]: [DEBUG]  bolt-app http request url: https://slack.com/api/apps.connections.open
2024-03-19T07:54:32.404196+00:00 app[app.1]: [DEBUG]  bolt-app http request body: {}
2024-03-19T07:54:32.404281+00:00 app[app.1]: [DEBUG]  bolt-app http request headers: {}
2024-03-19T07:54:32.424951+00:00 app[app.1]: [DEBUG]  bolt-app http response received
2024-03-19T07:54:32.425081+00:00 app[app.1]: [DEBUG]  bolt-app http request result: {"ok":true,"url":"wss://wss-primary.slack.com/link/?ticket=XXX&app_id=XXX","response_metadata":{"scopes":["connections:write"],"acceptedScopes":["connections:write"]}}
2024-03-19T07:54:32.425126+00:00 app[app.1]: [DEBUG]  bolt-app apiCall('apps.connections.open') end
2024-03-19T07:54:32.425230+00:00 app[app.1]: [DEBUG]  socket-mode:SocketModeClient:0 Transitioning to state: connecting:authenticated
2024-03-19T07:54:32.440719+00:00 app[app.1]: [DEBUG]  socket-mode:SocketModeClient:0 Transitioning to state: connecting:handshaking
2024-03-19T07:54:32.450793+00:00 app[app.1]: [DEBUG]  socket-mode:SocketModeClient:0 Received a message on the WebSocket: {"type":"disconnect","reason":"too_many_websockets","debug_info":{"host":"applink-14"}}
2024-03-19T07:54:32.450982+00:00 app[app.1]: [DEBUG]  bolt-app undefined
2024-03-19T07:54:32.451351+00:00 app[app.1]: [ERROR]   An unhandled error occurred while Bolt processed (type: undefined, error: TypeError: Cannot read properties of undefined (reading 'event'))
2024-03-19T07:54:32.451362+00:00 app[app.1]: [DEBUG]   Error details: TypeError: Cannot read properties of undefined (reading 'event'), retry num: undefined, retry reason: undefined

@filmaj filmaj added semver:patch enhancement M-T: A feature request for new functionality tests M-T: Testing work only draft pkg:socket-mode applies to `@slack/socket-mode` labels Mar 21, 2024
@filmaj filmaj self-assigned this Mar 21, 2024
@seratch
Copy link
Member

seratch commented Mar 22, 2024

@filmaj Thanks for thorough investigation on this!

I see the Python library has some mentions of too_many_websockets, and specifically calls out to close the previous connection first before opening a new one

The main reason behind the comment is to prevent errors when an app scales out to 10 instances; the maximum number of allowed connections per app. When 10 app instances are already up and running, it's worth first closing a connection before starting a new one. That's the reason why Python SDK does so.

If the reported situation is exactly the same, aligning with Python SDK could be beneficial. However, if the root cause is different, we might need to apply other improvements as well.

Generally speaking, when an app encounters the too_many_websockets error, the only available resolution is to reduce the number of active connections. It's viable to resolve the situation solely through client-side SDK changes.

If this library does not properly close some connections when swtiching them, we can definitely improve the logic to prevent such occurrences. However, I haven't yet identified such a bug exists. I've refactored some parts of the Node library a while ago, but I didn't modify the reconnecting logic at all then. We may need to improve it to make it even more stable. I will also check it out when I have time for it.

@filmaj
Copy link
Contributor Author

filmaj commented Mar 22, 2024

Thanks for your comments @seratch. I think this library closes connections cleanly, so I am not sure that is something we need to consider.

It makes sense to me to close the connection first before opening a new one if we have a max-connection constraint. Perhaps the fact the node library does not do this is simply an oversight from many years ago. In any case, given what you've said, it makes sense to me to make this change in this library.

I know it is a nervous proposition to make a significant behaviour change like that in this library, but at least the good news is that the library is now ready for a new major release (dropping EOL node versions support), so perhaps now is the time to make such a behaviour change? Additionally, with the existence of the new socket server integration tests, I have more confidence making behaviour changes as we can more fully test the state machine transitions for different, more complex sequences between socket peers.

@filmaj
Copy link
Contributor Author

filmaj commented Mar 22, 2024

One more note here. The following excerpt from the above logs is from bolt-js - not from socket-mode - so the interaction is a bit complex to tease out:

2024-03-19T07:54:32.345322+00:00 app[app.1]: [DEBUG]  socket-mode:SocketModeClient:0 Received a message on the WebSocket: {"type":"disconnect","reason":"too_many_websockets","debug_info":{"host":"applink-1"}}
2024-03-19T07:54:32.346427+00:00 app[app.1]: [DEBUG]  bolt-app undefined
2024-03-19T07:54:32.347037+00:00 app[app.1]: [ERROR]   An unhandled error occurred while Bolt processed (type: undefined, error: TypeError: Cannot read properties of undefined (reading 'event'))

My guess is that socket-mode, when it receives a type:disconnect,reason:too_many_websockets actually continues to propagate the event up as a full slack_event that the socket-mode module exposes (see bottom of Lifecycle Events documentation). This is different behaviour from when it receives other kinds of type:disconnect messages; for these messages it will NOT propagate it up as a slack_event. I think this is also a mistake, and leads to the above error through this code path in bolt-js (and in particular through the catch clause), and specifically:

  1. body, retryNum and retryReason are null or undefined when too_many_websockets gets raised as a slack_event. Everything except ack and type here would be null/undefined, I believe.
  2. bolt attempts to process the raised slack_event from the socket mode module
  3. in developer mode (which I'm pretty sure the app from which the above logs I posted come from has enabled), the body gets logged out. Bolt therefore logs out [DEBUG] bolt-app undefined
  4. bolt tries to getTypeAndConversation on the undefined body
  5. this method inspects a body.event subproperty as a first step. Since body is undefined, this leads to an exception...
  6. .. and explains the last log we see above: Cannot read properties of undefined (reading 'event')

@filmaj filmaj force-pushed the expanded-socket-mode-integration-tests branch from 65b4cd3 to ddfee0c Compare April 1, 2024 19:45
1. When calling disconnect() and already disconnected, do not raise an exception.
2. When receiving a too_many_websockets type:disconnect message from Slack, do not raise it as a slack_event lifecycle event. Also refactor the internal state machine events, combine the reason:warning and reason:refresh_requested type:disconnect messages into one event (since they are handled the same).
Tests for these too
@filmaj filmaj changed the title socket-mode: add test for reconnecting event via integration tests socket-mode: do not throw exception when calling disconnect() and already disconnected; do not raise slack_event in case of too_many_websockets message Apr 1, 2024
@filmaj filmaj added this to the socket-mode@2.0 milestone Apr 1, 2024
@filmaj filmaj marked this pull request as ready for review April 1, 2024 20:43
@filmaj filmaj requested a review from a team April 1, 2024 20:46
Copy link
Member

@seratch seratch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM

…Disconnect and ClientExplicitDisconnect. For all ServerExplicitDisconnect events, try to reconnect.
@filmaj filmaj changed the title socket-mode: do not throw exception when calling disconnect() and already disconnected; do not raise slack_event in case of too_many_websockets message socket-mode: do not throw exception when calling disconnect() and already disconnected; do not raise slack_event in case of type:disconnect messages Apr 2, 2024
@filmaj
Copy link
Contributor Author

filmaj commented Apr 17, 2024

@seratch I got further reports from customers that raising type:disconnect messages as slack_event can cause exceptions in their bolt app (as I described in this comment).

Question for you: do you think preventing the raising of type:disconnect messages as a slack_event would qualify as a major change? Perhaps it would be considered a patch change? I was thinking of reverting the Event enum name changes in this PR and just keeping:

  • no exception thrown when calling disconnect() and already disconnected, and
  • do not raise type:disconnect messages as a slack_event

.. and releasing a new patch release in the 1.3.x socket-mode line.

What do you think @seratch ?

I think this would help customers experiencing issues today while giving us space to work on the python-based port for the next major version.

@seratch
Copy link
Member

seratch commented Apr 17, 2024

@filmaj I don’t think it’s a major change in this context. So, we can ship it as a patch release!

@filmaj filmaj merged commit 3ebb6ce into main Apr 17, 2024
17 checks passed
@filmaj filmaj deleted the expanded-socket-mode-integration-tests branch April 17, 2024 18:56
filmaj added a commit that referenced this pull request Apr 17, 2024
…eady disconnected; do not raise slack_event in case of type:disconnect messages (#1762) also add integration tests to this maintenance branch
filmaj added a commit that referenced this pull request Apr 17, 2024
…eady disconnected; do not raise slack_event in case of type:disconnect messages (#1762)

also add integration tests to this maintenance branch
filmaj added a commit that referenced this pull request Apr 17, 2024
…eady disconnected; do not raise slack_event in case of type:disconnect messages (#1762) also add integration tests to this maintenance branch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement M-T: A feature request for new functionality pkg:socket-mode applies to `@slack/socket-mode` semver:major tests M-T: Testing work only
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants