-
Notifications
You must be signed in to change notification settings - Fork 661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
socket-mode: do not throw exception when calling disconnect() and already disconnected; do not raise slack_event in case of type:disconnect messages #1762
Conversation
@filmaj Thanks for thorough investigation on this!
The main reason behind the comment is to prevent errors when an app scales out to 10 instances; the maximum number of allowed connections per app. When 10 app instances are already up and running, it's worth first closing a connection before starting a new one. That's the reason why Python SDK does so. If the reported situation is exactly the same, aligning with Python SDK could be beneficial. However, if the root cause is different, we might need to apply other improvements as well. Generally speaking, when an app encounters the too_many_websockets error, the only available resolution is to reduce the number of active connections. It's viable to resolve the situation solely through client-side SDK changes. If this library does not properly close some connections when swtiching them, we can definitely improve the logic to prevent such occurrences. However, I haven't yet identified such a bug exists. I've refactored some parts of the Node library a while ago, but I didn't modify the reconnecting logic at all then. We may need to improve it to make it even more stable. I will also check it out when I have time for it. |
Thanks for your comments @seratch. I think this library closes connections cleanly, so I am not sure that is something we need to consider. It makes sense to me to close the connection first before opening a new one if we have a max-connection constraint. Perhaps the fact the node library does not do this is simply an oversight from many years ago. In any case, given what you've said, it makes sense to me to make this change in this library. I know it is a nervous proposition to make a significant behaviour change like that in this library, but at least the good news is that the library is now ready for a new major release (dropping EOL node versions support), so perhaps now is the time to make such a behaviour change? Additionally, with the existence of the new socket server integration tests, I have more confidence making behaviour changes as we can more fully test the state machine transitions for different, more complex sequences between socket peers. |
One more note here. The following excerpt from the above logs is from bolt-js - not from socket-mode - so the interaction is a bit complex to tease out:
My guess is that socket-mode, when it receives a
|
65b4cd3
to
ddfee0c
Compare
1. When calling disconnect() and already disconnected, do not raise an exception. 2. When receiving a too_many_websockets type:disconnect message from Slack, do not raise it as a slack_event lifecycle event. Also refactor the internal state machine events, combine the reason:warning and reason:refresh_requested type:disconnect messages into one event (since they are handled the same). Tests for these too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, LGTM
…Disconnect and ClientExplicitDisconnect. For all ServerExplicitDisconnect events, try to reconnect.
@seratch I got further reports from customers that raising Question for you: do you think preventing the raising of
.. and releasing a new patch release in the 1.3.x socket-mode line. What do you think @seratch ? I think this would help customers experiencing issues today while giving us space to work on the python-based port for the next major version. |
@filmaj I don’t think it’s a major change in this context. So, we can ship it as a patch release! |
…eady disconnected; do not raise slack_event in case of type:disconnect messages (#1762) also add integration tests to this maintenance branch
…eady disconnected; do not raise slack_event in case of type:disconnect messages (#1762) also add integration tests to this maintenance branch
…eady disconnected; do not raise slack_event in case of type:disconnect messages (#1762) also add integration tests to this maintenance branch
This PR addresses two, maybe three bugs, some of which are breaking changes (but I want to release a new major version of this library soon, so now is the time):
disconnect()
and already disconnected, do not raise an exception.too_many_websockets
type:disconnect
message from Slack, do not raise it as aslack_event
lifecycle event. Also refactor the internal state machine events, combine thereason:warning
andreason:refresh_requested
type: disconnect messages
into one event (since they are handled the same).Previous Working Notes
Expanding the socket-mode integration tests as I try to help a Salesforce team get to the root of some rare exceptions they are seeing in their socket mode app. It seems if the backend returns "too many websockets" this package may get itself into an unrecoverable loop.
Question: what should Bolt / this package do if it receives a
type:disconnect, reason:too_many_websockets
message? Reconnecting is probably bad (as we could inadvertently cause a stampede of reconnecting->disconnecting events)?Answer: probably the same as python Slack SDK and Java Slack SDK, which is to reconnect for all
type:disconnect
messages received from Slack backend.I see the Python library has some mentions of
too_many_websockets
, and specifically calls out to close the previous connection first before opening a new one. I do not think node-slack-sdk socket-mode does this.. it does the opposite: creates a new connection first, and then closes the old one (based on this code).Here is an excerpt of the production logs that I am trying to reproduce via these tests: