New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix #1243 Socket Mode reconnection issue #1465
Conversation
|
||
/* eslint-disable @typescript-eslint/indent, newline-per-chained-call */ | ||
|
||
private connectingSubmachine: Configuration<string, string> = Finity.configure() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
connectingSubmachine and refreshingSubmachine are mostly the same. Perhaps, we can have a factory method, which returns a new Configuration instance instead of duplicated code. Thoughts? @stevengill
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that having the two is the root cause of the reconnection issue. I've removed the refreshingSubmachine and simplified the logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes a lot of sense! In theory there should be no difference between initially-connecting and reconnecting.
} | ||
// Start heartbeat to keep track of the websocket connection continuing to be alive | ||
this.startPeriodicallySendingPingToSlack(); | ||
this.startMonitoringPingFromSlack(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This used to be heartbeat()
. I've renamed the method to be more specific.
this.badConnection = false; | ||
} | ||
// Start heartbeat to keep track of the websocket connection continuing to be alive | ||
this.startPeriodicallySendingPingToSlack(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the newly added one.
this.logger.debug('Tearing down WebSocket connections finished'); | ||
} | ||
|
||
private startPeriodicallySendingPingToSlack(): void { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is the newly added validation logic.
private heartbeat(): void { | ||
if (this.pingTimeout !== undefined) { | ||
clearTimeout(this.pingTimeout); | ||
private startMonitoringPingFromSlack(): void { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed
It seems that the new validation logic does not work in all the possible patterns. I will do more testing and improve the logic tomorrow . |
|
||
if (this.clientOptions.retryConfig === undefined) { | ||
// For faster retries of apps.connections.open API calls for reconnecting | ||
this.clientOptions.retryConfig = { retries: 100, factor: 1.3 }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improved the interval to run the WSS URL fetching API calls.
}); | ||
}) | ||
.on('server disconnect warning') | ||
.transitionTo('connecting').withCondition(() => this.autoReconnectEnabled) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed from refreshing-connection
to connecting
. The whole logic is much simpler now.
// currently have two websockets, so tear down the older one | ||
this.websocket.removeAllListeners('open'); | ||
this.websocket.removeAllListeners('close'); | ||
this.websocket.removeAllListeners('error'); | ||
this.websocket.removeAllListeners('message'); | ||
this.websocket.close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old connection needs to be closed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooo good catch
// only one websocket to tear down | ||
this.websocket.removeAllListeners('open'); | ||
this.websocket.removeAllListeners('close'); | ||
this.websocket.removeAllListeners('error'); | ||
this.websocket.removeAllListeners('message'); | ||
this.websocket.close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
The current revision still does not work in some scenarios. I will check it in detail tomorrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! The removal of reconnecting logic and consolidating around a single connecting state machine is a great improvement.
I tested by running the simple.js
example you provided, seeing the ping/pongs from both the client and the server side, disconnecting my wifi, and after some time reconnecting my wifi. It seemed to work well.
Just for my own education: when the client receives a ping from the server, the client does not respond with a pong - is that correct? Or am I missing something? When watching the debug logs in the simple example, I noticed that for the client-initiated heartbeat, we log out both pings and pongs, but for the server-initiated heartbeat, we only log the pings. Is that by design?
|
||
/* eslint-disable @typescript-eslint/indent, newline-per-chained-call */ | ||
|
||
private connectingSubmachine: Configuration<string, string> = Finity.configure() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes a lot of sense! In theory there should be no difference between initially-connecting and reconnecting.
// currently have two websockets, so tear down the older one | ||
this.websocket.removeAllListeners('open'); | ||
this.websocket.removeAllListeners('close'); | ||
this.websocket.removeAllListeners('error'); | ||
this.websocket.removeAllListeners('message'); | ||
this.websocket.close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooo good catch
@stevengill @filmaj
Once we merge these, I will rebase this PR again. Also, today I've applied a bunch of changes and now I don't see any issues with the latest code here. If you have time tomorrow, can you check the code for me? Thanks for your time! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still looks good, and thanks for splitting the PRs out - that made reviewing a lot easier. I did leave one question around a specific debug log message you added that I can't seem to reproduce when testing locally... here's the relevant log output during my test, shortly before, during and after I disconnected and reconnected my wifi from my laptop:
[DEBUG] socket-mode:SocketModeClient:0 Received pong from Slack server (data: Ping from client (1649966865067))
[DEBUG] socket-mode:SocketModeClient:0 Received ping from Slack server (data: Ping from applink-78ccf745b6-56dzm)
[DEBUG] socket-mode:SocketModeClient:0 Sent ping to Slack: Ping from client (1649966866733)
[DEBUG] socket-mode:SocketModeClient:0 Sent ping to Slack: Ping from client (1649966868399)
[DEBUG] socket-mode:SocketModeClient:0 Sent ping to Slack: Ping from client (1649966870067)
[DEBUG] socket-mode:SocketModeClient:0 Sent ping to Slack: Ping from client (1649966871733)
[INFO] socket-mode:SocketModeClient:0 A pong wasn't received from the server before the timeout of 5000ms!
[DEBUG] socket-mode:SocketModeClient:0 Transitioning to state: connected:reconnecting
[DEBUG] socket-mode:SocketModeClient:0 Sent ping to Slack: Ping from client (1649966873400)
[INFO] socket-mode:SocketModeClient:0 A pong wasn't received from the server before the timeout of 5000ms!
[DEBUG] socket-mode:SocketModeClient:0 Transitioning to state: reconnecting
[DEBUG] socket-mode:SocketModeClient:0 Since only the primary WebSocket exists, going to tear it down...
[DEBUG] socket-mode:SocketModeClient:0 Tearing down WebSocket connections finished
[DEBUG] socket-mode:SocketModeClient:0 Transitioning to state: connecting
[DEBUG] socket-mode:SocketModeClient:0 Transitioning to state: connecting:authenticating
[DEBUG] socket-mode:SocketModeClient:0 Going to retrieve a new WSS URL ...
[DEBUG] web-api:WebClient:0 apiCall('apps.connections.open') start
[DEBUG] web-api:WebClient:0 will perform http request
[WARN] web-api:WebClient:0 http request failed getaddrinfo ENOTFOUND slack.com
[DEBUG] web-api:WebClient:0 will perform http request
[WARN] web-api:WebClient:0 http request failed getaddrinfo ENOTFOUND slack.com
[DEBUG] web-api:WebClient:0 will perform http request
[WARN] web-api:WebClient:0 http request failed getaddrinfo ENOTFOUND slack.com
[DEBUG] web-api:WebClient:0 will perform http request
[WARN] web-api:WebClient:0 http request failed getaddrinfo ENOTFOUND slack.com
[DEBUG] web-api:WebClient:0 will perform http request
[WARN] web-api:WebClient:0 http request failed getaddrinfo ENOTFOUND slack.com
[DEBUG] web-api:WebClient:0 will perform http request
[DEBUG] web-api:WebClient:0 http response received
[DEBUG] socket-mode:SocketModeClient:0 Transitioning to state: connecting:authenticated
[DEBUG] socket-mode:SocketModeClient:0 Transitioning to state: connecting:handshaking
[DEBUG] socket-mode:SocketModeClient:0 received a message on the WebSocket: {"type":"hello","num_connections":2,"debug_info":{"host":"applink-78ccf745b6-vxtsq","build_number":3,"approximate_connection_time":18060},"connection_info":{"app_id":"A02DH5S1NM6"}}
[DEBUG] socket-mode:SocketModeClient:0 Transitioning to state: connected
[DEBUG] socket-mode:SocketModeClient:0 Transitioning to state: connected:ready
[DEBUG] socket-mode:SocketModeClient:0 Received ping from Slack server (data: Ping from applink-78ccf745b6-56dzm)
[DEBUG] socket-mode:SocketModeClient:0 Received pong from Slack server (data: Ping from client (1649966866733))
[DEBUG] socket-mode:SocketModeClient:0 Received pong from Slack server (data: Ping from client (1649966868399))
I was expecting to see a DEBUG log for "Reconnecting to Slack..." but it's not there... not sure why?
@@ -20,11 +20,40 @@ import { UnrecoverableSocketModeStartError } from './UnrecoverableSocketModeStar | |||
|
|||
const packageJson = require('../package.json'); // eslint-disable-line import/no-commonjs, @typescript-eslint/no-var-requires | |||
|
|||
// These enum values are used only in the state machine |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this 👍
.state('reconnecting') | ||
.state(State.Reconnecting) | ||
.onEnter(() => { | ||
this.logger.debug('Reconnecting to Slack ...'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good but is it essentially a duplicate of the global state machine's generic onStateEnter
event logger at the very top of this state machine's definition? In fact, after running the simple example you provided, this log does not come up at all...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't see this log, the SocketModeClient code might be a bit older one. I've added this log mainly for debugging. We can remove it if it sounds duplicated with other logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for splitting the PR. I imagine it greatly improved the readability!
I tested this locally with simple.js
and temporarily disconnected my wi-fi. Works great! I think this is a huge improvement for those facing the socket mode disconnected issue. 🎉 If we have missed a use-case, then we can continue to refine it later.
@filmaj Can you run |
.state('reconnecting') | ||
.state(State.Reconnecting) | ||
.onEnter(() => { | ||
this.logger.debug('Reconnecting to Slack ...'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you don't see this log, the SocketModeClient code might be a bit older one. I've added this log mainly for debugging. We can remove it if it sounds duplicated with other logs.
this.websocket.removeAllListeners('open'); | ||
this.websocket.removeAllListeners('close'); | ||
this.websocket.removeAllListeners('error'); | ||
this.websocket.removeAllListeners('message'); | ||
this.websocket.close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably we can improve this switching logic to be a bit safer.
- have a local variable for the old connection and assign the existing connection to it
- switch
this.websocekt
to the new connection - setTimeout
- remove the listeners from the old connection
- close the old one
In the latest revision, I've removed this state "connected:reconnecting". Having two "reconnecting" state in the state machine was an issue in the code yesterday |
Co-authored-by: Fil Maj <maj.fil@gmail.com>
5f56cf1
to
e8a8f78
Compare
Summary
This pull request resolves #1243
The cause of the issue is that the
heartbeat()
method can work only when a SocketModeClient receives ping from Slack server. When the connection is unable to receive ping from Slack for some reason (stale connection etc.), the reconnecting logic no longer works.This pull request adds additional connection validation, which sends ping from
SocketModeClient
side and checks if the client receives pong from Slack server. If the client does not receive pong for a certain amount of time, the client does reconnection anyway. This is a proven approach in Python and Java SDK.Requirements (place an
x
in each[ ]
)