Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #1243 Socket Mode reconnection issue #1465

Merged
merged 6 commits into from Apr 15, 2022

Conversation

seratch
Copy link
Member

@seratch seratch commented Apr 13, 2022

Summary

This pull request resolves #1243

The cause of the issue is that the heartbeat() method can work only when a SocketModeClient receives ping from Slack server. When the connection is unable to receive ping from Slack for some reason (stale connection etc.), the reconnecting logic no longer works.

This pull request adds additional connection validation, which sends ping from SocketModeClient side and checks if the client receives pong from Slack server. If the client does not receive pong for a certain amount of time, the client does reconnection anyway. This is a proven approach in Python and Java SDK.

Requirements (place an x in each [ ])

@seratch seratch added bug M-T: A confirmed bug report. Issues are confirmed when the reproduction steps are documented pkg:socket-mode applies to `@slack/socket-mode` labels Apr 13, 2022
@seratch seratch added this to the socket-mode@1.3.0 milestone Apr 13, 2022
@seratch seratch self-assigned this Apr 13, 2022

/* eslint-disable @typescript-eslint/indent, newline-per-chained-call */

private connectingSubmachine: Configuration<string, string> = Finity.configure()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

connectingSubmachine and refreshingSubmachine are mostly the same. Perhaps, we can have a factory method, which returns a new Configuration instance instead of duplicated code. Thoughts? @stevengill

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that having the two is the root cause of the reconnection issue. I've removed the refreshingSubmachine and simplified the logic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes a lot of sense! In theory there should be no difference between initially-connecting and reconnecting.

}
// Start heartbeat to keep track of the websocket connection continuing to be alive
this.startPeriodicallySendingPingToSlack();
this.startMonitoringPingFromSlack();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This used to be heartbeat(). I've renamed the method to be more specific.

this.badConnection = false;
}
// Start heartbeat to keep track of the websocket connection continuing to be alive
this.startPeriodicallySendingPingToSlack();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the newly added one.

this.logger.debug('Tearing down WebSocket connections finished');
}

private startPeriodicallySendingPingToSlack(): void {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is the newly added validation logic.

private heartbeat(): void {
if (this.pingTimeout !== undefined) {
clearTimeout(this.pingTimeout);
private startMonitoringPingFromSlack(): void {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed

@seratch seratch removed the request for review from mwbrooks April 13, 2022 11:17
@seratch
Copy link
Member Author

seratch commented Apr 13, 2022

It seems that the new validation logic does not work in all the possible patterns. I will do more testing and improve the logic tomorrow .


if (this.clientOptions.retryConfig === undefined) {
// For faster retries of apps.connections.open API calls for reconnecting
this.clientOptions.retryConfig = { retries: 100, factor: 1.3 };
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improved the interval to run the WSS URL fetching API calls.

});
})
.on('server disconnect warning')
.transitionTo('connecting').withCondition(() => this.autoReconnectEnabled)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed from refreshing-connection to connecting. The whole logic is much simpler now.

// currently have two websockets, so tear down the older one
this.websocket.removeAllListeners('open');
this.websocket.removeAllListeners('close');
this.websocket.removeAllListeners('error');
this.websocket.removeAllListeners('message');
this.websocket.close();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old connection needs to be closed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooo good catch

// only one websocket to tear down
this.websocket.removeAllListeners('open');
this.websocket.removeAllListeners('close');
this.websocket.removeAllListeners('error');
this.websocket.removeAllListeners('message');
this.websocket.close();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

@seratch
Copy link
Member Author

seratch commented Apr 13, 2022

The current revision still does not work in some scenarios. I will check it in detail tomorrow

Copy link
Contributor

@filmaj filmaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! The removal of reconnecting logic and consolidating around a single connecting state machine is a great improvement.

I tested by running the simple.js example you provided, seeing the ping/pongs from both the client and the server side, disconnecting my wifi, and after some time reconnecting my wifi. It seemed to work well.

Just for my own education: when the client receives a ping from the server, the client does not respond with a pong - is that correct? Or am I missing something? When watching the debug logs in the simple example, I noticed that for the client-initiated heartbeat, we log out both pings and pongs, but for the server-initiated heartbeat, we only log the pings. Is that by design?

packages/socket-mode/src/SocketModeClient.ts Outdated Show resolved Hide resolved

/* eslint-disable @typescript-eslint/indent, newline-per-chained-call */

private connectingSubmachine: Configuration<string, string> = Finity.configure()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes a lot of sense! In theory there should be no difference between initially-connecting and reconnecting.

packages/socket-mode/src/SocketModeClient.ts Outdated Show resolved Hide resolved
packages/socket-mode/src/SocketModeClient.ts Outdated Show resolved Hide resolved
// currently have two websockets, so tear down the older one
this.websocket.removeAllListeners('open');
this.websocket.removeAllListeners('close');
this.websocket.removeAllListeners('error');
this.websocket.removeAllListeners('message');
this.websocket.close();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooo good catch

@seratch
Copy link
Member Author

seratch commented Apr 14, 2022

@stevengill @filmaj
I've moved some unrelated refactoring changes to the following smaller PRs:

Once we merge these, I will rebase this PR again. Also, today I've applied a bunch of changes and now I don't see any issues with the latest code here. If you have time tomorrow, can you check the code for me? Thanks for your time!

Copy link
Contributor

@filmaj filmaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still looks good, and thanks for splitting the PRs out - that made reviewing a lot easier. I did leave one question around a specific debug log message you added that I can't seem to reproduce when testing locally... here's the relevant log output during my test, shortly before, during and after I disconnected and reconnected my wifi from my laptop:

[DEBUG]  socket-mode:SocketModeClient:0 Received pong from Slack server (data: Ping from client (1649966865067))
[DEBUG]  socket-mode:SocketModeClient:0 Received ping from Slack server (data: Ping from applink-78ccf745b6-56dzm)
[DEBUG]  socket-mode:SocketModeClient:0 Sent ping to Slack: Ping from client (1649966866733)
[DEBUG]  socket-mode:SocketModeClient:0 Sent ping to Slack: Ping from client (1649966868399)
[DEBUG]  socket-mode:SocketModeClient:0 Sent ping to Slack: Ping from client (1649966870067)
[DEBUG]  socket-mode:SocketModeClient:0 Sent ping to Slack: Ping from client (1649966871733)
[INFO]  socket-mode:SocketModeClient:0 A pong wasn't received from the server before the timeout of 5000ms!
[DEBUG]  socket-mode:SocketModeClient:0 Transitioning to state: connected:reconnecting
[DEBUG]  socket-mode:SocketModeClient:0 Sent ping to Slack: Ping from client (1649966873400)
[INFO]  socket-mode:SocketModeClient:0 A pong wasn't received from the server before the timeout of 5000ms!
[DEBUG]  socket-mode:SocketModeClient:0 Transitioning to state: reconnecting
[DEBUG]  socket-mode:SocketModeClient:0 Since only the primary WebSocket exists, going to tear it down...
[DEBUG]  socket-mode:SocketModeClient:0 Tearing down WebSocket connections finished
[DEBUG]  socket-mode:SocketModeClient:0 Transitioning to state: connecting
[DEBUG]  socket-mode:SocketModeClient:0 Transitioning to state: connecting:authenticating
[DEBUG]  socket-mode:SocketModeClient:0 Going to retrieve a new WSS URL ...
[DEBUG]  web-api:WebClient:0 apiCall('apps.connections.open') start
[DEBUG]  web-api:WebClient:0 will perform http request
[WARN]  web-api:WebClient:0 http request failed getaddrinfo ENOTFOUND slack.com
[DEBUG]  web-api:WebClient:0 will perform http request
[WARN]  web-api:WebClient:0 http request failed getaddrinfo ENOTFOUND slack.com
[DEBUG]  web-api:WebClient:0 will perform http request
[WARN]  web-api:WebClient:0 http request failed getaddrinfo ENOTFOUND slack.com
[DEBUG]  web-api:WebClient:0 will perform http request
[WARN]  web-api:WebClient:0 http request failed getaddrinfo ENOTFOUND slack.com
[DEBUG]  web-api:WebClient:0 will perform http request
[WARN]  web-api:WebClient:0 http request failed getaddrinfo ENOTFOUND slack.com
[DEBUG]  web-api:WebClient:0 will perform http request
[DEBUG]  web-api:WebClient:0 http response received
[DEBUG]  socket-mode:SocketModeClient:0 Transitioning to state: connecting:authenticated
[DEBUG]  socket-mode:SocketModeClient:0 Transitioning to state: connecting:handshaking
[DEBUG]  socket-mode:SocketModeClient:0 received a message on the WebSocket: {"type":"hello","num_connections":2,"debug_info":{"host":"applink-78ccf745b6-vxtsq","build_number":3,"approximate_connection_time":18060},"connection_info":{"app_id":"A02DH5S1NM6"}}
[DEBUG]  socket-mode:SocketModeClient:0 Transitioning to state: connected
[DEBUG]  socket-mode:SocketModeClient:0 Transitioning to state: connected:ready
[DEBUG]  socket-mode:SocketModeClient:0 Received ping from Slack server (data: Ping from applink-78ccf745b6-56dzm)
[DEBUG]  socket-mode:SocketModeClient:0 Received pong from Slack server (data: Ping from client (1649966866733))
[DEBUG]  socket-mode:SocketModeClient:0 Received pong from Slack server (data: Ping from client (1649966868399))

I was expecting to see a DEBUG log for "Reconnecting to Slack..." but it's not there... not sure why?

@@ -20,11 +20,40 @@ import { UnrecoverableSocketModeStartError } from './UnrecoverableSocketModeStar

const packageJson = require('../package.json'); // eslint-disable-line import/no-commonjs, @typescript-eslint/no-var-requires

// These enum values are used only in the state machine
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this 👍

.state('reconnecting')
.state(State.Reconnecting)
.onEnter(() => {
this.logger.debug('Reconnecting to Slack ...');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good but is it essentially a duplicate of the global state machine's generic onStateEnter event logger at the very top of this state machine's definition? In fact, after running the simple example you provided, this log does not come up at all...?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't see this log, the SocketModeClient code might be a bit older one. I've added this log mainly for debugging. We can remove it if it sounds duplicated with other logs.

Copy link
Member

@mwbrooks mwbrooks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for splitting the PR. I imagine it greatly improved the readability!

I tested this locally with simple.js and temporarily disconnected my wi-fi. Works great! I think this is a huge improvement for those facing the socket mode disconnected issue. 🎉 If we have missed a use-case, then we can continue to refine it later.

@seratch
Copy link
Member Author

seratch commented Apr 14, 2022

@filmaj Can you run link.sh again? The logs shows a bit older behavior.

.state('reconnecting')
.state(State.Reconnecting)
.onEnter(() => {
this.logger.debug('Reconnecting to Slack ...');
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't see this log, the SocketModeClient code might be a bit older one. I've added this log mainly for debugging. We can remove it if it sounds duplicated with other logs.

this.websocket.removeAllListeners('open');
this.websocket.removeAllListeners('close');
this.websocket.removeAllListeners('error');
this.websocket.removeAllListeners('message');
this.websocket.close();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably we can improve this switching logic to be a bit safer.

  • have a local variable for the old connection and assign the existing connection to it
  • switch this.websocekt to the new connection
  • setTimeout
    • remove the listeners from the old connection
    • close the old one

@seratch
Copy link
Member Author

seratch commented Apr 14, 2022

@filmaj

[DEBUG] socket-mode:SocketModeClient:0 Transitioning to state: connected:reconnecting

In the latest revision, I've removed this state "connected:reconnecting". Having two "reconnecting" state in the state machine was an issue in the code yesterday

seratch and others added 2 commits April 15, 2022 08:30
Co-authored-by: Fil Maj <maj.fil@gmail.com>
@seratch seratch merged commit 65744d2 into slackapi:main Apr 15, 2022
@seratch seratch deleted the issue-1243-socket-mode branch April 15, 2022 04:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug M-T: A confirmed bug report. Issues are confirmed when the reproduction steps are documented pkg:socket-mode applies to `@slack/socket-mode`
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Socket mode: Bolt doesn't handle WebSocket disconnections gracefully
3 participants