Add ability to detect a half-open connection#50039
Conversation
| if latency.positive? | ||
| ActiveSupport::Notifications.instrument( | ||
| "connection.latency", | ||
| value: latency, | ||
| connection_identifier: connection_identifier, | ||
| identifiers: identifiers.index_with { |id| instance_variable_get("@#{id}") } | ||
| ) | ||
| end |
There was a problem hiding this comment.
I'm on the fence about this, maybe it would be better to expose this in some other way to the app?
There was a problem hiding this comment.
I would leave it for another PR. Probably, adding a Connection class API to process pongs would work:
def handle_pong(payload)
pong(payload) if respond_to?(:pong)
endThere was a problem hiding this comment.
Ok. I'll leave this as is for now then.
|
Great! For the changelog just put the change at the beginning of the file. That line that exists now is good. |
b8c88fc to
f30422a
Compare
|
@brunoprietog thank you for the instructions. |
3e0e72f to
c476b1f
Compare
c476b1f to
f34ec2a
Compare
palkan
left a comment
There was a problem hiding this comment.
Great PR 👍
Thanks for such a detailed description and references to previous discussions.
Left some minor comments/suggestions.
| if latency.positive? | ||
| ActiveSupport::Notifications.instrument( | ||
| "connection.latency", | ||
| value: latency, | ||
| connection_identifier: connection_identifier, | ||
| identifiers: identifiers.index_with { |id| instance_variable_get("@#{id}") } | ||
| ) | ||
| end |
There was a problem hiding this comment.
I would leave it for another PR. Probably, adding a Connection class API to process pongs would work:
def handle_pong(payload)
pong(payload) if respond_to?(:pong)
end| end | ||
|
|
||
| def expects_pong_response_to_heartbeat? | ||
| protocol&.start_with?("actioncable-v1.1-") |
There was a problem hiding this comment.
We can move this into the initializer and store as @expects_pongs or similar
There was a problem hiding this comment.
I've changed the logic to set @expects_pongs in #handle_open just after @protocol is set.
It seems to be safe so set both @protocol and @expects_pongs in the initializer with the latest version of faye-websocket-driver. I assume that there is a good reason why @protocol is set in #handle_open.
Until I figure out why it isn't set in the initialzer I'd like to initialize @expects_pongs there.
| }, | ||
| "default_mount_path": "/cable", | ||
| "protocols": [ | ||
| "actioncable-v1.1-json", |
There was a problem hiding this comment.
I would stick to dashes: actioncable-v1-1-json. The mix of both dashes and dots looks a bit odd to me.
There was a problem hiding this comment.
I agree that at first it looks odd, but I went back and forth on this and landed on v1.1.
v1.1 is, to me, clearer at communicating that this is a minor revision of the protocol than v1-1.
And I like that the dashes still split the string into protocol_name, version and encoding.
f34ec2a to
5ebb42c
Compare
| }, | ||
| "default_mount_path": "/cable", | ||
| "protocols": [ | ||
| "actioncable-v1.1-json", |
There was a problem hiding this comment.
I agree that at first it looks odd, but I went back and forth on this and landed on v1.1.
v1.1 is, to me, clearer at communicating that this is a minor revision of the protocol than v1-1.
And I like that the dashes still split the string into protocol_name, version and encoding.
| end | ||
|
|
||
| def expects_pong_response_to_heartbeat? | ||
| protocol&.start_with?("actioncable-v1.1-") |
There was a problem hiding this comment.
I've changed the logic to set @expects_pongs in #handle_open just after @protocol is set.
It seems to be safe so set both @protocol and @expects_pongs in the initializer with the latest version of faye-websocket-driver. I assume that there is a good reason why @protocol is set in #handle_open.
Until I figure out why it isn't set in the initialzer I'd like to initialize @expects_pongs there.
| if latency.positive? | ||
| ActiveSupport::Notifications.instrument( | ||
| "connection.latency", | ||
| value: latency, | ||
| connection_identifier: connection_identifier, | ||
| identifiers: identifiers.index_with { |id| instance_variable_get("@#{id}") } | ||
| ) | ||
| end |
There was a problem hiding this comment.
Ok. I'll leave this as is for now then.
1153654 to
1c3af4c
Compare
|
|
||
| @ready_state = CLOSING unless @ready_state == CLOSED | ||
| @driver.close(reason, code) | ||
| begin_close(reason, code) if force && (@driver.state != :closed) |
There was a problem hiding this comment.
I had to add a way to force the connection to close.
When the WebSocket version is Hybi, the driver's #close behaviour changes such that it never emits a close event.
Because of this the connection is never closed and just lingers:
13:24:46 web.1 | I, [2023-12-08T13:24:46.507291 #440625] INFO -- : [LATENCY] [device:Device-586795065] 0.008601ms
13:24:49 web.1 | I, [2023-12-08T13:24:49.505572 #440625] INFO -- : [LATENCY] [device:Device-586795065] 0.006799ms
13:24:58 web.1 | I, [2023-12-08T13:24:58.499496 #440625] INFO -- : MISSED PONG
13:24:58 web.1 | I, [2023-12-08T13:24:58.499582 #440625] INFO -- : Called Connection#close with reason: heartbeat_timeout and reconnect: true
13:24:58 web.1 | I, [2023-12-08T13:24:58.499659 #440625] INFO -- : Called WebSocket#close
13:24:58 web.1 | I, [2023-12-08T13:24:58.502117 #440625] INFO -- : Called SocketClient#close
13:25:01 web.1 | I, [2023-12-08T13:25:01.499804 #440625] INFO -- : MISSED PONG
13:25:01 web.1 | I, [2023-12-08T13:25:01.499849 #440625] INFO -- : Called Connection#close with reason: heartbeat_timeout and reconnect: true
13:25:01 web.1 | I, [2023-12-08T13:25:01.499926 #440625] INFO -- : Called WebSocket#close
13:25:01 web.1 | I, [2023-12-08T13:25:01.501111 #440625] INFO -- : Called SocketClient#close
When force is set to true, if the driver didn't close the connection immediately, #begin_close will be called which will then remove the connection from the server and do all necessary cleanup - effectively forcing the connection to close.
The Driver implements three versions of the WebSocket protocol: Hybi, Draft76 & Draft75
Draft76 & Draft75 immediately change the state to :closed. Only Hybi changes it to :closing. So the force argument is only useful for Hybi WebSockets.
There was a problem hiding this comment.
I wonder if it would be better to use #shutdown: https://github.com/faye/websocket-driver-ruby/blob/58d4ed742749dfb3b4f568df36a5365088361220/lib/websocket/driver/hybi.rb#L258 ?
Invoking event callback directly doesn't seem right.
There was a problem hiding this comment.
I looked into that, but calling a private method also doesn't seem right.
I'd rather invoke the callback directly, or open up a PR against the driver to add the ability to force close a connection.
There was a problem hiding this comment.
calling a private method also doesn't seem right.
Ah, it's private; good point.
There was a problem hiding this comment.
The people over at faye-websocket-dirver gave me a very nice explanation as to what's going on here.
The TL;DR - The Hybi driver implements the final spec of WebSockets, which states that the client has to respond to a close message and therefore there is no way to force a connection closed except to close the socket and trigger any callbacks yourself. Since they just implement the protocol and don't work with sockets threads timeouts or anything similar there is nothing that they can do to force a connection to close.
They also stressed that state shouldn't be used. Even though it's a public method, it's only intended to be used in their test suite and is intentionally undocumented. Because of that I'll remove the state check when force is true.
Given what we know now I think that triggering the callback manually seems to be the best option in this particular case.
|
|
||
| @ready_state = CLOSING unless @ready_state == CLOSED | ||
| @driver.close(reason, code) | ||
| begin_close(reason, code) if force && (@driver.state != :closed) |
There was a problem hiding this comment.
I wonder if it would be better to use #shutdown: https://github.com/faye/websocket-driver-ruby/blob/58d4ed742749dfb3b4f568df36a5365088361220/lib/websocket/driver/hybi.rb#L258 ?
Invoking event callback directly doesn't seem right.
|
|
||
| def handle_open | ||
| @protocol = websocket.protocol | ||
| @expects_pong = @protocol&.starts_with?("actioncable-v1.1-") |
There was a problem hiding this comment.
Let's extract @protocol&.starts_with?("actioncable-v1.1-") to a helper method which we may update in the future (to add more protocols) without touching this method.
There was a problem hiding this comment.
I've turned @expects_pong into expects_pong? which memorizes the value
1c3af4c to
9f759e7
Compare
| latency = Time.now.to_f - payload["message"].to_f | ||
|
|
||
| if latency.positive? | ||
| ActiveSupport::Notifications.instrument( |
There was a problem hiding this comment.
Why do we need this instrumentation?
There was a problem hiding this comment.
We don't. It's a way to expose the measured latency to the app so that it can be processed.
E.g. I track the latency of each Device and mark ones that have a latency of 1s or more as problematic for another team to look into.
Do you think it would be worth exposing the latency via a dedicated interface or do you think it shouldn't be exposed at all?
There was a problem hiding this comment.
I'm just trying to understand the goal. Maybe we should document this new instrumentation in the guide. https://guides.rubyonrails.org/active_support_instrumentation.html#action-cable
There was a problem hiding this comment.
Oh, I didn't know these were documented, thank you for pointing that out.
I've updated the guide and changed the instrumentation name to be in line with the other action cable instrumentations.
There was a problem hiding this comment.
IMO, latency instrumentation is out of the scope of this feature.
Also, we already have the Server#open_connections_statistics and Connection#statistics to keep track (and poll for) connection metrics. Probably, we should add latency there and let the application decide on how to use it instead of triggering tons of AS events.
1dc5637 to
fc2f543
Compare
|
Would this PR benefit from some help? This is exactly what I want 🙇 |
|
I think this just got stuck in review. |
|
Any update on this? Running into something like this at work right now. Happy to help if needed |
ActionCable doesn't have a mechanism to detect if its connection is half-open (if there isn't a client listening on the other end). A half-open connection can linger around for up to half an hour using up system resources. This patch adds a new revision of the protocol - actioncable-v1.1-json - that instructs the client to respond to the server's heartbeat PING messages with a PONG message. If a PONG message isn't received by the server within two heartbeats the connections is assumed to be half-open and is closed. Fixes rails#45112 Fixes rails#24908 Fixes rails#29307 Fixes rails#32828
ef969c4 to
7328303
Compare
|
Rebased to main and updated the code a bit. This is in a mergeable state again. |
| def close(reason: nil, reconnect: true, force: false) | ||
| unless force |
There was a problem hiding this comment.
We only use force: true when we disconnect a slow client (not sending pongs on time), why do we skip the disconnect message? (And it becomes useless, I mean, the constant in the INTERNAL).
Missing pongs happen not only when the client is unreachable, but also in case of slow network or the client not catching up with the number of messages to consume/send. Sending the disconnect message is important in that case.
| unless expects_pong? | ||
| return transmit(type: ActionCable::INTERNAL[:message_types][:ping], message: Time.now.to_i) | ||
| end | ||
|
|
||
| if connection_idle? | ||
| return close(reason: ::ActionCable::INTERNAL[:disconnect_reasons][:heartbeat_timeout], force: true) | ||
| end |
There was a problem hiding this comment.
Can't we combine these two clauses? LIke, if missing_pong? (which is false for non-ponging clients)
|
|
||
| @ready_state = CLOSING unless @ready_state == CLOSED | ||
| @driver.close(reason, code) | ||
| begin_close(reason, code) if force |
There was a problem hiding this comment.
Calling @driver.close(...) emits the close event right away (if the socket in the closeable state—so, the driver is just a state machine), we don't need to call this manually (and, thus, force is not needed here, too).
| latency = Time.now.to_f - payload["message"].to_f | ||
|
|
||
| if latency.positive? | ||
| ActiveSupport::Notifications.instrument( |
There was a problem hiding this comment.
IMO, latency instrumentation is out of the scope of this feature.
Also, we already have the Server#open_connections_statistics and Connection#statistics to keep track (and poll for) connection metrics. Probably, we should add latency there and let the application decide on how to use it instead of triggering tons of AS events.
Motivation / Background
I have IoT devices connecting to my Rails app via ActionCable, and I want to log when a device connects or disconnects.
This is already doable via
Cahnnel#subscribedandChannel#unsubscribed; and it works pretty well when a device properly connects, subscribes, and disconnects.But when a device improperly disconnects - e.g. if it gets unplugged from the network - its ActionCable connection doesn't close and can linger in the system for up to half an hour.
This means that the server is using up resources to manage this connection, and send messages to it, when there isn't a client to receive them (since a client will reconnect if it misses 2 heartbeats which is about 6 sec). Some of these issues have been commented on in #45112 and #24908
Currently, ActionCable doesn't have a mechanism to detect if its connection is half-open (if there is a client listening on the other end).
This PR changes the protocol so that a client has to respond to a PING message with a PONG message. If it doesn't do so within 2 heartbeats (6 sec) the connection is assumed to be half-open and is closed.
Fixes #45112
Fixes #24908
Fixes #29307
Fixes #32828
Detail
Why not client-initiated heartbeats?
I've considered implementing client-initiated heartbeats (and did a proof of concept for it) as was suggested in #45112 and in #24908, but opted for PONG responses because Chrome recently added a feature that throttles setTimeout and setInterval to run at most once a minute if a tab isn't in focus.
This throttling would cause clients to reconnect loop if the half-open threshold is set to below 1 min. And heartbeats with intervals greater than 1 min aren't always desirable, like in my case where I want to know that a device went offline within seconds. Chrome's throttling also prompted socket.io to change it's heartbeat to be server-initated.
Backwards and forwards compatibility
To ensure that clients that don't know about PONG messages can still work with servers that do, and for clients that do know about PONG messages to work with servers that don't; I've created a minor revision of the protocol -
actioncable-v1.1-json.If a client sends that in a
Sec-WebSocket-Protocolheader to the server and the server responds withactioncable-v1.1-jsonboth the client and server know that they should send and expect PONG messages.If the client doesn't send
Sec-WebSocket-Protocol: actioncable-v1.1-jsonor the server doesn't respond withSec-WebSocket-Protocol: actioncable-v1.1-jsonthen the behavior fromactioncable-v1-jsonis used.Why is this logic on the connection object instead of the stream object?
In #45112 there was a suggestion to implement this as a timeout in the event-loop / select-loop.
From my understanding implementing it that way wouldn't be that much different from the implementation in this PR.
There would be some kind of last_message_read_at timestamp that's checked in each iteration, and if a threshold is exceeded the socket would be closed.
I implemented everything on the connection object because doimainwise it makes sense to me that a connection keeps the heartbeat and responds to it.
Though I can see the benefits of having everything in the event loop. I have no strong opinion on this.
What about dropped message detection from #24908?
I've opted not to tackle that in this PR as that seems like it could become a mini-framework within ActionCable.
What about client-side latency measurements from #24908?
I've opted not to include that in this PR. But this could probably be done by adding an
echotype message (or repurposingping) that just echos back what the client sends to the server.What about client-set heartbeat frequency from #24908?
I've opted not to include that in this PR as it's an unrelated feature to this one.
Why not native WebSocket PING and PONG messages like #32828?
It currently isn't possible to process native PING and PONG messages in browsers, so an additional sub-protocol level heartbeat would be needed for JS clients in browsers to be able to detect that they have been disconnected. Therefore it makes more sense to have just the one sub-protocol level heartbeat for everything.
What about the changelog?
The CHANGELOG file contains a link to the changes in Rails 7.1. I'm not sure what to do.Checklist
Before submitting the PR make sure the following are checked:
[Fix #issue-number]