Skip to content

Add ability to detect a half-open connection#50039

Open
monorkin wants to merge 1 commit into
rails:mainfrom
monorkin:add-pong-response-to-heartbeat-ping-messages
Open

Add ability to detect a half-open connection#50039
monorkin wants to merge 1 commit into
rails:mainfrom
monorkin:add-pong-response-to-heartbeat-ping-messages

Conversation

@monorkin

@monorkin monorkin commented Nov 13, 2023

Copy link
Copy Markdown
Contributor

Motivation / Background

I have IoT devices connecting to my Rails app via ActionCable, and I want to log when a device connects or disconnects.

This is already doable via Cahnnel#subscribed and Channel#unsubscribed; and it works pretty well when a device properly connects, subscribes, and disconnects.

But when a device improperly disconnects - e.g. if it gets unplugged from the network - its ActionCable connection doesn't close and can linger in the system for up to half an hour.

This means that the server is using up resources to manage this connection, and send messages to it, when there isn't a client to receive them (since a client will reconnect if it misses 2 heartbeats which is about 6 sec). Some of these issues have been commented on in #45112 and #24908

Currently, ActionCable doesn't have a mechanism to detect if its connection is half-open (if there is a client listening on the other end).

This PR changes the protocol so that a client has to respond to a PING message with a PONG message. If it doesn't do so within 2 heartbeats (6 sec) the connection is assumed to be half-open and is closed.

Fixes #45112
Fixes #24908
Fixes #29307
Fixes #32828

Detail

Why not client-initiated heartbeats?

I've considered implementing client-initiated heartbeats (and did a proof of concept for it) as was suggested in #45112 and in #24908, but opted for PONG responses because Chrome recently added a feature that throttles setTimeout and setInterval to run at most once a minute if a tab isn't in focus.

This throttling would cause clients to reconnect loop if the half-open threshold is set to below 1 min. And heartbeats with intervals greater than 1 min aren't always desirable, like in my case where I want to know that a device went offline within seconds. Chrome's throttling also prompted socket.io to change it's heartbeat to be server-initated.

Backwards and forwards compatibility

To ensure that clients that don't know about PONG messages can still work with servers that do, and for clients that do know about PONG messages to work with servers that don't; I've created a minor revision of the protocol - actioncable-v1.1-json.

If a client sends that in a Sec-WebSocket-Protocol header to the server and the server responds with actioncable-v1.1-json both the client and server know that they should send and expect PONG messages.

If the client doesn't send Sec-WebSocket-Protocol: actioncable-v1.1-json or the server doesn't respond with Sec-WebSocket-Protocol: actioncable-v1.1-json then the behavior from actioncable-v1-json is used.

Why is this logic on the connection object instead of the stream object?

In #45112 there was a suggestion to implement this as a timeout in the event-loop / select-loop.
From my understanding implementing it that way wouldn't be that much different from the implementation in this PR.
There would be some kind of last_message_read_at timestamp that's checked in each iteration, and if a threshold is exceeded the socket would be closed.

I implemented everything on the connection object because doimainwise it makes sense to me that a connection keeps the heartbeat and responds to it.

Though I can see the benefits of having everything in the event loop. I have no strong opinion on this.

What about dropped message detection from #24908?

I've opted not to tackle that in this PR as that seems like it could become a mini-framework within ActionCable.

What about client-side latency measurements from #24908?

I've opted not to include that in this PR. But this could probably be done by adding an echo type message (or repurposing ping) that just echos back what the client sends to the server.

What about client-set heartbeat frequency from #24908?

I've opted not to include that in this PR as it's an unrelated feature to this one.

Why not native WebSocket PING and PONG messages like #32828?

It currently isn't possible to process native PING and PONG messages in browsers, so an additional sub-protocol level heartbeat would be needed for JS clients in browsers to be able to detect that they have been disconnected. Therefore it makes more sense to have just the one sub-protocol level heartbeat for everything.

What about the changelog?

The CHANGELOG file contains a link to the changes in Rails 7.1. I'm not sure what to do.

Checklist

Before submitting the PR make sure the following are checked:

  • This Pull Request is related to one change. Changes that are unrelated should be opened in separate PRs.
  • Commit message has a detailed description of what changed and why. If this PR fixes a related issue include it in the commit message. Ex: [Fix #issue-number]
  • Tests are added or updated if you fix a bug or add a feature.
  • CHANGELOG files are updated for the changed libraries if there is a behavior change or additional feature. Minor bug fixes and documentation changes should not be included.

Comment on lines +308 to +315
if latency.positive?
ActiveSupport::Notifications.instrument(
"connection.latency",
value: latency,
connection_identifier: connection_identifier,
identifiers: identifiers.index_with { |id| instance_variable_get("@#{id}") }
)
end

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm on the fence about this, maybe it would be better to expose this in some other way to the app?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would leave it for another PR. Probably, adding a Connection class API to process pongs would work:

def handle_pong(payload)
   pong(payload) if respond_to?(:pong)
end

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I'll leave this as is for now then.

Comment thread actioncable/lib/action_cable/connection/base.rb Outdated
@brunoprietog

Copy link
Copy Markdown
Contributor

Great! For the changelog just put the change at the beginning of the file. That line that exists now is good.

*   Something

    *Stanko Krtalić*

@monorkin monorkin force-pushed the add-pong-response-to-heartbeat-ping-messages branch from b8c88fc to f30422a Compare November 15, 2023 06:29
@monorkin

monorkin commented Nov 15, 2023

Copy link
Copy Markdown
Contributor Author

@brunoprietog thank you for the instructions.
I've updated the changelog.

monorkin added a commit to monorkin/actioncable-presence-detection-demo that referenced this pull request Nov 15, 2023
@monorkin monorkin force-pushed the add-pong-response-to-heartbeat-ping-messages branch 3 times, most recently from 3e0e72f to c476b1f Compare November 15, 2023 11:08
Comment thread actioncable/CHANGELOG.md Outdated
Comment thread actioncable/CHANGELOG.md Outdated
@monorkin monorkin force-pushed the add-pong-response-to-heartbeat-ping-messages branch from c476b1f to f34ec2a Compare November 21, 2023 14:57

@palkan palkan left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great PR 👍

Thanks for such a detailed description and references to previous discussions.

Left some minor comments/suggestions.

Comment thread actioncable/lib/action_cable/connection/base.rb Outdated
Comment thread actioncable/lib/action_cable/connection/base.rb Outdated
Comment thread actioncable/lib/action_cable/connection/base.rb Outdated
Comment on lines +308 to +315
if latency.positive?
ActiveSupport::Notifications.instrument(
"connection.latency",
value: latency,
connection_identifier: connection_identifier,
identifiers: identifiers.index_with { |id| instance_variable_get("@#{id}") }
)
end

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would leave it for another PR. Probably, adding a Connection class API to process pongs would work:

def handle_pong(payload)
   pong(payload) if respond_to?(:pong)
end

Comment thread actioncable/lib/action_cable/connection/base.rb Outdated
Comment thread actioncable/lib/action_cable/connection/base.rb Outdated
end

def expects_pong_response_to_heartbeat?
protocol&.start_with?("actioncable-v1.1-")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can move this into the initializer and store as @expects_pongs or similar

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed the logic to set @expects_pongs in #handle_open just after @protocol is set.

It seems to be safe so set both @protocol and @expects_pongs in the initializer with the latest version of faye-websocket-driver. I assume that there is a good reason why @protocol is set in #handle_open.

Until I figure out why it isn't set in the initialzer I'd like to initialize @expects_pongs there.

},
"default_mount_path": "/cable",
"protocols": [
"actioncable-v1.1-json",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would stick to dashes: actioncable-v1-1-json. The mix of both dashes and dots looks a bit odd to me.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that at first it looks odd, but I went back and forth on this and landed on v1.1.
v1.1 is, to me, clearer at communicating that this is a minor revision of the protocol than v1-1.
And I like that the dashes still split the string into protocol_name, version and encoding.

@monorkin monorkin force-pushed the add-pong-response-to-heartbeat-ping-messages branch from f34ec2a to 5ebb42c Compare December 8, 2023 11:13

@monorkin monorkin left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the feedback @palkan
I've updated the PR and responded to some of your comments.
And sorry for the disjointed comment threads, I force pushed before commenting.

},
"default_mount_path": "/cable",
"protocols": [
"actioncable-v1.1-json",

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that at first it looks odd, but I went back and forth on this and landed on v1.1.
v1.1 is, to me, clearer at communicating that this is a minor revision of the protocol than v1-1.
And I like that the dashes still split the string into protocol_name, version and encoding.

Comment thread actioncable/lib/action_cable/connection/base.rb Outdated
Comment thread actioncable/lib/action_cable/connection/base.rb Outdated
end

def expects_pong_response_to_heartbeat?
protocol&.start_with?("actioncable-v1.1-")

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed the logic to set @expects_pongs in #handle_open just after @protocol is set.

It seems to be safe so set both @protocol and @expects_pongs in the initializer with the latest version of faye-websocket-driver. I assume that there is a good reason why @protocol is set in #handle_open.

Until I figure out why it isn't set in the initialzer I'd like to initialize @expects_pongs there.

Comment thread actioncable/lib/action_cable/connection/base.rb Outdated
Comment thread actioncable/lib/action_cable/connection/base.rb Outdated
Comment on lines +308 to +315
if latency.positive?
ActiveSupport::Notifications.instrument(
"connection.latency",
value: latency,
connection_identifier: connection_identifier,
identifiers: identifiers.index_with { |id| instance_variable_get("@#{id}") }
)
end

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I'll leave this as is for now then.

Comment thread actioncable/lib/action_cable/connection/base.rb Outdated
Comment thread actioncable/lib/action_cable/connection/base.rb Outdated
@monorkin monorkin force-pushed the add-pong-response-to-heartbeat-ping-messages branch 2 times, most recently from 1153654 to 1c3af4c Compare December 8, 2023 13:38

@ready_state = CLOSING unless @ready_state == CLOSED
@driver.close(reason, code)
begin_close(reason, code) if force && (@driver.state != :closed)

@monorkin monorkin Dec 8, 2023

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to add a way to force the connection to close.

When the WebSocket version is Hybi, the driver's #close behaviour changes such that it never emits a close event.

Because of this the connection is never closed and just lingers:

13:24:46 web.1  | I, [2023-12-08T13:24:46.507291 #440625]  INFO -- : [LATENCY] [device:Device-586795065] 0.008601ms                                                                                                                                                                 
13:24:49 web.1  | I, [2023-12-08T13:24:49.505572 #440625]  INFO -- : [LATENCY] [device:Device-586795065] 0.006799ms                                                                                                                                                                 
13:24:58 web.1  | I, [2023-12-08T13:24:58.499496 #440625]  INFO -- : MISSED PONG                                                                                                                                                                                                    
13:24:58 web.1  | I, [2023-12-08T13:24:58.499582 #440625]  INFO -- : Called Connection#close with reason: heartbeat_timeout and reconnect: true                                                                                                                                     
13:24:58 web.1  | I, [2023-12-08T13:24:58.499659 #440625]  INFO -- : Called WebSocket#close                                                                                                                                                                                         
13:24:58 web.1  | I, [2023-12-08T13:24:58.502117 #440625]  INFO -- : Called SocketClient#close
13:25:01 web.1  | I, [2023-12-08T13:25:01.499804 #440625]  INFO -- : MISSED PONG
13:25:01 web.1  | I, [2023-12-08T13:25:01.499849 #440625]  INFO -- : Called Connection#close with reason: heartbeat_timeout and reconnect: true
13:25:01 web.1  | I, [2023-12-08T13:25:01.499926 #440625]  INFO -- : Called WebSocket#close
13:25:01 web.1  | I, [2023-12-08T13:25:01.501111 #440625]  INFO -- : Called SocketClient#close

When force is set to true, if the driver didn't close the connection immediately, #begin_close will be called which will then remove the connection from the server and do all necessary cleanup - effectively forcing the connection to close.

The Driver implements three versions of the WebSocket protocol: Hybi, Draft76 & Draft75
Draft76 & Draft75 immediately change the state to :closed. Only Hybi changes it to :closing. So the force argument is only useful for Hybi WebSockets.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be better to use #shutdown: https://github.com/faye/websocket-driver-ruby/blob/58d4ed742749dfb3b4f568df36a5365088361220/lib/websocket/driver/hybi.rb#L258 ?

Invoking event callback directly doesn't seem right.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into that, but calling a private method also doesn't seem right.
I'd rather invoke the callback directly, or open up a PR against the driver to add the ability to force close a connection.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

calling a private method also doesn't seem right.

Ah, it's private; good point.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The people over at faye-websocket-dirver gave me a very nice explanation as to what's going on here.

The TL;DR - The Hybi driver implements the final spec of WebSockets, which states that the client has to respond to a close message and therefore there is no way to force a connection closed except to close the socket and trigger any callbacks yourself. Since they just implement the protocol and don't work with sockets threads timeouts or anything similar there is nothing that they can do to force a connection to close.

They also stressed that state shouldn't be used. Even though it's a public method, it's only intended to be used in their test suite and is intentionally undocumented. Because of that I'll remove the state check when force is true.

Given what we know now I think that triggering the callback manually seems to be the best option in this particular case.


@ready_state = CLOSING unless @ready_state == CLOSED
@driver.close(reason, code)
begin_close(reason, code) if force && (@driver.state != :closed)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be better to use #shutdown: https://github.com/faye/websocket-driver-ruby/blob/58d4ed742749dfb3b4f568df36a5365088361220/lib/websocket/driver/hybi.rb#L258 ?

Invoking event callback directly doesn't seem right.


def handle_open
@protocol = websocket.protocol
@expects_pong = @protocol&.starts_with?("actioncable-v1.1-")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's extract @protocol&.starts_with?("actioncable-v1.1-") to a helper method which we may update in the future (to add more protocols) without touching this method.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've turned @expects_pong into expects_pong? which memorizes the value

@monorkin monorkin force-pushed the add-pong-response-to-heartbeat-ping-messages branch from 1c3af4c to 9f759e7 Compare January 4, 2024 10:15
latency = Time.now.to_f - payload["message"].to_f

if latency.positive?
ActiveSupport::Notifications.instrument(

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this instrumentation?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't. It's a way to expose the measured latency to the app so that it can be processed.

E.g. I track the latency of each Device and mark ones that have a latency of 1s or more as problematic for another team to look into.

Do you think it would be worth exposing the latency via a dedicated interface or do you think it shouldn't be exposed at all?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just trying to understand the goal. Maybe we should document this new instrumentation in the guide. https://guides.rubyonrails.org/active_support_instrumentation.html#action-cable

@monorkin monorkin Jan 6, 2024

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I didn't know these were documented, thank you for pointing that out.
I've updated the guide and changed the instrumentation name to be in line with the other action cable instrumentations.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, latency instrumentation is out of the scope of this feature.

Also, we already have the Server#open_connections_statistics and Connection#statistics to keep track (and poll for) connection metrics. Probably, we should add latency there and let the application decide on how to use it instead of triggering tons of AS events.

@monorkin monorkin force-pushed the add-pong-response-to-heartbeat-ping-messages branch from 1dc5637 to fc2f543 Compare January 6, 2024 12:43
@rails-bot rails-bot Bot added the docs label Jan 6, 2024
@bensheldon

Copy link
Copy Markdown
Contributor

Would this PR benefit from some help? This is exactly what I want 🙇

@monorkin

monorkin commented Sep 5, 2025

Copy link
Copy Markdown
Contributor Author

I think this just got stuck in review.
I have the bandwidth to update it and bring it to a mergable state again.

@DRBragg

DRBragg commented Jan 6, 2026

Copy link
Copy Markdown
Contributor

Any update on this? Running into something like this at work right now. Happy to help if needed

ActionCable doesn't have a mechanism to detect if its connection is
half-open (if there isn't a client listening on the other end).

A half-open connection can linger around for up to half an hour using
up system resources.

This patch adds a new revision of the protocol - actioncable-v1.1-json -
that instructs the client to respond to the server's heartbeat PING
messages with a PONG message. If a PONG message isn't received by the
server within two heartbeats the connections is assumed to be half-open
and is closed.

Fixes rails#45112
Fixes rails#24908
Fixes rails#29307
Fixes rails#32828
@monorkin monorkin force-pushed the add-pong-response-to-heartbeat-ping-messages branch from ef969c4 to 7328303 Compare May 6, 2026 14:43
@monorkin

monorkin commented May 6, 2026

Copy link
Copy Markdown
Contributor Author

Rebased to main and updated the code a bit. This is in a mergeable state again.

Comment on lines +130 to +131
def close(reason: nil, reconnect: true, force: false)
unless force

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only use force: true when we disconnect a slow client (not sending pongs on time), why do we skip the disconnect message? (And it becomes useless, I mean, the constant in the INTERNAL).

Missing pongs happen not only when the client is unreachable, but also in case of slow network or the client not catching up with the number of messages to consume/send. Sending the disconnect message is important in that case.

Comment on lines +161 to +167
unless expects_pong?
return transmit(type: ActionCable::INTERNAL[:message_types][:ping], message: Time.now.to_i)
end

if connection_idle?
return close(reason: ::ActionCable::INTERNAL[:disconnect_reasons][:heartbeat_timeout], force: true)
end

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we combine these two clauses? LIke, if missing_pong? (which is false for non-ponging clients)


@ready_state = CLOSING unless @ready_state == CLOSED
@driver.close(reason, code)
begin_close(reason, code) if force

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling @driver.close(...) emits the close event right away (if the socket in the closeable state—so, the driver is just a state machine), we don't need to call this manually (and, thus, force is not needed here, too).

latency = Time.now.to_f - payload["message"].to_f

if latency.positive?
ActiveSupport::Notifications.instrument(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, latency instrumentation is out of the scope of this feature.

Also, we already have the Server#open_connections_statistics and Connection#statistics to keep track (and poll for) connection metrics. Probably, we should add latency there and let the application decide on how to use it instead of triggering tons of AS events.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

7 participants