Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ActionCable: Sporadically not receiving confirmation for (re)subscription to same channel #38668

Closed
stevschmid opened this issue Mar 6, 2020 · 15 comments · Fixed by #41581
Closed
Assignees
Labels

Comments

@stevschmid
Copy link

stevschmid commented Mar 6, 2020

I've been playing around with ActionCable and ran into a strange phenomenon. From time to time, the js triggers depending on action cable would suddenly stop. In my setup, I create a subscription ( consumer.subscriptions.create) in a Stimulus controller during Controller::connect() and unsubscribe (subscription.unsubscribe) in Controller::disconnect(). One of the websocket messages reloads the page (via Turbolinks). Turbolinks replaces parts of the DOM, thus recreating the Stimulus controller, which in turn leads to the resubscription to the same channel.

After a long debugging session, I finally noticed that sometimes the subscription would not get confirmed. Turns out that resubscribing (i.e. unsubscribe quickly followed by subscribe) to the same channel sometimes results in the server not confirming the subscription. Screenshots from Chrome WS Messages in Network tab.

Normal behavior in (confirmation received):

image

Unexpected behavior (no confirmation received):

image

I digged deeper into the ActionCable source, added some debug messages and it seems that in those "subscription losing" cases the two messages get handled by different threads at the same time, and the subscribe message gets handled before the unsubscribe message. This is problematic since

return if subscriptions.key?(id_key)
is preventing the creation of a new subscription, so the subscribe message is practically ignored. Meanwhile, a tiny bit later, the unsubscribe message will remove the only existing subscription.

So this race condition leads to an unexpected "loss of subscription". And it seems that even the ~40ms delay between unsubscribe and subscribe (see previous screenshot) can cause such a loss in unfortunate circumstances. I encountered this on Rails 5.2.4 with the aforementioned Stimulus+Turbolinks setup, and recreated this issue in a fresh new Rails app with 6.0.2.1, using this piece of code to automatically resubscribe until connected() would not get called anymore (i.e. in case of lack of confirmation). Setting the DELAY_RESUBSCRIBE in the Gist example to 0 causes the problem to appear nearly instantly for me. But even with a value of 10, such a loss occurs after a couple of minutes.

image

Culprit seems to be this return, which according to blame has its origin in another race condition addressing PR #26547. Removing this line solves the issue for me, since subscribe simply creates a second subscription for the same channel, whereas the subsequent unsubscribe pops on of those, leaving me with one subscription. So in that case the order of the handling does not matter. I didn't have time yet to dig into PR #26547 and find out why that return is necessary.

Steps to reproduce

unsubscribe and quickly (re)subscribe to the same ActionCable channel. See Gist

Expected behavior

Resubscription is always confirmed

image

Actual behavior

Resubscription is sporadically not getting confirmed

image

System configuration

Rails version: Tested 5.2.4 and 6.0.2.1

Ruby version: ruby 2.6.1p33

@rails-bot
Copy link

rails-bot bot commented Jun 5, 2020

This issue has been automatically marked as stale because it has not been commented on for at least three months.
The resources of the Rails team are limited, and so we are asking for your help.
If you can still reproduce this error on the 6-0-stable branch or on master, please reply with all of the information you have about it in order to keep the issue open.
Thank you for all your contributions.

@rails-bot rails-bot bot added the stale label Jun 5, 2020
@rails-bot rails-bot bot closed this as completed Jun 12, 2020
@xuanchien
Copy link

I got the same issue in one project today (also with Stimulus). I had to work around this by delaying the "create subscription" process a bit using setTimeout (500ms). I am not sure if there is any other better way to do it.

@jonsgreen
Copy link

jonsgreen commented Feb 5, 2021

I am finding this same problem in a new rails 6.1.1 app which is including turbo via hotwire. @stevschmid I am curious if you ever got a satisfying response to this issue?

@stevschmid
Copy link
Author

@jonsgreen No response, unfortunately. In the end we decided to omit websocket/actioncable for our fairly small use case.

@nicastelo
Copy link

Seeing the same on 5.2.4

@diegodelvalle
Copy link

We're having the same problem on rails 5.2.4.4, we're not using hotwire though.

@sergey-koba-mobidev
Copy link

same problem Hotwire rails 6.1

@sergey-koba-mobidev
Copy link

Here is the dirty hack solution for Turbo-rails if someone faces this problem with Hotwire
hotwired/turbo-rails#170

@glennfu
Copy link

glennfu commented May 28, 2021

I can confirm this still happens on Rails 6.0.3. I hope this can get reopened.

@zzak
Copy link
Member

zzak commented May 28, 2021

@glennfu Could you give #41581 a try and see if it fixes your issue?

@glennfu
Copy link

glennfu commented May 29, 2021

I'll definitely give this a try on Monday. On the one hand I like that the solution makes the client more resilient to any connection issue, but I still feel like waiting for half a second and trying again is just a slow bandaid. My real world use-case is experiencing exactly the symptoms this Issue describes, but the behavior is basically just a reload of the page. It sends a request to the server, and the response is a Turbolinks redirect to the same url. I feel like there's got to be a way to more properly reload a page without Actioncable tripping over itself. Like can I synchronously tell Actioncable to tear down, and then do my request, to ensure that the unsubscribe gets properly handled before the page gets reloaded with minimal delay?

@zzak
Copy link
Member

zzak commented May 30, 2021

@glennfu It might help if you could share your use-case in that PR with code examples. 🙏

@glennfu
Copy link

glennfu commented Jun 1, 2021

I have an app up for you: glennfu/cable_test@ebb2e23

Here's the approach:

  1. Load page - connection triggers a job, observe that the job sends the socket a message
  2. Press button to reload page - 5% of the time, subscription will not be re-created and Job will not be enqueued

The PR definitely fixes it. I tested it by grabbing the action_cable.js file from the PR and dropping it directly into my node_modules folder

I'm unsatisfied with this as a best solution because it's just reacting to the problem and retrying rather than preventing the problem. My 3rd commit on that repo you can see an alternate solution: At the time of the redirect, I send the controller an event to force an unsubscribe in an attempt to synchronously unsubscribe before the Turbolinks.visit() to avoid any chance of a race condition. I feel like there should be a better way say "hey ActionCable, it's time to tear down everything, we're about to leave the page" but I can't find the appropriate way to do that. I tried reading through the source but I can't figure out what's needed here. If that can be done, it would absolutely be my preferred solution to my current problem.

Despite me wanting a server-side solution to prevent the problem, this PR is still very worthwhile as there can be other ways a subscription is missed, and retrying and recovering is great. Either a network glitch, or fast tab switching (I've confirmed both Chrome and Safari can break subscriptions if you switch tabs as a page is loading)

Also in the original Issue here it was mentioned "Culprit seems to be this return" - I also tested removing that, however I often end up with 2 subscriptions instead of 0, so while it's definitely related to the problem here, removing it isn't a solution.

Lastly (I know I have a lot here, sorry), you might notice in my Channel I'm calling stop_all_streams. If you comment this out, the message sends grow out of control and never clean up. I've never seen this suggested in any ActionCable guide which makes me think I'm doing something wrong. Am I? Or is this a related bug?

@zzak
Copy link
Member

zzak commented Jun 2, 2021

@glennfu Thank you for this additional context, I need some time to sort things out but please feel free to ping me if you haven't heard back. I will also be keeping a close eye on #41581 if anything changes I will let you know 🙇

@glennfu
Copy link

glennfu commented Jun 8, 2021

Just wanted to follow up here and say I've applied the PR to my production environment where I was logging hundreds of failures to connect per day and now they're pretty much all connecting now. Big win.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants