Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: handle Presence list call timeout without crashing channel #360

Merged
merged 1 commit into from
Dec 1, 2022

Conversation

w3b6x9
Copy link
Member

@w3b6x9 w3b6x9 commented Nov 30, 2022

What kind of change does this PR introduce?

Bug fix

What is the current behavior?

Phoenix Presence shard can be overwhelmed if many channels of the same topic try to call Presence.list at or around the same time. This leads to shard time out errors after 5 seconds which causes the channel to crash.

What is the new behavior?

Channel will check to see if topic is currently tracking any presences by fetching from ets table directly. If there are presences then it reverts back to Presence.list call b/c only the shard state has access to down replicas.

@abc3 abc3 marked this pull request as ready for review November 30, 2022 17:46
@chasers
Copy link
Contributor

chasers commented Nov 30, 2022

idk do we even know for sure this is an issue? Presence has Shards for a reason. I'd rather see some instrumentation around this call before we up and bypass a bunch of shit in the lib.

Also, what if they change the underlying ETS structure and this breaks on a version update?

@w3b6x9 w3b6x9 force-pushed the fix/channel-presence-exit branch 2 times, most recently from cd37848 to 887f1ad Compare November 30, 2022 23:58
@w3b6x9
Copy link
Member Author

w3b6x9 commented Dec 1, 2022

idk do we even know for sure this is an issue?
I'd rather see some instrumentation around this call before we up and bypass a bunch of shit in the lib.

This is 100% an issue. See logs from this month. The same developers keep reaching out about how they are having repeated Realtime issues and this has been the culprit for them.

Presence has Shards for a reason.

Right but if channels are all trying to sync/track Presence over a single topic then it's all over a single shard process. Additional shards won't help in that scenario. All channels listening to a topic will be routed to the same shard process on that node.

Also, what if they change the underlying ETS structure and this breaks on a version update?

Well the Tracker state ets table hasn't changed in 6 years but I've updated this PR to account for that so reverts back to calling Presence.list. See https://github.com/supabase/realtime/pull/360/files#diff-2a3a48674821fef237a3451af5d33ca977ee5c7d2f6bc8ca18bd7f18b44a7a26R639 and https://github.com/supabase/realtime/pull/360/files#diff-2a3a48674821fef237a3451af5d33ca977ee5c7d2f6bc8ca18bd7f18b44a7a26R241-R242.

@w3b6x9 w3b6x9 requested review from abc3 and chasers December 1, 2022 21:29
@chasers
Copy link
Contributor

chasers commented Dec 1, 2022

Hmm, yeah happened when SIN had connection issues.

Did you see the dirty_list?

https://github.com/phoenixframework/phoenix_pubsub/blob/ca2b47c8cf31324b0bf96cea862058f783a3e7bd/lib/phoenix/tracker/shard.ex#L78

@chasers
Copy link
Contributor

chasers commented Dec 1, 2022

This is all the same stuff I did when trying to use Tracker to track rate limits across the cluster. Could never get it working.

@w3b6x9 w3b6x9 merged commit 02a8266 into rc-dev Dec 1, 2022
@w3b6x9 w3b6x9 deleted the fix/channel-presence-exit branch December 1, 2022 22:58
w3b6x9 pushed a commit that referenced this pull request Dec 22, 2022
Pass JWT errors up so we can log them
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants