New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SURE-7122] Excessive WebSocket activity when watching resources with permission by name #41809
Comments
I've managed to reproduce this given the instructions above. TL;DR The spamming is caused by the lack of In terms of how the UI handles watching resources over sockets and the
After reproducing the error it seems like rancher is failing to send the We'd expect to see something like (ignore the revision |
Need to reproduce this with 2.7.5 out. |
This repository uses an automated workflow to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the workflow can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the workflow will automatically close the issue in 14 days. Thank you for your contributions. |
Still relevant |
@richard-cox IIUC in the reproduced case above the loop happens on a Steve endpoint (/v1). How reasonable is it that the same happens in Norman as well? |
@moio Norman side should be safe given that it works differently (server side a single rancher--kube watch on all resources, rather than per client--rancher per-resource watch). The UI does send per resource watch messages for Norman though that was an oversight. The mechanism might get used or cleaned up (ui side) as part of rancher/dashboard#7906. |
Align behavior with plain Watch, which also does the same. Fixes rancher/rancher#41809 Signed-off-by: Silvio Moioli <silvio@moioli.net>
Align behavior with plain Watch, which also does the same. Fixes rancher/rancher#41809 Signed-off-by: Silvio Moioli <silvio@moioli.net>
@richard-cox rancher/steve#141 does produce the results you suggested. |
@moio Looks good, the |
Right, at timestamp 21:45:19.611
Right, but on different resource types ( The actual next resub of I reproduced this once again and confirm the only WS traffic I see on I uploaded the HAR files for further inspection here. To my understanding, this seems all good - please do correct me if you see anything suspicious. |
Sorry, i worded my comment badly. The original resource.stop/resource.error --> resource.start (with new revision number) process for configmaps looks good, so from 21:45:19.605 ending 21:45:19.647. My questions were around the resource.stop for configmaps at 21:45:39.639 which is immediately follow by a resub with the previously subscribed resource version. That looks like the old behaviour. In the Would this be something to do with the way the dev/test environment is set up to trigger the issue? We can go over this tomorrow in our sync up Update: Discussed with Silvio. What i thought was the old behaviour was in fact the correct behaviour given the socket watch timeout being set to 20s rather than 30m (as stated in the description). The patch results looks good! |
User reported success with the |
@richard-cox: @MSpencer87 discovered another side of this bug and I feel like needing your help to figure out why it happens. Assuming the same reproduction instructions above, However at that point reloading the detail ConfigMap page gets into an infinite loop of Here is the HAR file and a repro video for you: https://drive.google.com/drive/folders/1-EiXH9x5F9AIsj8YFD23rwUs_LdbsPz6?usp=sharing Could this be a dashboard problem? |
Align behavior with plain Watch, which also does the same. Fixes rancher/rancher#41809 Signed-off-by: Silvio Moioli <silvio@moioli.net>
Investigated, and the fix has revealed a bug in the UI which is reproducible after refreshing on an old resource's detail page (probably edit as well - my instance died and can't confirm). Normally we fetch a list of resources and watch with the revision from that fetch. If that becomes 'too old', minus a few other bits, we make a fresh fetch to get the latest resources and revision... and then watch with that revision. However if we fetch a single resource, we watch with it's own revision which is probably very old. If so the Bonus fun. The request to fetch the single resource is most probably namespaced... but that part doesn't make it to the request we make... so the url is broken anyway.
|
@MSpencer87 given the last comment from @richard-cox are you OK considering this done and addressing the remaining problem you found in the follow-up issue rancher/dashboard#10540? |
@MSpencer87 can you report test results here please? |
Align behavior with plain Watch, which also does the same. Fixes rancher/rancher#41809 Signed-off-by: Silvio Moioli <silvio@moioli.net>
Align behavior with plain Watch, which also does the same. Fixes rancher/rancher#41809 Signed-off-by: Silvio Moioli <silvio@moioli.net>
I was able to successfully reproduce on 2.7.4 and verify using v2.7.6-debug-41809-2, the watch errors surrounding Step to reproduce:
Steps for validation:
Please advise: Opening browser console to Network tab, filtering for |
@MSpencer87 Correct. The fix reveals a UI bug which has been addressed in below versions. Those issues are open but the changes are available in -head builds
2.7 is rancher/dashboard#10567 but blocked on #44656 |
Retested on 2.8.2 -> 2.8.3 and confirmed the same results as @MSpencer87's comment. Also encountered the same websocket "configmap" traffic every ~20 seconds or so. |
Internal reference: SURE-7122
TL;DR
When a user lists a resource in a namespace in which it has permission to only see certain elements, excessive WebSocket activity between browser and Rancher is created due to resource version synchronization being lost and retried multiple times per second.
This results in excessive memory usage from the Rancher pod (#41225) and possibly excessive load on the Kubernetes API Server (#41663).
Reproducer setup
install Rancher 2.7.4 with default options
edit the Rancher deployment, eg via:
Edit the main container section to add the
--debug
command line argument (inargs
) and theCATTLE_WATCH_TIMEOUT_SECONDS=20
environment variable (inenv
). Both changes make it easier to see symptoms without altering functionality significantly.End result should look like the following:
Wait until the deployment settles.
testuser
:Users & Authentication -> Users -> Create
Username: testuser
New Password: testusertestusertestuser
Confirm Password: testusertestusertestuser
Click on Create
Explore Cluster -> local -> Cluster and Project Members -> Cluster Membership -> Add
Select Member: testuser
Cluster Permissions: Member
Click on Create
testuser
only to one of them:testuser
Clusters -> local -> More Resources -> Core -> ConfigMaps
Note: it is expected that only the
test1
ConfigMap should be visible at this point, but nottest2
Note: if the test cluster uses self-signed SSL certificates, make sure a WebKit based browser is used such as Chrome or Safari. Firefox has a bug that prevents websockets to work correctly and that will not allow symptoms to emerge as of December 2023.
Reproducer outputs
rancher
pod for the presence of the string "too old":Expected result: up to one line every 20 seconds is produced
Actual result: many lines produced per second
Navigating away from the ConfigMap list will not stop the error flow, logging out will.
The text was updated successfully, but these errors were encountered: