[SURE-7122] Excessive WebSocket activity when watching resources with permission by name #41809

moio · 2023-06-09T15:05:16Z

Internal reference: SURE-7122

TL;DR

When a user lists a resource in a namespace in which it has permission to only see certain elements, excessive WebSocket activity between browser and Rancher is created due to resource version synchronization being lost and retried multiple times per second.

This results in excessive memory usage from the Rancher pod (#41225) and possibly excessive load on the Kubernetes API Server (#41663).

Reproducer setup

install Rancher 2.7.4 with default options
edit the Rancher deployment, eg via:

kubectl edit --namespace cattle-system deployment/rancher

Edit the main container section to add the --debug command line argument (in args) and the CATTLE_WATCH_TIMEOUT_SECONDS=20 environment variable (in env). Both changes make it easier to see symptoms without altering functionality significantly.

End result should look like the following:

      - args:
        - --http-listen-port=80
        - --https-listen-port=443
        - --add-local=true
        - --debug
        env:
        - name: CATTLE_WATCH_TIMEOUT_SECONDS
          value: "20"
        - name: CATTLE_NAMESPACE
          value: cattle-system
        - name: CATTLE_PEER_SERVICE
          value: rancher

Wait until the deployment settles.

create a test user with username testuser:

Users & Authentication -> Users -> Create
Username: testuser
New Password: testusertestusertestuser
Confirm Password: testusertestusertestuser
Click on Create

allow the new user to access the local cluster:

Explore Cluster -> local -> Cluster and Project Members -> Cluster Membership -> Add
Select Member: testuser
Cluster Permissions: Member
Click on Create

create a namespace with two ConfigMaps in it. Give access to testuser only to one of them:

export USER_ID=`kubectl get users -o json | jq --raw-output '.items[] | select(.username=="testuser")| .metadata.name'`

kubectl apply -f - <<EOF
apiVersion: v1
kind: Namespace
metadata:
  annotations:
  labels:
    kubernetes.io/metadata.name: test
  name: test
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: test1
  namespace: test
data:
  a: "1"
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: test2
  namespace: test
data:
  a: "2"
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: test
  name: cm-reader
rules:
- apiGroups: [""]
  resources: ["configmaps"]
  resourceNames: ["test1"]
  verbs: ["get", "watch", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: cm-reader
  namespace: test
subjects:
- kind: User
  name: $USER_ID
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: cm-reader
  apiGroup: rbac.authorization.k8s.io
EOF

Log into Rancher as testuser
list ConfigMaps:

Clusters -> local -> More Resources -> Core -> ConfigMaps

Note: it is expected that only the test1 ConfigMap should be visible at this point, but not test2

Wait a few minutes

Note: if the test cluster uses self-signed SSL certificates, make sure a WebKit based browser is used such as Chrome or Safari. Firefox has a bug that prevents websockets to work correctly and that will not allow symptoms to emerge as of December 2023.

Reproducer outputs

Observe logs of the rancher pod for the presence of the string "too old":

kubectl logs -f --namespace=cattle-system deployment/rancher | grep "too old"

Expected result: up to one line every 20 seconds is produced

Actual result: many lines produced per second

2023/06/09 14:30:58 [DEBUG] event watch error: too old resource version: 16581 (19656)
2023/06/09 14:30:58 [DEBUG] event watch error: too old resource version: 16581 (19656)
2023/06/09 14:30:58 [DEBUG] event watch error: too old resource version: 16581 (19656)
2023/06/09 14:30:58 [DEBUG] event watch error: too old resource version: 16581 (19656)
...

Navigating away from the ConfigMap list will not stop the error flow, logging out will.

The text was updated successfully, but these errors were encountered:

richard-cox · 2023-06-09T17:08:00Z

I've managed to reproduce this given the instructions above.

TL;DR The spamming is caused by the lack of resource.error too old message from Rancher, that's not something the UI can tackle.

In terms of how the UI handles watching resources over sockets and the resource.stop messages from rancher...

rancher -- kube socket closes (roughly every 30 minutes or some other specific scenarios)
rancher --> browser resource.stop message sent for every subscribed resource
browser --> rancher watch style message sent for every subscribed resource. These will either succeed as the revision is still valid (rancher --> browser resource.start message sent) or fail as it's to old so we carry on with the below steps
rancher --> browser resource.stop AND resource.error messages sent for every failed watch with a bad revision. The resource.error contains too old message.
browser makes http request to fetch latest resource for every failed watch
browser --> rancher watch style message sent with the latest valid revision returned via the http request

After reproducing the error it seems like rancher is failing to send the resource.error too old message at step 4. This means the UI tries to watch with the stale revision... which is rejected... and so on ad nauseum.

.

We'd expect to see something like (ignore the revision 1 used in the re-watch, testing in a borked system)...

gaktive · 2023-07-21T15:25:54Z

Need to reproduce this with 2.7.5 out.

github-actions · 2023-09-20T01:46:20Z

This repository uses an automated workflow to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the workflow can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the workflow will automatically close the issue in 14 days. Thank you for your contributions.

moio · 2023-09-20T12:21:58Z

Still relevant

moio · 2023-12-06T10:44:54Z

@richard-cox IIUC in the reproduced case above the loop happens on a Steve endpoint (/v1). How reasonable is it that the same happens in Norman as well?

richard-cox · 2023-12-11T09:22:23Z

@moio Norman side should be safe given that it works differently (server side a single rancher--kube watch on all resources, rather than per client--rancher per-resource watch). The UI does send per resource watch messages for Norman though that was an oversight. The mechanism might get used or cleaned up (ui side) as part of rancher/dashboard#7906.

Align behavior with plain Watch, which also does the same. Fixes rancher/rancher#41809 Signed-off-by: Silvio Moioli <silvio@moioli.net>

moio · 2023-12-22T20:49:59Z

@richard-cox rancher/steve#141 does produce the results you suggested.

Without patch:

With patch:

richard-cox · 2024-01-02T13:00:34Z

@moio Looks good, the resource.error with error: too old is there for the configmap resource and a new re-sub (resource.start) is sent with a different resource version. There is though a resource.stop for configmap very quickly afterwards, and the ui then resubs with the same resourceVersion as the first time. Does that resource.start --> resource.stop pattern continue further down the request log?

moio · 2024-01-04T08:35:49Z

There is though a resource.stop for configmap very quickly afterwards

Right, at timestamp 21:45:19.611

[...] and the ui then resubs with the same resourceVersion as the first time

Right, but on different resource types (namespaces then navlinks at 21:45:19.589) after their respective separate resource.stop events which happen predictably every 20s (I assume because CATTLE_WATCH_TIMEOUT_SECONDS=20, per reproducer conditions).

The actual next resub of configmaps happens at 21:45:39.640 (about 20s later) and uses the updated resourceVersion (180993).

I reproduced this once again and confirm the only WS traffic I see on configmaps is about every 20s.

Here already filtered:

I uploaded the HAR files for further inspection here.

To my understanding, this seems all good - please do correct me if you see anything suspicious.

richard-cox · 2024-01-08T14:37:50Z

Sorry, i worded my comment badly.

The original resource.stop/resource.error --> resource.start (with new revision number) process for configmaps looks good, so from 21:45:19.605 ending 21:45:19.647. My questions were around the resource.stop for configmaps at 21:45:39.639 which is immediately follow by a resub with the previously subscribed resource version. That looks like the old behaviour.

In the patched.har file it's more prominent. Lots of old behaviour (resource.stop --> resrouce.start with bad revision), one new / fixed behaviour and then a few of the older behaviour.

Would this be something to do with the way the dev/test environment is set up to trigger the issue? We can go over this tomorrow in our sync up

Update: Discussed with Silvio. What i thought was the old behaviour was in fact the correct behaviour given the socket watch timeout being set to 20s rather than 30m (as stated in the description). The patch results looks good!

moio · 2024-01-10T09:23:09Z

User reported success with the v2.7.6-debug-41809-2 image.

moio · 2024-02-05T11:33:08Z

@richard-cox: @MSpencer87 discovered another side of this bug and I feel like needing your help to figure out why it happens.

Assuming the same reproduction instructions above, too old resource version errors are now (as of v2.7.6-debug-41809-2) handled correctly when navigating to the ConfigMap list page. Clicking on an individual ConfigMap to go to the ConfigMap detail page also works correctly.

However at that point reloading the detail ConfigMap page gets into an infinite loop of too old resource version errors. AFAICS the error is delivered back to the UI, but for some reason resubscribing happens for the same old revision over and over:

Here is the HAR file and a repro video for you: https://drive.google.com/drive/folders/1-EiXH9x5F9AIsj8YFD23rwUs_LdbsPz6?usp=sharing

Could this be a dashboard problem?

Align behavior with plain Watch, which also does the same. Fixes rancher/rancher#41809 Signed-off-by: Silvio Moioli <silvio@moioli.net>

richard-cox · 2024-03-01T15:10:51Z

Investigated, and the fix has revealed a bug in the UI which is reproducible after refreshing on an old resource's detail page (probably edit as well - my instance died and can't confirm).

Normally we fetch a list of resources and watch with the revision from that fetch. If that becomes 'too old', minus a few other bits, we make a fresh fetch to get the latest resources and revision... and then watch with that revision.

However if we fetch a single resource, we watch with it's own revision which is probably very old. If so the too old message will prompt the UI to re-fetch the resource to get the latest copy and revision... and then watch with that resources revision. That revision isn't going to be the latest ... so will probably be too old again. Rinse and repeat.

Bonus fun. The request to fetch the single resource is most probably namespaced... but that part doesn't make it to the request we make... so the url is broken anyway.

~~I'll update this comment with a UI gh issue.~~ rancher/dashboard#10540

moio · 2024-03-01T15:30:05Z

@MSpencer87 given the last comment from @richard-cox are you OK considering this done and addressing the remaining problem you found in the follow-up issue rancher/dashboard#10540?

moio · 2024-03-04T09:13:54Z

@MSpencer87 can you report test results here please?

Align behavior with plain Watch, which also does the same. Fixes rancher/rancher#41809 Signed-off-by: Silvio Moioli <silvio@moioli.net>

MSpencer87 · 2024-03-08T19:53:37Z

I was able to successfully reproduce on 2.7.4 and verify using v2.7.6-debug-41809-2, the watch errors surrounding too old resource versions have been resolved.

Step to reproduce:

Bring up HA Rancher on 2.7.4 on rke1 v1.25.9-rancher2-1
Setup Rancher following the reproducer steps above:
- Edit Rancher Deployment with --debug and CATTLE_WATCH_TIMEOUT_SECONDS=20
- Create a new testuser with member permisions on the local cluster
- create a namespace with two ConfigMaps, allow testuser access to only one
Log into testuser, open browser console and navigate to view the config maps
Clusters -> local -> More Resources -> Core -> ConfigMaps
Following logs, watch errors are obvserved via:

kubectl logs -f --namespace=cattle-system deployment/rancher | grep "too old"

Steps for validation:

Bring up HA Rancher with rancherImageTag=v2.7.6-debug-41809-2
Setup Rancher following the reproducer steps above:
- Edit Rancher Deployment with --debug and CATTLE_WATCH_TIMEOUT_SECONDS=20
- Create a new testuser with member permisions on the local cluster
- create a namespace with two ConfigMaps, allow testuser access to only one
Log into testuser, open browser console and navigate to view the config maps
Clusters -> local -> More Resources -> Core -> ConfigMaps
Following logs, no watch errors are obvserved via:

kubectl logs -f --namespace=cattle-system deployment/rancher | grep "too old"

Please advise: Opening browser console to Network tab, filtering for subscribe?sockId= and viewing Messages with filter configmap will still produce message spam. This also seems to impact memory usage on the Rancher pod. @moio @richard-cox can we confirm that remaining websocket message spam is expected behavior to be addressed with rancher/dashboard#10540 ? If so, this can be closed.

richard-cox · 2024-03-09T06:45:23Z

@MSpencer87 Correct. The fix reveals a UI bug which has been addressed in below versions. Those issues are open but the changes are available in -head builds

2.9 Steve socket is spammed on fresh visit to a resource detail page dashboard#10540
2.8 [backport v2.8.next1] Steve socket is spammed on fresh visit to a resource detail page dashboard#10566

2.7 is rancher/dashboard#10567 but blocked on #44656

git-ival · 2024-03-28T21:42:28Z

Retested on 2.8.2 -> 2.8.3 and confirmed the same results as @MSpencer87's comment. Also encountered the same websocket "configmap" traffic every ~20 seconds or so.

git-ival mentioned this issue Jun 15, 2023

[BUG] rancher v2.7.2 cattle-cluster-agent memory leak oom kill #41225

Closed

This was referenced Jun 22, 2023

Possible Rancher v2.7.3 memory leak? #41713

Closed

[BUG] - Abnormally High Load Average on Kubernetes 1.25.9 #41684

Open

moio mentioned this issue Jul 24, 2023

[BUG] Resource watch re-subscription causes sustained heavy kube-apiserver load #41663

Open

github-actions bot added the status/stale label Sep 20, 2023

richard-cox removed the status/stale label Sep 20, 2023

moio added a commit to rancher/steve that referenced this issue Dec 22, 2023

WatchNames: return errors via WebSocket

89a999d

Align behavior with plain Watch, which also does the same. Fixes rancher/rancher#41809 Signed-off-by: Silvio Moioli <silvio@moioli.net>

moio added a commit to rancher/steve that referenced this issue Dec 22, 2023

WatchNames: return errors via WebSocket

8986449

Align behavior with plain Watch, which also does the same. Fixes rancher/rancher#41809 Signed-off-by: Silvio Moioli <silvio@moioli.net>

moio changed the title ~~Excessive WebSocket activity, memory usage, goroutine leak when watching resources with permission by name~~ Excessive WebSocket activity when watching resources with permission by name Dec 22, 2023

moio mentioned this issue Dec 22, 2023

WatchNames: return errors via WebSocket rancher/steve#141

Merged

moio changed the title ~~Excessive WebSocket activity when watching resources with permission by name~~ [SURE-7122] Excessive WebSocket activity when watching resources with permission by name Jan 5, 2024

MSpencer87 self-assigned this Jan 11, 2024

moio added a commit to moio/steve that referenced this issue Mar 1, 2024

WatchNames: return errors via WebSocket

fc2183a

Align behavior with plain Watch, which also does the same. Fixes rancher/rancher#41809 Signed-off-by: Silvio Moioli <silvio@moioli.net>

moio modified the milestones: v2.8.x, v2.8-Next1 Mar 1, 2024

moio added bug kind/bug Issues that are defects reported by users or that we know have reached a real release and removed bug labels Mar 1, 2024

moio mentioned this issue Mar 1, 2024

[Backport 2.8] [SURE-7122] Excessive WebSocket activity when watching resources with permission by name #44629

Closed

moio modified the milestones: v2.8-Next1, v2.9-Next1 Mar 1, 2024

richard-cox mentioned this issue Mar 1, 2024

Steve socket is spammed on fresh visit to a resource detail page rancher/dashboard#10540

Open

MbolotSuse closed this as completed in rancher/steve#141 Mar 1, 2024

richard-cox mentioned this issue Mar 1, 2024

Ensure too old messages for watches in individual resources are handled correctly rancher/dashboard#10541

Merged

7 tasks

moio reopened this Mar 4, 2024

moio added a commit to moio/steve that referenced this issue Mar 4, 2024

WatchNames: return errors via WebSocket

82473d6

Align behavior with plain Watch, which also does the same. Fixes rancher/rancher#41809 Signed-off-by: Silvio Moioli <silvio@moioli.net>

moio mentioned this issue Mar 4, 2024

[Backport 2.7] Excessive WebSocket activity when watching resources with permission by name #44640

Closed

moio added a commit to moio/steve that referenced this issue Mar 4, 2024

WatchNames: return errors via WebSocket

fbeca61

Align behavior with plain Watch, which also does the same. Fixes rancher/rancher#41809 Signed-off-by: Silvio Moioli <silvio@moioli.net>

moio mentioned this issue Mar 5, 2024

Bump Steve to include https://github.com/rancher/steve/pull/141 #44654

Merged

MbolotSuse closed this as completed in MbolotSuse/steve@1585ed3 Mar 19, 2024

richard-cox mentioned this issue Mar 20, 2024

Steve socket is spammed on fresh visit to node detail page rancher/dashboard#10668

Open

This was referenced Mar 22, 2024

[backport v2.7.next1] Steve socket is spammed on fresh visit to node detail page rancher/dashboard#10685

Closed

[backport v2.8.next1] Steve socket is spammed on fresh visit to node detail page rancher/dashboard#10686

Closed

git-ival self-assigned this Mar 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SURE-7122] Excessive WebSocket activity when watching resources with permission by name #41809

[SURE-7122] Excessive WebSocket activity when watching resources with permission by name #41809

moio commented Jun 9, 2023 •

edited

richard-cox commented Jun 9, 2023

gaktive commented Jul 21, 2023

github-actions bot commented Sep 20, 2023

moio commented Sep 20, 2023

moio commented Dec 6, 2023

richard-cox commented Dec 11, 2023

moio commented Dec 22, 2023

richard-cox commented Jan 2, 2024

moio commented Jan 4, 2024

richard-cox commented Jan 8, 2024 •

edited

moio commented Jan 10, 2024

moio commented Feb 5, 2024 •

edited

richard-cox commented Mar 1, 2024 •

edited

moio commented Mar 1, 2024 •

edited

moio commented Mar 4, 2024

MSpencer87 commented Mar 8, 2024 •

edited

richard-cox commented Mar 9, 2024

git-ival commented Mar 28, 2024

[SURE-7122] Excessive WebSocket activity when watching resources with permission by name #41809

[SURE-7122] Excessive WebSocket activity when watching resources with permission by name #41809

Comments

moio commented Jun 9, 2023 • edited

TL;DR

Reproducer setup

Reproducer outputs

richard-cox commented Jun 9, 2023

gaktive commented Jul 21, 2023

github-actions bot commented Sep 20, 2023

moio commented Sep 20, 2023

moio commented Dec 6, 2023

richard-cox commented Dec 11, 2023

moio commented Dec 22, 2023

richard-cox commented Jan 2, 2024

moio commented Jan 4, 2024

richard-cox commented Jan 8, 2024 • edited

moio commented Jan 10, 2024

moio commented Feb 5, 2024 • edited

richard-cox commented Mar 1, 2024 • edited

moio commented Mar 1, 2024 • edited

moio commented Mar 4, 2024

MSpencer87 commented Mar 8, 2024 • edited

richard-cox commented Mar 9, 2024

git-ival commented Mar 28, 2024

moio commented Jun 9, 2023 •

edited

richard-cox commented Jan 8, 2024 •

edited

moio commented Feb 5, 2024 •

edited

richard-cox commented Mar 1, 2024 •

edited

moio commented Mar 1, 2024 •

edited

MSpencer87 commented Mar 8, 2024 •

edited