Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SURE-7122] Excessive WebSocket activity when watching resources with permission by name #41809

Closed
moio opened this issue Jun 9, 2023 · 18 comments · Fixed by rancher/steve#141
Assignees
Labels
kind/bug Issues that are defects reported by users or that we know have reached a real release
Milestone

Comments

@moio
Copy link
Contributor

moio commented Jun 9, 2023

Internal reference: SURE-7122

TL;DR

When a user lists a resource in a namespace in which it has permission to only see certain elements, excessive WebSocket activity between browser and Rancher is created due to resource version synchronization being lost and retried multiple times per second.

This results in excessive memory usage from the Rancher pod (#41225) and possibly excessive load on the Kubernetes API Server (#41663).

Reproducer setup

  • install Rancher 2.7.4 with default options

  • edit the Rancher deployment, eg via:

kubectl edit --namespace cattle-system deployment/rancher

Edit the main container section to add the --debug command line argument (in args) and the CATTLE_WATCH_TIMEOUT_SECONDS=20 environment variable (in env). Both changes make it easier to see symptoms without altering functionality significantly.

End result should look like the following:

      - args:
        - --http-listen-port=80
        - --https-listen-port=443
        - --add-local=true
        - --debug
        env:
        - name: CATTLE_WATCH_TIMEOUT_SECONDS
          value: "20"
        - name: CATTLE_NAMESPACE
          value: cattle-system
        - name: CATTLE_PEER_SERVICE
          value: rancher

Wait until the deployment settles.

  • create a test user with username testuser:

Users & Authentication -> Users -> Create
Username: testuser
New Password: testusertestusertestuser
Confirm Password: testusertestusertestuser
Click on Create

  • allow the new user to access the local cluster:

Explore Cluster -> local -> Cluster and Project Members -> Cluster Membership -> Add
Select Member: testuser
Cluster Permissions: Member
Click on Create

  • create a namespace with two ConfigMaps in it. Give access to testuser only to one of them:
export USER_ID=`kubectl get users -o json | jq --raw-output '.items[] | select(.username=="testuser")| .metadata.name'`

kubectl apply -f - <<EOF
apiVersion: v1
kind: Namespace
metadata:
  annotations:
  labels:
    kubernetes.io/metadata.name: test
  name: test
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: test1
  namespace: test
data:
  a: "1"
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: test2
  namespace: test
data:
  a: "2"
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: test
  name: cm-reader
rules:
- apiGroups: [""]
  resources: ["configmaps"]
  resourceNames: ["test1"]
  verbs: ["get", "watch", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: cm-reader
  namespace: test
subjects:
- kind: User
  name: $USER_ID
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: cm-reader
  apiGroup: rbac.authorization.k8s.io
EOF
  • Log into Rancher as testuser
  • list ConfigMaps:

Clusters -> local -> More Resources -> Core -> ConfigMaps

Note: it is expected that only the test1 ConfigMap should be visible at this point, but not test2

  • Wait a few minutes

Note: if the test cluster uses self-signed SSL certificates, make sure a WebKit based browser is used such as Chrome or Safari. Firefox has a bug that prevents websockets to work correctly and that will not allow symptoms to emerge as of December 2023.

Reproducer outputs

  • Observe logs of the rancher pod for the presence of the string "too old":
kubectl logs -f --namespace=cattle-system deployment/rancher | grep "too old"

Expected result: up to one line every 20 seconds is produced

Actual result: many lines produced per second

2023/06/09 14:30:58 [DEBUG] event watch error: too old resource version: 16581 (19656)
2023/06/09 14:30:58 [DEBUG] event watch error: too old resource version: 16581 (19656)
2023/06/09 14:30:58 [DEBUG] event watch error: too old resource version: 16581 (19656)
2023/06/09 14:30:58 [DEBUG] event watch error: too old resource version: 16581 (19656)
...

Navigating away from the ConfigMap list will not stop the error flow, logging out will.

@richard-cox
Copy link
Member

I've managed to reproduce this given the instructions above.

TL;DR The spamming is caused by the lack of resource.error too old message from Rancher, that's not something the UI can tackle.

In terms of how the UI handles watching resources over sockets and the resource.stop messages from rancher...

  1. rancher -- kube socket closes (roughly every 30 minutes or some other specific scenarios)
  2. rancher --> browser resource.stop message sent for every subscribed resource
  3. browser --> rancher watch style message sent for every subscribed resource. These will either succeed as the revision is still valid (rancher --> browser resource.start message sent) or fail as it's to old so we carry on with the below steps
  4. rancher --> browser resource.stop AND resource.error messages sent for every failed watch with a bad revision. The resource.error contains too old message.
  5. browser makes http request to fetch latest resource for every failed watch
  6. browser --> rancher watch style message sent with the latest valid revision returned via the http request

After reproducing the error it seems like rancher is failing to send the resource.error too old message at step 4. This means the UI tries to watch with the stale revision... which is rejected... and so on ad nauseum.

image_720.

We'd expect to see something like (ignore the revision 1 used in the re-watch, testing in a borked system)...

image

@gaktive
Copy link
Member

gaktive commented Jul 21, 2023

Need to reproduce this with 2.7.5 out.

@github-actions
Copy link
Contributor

This repository uses an automated workflow to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the workflow can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the workflow will automatically close the issue in 14 days. Thank you for your contributions.

@moio
Copy link
Contributor Author

moio commented Sep 20, 2023

Still relevant

@moio
Copy link
Contributor Author

moio commented Dec 6, 2023

@richard-cox IIUC in the reproduced case above the loop happens on a Steve endpoint (/v1). How reasonable is it that the same happens in Norman as well?

@richard-cox
Copy link
Member

@moio Norman side should be safe given that it works differently (server side a single rancher--kube watch on all resources, rather than per client--rancher per-resource watch). The UI does send per resource watch messages for Norman though that was an oversight. The mechanism might get used or cleaned up (ui side) as part of rancher/dashboard#7906.

moio added a commit to rancher/steve that referenced this issue Dec 22, 2023
Align behavior with plain Watch, which also does the same.

Fixes rancher/rancher#41809

Signed-off-by: Silvio Moioli <silvio@moioli.net>
moio added a commit to rancher/steve that referenced this issue Dec 22, 2023
Align behavior with plain Watch, which also does the same.

Fixes rancher/rancher#41809

Signed-off-by: Silvio Moioli <silvio@moioli.net>
@moio moio changed the title Excessive WebSocket activity, memory usage, goroutine leak when watching resources with permission by name Excessive WebSocket activity when watching resources with permission by name Dec 22, 2023
@moio
Copy link
Contributor Author

moio commented Dec 22, 2023

@richard-cox rancher/steve#141 does produce the results you suggested.

Without patch:
image

With patch:
image

@richard-cox
Copy link
Member

@moio Looks good, the resource.error with error: too old is there for the configmap resource and a new re-sub (resource.start) is sent with a different resource version. There is though a resource.stop for configmap very quickly afterwards, and the ui then resubs with the same resourceVersion as the first time. Does that resource.start --> resource.stop pattern continue further down the request log?

@moio
Copy link
Contributor Author

moio commented Jan 4, 2024

There is though a resource.stop for configmap very quickly afterwards

Right, at timestamp 21:45:19.611

[...] and the ui then resubs with the same resourceVersion as the first time

Right, but on different resource types (namespaces then navlinks at 21:45:19.589) after their respective separate resource.stop events which happen predictably every 20s (I assume because CATTLE_WATCH_TIMEOUT_SECONDS=20, per reproducer conditions).

The actual next resub of configmaps happens at 21:45:39.640 (about 20s later) and uses the updated resourceVersion (180993).

I reproduced this once again and confirm the only WS traffic I see on configmaps is about every 20s.

Here already filtered:
image

I uploaded the HAR files for further inspection here.

To my understanding, this seems all good - please do correct me if you see anything suspicious.

@moio moio changed the title Excessive WebSocket activity when watching resources with permission by name [SURE-7122] Excessive WebSocket activity when watching resources with permission by name Jan 5, 2024
@richard-cox
Copy link
Member

richard-cox commented Jan 8, 2024

Sorry, i worded my comment badly.

The original resource.stop/resource.error --> resource.start (with new revision number) process for configmaps looks good, so from 21:45:19.605 ending 21:45:19.647. My questions were around the resource.stop for configmaps at 21:45:39.639 which is immediately follow by a resub with the previously subscribed resource version. That looks like the old behaviour.

In the patched.har file it's more prominent. Lots of old behaviour (resource.stop --> resrouce.start with bad revision), one new / fixed behaviour and then a few of the older behaviour.

Would this be something to do with the way the dev/test environment is set up to trigger the issue? We can go over this tomorrow in our sync up

Update: Discussed with Silvio. What i thought was the old behaviour was in fact the correct behaviour given the socket watch timeout being set to 20s rather than 30m (as stated in the description). The patch results looks good!

@moio
Copy link
Contributor Author

moio commented Jan 10, 2024

User reported success with the v2.7.6-debug-41809-2 image.

@MSpencer87 MSpencer87 self-assigned this Jan 11, 2024
@moio
Copy link
Contributor Author

moio commented Feb 5, 2024

@richard-cox: @MSpencer87 discovered another side of this bug and I feel like needing your help to figure out why it happens.

Assuming the same reproduction instructions above, too old resource version errors are now (as of v2.7.6-debug-41809-2) handled correctly when navigating to the ConfigMap list page. Clicking on an individual ConfigMap to go to the ConfigMap detail page also works correctly.

However at that point reloading the detail ConfigMap page gets into an infinite loop of too old resource version errors. AFAICS the error is delivered back to the UI, but for some reason resubscribing happens for the same old revision over and over:

image

Here is the HAR file and a repro video for you: https://drive.google.com/drive/folders/1-EiXH9x5F9AIsj8YFD23rwUs_LdbsPz6?usp=sharing

Could this be a dashboard problem?

moio added a commit to moio/steve that referenced this issue Mar 1, 2024
Align behavior with plain Watch, which also does the same.

Fixes rancher/rancher#41809

Signed-off-by: Silvio Moioli <silvio@moioli.net>
@richard-cox
Copy link
Member

richard-cox commented Mar 1, 2024

Investigated, and the fix has revealed a bug in the UI which is reproducible after refreshing on an old resource's detail page (probably edit as well - my instance died and can't confirm).

Normally we fetch a list of resources and watch with the revision from that fetch. If that becomes 'too old', minus a few other bits, we make a fresh fetch to get the latest resources and revision... and then watch with that revision.

However if we fetch a single resource, we watch with it's own revision which is probably very old. If so the too old message will prompt the UI to re-fetch the resource to get the latest copy and revision... and then watch with that resources revision. That revision isn't going to be the latest ... so will probably be too old again. Rinse and repeat.

Bonus fun. The request to fetch the single resource is most probably namespaced... but that part doesn't make it to the request we make... so the url is broken anyway.

I'll update this comment with a UI gh issue. rancher/dashboard#10540

@moio moio modified the milestones: v2.8.x, v2.8-Next1 Mar 1, 2024
@moio moio added bug kind/bug Issues that are defects reported by users or that we know have reached a real release and removed bug labels Mar 1, 2024
@moio moio modified the milestones: v2.8-Next1, v2.9-Next1 Mar 1, 2024
@moio
Copy link
Contributor Author

moio commented Mar 1, 2024

@MSpencer87 given the last comment from @richard-cox are you OK considering this done and addressing the remaining problem you found in the follow-up issue rancher/dashboard#10540?

@moio
Copy link
Contributor Author

moio commented Mar 4, 2024

@MSpencer87 can you report test results here please?

@MSpencer87
Copy link
Contributor

MSpencer87 commented Mar 8, 2024

I was able to successfully reproduce on 2.7.4 and verify using v2.7.6-debug-41809-2, the watch errors surrounding too old resource versions have been resolved.

Step to reproduce:

  1. Bring up HA Rancher on 2.7.4 on rke1 v1.25.9-rancher2-1
  2. Setup Rancher following the reproducer steps above:
    • Edit Rancher Deployment with --debug and CATTLE_WATCH_TIMEOUT_SECONDS=20
    • Create a new testuser with member permisions on the local cluster
    • create a namespace with two ConfigMaps, allow testuser access to only one
  3. Log into testuser, open browser console and navigate to view the config maps
    Clusters -> local -> More Resources -> Core -> ConfigMaps
  4. Following logs, watch errors are obvserved via:
kubectl logs -f --namespace=cattle-system deployment/rancher | grep "too old"

Steps for validation:

  1. Bring up HA Rancher with rancherImageTag=v2.7.6-debug-41809-2
  2. Setup Rancher following the reproducer steps above:
    • Edit Rancher Deployment with --debug and CATTLE_WATCH_TIMEOUT_SECONDS=20
    • Create a new testuser with member permisions on the local cluster
    • create a namespace with two ConfigMaps, allow testuser access to only one
  3. Log into testuser, open browser console and navigate to view the config maps
    Clusters -> local -> More Resources -> Core -> ConfigMaps
  4. Following logs, no watch errors are obvserved via:
kubectl logs -f --namespace=cattle-system deployment/rancher | grep "too old"

Please advise: Opening browser console to Network tab, filtering for subscribe?sockId= and viewing Messages with filter configmap will still produce message spam. This also seems to impact memory usage on the Rancher pod. @moio @richard-cox can we confirm that remaining websocket message spam is expected behavior to be addressed with rancher/dashboard#10540 ? If so, this can be closed.

@richard-cox
Copy link
Member

@MSpencer87 Correct. The fix reveals a UI bug which has been addressed in below versions. Those issues are open but the changes are available in -head builds

2.7 is rancher/dashboard#10567 but blocked on #44656

@git-ival
Copy link

Retested on 2.8.2 -> 2.8.3 and confirmed the same results as @MSpencer87's comment. Also encountered the same websocket "configmap" traffic every ~20 seconds or so.

@git-ival git-ival self-assigned this Mar 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues that are defects reported by users or that we know have reached a real release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants