Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Steve should only send back metadata for counts that have changed #36681

Closed
nwmac opened this issue Mar 1, 2022 · 8 comments
Closed

Steve should only send back metadata for counts that have changed #36681

nwmac opened this issue Mar 1, 2022 · 8 comments
Assignees
Labels
area/scalability 10k or bust kind/enhancement Issues that improve or augment existing functionality release-note Note this issue in the milestone's release notes team/area1
Milestone

Comments

@nwmac
Copy link
Member

nwmac commented Mar 1, 2022

Currently, the Steve API sends count metadata over the web socket every time resource count metadata changes. This results in 24K (typical minimum) of data being send over the socket that the UI has to process.

This is always the full payload of count metadata.

It would be more efficient if Steve only sent metadata for the resources whose counts have changed, rather than the entire document.

@nwmac nwmac added this to the v2.6.5 milestone Mar 1, 2022
@cbron cbron assigned samjustus and unassigned cbron Mar 3, 2022
@samjustus samjustus modified the milestones: v2.6.5, v.2.6.6 Mar 9, 2022
@cbron cbron added area/scalability 10k or bust and removed feature/performance labels Apr 26, 2022
@samjustus
Copy link
Collaborator

@nwmac punting to 2.7

@gaktive gaktive added the JIRA To be used in correspondence with the internal ticketing system. label Oct 12, 2022
@gaktive
Copy link
Member

gaktive commented Oct 12, 2022

Internal reference: SURE-5394

@samjustus samjustus removed the JIRA To be used in correspondence with the internal ticketing system. label Oct 13, 2022
@zube zube bot added the kind/enhancement Issues that improve or augment existing functionality label Dec 12, 2022
@git-ival
Copy link

Hi @nwmac I have a few questions about the specifics on this issue.

  1. When you say "24K of data" I'm assuming you meant KB as in KiloBytes. Is that correct?
  2. Was this observed on a Rancher cluster with or without downstream clusters?
  3. Were the resource updates coming from Downstream or Local resources?
  4. How many resources of different types existed in the cluster? (Secrets, Projects, Pods, etc.)
  5. Can you provide some details on the Rancher cluster's specifics so that we can reproduce more easily and test any fixes?

@MbolotSuse
Copy link
Contributor

@git-ival This hasn't been fully merged yet. I'll get you a validation template which aims to give you more information to reproduce the issue once it's ready for testing.

@MbolotSuse
Copy link
Contributor

Validation Template

Root Cause

Steve maintains a map of the current counts for various resource types in memory. Each time that this map was changed, it would send over the full map of counts, which caused a large object to be delivered through the websocket each time that a resource was added or deleted.

What was fixed, or what change have occurred

  • Steve only sends the counts that have changed. For example, if you create a service account, you will only get the counts (over the websocket) for the serviceAccounts, and not for pods or deployments
  • Counts are now "de-bounced" for five seconds, meaning that we only send a counts update every 5 seconds (at most) and this update will contain all of the changed counts from the last five seconds. This was explicitly requested by the UI in Steve should throttle sending of count metadata #36682.

Areas or cases that should be tested

  • Basic UI functionality. For example:
    • Are the counts on the sidebar accurate?
    • Do they stay accurate for a resource type (e.x. pods) if I create/update/delete a resource of that same type (i.e. make a new pod)?
    • Do they stay accurate for a resource type if I create/update/delete a resource of an unrelated type?

What areas could experience regressions

  • The resource counts that the UI presents
  • The presence of sidebars allowing a user to filter for a specific resource type (e.g. Workloads -> Deployments)

Are the repro steps accurate/minimal?

Yes, they are included here for convenience.

  1. Run rancher/rancher:v2.7.0 (docker install or HA)
  2. Complete basic setup steps through setting a new admin password
  3. Open your dev console, and filter by websocket connections. Find the websocket connection which started the request for the counts resource.
  4. Add a new resource to the local cluster (for example, create a serviceAccount/configmap/namespace).
  5. Observe that you received counts for every resource type in the cluster.

Q/A

  1. I don't think that the exact size is important here - the salient point is that it would send a large JSON over the websocket on a frequent basis.
  2. The counts resource is provided by steve, and can be obtained both in local clusters and downstream clusters. I would guess that this issue is substantially worse in the local cluster (where there is more activity/overall resources), but you can likely notice it by navigating to downstream clusters in the UI and looking at the websocket connections to find where the counts for that cluster are coming from.
  3. As stated above, you can get updates from both local and downstream connections, depending on where you have requested the counts for.
  4. You can check this in clusters that you have running with kubectl api-resources -o wide | wc -l. From my testing on a basic rancher install, this number comes out to about 172.
  5. This issue (counts for all resources) is reproducible on every rancher setup (docker install, HA install, many downstreams, few downstreams). If you are looking for repro tips to make this as bad as possible from a scaling perspective, I would go with:
  • Create many resources very fast. Secrets/configmaps/tokens are all good candidates for this
  • Create/delete clusters. Cluster creation/deletion seems to create/delete many different resources of varying types (RBAC, core types, etc). Because the operation is relatively "noisy" it can produce many of these counts

If you would like more specific guidance, please let me know.

@floatingman
Copy link
Contributor

Ran through the validation template and observed the correct behavior.

@zube zube bot closed this as completed Dec 23, 2022
@MbolotSuse MbolotSuse added the release-note Note this issue in the milestone's release notes label Mar 10, 2023
@MbolotSuse
Copy link
Contributor

Release Note

Rancher maintains a /v1/counts endpoint that the UI uses to display resource counts. The UI subscribes to changes to the counts for all resources through a websocket to receive the new counts for resources.

Previously, each message from this socket would include all counts for every resource type in the cluster, even if the counts only changed for one specific resource type. This would cause the UI to need to re-update resource counts for every resource type at a high frequency, causing significant performance impact. Now, Rancher will only send back a count for a resource type if the count has changed from the previously known number, improving UI performance.

@samjustus
Copy link
Collaborator

/backport 2023-Q2-v2.6.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/scalability 10k or bust kind/enhancement Issues that improve or augment existing functionality release-note Note this issue in the milestone's release notes team/area1
Projects
None yet
Development

No branches or pull requests

9 participants