Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistencies with Cluster metrics across the UI #5430

Closed
aalves08 opened this issue Mar 17, 2022 · 8 comments
Closed

Inconsistencies with Cluster metrics across the UI #5430

aalves08 opened this issue Mar 17, 2022 · 8 comments

Comments

@aalves08
Copy link
Contributor

aalves08 commented Mar 17, 2022

System
v2.6.3

Describe the bug
Several inconsistencies with Cluster metrics have been found throughout the UI (homepage vs Cluster Dashboard vs Node list view-> for single node clusters).

1) In the homepage cluster table, the "usage" for RKE2 clusters is always zero FIXED in 2.6.4 already
2) Values for Reserved (MEM and CPU) in Cluster Dashboard for a RK2 cluster are broken (as in it displays 0 always) FIXED in 2.6.4 already
3) Number of PODS used in the homepage cluster table is inconsistent with the number displayed in Cluster Dashboard for all cluster types
4) in the homepage cluster table we use MB for MEM less than 1 GiB, while the units in Cluster Dashboard are only GiB
5) Inconsistency between the number of "used" PODS between Cluster Dashboard and nodes list view (applicable for single machine count/node Clusters)
6) In the homepage cluster table we show the "reserved" values rather than the "usage" values when comparing with the Cluster Dashboard

To Reproduce
1) Create a RKE2 cluster -> check Cluster Dashboard for numbers -> compare with homepage (always zero)
2) Create a RKE2 cluster -> Create a deployment with MEM and CPU reserved -> check Cluster Dashboard
3) Create a RKE2 cluster -> check Cluster Dashboard for POD numbers -> compare with homepage
4) Create a RKE2 cluster -> check Cluster Dashboard for MEM units -> compare with homepage
5) Create a RKE2 cluster with a single node -> Create a deployment -> check Cluster Dashboard and compare with nodes list view
6) Create a RKE2 cluster -> check Cluster Dashboard for numbers -> compare with homepage

Expected Result

  1. should display correct metrics
  2. should display correct metrics
  3. POD numbers should match
  4. units should be consistent (GiB)
  5. should result in the same values of PODS for both (applicable for single machine count/node Clusters)
  6. Since it's a "very expensive" operation for the frontend to grab all the correct "usage" for MEM, CPU, PODS (at least 2 extra API requests per cluster) on the homepage, we decided to remove the "reserved" information from the MEM and CPU until we have a proper technical solution to display the usage in the cluster list in the homepage

Additional Information

  • Main objectives are to show the "usage" everywhere for MEM, CPU, PODS (applicable views), with consistent numbers and units throughout the UI.
@aalves08 aalves08 added this to the v2.6.5 milestone Mar 17, 2022
@aalves08 aalves08 self-assigned this Mar 17, 2022
@aalves08
Copy link
Contributor Author

@nwmac issue created about the errors we found with our digging related with cluster metrics.

@gaktive I believe this issue relates better with SURE-4148 and SURE-4090. What do you think?

@xhejtman
Copy link

I see problem with these metrics as well (2.6.3-patch2). In cluster explorer, I see metrics numbers like cpu, memory, but they are not current, they seem to be from some point in time and not updated or randomly updated.

@aalves08
Copy link
Contributor Author

BE response to possibly adding status.usage to git repo obj response:
https://suse.slack.com/archives/C02CX064EBX/p1648551618026609

@gaktive
Copy link
Member

gaktive commented Apr 4, 2022

Another possible scenario: SURE-4301 brings up the scenario that a k8s upgrade to 1.21.8 changes the monitoring values.

@aalves08
Copy link
Contributor Author

aalves08 commented Apr 5, 2022

@gaktive I think we should look into SURE-4301 for 2.6.6. The PR is already open for this issue and I think that SURE-4301 will need quite a bit of time to investigate.

Heads up: We are going to remove the "usage" for CPU and MEM in the homepage. Technically the values that we were displaying there were for reserved and not used resources, which is the data the clients really want to look at. To add that data on the homepage, we would need changes to the BE.
Unfortunately this has been a big source for inconsistency claims.

So, clients should focus on looking at the cluster dashboard to get proper metrics for their clusters for now.

FYI @nwmac

@gaktive
Copy link
Member

gaktive commented Apr 20, 2022

Upon discussion, we're OK to remove the usage side if the results are wrong. We just need to put this in the release notes for 2.6.5. I see the docs ticket in place, so that's helpful.

@igomez06
Copy link

igomez06 commented May 3, 2022

Setup:
Rancher Version: v2.6.5-rc6
Kubernetes Version: RKE2 v1.22.8+rke2r1
HA Install

Steps:

  1. Created a RKE2 POD numbers are consistent between Cluster Dashboard and homepage
  2. Same RKE2 cluster MEM units are consistent between Cluster Dashboard and homepage
  3. Same RKE2 cluster numbers are consistent between Cluster Dashboard and homepage memory, cores, pods, etc.
  4. Created a single node RKE2 cluster, then created a deployment. Pod numbers are consistent between Cluster Dashboard and nodes list view
  5. "reserved" no longer is present in the homepage as expected.

@zube zube bot closed this as completed May 3, 2022
@ellerydb
Copy link

Upon discussion, we're OK to remove the usage side if the results are wrong. We just need to put this in the release notes for 2.6.5. I see the docs ticket in place, so that's helpful.

As a Rancher platform owner, it was good to have a holistic view of all the clusters usage across the board.

@zube zube bot removed the [zube]: Done label Aug 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants