Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rancher-Monitoring: Query on Cluster Grafana Dashboard reports incorrect values #24343

Open
toddexp opened this issue Dec 2, 2019 · 12 comments
Labels
area/monitoring internal kind/bug Issues that are defects reported by users or that we know have reached a real release priority/1 team/observability&backup the team that is responsible for monitoring/logging and BRO [zube]: To Triage
Milestone

Comments

@toddexp
Copy link

toddexp commented Dec 2, 2019

What kind of request is this (question/bug/enhancement/feature request):
Bug

Steps to reproduce (least amount of steps as possible):
Enable cluster monitoring
View cluster Grafana dashboard

Result:
The Pod CPU Usage and All Process CPU Usage sections of the dashboard are inaccurate. These graphs on the dashboard are doubled due to the query that was used.

Other details that may be helpful:
The Rancher monitoring seems to be exposing similar metrics in multiple ways. Due to the query that was used to form the graphs multiple metrics are being summed together.

Incorrect Pod CPU Usage query: sum (rate (container_cpu_usage_seconds_total{pod_name!="",container_name!="POD",node=~"^$Node$"}[5m])) by (pod_name)

Incorrect All Process CPU Usage query: sum (rate (container_cpu_usage_seconds_total{namespace!="",pod_name!="",node=~"^$Node$"}[5m])) by (namespace, pod_name)

corrected Pod CPU Usage query: sum (rate (container_cpu_usage_seconds_total{pod_name!="",container_name!="",container_name!="POD",node=~"^$Node$"}[5m])) by (pod_name) Added container_namer!="" to the query

corrected All Process CPU Usage query: sum (rate (container_cpu_usage_seconds_total{namespace!="",pod_name!="",container_name!="",node=~"^$Node$"}[5m])) by (namespace, pod_name) Added container_namer!="" to the query

There was a similar issue opened for daemonset Grafana graphs: #20162

Environment information

  • Rancher version (rancher/rancher/rancher/server image tag or shown bottom left in the UI): 2.3.0
  • Installation option (single install/HA): single install

gz#15744

@loganhz loganhz added this to the v2.4 milestone Dec 3, 2019
@loganhz loganhz added kind/bug Issues that are defects reported by users or that we know have reached a real release area/monitoring team/cn labels Dec 3, 2019
@loganhz loganhz removed this from the v2.4 milestone Dec 16, 2019
@loganhz loganhz removed the team/cn label Dec 16, 2019
@lxkaka
Copy link

lxkaka commented Dec 19, 2019

same issue and the memory metic is exposed also twice.

@lxkaka
Copy link

lxkaka commented Dec 19, 2019

@toddexp I think the added variable shoud be image!=""

@toddexp
Copy link
Author

toddexp commented Dec 19, 2019

In my environment, for the memory and cpu queries that are used in the cluster page in Grafana, the metrics in Prometheus do not have image labels in the metrics used. When I applied the image!="" my query then returned 0 results.

Taking a closer look it does appear that the memory graphs are also effected and the queries are incorrect. However it seems to only be a minor increase in memory metrics as opposed to the cpu metrics which are doubling. I am not sure what Rancher has added for metrics gathering but there seems to be these entries that have these metrics in them: container="POD",container_name="POD",endpoint="https-metrics",job="expose-kubelets-metrics",namespace="cattle-prometheus",pod="exporter-node-cluster-monitoring-5jrd4",pod_name="exporter-node-cluster-monitoring-5jrd4",service="expose-kubelets-metrics" Since these are not being filtered out in the Grafana dashboards we are picking up duplicate data. For the memory metrics this is only adding a very small amount of memory. But for the cpu these additional metrics are doubling the cpu found.

@lxkaka
Copy link

lxkaka commented Dec 20, 2019

@loganhz has any plan to resolve this bug?

@toddexp
Copy link
Author

toddexp commented Apr 2, 2020

I was hoping that this bug was corrected with this issue fix. I just installed rancher 2.4.2 with monitoring v0.1.0 and the Grafana graphs are still incorrectly double what they should be.

@lxkaka
Copy link

lxkaka commented Apr 5, 2020

continue to follow this issue

@chanjarster
Copy link

chanjarster commented Sep 29, 2020

same issue and the memory metic is exposed also twice.

Same issue, rancher v2.3.6 container_cpu_usage_seconds_total and container_memory_working_set_bytes are doubled:

container_cpu_usage_seconds_total{container="uniauth",container_name="uniauth",cpu="total",endpoint="https-metrics",id="/kubepods/burstable/podc4939924-829b-4b7b-98e3-9f00840409e3/8d95c26df64e54686e7707c3ee7d018cfd03fc0c61bd6818db19eb4a14c619dd",image="harbor.supwisdom.com/uniauth/uniauth@sha256:b86c971c07f42864675cffc22386a0e8ddcb920e6025ee01d6053e18de4bca23",instance="192.168.116.117:10250",job="expose-kubelets-metrics",name="k8s_uniauth_uniauth-test-backend-7d895686d4-mzbzn_uniauth-test_c4939924-829b-4b7b-98e3-9f00840409e3_0",namespace="uniauth-test",node="k8sworker01",pod="uniauth-test-backend-7d895686d4-mzbzn",pod_name="uniauth-test-backend-7d895686d4-mzbzn",service="expose-kubelets-metrics"}
--

container_cpu_usage_seconds_total{container="uniauth",container_name="uniauth",endpoint="https-metrics",instance="192.168.116.117:10250",job="expose-kubelets-metrics",namespace="uniauth-test",node="k8sworker01",pod="uniauth-test-backend-7d895686d4-mzbzn",pod_name="uniauth-test-backend-7d895686d4-mzbzn",service="expose-kubelets-metrics"}
--
container_memory_working_set_bytes{container="uniauth",container_name="uniauth",endpoint="https-metrics",id="/kubepods/burstable/podc4939924-829b-4b7b-98e3-9f00840409e3/8d95c26df64e54686e7707c3ee7d018cfd03fc0c61bd6818db19eb4a14c619dd",image="harbor.supwisdom.com/uniauth/uniauth@sha256:b86c971c07f42864675cffc22386a0e8ddcb920e6025ee01d6053e18de4bca23",instance="192.168.116.117:10250",job="expose-kubelets-metrics",name="k8s_uniauth_uniauth-test-backend-7d895686d4-mzbzn_uniauth-test_c4939924-829b-4b7b-98e3-9f00840409e3_0",namespace="uniauth-test",node="k8sworker01",pod="uniauth-test-backend-7d895686d4-mzbzn",pod_name="uniauth-test-backend-7d895686d4-mzbzn",service="expose-kubelets-metrics"}
--

container_memory_working_set_bytes{container="uniauth",container_name="uniauth",endpoint="https-metrics",instance="192.168.116.117:10250",job="expose-kubelets-metrics",namespace="uniauth-test",node="k8sworker01",pod="uniauth-test-backend-7d895686d4-mzbzn",pod_name="uniauth-test-backend-7d895686d4-mzbzn",service="expose-kubelets-metrics"}
--

while the query are:

sum by (container_name)(rate(container_cpu_usage_seconds_total{namespace="$namespace",container_name!="",container_name=~"$container",container_name!="POD",pod_name="$pod"}[5m]))

sum by(container_name) (container_memory_working_set_bytes{namespace="$namespace",container_name!="",container_name=~"$container",container_name!="POD",pod_name="$pod"})

So the result are all doubled, that's really confusing

@chanjarster
Copy link

I add id!="" on the query problem resolved. But since rancher imported grafana dashboards are not modifiable, I have to import the fixed dashboard, that's not convenient.

@deniseschannon deniseschannon removed this from the v2.4 - Backlog milestone Jan 29, 2021
@Turb0Fly
Copy link

I've got a 2.4.8 Rancher installation and pod memory usage is indeed effectively doubled when viewed in Grafana at the cluster level.
When going into the Pods dashboard and selecting the namespace and the proper pod, it is displayed accurately.

@samail
Copy link

samail commented May 21, 2021

Same on 2.5.5 version
This can be fixed by changing query from sum to max

Like for Pod CPU Usage as an example
From

sum (rate (container_cpu_usage_seconds_total{pod_name!="",container_name!="POD",node=~"^$Node$"}[5m])) by (pod_name)

To

max (rate (container_cpu_usage_seconds_total{pod_name!="",container_name!="POD",node=~"^$Node$"}[5m])) by (pod_name)

@Tejeev Tejeev changed the title Rancher-Monitoring: Query Issue on Cluster Grafana Dashboard Rancher-Monitoring: Query on Cluster Grafana Dashboard reports incorrect values Jun 26, 2021
@samjustus samjustus added team/observability&backup the team that is responsible for monitoring/logging and BRO and removed team/opni labels Feb 1, 2024
@perjham
Copy link

perjham commented Mar 12, 2024

Hello, this behavior is still present in rancher 2.9 and the rancher monitoring stack , chart version "104.0.2+up45.31.1" over rhel9. Any advise?

@MKlimuszka MKlimuszka added this to the v2.9-Next1 milestone Mar 19, 2024
@MKlimuszka
Copy link
Collaborator

#44726 is similar, possibly duplicate issue that has a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/monitoring internal kind/bug Issues that are defects reported by users or that we know have reached a real release priority/1 team/observability&backup the team that is responsible for monitoring/logging and BRO [zube]: To Triage
Projects
None yet
Development

No branches or pull requests