Rancher-Monitoring: Query on Cluster Grafana Dashboard reports incorrect values #24343

toddexp · 2019-12-02T20:56:09Z

What kind of request is this (question/bug/enhancement/feature request):
Bug

Steps to reproduce (least amount of steps as possible):
Enable cluster monitoring
View cluster Grafana dashboard

Result:
The Pod CPU Usage and All Process CPU Usage sections of the dashboard are inaccurate. These graphs on the dashboard are doubled due to the query that was used.

Other details that may be helpful:
The Rancher monitoring seems to be exposing similar metrics in multiple ways. Due to the query that was used to form the graphs multiple metrics are being summed together.

Incorrect Pod CPU Usage query: sum (rate (container_cpu_usage_seconds_total{pod_name!="",container_name!="POD",node=~"^$Node$"}[5m])) by (pod_name)

Incorrect All Process CPU Usage query: sum (rate (container_cpu_usage_seconds_total{namespace!="",pod_name!="",node=~"^$Node$"}[5m])) by (namespace, pod_name)

corrected Pod CPU Usage query: sum (rate (container_cpu_usage_seconds_total{pod_name!="",container_name!="",container_name!="POD",node=~"^$Node$"}[5m])) by (pod_name) Added container_namer!="" to the query

corrected All Process CPU Usage query: sum (rate (container_cpu_usage_seconds_total{namespace!="",pod_name!="",container_name!="",node=~"^$Node$"}[5m])) by (namespace, pod_name) Added container_namer!="" to the query

There was a similar issue opened for daemonset Grafana graphs: #20162

Environment information

Rancher version (rancher/rancher/rancher/server image tag or shown bottom left in the UI): 2.3.0
Installation option (single install/HA): single install

gz#15744

The text was updated successfully, but these errors were encountered:

lxkaka · 2019-12-19T09:33:55Z

same issue and the memory metic is exposed also twice.

lxkaka · 2019-12-19T10:37:16Z

@toddexp I think the added variable shoud be image!=""

toddexp · 2019-12-19T15:15:02Z

In my environment, for the memory and cpu queries that are used in the cluster page in Grafana, the metrics in Prometheus do not have image labels in the metrics used. When I applied the image!="" my query then returned 0 results.

Taking a closer look it does appear that the memory graphs are also effected and the queries are incorrect. However it seems to only be a minor increase in memory metrics as opposed to the cpu metrics which are doubling. I am not sure what Rancher has added for metrics gathering but there seems to be these entries that have these metrics in them: container="POD",container_name="POD",endpoint="https-metrics",job="expose-kubelets-metrics",namespace="cattle-prometheus",pod="exporter-node-cluster-monitoring-5jrd4",pod_name="exporter-node-cluster-monitoring-5jrd4",service="expose-kubelets-metrics" Since these are not being filtered out in the Grafana dashboards we are picking up duplicate data. For the memory metrics this is only adding a very small amount of memory. But for the cpu these additional metrics are doubling the cpu found.

lxkaka · 2019-12-20T06:39:47Z

@loganhz has any plan to resolve this bug?

toddexp · 2020-04-02T20:39:12Z

I was hoping that this bug was corrected with this issue fix. I just installed rancher 2.4.2 with monitoring v0.1.0 and the Grafana graphs are still incorrectly double what they should be.

lxkaka · 2020-04-05T05:35:45Z

continue to follow this issue

chanjarster · 2020-09-29T01:33:19Z

same issue and the memory metic is exposed also twice.

Same issue, rancher v2.3.6 container_cpu_usage_seconds_total and container_memory_working_set_bytes are doubled:

container_cpu_usage_seconds_total{container="uniauth",container_name="uniauth",cpu="total",endpoint="https-metrics",id="/kubepods/burstable/podc4939924-829b-4b7b-98e3-9f00840409e3/8d95c26df64e54686e7707c3ee7d018cfd03fc0c61bd6818db19eb4a14c619dd",image="harbor.supwisdom.com/uniauth/uniauth@sha256:b86c971c07f42864675cffc22386a0e8ddcb920e6025ee01d6053e18de4bca23",instance="192.168.116.117:10250",job="expose-kubelets-metrics",name="k8s_uniauth_uniauth-test-backend-7d895686d4-mzbzn_uniauth-test_c4939924-829b-4b7b-98e3-9f00840409e3_0",namespace="uniauth-test",node="k8sworker01",pod="uniauth-test-backend-7d895686d4-mzbzn",pod_name="uniauth-test-backend-7d895686d4-mzbzn",service="expose-kubelets-metrics"}
--

container_cpu_usage_seconds_total{container="uniauth",container_name="uniauth",endpoint="https-metrics",instance="192.168.116.117:10250",job="expose-kubelets-metrics",namespace="uniauth-test",node="k8sworker01",pod="uniauth-test-backend-7d895686d4-mzbzn",pod_name="uniauth-test-backend-7d895686d4-mzbzn",service="expose-kubelets-metrics"}
--

container_memory_working_set_bytes{container="uniauth",container_name="uniauth",endpoint="https-metrics",id="/kubepods/burstable/podc4939924-829b-4b7b-98e3-9f00840409e3/8d95c26df64e54686e7707c3ee7d018cfd03fc0c61bd6818db19eb4a14c619dd",image="harbor.supwisdom.com/uniauth/uniauth@sha256:b86c971c07f42864675cffc22386a0e8ddcb920e6025ee01d6053e18de4bca23",instance="192.168.116.117:10250",job="expose-kubelets-metrics",name="k8s_uniauth_uniauth-test-backend-7d895686d4-mzbzn_uniauth-test_c4939924-829b-4b7b-98e3-9f00840409e3_0",namespace="uniauth-test",node="k8sworker01",pod="uniauth-test-backend-7d895686d4-mzbzn",pod_name="uniauth-test-backend-7d895686d4-mzbzn",service="expose-kubelets-metrics"}
--

container_memory_working_set_bytes{container="uniauth",container_name="uniauth",endpoint="https-metrics",instance="192.168.116.117:10250",job="expose-kubelets-metrics",namespace="uniauth-test",node="k8sworker01",pod="uniauth-test-backend-7d895686d4-mzbzn",pod_name="uniauth-test-backend-7d895686d4-mzbzn",service="expose-kubelets-metrics"}
--

while the query are:

sum by (container_name)(rate(container_cpu_usage_seconds_total{namespace="$namespace",container_name!="",container_name=~"$container",container_name!="POD",pod_name="$pod"}[5m]))

sum by(container_name) (container_memory_working_set_bytes{namespace="$namespace",container_name!="",container_name=~"$container",container_name!="POD",pod_name="$pod"})

So the result are all doubled, that's really confusing

chanjarster · 2020-09-29T03:03:57Z

I add id!="" on the query problem resolved. But since rancher imported grafana dashboards are not modifiable, I have to import the fixed dashboard, that's not convenient.

Turb0Fly · 2021-03-11T20:55:54Z

I've got a 2.4.8 Rancher installation and pod memory usage is indeed effectively doubled when viewed in Grafana at the cluster level.
When going into the Pods dashboard and selecting the namespace and the proper pod, it is displayed accurately.

samail · 2021-05-21T08:38:57Z

Same on 2.5.5 version
This can be fixed by changing query from sum to max

Like for Pod CPU Usage as an example
From

sum (rate (container_cpu_usage_seconds_total{pod_name!="",container_name!="POD",node=~"^$Node$"}[5m])) by (pod_name)

To

max (rate (container_cpu_usage_seconds_total{pod_name!="",container_name!="POD",node=~"^$Node$"}[5m])) by (pod_name)

perjham · 2024-03-12T02:41:10Z

Hello, this behavior is still present in rancher 2.9 and the rancher monitoring stack , chart version "104.0.2+up45.31.1" over rhel9. Any advise?

MKlimuszka · 2024-05-03T17:41:32Z

#44726 is similar, possibly duplicate issue that has a PR.

loganhz added this to the v2.4 milestone Dec 3, 2019

loganhz assigned thxCode Dec 3, 2019

loganhz added kind/bug Issues that are defects reported by users or that we know have reached a real release area/monitoring team/cn labels Dec 3, 2019

loganhz removed this from the v2.4 milestone Dec 16, 2019

loganhz unassigned thxCode Dec 16, 2019

loganhz removed the team/cn label Dec 16, 2019

aemneina added the internal label Mar 5, 2020

maggieliu added the [zube]: Backlog label Mar 13, 2020

maggieliu modified the milestones: v2.4.2, v2.4.2a Mar 13, 2020

maggieliu modified the milestones: v2.4.2, v2.4 - Backlog Mar 24, 2020

maggieliu added the [zube]: Backlog 1 label Mar 26, 2020

zube bot added [zube]: Backlog and removed [zube]: Backlog labels Mar 26, 2020

maggieliu added the [zube]: Team Red Backlog label Mar 27, 2020

zube bot added [zube]: Backlog 1 and removed [zube]: Team Red Backlog labels Mar 27, 2020

deniseschannon removed this from the v2.4 - Backlog milestone Jan 29, 2021

Tejeev changed the title ~~Rancher-Monitoring: Query Issue on Cluster Grafana Dashboard~~ Rancher-Monitoring: Query on Cluster Grafana Dashboard reports incorrect values Jun 26, 2021

Jono-SUSE-Rancher removed the [zube]: Team Red Backlog label Nov 23, 2021

MKlimuszka added [zube]: To Triage team/opni labels Nov 29, 2023

samjustus added team/observability&backup the team that is responsible for monitoring/logging and BRO and removed team/opni labels Feb 1, 2024

MKlimuszka added the priority/1 label Mar 19, 2024

MKlimuszka added this to the v2.9-Next1 milestone Mar 19, 2024

MKlimuszka mentioned this issue May 7, 2024

[BUG] Wrong metric expression for Deployment/ReplicaSet Metric's section #44726

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rancher-Monitoring: Query on Cluster Grafana Dashboard reports incorrect values #24343

Rancher-Monitoring: Query on Cluster Grafana Dashboard reports incorrect values #24343

toddexp commented Dec 2, 2019 •

edited by jambajaar

lxkaka commented Dec 19, 2019 •

edited

lxkaka commented Dec 19, 2019

toddexp commented Dec 19, 2019

lxkaka commented Dec 20, 2019

toddexp commented Apr 2, 2020

lxkaka commented Apr 5, 2020

chanjarster commented Sep 29, 2020 •

edited

chanjarster commented Sep 29, 2020

Turb0Fly commented Mar 11, 2021

samail commented May 21, 2021

perjham commented Mar 12, 2024

MKlimuszka commented May 3, 2024

Rancher-Monitoring: Query on Cluster Grafana Dashboard reports incorrect values #24343

Rancher-Monitoring: Query on Cluster Grafana Dashboard reports incorrect values #24343

Comments

toddexp commented Dec 2, 2019 • edited by jambajaar

lxkaka commented Dec 19, 2019 • edited

lxkaka commented Dec 19, 2019

toddexp commented Dec 19, 2019

lxkaka commented Dec 20, 2019

toddexp commented Apr 2, 2020

lxkaka commented Apr 5, 2020

chanjarster commented Sep 29, 2020 • edited

chanjarster commented Sep 29, 2020

Turb0Fly commented Mar 11, 2021

samail commented May 21, 2021

perjham commented Mar 12, 2024

MKlimuszka commented May 3, 2024

toddexp commented Dec 2, 2019 •

edited by jambajaar

lxkaka commented Dec 19, 2019 •

edited

chanjarster commented Sep 29, 2020 •

edited