Kernel consumption too high #1279

jochen-schuettler · 2024-02-29T16:12:14Z

What happened?

The measured consumption for the kernel is much too high, beyond what the machine is physically capable of.

What did you expect to happen?

Correct measurement for kernel.

How can we reproduce it (as minimally and precisely as possible)?

Install kepler 0.7.7 into AWS ECS. Observe measured kernel consumption over some days.

Anything else we need to know?

No response

Kepler image tag

0.7.7

Kubernetes version

1.24

Cloud provider or bare metal

AWS EKS / ECS

OS version

Amazon Linux 2

Install tools

Helm

Kepler deployment config

Unchanged from Helm chart.

Container runtime (CRI) and version (if applicable)

No response

Related plugins (CNI, CSI, ...) and versions (if applicable)

CNI Version: v1.12.0-eksbuild.2 CSI Version: v1.14.0-eksbuild.1

rootfs · 2024-03-04T20:55:36Z

I suspect this is a overflow issue. Do you have this issue before or only during this period?

jochen-schuettler · 2024-03-05T08:17:09Z

it started with 0.7.3 -> 0.7.7, while 0.7.3 had other issues. ongoing problem

jochen-schuettler · 2024-03-05T08:18:42Z

namespace "tools" is also affected

jochen-schuettler · 2024-03-05T08:22:09Z

compare namespace "kepler" in the same frame

marceloamaral · 2024-03-05T08:45:52Z

@jochen-schuettler thanks for investigating this.

Are you running on bare-metal or VM?
Can you check the metrics in prometheus:
- rate(kepler_container_bpf_cpu_time_ms_total[1m])
- rate(kepler_container_task_clock_ms_total[1m])
- rate(kepler_container_package_joules_total[1m])
- rate(kepler_container_dram_joules_total[1m])
- rate(kepler_container_other_joules_total[1m])
- rate(kepler_node_package_joules_total[1m])
- rate(kepler_node_dram_joules_total[1m])
- rate(kepler_node_other_joules_total[1m])

jochen-schuettler · 2024-03-05T08:59:22Z

It's a VM on AWS ECS. I don't have access to Prometheus, but to Grafana. This is the first metric, rate(kepler_container_bpf_cpu_time_ms_total[1m])

jochen-schuettler · 2024-03-05T08:59:51Z

rate(kepler_container_task_clock_ms_total[1m])

jochen-schuettler · 2024-03-05T09:00:32Z

rate(kepler_container_package_joules_total[1m])

jochen-schuettler · 2024-03-05T09:02:07Z

rate(kepler_container_dram_joules_total[1m])

jochen-schuettler · 2024-03-05T09:02:45Z

rate(kepler_container_other_joules_total[1m])

jochen-schuettler · 2024-03-05T09:03:15Z

rate(kepler_node_package_joules_total[1m])

jochen-schuettler · 2024-03-05T09:03:54Z

rate(kepler_node_dram_joules_total[1m])

jochen-schuettler · 2024-03-05T09:04:20Z

rate(kepler_node_other_joules_total[1m])

marceloamaral · 2024-03-05T13:18:28Z

@sunya-ch the power models are using bpf_cpu_time right?
Any idea about the huge numbers?

jochen-schuettler · 2024-03-08T07:34:03Z

(no change by v0.7.8)

jochen-schuettler · 2024-03-11T07:21:01Z

@sunya-ch : see above, any idea ?

jochen-schuettler · 2024-04-04T08:41:17Z

The issue is still open. @marceloamaral , @rootfs : Can I help you in any way to investigate ?

TWpgo · 2024-04-04T12:00:07Z

I also get some really unrealistic measurements from time to time

jochen-schuettler · 2024-04-08T09:36:36Z

Additional Info: The VM is running on a AMD EPYC 7R13 Processor (stepping 1). This is not included in cpus.yaml. So maybe the problem is related to that?
Edit: After further Reading I realized that Milan, Zen 3 is eq. EPYC 7R13, so the CPU is in the list after all.
1.) But does Kepler know that?
2.) Can we somehow access which CPU model was used?
3.) Also, is it possible these peaks happen when the VM host changes? Maybe AWS EKS shifts a VM quite freely between physical hosts, leading to steps in CPU time? (This is a total guess from an outsider's point of view.)

rootfs · 2024-04-09T14:25:10Z

Maybe there are some overflow during extended tests? @vprashar2929 have you seen this in any tests?

rootfs · 2024-04-09T14:57:35Z

@jochen-schuettler can you get prometheus stats kepler_container_joules_total over the period of spikes? This stats is used to calculate the power consumption.

jochen-schuettler · 2024-04-10T07:04:14Z

jochen-schuettler · 2024-04-10T07:15:56Z

Total Power Consumption (W) in Namespace All-data-2024-04-10 09 15 31.csv

vprashar2929 · 2024-04-17T11:49:04Z

@rootfs I have seen this before and in the case of VM's Kepler does report higher energy usage. I think post 0.7.2 only it started. #1142

jochen-schuettler · 2024-04-19T08:48:59Z

We have occurences regardless of the kernel namespace in a local setup in both BM and VM. Looking at bpf_cpu_time_ms there reveals rapid rises over some minutes. kepler 0.7.2 seems ok, yes.

jochen-schuettler added the kind/bug report bug issue label Feb 29, 2024

geurjas mentioned this issue Jun 7, 2024

Kepler reports unrealistic measurements for short period #1344

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kernel consumption too high #1279

Kernel consumption too high #1279

jochen-schuettler commented Feb 29, 2024

rootfs commented Mar 4, 2024

jochen-schuettler commented Mar 5, 2024

jochen-schuettler commented Mar 5, 2024

jochen-schuettler commented Mar 5, 2024

marceloamaral commented Mar 5, 2024

jochen-schuettler commented Mar 5, 2024

jochen-schuettler commented Mar 5, 2024

jochen-schuettler commented Mar 5, 2024

jochen-schuettler commented Mar 5, 2024

jochen-schuettler commented Mar 5, 2024

jochen-schuettler commented Mar 5, 2024

jochen-schuettler commented Mar 5, 2024

jochen-schuettler commented Mar 5, 2024

marceloamaral commented Mar 5, 2024

jochen-schuettler commented Mar 8, 2024

jochen-schuettler commented Mar 11, 2024

jochen-schuettler commented Apr 4, 2024

TWpgo commented Apr 4, 2024

jochen-schuettler commented Apr 8, 2024 •

edited

rootfs commented Apr 9, 2024

rootfs commented Apr 9, 2024

jochen-schuettler commented Apr 10, 2024

jochen-schuettler commented Apr 10, 2024

vprashar2929 commented Apr 17, 2024

jochen-schuettler commented Apr 19, 2024

Kernel consumption too high #1279

Kernel consumption too high #1279

Comments

jochen-schuettler commented Feb 29, 2024

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Kepler image tag

Kubernetes version

Cloud provider or bare metal

OS version

Install tools

Kepler deployment config

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

rootfs commented Mar 4, 2024

jochen-schuettler commented Mar 5, 2024

jochen-schuettler commented Mar 5, 2024

jochen-schuettler commented Mar 5, 2024

marceloamaral commented Mar 5, 2024

jochen-schuettler commented Mar 5, 2024

jochen-schuettler commented Mar 5, 2024

jochen-schuettler commented Mar 5, 2024

jochen-schuettler commented Mar 5, 2024

jochen-schuettler commented Mar 5, 2024

jochen-schuettler commented Mar 5, 2024

jochen-schuettler commented Mar 5, 2024

jochen-schuettler commented Mar 5, 2024

marceloamaral commented Mar 5, 2024

jochen-schuettler commented Mar 8, 2024

jochen-schuettler commented Mar 11, 2024

jochen-schuettler commented Apr 4, 2024

TWpgo commented Apr 4, 2024

jochen-schuettler commented Apr 8, 2024 • edited

rootfs commented Apr 9, 2024

rootfs commented Apr 9, 2024

jochen-schuettler commented Apr 10, 2024

jochen-schuettler commented Apr 10, 2024

vprashar2929 commented Apr 17, 2024

jochen-schuettler commented Apr 19, 2024

jochen-schuettler commented Apr 8, 2024 •

edited