Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel consumption too high #1279

Open
jochen-schuettler opened this issue Feb 29, 2024 · 25 comments
Open

Kernel consumption too high #1279

jochen-schuettler opened this issue Feb 29, 2024 · 25 comments
Labels
kind/bug report bug issue

Comments

@jochen-schuettler
Copy link

What happened?

The measured consumption for the kernel is much too high, beyond what the machine is physically capable of.
grafik

What did you expect to happen?

Correct measurement for kernel.

How can we reproduce it (as minimally and precisely as possible)?

Install kepler 0.7.7 into AWS ECS. Observe measured kernel consumption over some days.

Anything else we need to know?

No response

Kepler image tag

0.7.7

Kubernetes version

1.24

Cloud provider or bare metal

AWS EKS / ECS

OS version

Amazon Linux 2

Install tools

Helm

Kepler deployment config

Unchanged from Helm chart.

Container runtime (CRI) and version (if applicable)

No response

Related plugins (CNI, CSI, ...) and versions (if applicable)

CNI Version: v1.12.0-eksbuild.2 CSI Version: v1.14.0-eksbuild.1
@jochen-schuettler jochen-schuettler added the kind/bug report bug issue label Feb 29, 2024
@rootfs
Copy link
Contributor

rootfs commented Mar 4, 2024

I suspect this is a overflow issue. Do you have this issue before or only during this period?

@jochen-schuettler
Copy link
Author

it started with 0.7.3 -> 0.7.7, while 0.7.3 had other issues. ongoing problem
grafik

@jochen-schuettler
Copy link
Author

namespace "tools" is also affected
grafik

@jochen-schuettler
Copy link
Author

compare namespace "kepler" in the same frame
grafik

@marceloamaral
Copy link
Collaborator

@jochen-schuettler thanks for investigating this.

  • Are you running on bare-metal or VM?
  • Can you check the metrics in prometheus:
    • rate(kepler_container_bpf_cpu_time_ms_total[1m])
    • rate(kepler_container_task_clock_ms_total[1m])
    • rate(kepler_container_package_joules_total[1m])
    • rate(kepler_container_dram_joules_total[1m])
    • rate(kepler_container_other_joules_total[1m])
    • rate(kepler_node_package_joules_total[1m])
    • rate(kepler_node_dram_joules_total[1m])
    • rate(kepler_node_other_joules_total[1m])

@jochen-schuettler
Copy link
Author

It's a VM on AWS ECS. I don't have access to Prometheus, but to Grafana. This is the first metric, rate(kepler_container_bpf_cpu_time_ms_total[1m])
grafik

@jochen-schuettler
Copy link
Author

rate(kepler_container_task_clock_ms_total[1m])
grafik

@jochen-schuettler
Copy link
Author

rate(kepler_container_package_joules_total[1m])
grafik

@jochen-schuettler
Copy link
Author

rate(kepler_container_dram_joules_total[1m])
grafik

@jochen-schuettler
Copy link
Author

rate(kepler_container_other_joules_total[1m])
grafik

@jochen-schuettler
Copy link
Author

rate(kepler_node_package_joules_total[1m])
grafik

@jochen-schuettler
Copy link
Author

rate(kepler_node_dram_joules_total[1m])
grafik

@jochen-schuettler
Copy link
Author

rate(kepler_node_other_joules_total[1m])
grafik

@marceloamaral
Copy link
Collaborator

@sunya-ch the power models are using bpf_cpu_time right?
Any idea about the huge numbers?

@jochen-schuettler
Copy link
Author

(no change by v0.7.8)

@jochen-schuettler
Copy link
Author

@sunya-ch : see above, any idea ?

@jochen-schuettler
Copy link
Author

The issue is still open. @marceloamaral , @rootfs : Can I help you in any way to investigate ?

@TWpgo
Copy link

TWpgo commented Apr 4, 2024

image
I also get some really unrealistic measurements from time to time

@jochen-schuettler
Copy link
Author

jochen-schuettler commented Apr 8, 2024

Additional Info: The VM is running on a AMD EPYC 7R13 Processor (stepping 1). This is not included in cpus.yaml. So maybe the problem is related to that?
Edit: After further Reading I realized that Milan, Zen 3 is eq. EPYC 7R13, so the CPU is in the list after all.
1.) But does Kepler know that?
2.) Can we somehow access which CPU model was used?
3.) Also, is it possible these peaks happen when the VM host changes? Maybe AWS EKS shifts a VM quite freely between physical hosts, leading to steps in CPU time? (This is a total guess from an outsider's point of view.)

@rootfs
Copy link
Contributor

rootfs commented Apr 9, 2024

Maybe there are some overflow during extended tests? @vprashar2929 have you seen this in any tests?

@rootfs
Copy link
Contributor

rootfs commented Apr 9, 2024

@jochen-schuettler can you get prometheus stats kepler_container_joules_total over the period of spikes? This stats is used to calculate the power consumption.

@jochen-schuettler
Copy link
Author

grafik

@jochen-schuettler
Copy link
Author

@vprashar2929
Copy link
Contributor

@rootfs I have seen this before and in the case of VM's Kepler does report higher energy usage. I think post 0.7.2 only it started. #1142

@jochen-schuettler
Copy link
Author

We have occurences regardless of the kernel namespace in a local setup in both BM and VM. Looking at bpf_cpu_time_ms there reveals rapid rises over some minutes. kepler 0.7.2 seems ok, yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug report bug issue
Projects
None yet
Development

No branches or pull requests

5 participants