Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kepler not reporting correct process name in metrics #1354

Open
vprashar2929 opened this issue Apr 16, 2024 · 1 comment
Open

Kepler not reporting correct process name in metrics #1354

vprashar2929 opened this issue Apr 16, 2024 · 1 comment
Labels
kind/bug report bug issue

Comments

@vprashar2929
Copy link
Contributor

What happened?

When Kepler using the latest deployed on a machine currently it reports the wrong process name in the exported metrics.

Attaching some screenshots for reference:

  • Actual process name and PID running on the system:
ps -ef | grep 75577
qemu       75577       1  8 Apr15 ?        01:10:07 /usr/bin/qemu-system-x86_64 -name guest=fedora39,debug-threads=on -S 

Output from pstree command:

pstree -p | grep qemu
           |-qemu-system-x86(75577)-+-{qemu-system-x86}(75605)
           |                        |-{qemu-system-x86}(75617)
           |                        |-{qemu-system-x86}(75618)
           |                        |-{qemu-system-x86}(75619)
           |                        |-{qemu-system-x86}(75620)
           |                        |-{qemu-system-x86}(75622)
           |                        |-{qemu-system-x86}(109718)
           |                        |-{qemu-system-x86}(109719)
           |                        |-{qemu-system-x86}(109720)
           |                        `-{qemu-system-x86}(109721)
  • Value reported by kepler_process_platform_joules_total for the particular pid 75577 that is command="CPU 0/KVM" which is wrong

Screenshot 2024-04-16 at 1 21 16 PM

What did you expect to happen?

Kepler should report the correct command name in the metrics that it exports.

How can we reproduce it (as minimally and precisely as possible)?

Run Kepler either on Kubernetes or using the docker-compose locally which is present here: https://github.com/sustainable-computing-io/kepler/tree/main/hackdocker-compose

Anything else we need to know?

No response

Kepler image tag

latest

Kubernetes version

$ kubectl version
# paste output here

Cloud provider or bare metal

OS version

# On Linux:
$ cat /etc/os-release
# paste output here
$ uname -a
# paste output here

# On Windows:
C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture
# paste output here

Install tools

Kepler deployment config

For on kubernetes:

$ KEPLER_NAMESPACE=kepler

# provide kepler configmap
$ kubectl get configmap kepler-cfm -n ${KEPLER_NAMESPACE} 
# paste output here

# provide kepler deployment description
$ kubectl describe deployment kepler-exporter -n ${KEPLER_NAMESPACE} 

For standalone:

put your Kepler command argument here

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

@vprashar2929 vprashar2929 added the kind/bug report bug issue label Apr 16, 2024
@dave-tucker
Copy link
Collaborator

I know why this is 🎉
See: https://github.com/sustainable-computing-io/kepler/blob/main/bpfassets/libbpf/src/kepler.bpf.c#L247C3-L247C23

As @vimalk78 found out, from eBPF we record the:

  • PID (as seen by the kernel)
  • TGID (as seen by the kernel)
  • Comm

From the perspective of userland, the PID is actually what the kernel calls the TGID - you'll notice that we accidentally on-purpose switch the order of these fields in the definition of the struct:
https://github.com/sustainable-computing-io/kepler/blob/main/pkg/bpf/types.go#L49-L50

TL:DR the comm that we record belongs to the pid (as the kernel sees it, not as userland sees it), so you will indeed get values like CPU 0/KVM.

I think the fix required here is going to be either:

  1. Don't record the comm from eBPF and look it up from procfs instead
  2. Only set the comm if pid == tgid

I'm going to try and verify this theory on my development machine at some point later this week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug report bug issue
Projects
None yet
Development

No branches or pull requests

2 participants