Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(pkg/bpf): Replace libbpfgo with cilium/ebpf #1438

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

dave-tucker
Copy link
Collaborator

@dave-tucker dave-tucker commented May 15, 2024

cilium/ebpf is a pure Go eBPF package and is used in a number of popular
cloud-native projects. The benefits to Kepler are:

  1. Bytecode is built using bpf2go and the C and Go structs are kept in
    sync automatically
  2. There is no requirement for Cgo anymore and therefore no requirement
    to have libbpf or libelf installed to compile and/or to be
    dynamically linked at runtime
  3. Simplified packaging as the correct bytecode is contained within the
    kepler binary

Overall I'm happy with this change, but there is only one thing that
bugs me.

We have to check in the bytecode object files (e.g kepler.bpfel.o) or
the Go tooling (go lint/go vet) complains about the missing files.
I couldn't reliably get go generate ./... to work to compile these
files in CI. This is something which should be relatively easy to fix
in the Makefile/CI environment before we cut a release.

Depends-On: #1435
Fixes: #1424

@dave-tucker dave-tucker force-pushed the cilium-ebpf branch 7 times, most recently from 5fa556a to 42303d8 Compare May 15, 2024 17:53
@vimalk78
Copy link
Collaborator

@rootfs @dave-tucker which release is this targeted for?

@dave-tucker
Copy link
Collaborator Author

I don't have a target in mind. But I doubt it will be ready for review until Friday at the earliest.

@dave-tucker dave-tucker force-pushed the cilium-ebpf branch 2 times, most recently from 08d0a91 to d031ddd Compare May 16, 2024 12:56
@dave-tucker dave-tucker marked this pull request as ready for review May 16, 2024 12:56
@dave-tucker dave-tucker changed the title Migrate to cilium/ebpf chore(pkg/bpf): Replace libbpfgo with cilium/ebpf May 16, 2024
@dave-tucker dave-tucker marked this pull request as draft May 16, 2024 14:24
@dave-tucker
Copy link
Collaborator Author

I thought I was close, but I think I found an issue in #1443 that causes failures when perf events can't be opened for some reason. I've cherry-picked that here to see if it makes CI green - it applies to the libbpfgo code to so it's another PR.

Copy link

github-actions bot commented May 20, 2024

🤖 SeineSailor

Here's a concise summary of the pull request changes:

Key Modifications:

  1. Replacement of libbpfgo with cilium/ebpf: The Kepler project has replaced libbpfgo with cilium/ebpf, eliminating Cgo, libbpf, and libelf dependencies. This change introduces bytecode object files that need to be checked in.
  2. Internal updates to cilium/ebpf package: Various internal updates have been made to the cilium/ebpf package, including changes to sysenc, tracefs, unix, and link modules.
  3. Update to .pre-commit-config.yaml: The exclude pattern in the repos configuration has been updated to exclude bpf/include/ instead of libbpf/include/.

Impact on Codebase:

  • The replacement of libbpfgo with cilium/ebpf may affect dependent code due to internal changes and removals of functions and structures.
  • The updates to the cilium/ebpf package are internal and do not affect the external interface or behavior of the code.

Observations and Suggestions:

  • It's essential to review the internal changes to the cilium/ebpf package to ensure they do not introduce any breaking changes or compatibility issues.
  • Consider adding tests to verify the functionality of the updated code, especially for the replaced dependencies.
  • The introduction of bytecode object files may require additional documentation or guidelines for developers to ensure proper usage and maintenance.

Overall, this pull request appears to be a significant refactoring effort to modernize the Kepler project's dependencies and internal implementation. A thorough review and testing of the changes are necessary to ensure the codebase remains stable and maintainable.

@dave-tucker dave-tucker marked this pull request as ready for review May 20, 2024 16:12
pkg/bpf/kepler_bpfel.go Outdated Show resolved Hide resolved
packaging/rpm/kepler.spec Outdated Show resolved Hide resolved
.gitignore Show resolved Hide resolved
Copy link
Collaborator

@sthaha sthaha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@sthaha
Copy link
Collaborator

sthaha commented May 21, 2024

@dave-tucker we may want to look at this crash before merging.
output of make build


❯ sudo ./_output/bin/kepler

I0522 09:33:18.758775 2437499 gpu.go:38] Trying to initialize GPU collector using dcgm
W0522 09:33:18.758911 2437499 gpu_dcgm.go:104] There is no DCGM daemon running in the host: libdcgm.so not Found
W0522 09:33:18.758976 2437499 gpu_dcgm.go:108] Could not start DCGM. Error: libdcgm.so not Found
I0522 09:33:18.758997 2437499 gpu.go:45] Error initializing dcgm: not able to connect to DCGM: libdcgm.so not Found
I0522 09:33:18.759008 2437499 gpu.go:38] Trying to initialize GPU collector using nvidia-nvml
I0522 09:33:18.759111 2437499 gpu.go:45] Error initializing nvidia-nvml: failed to init nvml. ERROR_LIBRARY_NOT_FOUND
I0522 09:33:18.759121 2437499 gpu.go:38] Trying to initialize GPU collector using dummy
I0522 09:33:18.759134 2437499 gpu.go:42] Using dummy to obtain gpu power
I0522 09:33:18.760886 2437499 exporter.go:85] Kepler running on version: v0.7.10-5-gf1097c0e2c498-dirty
I0522 09:33:18.760897 2437499 config.go:283] using gCgroup ID in the BPF program: true
I0522 09:33:18.760963 2437499 config.go:285] kernel version: 6.7
I0522 09:33:18.761045 2437499 config.go:310] The Idle power will be exposed. Are you running on Baremetal or using single VM per node?
I0522 09:33:18.761127 2437499 redfish.go:169] failed to get redfish credential file path
I0522 09:33:18.761632 2437499 acpi.go:71] Could not find any ACPI power meter path. Is it a VM?
I0522 09:33:18.763071 2437499 exporter.go:94] Number of CPUs: 8
I0522 09:33:19.257525 2437499 watcher.go:67] Using in cluster k8s config
I0522 09:33:19.257541 2437499 watcher.go:74] failed to get config: unable to load in-cluster configuration, KUBERNETES_SERVICE_HOST and KUBERNETES_SERVICE_PORT must be defined
I0522 09:33:19.257553 2437499 watcher.go:126] k8s APIserver watcher was not enabled
I0522 09:33:19.257642 2437499 prometheus_collector.go:96] Registered Container Prometheus metrics
I0522 09:33:19.257688 2437499 prometheus_collector.go:101] Registered VM Prometheus metrics
I0522 09:33:19.257724 2437499 prometheus_collector.go:105] Registered Node Prometheus metrics
I0522 09:33:19.257826 2437499 node_platform_energy.go:54] Failed to create Regressor/AbsPower Power Model to estimate Node Platform Power: open /var/lib/kepler/data/acpi_AbsPowerModel.json: no such fi
le or directory
I0522 09:33:19.257965 2437499 exporter.go:175] starting to listen on 0.0.0.0:8888
I0522 09:33:19.257980 2437499 exporter.go:181] Started Kepler in 497.120003ms
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x854f72]

goroutine 53 [running]:
github.com/sustainable-computing-io/kepler/pkg/collector/stats/types.(*UInt64StatCollection).AddDeltaStat(0x0, {0x18b8388, 0x7}, 0x0)
        kepler/pkg/collector/stats/types/types.go:108 +0x32
github.com/sustainable-computing-io/kepler/pkg/collector/resourceutilization/bpf.updateSWCounters(0x16a0640?, 0xc000190150, 0x1?)
        kepler/pkg/collector/resourceutilization/bpf/process_bpf_collector.go:40 +0xc5
github.com/sustainable-computing-io/kepler/pkg/collector/resourceutilization/bpf.UpdateProcessBPFMetrics({0x1b40fe8, 0xc000692f00}, 0xc0000736d0?)
        kepler/pkg/collector/resourceutilization/bpf/process_bpf_collector.go:129 +0xa39
github.com/sustainable-computing-io/kepler/pkg/collector.(*Collector).updateProcessResourceUtilizationMetrics(0xc002bfe4c0, 0x0?)
        kepler/pkg/collector/metric_collector.go:191 +0x5d
github.com/sustainable-computing-io/kepler/pkg/collector.(*Collector).updateResourceUtilizationMetrics(0xc002bfe4c0)
        kepler/pkg/collector/metric_collector.go:155 +0x54
github.com/sustainable-computing-io/kepler/pkg/collector.(*Collector).Update(0xb2d05e00?)
        kepler/pkg/collector/metric_collector.go:106 +0x53
github.com/sustainable-computing-io/kepler/pkg/manager.(*CollectorManager).Start.func1()
        kepler/pkg/manager/manager.go:74 +0x7b
created by github.com/sustainable-computing-io/kepler/pkg/manager.(*CollectorManager).Start in goroutine 1
        kepler/pkg/manager/manager.go:66 +0x65

@sthaha
Copy link
Collaborator

sthaha commented May 21, 2024

I get the same using the compose file

❯ cd manifests/compose/dev
❯ docker compose up --build kepler-dev

kepler-dev-1  | I0521 23:32:42.275436 2436610 process_bpf_collector.go:88] process worker (pid=2202720, cgroup=1466240) has 0 task clock time 2344773 CPU cycles, 883246 instructions, 232766 cache miss
es, 68 page cache hits
kepler-dev-1  | panic: runtime error: invalid memory address or nil pointer dereference
kepler-dev-1  | [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x8788f2]
kepler-dev-1  |
kepler-dev-1  | goroutine 30 [running]:
kepler-dev-1  | github.com/sustainable-computing-io/kepler/pkg/collector/stats/types.(*UInt64StatCollection).AddDeltaStat(0x0, {0x18ee6d1, 0x7}, 0x0)
kepler-dev-1  |         /workspace/pkg/collector/stats/types/types.go:108 +0x32
kepler-dev-1  | github.com/sustainable-computing-io/kepler/pkg/collector/resourceutilization/bpf.updateSWCounters(0x16d05e0?, 0xc0003e01c0, 0x219c60?)
kepler-dev-1  |         /workspace/pkg/collector/resourceutilization/bpf/process_bpf_collector.go:40 +0xc5
kepler-dev-1  | github.com/sustainable-computing-io/kepler/pkg/collector/resourceutilization/bpf.UpdateProcessBPFMetrics({0x1b7aea8, 0xc0005977c0}, 0xc00048aed0?)
kepler-dev-1  |         /workspace/pkg/collector/resourceutilization/bpf/process_bpf_collector.go:129 +0xa39
kepler-dev-1  | github.com/sustainable-computing-io/kepler/pkg/collector.(*Collector).updateProcessResourceUtilizationMetrics(0xc002bcc500, 0x0?)
kepler-dev-1  |         /workspace/pkg/collector/metric_collector.go:191 +0x5d
kepler-dev-1  | github.com/sustainable-computing-io/kepler/pkg/collector.(*Collector).updateResourceUtilizationMetrics(0xc002bcc500)
kepler-dev-1  |         /workspace/pkg/collector/metric_collector.go:155 +0x54
kepler-dev-1  | github.com/sustainable-computing-io/kepler/pkg/collector.(*Collector).Update(0xb2d05e00?)
kepler-dev-1  |         /workspace/pkg/collector/metric_collector.go:106 +0x53
kepler-dev-1  | github.com/sustainable-computing-io/kepler/pkg/manager.(*CollectorManager).Start.func1()
kepler-dev-1  |         /workspace/pkg/manager/manager.go:74 +0x7b
kepler-dev-1  | created by github.com/sustainable-computing-io/kepler/pkg/manager.(*CollectorManager).Start in goroutine 1
kepler-dev-1  |         /workspace/pkg/manager/manager.go:66 +0x65
kepler-dev-1 exited with code 2

@dave-tucker
Copy link
Collaborator Author

@sthaha I just force pushed to fix a small bug in the cleanup logic (forgetting to clear the enabledHardwareCounters set).
I know that segfault happens when there is a disagreement between what pkg/bpf declares as supported and what pkg/collector/stats uses to populate the map on the Stats struct, but that should have been taken care of in #1443.

I've tried locally on Fedora 40 with both running the binary and compose and can't reproduce. I've also tried on RHEL 9.4 and RHEL 9.2 and still can't reproduce. I'm also not seeing anything in the logs stating that any of the eBPF probes couldn't be loaded 🤔

TL;DR I'm a bit stuck on fixing this as I can't reproduce the issue.

@sthaha
Copy link
Collaborator

sthaha commented May 23, 2024

I tried with the latest, commit 6d8539b and runs fine.

@dave-tucker
Copy link
Collaborator Author

This is ready to go. I'm holding off doing a rebase though until #1481 has merged so marking this as a draft until then.

@dave-tucker dave-tucker marked this pull request as draft June 4, 2024 18:32
klog.Infof("Number of CPUs: %d", numCPU)
for _, m := range specs.Maps {
// Only resize maps that have a MaxEntries of NUM_CPUS constant
if m.MaxEntries == 128 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we check condition by map name prefix if not by map name? the size comparison does not reflect why are we resizing the map as per number of cpus.

Copy link
Collaborator Author

@dave-tucker dave-tucker Jun 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, 128 is indeed the magic NUM_CPUS constant that we define on the eBPF side.
If you use NUM_CPUS as your max-entries value, the contract is that userspace will resize your map to the number of CPUs.

I prefer this to keeping a list of map names and/or prefixes around which could get stale

@dave-tucker
Copy link
Collaborator Author

Snyk checks are failing, but not due to changes in this PR:
It seems the are complaining about license issues in yaml.v1

That comes in via:

$ go mod graph | grep gopkg.in/yaml.v1
howett.net/plist@v1.0.0 gopkg.in/yaml.v1@v1.0.0-20140924161607-9f9df34309c0

Which is used by:

$ go mod why howett.net/plist       
# howett.net/plist
github.com/sustainable-computing-io/kepler/cmd/validator
github.com/jaypipes/ghw
github.com/jaypipes/ghw/pkg/block
howett.net/plist

The other issues are either in:

  • Packages that aren't in our dependency tree: i.e gogs.io/gogs
  • Dependencies of our dependencies - most of which pre-date this PR

@sthaha
Copy link
Collaborator

sthaha commented Jun 12, 2024

Snyk checks are failing, but not due to changes in this PR:
It seems the are complaining about license issues in yaml.v1

I can't see the details of the failure ... but, I wonder why the other PRs pass CI while this fails 🤔

@dave-tucker
Copy link
Collaborator Author

Snyk checks are failing, but not due to changes in this PR:
It seems the are complaining about license issues in yaml.v1

I can't see the details of the failure ... but, I wonder why the other PRs pass CI while this fails 🤔

I think anything that touches go.mod causes it to rescan dependencies and then fail. See #1516

@marceloamaral
Copy link
Collaborator

@dave-tucker can you sync this PR with main?

@dave-tucker
Copy link
Collaborator Author

@dave-tucker can you sync this PR with main?

Done!

@dave-tucker dave-tucker force-pushed the cilium-ebpf branch 3 times, most recently from d64c96f to 1aac060 Compare June 14, 2024 13:24
@dave-tucker dave-tucker mentioned this pull request Jun 14, 2024
@dave-tucker dave-tucker force-pushed the cilium-ebpf branch 3 times, most recently from 5fa77d5 to 138a422 Compare June 17, 2024 14:19
@sthaha
Copy link
Collaborator

sthaha commented Jun 18, 2024

LGTM

@dave-tucker dave-tucker force-pushed the cilium-ebpf branch 2 times, most recently from 274f339 to a00fa0d Compare June 18, 2024 12:48
cilium/ebpf is a pure Go eBPF package and is used in a number of popular
cloud-native projects. The benefits to Kepler are:

1. Bytecode is built using bpf2go and the C and Go structs are kept in
   sync automatically
2. There is no requirement for Cgo anymore and therefore no requirement
   to have libbpf or libelf installed to compile and/or to be
   dynamically linked at runtime
3. Simplified packaging as the correct bytecode is contained within the
   kepler binary

Overall I'm happy with this change, but there is only one thing that
bugs me.

We have to check in the bytecode object files (e.g kepler.bpfel.o) or
the Go tooling (go lint/go vet) complains about the missing files.
I couldn't reliably get `go generate ./...` to work to compile these
files in CI. This is something which should be relatively easy to fix
in the Makefile/CI environment before we cut a release.

Signed-off-by: Dave Tucker <dave@dtucker.co.uk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replace libbpfgo with cilium/ebpf
4 participants