Skip to content

[DO NOT MERGE] Ringbuf redux #1687

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

dave-tucker
Copy link
Collaborator

Brings back the ringbuf work, just to see if it's something we want to continue with or not.

rootfs and others added 30 commits May 19, 2024 13:56
Signed-off-by: Huamin Chen <hchen@redhat.com>
Signed-off-by: sustainable-computing-bot <bot@sustainable-computing.io>
removes debugging logging from GPU metrics

Signed-off-by: Vibhu Prashar <vibhu.sharma2929@gmail.com>
Signed-off-by: Huamin Chen <hchen@redhat.com>
…d-verbose

fix: remove logging while collecting GPU metrics
feat(ci)add equinix metal instance to CI
Signed-off-by: sustainable-computing-bot <bot@sustainable-computing.io>
This PR:
* Corrects the query for `bpf_cpu_time_ms_total` metric
* Fixes the display issue for panels showing
  CPU time, cache miss, CPU cycles and instructions

Signed-off-by: Vibhu Prashar <vibhu.sharma2929@gmail.com>
Signed-off-by: Dave Tucker <dave@dtucker.co.uk>
Signed-off-by: Dave Tucker <dave@dtucker.co.uk>
…x-dev-dash

fix(dev-dashboard): update and correct metrics in dev dashboard
Signed-off-by: sustainable-computing-bot <bot@sustainable-computing.io>
Signed-off-by: Sunyanan Choochotkaew <sunyanan.choochotkaew1@ibm.com>
This commit simplifies the containerID from Cgroup path parsing logic.
All container IDs from all supported runtimes (including podman) are
64 character alphanumeric strings.

As such we can simply strip the `.scope` suffix and take the last 64
characters.

Signed-off-by: Dave Tucker <dave@dtucker.co.uk>
Signed-off-by: Huamin Chen <hchen@redhat.com>
…puting-io/validator

chore(validator)update validator usage; remove job from prom query
…g-io#1473)

This commit adds a `label` - `version` to `kepler_exporter_build_info`
metrics to expose the version of kepler installed on the machine / node.

Fixes: sustainable-computing-io#1449

Signed-off-by: Sunil Thaha <sthaha@redhat.com>
* add vm name option to validator

Signed-off-by: Huamin Chen <hchen@redhat.com>

* validator: externalize the query

Signed-off-by: Huamin Chen <hchen@redhat.com>

---------

Signed-off-by: Huamin Chen <hchen@redhat.com>
Signed-off-by: Huamin Chen <hchen@redhat.com>
…-io#1477)

Previously only a `MapSize` (lesser than size of processs) of entries
used to be copied from processes bpf map. The right use of a batch size
less than the size of map is as follows:

Ref: https://lwn.net/Articles/797808/
```
    p_key = NULL;
    p_next_key = &key;
    while (true) {
       err = bpf_map_lookup_batch(fd, p_key, &p_next_key, keys, values,
                                  &batch_size, elem_flags, flags);
       if (err) ...
       if (p_next_key) break; // done
       if (!p_key) p_key = p_next_key;
    }
```

This PR fixes the issue by creating a entries byte array with the same
size as the `processes` map.

Additionally, this commit makes use of `bytes.Reader` instead of
bytes.Buffer to avoid unnecessary copy of ephemeral byte array.

Signed-off-by: Sunil Thaha <sthaha@redhat.com>
Signed-off-by: sustainable-computing-bot <bot@sustainable-computing.io>
…nable-computing-io#1461)

* Adding New Metric Cases to Case module

Signed-off-by: Kaiyi <kaiyiliu21@gmail.com>

* WIP: Add Test Cases for Prometheus, Cases, and Stresser

I included basic pytest test cases for pure methods used for
producing error metrics from prometheus response data, cleaning
data from prometheus response data, and generating prom queries
for validation. I will also include basic test cases for
stresser related methods like for connecting to vm, copying
files to vm, and running the scripts for vm.

The purpose of these test cases are to ensure the code made
during PRs do not break the functionality of the validator.

Reported-by: Kaiyi <kaliu@redhat.com>
Signed-off-by: Kaiyi <kaiyiliu21@gmail.com>

* Include additional test cases for prometheus and stresser.

Included fixtures for cases, prometheus, and stresser. Included
additional test cases for prometheus and stresser.

Signed-off-by: Kaiyi <kaiyiliu21@gmail.com>

* Skip Cases test

Due to previous changes to the Cases API, this test case needs to be updated in future.

Signed-off-by: Kaiyi <kaiyiliu21@gmail.com>

---------

Signed-off-by: Kaiyi <kaiyiliu21@gmail.com>
…ing gpu metrics (sustainable-computing-io#1480)

Signed-off-by: marceloamaral <marcelo.amaral1@ibm.com>
Signed-off-by: sustainable-computing-bot <bot@sustainable-computing.io>
sthaha and others added 26 commits August 1, 2024 10:14
Signed-off-by: Sunil Thaha <sthaha@redhat.com>
…unused-config

cleanup(config): remove unused kernel src dir
The resolves pid 0 to system_processes without (throwing) any error
since there is no command associated with it. This commit fixes the
following log

I0802 03:24:03.647656  733341 process_bpf_collector.go:100] failed to resolve comm for PID 0: process not running: stat /proc/0: no such file or directory, set comm=system_processes
I0802 03:24:03.648079  733341 process_bpf_collector.go:100] failed to resolve comm for PID 0: process not running, set comm=system_processes
0802 process_bpf_collector.go:100] failed to resolve comm for PID 0: process not running: stat /proc/0: no such file or directory, set comm=system_processes
...
I0802 process_bpf_collector.go:100] failed to resolve comm for PID 0: process not running, set comm=system_processes

Signed-off-by: Sunil Thaha <sthaha@redhat.com>
…lve-pid-0

fix: resolve pid 0 to system_processes
…#1671)

Processing events in the same goroutine as the ring buffer reader
requires acquiring a mutex, which blocks ringbuf event processing
causing a backlog. To avoid this, send events via a buffered
channel to a dedicated event processing goroutine to ensure
that the ringbuf remains unblocked. This has decreased CPU
load from 1-3% on my machine to 0-1% CPU load.

Signed-off-by: Dave Tucker <dave@dtucker.co.uk>
Bumps the go-dependencies group with 2 updates: [github.com/onsi/gomega](https://github.com/onsi/gomega) and [golang.org/x/sys](https://github.com/golang/sys).


Updates `github.com/onsi/gomega` from 1.34.0 to 1.34.1
- [Release notes](https://github.com/onsi/gomega/releases)
- [Changelog](https://github.com/onsi/gomega/blob/master/CHANGELOG.md)
- [Commits](onsi/gomega@v1.34.0...v1.34.1)

Updates `golang.org/x/sys` from 0.22.0 to 0.23.0
- [Commits](golang/sys@v0.22.0...v0.23.0)

---
updated-dependencies:
- dependency-name: github.com/onsi/gomega
  dependency-type: direct:production
  update-type: version-update:semver-patch
  dependency-group: go-dependencies
- dependency-name: golang.org/x/sys
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: go-dependencies
...

Signed-off-by: dependabot[bot] <support@github.com>
This commit fixes dashboards to show MAPE and MSE for the time range
selected in the panel. Thus the dashboards now show the MSE and MAPE
that the validator script would produce if it was run for the same time
range.

Signed-off-by: Sunil Thaha <sthaha@redhat.com>
This commit modifies validator in the following way

1. It now makes use of TestResult object to hold all information about
   the test, so that other formats (e.g. json) can be easily generated.

2. Modifies generation of Markdown to make use of the TestResult dataclass

3. Fixes prometheus timestamp based filter to include samples that are
   within the next scrape of the first series.

4. Markdown result now produces a summary table

Signed-off-by: Sunil Thaha <sthaha@redhat.com>
…r-refactor

chore(validator): store test result in an object
…puting-io/dependabot/go_modules/go-dependencies-258b52e974

build(deps): bump the go-dependencies group with 2 updates
Add a workflow that runs the pre-commit jobs to
cover the case where a developer hasn't opted into
pre-commit. Also add the conventional commit message
hook and enable more golang linters. The commits also
tidies up the markdown files that had line length
issues.

Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
Signed-off-by: Huamin Chen <hchen@redhat.com>
check metrics before and after e2e tests
…uting-io#1671)"

This reverts commit c99e399.

Signed-off-by: Dave Tucker <dave@dtucker.co.uk>
This reverts commit cde7833.

Signed-off-by: Dave Tucker <dave@dtucker.co.uk>
This reverts commit ca4cdf6.

Signed-off-by: Dave Tucker <dave@dtucker.co.uk>
Signed-off-by: Dave Tucker <dave@dtucker.co.uk>
…ert-ringbuf

Revert "Merge pull request sustainable-computing-io#1628 from dave-tucker/new-ebpf"
Signed-off-by: Sunil Thaha <sthaha@redhat.com>
…dator-node-info

chore(validator): fix missing node-info in md
Signed-off-by: Anthony Harivel <aharivel@redhat.com>
By default the name chosen to define the Virtual Machine ID is the one
found in /proc/<pid>/cgroup/.

Many IaaS platform like Openstack use metadata in libvirt to customise
among other things the instance's ID.

Allow the user to use the metadata in libvirt to set the name of the
vm metrics and gives a better user experience.

Set "LIBVIRT_METADATA_URI" with the uri of the XML namespace identifier.
Set "LIBVIRT_METADATA_TOKEN" to choose the string used for VM ID. By
default, the token "name" is used.

Signed-off-by: Anthony Harivel <aharivel@redhat.com>
This uses a ring buffer to send events from eBPF back to userspace.
Doing so allows for our eBPF probes to complete quickly, and pushes
all delta calculation and summary back into userspace.

Moves the logic that resolves comm from procfs into its own pkg.
In addition, a cache is added to avoid hitting procfs every time
we update process metrics.

Removes the bpftest pkg and includes the test programs in the
main kepler.bpf.c file. This prevents drift in the go generate
flags and generally makes it easier to write tests.

Signed-off-by: Dave Tucker <dave@dtucker.co.uk>
The resolves pid 0 to system_processes without (throwing) any error
since there is no command associated with it. This commit fixes the
following log

I0802 03:24:03.647656  733341 process_bpf_collector.go:100] failed to resolve comm for PID 0: process not running: stat /proc/0: no such file or directory, set comm=system_processes
I0802 03:24:03.648079  733341 process_bpf_collector.go:100] failed to resolve comm for PID 0: process not running, set comm=system_processes
0802 process_bpf_collector.go:100] failed to resolve comm for PID 0: process not running: stat /proc/0: no such file or directory, set comm=system_processes
...
I0802 process_bpf_collector.go:100] failed to resolve comm for PID 0: process not running, set comm=system_processes

Signed-off-by: Sunil Thaha <sthaha@redhat.com>
…#1671)

Processing events in the same goroutine as the ring buffer reader
requires acquiring a mutex, which blocks ringbuf event processing
causing a backlog. To avoid this, send events via a buffered
channel to a dedicated event processing goroutine to ensure
that the ringbuf remains unblocked. This has decreased CPU
load from 1-3% on my machine to 0-1% CPU load.

Signed-off-by: Dave Tucker <dave@dtucker.co.uk>
Copy link
Contributor

github-actions bot commented Aug 9, 2024

🤖 SeineSailor

Here is a concise summary of the pull request changes:

The "[DO NOT MERGE] Ringbuf redux" pull request reintroduces the ring buffer functionality to the codebase, affecting the build process, testing procedures, and internal implementation. Key modifications include:

  1. Build and testing changes: Modifications to the Makefile, go.mod, and go.sum files, as well as changes to the testing procedures, with the SUDO_TEST_PKGS variable now listing pkg/bpf instead of pkg/bpftest.
  2. Ring buffer functionality: Addition of a ring buffer map, perf event arrays, and related functions for getting hardware event counters. New functions for handling IRQ, page cache hits, and process free events have been added.
  3. eBPF exporter changes: Introduction of a new flag, bpfDebugMetricsEnabled, which enables debug metrics for eBPF. A new goroutine is started to handle errors from the eBPF exporter's Start method.
  4. Process metrics and communication resolution changes: Updates to the updateSWCounters function, processesData structure, and comm variable replacement with processComm from the commResolver.
  5. Vendor package updates: Upgrades to the github.com/cilium/ebpf package from v0.15.0 to v0.16.0, which may involve changes in the underlying implementation, affecting the behavior of the code.
  6. Testing script changes: Updates to the run-tests.sh script to include a quote_env function and modifications to the preserved_env array.

Observations and suggestions for improvement:

  • The pull request includes a mix of build process changes, new functionality, and vendor package updates, which may make it challenging to review and test.
  • It would be beneficial to break down the changes into smaller, more focused pull requests to facilitate easier review and testing.
  • The impact of the vendor package updates on the codebase's behavior should be thoroughly tested and documented.
  • The new ring buffer functionality and eBPF exporter changes may require additional testing and validation to ensure they do not introduce any regressions or issues.

@vimalk78
Copy link
Collaborator

should we start a new branch for this change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.