Skip to content

Conversation

vimalk78
Copy link
Collaborator

  • brings back .dockerignore file
  • fix the logic of adding accelerator device map
  • added few log statements

sunya-ch and others added 30 commits August 16, 2024 10:25
Signed-off-by: Sunyanan Choochotkaew <sunyanan.choochotkaew1@ibm.com>
…server-patch-1

feat: add model_name attribute to ComponentModelWeights
Signed-off-by: Sunyanan Choochotkaew <sunyanan.choochotkaew1@ibm.com>
Signed-off-by: Sunil Thaha <sthaha@redhat.com>
…r-longer-test

chore(validator): run stress test for longer
This commit introduces a workflow for testing ACPI functionality
using Equinix self-hosted runners. The workflow deploys Kepler using
mock-acpi compose setup and runs validator to ensure functionality.

Key-features:
- Workflow is triggered on pull requests that include a specific commit
  message `/test-acpi`.
- Environment setup is handled by ansible.

Signed-off-by: Vibhu Prashar <vibhu.sharma2929@gmail.com>
…puting-io/add-acpi-wkf

feat(ci): implement mock-ACPI workflow
…server-patch-1

fix: format ComponentModelWeights
This commit moves model_weights from/var/lib/kepler/data/ to its own
directory - var/lib/kepler/data/model_weights/ this allows additional data
like machine-spec to be stored its own directory.

Additionally this change fixes the blank cpu.yaml that gets created when
running compose files.

Signed-off-by: Sunil Thaha <sthaha@redhat.com>
…k-cpu-yaml

chore: move model_weights to its own directory
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
This commit resolves two key issues with the mock-acpi workflow:
- Checkout correct branch: The workflow previously checkout out the
  default branch when triggered by a pull request. This fix ensures
  that the correct pull request branch is checked out during CI run.
- Attach workflow to pull request checks: The workflow was not
  being reflected under pull request checks. With this fix, the workflow
  will be correctly attached, ensuring its status visible and reported
  under pull request checks.

Signed-off-by: Vibhu Prashar <vibhu.sharma2929@gmail.com>
…eanup-exporter-globals

chore: cleanup globals in exporter
Signed-off-by: Sunyanan Choochotkaew <sunyanan.choochotkaew1@ibm.com>
…server-patch-1

fix: set default trainer only for local regressor
…pi-wk-status

fix(ci): ensure proper status reporting for mock-acpi workflow
This commit addresses an issue where the `cleanup` and `final status`
jobs were incorrectly dependent on the `create-runner` job, leading to
their premature execution. This fix ensures that both `cleanup` and
`final status` job run independently and only after all necessary
preceding jobs have finished

Signed-off-by: Vibhu Prashar <vibhu.sharma2929@gmail.com>
Signed-off-by: Sunyanan Choochotkaew <sunyanan.choochotkaew1@ibm.com>
…server-patch-1

feat: add --disable-power-meter option
…x-job-flow

fix(ci): ensure independent execution of cleanup and status jobs
…idation

feat: Export validation result as json object.
Signed-off-by: Vimal Kumar <vimal78@gmail.com>
Signed-off-by: Vimal Kumar <vimal78@gmail.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
This commit addresses the issue of multiple jobs defined in the
`mock-acpi` workflow, which were unintentionally executing in parallel
due to sequence constraints. By consolidating the workflow into a single
job, we ensure that the tasks are executed sequentially

Signed-off-by: Vibhu Prashar <vibhu.sharma2929@gmail.com>
…puting-io/fix-flow

fix(ci): consolidate mock-acpi workflow into single job
Updates:
  * kepler_model_server to 0.7.11-2.
  * Models to 0.7.11 SGDRegressor
  * Adds DISABLE_POWER_METER env to disable power meters when running
    vm compose on a BM (just in case)

Signed-off-by: Sunil Thaha <sthaha@redhat.com>
…ms-0.7.11-2

chore(compose): bumpup model-server to v0.7.11-2
Add cosign to two additional image building workflows to resolve
intermittent signing issue.  Uses GitHub OIDC token for signing,
and narrows scope of privileges to only what is necessary to
write token.

Signed-off-by: Arthur Savage <arthursavage47@gmail.com>
Signed-off-by: Vimal Kumar <vimal78@gmail.com>
vprashar2929 and others added 15 commits December 5, 2024 10:17
This commit migrates the mock-acpi workflow to use the
GitHub runner instead of the Equinix self-hosted runner.
Since the workflow is designed for testing ACPI functionality,
using a mock, a self-hosted runner is not required.

Running the workflow on the GitHub runner, which operates as a VM,
enables execution on every pull request, ensuring consistent validation
of ACPI functionality for Kepler.

Signed-off-by: vprashar2929 <vibhu.sharma2929@gmail.com>
…pi-wkf

chore(ci): migrate mock-acpi workflow to GH runner
* [test]: add test case on package cgroup

Signed-off-by: Sam Yuan <yy19902439@126.com>

* [fix]: update cache setting logic when error happen

Signed-off-by: Sam Yuan <yy19902439@126.com>

* [fix]: fix lint

Signed-off-by: Sam Yuan <yy19902439@126.com>

---------

Signed-off-by: Sam Yuan <yy19902439@126.com>
…om-scaph

feat(compose): add fallback scrape protocol for Scaphandre service
Signed-off-by: Huamin Chen <hchen@redhat.com>
Signed-off-by: Sam Yuan <yy19902439@126.com>
Signed-off-by: Mario Vazquez <mavazque@redhat.com>
Signed-off-by: Huamin Chen <hchen@redhat.com>
…race

feat(sensor): support NVIDIA Grace Hopper
…e-computing-io#1889)

Bumps the github-actions group with 4 updates: [actions/checkout](https://github.com/actions/checkout), [actions/attest-build-provenance](https://github.com/actions/attest-build-provenance), [actions/attest-sbom](https://github.com/actions/attest-sbom) and [codecov/codecov-action](https://github.com/codecov/codecov-action).


Updates `actions/checkout` from 3 to 4
- [Release notes](https://github.com/actions/checkout/releases)
- [Commits](actions/checkout@v3...v4)

Updates `actions/attest-build-provenance` from 1 to 2
- [Release notes](https://github.com/actions/attest-build-provenance/releases)
- [Changelog](https://github.com/actions/attest-build-provenance/blob/main/RELEASE.md)
- [Commits](actions/attest-build-provenance@v1...v2)

Updates `actions/attest-sbom` from 1 to 2
- [Release notes](https://github.com/actions/attest-sbom/releases)
- [Changelog](https://github.com/actions/attest-sbom/blob/main/RELEASE.md)
- [Commits](actions/attest-sbom@v1...v2)

Updates `codecov/codecov-action` from 5.0.7 to 5.1.1
- [Release notes](https://github.com/codecov/codecov-action/releases)
- [Changelog](https://github.com/codecov/codecov-action/blob/main/CHANGELOG.md)
- [Commits](codecov/codecov-action@v5.0.7...v5.1.1)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: github-actions
- dependency-name: actions/attest-build-provenance
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: github-actions
- dependency-name: actions/attest-sbom
  dependency-type: direct:production
  update-type: version-update:semver-major
  dependency-group: github-actions
- dependency-name: codecov/codecov-action
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: github-actions
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…ble-computing-io#1892)

Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.26.0 to 0.31.0.
- [Commits](golang/crypto@v0.26.0...v0.31.0)

---
updated-dependencies:
- dependency-name: golang.org/x/crypto
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Signed-off-by: Sam Yuan <yy19902439@126.com>
…stainable-computing-io#1896)

Bumps the github-actions group with 1 update: [anchore/sbom-action](https://github.com/anchore/sbom-action).


Updates `anchore/sbom-action` from 0.17.8 to 0.17.9
- [Release notes](https://github.com/anchore/sbom-action/releases)
- [Changelog](https://github.com/anchore/sbom-action/blob/main/RELEASE.md)
- [Commits](anchore/sbom-action@v0.17.8...v0.17.9)

---
updated-dependencies:
- dependency-name: anchore/sbom-action
  dependency-type: direct:production
  update-type: version-update:semver-patch
  dependency-group: github-actions
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…e-computing-io#1900)

Bumps the github-actions group with 2 updates: [actions/upload-artifact](https://github.com/actions/upload-artifact) and [codecov/codecov-action](https://github.com/codecov/codecov-action).


Updates `actions/upload-artifact` from 4.4.3 to 4.5.0
- [Release notes](https://github.com/actions/upload-artifact/releases)
- [Commits](actions/upload-artifact@v4.4.3...v4.5.0)

Updates `codecov/codecov-action` from 5.1.1 to 5.1.2
- [Release notes](https://github.com/codecov/codecov-action/releases)
- [Changelog](https://github.com/codecov/codecov-action/blob/main/CHANGELOG.md)
- [Commits](codecov/codecov-action@v5.1.1...v5.1.2)

---
updated-dependencies:
- dependency-name: actions/upload-artifact
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: github-actions
- dependency-name: codecov/codecov-action
  dependency-type: direct:production
  update-type: version-update:semver-patch
  dependency-group: github-actions
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
@vimalk78 vimalk78 requested a review from rootfs January 21, 2025 20:00
Copy link
Contributor

github-actions bot commented Jan 21, 2025

🤖 SeineSailor

Here is a concise summary of the pull request changes:

Summary: The "Fix accelerator device map" pull request refactors the accelerator device map logic, improving internal diagnostics and adding log statements for better visibility. Key modifications include corrections to device map addition logic and the introduction of the GetAllDevices() method to the Registry type.

Impact: These changes enhance the code's internal functionality without affecting the external interface or behavior, except for the added GetAllDevices() method. The readded .dockerignore file ensures proper Docker container setup.

Observations: The changes are focused on internal improvements, suggesting a refinement of the existing implementation rather than a significant architectural shift. The addition of log statements and diagnostics will aid in debugging and troubleshooting.

Suggestions: Consider adding tests to validate the corrected device map logic and the new GetAllDevices() method to ensure their correctness and reliability.

@rootfs
Copy link
Contributor

rootfs commented Jan 22, 2025

can you sign DCO?

Signed-off-by: Vimal Kumar <vimal78@gmail.com>
@vimalk78
Copy link
Collaborator Author

can you sign DCO?

@rootfs signed DCO

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding .dockerignore will lead to incorrect version computation #1809

if err != nil {
klog.Infof("There is no DCGM daemon running in the host: %s", err)
// embedded mode is not recommended for production per https://github.com/NVIDIA/dcgm-exporter/issues/22#issuecomment-1321521995
klog.Info("Attempting to inilialize dcgm in Embedded mode.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit typo: initialize

Comment on lines +135 to +136
j, _ := json.Marshal(r.GetAllDevices())
klog.V(5).Infof("Accelerator Registry AllDevices: %s", string(j))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about we marshal only if klog.V(5).IsEnabled() ?

Comment on lines +129 to +130
} else {
m[d] = deviceStartup
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we panic here instead? The accelerator has ready been defined and there is now an attempt to overwrite it. IMHO, we should panic and aligns with the Must in MustRegister.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.