-
Notifications
You must be signed in to change notification settings - Fork 215
Fix accelerator device map #1913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix accelerator device map #1913
Conversation
Signed-off-by: Sunyanan Choochotkaew <sunyanan.choochotkaew1@ibm.com>
…server-patch-1 feat: add model_name attribute to ComponentModelWeights
Signed-off-by: Sunyanan Choochotkaew <sunyanan.choochotkaew1@ibm.com>
Signed-off-by: Sunil Thaha <sthaha@redhat.com>
…r-longer-test chore(validator): run stress test for longer
This commit introduces a workflow for testing ACPI functionality using Equinix self-hosted runners. The workflow deploys Kepler using mock-acpi compose setup and runs validator to ensure functionality. Key-features: - Workflow is triggered on pull requests that include a specific commit message `/test-acpi`. - Environment setup is handled by ansible. Signed-off-by: Vibhu Prashar <vibhu.sharma2929@gmail.com>
…puting-io/add-acpi-wkf feat(ci): implement mock-ACPI workflow
…server-patch-1 fix: format ComponentModelWeights
This commit moves model_weights from/var/lib/kepler/data/ to its own directory - var/lib/kepler/data/model_weights/ this allows additional data like machine-spec to be stored its own directory. Additionally this change fixes the blank cpu.yaml that gets created when running compose files. Signed-off-by: Sunil Thaha <sthaha@redhat.com>
…k-cpu-yaml chore: move model_weights to its own directory
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
This commit resolves two key issues with the mock-acpi workflow: - Checkout correct branch: The workflow previously checkout out the default branch when triggered by a pull request. This fix ensures that the correct pull request branch is checked out during CI run. - Attach workflow to pull request checks: The workflow was not being reflected under pull request checks. With this fix, the workflow will be correctly attached, ensuring its status visible and reported under pull request checks. Signed-off-by: Vibhu Prashar <vibhu.sharma2929@gmail.com>
…eanup-exporter-globals chore: cleanup globals in exporter
Signed-off-by: Sunyanan Choochotkaew <sunyanan.choochotkaew1@ibm.com>
…server-patch-1 fix: set default trainer only for local regressor
…pi-wk-status fix(ci): ensure proper status reporting for mock-acpi workflow
This commit addresses an issue where the `cleanup` and `final status` jobs were incorrectly dependent on the `create-runner` job, leading to their premature execution. This fix ensures that both `cleanup` and `final status` job run independently and only after all necessary preceding jobs have finished Signed-off-by: Vibhu Prashar <vibhu.sharma2929@gmail.com>
Signed-off-by: Sunyanan Choochotkaew <sunyanan.choochotkaew1@ibm.com>
…server-patch-1 feat: add --disable-power-meter option
…x-job-flow fix(ci): ensure independent execution of cleanup and status jobs
…idation feat: Export validation result as json object.
Signed-off-by: Vimal Kumar <vimal78@gmail.com>
Signed-off-by: Vimal Kumar <vimal78@gmail.com>
Signed-off-by: Maryam Tahhan <mtahhan@redhat.com>
This commit addresses the issue of multiple jobs defined in the `mock-acpi` workflow, which were unintentionally executing in parallel due to sequence constraints. By consolidating the workflow into a single job, we ensure that the tasks are executed sequentially Signed-off-by: Vibhu Prashar <vibhu.sharma2929@gmail.com>
…puting-io/fix-flow fix(ci): consolidate mock-acpi workflow into single job
Updates: * kepler_model_server to 0.7.11-2. * Models to 0.7.11 SGDRegressor * Adds DISABLE_POWER_METER env to disable power meters when running vm compose on a BM (just in case) Signed-off-by: Sunil Thaha <sthaha@redhat.com>
…ms-0.7.11-2 chore(compose): bumpup model-server to v0.7.11-2
Add cosign to two additional image building workflows to resolve intermittent signing issue. Uses GitHub OIDC token for signing, and narrows scope of privileges to only what is necessary to write token. Signed-off-by: Arthur Savage <arthursavage47@gmail.com>
Signed-off-by: Vimal Kumar <vimal78@gmail.com>
This commit migrates the mock-acpi workflow to use the GitHub runner instead of the Equinix self-hosted runner. Since the workflow is designed for testing ACPI functionality, using a mock, a self-hosted runner is not required. Running the workflow on the GitHub runner, which operates as a VM, enables execution on every pull request, ensuring consistent validation of ACPI functionality for Kepler. Signed-off-by: vprashar2929 <vibhu.sharma2929@gmail.com>
…pi-wkf chore(ci): migrate mock-acpi workflow to GH runner
* [test]: add test case on package cgroup Signed-off-by: Sam Yuan <yy19902439@126.com> * [fix]: update cache setting logic when error happen Signed-off-by: Sam Yuan <yy19902439@126.com> * [fix]: fix lint Signed-off-by: Sam Yuan <yy19902439@126.com> --------- Signed-off-by: Sam Yuan <yy19902439@126.com>
…om-scaph feat(compose): add fallback scrape protocol for Scaphandre service
Signed-off-by: Huamin Chen <hchen@redhat.com>
Signed-off-by: Sam Yuan <yy19902439@126.com>
Signed-off-by: Mario Vazquez <mavazque@redhat.com>
…e-paths Fixes paths for grace arm
Signed-off-by: Huamin Chen <hchen@redhat.com>
…race feat(sensor): support NVIDIA Grace Hopper
…e-computing-io#1889) Bumps the github-actions group with 4 updates: [actions/checkout](https://github.com/actions/checkout), [actions/attest-build-provenance](https://github.com/actions/attest-build-provenance), [actions/attest-sbom](https://github.com/actions/attest-sbom) and [codecov/codecov-action](https://github.com/codecov/codecov-action). Updates `actions/checkout` from 3 to 4 - [Release notes](https://github.com/actions/checkout/releases) - [Commits](actions/checkout@v3...v4) Updates `actions/attest-build-provenance` from 1 to 2 - [Release notes](https://github.com/actions/attest-build-provenance/releases) - [Changelog](https://github.com/actions/attest-build-provenance/blob/main/RELEASE.md) - [Commits](actions/attest-build-provenance@v1...v2) Updates `actions/attest-sbom` from 1 to 2 - [Release notes](https://github.com/actions/attest-sbom/releases) - [Changelog](https://github.com/actions/attest-sbom/blob/main/RELEASE.md) - [Commits](actions/attest-sbom@v1...v2) Updates `codecov/codecov-action` from 5.0.7 to 5.1.1 - [Release notes](https://github.com/codecov/codecov-action/releases) - [Changelog](https://github.com/codecov/codecov-action/blob/main/CHANGELOG.md) - [Commits](codecov/codecov-action@v5.0.7...v5.1.1) --- updated-dependencies: - dependency-name: actions/checkout dependency-type: direct:production update-type: version-update:semver-major dependency-group: github-actions - dependency-name: actions/attest-build-provenance dependency-type: direct:production update-type: version-update:semver-major dependency-group: github-actions - dependency-name: actions/attest-sbom dependency-type: direct:production update-type: version-update:semver-major dependency-group: github-actions - dependency-name: codecov/codecov-action dependency-type: direct:production update-type: version-update:semver-minor dependency-group: github-actions ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…ble-computing-io#1892) Bumps [golang.org/x/crypto](https://github.com/golang/crypto) from 0.26.0 to 0.31.0. - [Commits](golang/crypto@v0.26.0...v0.31.0) --- updated-dependencies: - dependency-name: golang.org/x/crypto dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Signed-off-by: Sam Yuan <yy19902439@126.com>
…stainable-computing-io#1896) Bumps the github-actions group with 1 update: [anchore/sbom-action](https://github.com/anchore/sbom-action). Updates `anchore/sbom-action` from 0.17.8 to 0.17.9 - [Release notes](https://github.com/anchore/sbom-action/releases) - [Changelog](https://github.com/anchore/sbom-action/blob/main/RELEASE.md) - [Commits](anchore/sbom-action@v0.17.8...v0.17.9) --- updated-dependencies: - dependency-name: anchore/sbom-action dependency-type: direct:production update-type: version-update:semver-patch dependency-group: github-actions ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…e-computing-io#1900) Bumps the github-actions group with 2 updates: [actions/upload-artifact](https://github.com/actions/upload-artifact) and [codecov/codecov-action](https://github.com/codecov/codecov-action). Updates `actions/upload-artifact` from 4.4.3 to 4.5.0 - [Release notes](https://github.com/actions/upload-artifact/releases) - [Commits](actions/upload-artifact@v4.4.3...v4.5.0) Updates `codecov/codecov-action` from 5.1.1 to 5.1.2 - [Release notes](https://github.com/codecov/codecov-action/releases) - [Changelog](https://github.com/codecov/codecov-action/blob/main/CHANGELOG.md) - [Commits](codecov/codecov-action@v5.1.1...v5.1.2) --- updated-dependencies: - dependency-name: actions/upload-artifact dependency-type: direct:production update-type: version-update:semver-minor dependency-group: github-actions - dependency-name: codecov/codecov-action dependency-type: direct:production update-type: version-update:semver-patch dependency-group: github-actions ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
🤖 SeineSailor Here is a concise summary of the pull request changes: Summary: The "Fix accelerator device map" pull request refactors the accelerator device map logic, improving internal diagnostics and adding log statements for better visibility. Key modifications include corrections to device map addition logic and the introduction of the Impact: These changes enhance the code's internal functionality without affecting the external interface or behavior, except for the added Observations: The changes are focused on internal improvements, suggesting a refinement of the existing implementation rather than a significant architectural shift. The addition of log statements and diagnostics will aid in debugging and troubleshooting. Suggestions: Consider adding tests to validate the corrected device map logic and the new |
33fdffa
to
40b36a1
Compare
can you sign DCO? |
…sustainable-computing-io#1809)" This reverts commit 91c8e62. Signed-off-by: Vimal Kumar <vimal78@gmail.com>
Signed-off-by: Vimal Kumar <vimal78@gmail.com>
40b36a1
to
9ca44bf
Compare
@rootfs signed DCO |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding .dockerignore will lead to incorrect version computation #1809
if err != nil { | ||
klog.Infof("There is no DCGM daemon running in the host: %s", err) | ||
// embedded mode is not recommended for production per https://github.com/NVIDIA/dcgm-exporter/issues/22#issuecomment-1321521995 | ||
klog.Info("Attempting to inilialize dcgm in Embedded mode.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit typo: initialize
j, _ := json.Marshal(r.GetAllDevices()) | ||
klog.V(5).Infof("Accelerator Registry AllDevices: %s", string(j)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about we marshal only if klog.V(5).IsEnabled() ?
} else { | ||
m[d] = deviceStartup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we panic here instead? The accelerator has ready been defined and there is now an attempt to overwrite it. IMHO, we should panic and aligns with the Must
in MustRegister.
.dockerignore
file