-
Notifications
You must be signed in to change notification settings - Fork 55
Set Up a CI Pipeline for H100 #526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
The purpose of this PR is to introduce a new CI pipeline for the H100 environment. Key changes include:
- Adding a new performance benchmark file (perf_ndmv5.jsonl) for latency tests.
- Updating multiple Azure Pipelines YAML configuration files to include new jobs, container image substitutions, and hardware‐specific job definitions (e.g., H100 and A100 variants).
- Creating and adjusting pipeline templates (ut.yaml, ut-npkit.yaml, nccl-test.yaml, integration-test.yaml) to accommodate the new hardware environment.
Reviewed Changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated no comments.
Show a summary per file
File | Description |
---|---|
test/deploy/perf_ndmv5.jsonl | Added new performance benchmark data for latency tests. |
.azure-pipelines/ut.yml | Updated job names and container image substitution to support new hardware targets. |
.azure-pipelines/nccl-api-test.yaml | Introduced H100-specific parameters including adjusted nvccGencode value. |
.azure-pipelines/integration-test.yml | Added an H100 integration job and updated baseline file references. |
Other pipeline template files | Refactored templates to support the new CI pipeline structure for H100. |
Comments suppressed due to low confidence (3)
.azure-pipelines/ut.yml:13
- [nitpick] Verify that the use of 'UnitTestA100' accurately reflects the intended hardware target. If this job is meant for H100 tests, consider aligning the name consistently to avoid potential confusion.
+- job: UnitTestA100
.azure-pipelines/nccl-api-test.yaml:55
- [nitpick] Confirm that the 'nvccGencode' value '-gencode=arch=compute_90,code=sm_90' is correct for the H100 target and reflects the latest hardware specifications.
nvccGencode: "-gencode=arch=compute_90,code=sm_90"
.azure-pipelines/integration-test.yml:25
- [nitpick] Ensure that the consistent variable substitution syntax '$(containerImage)' is used across all pipeline files, as mixed syntaxes can potentially lead to discrepancies.
image: $(containerImage)
@@ -0,0 +1,3 @@ | |||
{"name":"allreduce", "kernel":6, "ranks":8, "ranksPerNode":8, "algBw":3.98, "busBw":6.96, "size":24576, "time":6.18, "target":"latency"} | |||
{"name":"allreduce", "kernel":6, "ranks":8, "ranksPerNode":8, "algBw":7.42, "busBw":12.99, "size":49152, "time":6.62, "target":"latency"} | |||
{"name":"allreduce", "kernel":6, "ranks":8, "ranksPerNode":8, "algBw":10.67, "busBw":18.68, "size":73728, "time":6.91, "target":"latency"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few questions about this file:
-
I noticed in perf_ndmv4.jsonl, we have more test cases with 8 and 16 ranks and different kernels. Is there a reason we only test with 8 ranks and use kernel 6 for H100 validation?
-
Do we have performance tests for other collectives beside allreduce in CI pipeline?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, this because most of algos are designed for nvd4, we don't tune the algos for ndv5. Only algo6 (allreduce with small message size) is reasonable for ndv5.
In future we will move to DSL based algo and add perf regression test based on that. For now, the perf file is added to pass the CI pipeline and test some functionality.
Set Up a CI Pipeline for H100