Skip to content

Set Up a CI Pipeline for H100 #526

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
May 15, 2025
Merged

Set Up a CI Pipeline for H100 #526

merged 26 commits into from
May 15, 2025

Conversation

Binyang2014
Copy link
Contributor

@Binyang2014 Binyang2014 commented May 15, 2025

Set Up a CI Pipeline for H100

@Binyang2014 Binyang2014 marked this pull request as ready for review May 15, 2025 20:34
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

The purpose of this PR is to introduce a new CI pipeline for the H100 environment. Key changes include:

  • Adding a new performance benchmark file (perf_ndmv5.jsonl) for latency tests.
  • Updating multiple Azure Pipelines YAML configuration files to include new jobs, container image substitutions, and hardware‐specific job definitions (e.g., H100 and A100 variants).
  • Creating and adjusting pipeline templates (ut.yaml, ut-npkit.yaml, nccl-test.yaml, integration-test.yaml) to accommodate the new hardware environment.

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated no comments.

Show a summary per file
File Description
test/deploy/perf_ndmv5.jsonl Added new performance benchmark data for latency tests.
.azure-pipelines/ut.yml Updated job names and container image substitution to support new hardware targets.
.azure-pipelines/nccl-api-test.yaml Introduced H100-specific parameters including adjusted nvccGencode value.
.azure-pipelines/integration-test.yml Added an H100 integration job and updated baseline file references.
Other pipeline template files Refactored templates to support the new CI pipeline structure for H100.
Comments suppressed due to low confidence (3)

.azure-pipelines/ut.yml:13

  • [nitpick] Verify that the use of 'UnitTestA100' accurately reflects the intended hardware target. If this job is meant for H100 tests, consider aligning the name consistently to avoid potential confusion.
+- job: UnitTestA100

.azure-pipelines/nccl-api-test.yaml:55

  • [nitpick] Confirm that the 'nvccGencode' value '-gencode=arch=compute_90,code=sm_90' is correct for the H100 target and reflects the latest hardware specifications.
nvccGencode:      "-gencode=arch=compute_90,code=sm_90"

.azure-pipelines/integration-test.yml:25

  • [nitpick] Ensure that the consistent variable substitution syntax '$(containerImage)' is used across all pipeline files, as mixed syntaxes can potentially lead to discrepancies.
image: $(containerImage)

@Binyang2014 Binyang2014 changed the title Create a CI pipeline for H100 Create a CI pipeline on H100 May 15, 2025
@Binyang2014 Binyang2014 changed the title Create a CI pipeline on H100 Set Up a CI Pipeline for H100 May 15, 2025
@Binyang2014 Binyang2014 merged commit a18e91c into main May 15, 2025
23 of 25 checks passed
@Binyang2014 Binyang2014 deleted the binyli/ci branch May 15, 2025 21:50
@@ -0,0 +1,3 @@
{"name":"allreduce", "kernel":6, "ranks":8, "ranksPerNode":8, "algBw":3.98, "busBw":6.96, "size":24576, "time":6.18, "target":"latency"}
{"name":"allreduce", "kernel":6, "ranks":8, "ranksPerNode":8, "algBw":7.42, "busBw":12.99, "size":49152, "time":6.62, "target":"latency"}
{"name":"allreduce", "kernel":6, "ranks":8, "ranksPerNode":8, "algBw":10.67, "busBw":18.68, "size":73728, "time":6.91, "target":"latency"}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions about this file:

  • I noticed in perf_ndmv4.jsonl, we have more test cases with 8 and 16 ranks and different kernels. Is there a reason we only test with 8 ranks and use kernel 6 for H100 validation?

  • Do we have performance tests for other collectives beside allreduce in CI pipeline?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this because most of algos are designed for nvd4, we don't tune the algos for ndv5. Only algo6 (allreduce with small message size) is reasonable for ndv5.
In future we will move to DSL based algo and add perf regression test based on that. For now, the perf file is added to pass the CI pipeline and test some functionality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants