Set Up a CI Pipeline for H100 #526

Binyang2014 · 2025-05-15T18:08:59Z

Set Up a CI Pipeline for H100

Copilot

Pull Request Overview

The purpose of this PR is to introduce a new CI pipeline for the H100 environment. Key changes include:

Adding a new performance benchmark file (perf_ndmv5.jsonl) for latency tests.
Updating multiple Azure Pipelines YAML configuration files to include new jobs, container image substitutions, and hardware‐specific job definitions (e.g., H100 and A100 variants).
Creating and adjusting pipeline templates (ut.yaml, ut-npkit.yaml, nccl-test.yaml, integration-test.yaml) to accommodate the new hardware environment.

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
test/deploy/perf_ndmv5.jsonl	Added new performance benchmark data for latency tests.
.azure-pipelines/ut.yml	Updated job names and container image substitution to support new hardware targets.
.azure-pipelines/nccl-api-test.yaml	Introduced H100-specific parameters including adjusted nvccGencode value.
.azure-pipelines/integration-test.yml	Added an H100 integration job and updated baseline file references.
Other pipeline template files	Refactored templates to support the new CI pipeline structure for H100.

Comments suppressed due to low confidence (3)

.azure-pipelines/ut.yml:13

[nitpick] Verify that the use of 'UnitTestA100' accurately reflects the intended hardware target. If this job is meant for H100 tests, consider aligning the name consistently to avoid potential confusion.

+- job: UnitTestA100

.azure-pipelines/nccl-api-test.yaml:55

[nitpick] Confirm that the 'nvccGencode' value '-gencode=arch=compute_90,code=sm_90' is correct for the H100 target and reflects the latest hardware specifications.

nvccGencode:      "-gencode=arch=compute_90,code=sm_90"

.azure-pipelines/integration-test.yml:25

[nitpick] Ensure that the consistent variable substitution syntax '$(containerImage)' is used across all pipeline files, as mixed syntaxes can potentially lead to discrepancies.

image: $(containerImage)

mahdiehghazim · 2025-05-16T21:29:58Z

test/deploy/perf_ndmv5.jsonl

@@ -0,0 +1,3 @@
+{"name":"allreduce", "kernel":6, "ranks":8, "ranksPerNode":8,  "algBw":3.98,  "busBw":6.96,   "size":24576,      "time":6.18,    "target":"latency"}
+{"name":"allreduce", "kernel":6, "ranks":8, "ranksPerNode":8,  "algBw":7.42,  "busBw":12.99,  "size":49152,      "time":6.62,    "target":"latency"}
+{"name":"allreduce", "kernel":6, "ranks":8, "ranksPerNode":8,  "algBw":10.67, "busBw":18.68,  "size":73728,      "time":6.91,    "target":"latency"}


A few questions about this file:

I noticed in perf_ndmv4.jsonl, we have more test cases with 8 and 16 ranks and different kernels. Is there a reason we only test with 8 ranks and use kernel 6 for H100 validation?

Do we have performance tests for other collectives beside allreduce in CI pipeline?

Oh, this because most of algos are designed for nvd4, we don't tune the algos for ndv5. Only algo6 (allreduce with small message size) is reasonable for ndv5.
In future we will move to DSL based algo and add perf regression test based on that. For now, the perf file is added to pass the CI pipeline and test some functionality.

Binyang2014 added 26 commits May 14, 2025 01:33

setup ci for h100

7042043

WIP

e93b724

WIP

f1e0ae1

WIP

12d4a45

WIP

93d6f99

WIP

4f02ef3

WIP

cc540b0

WIP

257d63a

fix

0cd7d0c

WIP

5e55e1d

update for codegen

9600029

update for h100

2ca6dc9

update

26b2791

WIP

20dd90d

WIP

6488dd5

WIP

6a129db

fix

075f5c3

WIP

73b81a2

WIP

970e1d0

update

e086949

WIP

35d3e67

update

d07aed6

WIP

8711b39

WIP

cdae260

WIP

c6f0b9c

WIP

b442ab8

Binyang2014 marked this pull request as ready for review May 15, 2025 20:34

Binyang2014 requested review from Copilot, chhwang and seagater May 15, 2025 20:34

Binyang2014 requested review from caiomcbr and mahdiehghazim May 15, 2025 20:35

Copilot AI reviewed May 15, 2025

View reviewed changes

Binyang2014 changed the title ~~Create a CI pipeline for H100~~ Create a CI pipeline on H100 May 15, 2025

Binyang2014 changed the title ~~Create a CI pipeline on H100~~ Set Up a CI Pipeline for H100 May 15, 2025

chhwang approved these changes May 15, 2025

View reviewed changes

Binyang2014 merged commit a18e91c into main May 15, 2025
23 of 25 checks passed

Binyang2014 deleted the binyli/ci branch May 15, 2025 21:50

mahdiehghazim reviewed May 16, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Set Up a CI Pipeline for H100 #526

Set Up a CI Pipeline for H100 #526

Uh oh!

Binyang2014 commented May 15, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

mahdiehghazim May 16, 2025

Uh oh!

Binyang2014 May 16, 2025

Uh oh!

Uh oh!

Set Up a CI Pipeline for H100 #526

Set Up a CI Pipeline for H100 #526

Uh oh!

Conversation

Binyang2014 commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

mahdiehghazim May 16, 2025

Choose a reason for hiding this comment

Uh oh!

Binyang2014 May 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Binyang2014 commented May 15, 2025 •

edited

Loading