[a2av] Improve tuning for 4 GPUs #154580

kwen2501 · 2025-05-29T01:07:15Z

Stack from ghstack (oldest at bottom):

-> [a2av] Improve tuning for 4 GPUs #154580

Problem

Running nvshmem_all_to_all_vdev on 4 x H100s (fully connected with NVSwitch).
Before:

Bytes: MiB, Time: us, BusBw: GB/s
0  32.29  16.23
1  33.01  31.76
2  33.01  63.54
4  33.83  123.97
8  49.83  168.34
16  80.82  207.59
32  178.66  187.82
64  335.79  199.86
128  646.72  207.54
256  1268.77  211.57
512  2511.14  213.80
1024  4998.31  214.82
2048  9964.49  215.51
4096  19892.34  215.91

215 GB/s does not reach the SOL of NV18 (350-400 GB/s).

Change

If the number of peers decreases (say 8 to 4), we do not reduce the number of CTAs; instead, we shift more CTAs towards the data parallel dimension.

After:

Bytes: MiB, Time: us, BusBw: GB/s
0  25.01  20.96
1  25.70  40.80
2  25.76  81.42
4  28.87  145.26
8  40.79  205.64
16  61.46  272.97
32  111.82  300.06
64  202.40  331.57
128  382.56  350.84
256  739.11  363.19
512  1450.79  370.05
1024  2873.13  373.72
2048  5719.50  375.47
4096  11395.65  376.90

If we look at MoE related region, say 32 MB, we can see a 187 -> 300 GB/s improvement.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k

[ghstack-poisoned]

pytorch-bot · 2025-05-29T01:07:19Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/154580

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit d181c33 with merge base 241f8dc ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / linux-jammy-py3-clang12-executorch / test (executorch, 1, 1, linux.2xlarge) (gh) (similar failure)
backends/xnnpack/test/ops/test_conv1d.py::TestConv1d::test_qs8_conv1d_batchnorm_seq

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 849d7b5 Pull Request resolved: #154580

ngimel · 2025-05-29T03:57:10Z

What about our triton-based all2allv? does it need this tuning too?

kwen2501 · 2025-05-29T16:08:46Z

What about our triton-based all2allv? does it need this tuning too?

It probably needs too. On the other hand, I am not 100% sure about its maintenance or move-in-core plan --
The nvshmem_all_to_all_vdev impl seems to show lower latency than the Triton impl. (This is probably due to an algorithm difference than NVSHMEM vs Triton difference).
But I can try tuning it in the symm-mem-recipe repo.

kwen2501 · 2025-05-30T00:17:19Z

@pytorchbot merge

pytorchmergebot · 2025-05-30T00:20:23Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

### Problem Running `nvshmem_all_to_all_vdev` on 4 x H100s (fully connected with NVSwitch). Before: ``` Bytes: MiB, Time: us, BusBw: GB/s 0 32.29 16.23 1 33.01 31.76 2 33.01 63.54 4 33.83 123.97 8 49.83 168.34 16 80.82 207.59 32 178.66 187.82 64 335.79 199.86 128 646.72 207.54 256 1268.77 211.57 512 2511.14 213.80 1024 4998.31 214.82 2048 9964.49 215.51 4096 19892.34 215.91 ``` 215 GB/s does not reach the SOL of NV18 (350-400 GB/s). ### Change If the number of peers decreases (say 8 to 4), we do not reduce the number of CTAs; instead, we shift more CTAs towards the data parallel dimension. After: ``` Bytes: MiB, Time: us, BusBw: GB/s 0 25.01 20.96 1 25.70 40.80 2 25.76 81.42 4 28.87 145.26 8 40.79 205.64 16 61.46 272.97 32 111.82 300.06 64 202.40 331.57 128 382.56 350.84 256 739.11 363.19 512 1450.79 370.05 1024 2873.13 373.72 2048 5719.50 375.47 4096 11395.65 376.90 ``` If we look at MoE related region, say 32 MB, we can see a 187 -> 300 GB/s improvement. Pull Request resolved: pytorch#154580 Approved by: https://github.com/ngimel

[a2av] Improve tuning for 4 GPUs

d181c33

[ghstack-poisoned]

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels May 29, 2025

kwen2501 added a commit that referenced this pull request May 29, 2025

[a2av] Improve tuning for 4 GPUs

b7313ff

ghstack-source-id: 849d7b5 Pull Request resolved: #154580

kwen2501 requested review from fduwjj, fegin and ngimel May 29, 2025 01:30

ngimel approved these changes May 29, 2025

View reviewed changes

kwen2501 added the ciflow/trunk Trigger trunk jobs on your pull request label May 29, 2025

pytorchmergebot added the merging label May 30, 2025

pytorchmergebot added the Merged label May 30, 2025

pytorchmergebot closed this in 61bfb3d May 30, 2025

pytorchmergebot removed the merging label May 30, 2025

github-actions bot deleted the gh/kwen2501/163/head branch June 29, 2025 02:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[a2av] Improve tuning for 4 GPUs #154580

[a2av] Improve tuning for 4 GPUs #154580

Uh oh!

kwen2501 commented May 29, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented May 29, 2025 •

edited

Loading

Uh oh!

ngimel commented May 29, 2025

Uh oh!

kwen2501 commented May 29, 2025

Uh oh!

kwen2501 commented May 30, 2025

Uh oh!

pytorchmergebot commented May 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[a2av] Improve tuning for 4 GPUs #154580

[a2av] Improve tuning for 4 GPUs #154580

Uh oh!

Conversation

kwen2501 commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Change

Uh oh!

pytorch-bot bot commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/154580

✅ You can merge normally! (1 Unrelated Failure)

Uh oh!

ngimel commented May 29, 2025

Uh oh!

kwen2501 commented May 29, 2025

Uh oh!

kwen2501 commented May 30, 2025

Uh oh!

pytorchmergebot commented May 30, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kwen2501 commented May 29, 2025 •

edited

Loading

pytorch-bot bot commented May 29, 2025 •

edited

Loading