Skip to content

Conversation

kwen2501
Copy link
Contributor

@kwen2501 kwen2501 commented May 29, 2025

Stack from ghstack (oldest at bottom):

Problem

Running nvshmem_all_to_all_vdev on 4 x H100s (fully connected with NVSwitch).
Before:

Bytes: MiB, Time: us, BusBw: GB/s
0  32.29  16.23
1  33.01  31.76
2  33.01  63.54
4  33.83  123.97
8  49.83  168.34
16  80.82  207.59
32  178.66  187.82
64  335.79  199.86
128  646.72  207.54
256  1268.77  211.57
512  2511.14  213.80
1024  4998.31  214.82
2048  9964.49  215.51
4096  19892.34  215.91

215 GB/s does not reach the SOL of NV18 (350-400 GB/s).

Change

If the number of peers decreases (say 8 to 4), we do not reduce the number of CTAs; instead, we shift more CTAs towards the data parallel dimension.

After:

Bytes: MiB, Time: us, BusBw: GB/s
0  25.01  20.96
1  25.70  40.80
2  25.76  81.42
4  28.87  145.26
8  40.79  205.64
16  61.46  272.97
32  111.82  300.06
64  202.40  331.57
128  382.56  350.84
256  739.11  363.19
512  1450.79  370.05
1024  2873.13  373.72
2048  5719.50  375.47
4096  11395.65  376.90

If we look at MoE related region, say 32 MB, we can see a 187 -> 300 GB/s improvement.

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k

@pytorch-bot pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels May 29, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented May 29, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/154580

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit d181c33 with merge base 241f8dc (image):

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

kwen2501 added a commit that referenced this pull request May 29, 2025
ghstack-source-id: 849d7b5
Pull Request resolved: #154580
@kwen2501 kwen2501 requested review from fduwjj, fegin and ngimel May 29, 2025 01:30
@ngimel
Copy link
Collaborator

ngimel commented May 29, 2025

What about our triton-based all2allv? does it need this tuning too?

@kwen2501 kwen2501 added the ciflow/trunk Trigger trunk jobs on your pull request label May 29, 2025
@kwen2501
Copy link
Contributor Author

What about our triton-based all2allv? does it need this tuning too?

It probably needs too. On the other hand, I am not 100% sure about its maintenance or move-in-core plan --
The nvshmem_all_to_all_vdev impl seems to show lower latency than the Triton impl. (This is probably due to an algorithm difference than NVSHMEM vs Triton difference).
But I can try tuning it in the symm-mem-recipe repo.

@kwen2501
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

iupaikov-amd pushed a commit to ROCm/pytorch that referenced this pull request Jun 4, 2025
### Problem
Running `nvshmem_all_to_all_vdev` on 4 x H100s (fully connected with NVSwitch).
Before:
```
Bytes: MiB, Time: us, BusBw: GB/s
0  32.29  16.23
1  33.01  31.76
2  33.01  63.54
4  33.83  123.97
8  49.83  168.34
16  80.82  207.59
32  178.66  187.82
64  335.79  199.86
128  646.72  207.54
256  1268.77  211.57
512  2511.14  213.80
1024  4998.31  214.82
2048  9964.49  215.51
4096  19892.34  215.91
```

215 GB/s does not reach the SOL of NV18 (350-400 GB/s).

### Change
If the number of peers decreases (say 8 to 4), we do not reduce the number of CTAs; instead, we shift more CTAs towards the data parallel dimension.

After:
```
Bytes: MiB, Time: us, BusBw: GB/s
0  25.01  20.96
1  25.70  40.80
2  25.76  81.42
4  28.87  145.26
8  40.79  205.64
16  61.46  272.97
32  111.82  300.06
64  202.40  331.57
128  382.56  350.84
256  739.11  363.19
512  1450.79  370.05
1024  2873.13  373.72
2048  5719.50  375.47
4096  11395.65  376.90
```

If we look at MoE related region, say 32 MB, we can see a 187 -> 300 GB/s improvement.

Pull Request resolved: pytorch#154580
Approved by: https://github.com/ngimel
@github-actions github-actions bot deleted the gh/kwen2501/163/head branch June 29, 2025 02:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk Trigger trunk jobs on your pull request Merged oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants