Skip to content

Conversation

pragupta
Copy link
Contributor

@pragupta pragupta commented Jun 24, 2025

The biggest bottleneck that we found with two-shot allreduce was that the compiler was serializing all the load operations for some reason. To avoid these load delays, we've added de-serialization of loads. Along with this improvement, we also found that on AMD GPUs a different block and thread size gives a nice performance boost. Here are the bandwidth numbers I am getting with this PR:
image

The rows that are green are the tensor sizes that we are interested in because two-shot is only used for bigger sizes (one-shot is used for smaller sizes). As we can see, our baseline numbers wrt to fbgemm numbers were consistently underperforming. However, with this deserialize change, most of the tensor sizes have a performance boost (positive %) for the green tensors. There's one tensor with negative performance, but that's within error margin.

co-authored by: @amd-hhashemi
pytorch/FBGEMM#4072

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

Copy link

pytorch-bot bot commented Jun 24, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/156746

Note: Links to docs will display an error until the docs builds have been completed.

❌ 6 New Failures, 1 Pending

As of commit e4c42ba with merge base aa280ea (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added module: rocm AMD GPU support for Pytorch oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Jun 24, 2025
@pragupta pragupta force-pushed the pg-symm-perf-remote branch from 269df93 to cbf66d9 Compare June 24, 2025 20:16
@pragupta pragupta changed the title [ROCm][SymmetricMemory] De-serialize loads and stores to improve performance [ROCm][SymmetricMemory] De-serialize loads to improve performance Jun 24, 2025
@jeffdaily jeffdaily added the ciflow/periodic-rocm-mi300 Trigger "distributed" config CI on ROCm MI300 label Jun 24, 2025
@pytorch-bot pytorch-bot bot removed the ciflow/periodic-rocm-mi300 Trigger "distributed" config CI on ROCm MI300 label Jun 25, 2025
@pragupta pragupta changed the title [ROCm][SymmetricMemory] De-serialize loads to improve performance [ROCm][SymmetricMemory] Performance improvements for two-shot allreduce Jun 30, 2025
@jeffdaily jeffdaily added the ciflow/periodic-rocm-mi300 Trigger "distributed" config CI on ROCm MI300 label Jun 30, 2025
@jeffdaily jeffdaily marked this pull request as ready for review June 30, 2025 21:24
@jeffdaily jeffdaily added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 30, 2025
Copy link

pytorch-bot bot commented Jun 30, 2025

To add the ciflow label ciflow/trunk please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

@pytorch-bot pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label Jun 30, 2025
@jeffdaily
Copy link
Collaborator

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 30, 2025
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@jeffdaily
Copy link
Collaborator

@pytorchbot merge -i "rocm-only change, all rocm CI is passing"

Copy link

pytorch-bot bot commented Jul 1, 2025

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: unrecognized arguments: rocm-only change, all rocm CI is passing

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick} ...

Try @pytorchbot --help for more info.

@jeffdaily
Copy link
Collaborator

@pytorchbot merge -f "rocm-only change, all rocm CI is passing"

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/periodic-rocm-mi300 Trigger "distributed" config CI on ROCm MI300 ciflow/trunk Trigger trunk jobs on your pull request Merged module: rocm AMD GPU support for Pytorch oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: distributed (c10d) release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants