[ROCm][SymmetricMemory] Performance improvements for two-shot allreduce #156746

pragupta · 2025-06-24T19:52:35Z

The biggest bottleneck that we found with two-shot allreduce was that the compiler was serializing all the load operations for some reason. To avoid these load delays, we've added de-serialization of loads. Along with this improvement, we also found that on AMD GPUs a different block and thread size gives a nice performance boost. Here are the bandwidth numbers I am getting with this PR:

The rows that are green are the tensor sizes that we are interested in because two-shot is only used for bigger sizes (one-shot is used for smaller sizes). As we can see, our baseline numbers wrt to fbgemm numbers were consistently underperforming. However, with this deserialize change, most of the tensor sizes have a performance boost (positive %) for the green tensors. There's one tensor with negative performance, but that's within error margin.

co-authored by: @amd-hhashemi
pytorch/FBGEMM#4072

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang @naromero77amd

pytorch-bot · 2025-06-24T19:52:39Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/156746

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 6 New Failures, 1 Pending

As of commit e4c42ba with merge base aa280ea ():

NEW FAILURES - The following jobs have failed:

pull / linux-jammy-py3.13-clang12 / test (dynamo_wrapped, 1, 3, linux.2xlarge) (gh)
Found the unimplemented_v2 or unimplemented_v2_with_warning calls below that don't match the registry in graph_break_registry.json.
pull / linux-jammy-py3.13-clang12 / test (dynamo_wrapped, 2, 3, linux.2xlarge) (gh)
Found the unimplemented_v2 or unimplemented_v2_with_warning calls below that don't match the registry in graph_break_registry.json.
pull / linux-jammy-py3.13-clang12 / test (dynamo_wrapped, 3, 3, linux.2xlarge) (gh)
Found the unimplemented_v2 or unimplemented_v2_with_warning calls below that don't match the registry in graph_break_registry.json.
pull / linux-jammy-py3.9-clang12 / test (dynamo_wrapped, 1, 3, linux.2xlarge) (gh)
Found the unimplemented_v2 or unimplemented_v2_with_warning calls below that don't match the registry in graph_break_registry.json.
pull / linux-jammy-py3.9-clang12 / test (dynamo_wrapped, 2, 3, linux.2xlarge) (gh)
Found the unimplemented_v2 or unimplemented_v2_with_warning calls below that don't match the registry in graph_break_registry.json.
pull / linux-jammy-py3.9-clang12 / test (dynamo_wrapped, 3, 3, linux.2xlarge) (gh)
Found the unimplemented_v2 or unimplemented_v2_with_warning calls below that don't match the registry in graph_break_registry.json.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot · 2025-06-30T21:25:15Z

To add the ciflow label ciflow/trunk please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

jeffdaily · 2025-06-30T21:39:55Z

@pytorchbot merge

pytorchmergebot · 2025-06-30T21:42:07Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorchmergebot · 2025-06-30T21:42:23Z

Merge failed

Reason: 6 mandatory check(s) failed. The first few are:

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

jeffdaily · 2025-07-01T00:34:44Z

@pytorchbot merge -i "rocm-only change, all rocm CI is passing"

pytorch-bot · 2025-07-01T00:34:46Z

❌ 🤖 pytorchbot command failed:

@pytorchbot: error: unrecognized arguments: rocm-only change, all rocm CI is passing

usage: @pytorchbot [-h] {merge,revert,rebase,label,drci,cherry-pick} ...

Try @pytorchbot --help for more info.

jeffdaily · 2025-07-01T00:35:33Z

@pytorchbot merge -f "rocm-only change, all rocm CI is passing"

pytorchmergebot · 2025-07-01T00:37:08Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

pytorch-bot bot added module: rocm AMD GPU support for Pytorch oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Jun 24, 2025

pytorchbot added the open source label Jun 24, 2025

pragupta and others added 3 commits June 24, 2025 20:11

two shot perf improvement

fe4b6e9

stop serlization of consecutive ld_vec.

a81b87a

Only add de-serialization of ld_vec for ROCm

cbf66d9

pragupta force-pushed the pg-symm-perf-remote branch from 269df93 to cbf66d9 Compare June 24, 2025 20:16

pragupta changed the title ~~[ROCm][SymmetricMemory] De-serialize loads and stores to improve performance~~ [ROCm][SymmetricMemory] De-serialize loads to improve performance Jun 24, 2025

amd-hhashemi added 2 commits June 24, 2025 20:49

stop serlization of consecutive ld_vec (improve-1)

b42c0a0

stop serlization of consecutive ld_vec (improve-2)

2ce9980

jeffdaily added the ciflow/periodic-rocm-mi300 Trigger "distributed" config CI on ROCm MI300 label Jun 24, 2025

correction to last

c4a23db

pytorch-bot bot removed the ciflow/periodic-rocm-mi300 Trigger "distributed" config CI on ROCm MI300 label Jun 25, 2025

Adjust block and thread size for two shot

e4c42ba

pragupta changed the title ~~[ROCm][SymmetricMemory] De-serialize loads to improve performance~~ [ROCm][SymmetricMemory] Performance improvements for two-shot allreduce Jun 30, 2025

jeffdaily added the ciflow/periodic-rocm-mi300 Trigger "distributed" config CI on ROCm MI300 label Jun 30, 2025

jeffdaily approved these changes Jun 30, 2025

View reviewed changes

jeffdaily marked this pull request as ready for review June 30, 2025 21:24

jeffdaily added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 30, 2025

pytorch-bot bot removed the ciflow/trunk Trigger trunk jobs on your pull request label Jun 30, 2025

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 30, 2025

pytorchmergebot added the merging label Jun 30, 2025

pytorchmergebot removed the merging label Jun 30, 2025

pytorchmergebot added the merging label Jul 1, 2025

pytorchmergebot added the Merged label Jul 1, 2025

pytorchmergebot closed this in 6dc2b22 Jul 1, 2025

pytorchmergebot removed the merging label Jul 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ROCm][SymmetricMemory] Performance improvements for two-shot allreduce #156746

[ROCm][SymmetricMemory] Performance improvements for two-shot allreduce #156746

Uh oh!

pragupta commented Jun 24, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jun 24, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jun 30, 2025

Uh oh!

jeffdaily commented Jun 30, 2025

Uh oh!

pytorchmergebot commented Jun 30, 2025

Uh oh!

pytorchmergebot commented Jun 30, 2025

Uh oh!

jeffdaily commented Jul 1, 2025

Uh oh!

pytorch-bot bot commented Jul 1, 2025

Uh oh!

jeffdaily commented Jul 1, 2025

Uh oh!

pytorchmergebot commented Jul 1, 2025

Uh oh!

Uh oh!

[ROCm][SymmetricMemory] Performance improvements for two-shot allreduce #156746

[ROCm][SymmetricMemory] Performance improvements for two-shot allreduce #156746

Uh oh!

Conversation

pragupta commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/156746

❌ 6 New Failures, 1 Pending

Uh oh!

pytorch-bot bot commented Jun 30, 2025

Uh oh!

jeffdaily commented Jun 30, 2025

Uh oh!

pytorchmergebot commented Jun 30, 2025

Merge started

Uh oh!

pytorchmergebot commented Jun 30, 2025

Merge failed

Uh oh!

jeffdaily commented Jul 1, 2025

Uh oh!

pytorch-bot bot commented Jul 1, 2025

Uh oh!

jeffdaily commented Jul 1, 2025

Uh oh!

pytorchmergebot commented Jul 1, 2025

Merge started

Uh oh!

Uh oh!

pragupta commented Jun 24, 2025 •

edited

Loading

pytorch-bot bot commented Jun 24, 2025 •

edited

Loading