[TESTING] [ROCm] Add partitioned buffer approach for scatter add op #168073

jataylo · 2025-11-18T11:57:29Z

It has been observed that in the case of heavy contended atomics poor performance is being achieved.

To solve this problem while minimizing kernel overhead this PR proposes an fx pass which will replace the index_put operation with an alternative scatter approach.

Algorithm:

Enumerate scatter operations: operation_id = [0, 1, 2, ..., N-1]
Assign to partitions: partition_id = operation_id % num_partitions
Create expanded buffers along scatter_dim: size = num_partitions × dim_size
Adjust indices: adjusted_idx = original_idx + (partition_id × dim_size)
Perform partitioned scatter with reduced contention
Reduce across partitions: sum(partitions, dim=scatter_dim)

This will reduce atomic contention at the cost of memory usage. In order to combat this we have built heuristics around the total number of partitions for the expanded buffer, as well as setting a cap on how large these expanded tensors can be (currently 10% of GPU memory)

Note the heuristic cannot be perfect as we do not know the true indices data at compile time, in real world models the indices will have duplicates and not be uniformly distributed which increases atomic contention, currently this cannot be modelled and we have to estimate contention based on input and output buffer sizes.

Benchmark code: https://gist.github.com/jataylo/dd3a6353ad2859efd65fa87b28aa3ebd
This code executes 3 index_add ops to 3 seperate buffers.
N = 1000000
D = 100
n = 501

values = float32 [N,D]
indices = int64 [N]
output = float32 [n, D]

For each run we modify the range of randint to simulate various levels of atomic contention

Gathered two sets of results, one with partitioned_scatter_enabled=True, the other partitioned_scatter_enabled=False

uniform_range	no_compile_ms	compile_ms (partitioned_scatter_enabled=False)	compile_ms (partitioned_scatter_enabled=True)	speedup
0-0	85.52	28.50	3.55	8.03
0-1	46.99	15.66	2.47	6.33
0-3	25.16	8.31	2.20	3.78
0-7	12.92	4.32	1.63	2.66
0-15	6.66	4.24	1.60	2.66
0-31	3.43	3.19	1.33	2.40
0-63	1.79	1.62	1.32	1.23
0-127	1.76	1.59	1.24	1.28
0-255	1.73	1.32	1.24	1.07
0-500	1.61	1.27	1.23	1.04

Note there are improvements to make after this lands:

Add dynamic shape support, needs to be conservative here to not explode memory usage.
Update IR and codegen directly to avoid iota op and needing to update indices via torch ops, we can likely do this in store codegen itself.
Develop new implementations for memory constrained environments

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @hongxiayang @naromero77amd @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben

pytorch-bot · 2025-11-18T11:57:33Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/168073

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Unrelated Failure

As of commit fedb7f1 with merge base 65b9892 ():

NEW FAILURES - The following jobs have failed:

inductor-perf-nightly-rocm-mi300 / rocm-py3_10-inductor-benchmark-test / test (inductor_torchbench_perf_rocm_mi300, 4, 9, linux.rocm.gpu.gfx942.1) (gh)
Process completed with exit code 1.
Lint / lintrunner-noclang-partial / linux-job (gh)
>>> Lint for torch/_inductor/fx_passes/reduced_atomic_contention.py:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

rocm-mi300 / linux-noble-rocm-py3.12-mi300 / test (default, 6, 6, linux.rocm.gpu.gfx942.1.b) (gh) (disabled by #163689)
test/inductor/test_cuda_repro.py::CudaReproTests::test_qwen2_7b_sdpa_input_alignment_requires_recompile

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jataylo · 2025-11-18T14:17:33Z

Note this PR is not ready for review. Need to fix UTs, check if NV benefit from this and see if the iota can be optimised out.

jataylo · 2025-11-21T10:57:20Z

@pytorchbot rebase

pytorchmergebot · 2025-11-21T10:58:57Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

pytorchmergebot · 2025-11-21T10:59:01Z

Successfully rebased scatter-add-opt onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout scatter-add-opt && git pull --rebase)

jataylo · 2025-11-23T00:16:02Z

temporarily disabling feature to get comparative perf dashboard data

pytorch-bot bot added ciflow/inductor module: inductor labels Nov 18, 2025

jataylo changed the title ~~Add partitioned buffer approach for scatter add op~~ [TESTING] Add partitioned buffer approach for scatter add op Nov 18, 2025

jataylo added ciflow/inductor-rocm Trigger "inductor" config CI on ROCm ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 ciflow/inductor-perf-test-nightly-rocm-mi300 Trigger inductor perf tests on ROCm MI300 ciflow/rocm-mi200 Trigger "default" config CI on ROCm MI200 labels Nov 18, 2025

pytorchbot added the open source label Nov 18, 2025

jataylo changed the title ~~[TESTING] Add partitioned buffer approach for scatter add op~~ [TESTING] [ROCm] Add partitioned buffer approach for scatter add op Nov 18, 2025

pytorch-bot bot added ciflow/rocm Trigger "default" config CI on ROCm module: rocm AMD GPU support for Pytorch release notes: fx release notes category labels Nov 18, 2025

jataylo and others added 6 commits November 21, 2025 10:58

Add partitioned buffer approach for scatter add op

9bd6d92

Resolve bugs and restructure

a8743d6

FakeTensor support and additional fixes

87bdec3

Add multi dim indices support to fix broadcasting error

7296527

Force config off for certain UT

fb05d36

Force config off for certain UT

2a6ea45

pytorchmergebot force-pushed the scatter-add-opt branch from 1da9e67 to 2a6ea45 Compare November 21, 2025 10:59

No lowering for flatten op

684d289

jataylo removed ciflow/inductor ciflow/rocm Trigger "default" config CI on ROCm labels Nov 23, 2025

jataylo removed ciflow/inductor-rocm Trigger "inductor" config CI on ROCm ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 ciflow/rocm-mi200 Trigger "default" config CI on ROCm MI200 labels Nov 23, 2025

Temporarily disabling to get direct comparison in perf table

5c30a89

pytorch-bot bot added ciflow/inductor ciflow/rocm Trigger "default" config CI on ROCm labels Nov 23, 2025

jataylo added 10 commits November 23, 2025 10:39

Cleanup and add tracking

32220d1

Re-enable optimization

915e4f9

Import install_config_module in config.py

5bd01dd

Add additional heuristic

063962d

Update reduced_atomic_contention.py

b48606b

Temporary disable for dashboard compare

38c42c5

Update post_grad.py

a07f464

Tuned heuristic

13e07ef

Fix heuristic and linting

6bacdd7

Linting

0452391

jataylo mentioned this pull request Nov 28, 2025

[NO CP] [release/2.9] Add partitioned buffer approach for scatter add op ROCm/pytorch#2839

Open

jataylo and others added 2 commits November 28, 2025 09:19

Linting

6d8a856

Linting

fedb7f1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[TESTING] [ROCm] Add partitioned buffer approach for scatter add op #168073

[TESTING] [ROCm] Add partitioned buffer approach for scatter add op #168073

Uh oh!

jataylo commented Nov 18, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Nov 18, 2025 •

edited

Loading

Uh oh!

jataylo commented Nov 18, 2025

Uh oh!

jataylo commented Nov 21, 2025

Uh oh!

pytorchmergebot commented Nov 21, 2025

Uh oh!

pytorchmergebot commented Nov 21, 2025

Uh oh!

jataylo commented Nov 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[TESTING] [ROCm] Add partitioned buffer approach for scatter add op #168073

Are you sure you want to change the base?

[TESTING] [ROCm] Add partitioned buffer approach for scatter add op #168073

Uh oh!

Conversation

jataylo commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/168073

❌ 2 New Failures, 1 Unrelated Failure

Uh oh!

jataylo commented Nov 18, 2025

Uh oh!

jataylo commented Nov 21, 2025

Uh oh!

pytorchmergebot commented Nov 21, 2025

Uh oh!

pytorchmergebot commented Nov 21, 2025

Uh oh!

jataylo commented Nov 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jataylo commented Nov 18, 2025 •

edited

Loading

pytorch-bot bot commented Nov 18, 2025 •

edited

Loading