Skip to content

Conversation

@jataylo
Copy link
Collaborator

@jataylo jataylo commented Nov 18, 2025

It has been observed that in the case of heavy contended atomics poor performance is being achieved.

To solve this problem while minimizing kernel overhead this PR proposes an fx pass which will replace the index_put operation with an alternative scatter approach.

Algorithm:

  1. Enumerate scatter operations: operation_id = [0, 1, 2, ..., N-1]
  2. Assign to partitions: partition_id = operation_id % num_partitions
  3. Create expanded buffers along scatter_dim: size = num_partitions × dim_size
  4. Adjust indices: adjusted_idx = original_idx + (partition_id × dim_size)
  5. Perform partitioned scatter with reduced contention
  6. Reduce across partitions: sum(partitions, dim=scatter_dim)

This will reduce atomic contention at the cost of memory usage. In order to combat this we have built heuristics around the total number of partitions for the expanded buffer, as well as setting a cap on how large these expanded tensors can be (currently 10% of GPU memory)

Note the heuristic cannot be perfect as we do not know the true indices data at compile time, in real world models the indices will have duplicates and not be uniformly distributed which increases atomic contention, currently this cannot be modelled and we have to estimate contention based on input and output buffer sizes.

Benchmark code: https://gist.github.com/jataylo/dd3a6353ad2859efd65fa87b28aa3ebd
This code executes 3 index_add ops to 3 seperate buffers.
N = 1000000
D = 100
n = 501

values = float32 [N,D]
indices = int64 [N]
output = float32 [n, D]

For each run we modify the range of randint to simulate various levels of atomic contention

Gathered two sets of results, one with partitioned_scatter_enabled=True, the other partitioned_scatter_enabled=False

uniform_range no_compile_ms compile_ms (partitioned_scatter_enabled=False) compile_ms (partitioned_scatter_enabled=True) speedup
0-0 85.52 28.50 3.55 8.03
0-1 46.99 15.66 2.47 6.33
0-3 25.16 8.31 2.20 3.78
0-7 12.92 4.32 1.63 2.66
0-15 6.66 4.24 1.60 2.66
0-31 3.43 3.19 1.33 2.40
0-63 1.79 1.62 1.32 1.23
0-127 1.76 1.59 1.24 1.28
0-255 1.73 1.32 1.24 1.07
0-500 1.61 1.27 1.23 1.04

Note there are improvements to make after this lands:

  1. Add dynamic shape support, needs to be conservative here to not explode memory usage.
  2. Update IR and codegen directly to avoid iota op and needing to update indices via torch ops, we can likely do this in store codegen itself.
  3. Develop new implementations for memory constrained environments

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @hongxiayang @naromero77amd @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben

@pytorch-bot
Copy link

pytorch-bot bot commented Nov 18, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/168073

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Unrelated Failure

As of commit fedb7f1 with merge base 65b9892 (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@jataylo jataylo changed the title Add partitioned buffer approach for scatter add op [TESTING] Add partitioned buffer approach for scatter add op Nov 18, 2025
@jataylo jataylo added ciflow/inductor-rocm Trigger "inductor" config CI on ROCm ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 ciflow/inductor-perf-test-nightly-rocm-mi300 Trigger inductor perf tests on ROCm MI300 ciflow/rocm-mi200 Trigger "default" config CI on ROCm MI200 labels Nov 18, 2025
@jataylo
Copy link
Collaborator Author

jataylo commented Nov 18, 2025

Note this PR is not ready for review. Need to fix UTs, check if NV benefit from this and see if the iota can be optimised out.

@jataylo jataylo changed the title [TESTING] Add partitioned buffer approach for scatter add op [TESTING] [ROCm] Add partitioned buffer approach for scatter add op Nov 18, 2025
@pytorch-bot pytorch-bot bot added ciflow/rocm Trigger "default" config CI on ROCm module: rocm AMD GPU support for Pytorch release notes: fx release notes category labels Nov 18, 2025
@jataylo
Copy link
Collaborator Author

jataylo commented Nov 21, 2025

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased scatter-add-opt onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout scatter-add-opt && git pull --rebase)

@jataylo jataylo removed ciflow/inductor ciflow/rocm Trigger "default" config CI on ROCm labels Nov 23, 2025
@jataylo jataylo removed ciflow/inductor-rocm Trigger "inductor" config CI on ROCm ciflow/rocm-mi300 Trigger "default" config CI on ROCm MI300 ciflow/rocm-mi200 Trigger "default" config CI on ROCm MI200 labels Nov 23, 2025
@pytorch-bot pytorch-bot bot added ciflow/inductor ciflow/rocm Trigger "default" config CI on ROCm labels Nov 23, 2025
@jataylo
Copy link
Collaborator Author

jataylo commented Nov 23, 2025

temporarily disabling feature to get comparative perf dashboard data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/inductor ciflow/inductor-perf-test-nightly-rocm-mi300 Trigger inductor perf tests on ROCm MI300 ciflow/rocm Trigger "default" config CI on ROCm module: inductor module: rocm AMD GPU support for Pytorch open source release notes: fx release notes category

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants