Skip to content

Conversation

@jerrymannil
Copy link
Contributor

@jerrymannil jerrymannil commented Jun 14, 2024

Current implementation is very specific to MI100.
This is causing performance degradation for other GPUs.

Fixes #128631

Benchmarking on MI300X:

Before:  1918.5126953125 ms
After: 0.8285150527954102 ms

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang

Fixes #128631
Current implementation is very specific to MI100.
This is causing performance degradation for other GPUs.
@jerrymannil jerrymannil requested a review from eqy as a code owner June 14, 2024 22:10
@pytorch-bot
Copy link

pytorch-bot bot commented Jun 14, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/128750

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 36a1e1a with merge base bca2cf0 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Jun 14, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: jeffdaily / name: Jeff Daily (36a1e1a)
  • ✅ login: jerrymannil / name: Jerry Mannil (b59ef58)

@pytorch-bot pytorch-bot bot added module: rocm AMD GPU support for Pytorch release notes: cuda release notes category labels Jun 14, 2024
@jeffdaily jeffdaily changed the title [ROCM] Fix fp32 atomicAdd for non-MI100 GPUs [ROCm] Fix fp32 atomicAdd for non-MI100 GPUs Jun 14, 2024
@jeffdaily jeffdaily added the ciflow/rocm Trigger "default" config CI on ROCm label Jun 15, 2024
@xw285cornell
Copy link
Contributor

THis is awesome! Can you add some benchmark results for this change?

@facebook-github-bot
Copy link
Contributor

@xw285cornell has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@mikaylagawarecki mikaylagawarecki added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 17, 2024
@jerrymannil
Copy link
Contributor Author

jerrymannil commented Jun 17, 2024

@xw285cornell

Benchmarking on MI300X

Before:  1918.5126953125 ms
After: 0.8285150527954102 ms

@jerrymannil
Copy link
Contributor Author

@eqy Can you look ?

@jataylo jataylo requested review from malfet and xw285cornell June 18, 2024 14:43
@xw285cornell
Copy link
Contributor

very nice, thank you for the contribution!

@facebook-github-bot
Copy link
Contributor

@pytorchbot merge -f 'Landed internally'

(Initiating merge automatically since Phabricator Diff has merged, using force because this PR might not pass merge_rules.json but landed internally)

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

mhalk pushed a commit to mhalk/pytorch that referenced this pull request Oct 29, 2024
Current implementation is very specific to MI100.
This is causing performance degradation for other GPUs.

Fixes pytorch#128631

Benchmarking on MI300X:
```
Before:  1918.5126953125 ms
After: 0.8285150527954102 ms
```

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Pull Request resolved: pytorch#128750
Approved by: https://github.com/xw285cornell

(cherry picked from commit 1f0a68b)
jithunnair-amd pushed a commit to ROCm/pytorch that referenced this pull request Mar 17, 2025
Current implementation is very specific to MI100.
This is causing performance degradation for other GPUs.

Fixes pytorch#128631

Benchmarking on MI300X:
```
Before:  1918.5126953125 ms
After: 0.8285150527954102 ms
```

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Pull Request resolved: pytorch#128750
Approved by: https://github.com/xw285cornell

(cherry picked from commit 1f0a68b)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/rocm Trigger "default" config CI on ROCm Merged module: rocm AMD GPU support for Pytorch open source release notes: cuda release notes category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

torch.gather can be slow on AMD with duplicated index

7 participants