-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[ROCm] Fix fp32 atomicAdd for non-MI100 GPUs #128750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Fixes #128631 Current implementation is very specific to MI100. This is causing performance degradation for other GPUs.
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/128750
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 36a1e1a with merge base bca2cf0 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
THis is awesome! Can you add some benchmark results for this change? |
|
@xw285cornell has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
|
Benchmarking on MI300X |
|
@eqy Can you look ? |
|
very nice, thank you for the contribution! |
|
@pytorchbot merge -f 'Landed internally' (Initiating merge automatically since Phabricator Diff has merged, using force because this PR might not pass merge_rules.json but landed internally) |
Merge startedYour change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Current implementation is very specific to MI100. This is causing performance degradation for other GPUs. Fixes pytorch#128631 Benchmarking on MI300X: ``` Before: 1918.5126953125 ms After: 0.8285150527954102 ms ``` Co-authored-by: Jeff Daily <jeff.daily@amd.com> Pull Request resolved: pytorch#128750 Approved by: https://github.com/xw285cornell (cherry picked from commit 1f0a68b)
Current implementation is very specific to MI100. This is causing performance degradation for other GPUs. Fixes pytorch#128631 Benchmarking on MI300X: ``` Before: 1918.5126953125 ms After: 0.8285150527954102 ms ``` Co-authored-by: Jeff Daily <jeff.daily@amd.com> Pull Request resolved: pytorch#128750 Approved by: https://github.com/xw285cornell (cherry picked from commit 1f0a68b)
Current implementation is very specific to MI100.
This is causing performance degradation for other GPUs.
Fixes #128631
Benchmarking on MI300X:
cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @dllehr-amd @jataylo @hongxiayang