-
Notifications
You must be signed in to change notification settings - Fork 21.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use atomicAdd
for bfloat16
in Ampere and above
#84981
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/84981
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit eeff148: This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@@ -6,6 +6,10 @@ | |||
|
|||
#include <ATen/NumericUtils.h> | |||
|
|||
#if !(defined(USE_ROCM) || ((defined(CUDA_VERSION) && CUDA_VERSION < 11000) || (defined(__CUDA_ARCH__) && (__CUDA_ARCH__ < 800)))) | |||
#include <cuda_bf16.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why this is needed, if it is removed, the build fails with the complaint that __nv_bfloat16
isn't defined when the included c10/util/BFloat16.h
header should also include it...
return bsum + val; | ||
}); | ||
#else | ||
__nv_bfloat16 r = atomicAdd(reinterpret_cast<__nv_bfloat16*>(address), *reinterpret_cast<__nv_bfloat16*>(&val)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very clunky and I should be able to use the functions in c10/util/BFloat16-inl.h
but I was tripping over the syntax.
What are the perf numbers you are getting with this? |
Better than before but still very much the wrong order of magnitude according to the microbenchmark (on A6000):
after
|
would bfloat2 atomicAdd help? (This is what's used for half) https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH____BFLOAT162__ARITHMETIC.html#group__CUDA__MATH____BFLOAT162__ARITHMETIC_1g550f52c89d672213390e9bfd8a3c42bf |
Not sure what this entails, naively changing the cast types yields a misaligned address error,,, Will check the generated code. |
Yeah you cannot naively change the type, you should use the approach similar to fastAtomicAdd for fp16, where you apply atomicAdd to the properly aligned reinterpret-casted half2 (or bfloat162) pointer, adding 0 to the second half. |
I can confirm that I already tried reimplementing pytorch/aten/src/ATen/native/cuda/KernelUtils.cuh Lines 25 to 64 in 9d11552
One more interesting thing that I keep running into (irrelevant to this) is the lack of a defined typecast between As far as atomicadds for bfloat go, I'm getting roughly 100x the latency I get with |
I may be completely reading this wrong, but it seems like all atomic operations for bfloat16 are implemented with CAS, as opposed to float16 and float162 which seem to have a dedicated add instruction: I ended up switching the kernel I'm struggling with to float32 at all times (since it's just a backwards kernel for a really small weight tensor, so return a float32 tensor isn't too big of a deal). It is just a temporary solution for my use case to drop that 10ms latency back down to 100us. It's almost the same latency as float16 now, but this particular kernel doesn't even use half2 right now, so no harm done. That said, I will try and look into this some more and see if I can figure something else out. Also, this PR could probably be merged with: #80340 . |
/easycla As part of the transition to the PyTorch Foundation, this project now requires contributions be covered under the new CLA. See #85559 for additional details. This comment will trigger a new check of this PR. If you are already covered, you will simply see a new "EasyCLA" check that passes. If you are not covered, a bot will leave a new comment with a link to sign. |
I think CUDA v11.8 might help resolve the speed issue by the way. Notice how CUDA v11.7 only had a CAS instruction for b16:
However, in v11.8 we get this nice little addition which I think is all we needed
|
I'm seeing the following PTX being generated with just a standalone call to However, the slowdown remains so I'm suspecting that it might be another kernel/operation that is slow. |
@eqy Interesting. I'm assuming you're on 11.8? |
Yes, this is on 11.8 but I'm not 100% sure my build is correct and need to recheck. |
Yeah I just noticed there's a new NGC image with the latest torch. I'm going to give it a shot with my example and see if I observe anything different. |
So I'm still confirming, because my environment's a bit of a mess, but in my case the issue is partially resolved I guess. I will note that I'm not building torch from your commits though (and not using the vectroized bfloat2 ops or anything, just plain UpdateI can confirm I'm still facing the issue on my end even on CUDA 11.8. NGC pytorch:22.08 (cu117)
NGC pytorch:22.10 (cu118)
|
6ad5cc2
to
a1caf6e
Compare
Thanks for checking on your end @alihassanijr . On
Dropping the [WIP] label now. |
atomicAdd
for bfloat16
in Ampere and aboveatomicAdd
for bfloat16
in Ampere and above
@pytorchmergebot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
Merge failedReason: This PR is too stale; the last push date was more than 3 days ago. Please rebase and try again. You can rebase by leaving the following comment on this PR: Details for Dev Infra teamRaised by workflow job |
@pytorchmergebot rebase |
@pytorchbot successfully started a rebase job. Check the current status here |
Successfully rebased |
a29a23d
to
eeff148
Compare
@pytorchmergebot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
WIP to fix extremely slow `scatter_add` issue vs. fp16. The current changes seem to improve performance, but it still appears to lag behind the fp16 equivalent. CC @ngimel @ptrblck Pull Request resolved: pytorch#84981 Approved by: https://github.com/ngimel
WIP to fix extremely slow
scatter_add
issue vs. fp16. The current changes seem to improve performance, but it still appears to lag behind the fp16 equivalent.CC @ngimel @ptrblck