New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA BFloat16 TopK #44755
CUDA BFloat16 TopK #44755
Conversation
💊 CI failures summary and remediationsAs of commit ce32659 (more details on the Dr. CI page): Commit ce32659 was recently pushed. Waiting for builds... This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 5 times. |
aten/src/THC/THCDeviceUtils.cuh
Outdated
@@ -39,6 +41,15 @@ __device__ __forceinline__ T doLdg(const T* p) { | |||
#endif | |||
} | |||
|
|||
template <> | |||
__device__ __forceinline__ c10::BFloat16 doLdg<c10::BFloat16>(const c10::BFloat16* p) { | |||
#if __CUDA_ARCH__ >= 350 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you need #if here? torch is only supported on CUDA_ARCH >= 350
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is actually equivalent to #ifndef __HIP_PLATFORM_HCC__
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then it should say so? __ldg
doesn't provide performance benefit these days, but I guess you still need to load short and construct bfloat16 from bits on cuda, and hip is able to handle it natively?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I will change this to #ifndef __HIP_PLATFORM_HCC__
, and maybe remove it later (needs benchmark). On HIP, it is just *p
, so it's OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to let you know, changing the #if does break the build for NVIDIA GRID K520 GPU. I understand that is not a supported CUDA architecture though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about that, but as you note it is not a supported architecture.
Codecov Report
@@ Coverage Diff @@
## master #44755 +/- ##
=======================================
Coverage 68.07% 68.08%
=======================================
Files 384 384
Lines 49765 49774 +9
=======================================
+ Hits 33879 33890 +11
+ Misses 15886 15884 -2
Continue to review full report at Codecov.
|
if you observe weird numerical behavior with bfloat16 topk, https://github.com/pytorch/pytorch/blame/b85568a54a9c60986235ad1e0cc5dffc71b9d5b1/aten/src/ATen/native/cuda/SortingRadixSelect.cuh#L147-L163 is the main suspect. @ngimel you remember our adventures with that for fp16. The same fix was also necessary for bfloat16, and @gchanan included the fix for bfloat16 in his PR, but we had no way to test bfloat16 on cuda at the time. |
@mcarilli Tests on CI are passing, so it should be OK? Do you think we need more tests beyond the existing unit tests? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
CI is enough, get_all_dtypes is testing bfloat16, right? |
@ngimel Yes, by default it include everything, unless you say
|
There are internal build failures
|
Let me benchmark and remove ldg |
The problem is not __ldg, I believe, it's fromBits. I have no idea why --expt-relaxed-constexpr is not passed in internal builds and how it used to work. Maybe just returning CUDA_ARCH guard is the way to go ;-) |
The solution for the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
No description provided.