-
Notifications
You must be signed in to change notification settings - Fork 22.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ROCm] topk and sort fixes #12337
[ROCm] topk and sort fixes #12337
Conversation
* Topk part 1: fix intrinsincs for 64 wave front (#224) 64 in a wave front - intrinsics change. * Disable in-place sorting on ROCm. (#237) It is known to hang - use the Thrust fallback Skip one test - fails with the fallback. * Topk fixes (#239) * Spec (https://docs.nvidia.com/cuda/pdf/ptx_isa_6.3.pdf) Sec 9.7.1.19 (bfe) and 9.7.1.20 (bfi) requires pos and len to be limited to 0...255 * Spec (https://docs.nvidia.com/cuda/pdf/ptx_isa_6.3.pdf) Sec 9.7.1.19 requires extracted bits to be in LSBs * Correct logic for getLaneMaskLe. Previous logic would return 0x0 instead of 0xffffffffffffffff for lane 63 * Round up blockDim.x to prevent negative index for smem
@pytorchbot retest this please |
Just FYI, when you submit these PRs, please use full URLs for issues; they are cross-linking to the wrong issues now. |
@@ -207,7 +213,7 @@ __device__ void exclusiveBinaryPrefixScan(T* smem, bool in, T* out, T* carry, Bi | |||
*out -= (T) in; | |||
|
|||
// The outgoing carry for all threads is the last warp's sum | |||
*carry = smem[(blockDim.x / SCAN_UTILS_WARP_SIZE) - 1]; | |||
*carry = smem[THCCeilDiv<int>(blockDim.x, SCAN_UTILS_WARP_SIZE) - 1]; |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
@pytorchbot retest this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
Summary: * Topk part 1: fix intrinsincs for 64 wave front (#224) 64 in a wave front - intrinsics change. * Disable in-place sorting on ROCm. (#237) It is known to hang - use the Thrust fallback Skip one test - fails with the fallback. * Topk fixes (#239) * Spec (https://docs.nvidia.com/cuda/pdf/ptx_isa_6.3.pdf) Sec 9.7.1.19 (bfe) and 9.7.1.20 (bfi) requires pos and len to be limited to 0...255 * Spec (https://docs.nvidia.com/cuda/pdf/ptx_isa_6.3.pdf) Sec 9.7.1.19 requires extracted bits to be in LSBs * Correct logic for getLaneMaskLe. Previous logic would return 0x0 instead of 0xffffffffffffffff for lane 63 * Round up blockDim.x to prevent negative index for smem bddppq ezyang Note the one additional skipped test resulting from using the thrust sort fallback for all sizes. We are working on getting bitonic to work properly (and always). Until then, this needs to be skipped on ROCm. Pull Request resolved: pytorch/pytorch#12337 Differential Revision: D10259481 Pulled By: ezyang fbshipit-source-id: 5c8dc6596d7a3103ba7b4b550cba895f38c8148e
Summary: * Topk part 1: fix intrinsincs for 64 wave front (pytorch#224) 64 in a wave front - intrinsics change. * Disable in-place sorting on ROCm. (pytorch#237) It is known to hang - use the Thrust fallback Skip one test - fails with the fallback. * Topk fixes (pytorch#239) * Spec (https://docs.nvidia.com/cuda/pdf/ptx_isa_6.3.pdf) Sec 9.7.1.19 (bfe) and 9.7.1.20 (bfi) requires pos and len to be limited to 0...255 * Spec (https://docs.nvidia.com/cuda/pdf/ptx_isa_6.3.pdf) Sec 9.7.1.19 requires extracted bits to be in LSBs * Correct logic for getLaneMaskLe. Previous logic would return 0x0 instead of 0xffffffffffffffff for lane 63 * Round up blockDim.x to prevent negative index for smem bddppq ezyang Note the one additional skipped test resulting from using the thrust sort fallback for all sizes. We are working on getting bitonic to work properly (and always). Until then, this needs to be skipped on ROCm. Pull Request resolved: pytorch#12337 Differential Revision: D10259481 Pulled By: ezyang fbshipit-source-id: 5c8dc6596d7a3103ba7b4b550cba895f38c8148e
Summary: * Topk part 1: fix intrinsincs for 64 wave front (pytorch#224) 64 in a wave front - intrinsics change. * Disable in-place sorting on ROCm. (pytorch#237) It is known to hang - use the Thrust fallback Skip one test - fails with the fallback. * Topk fixes (pytorch#239) * Spec (https://docs.nvidia.com/cuda/pdf/ptx_isa_6.3.pdf) Sec 9.7.1.19 (bfe) and 9.7.1.20 (bfi) requires pos and len to be limited to 0...255 * Spec (https://docs.nvidia.com/cuda/pdf/ptx_isa_6.3.pdf) Sec 9.7.1.19 requires extracted bits to be in LSBs * Correct logic for getLaneMaskLe. Previous logic would return 0x0 instead of 0xffffffffffffffff for lane 63 * Round up blockDim.x to prevent negative index for smem bddppq ezyang Note the one additional skipped test resulting from using the thrust sort fallback for all sizes. We are working on getting bitonic to work properly (and always). Until then, this needs to be skipped on ROCm. Pull Request resolved: pytorch#12337 Differential Revision: D10259481 Pulled By: ezyang fbshipit-source-id: 5c8dc6596d7a3103ba7b4b550cba895f38c8148e
64 in a wave front - intrinsics change.
It is known to hang - use the Thrust fallback
Skip one test - fails with the fallback.
@bddppq @ezyang
Note the one additional skipped test resulting from using the thrust sort fallback for all sizes. We are working on getting bitonic to work properly (and always). Until then, this needs to be skipped on ROCm.