-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] [SIMT] Add CUDA warp-level intrinsics to Taichi #4631
Comments
Extension: Add Warp size query and control. Warp level intrinsics exists in Vulkan and Metal, and on those platforms some devices use warp size different from 32, some devices even allow custom warp sizes. (subgroup size control & subgroup operations) |
@bobcao3 Can't agree more! :-) |
Changes i would like to see: in addition to using CUDA's warp level primitives, we should look into adding higher level intrinsics directly such as subgroup add, subgroup scan, etc. These are supported in Vulkan and device driver will provide optimal implementation depending on the device. On CUDA we can provide our own wrapper for these higher level primitives. Reference: https://www.khronos.org/blog/vulkan-subgroup-tutorial |
Would love to see this! |
One addition to this proposal: warp intrinsics is a great add-on, but in the meantime, we also need a design to formalize our parallelization strategy. Right now it's quite vague to users how a Taichi for iteration is mapped to a GPU thread (TLDR; it's backend-dependent..) I think we need to offer explicit spec on this (cc @strongoier). |
Should we remove the mask part from the intrinsics? It seems like only CUDA and AVX512 supports these masks. |
I want to take care of |
Continuing discussions on @bobcao3's question:
My opinion: I agree exposing masks can be extra trouble for users, and can harm portability. Does anyone know a frequent CUDA use case where explicitly specifying the masks is helpful? If not then maybe we should not expose masks. |
Also vote for hiding the masks beneath Taichi's interface. The masks are extremely troublesome and hard to understand especially in Taichi, as we have hidden a lot many parallelization details for elegant parallel programming. The prerequisite to expose mask is a set of more direct APIs to manipulate parallelization.
Special stencil patterns covering specific near neighbors (star stencil etc.) might need special masks, but such optimizations can be handled internally in Taichi. We can also quickly add the mask APIs when needed. |
According to the CUDA API, the masking behavior is really unexpected. If an active thread executing an instruction where it is not in the mask yields unexpected behavior, this the mask is only an convergence requirement. Now comes the tricky part, there's no explicit convergence requirement in CUDA, thus the mask must be queried everytime we've taken a branch. Using the ALL mask in divergent control flow can result in GPU hang, while using |
Mask in vector processing like AVX512 or RiscV Vectors are very different from CUDA. |
I would like to take care of |
Got I naive question. If I want to implement a task in the issue or other opened issue, how do I know that maybe somebody do the same work as me. |
Good question. As long as nobody says "I'll take this task" and the issue has no assignee, you are safe to assume that nobody is working on it. Before you start coding, it would be nice to leave a comment "let me implement XXXX" so that people know you are working on it :-) |
Recent NV gpus (Ampere and later) also support |
Wow, that sounds quite attractive. Thanks for pointing this out. We need to dispatch the code according to compute capability. One place to look at: taichi/taichi/runtime/llvm/locked_task.h Line 28 in d82ea90
@qiao-bo Could you add this to the feature list and coordinate its development? Many thanks! |
@yuanming-hu @masahi It turns out a bit difficult to support the new reduce warp intrinsics at this moment. For example, The migration to LLVM 12 is on our roadmap. Nevertheless, it may still lack the support of this warp reduce ;). For the purpose of this issue, I suggest to move this feature proposal to another issue for later work. WDYT? |
Sounds good - we probably need to postpone the implementation until we have LLVM >= 13. (If someone insists on implementing that, he can also consider using inline PTX assembly.) |
I will take care of |
I'll take care of |
I'm working on |
Update: Since we are approaching v1.1.0 release, I would like to draw an intermediate summary on this issue. Thanks to our contributors, the list of warp-level intrinsics has been fully implemented. The milestone has also been achieved, namely using the intrinsics to implement a parallel scan (https://github.com/taichi-dev/taichi_benchmark/blob/main/pbf/src/taichi/scan.py), thanks to @YuCrazing. As the next step, the following related tasks are planned:
In the long term, we plan provide high-level primitives that are backend-agnostic, and are able to provide abstractions to CUDA warp intrinsics, Vulkan subgroup, Metal SIMD group, cpu vectorization, etc. Since this issue is meant to address CUDA warp-level intrinsics, maybe we can use another issue to track the progress of the mentioned tasks? |
Hi, I wanted to know if anyone is working on adding support for the |
Maybe you can use a structure similar to how TextureStmt returns vec4... |
Can you share the link to it? I can't find |
(For people who are familiar with CUDA/LLVM, this is a good starting issue. For most intrinsics, you will only need to write < 10 LoC to implement the API, and < 50 LoC to test it. Come join us! :-)
Intro
There has been an increasing Taichi user need for writing high-performance SIMT kernels. For these use cases, it is fine to sacrifice a certain level of portability.
Currently, when running on CUDA, Taichi already follows the SIMT execution model. However, it lacks support for warp-level and block-level intrinsics (e.g.,
__ballot_sync
and__syncthreads
) that are often needed in fancy SIMT kernels.Implementation plan
__syncthreads
and add explicit shared memory support. We may even consider TensorCore and ray-tracing intrinsics.List of CUDA warp-level intrinsic
We plan to implement all of the following warp-level intrinsics:
__all_sync
(should be namedall_nonzero
in our API to avoid conflict withall
in Python) (by @varinic, [SIMT] Add all_sync warp intrinsics #4718)__any_sync
(should be namedany_nonzero
to avoid conflict withany
in Python) (by @varinic, [SIMT] Add any_sync warp intrinsics #4719)__uni_sync
(should be namedunique
) (by @0xzhang, [SIMT] Add uni_sync warp intrinsics #4927 (comment))__ballot_sync
(by @Wimacs, [SIMT] Add ballot_sync warp intrinsics #4641)__shfl_sync (i32)
(by @varinic [SIMT] Add shfl_sync_i32/f32 warp intrinsics #4717)__shfl_sync (f32)
(by @varinic [SIMT] Add shfl_sync_i32/f32 warp intrinsics #4717)__shfl_up_sync (i32)
(by @YuCrazing, [SIMT] Add shfl_up_i32/f32 warp level intrinsics #4632)__shfl_up_sync (f32)
(by @YuCrazing, [SIMT] Add shfl_up_i32/f32 warp level intrinsics #4632)__shfl_down_sync (i32)
(by @yuanming-hu, [SIMT] Implement ti.simt.warp.shfl_down_i32 and add stubs for other warp-level intrinsics #4616)__shfl_down_sync (f32)
(by @caic99, [SIMT] Add shfl_down_f32 intrinsic. #4819)__shfl_xor_sync
(by @varinic, WIP, [SIMT] Add shfl_xor_i32 warp intrinsics #4642)__match_any_sync
(by @galeselee, [SIMT] Add match_any warp intrinsics #4921)__match_all_sync
(by @galeselee, [SIMT] Add match_all warp intrinsics #4961)__activemask
(by @galeselee, [SIMT] Add activemask warp intrinsics #4918)__syncwarp
(by @galeselee, [SIMT] Add syncwarp warp intrinsics #4917 )See here and CUDA doc for more details :-)
API
We may pick one of the following API formats, depending on whether warp-level and block-level intrinsics should be put under the same namespace:
ti.simt.X
, such asti.simt.ballot()
andti.simt.warp_sync()
ti.simt.warp.X
, such asti.simt.warp.ballot()
andti.simt.warp.sync()
Please let me know which one you guys prefer :-)
Example
Computing sum of all values in a warp using
shfl_down
:Steps and how we collaborate
InternalFuncCallExpression
andInternalFuncStmt
. One issue is that in the LLVM codegen the generated function takesRuntimeContext *
, which is not needed. We need to make that optional. (Update: this is done in [SIMT] Implement ti.simt.warp.shfl_down_i32 and add stubs for other warp-level intrinsics #4616)ti.simt.warp.X
.Currently we are at step 2. For everyone who wants to contribute to this, please take one single intrinsic function to implement in a PR. That would simplify review and testing.
Please leave a comment (e.g., "I'll take care of
ti.simt.wary.shfl
!") in this PR, so that other community members know that you are working on it and we avoid duplicated work.For example, if you wish to implement
ballot
, fill intaichi/python/taichi/lang/simt.py
Lines 20 to 22 in 8497320
and
taichi/tests/python/test_simt.py
Lines 23 to 26 in 8497320
An example PR: #4632
What we already have
Scaffold code and
shfl_down_i32
I went ahead and implemented #4616
LLVM -> NVVM -> PTX code path
We already have a bunch of functions that wrap most of these intrinsics:
taichi/taichi/llvm/llvm_context.cpp
Lines 355 to 369 in bee97d5
Therefore, for most of the cases, with high probability, the intrinsics can be implemented simply in 3-4 lines of code (+ tests). We can just call these functions. For example,
taichi/python/taichi/lang/simt/warp.py
Lines 81 to 88 in 22d1895
Milestone
Implement GPU parallel scan (prefix sum)? That would be very useful in particle simulations.
Ideas are welcome!
Future steps: making Taichi (kind of) a superset of CUDA!
__syncthreads
,__threadfence
etc.ti.raw_kernel
, something that provides 1:1 mapping to a__global__
CUDA kernelAppendix: List of higher-level primitives (in Vulkan, Metal, etc. & implements as helpers in CUDA)
Some of these exist in CUDA directly, however the scope of execution (i.e. mask) is not involved, and
sync
behavior is guaranteed, therefore it can not be directly mapped 1:1 with CUDA, helper functions are needed. (Reference: https://www.youtube.com/watch?v=fP1Af0u097o where Nvidia talked about implementing these in the drivers)subgroupBarrier
Execution barriersubgroupMemoryBarrier
Memory fencesubgroupElect
Elect a single invocation as leader (very useful in atomic reduction)subgroupAll
subgroupAny
subgroupAllEqual
subgroupBroadcast
(might be tricky as theid
that is broadcasting from is compile time constant`)subgroupBroadcastFirst
(use the lowest id active invocation)ballot
options (GL_KHR_shader_subgroup_ballot
)subgroupAdd
subgroupMul
subgroupMin
subgroupMax
subgroupAnd
subgroupOr
subgroupXor
subgroupShuffle
subgroupShuffleXor
subgroupShuffleUp
subgroupShuffleDown
The text was updated successfully, but these errors were encountered: