[RFC] [SIMT] Add CUDA warp-level intrinsics to Taichi #4631

yuanming-hu · 2022-03-25T15:16:38Z

(For people who are familiar with CUDA/LLVM, this is a good starting issue. For most intrinsics, you will only need to write < 10 LoC to implement the API, and < 50 LoC to test it. Come join us! :-)

Intro

There has been an increasing Taichi user need for writing high-performance SIMT kernels. For these use cases, it is fine to sacrifice a certain level of portability.

Currently, when running on CUDA, Taichi already follows the SIMT execution model. However, it lacks support for warp-level and block-level intrinsics (e.g.,__ballot_sync and __syncthreads) that are often needed in fancy SIMT kernels.

Implementation plan

We support CUDA warp-level intrinsics only, as the first step, in this issue
In the longer term, may consider supporting other backends such as SPIR-V, Metal, AMDGPU etc. We may also consider other intrinsic such as __syncthreads and add explicit shared memory support. We may even consider TensorCore and ray-tracing intrinsics.

List of CUDA warp-level intrinsic

We plan to implement all of the following warp-level intrinsics:

See here and CUDA doc for more details :-)

API

We may pick one of the following API formats, depending on whether warp-level and block-level intrinsics should be put under the same namespace:

ti.simt.X, such as ti.simt.ballot() and ti.simt.warp_sync()
ti.simt.warp.X, such as ti.simt.warp.ballot() and ti.simt.warp.sync()
Other ideas?

Please let me know which one you guys prefer :-)

Example

Computing sum of all values in a warp using shfl_down:

@ti.func
def warp_reduce(val):
    mask = ti.u32(0xFFFFFFFF)
    # assuming warp_size = 32 and no outside warp divergence
    val += ti.simt.warp.shfl_down(mask, val, 16)
    val += ti.simt.warp.shfl_down(mask, val, 8)
    val += ti.simt.warp.shfl_down(mask, val, 4)
    val += ti.simt.warp.shfl_down(mask, val, 2)
    val += ti.simt.warp.shfl_down(mask, val, 1)
    return val

Steps and how we collaborate

Implement the infrastructure for the intrinsics. We will use InternalFuncCallExpression and InternalFuncStmt. One issue is that in the LLVM codegen the generated function takes RuntimeContext *, which is not needed. We need to make that optional. (Update: this is done in [SIMT] Implement ti.simt.warp.shfl_down_i32 and add stubs for other warp-level intrinsics #4616)
Implement all the intrinsics and add corresponding test cases
Decide which namespace to use, and put all the intrinsics to that namespace. Before we reach a consensus, let's use ti.simt.warp.X.
Add documentation

Currently we are at step 2. For everyone who wants to contribute to this, please take one single intrinsic function to implement in a PR. That would simplify review and testing.

Please leave a comment (e.g., "I'll take care of ti.simt.wary.shfl!") in this PR, so that other community members know that you are working on it and we avoid duplicated work.

For example, if you wish to implement ballot, fill in

taichi/python/taichi/lang/simt.py

Lines 20 to 22 in 8497320

    
           def ballot(): 
        
               # TODO 
        
               pass

and

taichi/tests/python/test_simt.py

Lines 23 to 26 in 8497320

    
           @test_utils.test(arch=ti.cuda) 
        
           def test_ballot(): 
        
               # TODO 
        
               pass

An example PR: #4632

What we already have

Scaffold code and `shfl_down_i32`

I went ahead and implemented #4616

LLVM -> NVVM -> PTX code path

We already have a bunch of functions that wrap most of these intrinsics:

taichi/taichi/llvm/llvm_context.cpp

Lines 355 to 369 in bee97d5

    
           patch_intrinsic("warp_barrier", Intrinsic::nvvm_bar_warp_sync, false); 
        
           patch_intrinsic("block_memfence", Intrinsic::nvvm_membar_cta, false); 
        
           patch_intrinsic("grid_memfence", Intrinsic::nvvm_membar_gl, false); 
        
           patch_intrinsic("system_memfence", Intrinsic::nvvm_membar_sys, false); 
        
           patch_intrinsic("cuda_ballot", Intrinsic::nvvm_vote_ballot); 
        
           patch_intrinsic("cuda_ballot_sync", Intrinsic::nvvm_vote_ballot_sync); 
        
           patch_intrinsic("cuda_shfl_down_sync_i32", 
        
                           Intrinsic::nvvm_shfl_sync_down_i32); 
        
           patch_intrinsic("cuda_shfl_down_sync_f32", 
        
                           Intrinsic::nvvm_shfl_sync_down_f32); 
        
           patch_intrinsic("cuda_match_any_sync_i32", 
        
                           Intrinsic::nvvm_match_any_sync_i32);

Therefore, for most of the cases, with high probability, the intrinsics can be implemented simply in 3-4 lines of code (+ tests). We can just call these functions. For example,

taichi/python/taichi/lang/simt/warp.py

Lines 81 to 88 in 22d1895

    
           def shfl_down_f32(mask, val, offset): 
        
               # lane offset is 31 for warp size 32 
        
               return impl.call_internal("cuda_shfl_down_sync_f32", 
        
                                         mask, 
        
                                         val, 
        
                                         offset, 
        
                                         31, 
        
                                         with_runtime_context=False)

Milestone

Implement GPU parallel scan (prefix sum)? That would be very useful in particle simulations.
Ideas are welcome!

Future steps: making Taichi (kind of) a superset of CUDA!

Explicit shared memory operation support
Other block-level and other intrinsics: __syncthreads, __threadfence etc.
ti.raw_kernel, something that provides 1:1 mapping to a __global__ CUDA kernel

Appendix: List of higher-level primitives (in Vulkan, Metal, etc. & implements as helpers in CUDA)

Some of these exist in CUDA directly, however the scope of execution (i.e. mask) is not involved, and sync behavior is guaranteed, therefore it can not be directly mapped 1:1 with CUDA, helper functions are needed. (Reference: https://www.youtube.com/watch?v=fP1Af0u097o where Nvidia talked about implementing these in the drivers)

The text was updated successfully, but these errors were encountered:

bobcao3 · 2022-03-25T15:51:57Z

Extension: Add Warp size query and control. Warp level intrinsics exists in Vulkan and Metal, and on those platforms some devices use warp size different from 32, some devices even allow custom warp sizes. (subgroup size control & subgroup operations)

yuanming-hu · 2022-03-25T15:53:04Z

@bobcao3 Can't agree more! :-)

bobcao3 · 2022-03-25T15:56:35Z

Changes i would like to see: in addition to using CUDA's warp level primitives, we should look into adding higher level intrinsics directly such as subgroup add, subgroup scan, etc. These are supported in Vulkan and device driver will provide optimal implementation depending on the device. On CUDA we can provide our own wrapper for these higher level primitives. Reference: https://www.khronos.org/blog/vulkan-subgroup-tutorial

AmesingFlank · 2022-03-26T00:41:57Z

Would love to see this!
Btw Metal has pretty good warp intrinsics support as well (they call it SIMD-group). See table 6.13 in https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf

k-ye · 2022-03-26T03:52:28Z

One addition to this proposal: warp intrinsics is a great add-on, but in the meantime, we also need a design to formalize our parallelization strategy. Right now it's quite vague to users how a Taichi for iteration is mapped to a GPU thread (TLDR; it's backend-dependent..) I think we need to offer explicit spec on this (cc @strongoier).

bobcao3 · 2022-03-26T23:05:33Z

Should we remove the mask part from the intrinsics? It seems like only CUDA and AVX512 supports these masks.

Wimacs · 2022-03-27T06:52:00Z

I want to take care of __ballot_sync intrinsics!

yuanming-hu · 2022-03-27T11:49:11Z

Continuing discussions on @bobcao3's question:

Should we remove the mask part from the intrinsics? It seems like only CUDA and AVX512 supports these masks. We can hard code it to all for now, but due to the complexity in the scheduling and non-guranteed lock-step execution, using the right mask probably needs the compiler to figure out the whether there can be divergence or not (when there's divergence, we need to run int mask = __match_any_sync(__activemask(), data); to get the right mask) I think handing masks over to the user may make it significantly harder to code, while also breaking compatibility with non CUDA devices)

My opinion: I agree exposing masks can be extra trouble for users, and can harm portability. Does anyone know a frequent CUDA use case where explicitly specifying the masks is helpful? If not then maybe we should not expose masks.

turbo0628 · 2022-03-27T12:14:39Z

I agree exposing masks can be extra trouble for users, and can harm portability.

Also vote for hiding the masks beneath Taichi's interface.

The masks are extremely troublesome and hard to understand especially in Taichi, as we have hidden a lot many parallelization details for elegant parallel programming. The prerequisite to expose mask is a set of more direct APIs to manipulate parallelization.

Does anyone know a frequent CUDA use case where explicitly specifying the masks is helpful?

Special stencil patterns covering specific near neighbors (star stencil etc.) might need special masks, but such optimizations can be handled internally in Taichi. We can also quickly add the mask APIs when needed.

bobcao3 · 2022-03-27T16:40:13Z

According to the CUDA API, the masking behavior is really unexpected. If an active thread executing an instruction where it is not in the mask yields unexpected behavior, this the mask is only an convergence requirement. Now comes the tricky part, there's no explicit convergence requirement in CUDA, thus the mask must be queried everytime we've taken a branch. Using the ALL mask in divergent control flow can result in GPU hang, while using __activethread() does not guarantee a reconvergence after branching. Thus we should definitely hide the mask, but it also seems quite tricky to implement masks internally. I would say we need to maintain an mask variable once we encountered an IfStmt.

bobcao3 · 2022-03-27T16:41:07Z

Mask in vector processing like AVX512 or RiscV Vectors are very different from CUDA.

varinic · 2022-03-27T17:44:39Z

I would like to take care of __shfl_xor_sync intrinsics!

DongqiShen · 2022-03-28T04:23:02Z

Got I naive question. If I want to implement a task in the issue or other opened issue, how do I know that maybe somebody do the same work as me.

yuanming-hu · 2022-03-28T05:21:59Z

Got I naive question. If I want to implement a task in the issue or other opened issue, how do I know that maybe somebody do the same work as me.

Good question. As long as nobody says "I'll take this task" and the issue has no assignee, you are safe to assume that nobody is working on it. Before you start coding, it would be nice to leave a comment "let me implement XXXX" so that people know you are working on it :-)

masahi · 2022-04-25T04:25:06Z

Changes i would like to see: in addition to using CUDA's warp level primitives, we should look into adding higher level intrinsics directly such as subgroup add, subgroup scan, etc. These are supported in Vulkan and device driver will provide optimal implementation depending on the device. On CUDA we can provide our own wrapper for these higher level primitives.

Recent NV gpus (Ampere and later) also support reduce_sync variant of intrinsics: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-reduce-functions
In particular, this slide on page 47 says __reduce_op_sync warp intrinsics are faster than warp shuffle based implementation by 10x.
https://developer.download.nvidia.com/video/gputechconf/gtc/2020/presentations/s21170-cuda-on-nvidia-ampere-gpu-architecture-taking-your-algorithms-to-the-next-level-of-performance.pdf?t=eyJscyI6ImdzZW8iLCJsc2QiOiJodHRwczpcL1wvd3d3Lmdvb2dsZS5jb21cLyIsIm5jaWQiOiJlbS1hbm5vLTkyMTMzOS12dDIwIn0

yuanming-hu · 2022-04-25T05:12:36Z

Recent NV gpus (Ampere and later) also support reduce_sync variant of intrinsics: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-reduce-functions
In particular, this slide on page 47 says __reduce_op_sync warp intrinsics are faster than warp shuffle based implementation by 10x.
https://developer.download.nvidia.com/video/gputechconf/gtc/2020/presentations/s21170-cuda-on-nvidia-ampere-gpu-architecture-taking-your-algorithms-to-the-next-level-of-performance.pdf?t=eyJscyI6ImdzZW8iLCJsc2QiOiJodHRwczpcL1wvd3d3Lmdvb2dsZS5jb21cLyIsIm5jaWQiOiJlbS1hbm5vLTkyMTMzOS12dDIwIn0

Wow, that sounds quite attractive. Thanks for pointing this out. We need to dispatch the code according to compute capability. One place to look at:

taichi/taichi/runtime/llvm/locked_task.h

Line 28 in d82ea90

if (cuda_compute_capability() < 70) {

@qiao-bo Could you add this to the feature list and coordinate its development? Many thanks!

qiao-bo · 2022-04-27T14:50:48Z

@yuanming-hu @masahi It turns out a bit difficult to support the new reduce warp intrinsics at this moment. For example, __reduce_add_sync (i32) will need to be mapped to redux.sync.add.s32. This new redux keyword is only supported since LLVM13 (https://github.com/llvm/llvm-project/blob/release/13.x/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td). Also tried bypassing NVVM and just use ptx asm in our runtime, but then llvm10 wouldn't let us because of the ptx jit compilation.

The migration to LLVM 12 is on our roadmap. Nevertheless, it may still lack the support of this warp reduce ;). For the purpose of this issue, I suggest to move this feature proposal to another issue for later work. WDYT?

yuanming-hu · 2022-04-27T15:06:51Z

Sounds good - we probably need to postpone the implementation until we have LLVM >= 13.

(If someone insists on implementing that, he can also consider using inline PTX assembly.)

galeselee · 2022-05-05T07:14:31Z

I will take care of __syncwarp intrinsic.

0xzhang · 2022-05-06T23:51:20Z

I'll take care of __uni_sync.

galeselee · 2022-05-07T03:25:02Z

I will take care of __syncwarp intrinsic.

I'm working on match_all.

qiao-bo · 2022-07-05T04:25:38Z

Update: Since we are approaching v1.1.0 release, I would like to draw an intermediate summary on this issue.

Thanks to our contributors, the list of warp-level intrinsics has been fully implemented. The milestone has also been achieved, namely using the intrinsics to implement a parallel scan (https://github.com/taichi-dev/taichi_benchmark/blob/main/pbf/src/taichi/scan.py), thanks to @YuCrazing.

As the next step, the following related tasks are planned:

Add more examples to utilize the warp intrinsics
Document the instructions in Taichi docs web
Block level support. i.e., explicit shared memory support @turbo0628
raw_kernel support
HW supported warp intrinsics on NV GPUs

In the long term, we plan provide high-level primitives that are backend-agnostic, and are able to provide abstractions to CUDA warp intrinsics, Vulkan subgroup, Metal SIMD group, cpu vectorization, etc.

Since this issue is meant to address CUDA warp-level intrinsics, maybe we can use another issue to track the progress of the mentioned tasks?

alasin · 2024-01-02T15:28:52Z

Hi, I wanted to know if anyone is working on adding support for the subgroup* operations listed above? I can add support for some of the simple ones (shuffle*) but it'd be great if someone can look into the ballot ops (supported by GL_KHR_shader_subgroup_ballot) as I'm not sure how to implement them (the return type is a uvec4) and need to use them for a project.

bobcao3 · 2024-01-02T19:57:17Z

Hi, I wanted to know if anyone is working on adding support for the subgroup* operations listed above? I can add support for some of the simple ones (shuffle*) but it'd be great if someone can look into the ballot ops (supported by GL_KHR_shader_subgroup_ballot) as I'm not sure how to implement them (the return type is a uvec4) and need to use them for a project.

Maybe you can use a structure similar to how TextureStmt returns vec4...

alasin · 2024-01-03T15:05:30Z

Maybe you can use a structure similar to how TextureStmt returns vec4...

Can you share the link to it? I can't find TextureStmt while searching.

yuanming-hu added the feature request Suggest an idea on this project label Mar 25, 2022

yuanming-hu mentioned this issue Mar 25, 2022

[SIMT] Implement ti.simt.warp.shfl_down_i32 and add stubs for other warp-level intrinsics #4616

Merged

yuanming-hu assigned yuanming-hu, turbo0628, YuCrazing and qiao-bo Mar 25, 2022

k-ye added discussion Welcome discussion! RFC labels Mar 26, 2022

yuanming-hu added the welcome contribution label Mar 26, 2022

YuCrazing mentioned this issue Mar 26, 2022

[SIMT] Add shfl_up_i32/f32 warp level intrinsics #4632

Merged

Wimacs mentioned this issue Mar 27, 2022

[SIMT] Add ballot_sync warp intrinsics #4641

Merged

bobcao3 mentioned this issue Mar 27, 2022

[simt] Subgroup reduction primitives #4643

Merged

yuanming-hu pinned this issue Mar 30, 2022

This was referenced Apr 5, 2022

[SIMT] Add shfl_sync_i32/f32 warp intrinsics #4717

Merged

[SIMT] Add all_sync warp intrinsics #4718

Merged

[SIMT] Add any_sync warp intrinsics #4719

Merged

This was referenced Apr 7, 2022

[SIMT] [cuda] Use correct source lane offset for warp intrinsics #4734

Merged

[SIMT] Add shfl_xor_i32 warp intrinsics #4642

Merged

caic99 mentioned this issue Apr 19, 2022

[SIMT] Add shfl_down_f32 intrinsic. #4819

Merged

qiao-bo added this to the Taichi v1.1.0 milestone Apr 26, 2022

This was referenced May 6, 2022

[SIMT] Add syncwarp warp intrinsics #4917

Merged

[SIMT] Add activemask warp intrinsics #4918

Merged

[SIMT] Add match_any warp intrinsics #4921

Merged

0xzhang mentioned this issue May 7, 2022

[SIMT] Add uni_sync warp intrinsics #4927

Merged

galeselee mentioned this issue May 9, 2022

[SIMT] Add match_all warp intrinsics #4938

Closed

strongoier mentioned this issue May 11, 2022

[bug] [simt] Fix the problem that some intrinsics are never called #4957

Merged

galeselee mentioned this issue May 11, 2022

[SIMT] Add match_all warp intrinsics #4961

Merged

YuCrazing mentioned this issue May 13, 2022

[cuda] Add block and grid level intrinsic for cuda backend #4977

Merged

qiao-bo mentioned this issue Jun 1, 2022

[cuda] [simt] Add assertions for warp intrinsics on old GPUs #5077

Merged

qiao-bo mentioned this issue Jul 5, 2022

[doc] Add simt functions in operators page #5333

Merged

ailzhang removed this from the Taichi v1.1.0 milestone Aug 10, 2022

qiao-bo unpinned this issue Aug 12, 2022

wanmeihuali mentioned this issue Jul 17, 2023

Is any plan to support "__syncthreads_and", "__syncthreads_or", and "__syncthreads_count"? #8289

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] [SIMT] Add CUDA warp-level intrinsics to Taichi #4631

[RFC] [SIMT] Add CUDA warp-level intrinsics to Taichi #4631

yuanming-hu commented Mar 25, 2022 •

edited by turbo0628

Loading

bobcao3 commented Mar 25, 2022

yuanming-hu commented Mar 25, 2022

bobcao3 commented Mar 25, 2022

AmesingFlank commented Mar 26, 2022 •

edited

Loading

k-ye commented Mar 26, 2022

bobcao3 commented Mar 26, 2022

Wimacs commented Mar 27, 2022

yuanming-hu commented Mar 27, 2022

turbo0628 commented Mar 27, 2022 •

edited

Loading

bobcao3 commented Mar 27, 2022

bobcao3 commented Mar 27, 2022

varinic commented Mar 27, 2022

DongqiShen commented Mar 28, 2022

yuanming-hu commented Mar 28, 2022

masahi commented Apr 25, 2022

yuanming-hu commented Apr 25, 2022

qiao-bo commented Apr 27, 2022 •

edited

Loading

yuanming-hu commented Apr 27, 2022

galeselee commented May 5, 2022

0xzhang commented May 6, 2022

galeselee commented May 7, 2022

qiao-bo commented Jul 5, 2022 •

edited

Loading

alasin commented Jan 2, 2024

bobcao3 commented Jan 2, 2024

alasin commented Jan 3, 2024

[RFC] [SIMT] Add CUDA warp-level intrinsics to Taichi #4631

[RFC] [SIMT] Add CUDA warp-level intrinsics to Taichi #4631

Comments

yuanming-hu commented Mar 25, 2022 • edited by turbo0628 Loading

Intro

Implementation plan

List of CUDA warp-level intrinsic

API

Example

Steps and how we collaborate

What we already have

Scaffold code and shfl_down_i32

LLVM -> NVVM -> PTX code path

Milestone

Future steps: making Taichi (kind of) a superset of CUDA!

Appendix: List of higher-level primitives (in Vulkan, Metal, etc. & implements as helpers in CUDA)

bobcao3 commented Mar 25, 2022

yuanming-hu commented Mar 25, 2022

bobcao3 commented Mar 25, 2022

AmesingFlank commented Mar 26, 2022 • edited Loading

k-ye commented Mar 26, 2022

bobcao3 commented Mar 26, 2022

Wimacs commented Mar 27, 2022

yuanming-hu commented Mar 27, 2022

turbo0628 commented Mar 27, 2022 • edited Loading

bobcao3 commented Mar 27, 2022

bobcao3 commented Mar 27, 2022

varinic commented Mar 27, 2022

DongqiShen commented Mar 28, 2022

yuanming-hu commented Mar 28, 2022

masahi commented Apr 25, 2022

yuanming-hu commented Apr 25, 2022

qiao-bo commented Apr 27, 2022 • edited Loading

yuanming-hu commented Apr 27, 2022

galeselee commented May 5, 2022

0xzhang commented May 6, 2022

galeselee commented May 7, 2022

qiao-bo commented Jul 5, 2022 • edited Loading

alasin commented Jan 2, 2024

bobcao3 commented Jan 2, 2024

alasin commented Jan 3, 2024

yuanming-hu commented Mar 25, 2022 •

edited by turbo0628

Loading

Scaffold code and `shfl_down_i32`

AmesingFlank commented Mar 26, 2022 •

edited

Loading

turbo0628 commented Mar 27, 2022 •

edited

Loading

qiao-bo commented Apr 27, 2022 •

edited

Loading

qiao-bo commented Jul 5, 2022 •

edited

Loading