Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] [SIMT] Add CUDA warp-level intrinsics to Taichi #4631

Open
28 of 37 tasks
yuanming-hu opened this issue Mar 25, 2022 · 25 comments
Open
28 of 37 tasks

[RFC] [SIMT] Add CUDA warp-level intrinsics to Taichi #4631

yuanming-hu opened this issue Mar 25, 2022 · 25 comments
Assignees
Labels
discussion Welcome discussion! feature request Suggest an idea on this project RFC welcome contribution

Comments

@yuanming-hu
Copy link
Member

yuanming-hu commented Mar 25, 2022

(For people who are familiar with CUDA/LLVM, this is a good starting issue. For most intrinsics, you will only need to write < 10 LoC to implement the API, and < 50 LoC to test it. Come join us! :-)

Intro

There has been an increasing Taichi user need for writing high-performance SIMT kernels. For these use cases, it is fine to sacrifice a certain level of portability.

Currently, when running on CUDA, Taichi already follows the SIMT execution model. However, it lacks support for warp-level and block-level intrinsics (e.g.,__ballot_sync and __syncthreads) that are often needed in fancy SIMT kernels.

Implementation plan

  • We support CUDA warp-level intrinsics only, as the first step, in this issue
  • In the longer term, may consider supporting other backends such as SPIR-V, Metal, AMDGPU etc. We may also consider other intrinsic such as __syncthreads and add explicit shared memory support. We may even consider TensorCore and ray-tracing intrinsics.

List of CUDA warp-level intrinsic

We plan to implement all of the following warp-level intrinsics:

See here and CUDA doc for more details :-)

API

We may pick one of the following API formats, depending on whether warp-level and block-level intrinsics should be put under the same namespace:

  1. ti.simt.X, such as ti.simt.ballot() and ti.simt.warp_sync()
  2. ti.simt.warp.X, such as ti.simt.warp.ballot() and ti.simt.warp.sync()
  3. Other ideas?

Please let me know which one you guys prefer :-)

Example

Computing sum of all values in a warp using shfl_down:

@ti.func
def warp_reduce(val):
    mask = ti.u32(0xFFFFFFFF)
    # assuming warp_size = 32 and no outside warp divergence
    val += ti.simt.warp.shfl_down(mask, val, 16)
    val += ti.simt.warp.shfl_down(mask, val, 8)
    val += ti.simt.warp.shfl_down(mask, val, 4)
    val += ti.simt.warp.shfl_down(mask, val, 2)
    val += ti.simt.warp.shfl_down(mask, val, 1)
    return val

Steps and how we collaborate

  1. Implement the infrastructure for the intrinsics. We will use InternalFuncCallExpression and InternalFuncStmt. One issue is that in the LLVM codegen the generated function takes RuntimeContext *, which is not needed. We need to make that optional. (Update: this is done in [SIMT] Implement ti.simt.warp.shfl_down_i32 and add stubs for other warp-level intrinsics #4616)
  2. Implement all the intrinsics and add corresponding test cases
  3. Decide which namespace to use, and put all the intrinsics to that namespace. Before we reach a consensus, let's use ti.simt.warp.X.
  4. Add documentation

Currently we are at step 2. For everyone who wants to contribute to this, please take one single intrinsic function to implement in a PR. That would simplify review and testing.

Please leave a comment (e.g., "I'll take care of ti.simt.wary.shfl!") in this PR, so that other community members know that you are working on it and we avoid duplicated work.

For example, if you wish to implement ballot, fill in

def ballot():
# TODO
pass

and

@test_utils.test(arch=ti.cuda)
def test_ballot():
# TODO
pass

An example PR: #4632

What we already have

Scaffold code and shfl_down_i32

I went ahead and implemented #4616

LLVM -> NVVM -> PTX code path

We already have a bunch of functions that wrap most of these intrinsics:

patch_intrinsic("warp_barrier", Intrinsic::nvvm_bar_warp_sync, false);
patch_intrinsic("block_memfence", Intrinsic::nvvm_membar_cta, false);
patch_intrinsic("grid_memfence", Intrinsic::nvvm_membar_gl, false);
patch_intrinsic("system_memfence", Intrinsic::nvvm_membar_sys, false);
patch_intrinsic("cuda_ballot", Intrinsic::nvvm_vote_ballot);
patch_intrinsic("cuda_ballot_sync", Intrinsic::nvvm_vote_ballot_sync);
patch_intrinsic("cuda_shfl_down_sync_i32",
Intrinsic::nvvm_shfl_sync_down_i32);
patch_intrinsic("cuda_shfl_down_sync_f32",
Intrinsic::nvvm_shfl_sync_down_f32);
patch_intrinsic("cuda_match_any_sync_i32",
Intrinsic::nvvm_match_any_sync_i32);

Therefore, for most of the cases, with high probability, the intrinsics can be implemented simply in 3-4 lines of code (+ tests). We can just call these functions. For example,

def shfl_down_f32(mask, val, offset):
# lane offset is 31 for warp size 32
return impl.call_internal("cuda_shfl_down_sync_f32",
mask,
val,
offset,
31,
with_runtime_context=False)

Milestone

Implement GPU parallel scan (prefix sum)? That would be very useful in particle simulations.
Ideas are welcome!

Future steps: making Taichi (kind of) a superset of CUDA!

  1. Explicit shared memory operation support
  2. Other block-level and other intrinsics: __syncthreads, __threadfence etc.
  3. ti.raw_kernel, something that provides 1:1 mapping to a __global__ CUDA kernel

Appendix: List of higher-level primitives (in Vulkan, Metal, etc. & implements as helpers in CUDA)

Some of these exist in CUDA directly, however the scope of execution (i.e. mask) is not involved, and sync behavior is guaranteed, therefore it can not be directly mapped 1:1 with CUDA, helper functions are needed. (Reference: https://www.youtube.com/watch?v=fP1Af0u097o where Nvidia talked about implementing these in the drivers)

  • subgroupBarrier Execution barrier
  • subgroupMemoryBarrier Memory fence
  • subgroupElect Elect a single invocation as leader (very useful in atomic reduction)
  • subgroupAll
  • subgroupAny
  • subgroupAllEqual
  • subgroupBroadcast (might be tricky as the id that is broadcasting from is compile time constant`)
  • subgroupBroadcastFirst (use the lowest id active invocation)
  • Other ballot options (GL_KHR_shader_subgroup_ballot)
  • Subgroup arithmetic (Useful in reduction primitives)
    • subgroupAdd
    • subgroupMul
    • subgroupMin
    • subgroupMax
    • subgroupAnd
    • subgroupOr
    • subgroupXor
  • Subgroup inclusive scan arithmetic (Like subgroup arithmetic, but the result is an inclusive scan)
  • subgroupShuffle
  • subgroupShuffleXor
  • subgroupShuffleUp
  • subgroupShuffleDown
@bobcao3
Copy link
Collaborator

bobcao3 commented Mar 25, 2022

Extension: Add Warp size query and control. Warp level intrinsics exists in Vulkan and Metal, and on those platforms some devices use warp size different from 32, some devices even allow custom warp sizes. (subgroup size control & subgroup operations)

@yuanming-hu
Copy link
Member Author

@bobcao3 Can't agree more! :-)

@bobcao3
Copy link
Collaborator

bobcao3 commented Mar 25, 2022

Changes i would like to see: in addition to using CUDA's warp level primitives, we should look into adding higher level intrinsics directly such as subgroup add, subgroup scan, etc. These are supported in Vulkan and device driver will provide optimal implementation depending on the device. On CUDA we can provide our own wrapper for these higher level primitives. Reference: https://www.khronos.org/blog/vulkan-subgroup-tutorial

@AmesingFlank
Copy link
Collaborator

AmesingFlank commented Mar 26, 2022

Would love to see this!
Btw Metal has pretty good warp intrinsics support as well (they call it SIMD-group). See table 6.13 in https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf

@k-ye
Copy link
Member

k-ye commented Mar 26, 2022

One addition to this proposal: warp intrinsics is a great add-on, but in the meantime, we also need a design to formalize our parallelization strategy. Right now it's quite vague to users how a Taichi for iteration is mapped to a GPU thread (TLDR; it's backend-dependent..) I think we need to offer explicit spec on this (cc @strongoier).

@bobcao3
Copy link
Collaborator

bobcao3 commented Mar 26, 2022

Should we remove the mask part from the intrinsics? It seems like only CUDA and AVX512 supports these masks.

@Wimacs
Copy link
Contributor

Wimacs commented Mar 27, 2022

I want to take care of __ballot_sync intrinsics!

@yuanming-hu
Copy link
Member Author

Continuing discussions on @bobcao3's question:

Should we remove the mask part from the intrinsics? It seems like only CUDA and AVX512 supports these masks. We can hard code it to all for now, but due to the complexity in the scheduling and non-guranteed lock-step execution, using the right mask probably needs the compiler to figure out the whether there can be divergence or not (when there's divergence, we need to run int mask = __match_any_sync(__activemask(), data); to get the right mask) I think handing masks over to the user may make it significantly harder to code, while also breaking compatibility with non CUDA devices)

My opinion: I agree exposing masks can be extra trouble for users, and can harm portability. Does anyone know a frequent CUDA use case where explicitly specifying the masks is helpful? If not then maybe we should not expose masks.

@turbo0628
Copy link
Member

turbo0628 commented Mar 27, 2022

I agree exposing masks can be extra trouble for users, and can harm portability.

Also vote for hiding the masks beneath Taichi's interface.

The masks are extremely troublesome and hard to understand especially in Taichi, as we have hidden a lot many parallelization details for elegant parallel programming. The prerequisite to expose mask is a set of more direct APIs to manipulate parallelization.

Does anyone know a frequent CUDA use case where explicitly specifying the masks is helpful?

Special stencil patterns covering specific near neighbors (star stencil etc.) might need special masks, but such optimizations can be handled internally in Taichi. We can also quickly add the mask APIs when needed.

@bobcao3
Copy link
Collaborator

bobcao3 commented Mar 27, 2022

According to the CUDA API, the masking behavior is really unexpected. If an active thread executing an instruction where it is not in the mask yields unexpected behavior, this the mask is only an convergence requirement. Now comes the tricky part, there's no explicit convergence requirement in CUDA, thus the mask must be queried everytime we've taken a branch. Using the ALL mask in divergent control flow can result in GPU hang, while using __activethread() does not guarantee a reconvergence after branching. Thus we should definitely hide the mask, but it also seems quite tricky to implement masks internally. I would say we need to maintain an mask variable once we encountered an IfStmt.

@bobcao3
Copy link
Collaborator

bobcao3 commented Mar 27, 2022

Mask in vector processing like AVX512 or RiscV Vectors are very different from CUDA.

@varinic
Copy link
Contributor

varinic commented Mar 27, 2022

I would like to take care of __shfl_xor_sync intrinsics!

@DongqiShen
Copy link
Contributor

Got I naive question. If I want to implement a task in the issue or other opened issue, how do I know that maybe somebody do the same work as me.

@yuanming-hu
Copy link
Member Author

Got I naive question. If I want to implement a task in the issue or other opened issue, how do I know that maybe somebody do the same work as me.

Good question. As long as nobody says "I'll take this task" and the issue has no assignee, you are safe to assume that nobody is working on it. Before you start coding, it would be nice to leave a comment "let me implement XXXX" so that people know you are working on it :-)

@masahi
Copy link
Contributor

masahi commented Apr 25, 2022

Changes i would like to see: in addition to using CUDA's warp level primitives, we should look into adding higher level intrinsics directly such as subgroup add, subgroup scan, etc. These are supported in Vulkan and device driver will provide optimal implementation depending on the device. On CUDA we can provide our own wrapper for these higher level primitives.

Recent NV gpus (Ampere and later) also support reduce_sync variant of intrinsics: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-reduce-functions
In particular, this slide on page 47 says __reduce_op_sync warp intrinsics are faster than warp shuffle based implementation by 10x.
https://developer.download.nvidia.com/video/gputechconf/gtc/2020/presentations/s21170-cuda-on-nvidia-ampere-gpu-architecture-taking-your-algorithms-to-the-next-level-of-performance.pdf?t=eyJscyI6ImdzZW8iLCJsc2QiOiJodHRwczpcL1wvd3d3Lmdvb2dsZS5jb21cLyIsIm5jaWQiOiJlbS1hbm5vLTkyMTMzOS12dDIwIn0

@yuanming-hu
Copy link
Member Author

Recent NV gpus (Ampere and later) also support reduce_sync variant of intrinsics: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-reduce-functions
In particular, this slide on page 47 says __reduce_op_sync warp intrinsics are faster than warp shuffle based implementation by 10x.
https://developer.download.nvidia.com/video/gputechconf/gtc/2020/presentations/s21170-cuda-on-nvidia-ampere-gpu-architecture-taking-your-algorithms-to-the-next-level-of-performance.pdf?t=eyJscyI6ImdzZW8iLCJsc2QiOiJodHRwczpcL1wvd3d3Lmdvb2dsZS5jb21cLyIsIm5jaWQiOiJlbS1hbm5vLTkyMTMzOS12dDIwIn0

Wow, that sounds quite attractive. Thanks for pointing this out. We need to dispatch the code according to compute capability. One place to look at:

if (cuda_compute_capability() < 70) {

@qiao-bo Could you add this to the feature list and coordinate its development? Many thanks!

@qiao-bo qiao-bo added this to the Taichi v1.1.0 milestone Apr 26, 2022
@qiao-bo
Copy link
Collaborator

qiao-bo commented Apr 27, 2022

@yuanming-hu @masahi It turns out a bit difficult to support the new reduce warp intrinsics at this moment. For example, __reduce_add_sync (i32) will need to be mapped to redux.sync.add.s32. This new redux keyword is only supported since LLVM13 (https://github.com/llvm/llvm-project/blob/release/13.x/llvm/lib/Target/NVPTX/NVPTXIntrinsics.td). Also tried bypassing NVVM and just use ptx asm in our runtime, but then llvm10 wouldn't let us because of the ptx jit compilation.

The migration to LLVM 12 is on our roadmap. Nevertheless, it may still lack the support of this warp reduce ;). For the purpose of this issue, I suggest to move this feature proposal to another issue for later work. WDYT?

@yuanming-hu
Copy link
Member Author

Sounds good - we probably need to postpone the implementation until we have LLVM >= 13.

(If someone insists on implementing that, he can also consider using inline PTX assembly.)

@galeselee
Copy link
Contributor

I will take care of __syncwarp intrinsic.

@0xzhang
Copy link
Collaborator

0xzhang commented May 6, 2022

I'll take care of __uni_sync.

@galeselee
Copy link
Contributor

I will take care of __syncwarp intrinsic.

I'm working on match_all.

@qiao-bo
Copy link
Collaborator

qiao-bo commented Jul 5, 2022

Update: Since we are approaching v1.1.0 release, I would like to draw an intermediate summary on this issue.

Thanks to our contributors, the list of warp-level intrinsics has been fully implemented. The milestone has also been achieved, namely using the intrinsics to implement a parallel scan (https://github.com/taichi-dev/taichi_benchmark/blob/main/pbf/src/taichi/scan.py), thanks to @YuCrazing.

As the next step, the following related tasks are planned:

  • Add more examples to utilize the warp intrinsics
  • Document the instructions in Taichi docs web
  • Block level support. i.e., explicit shared memory support @turbo0628
  • raw_kernel support
  • HW supported warp intrinsics on NV GPUs

In the long term, we plan provide high-level primitives that are backend-agnostic, and are able to provide abstractions to CUDA warp intrinsics, Vulkan subgroup, Metal SIMD group, cpu vectorization, etc.

Since this issue is meant to address CUDA warp-level intrinsics, maybe we can use another issue to track the progress of the mentioned tasks?

@alasin
Copy link

alasin commented Jan 2, 2024

Hi, I wanted to know if anyone is working on adding support for the subgroup* operations listed above? I can add support for some of the simple ones (shuffle*) but it'd be great if someone can look into the ballot ops (supported by GL_KHR_shader_subgroup_ballot) as I'm not sure how to implement them (the return type is a uvec4) and need to use them for a project.

@bobcao3
Copy link
Collaborator

bobcao3 commented Jan 2, 2024

Hi, I wanted to know if anyone is working on adding support for the subgroup* operations listed above? I can add support for some of the simple ones (shuffle*) but it'd be great if someone can look into the ballot ops (supported by GL_KHR_shader_subgroup_ballot) as I'm not sure how to implement them (the return type is a uvec4) and need to use them for a project.

Maybe you can use a structure similar to how TextureStmt returns vec4...

@alasin
Copy link

alasin commented Jan 3, 2024

Maybe you can use a structure similar to how TextureStmt returns vec4...

Can you share the link to it? I can't find TextureStmt while searching.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Welcome discussion! feature request Suggest an idea on this project RFC welcome contribution
Projects
None yet
Development

No branches or pull requests