Is any plan to support "__syncthreads_and", "__syncthreads_or", and "__syncthreads_count"? #8289

wanmeihuali · 2023-07-17T00:55:32Z

I see in PR-4631 you are adding CUDA warp-level intrinsics to Taichi. Do you have any plan to add some CUDA block-level intrinsics? e.g. __syncthreads_and/__syncthreads_or/__syncthreads_count? I've tried to add these intrinsics into Taichi, if you think these intrinsics are acceptable, I can create a PR.

Describe the solution you'd like (if any)
API in ti.simt.block:

def sync_all_nonzero(predicate):

def sync_any_nonzero(predicate):

def sync_count_nonzero(predicate):

My current implementation:
See: master...wanmeihuali:taichi:master

Additional comments
Limitation: similar intrinsics shall also exist in Metal and/or other platforms. But I don't have such device/dev env, so my implementation only supports CUDA now.

Thanks!

The text was updated successfully, but these errors were encountered:

…reads_count" from CUDA. (#8297) Issue: #8289 ### Brief Summary From the CUDA document: Devices of compute capability 2.x and higher support three variations of __syncthreads() described below. ```cpp int __syncthreads_count(int predicate); ``` is identical to __syncthreads() with the additional feature that it evaluates predicate for all threads of the block and returns the number of threads for which predicate evaluates to non-zero. ```cpp int __syncthreads_and(int predicate); ``` is identical to __syncthreads() with the additional feature that it evaluates predicate for all threads of the block and returns non-zero if and only if predicate evaluates to non-zero for all of them. ```cpp int __syncthreads_or(int predicate); ``` is identical to __syncthreads() with the additional feature that it evaluates predicate for all threads of the block and returns non-zero if and only if predicate evaluates to non-zero for any of them. This PR just add these three operations for CUDA only, the API looks like: ```python def sync_all_nonzero(predicate): # __syncthreads_and def sync_any_nonzero(predicate): # __syncthreads_or def sync_count_nonzero(predicate): #__syncthreads_count ``` And the predicate is always expected to be ti.int32 ### Walkthrough Overall, the code is just modified from the CUDA WARP operations, the implementation is pretty straightforward. I tried to add some similar tests to the WARP operations, and all tests are passed on my local machine. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

wanmeihuali added the feature request Suggest an idea on this project label Jul 17, 2023

jim19930609 added the welcome contribution label Jul 21, 2023

wanmeihuali mentioned this issue Jul 23, 2023

[simt] Support "__syncthreads_and", "__syncthreads_or", and "__syncthreads_count" from CUDA. #8297

Merged

lin-hitonami closed this as completed Oct 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is any plan to support "__syncthreads_and", "__syncthreads_or", and "__syncthreads_count"? #8289

Is any plan to support "__syncthreads_and", "__syncthreads_or", and "__syncthreads_count"? #8289

wanmeihuali commented Jul 17, 2023

Is any plan to support "__syncthreads_and", "__syncthreads_or", and "__syncthreads_count"? #8289

Is any plan to support "__syncthreads_and", "__syncthreads_or", and "__syncthreads_count"? #8289

Comments

wanmeihuali commented Jul 17, 2023