Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is any plan to support "__syncthreads_and", "__syncthreads_or", and "__syncthreads_count"? #8289

Closed
wanmeihuali opened this issue Jul 17, 2023 · 0 comments
Labels
feature request Suggest an idea on this project welcome contribution

Comments

@wanmeihuali
Copy link
Contributor

I see in PR-4631 you are adding CUDA warp-level intrinsics to Taichi. Do you have any plan to add some CUDA block-level intrinsics? e.g. __syncthreads_and/__syncthreads_or/__syncthreads_count? I've tried to add these intrinsics into Taichi, if you think these intrinsics are acceptable, I can create a PR.

Describe the solution you'd like (if any)
API in ti.simt.block:

def sync_all_nonzero(predicate):

def sync_any_nonzero(predicate):

def sync_count_nonzero(predicate):

My current implementation:
See: master...wanmeihuali:taichi:master

Additional comments
Limitation: similar intrinsics shall also exist in Metal and/or other platforms. But I don't have such device/dev env, so my implementation only supports CUDA now.

Thanks!

@wanmeihuali wanmeihuali added the feature request Suggest an idea on this project label Jul 17, 2023
lin-hitonami pushed a commit that referenced this issue Oct 31, 2023
…reads_count" from CUDA. (#8297)

Issue: #8289 

### Brief Summary

From the CUDA document:
Devices of compute capability 2.x and higher support three variations of
__syncthreads() described below.
```cpp
int __syncthreads_count(int predicate);
```
is identical to __syncthreads() with the additional feature that it
evaluates predicate for all threads of the block and returns the number
of threads for which predicate evaluates to non-zero.
```cpp
int __syncthreads_and(int predicate);
```
is identical to __syncthreads() with the additional feature that it
evaluates predicate for all threads of the block and returns non-zero if
and only if predicate evaluates to non-zero for all of them.
```cpp
int __syncthreads_or(int predicate);
```
is identical to __syncthreads() with the additional feature that it
evaluates predicate for all threads of the block and returns non-zero if
and only if predicate evaluates to non-zero for any of them.

This PR just add these three operations for CUDA only, the API looks
like:
```python
def sync_all_nonzero(predicate): # __syncthreads_and

def sync_any_nonzero(predicate): # __syncthreads_or

def sync_count_nonzero(predicate): #__syncthreads_count

```
And the predicate is always expected to be ti.int32
### Walkthrough
Overall, the code is just modified from the CUDA WARP operations, the
implementation is pretty straightforward. I tried to add some similar
tests to the WARP operations, and all tests are passed on my local
machine.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Suggest an idea on this project welcome contribution
Projects
Status: Done
Development

No branches or pull requests

3 participants