You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I see in PR-4631 you are adding CUDA warp-level intrinsics to Taichi. Do you have any plan to add some CUDA block-level intrinsics? e.g. __syncthreads_and/__syncthreads_or/__syncthreads_count? I've tried to add these intrinsics into Taichi, if you think these intrinsics are acceptable, I can create a PR.
Describe the solution you'd like (if any)
API in ti.simt.block:
Additional comments
Limitation: similar intrinsics shall also exist in Metal and/or other platforms. But I don't have such device/dev env, so my implementation only supports CUDA now.
Thanks!
The text was updated successfully, but these errors were encountered:
…reads_count" from CUDA. (#8297)
Issue: #8289
### Brief Summary
From the CUDA document:
Devices of compute capability 2.x and higher support three variations of
__syncthreads() described below.
```cpp
int __syncthreads_count(int predicate);
```
is identical to __syncthreads() with the additional feature that it
evaluates predicate for all threads of the block and returns the number
of threads for which predicate evaluates to non-zero.
```cpp
int __syncthreads_and(int predicate);
```
is identical to __syncthreads() with the additional feature that it
evaluates predicate for all threads of the block and returns non-zero if
and only if predicate evaluates to non-zero for all of them.
```cpp
int __syncthreads_or(int predicate);
```
is identical to __syncthreads() with the additional feature that it
evaluates predicate for all threads of the block and returns non-zero if
and only if predicate evaluates to non-zero for any of them.
This PR just add these three operations for CUDA only, the API looks
like:
```python
def sync_all_nonzero(predicate): # __syncthreads_and
def sync_any_nonzero(predicate): # __syncthreads_or
def sync_count_nonzero(predicate): #__syncthreads_count
```
And the predicate is always expected to be ti.int32
### Walkthrough
Overall, the code is just modified from the CUDA WARP operations, the
implementation is pretty straightforward. I tried to add some similar
tests to the WARP operations, and all tests are passed on my local
machine.
---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
I see in PR-4631 you are adding CUDA warp-level intrinsics to Taichi. Do you have any plan to add some CUDA block-level intrinsics? e.g. __syncthreads_and/__syncthreads_or/__syncthreads_count? I've tried to add these intrinsics into Taichi, if you think these intrinsics are acceptable, I can create a PR.
Describe the solution you'd like (if any)
API in ti.simt.block:
My current implementation:
See: master...wanmeihuali:taichi:master
Additional comments
Limitation: similar intrinsics shall also exist in Metal and/or other platforms. But I don't have such device/dev env, so my implementation only supports CUDA now.
Thanks!
The text was updated successfully, but these errors were encountered: