-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Perf] Shared Array APIs #5338
Comments
^ Please move the existing TLS & BLS code to alloca statements as well |
Can't wait to do that |
I suggest not mixing scratch pad with |
In the context of this discussion, using WDYT about using the Mat type internally? We can decide the final API exposed to users when things get ready |
I think for demonstration it is fine. However, when it comes to merging code in, it may not be the best choice to complicate |
+1. Matrix and vector are meant to be mathematical types. In this case, though, we need a "buffer storage" type, and a dedicated type for that (before unifying all the pointers). |
As there's already a scratch pad in the codebase, let's use the Shared Array instead! |
TL;DR
I would like to request a scratch pad (shared memory in CUDA) API that exhibits ~2x n-body simulation performance on CUDA GPUs, which is on-par with the best CUDA implementations.
The Problem
The Taichi kernel function for n-body simulation is demonstrated as follows:
We should note that the inner
nBodies
loop is factorized into three smaller loops: theLOOP 2
,LOOP BLOCK
andLOOP UNROLL
. This ispartial loop unrolling
which helps to build a more GPU friendly instruction flow. But we are not going to dive into details here.Ideally we want to fetch data from DRAM memory into scratch pad at
FETCH
, and replace theLOADS FROM DRAM
with the local loads from scratch pad. The target code should be equivalent with the following CUDA kernel:Note1: in the loading stage, all threads request to the same position in the shared memory, meaning that a broadcast is triggered (See official document). This is the reason why shared memory impl is so much faster than global memory for the n-body simulation.
Note 2: accessing shared memory with stride of 3 32-bit floats will not cause bank conflicts, see official docuement.
Currently we cannot easily implement this in Taichi. We need a set of new APIs to enable the optimization.
The Hack
In fact, most of the utilities are already present in the codebase. Therefore I have made a prototype that manipulates the CHI IR to demonstrate how this works in Taichi.
Local Array Allocation
In CUDA, the shared memory is allocated through the following statement:
In current Taichi code base, by setting the
bls_size
in theOffloadedStmt
, the following allocation code snippet will be invoked at codegen:I think we need a statement for this, a specialized
AllocaStmt
should work.Data fetch into scratch pad
CUDA code:
See this part for the hack.
where
index_xyz
is the 0/1/2 offset for thex
y
andz
elements.In this part, we first get the corresponding
snode
from actualGlobalPtrStmt
, and made the new statements with thesnode
andthreadIdx.x
. Then we get the pointer to bls buffer viaBlockLocalPtrStmt
.I think the
BlockLocalPtrStmt
doesn't make sense and should be removed. It can be replace with aGlobalPtrStmt
orGlobalVariableStmt
, with proper pass-in arguments.Replace Load Statements
Finally, the statements defined in the code section in
LOAD FROM DRAM
should be replaced with loads to scratch pads.See here
The final performance data is sound!
API Proposal
I'd like to write the following Taichi code for this optimization:
It seems that we only need to modify the
AllocaStmt
to invoke correct allocation code, and consider scratch pad in the coming Mat/Vec types.The text was updated successfully, but these errors were encountered: