In [1]:
import dace
from IPython.display import Code
from dace.transformation import pass_pipeline


Here you can choose any of the 3 following programs to see how the sdfg assigns streams and how it adds synchronization tasklets if required.
You can, if you wish, also change e.g. the StorageType of one input- as long as you don't choose a strategy where the GPU is not used (e.g. a direct CPU
to CPU copy), a synchronization tasklet should be added. 

Note: test1 is a special case - where we have only one connected component. I thought it would be cool if we just use the default nullptr in this case instead of 
creating a stream.

In [2]:
@dace.program
def test1(A: dace.uint32[10] @ dace.dtypes.StorageType.GPU_Global,
          B: dace.uint32[10] @ dace.dtypes.StorageType.GPU_Global
         ):
    A[:] = B[:]

In [3]:
@dace.program
def test2(A: dace.uint32[10] @ dace.dtypes.StorageType.GPU_Global,
         B: dace.uint32[10] @ dace.dtypes.StorageType.GPU_Global,
         C: dace.uint32[10] @ dace.dtypes.StorageType.GPU_Global,
         D: dace.uint32[10] @ dace.dtypes.StorageType.GPU_Global
         ):
    
    for i in dace.map[0:10] @ dace.dtypes.ScheduleType.GPU_Device:
        A[i] = B[i]
    
    for j in dace.map[0:10] @ dace.dtypes.ScheduleType.GPU_Device:
        C[j] = D[j]

In [4]:
@dace.program
def test3(A: dace.uint32[10] @ dace.dtypes.StorageType.GPU_Global,
          B: dace.uint32[10] @ dace.dtypes.StorageType.GPU_Global,
          C: dace.uint32[10] @ dace.dtypes.StorageType.GPU_Global,
          D: dace.uint32[10] @ dace.dtypes.StorageType.GPU_Global
         ):
    
    A[:] = B[:]
    
    for i in dace.map[0:3] @ dace.dtypes.ScheduleType.Sequential:
        for j in dace.map[0:10] @ dace.dtypes.ScheduleType.GPU_Device:
            C[j] = D[j]

Choose which program you want to select for generating the sdfg below. It will give you the sdfg, without any snychronization tasklets.
The old codegen, would figure out where synchronization has to occur. We will make this explicit, as you wanted :).

In [5]:
# Choose
# sdfg = test1.to_sdfg()
# sdfg = test2.to_sdfg()
sdfg = test3.to_sdfg()
sdfg

Now we apply the pass to see the change:

In [6]:
# import the pass
from dace.transformation.passes.gpustream_scheduling import NaiveGPUStreamScheduler

# Define backend stream access expression, which is used as below. 
# (I do this explicitly such that any change in the access expression can be detected easier in future)
gpu_stream_access_template = "__state->gpu_context->streams[{gpu_stream}]"  

# Initialize and configure GPU stream scheduling pass
gpu_stream_pass = NaiveGPUStreamScheduler()
gpu_stream_pass.set_gpu_stream_access_template(gpu_stream_access_template)
assigned_streams = gpu_stream_pass.apply_pass(sdfg, None)


Look at which nodes get assigned to which streams - as expected, right?

In [7]:
assigned_streams

{AccessNode (B): 0,
 AccessNode (A): 0,
 AccessNode (D): 1,
 MapEntry (test3_10[i=0:3]): 1,
 MapEntry (test3_10_4_11[j=0:10]): 1,
 Tasklet (assign_12_12): 1,
 MapExit (test3_10_4_11[j=0:10]): 1,
 MapExit (test3_10[i=0:3]): 1,
 AccessNode (C): 1}

Look at the extended sdfg, now the synchronization is explicit and not the job of the codegen to figure out and implement.

In [8]:
sdfg 

And you can also inspect the corresponding code. Just ensure that you are using the experimental codegen:

In [9]:
from dace.config import Config

assert Config.get('compiler', 'cuda', 'implementation') == "experimental"

AssertionError: 

In [None]:
Code(sdfg.generate_code()[0].clean_code, language='cpp')

In [None]:
Code(sdfg.generate_code()[1].clean_code, language='cpp')