# Simple Memory Copy

This example demonstrates a basic memory copy operation where **shared memory** is used as an intermediate buffer.  
It serves as the simplest possible scenario to test whether the `DefaultSharedMemorySync()` pass correctly inserts synchronization.

The goal is to observe shared memory behavior in a minimal setting.


First, we import needed modules at the beginning:

In [1]:
import dace
from IPython.display import Code
from dace.transformation.passes.shared_memory_synchronization import DefaultSharedMemorySync

## Insipration

Below is the sdfg which I used for inspiration. The goal is to replace 'k' with a shared memory array later.

In [2]:
@dace.program
def simpleCopy(A: dace.float64[32] @ dace.dtypes.StorageType.GPU_Global, B: dace.float64[32] @ dace.dtypes.StorageType.GPU_Global, C: dace.float64[32] @ dace.dtypes.StorageType.GPU_Global):
    for i in dace.map[0:32:32] @ dace.dtypes.ScheduleType.GPU_Device:
        for j in dace.map[0:32] @ dace.dtypes.ScheduleType.GPU_ThreadBlock:
            k = A[j]
            B[j] = k

simpleCopy.to_sdfg()


A DaCe program built using the sdfg API, corresponding to a simple memory copy using shared memory as a buffer:

In [3]:
def simpleCopy_smem():
    # Create SDFG and state
    sdfg = dace.SDFG("simpleCopy_smem")
    state = sdfg.add_state("main")

    # Add arrays
    sdfg.add_array("A", (32,), dace.uint32, storage=dace.dtypes.StorageType.GPU_Global)
    sdfg.add_array("B", (32,), dace.uint32, storage=dace.dtypes.StorageType.GPU_Global)
    sdfg.add_array("S", (32,), dace.uint32, storage=dace.dtypes.StorageType.GPU_Shared, transient=True, lifetime=dace.dtypes.AllocationLifetime.Scope)

    # Add access nodes
    a_acc = state.add_access("A")
    b_acc = state.add_access("B")
    s_acc= state.add_access("S")

    # GPU Device map
    gpu_map_entry, gpu_map_exit = state.add_map(
        "gpu_map",
        dict(i="0:32:32"),
        schedule=dace.dtypes.ScheduleType.GPU_Device,
    )

    #  GPU TB map
    tb_map_entry, tb_map_exit = state.add_map(
        "tb",
        dict(j="0:32"),
        schedule=dace.dtypes.ScheduleType.GPU_ThreadBlock,
    )

    # Add tasklets for A -> S -> B
    tasklet1 = state.add_tasklet(
        "copy_to_shared",
        inputs={"__inp"},
        outputs={"__out"},
        code="__out = __inp;",
        language=dace.dtypes.Language.CPP
    )

    tasklet2 = state.add_tasklet(
        "copy_to_global",
        inputs={"__inp"},
        outputs={"__out"},
        code="__out = __inp;",
        language=dace.dtypes.Language.CPP
    )


    # Edges
    state.add_edge(a_acc, None, gpu_map_entry, None, dace.Memlet("A[0:32]"))
    state.add_edge(gpu_map_entry, None, tb_map_entry, None, dace.Memlet("A[0:32]"))
    state.add_edge(tb_map_entry, None, tasklet1, "__inp", dace.Memlet("A[j]"))
    state.add_edge(tasklet1, "__out", s_acc, None, dace.Memlet("S[j]"))
    state.add_edge(s_acc, None, tasklet2, "__inp", dace.Memlet("S[j]"))
    state.add_edge(tasklet2, "__out", tb_map_exit, None, dace.Memlet("B[j]"))
    state.add_edge(tb_map_exit, None, gpu_map_exit, None, dace.Memlet("B[0:32]"))
    state.add_edge(gpu_map_exit, None, b_acc, None, dace.Memlet("B[0:32]"))

    sdfg.fill_scope_connectors()
    return sdfg

sdfg = simpleCopy_smem()
sdfg

## Adding Synchronization Barriers

A simple pass is used to add synchronization tasklets correct. We observe, that the synchronization tasklet is inserted after 
the shared memory access and between an assignment tasklet, ensuring that the threads wait until all data is in shared memory before
using it. (Note, that in this case, synchronization would not be necessary since each thread access the same position in shared memory
it writes to. But we only care about the correct insertion after a shared memory accessNode is used)

In [4]:
DefaultSharedMemorySync().apply_pass(sdfg, None)
sdfg

The generated code where the "__syncthreads();" tasklet is correctly placed:

In [5]:
Code(sdfg.generate_code()[1].clean_code)

In [6]:
#Code(sdfg.generate_code()[0].clean_code)