# Library Nodes

`LibraryNode`s facilitate the abstraction of common operations, enabling easy reuse in different SDFGs and Data-Centric progrmas. This tutorial covers creating `LibraryNode`s with different implelementations (called *expansions* or `ExpandTransformation`s), and how to use them in SDFGs or Data-Centric programs.

For this tutorial, we use as an example the SDDMM (sampled dense-dense matrix multiplication) operation:
$$\bm{D} = \bm{A} \odot \left(\bm{B} \times \bm{C}\right)$$
$\bm{A}$ is a sparse matrix, while $\bm{B}$ and $\bm{C}$ are dense matrices. The ouput $\bm{D}$ is the Hadamard (element-wise) product of $\bm{A}$ and the matrix product of $\bm{B}$ and $\bm{C}$, and has the same sparsity pattern as $\bm{A}$. Effectively, $\bm{A}$ *samples* (or filters) the dense product $\bm{B} \times \bm{C}$. Assuming $\bm{A}$ is in CSR format, the SDDMM algorithm is as follows:

```python
# A (D) has shape (M, N) with nnz non-zero values
# A_data (D_data) is the non-zero values of A (D)
# A_indices (D_indices) is the column indices of A (D)
# A_indptr (D_indptr) is the row pointers of A (D)
# B has shape (M, K)
# C has shape (K, N)
D_data = np.zeros_like(A_data)
D_indices = np.copy(A_indices)
D_indptr = np.copy(A_indptr)
for i in range(M):
    for j in range(A_indptr[i], A_indptr[i + 1]):
        for k in range(K):
            D_data[j] += B[i, k] * C[k, A_indices[j]]
        D_data[j] *= A_data[j]
```

We start by creating a LibraryNode that represents the SDDMM operation. We create a class that inherits from `dace.sdfg.nodes.LibraryNode`, and we decorate it with `@dace.library.node`. The class must include an `implementations` dictionary, and an `defaul_implementation` string, which we will discuss later. The `LibraryNode`'s initialization method must call the initialization method of the super-class and pass the node's name, location, inputs, and outputs. The inputs and the outputs are the node's connector names.

In [1]:
import dace

from dace import library
from dace.sdfg import nodes
from dace.transformation import ExpandTransformation
from typing import Dict


@library.node
class MySDDMM(nodes.LibraryNode):

    # We will fill those later
    implementations: Dict[str, ExpandTransformation] = {}
    default_implementation: str = None

    def __init__(self, name, location=None):
        super().__init__(name,
                         location=location,
                         inputs={'_a_data', '_a_indices', '_a_indptr', '_b', '_c'},
                         outputs={'_d_data', '_d_indices', '_d_indptr'})


A `LibraryNode` can have different implemenetations (expansions), generic or specialized for specific architectures. These implementations can use the SDFG API but they can also be written as Data-Centric programs. We start by creating a *pure* expansion, which is an implementation that does not use any components, e.g., libraries, external to DaCe. We write this expansion as a Data-Centric Python program:

In [2]:
@library.expansion
class MySDDMMPureExpansion(ExpandTransformation):

    environments = []

    @staticmethod
    def expansion(node, state, sdfg):

        # Find shapes and datatypes of inputs and outputs

        # A matrix
        a_indptr_name = list(state.in_edges_by_connector(node, '_a_indptr'))[0].data.data
        a_indptr_arr = sdfg.arrays[a_indptr_name]
        a_data_name = list(state.in_edges_by_connector(node, '_a_data'))[0].data.data
        a_data_arr = sdfg.arrays[a_data_name]
        a_rowsp1 = a_indptr_arr.shape[0]
        a_nnz = a_data_arr.shape[0]
        a_dtype = a_data_arr.dtype

        # B matrix
        b_name = list(state.in_edges_by_connector(node, '_b'))[0].data.data
        b_arr = sdfg.arrays[b_name]
        b_rows = b_arr.shape[0]
        b_cols = b_arr.shape[1]
        b_dtype = b_arr.dtype

        # C matrix
        c_name = list(state.in_edges_by_connector(node, '_c'))[0].data.data
        c_arr = sdfg.arrays[c_name]
        c_rows = c_arr.shape[0]
        c_cols = c_arr.shape[1]
        c_dtype = c_arr.dtype

        # D matrix
        # We assume that it has the same shape and datatype as A

        @dace.program
        def sddmm_pure(_a_data: a_dtype[a_nnz], _a_indices: dace.int32[a_nnz], _a_indptr: dace.int32[a_rowsp1],
                       _b: b_dtype[b_rows, b_cols], _c: c_dtype[c_rows, c_cols],
                       _d_data: a_dtype[a_nnz], _d_indices: dace.int32[a_nnz], _d_indptr: dace.int32[a_rowsp1]):

            _d_data[:] = 0
            _d_indices[:] = _a_indices
            _d_indptr[:] = _a_indptr

            for i in dace.map[0:a_rowsp1 - 1]:
                for j in dace.map[_a_indptr[i]:_a_indptr[i + 1]]:
                    for k in dace.map[0:b_cols]:
                        _d_data[j] += _b[i, k] * _c[k, _a_indices[j]]
                    _d_data[j] *= _a_data[j]

        return sddmm_pure.to_sdfg()



To enable the above expansion, we add it to the `implementations` dictionary:

In [3]:
@library.node
class MySDDMM(nodes.LibraryNode):

    implementations: Dict[str, ExpandTransformation] = {'pure': MySDDMMPureExpansion}
    default_implementation: str = None

    def __init__(self, name, location=None):
        super().__init__(name,
                         location=location,
                         inputs={'_a_data', '_a_indices', '_a_indptr', '_b', '_c'},
                         outputs={'_d_data', '_d_indices', '_d_indptr'})


Now that there is at least one expansion for the `LibraryNode`, we can use it in an SDFG like any other `CodeNode` or `Tasklet`. However, it is also possible to automate its use in Data-Centric Python programs. We us as an example the inference formula for a single-layer of the Vanilla Attention (VA) Graph Neural Network (GNN):
$$\bm{H}^\prime = \sigma\left(\bm{A} \odot \left(\bm{H} \times \bm{H}^T\right) \times \bm{H} \times \bm{W}\right)$$
We implement the above formula as a Data-Centric Python program:

In [4]:
import numpy as np

# A is N x N, H is N x K0, W is K0 x K1, H' is N x K1
N, K0, K1, NNZ = (dace.symbol(s) for s in ('N', 'K0', 'K1', 'NNZ'))

@dace.program
def va_inference_layer(A_data: dace.float32[NNZ], A_indices: dace.int32[NNZ], A_indptr: dace.int32[N + 1],
                       H: dace.float32[N, K0],
                       W: dace.float32[K0, K1],
                       H_prime: dace.float32[N, K1]):
    
    # S = A \odot (H \times H^T)
    # S_data = np.empty_like(A_data)
    # S_indices = np.empty_like(A_indices)
    # S_indptr = np.empty_like(A_indptr)
    # dace.sddmm_op(A_data, A_indices, A_indptr, W, H, np.transpose(H), S_data, S_indices, S_indptr)
    S_data, S_indices, S_indptr = dace.sddmm_op(A_data, A_indices, A_indptr, H, np.transpose(H))

    H_prime[:] = np.maximum(0, dace.csrmm_op(S_data, S_indices, S_indptr, H) @ W)



To be able to convert the above program to SDFG, we need to define `SDDMM_op` and `CSRMM_op`:

In [5]:
from dace.frontend.common import op_repository 


@op_repository.replaces('dace.sddmm_op')
def sddmm_libnode(pv: 'ProgramVisitor',
                  sdfg: dace.SDFG,
                  state: dace.SDFGState,
                  A_data: str,
                  A_indices: str,
                  A_indptr: str,
                  B: str,
                  C: str):
    # Input access nodes
    A_data_acc, A_indices_acc, A_indptr_acc, B_acc, C_acc = (
        state.add_access(n) for n in (A_data, A_indices, A_indptr, B, C))
    # Output D
    A_data_arr = sdfg.arrays[A_data]
    A_indices_arr = sdfg.arrays[A_indices]
    A_indptr_arr = sdfg.arrays[A_indptr]
    D_data, D_data_arr = sdfg.add_temp_transient_like(A_data_arr)
    D_indices, D_indices_arr = sdfg.add_temp_transient_like(A_indices_arr)
    D_indptr, D_indptr_arr = sdfg.add_temp_transient_like(A_indptr_arr)
    D_data_acc, D_indices_acc, D_indptr_acc = (state.add_access(n) for n in (D_data, D_indices, D_indptr))

    libnode = MySDDMM('sddmm')
    state.add_node(libnode)

    # Connect nodes
    state.add_edge(A_indptr_acc, None, libnode, '_a_indptr', dace.Memlet(A_indptr))
    state.add_edge(A_indices_acc, None, libnode, '_a_indices', dace.Memlet(A_indices))
    state.add_edge(A_data_acc, None, libnode, '_a_data', dace.Memlet(A_data))
    state.add_edge(B_acc, None, libnode, '_b', dace.Memlet(B))
    state.add_edge(C_acc, None, libnode, '_c', dace.Memlet(C))
    state.add_edge(libnode, '_d_data', D_data_acc, None, dace.Memlet(D_data))
    state.add_edge(libnode, '_d_indices', D_indices_acc, None, dace.Memlet(D_indices))
    state.add_edge(libnode, '_d_indptr', D_indptr_acc, None, dace.Memlet(D_indptr))

    return [D_data, D_indices, D_indptr]


@op_repository.replaces('dace.csrmm_op')
def csrmm_libnode(pv: 'ProgramVisitor',
                  sdfg: dace.SDFG,
                  state: dace.SDFGState,
                  A_data: str,
                  A_indices: str,
                  A_indptr: str,
                  B: str):
    # Input access nodes
    A_data_acc, A_indices_acc, A_indptr_acc, B_acc = (state.add_access(n) for n in (A_data, A_indices, A_indptr, B))
    # Output C
    A_indptr_arr = sdfg.arrays[A_indptr]
    rows = A_indptr_arr.shape[0] - 1
    cols = sdfg.arrays[B].shape[1]
    A_data_arr = sdfg.arrays[A_data]
    dtype = A_data_arr.dtype
    C, C_arr = sdfg.add_temp_transient([rows, cols], dtype)
    C_acc = state.add_write(C)

    from dace.libraries.sparse import CSRMM
    libnode = CSRMM('csrmm')
    state.add_node(libnode)

    # Connect nodes
    state.add_edge(A_indptr_acc, None, libnode, '_a_rows', dace.Memlet(A_indptr))
    state.add_edge(A_indices_acc, None, libnode, '_a_cols', dace.Memlet(A_indices))
    state.add_edge(A_data_acc, None, libnode, '_a_vals', dace.Memlet(A_data))
    state.add_edge(B_acc, None, libnode, '_b', dace.Memlet(B))
    state.add_edge(libnode, '_c', C_acc, None, dace.Memlet(C))

    return [C]


In [6]:
sdfg = va_inference_layer.to_sdfg()

In [7]:
sdfg.save('sddmm_tutorial.sdfg')

'99ee1e0d5dcc44525626f871a7d195f0443d299da43f07e7ae19afa26ab3ca9f'

In [8]:
func = sdfg.compile()

Automatically expanded library node "_Transpose_" with implementation "pure".


AttributeError: type object 'MySDDMM' has no attribute '_dace_library_name'

In [9]:
from dace.library import register_node
register_node(MySDDMM, dace.libraries.sparse)

In [10]:
func = sdfg.compile()

Automatically expanded library node "_Transpose_" with implementation "pure".


ValueError: No implementation or default implementation specified.

In [11]:
MySDDMM._dace_library_name

'sparse'

In [12]:
sdfg = va_inference_layer.to_sdfg()
func = sdfg.compile()

Automatically expanded library node "_Transpose_" with implementation "pure".


ValueError: No implementation or default implementation specified.

In [13]:
MySDDMM.default_implementation = 'pure'
dace.libraries.sparse.CSRMM.default_implementation = 'pure'

In [14]:
sdfg = va_inference_layer.to_sdfg()
func = sdfg.compile()

Automatically expanded library node "_Transpose_" with implementation "pure".
Automatically expanded library node "sddmm" with implementation "pure".


  setattr(self, word, getattr(machar, word).flat[0])
  return self._float_to_str(self.smallest_subnormal)


Automatically expanded library node "csrmm" with implementation "pure".
Automatically expanded library node "_MatMult_" with implementation "specialize".
Automatically expanded library node "_MatMult_gemm" with implementation "pure".
-- Configuring done
-- Generating done
-- Build files have been written to: /home/alziogas/Projects/dace/tutorials/.dacecache/va_inference_layer/build

[ 25%] Building CXX object CMakeFiles/va_inference_layer.dir/home/alziogas/Projects/dace/tutorials/.dacecache/va_inference_layer/src/cpu/va_inference_layer.cpp.o
In file included from /home/alziogas/Projects/dace/dace/codegen/../runtime/include/dace/dace.h:14,
                 from /home/alziogas/Projects/dace/tutorials/.dacecache/va_inference_layer/src/cpu/va_inference_layer.cpp:2:
/home/alziogas/Projects/dace/dace/codegen/../runtime/include/dace/types.h: In constructor 'dace::half::half(float)':
             uint32_t x = *((uint32_t*)&f);
                           ~^~~~~~~~~~~~~~
/home/alziogas/Projects/

In [15]:
import numpy as np
from scipy import sparse

rng = np.random.default_rng(42)
A = sparse.random(1000, 1000, density=0.01, dtype=np.float32, format='csr', random_state=rng)
A.data[:] = 1
H = rng.random((1000, 128), dtype=np.float32)
W = rng.random((128, 128), dtype=np.float32)

val = np.empty((1000, 128), dtype=np.float32)
func(A_data=A.data.copy(), A_indices=A.indices.copy(), A_indptr=A.indptr.copy(), H=H, W=W, H_prime=val, N=1000, K0=128, K1=128, NNZ=A.nnz)

ref = np.maximum(0, (A.toarray() * (H @ H.T)) @ H @ W)

np.allclose(ref, val)
np.linalg.norm(ref - val) / np.linalg.norm(ref)

  setattr(self, word, getattr(machar, word).flat[0])
  return self._float_to_str(self.smallest_subnormal)


9.3466355e-08

In [16]:
from dace.transformation.auto.auto_optimize import auto_optimize

from dace.libraries.sparse import CSRMM
CSRMM.default_implementation = 'cuSPARSE'

sdfg = va_inference_layer.to_sdfg()
auto_optimize(sdfg, dace.DeviceType.GPU)

Applied 1 GPUTransformSDFG.
Hello
['cuBLAS', 'cuSolverDn', 'cuSPARSE', 'GPUAuto', 'CUB', 'pure']
Automatically expanded library node "_Transpose_" with implementation "cuBLAS".
Automatically expanded library node "sddmm" with implementation "pure".
Automatically expanded library node "csrmm" with implementation "cuSPARSE".
Automatically expanded library node "_MatMult_gemm" with implementation "cuBLAS".




In [18]:
func = sdfg.compile()
func(A_data=A.data.copy(), A_indices=A.indices.copy(), A_indptr=A.indptr.copy(), H=H, W=W, H_prime=val, N=1000, K0=128, K1=128, NNZ=A.nnz)

np.allclose(ref, val)
np.linalg.norm(ref - val) / np.linalg.norm(ref)



-- Found CUDA: /usr/local/cuda (found version "11.6")
-- Local CUDA architectures detected: 75
-- Configuring done
-- Generating done
-- Build files have been written to: /home/alziogas/Projects/dace/tutorials/.dacecache/va_inference_layer/build

[ 20%] Building NVCC (Device) object CMakeFiles/cuda_compile_1.dir/__/__/tutorials/.dacecache/va_inference_layer/src/cuda/cuda_compile_1_generated_va_inference_layer_0_cuda.cu.o

[ 40%] Building CXX object CMakeFiles/va_inference_layer_0.dir/home/alziogas/Projects/dace/tutorials/.dacecache/va_inference_layer/src/cpu/va_inference_layer_0.cpp.o
In file included from /home/alziogas/Projects/dace/dace/codegen/../runtime/include/dace/dace.h:14,
                 from /home/alziogas/Projects/dace/tutorials/.dacecache/va_inference_layer/src/cpu/va_inference_layer_0.cpp:2:
/home/alziogas/Projects/dace/dace/codegen/../runtime/include/dace/types.h: In constructor 'dace::half::half(float)':
             uint32_t x = *((uint32_t*)&f);
                     

8.7761656e-08

In [30]:
import copy
from dace.libraries.blas.blas_helpers import cublas_type_metadata, to_cublas_computetype
from dace.libraries.sparse.environments import cuSPARSE

@library.expansion
class MySDDMMCuSPARSEExpansion(ExpandTransformation):

    environments = [cuSPARSE]

    @staticmethod
    def expansion(node, state, sdfg):

        # Find shapes and datatypes of inputs and outputs

        # A matrix
        a_indptr_name = list(state.in_edges_by_connector(node, '_a_indptr'))[0].data.data
        a_indptr_arr = sdfg.arrays[a_indptr_name]
        a_indices_name = list(state.in_edges_by_connector(node, '_a_indices'))[0].data.data
        a_indices_arr = sdfg.arrays[a_indices_name]
        a_data_name = list(state.in_edges_by_connector(node, '_a_data'))[0].data.data
        a_data_arr = sdfg.arrays[a_data_name]
        a_rowsp1 = a_indptr_arr.shape[0]
        a_nnz = a_data_arr.shape[0]
        a_dtype = a_data_arr.dtype

        # B matrix
        b_name = list(state.in_edges_by_connector(node, '_b'))[0].data.data
        b_arr = sdfg.arrays[b_name]
        b_rows = b_arr.shape[0]
        b_cols = b_arr.shape[1]
        b_dtype = b_arr.dtype

        # C matrix
        c_name = list(state.in_edges_by_connector(node, '_c'))[0].data.data
        c_arr = sdfg.arrays[c_name]
        c_rows = c_arr.shape[0]
        c_cols = c_arr.shape[1]
        c_dtype = c_arr.dtype

        # D matrix
        # We assume that it has the same shape and datatype as A
        d_indptr_name = list(state.out_edges_by_connector(node, '_d_indptr'))[0].data.data
        d_indptr_arr = sdfg.arrays[d_indptr_name]
        d_indices_name = list(state.out_edges_by_connector(node, '_d_indices'))[0].data.data
        d_indices_arr = sdfg.arrays[d_indices_name]
        d_data_name = list(state.out_edges_by_connector(node, '_d_data'))[0].data.data
        d_data_arr = sdfg.arrays[d_data_name]

        # If buffers are not on the GPU, copy them
        needs_copy = any(desc.storage not in (dace.StorageType.GPU_Global, dace.StorageType.CPU_Pinned)
                         for desc in (a_data_arr, b_arr, c_arr, d_data_arr))
        
        dtype = a_data_arr.dtype.base_type
        cdtype = cublas_type_metadata(dtype)[1]
        compute = f'CUDA_R_{to_cublas_computetype(dtype)}'
        handle = '__dace_cusparse_handle'

        call_prefix = cuSPARSE.handle_setup_code(node)
        call_suffix = ''

        # Deal with complex input constants
        if isinstance(b_arr, np.complexfloating):
            alpha = f'{dtype.ctype}(1, 0)'
            beta = f'{dtype.ctype}(0, 0)'
        else:
            alpha = f'{dtype.ctype}(1)'
            beta = f'{dtype.ctype}(0)'

        # Set pointer mode to host
        call_prefix += f'''cusparseSetPointerMode(__dace_cusparse_handle, CUSPARSE_POINTER_MODE_DEVICE);
        {cdtype} alpha = {alpha};
        {cdtype} beta = {beta};
        '''

        arr_prefix = ''
        if needs_copy:
            arr_prefix = '_conn'

        call = f"""
            // Please note that cuSPARSE defines SDDMM as (AxB)oC, while we defined it earier as Ao(BxC),
            // where 'o' is the Hadamard product. In other words, in cuSPARSE, C is the sparse matrix that samples
            // the dense product of B and C. We will continue using our notation here, but please keep this in mind.

            // Copy/set output
            cudaMemcpy({arr_prefix}_d_indptr, {arr_prefix}_a_indptr, {a_rowsp1} * sizeof(int32_t), cudaMemcpyDeviceToDevice);
            cudaMemcpy({arr_prefix}_d_indices, {arr_prefix}_a_indices, {a_nnz} * sizeof(int32_t), cudaMemcpyDeviceToDevice);
            cudaMemcpy({arr_prefix}_d_data, {arr_prefix}_a_data, {a_nnz} * sizeof({dtype.ctype}), cudaMemcpyDeviceToDevice);
            
            cusparseSpMatDescr_t matA;
            cusparseDnMatDescr_t matB, matC;
            void*                dBuffer    = NULL;
            size_t               bufferSize = 0;

            // Create sparse matrix A (D) in CSR format
            dace::sparse::CheckCusparseError( cusparseCreateCsr(&matA, {a_rowsp1 - 1}, {b_rows}, {a_nnz},
                                                {arr_prefix}_d_indptr, {arr_prefix}_d_indices, {arr_prefix}_d_data,
                                                CUSPARSE_INDEX_32I, CUSPARSE_INDEX_32I,
                                                CUSPARSE_INDEX_BASE_ZERO, {compute}) );
            // Create dense matrix B
            dace::sparse::CheckCusparseError( cusparseCreateDnMat(&matB, {b_rows}, {b_cols}, {b_cols}, {arr_prefix}_b,
                                                {compute}, CUSPARSE_ORDER_ROW) );
            // Create dense matrix C
            dace::sparse::CheckCusparseError( cusparseCreateDnMat(&matC, {c_rows}, {c_cols}, {c_cols}, {arr_prefix}_c,
                                                {compute}, CUSPARSE_ORDER_ROW) );
            // allocate an external buffer if needed
            dace::sparse::CheckCusparseError( cusparseSDDMM_bufferSize(
                                                {handle},
                                                CUSPARSE_OPERATION_NON_TRANSPOSE,
                                                CUSPARSE_OPERATION_NON_TRANSPOSE,
                                                &alpha, matB, matC, &beta, matA, {compute},
                                                CUSPARSE_SDDMM_ALG_DEFAULT, &bufferSize) );
            cudaMalloc(&dBuffer, bufferSize);

            // execute SpMM
            dace::sparse::CheckCusparseError( cusparseSDDMM(
                                                {handle},
                                                CUSPARSE_OPERATION_NON_TRANSPOSE,
                                                CUSPARSE_OPERATION_NON_TRANSPOSE,
                                                &alpha, matB, matC, &beta, matA, {compute},
                                                CUSPARSE_SDDMM_ALG_DEFAULT, dBuffer) );

            // destroy matrix/vector descriptors
            dace::sparse::CheckCusparseError( cusparseDestroySpMat(matA) );
            dace::sparse::CheckCusparseError( cusparseDestroyDnMat(matB) );
            dace::sparse::CheckCusparseError( cusparseDestroyDnMat(matC) );
            cudaFree(dBuffer);
        """

        code = (call_prefix + call + call_suffix)
        tasklet = dace.sdfg.nodes.Tasklet(
            node.name,
            node.in_connectors,
            node.out_connectors,
            code,
            language=dace.dtypes.Language.CPP,
        )

        # If buffers are not on the GPU, copy them
        if needs_copy:
            nsdfg = dace.SDFG('nested_gemm')
            copies = [('_a_rows', a_indptr_arr), ('_a_cols', a_indices_arr), ('_a_vals', a_data_arr), ('_b', b_arr),
                      ('_c', c_arr), ('_d_rows', d_indptr_arr), ('_d_cols', d_indices_arr), ('_d_vals', d_data_arr),]
            for name, desc in copies:
                if isinstance(desc, dace.data.View):
                    dcopy = desc.as_array()
                else:
                    dcopy = copy.deepcopy(desc)
                dcopy.lifetime = dace.AllocationLifetime.Scope
                dcopy_gpu = copy.deepcopy(dcopy)
                dcopy.transient = False
                nsdfg.add_datadesc(name, dcopy)
                dcopy_gpu.transient = True
                dcopy_gpu.storage = dace.StorageType.GPU_Global
                nsdfg.add_datadesc(name + '_gpu', dcopy_gpu)
            nstate = nsdfg.add_state()
            har, hac, had, hb, hc, hdr, hdc, hdd = (nstate.add_access(n) for n in (
                '_a_indptr', '_a_indices', '_a_data', '_b', '_c', '_d_indptr', '_d_indices', '_d_data'))
            gar, gac, gad, gb, gc, gdr, gdc, gdd = (nstate.add_access(n) for n in (
                '_a_indptr_gpu', '_a_indices_gpu', '_a_data_gpu', '_b_gpu', '_c_gpu',
                '_d_indptr_gpu', '_d_indices_gpu', '_d_data_gpu'))

            # Reset code and connectors
            tasklet.in_connectors = {"_conn" + k: None for k in tasklet.in_connectors}
            tasklet.out_connectors = {"_conn" + k: None for k in tasklet.out_connectors}

            nstate.add_nedge(har, gar, dace.Memlet.from_array('_a_indptr', a_indptr_arr))
            nstate.add_nedge(hac, gac, dace.Memlet.from_array('_a_indices', a_indices_arr))
            nstate.add_nedge(had, gad, dace.Memlet.from_array('_a_data', a_data_arr))
            nstate.add_nedge(hb, gb, dace.Memlet.from_array('_b', b_arr))
            nstate.add_nedge(hc, gc, dace.Memlet.from_array('_c', c_arr))

            nstate.add_edge(gar, None, tasklet, '_conn_a_indptr', dace.Memlet.from_array('_a_indptr_gpu', a_indptr_arr))
            nstate.add_edge(gac, None, tasklet, '_conn_a_indices', dace.Memlet.from_array('_a_indices_gpu', a_indices_arr))
            nstate.add_edge(gad, None, tasklet, '_conn_a_data', dace.Memlet.from_array('_a_data_gpu', a_data_arr))
            nstate.add_edge(gb, None, tasklet, '_conn_b', dace.Memlet.from_array('_b_gpu', b_arr))
            nstate.add_edge(gc, None, tasklet, '_conn_c', dace.Memlet.from_array('_c_gpu', c_arr))
            nstate.add_edge(tasklet, '_conn_d_indptr', gdr, None, dace.Memlet.from_array('_d_indptr_gpu', d_indptr_arr))
            nstate.add_edge(tasklet, '_conn_d_indices', gdc, None, dace.Memlet.from_array('_d_indices_gpu', d_indices_arr))
            nstate.add_edge(tasklet, '_conn_d_data', gdd, None, dace.Memlet.from_array('_d_data_gpu', d_data_arr))

            nstate.add_nedge(gdr, hdr, dace.Memlet.from_array('_d_indptr', d_indptr_arr))
            nstate.add_nedge(gdc, hdc, dace.Memlet.from_array('_d_indices', d_indices_arr))
            nstate.add_nedge(gdd, hdd, dace.Memlet.from_array('_d_data', d_data_arr))

            return nsdfg
        # End of copy to GPU

        return tasklet


In [31]:
@library.node
class MySDDMM(nodes.LibraryNode):

    implementations: Dict[str, ExpandTransformation] = {'pure': MySDDMMPureExpansion, 'cuSPARSE': MySDDMMCuSPARSEExpansion}
    default_implementation: str = 'cuSPARSE'

    def __init__(self, name, location=None):
        super().__init__(name,
                         location=location,
                         inputs={'_a_data', '_a_indices', '_a_indptr', '_b', '_c'},
                         outputs={'_d_data', '_d_indices', '_d_indptr'})

In [32]:
sdfg = va_inference_layer.to_sdfg()
auto_optimize(sdfg, dace.DeviceType.GPU)

func = sdfg.compile()
func(A_data=A.data.copy(), A_indices=A.indices.copy(), A_indptr=A.indptr.copy(), H=H, W=W, H_prime=val, N=1000, K0=128, K1=128, NNZ=A.nnz)

np.allclose(ref, val)
np.linalg.norm(ref - val) / np.linalg.norm(ref)

Applied 1 GPUTransformSDFG.
Hello
['cuBLAS', 'cuSolverDn', 'cuSPARSE', 'GPUAuto', 'CUB', 'pure']
Automatically expanded library node "_Transpose_" with implementation "cuBLAS".
Automatically expanded library node "sddmm" with implementation "cuSPARSE".
Automatically expanded library node "csrmm" with implementation "cuSPARSE".
Automatically expanded library node "_MatMult_gemm" with implementation "cuBLAS".
[ 20%] Building CXX object CMakeFiles/va_inference_layer_1.dir/home/alziogas/Projects/dace/tutorials/.dacecache/va_inference_layer/src/cpu/va_inference_layer_1.cpp.o
In file included from /home/alziogas/Projects/dace/dace/codegen/../runtime/include/dace/dace.h:14,
                 from /home/alziogas/Projects/dace/tutorials/.dacecache/va_inference_layer/src/cpu/va_inference_layer_1.cpp:2:
/home/alziogas/Projects/dace/dace/codegen/../runtime/include/dace/types.h: In constructor 'dace::half::half(float)':
             uint32_t x = *((uint32_t*)&f);
                           ~^~~~~~~~

8.7761656e-08