# Getting Started with DaCe

DaCe is a Python library that enables optimizing code with ease, from running on a single core to a full supercomputer. With the power of data-centric transformations, it can automatically map code for CPUs, GPUs, and FPGAs.

Let's get started with DaCe by importing it:

In [1]:
import dace

A data-centric program can be generated from several general-purpose and domain-specific languages. Our main frontend, however, is Python/numpy. To define a program, we take an existing function on numpy arrays and decorate it with `@dace.program`:

In [2]:
@dace.program
def getstarted(A):
    return A + A

Running our dace program, we will see several outputs and a prompt. These are the available transformations we can apply. For the first step, we opt to apply none (press Enter) and proceed with compilation and running:

In [3]:
import numpy as np
a = np.random.rand(2, 3)
a

array([[0.97547066, 0.95040688, 0.01353542],
       [0.85183968, 0.42836647, 0.76228063]])

In [4]:
getstarted(a)

-- Configuring done (0.5s)
[0mYou have changed variables that require your cache to be deleted.
Configure will be re-run and you may have to reset some variables.
The following variables have changed:
CMAKE_CXX_COMPILER= /opt/homebrew/opt/llvm/bin/clang++
[0m
-- The C compiler identification is Clang 20.1.7
-- The CXX compiler identification is Clang 20.1.7
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /opt/homebrew/opt/llvm/bin/clang - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /opt/homebrew/opt/llvm/bin/clang++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- Found OpenMP_CXX: -fopenmp=libomp (found version "5.1")
-- Found

array([[1.95094132, 1.90081375, 0.02707083],
       [1.70367937, 0.85673294, 1.52456126]])

The results are, as expected, `2*A`.

Now, let's inspect the intermediate representation of the data-centric program, its Stateful Dataflow Multigraph (SDFG):

In [5]:
getstarted.to_sdfg(a)

You can drag the handle at the bottom right to make the SDFG frame larger.

Notice the following four elements in the graph:

1. **State** (blue region): This is the control flow part of the application, represented as a state machine. Since there is no control-flow in the data-centric representation of `A+A`, we see only one state encompassing the computation.
2. **Arrays** (circular nodes) and **Memlets** (arrows): These nodes represent disjoint N-dimensional memory regions (similar to numpy `ndarray`s), and the edges represent data that is moved throughout the state. Hovering over a memlet will show more information about the subset being moved.
3. **Tasklets** (octagon): This node represents the computational parts of the graph. Zooming into it will show the code (addition operation in this case). Tasklets act as pure functions that can only work with the data coming into/out of its **connectors** (cyan circles on the node).
4. **Maps** (trapezoid): Anything that is enclosed between these two nodes (the map *scope*) is replicated for the number of times specified on the node (in our case, `2*3` times). This creates parametric parallelism in the graph and can be nested in each other for efficient parallelization and distribution of work.

Unfortunately (or fortunately in some cases), this graph is specialized for a specific size of array (as given to it), and will not work on other sizes. To compile a program that works with general sizes, we'll need to use symbolic sizes.

## Symbols

DaCe includes a symbolic math engine (extending SymPy) to support symbolic expressions for sizes, ranges, accesses, and more. 

Any number of symbols can be used throughout a computation. Defining a symbol is as easy as calling:

In [6]:
N = dace.symbol('N')

which we can now use for any computation and definitions. For example, annotating the types of our function from above will yield a version that works with any size:

In [7]:
@dace.program
def getstarted_sym(A: dace.float64[N, 2*N]):
    return A + A

In [8]:
getstarted_sym.to_sdfg()

If we compile this code, any array that can match a size of `Nx2N` will be automatically used to infer the value of `N` and invoke the function:

In [9]:
getstarted_sym(np.random.rand(100, 200))

-- The C compiler identification is Clang 20.1.7
-- The CXX compiler identification is Clang 20.1.7
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /opt/homebrew/opt/llvm/bin/clang - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /opt/homebrew/opt/llvm/bin/clang++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- Found OpenMP_CXX: -fopenmp=libomp (found version "5.1")
-- Found OpenMP: TRUE (found version "5.1") found components: CXX
-- Configuring done (2.3s)
-- Generating done (0.0s)
-- Build files have been written to: /Users/sophieblock/dev_packages/dace/tutorials/.dacecache/getstarted_sym/build

[ 25%] [32mBuilding CXX object CM

array([[1.24778744, 0.62633637, 0.78074174, ..., 0.94280647, 0.16114633,
        0.19611979],
       [0.29612183, 1.85844457, 1.5315922 , ..., 1.02211127, 1.2505371 ,
        0.76519006],
       [0.57405681, 1.04549932, 1.26368478, ..., 0.04879747, 1.98218855,
        0.58572712],
       ...,
       [0.54871819, 1.81324495, 0.38762628, ..., 0.67777597, 1.51170197,
        0.72363192],
       [0.32642433, 0.47425769, 0.98985848, ..., 0.07857811, 0.09764287,
        0.0192146 ],
       [0.32490961, 0.24263521, 0.1650494 , ..., 0.94200318, 1.47921343,
        0.1774446 ]], shape=(100, 200))

## Performance

Given our symbolic SDFG, we would not like to recompile it every time. Thus, we can pre-compile the graph into an .so/.dll file:

In [10]:
csdfg = getstarted_sym.compile()

-- Configuring done (0.1s)
-- Generating done (0.0s)
-- Build files have been written to: /Users/sophieblock/dev_packages/dace/tutorials/.dacecache/getstarted_sym/build

[ 25%] [32mBuilding CXX object CMakeFiles/getstarted_sym_0.dir/Users/sophieblock/dev_packages/dace/tutorials/.dacecache/getstarted_sym/src/cpu/getstarted_sym_0.cpp.o[0m




[ 50%] [32m[1mLinking CXX shared library libgetstarted_sym_0.dylib[0m
[ 50%] Built target getstarted_sym_0
[ 75%] [32mBuilding CXX object CMakeFiles/dacestub_getstarted_sym_0.dir/tools/dacestub.cpp.o[0m
[100%] [32m[1mLinking CXX shared library libdacestub_getstarted_sym_0.dylib[0m
[100%] Built target dacestub_getstarted_sym_0



A compiled SDFG, however, has to be invoked like an SDFG, with keyword arguments only:

In [11]:
b = csdfg(A=np.random.rand(10,20), N=np.int32(10))

We can now see the performance of the code on large arrays vs. numpy:

In [12]:
tester = np.random.rand(2000, 4000)

In [13]:
%timeit tester + tester

7.36 ms ± 240 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [14]:
%timeit csdfg(A=tester, N=np.int32(2000))

1.12 ms ± 19 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


## Explicit Dataflow

One can specify explicit dataflow in dace using `for i in dace.map[begin:end]:` syntax, as well as tasklets manually using `with dace.tasklet:`. Here is an example of a real-world example (Scattering Self-Energies) with an 8-dimensional parallel computation:

In [15]:
# Declaration of symbolic variables
Nkz, NE, Nqz, Nw, N3D, NA, NB, Norb = (
    dace.symbol(name, dtype=dace.int32)
    for name in ['Nkz', 'NE', 'Nqz', 'Nw', 'N3D', 'NA', 'NB', 'Norb'])


@dace.program
def sse_sigma(neigh_idx: dace.int32[NA, NB],
              dH: dace.complex128[NA, NB, N3D, Norb, Norb],
              G: dace.complex128[Nkz, NE, NA, Norb, Norb],
              D: dace.complex128[Nqz, Nw, NA, NB, N3D, N3D],
              Sigma: dace.complex128[Nkz, NE, NA, Norb, Norb]):

    # Declaration of Map scope
    for k, E, q, w, i, j, a, b in dace.map[0:Nkz, 0:NE, 0:Nqz, 0:Nw, 0:N3D, 0:
                                           N3D, 0:NA, 0:NB]:
        dHG = G[k - q, E - w, neigh_idx[a, b]] @ dH[a, b, i]
        dHD = dH[a, b, j] * D[q, w, a, b, i, j]
        Sigma[k, E, a] += dHG @ dHD
        
sse_sigma.to_sdfg()