# Getting Started with DaCe

DaCe is a Python library that enables optimizing code with ease, from running on a single core to a full supercomputer. With the power of data-centric transformations, it can automatically map code for CPUs, GPUs, and FPGAs.

Let's get started with DaCe by importing it:

In [1]:
import dace

A data-centric program can be generated from several general-purpose and domain-specific languages. Our main frontend, however, is Python/numpy. To define a program, we take an existing function on numpy arrays and decorate it with `@dace.program`:

In [2]:
@dace.program
def getstarted(A):
    return A + A

Running our dace program, we will see several outputs and a prompt. These are the available transformations we can apply. For the first step, we opt to apply none (press Enter) and proceed with compilation and running:

In [3]:
import numpy as np
a = np.random.rand(2, 3)
a

array([[0.55928079, 0.99304442, 0.65794934],
       [0.58411335, 0.77098752, 0.30517518]])

In [4]:
getstarted(a)

Applied 1 StateFusion.
0. Transformation FPGATransformSDFG in getstarted
1. Transformation FPGATransformState in BinOp_3
2. Transformation GPUTransformLocalStorage in _Add__map[__i0=0:2, __i1=0:3]
3. Transformation GPUTransformMap in _Add__map[__i0=0:2, __i1=0:3]
4. Transformation GPUTransformSDFG in getstarted
5. Transformation MapExpansion in _Add__map: ['__i0', '__i1']
6. Transformation MapTiling in _Add__map: ['__i0', '__i1']
7. Transformation NestSDFG in getstarted
8. Transformation StripMining in _Add__map: ['__i0', '__i1']
9. Transformation Vectorization in 0 -> 1 -> 2
Select the pattern to apply (0 - 9 or name$id): 
You did not select a valid option. Quitting optimization ...
-- Configuring done
-- Generating done
-- Build files have been written to: /path/to/dace/tutorials/.dacecache/getstarted/build

[ 50%] Built target getstarted
[100%] Built target dacestub_getstarted



array([[1.11856158, 1.98608884, 1.31589868],
       [1.1682267 , 1.54197505, 0.61035035]])

The results are, as expected, `2*A`.

Now, let's inspect the intermediate representation of the data-centric program, its Stateful Dataflow Multigraph (SDFG):

In [6]:
getstarted.to_sdfg(a)

Applied 1 StateFusion.


You can drag the handle at the bottom right to make the SDFG frame larger.

Notice the following four elements in the graph:

1. **State** (blue region): This is the control flow part of the application, represented as a state machine. Since there is no control-flow in the data-centric representation of `A+A`, we see only one state encompassing the computation.
2. **Arrays** (circular nodes) and **Memlets** (arrows): These nodes represent disjoint N-dimensional memory regions (similar to numpy `ndarray`s), and the edges represent data that is moved throughout the state. Hovering over a memlet will show more information about the subset being moved.
3. **Tasklets** (octagon): This node represents the computational parts of the graph. Zooming into it will show the code (addition operation in this case). Tasklets act as pure functions that can only work with the data coming into/out of its **connectors** (cyan circles on the node).
4. **Maps** (trapezoid): Anything that is enclosed between these two nodes (the map *scope*) is replicated for the number of times specified on the node (in our case, `2*3` times). This creates parametric parallelism in the graph and can be nested in each other for efficient parallelization and distribution of work.

Unfortunately (or fortunately in some cases), this graph is specialized for a specific size of array (as given to it), and will not work on other sizes. To compile a program that works with general sizes, we'll need to use symbolic sizes.

## Symbols

DaCe includes a symbolic math engine (extending SymPy) to support symbolic expressions for sizes, ranges, accesses, and more. 

Any number of symbols can be used throughout a computation. Defining a symbol is as easy as calling:

In [7]:
N = dace.symbol('N')

which we can now use for any computation and definitions. For example, annotating the types of our function from above will yield a version that works with any size:

In [8]:
@dace.program
def getstarted_sym(A: dace.float64[N, 2*N]):
    return A + A

In [9]:
getstarted_sym.to_sdfg()

Applied 1 StateFusion.


If we compile this code, any array that can match a size of `Nx2N` will be automatically used to infer the value of `N` and invoke the function:

In [10]:
getstarted_sym(np.random.rand(100, 200))

Applied 1 StateFusion.
0. Transformation FPGATransformSDFG in getstarted_sym
1. Transformation FPGATransformState in BinOp_3
2. Transformation GPUTransformLocalStorage in _Add__map[__i0=0:N, __i1=0:2*N]
3. Transformation GPUTransformMap in _Add__map[__i0=0:N, __i1=0:2*N]
4. Transformation GPUTransformSDFG in getstarted_sym
5. Transformation MapExpansion in _Add__map: ['__i0', '__i1']
6. Transformation MapTiling in _Add__map: ['__i0', '__i1']
7. Transformation NestSDFG in getstarted_sym
8. Transformation StripMining in _Add__map: ['__i0', '__i1']
9. Transformation Vectorization in 0 -> 1 -> 2
Select the pattern to apply (0 - 9 or name$id): 
You did not select a valid option. Quitting optimization ...
-- Configuring done
-- Generating done
-- Build files have been written to: /path/to/dace/tutorials/.dacecache/getstarted_sym/build

[ 50%] Built target getstarted_sym
[100%] Built target dacestub_getstarted_sym



array([[1.7585673 , 1.4819243 , 0.81444086, ..., 0.50401245, 0.53381245,
        1.09545598],
       [0.35190908, 1.72341969, 1.94377987, ..., 1.39787272, 1.82237336,
        1.12328556],
       [1.48051736, 0.40447843, 0.03104017, ..., 0.92004259, 0.1404551 ,
        0.54104224],
       ...,
       [1.17761424, 0.80583588, 1.48981887, ..., 1.19151846, 1.76003948,
        1.43971579],
       [0.79070708, 1.5751959 , 1.72468958, ..., 0.52099313, 0.50643533,
        0.58934414],
       [0.78405278, 1.17276108, 1.27954205, ..., 1.69494209, 0.49935545,
        0.33314416]])

## Performance

Given our symbolic SDFG, we would not like to recompile it every time. Thus, we can pre-compile the graph into an .so/.dll file:

In [11]:
csdfg = getstarted_sym.compile()

Applied 1 StateFusion.
0. Transformation FPGATransformSDFG in getstarted_sym
1. Transformation FPGATransformState in BinOp_3
2. Transformation GPUTransformLocalStorage in _Add__map[__i0=0:N, __i1=0:2*N]
3. Transformation GPUTransformMap in _Add__map[__i0=0:N, __i1=0:2*N]
4. Transformation GPUTransformSDFG in getstarted_sym
5. Transformation MapExpansion in _Add__map: ['__i0', '__i1']
6. Transformation MapTiling in _Add__map: ['__i0', '__i1']
7. Transformation NestSDFG in getstarted_sym
8. Transformation StripMining in _Add__map: ['__i0', '__i1']
9. Transformation Vectorization in 0 -> 1 -> 2
Select the pattern to apply (0 - 9 or name$id): 
You did not select a valid option. Quitting optimization ...
-- Configuring done
-- Generating done
-- Build files have been written to: /path/to/dace/tutorials/.dacecache/getstarted_sym/build

[ 50%] Built target getstarted_sym
[100%] Built target dacestub_getstarted_sym



A compiled SDFG, however, has to be invoked like an SDFG, with keyword arguments only:

In [12]:
b = csdfg(A=np.random.rand(10,20), N=np.int32(10))

We can now see the performance of the code on large arrays vs. numpy:

In [13]:
tester = np.random.rand(2000, 4000)

In [14]:
%timeit tester + tester

81.4 ms ± 568 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [15]:
%timeit csdfg(A=tester, N=np.int32(2000))

9.92 ms ± 96.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## Explicit Dataflow

One can specify explicit dataflow in dace using `for i in dace.map[begin:end]:` syntax, as well as tasklets manually using `with dace.tasklet:`. Here is an example of a real-world example (Scattering Self-Energies) with an 8-dimensional parallel computation:

In [16]:
# Declaration of symbolic variables
Nkz, NE, Nqz, Nw, N3D, NA, NB, Norb = (
    dace.symbol(name)
    for name in ['Nkz', 'NE', 'Nqz', 'Nw', 'N3D', 'NA', 'NB', 'Norb'])


@dace.program
def sse_sigma(neigh_idx: dace.int32[NA, NB],
              dH: dace.complex128[NA, NB, N3D, Norb, Norb],
              G: dace.complex128[Nkz, NE, NA, Norb, Norb],
              D: dace.complex128[Nqz, Nw, NA, NB, N3D, N3D],
              Sigma: dace.complex128[Nkz, NE, NA, Norb, Norb]):

    # Declaration of Map scope
    for k, E, q, w, i, j, a, b in dace.map[0:Nkz, 0:NE, 0:Nqz, 0:Nw, 0:N3D, 0:
                                           N3D, 0:NA, 0:NB]:
        dHG = G[k - q, E - w, neigh_idx[a, b]] @ dH[a, b, i]
        dHD = dH[a, b, j] * D[q, w, a, b, i, j]
        Sigma[k, E, a] += dHG @ dHD
        
sse_sigma.to_sdfg()

Applied 5 StateFusion, 1 MergeArrays, 1 InlineSDFG.
