# FPGA programming with DaCe

In this tutorial, we will see how a developer can write code using the python DaCe frontend and generate efficient code for FPGA.
We will discuss:
- how to parse, transform and optimize the code for FPGA devices with maximal control (for FPGA experienced users)
- how to get this done automatically by date auto-optmization heuristics (for non-experienced users or to quickly get a working example).

Let's start with `ATAX`, a Matrix Transpose vector multiplication included in polybench suite: the case of ATAX, that computes $y = A^T Ax$.

Following the [Numpy API](https://nbviewer.jupyter.org/github/spcl/dace/blob/master/tutorials/numpy_frontend.ipynb) tutorial, we start by by writing the DaCe program as a regular python method, annotated with the `dace.program` annotation, with explict type annotation. 

In [None]:
import dace

In [None]:
M, N  = 24, 24

@dace.program
def atax(A: dace.float32[M, N], x: dace.float32[N]):
    return (A @ x) @ A

## Vanilla execution

Let's start by compiling the program for FPGA without applying any particular optimization. First of all we can parse the program to build its SDFG and have a look at it:

In [None]:
sdfg = atax.to_sdfg()
sdfg

At this point, we need to transform it for FPGA execution: 

In [None]:
from dace.transformation.interstate import FPGATransformSDFG
sdfg.apply_transformations(FPGATransformSDFG)
sdfg

This transformation takes care of creating create additional pre- and post-states to perform memory transfers between host and device performing memory transfers between host and device. 
The actual computation is now scheduled to be executed on the FPGA as an FPGA kernel, and memories accessed by the transformed subgraph are replaced with their FPGA equivalents.

Now we can compile and run it for execution (now commented for the sake of executing in the jupyter notebook).

## Manually optmize for FPGA execution

We can notice how the current SDFG contains two library nodes Library nodes are high-level nodes that represent two (generic) matrix multiplication. During compilation and optimization, Library Nodes are expanded by replacing them with a subgraph, lowering them towards a concrete implementation of their behavior. For FPGA, it is convenient to do this explicitly. 

First of all, specialize the two generic matrix multiplication. In this case they are indeed two matrix-vector multiplication (one transposed)

In [None]:
sdfg = atax.to_sdfg()
sdfg.apply_transformations(FPGATransformSDFG)
sdfg.expand_library_nodes(recursive=False)
sdfg

For all matrix-vector multiplications (`gemv` and `gemvt`) we can use the `FPGA_Accumulate` expension this FPGA-oriented expansion iterates over the input matrix in simple row-major order (with optional tiling). The user can also specify a different expansion for each library node. Please refer to the documentation to see [all available FPGA expensions](https://spcldace.readthedocs.io/en/latest/optimization/fpga.html#available-fpga-expansions). We now choose the expansion and apply it (expanding it). Since this implementation makes use of BRAMs to store intermediate results whose size must be known at compile time, we need to "specialize" the size of our input data.

In [None]:
from dace.libraries.blas import Gemv
Gemv.default_implementation = "FPGA_Accumulate"
sdfg.expand_library_nodes()
sdfg.specialize(dict(M=M, N=N))
sdfg

In the resulting SDFG, we can notice how the two `gemv` have been replaced by the corresponding implementations. 
We note how in this computation the memory access pattern (to the inputs `A` and `x` and output `return`) are known a priori. We can therefore decouple them from the computation creating streaming memory accessors, for the benifit of a simplified circuit implementation. DaCe offers the `StreamingMemory` transformation that automatically does this.

In [None]:
from dace.transformation.dataflow import StreamingMemory
from dace.transformation.interstate import InlineSDFG
sdfg.apply_transformations_repeated([InlineSDFG, StreamingMemory],
                                                         [{}, {
                                                             'storage': dace.StorageType.FPGA_Local
                                                         }],
                                                         print_report=True)
sdfg

As we can notice from the SDFG, the transformation applied 3 times: for the reads from `A` (transposed and non-trasposed), for the reads from `x` and the writings of the final result in memory. While applying the transformation, we also Inlined ("flattened") the SDFG so that we can fully analyze data access patterns, and we specified that the resulting streams must be stored in FPGA local memory (BRAM).

In more complicated use cases, this can be useful to make use of burst-mode in memory controller (see the [transformation documentation](https://spcldace.readthedocs.io/en/latest/source/dace.transformation.dataflow.html#dace.transformation.dataflow.streaming_memory.StreamingMemory)), or broadcasting off-chip memory to multiple processing elements. 

It could occurs that subsequent computations share data through off-chip memory. If the memory access patterns are analyzable, we can avoid this undesiderable situation by using the `StreamingComposition` transformation. Similar to `StreamingMemory` this transformation will analyze data access patterns and, when applicable, converts two connected computations  into two separate processing elements, with a stream connecting the results, removing the need of off-chip accesses and enabling the concurrent execution of the two components. This transformation does not apply in the considered use case, but the interested reader can refer to the related [documentation](https://spcldace.readthedocs.io/en/latest/source/dace.transformation.dataflow.html#dace.transformation.dataflow.streaming_memory.StreamingComposition).

Finally, since in this case we have multiple memory buffer being accessed concurrently, we can distributed them on different memory banks (if the target device supports more than one memory bank).

In [None]:
from dace.transformation.auto.auto_optimize import fpga_auto_opt
fpga_auto_opt.fpga_rr_interleave_containers_to_banks(sdfg, num_banks = 4, memory_type = "DDR")

The `fpga_auto_opt` module contains FPGA-specific optimizations. Another example of automatic optimization that can be applied is `fpga_global_to_local`, that changes the storage of containers allocated in global memory to local memory when this is possible.

Finally, we can execute the program (here commented out)

## Auto-Optimization

While the discussion above enables an experienced programmer to tune the FPGA execution of its program, in many cases a good level of optimization can be achieved automatically by applying DaCe auto-optimization heuristic. If this targets FPGA devices, it will apply a set of simplification to the SDFG, and then applies the transformations discussed above, with the exception of the `StreamingMemory` (or `StreamingComposition` when applicable) ones. Note that these automatically applied transformations can not currently being tuned by the user. Let's start again from the SDFG parsing:

In [None]:
sdfg = atax.to_sdfg()
from dace.transformation.auto.auto_optimize import auto_optimize
sdfg = auto_optimize(sdfg, dace.dtypes.DeviceType.FPGA)
sdfg.expand_library_nodes()
sdfg.specialize(dict(M=M, N=N))
sdfg

Note that there is no need to explicitely expand library nodes. Here we did so to show the resulting SDFG. Then the program can be executed as before.

## Hardware Execution

By default, DaCe is configured to execute FPGA program in software emulation mode. This behavior can be changed through DaCe configuration settings, by setting the compilation mode properly either programmatically or via environment variable. For example, to enable Hardware execution via command line (other methods can be found in the [Configuring DaCe documentation](https://spcldace.readthedocs.io/en/latest/setup/config.html) and in the compilation configuration schema for [Xilinx](https://spcldace.readthedocs.io/en/latest/source/config_schema.html#envvar-compiler.xilinx.mode) and [Intel](https://spcldace.readthedocs.io/en/latest/source/config_schema.html#envvar-compiler.xilinx.mode) FPGAs).

For example, to specify hardware execution via environment variable, the user can execute their DaCe program as follow:

In [None]:
$ DACE_compiler_xilinx_mode=hardware python path_to_my_dace_program.py

This will trigger the hardware compilation flow, that will generate the bistream and executed the program on a FPGA equipped machine. Note that if the bitstream was not previously compiled (or there have been changes to the DaCe program), synthesis may require several hours, depending on the complexity of the generated FPGA program and machine capabilities.