# FPGA programming with DaCe

In this tutorial, we will see how a developer can write code using the python DaCe frontend and generate efficient code for FPGA.
We will discuss:
- how to parse, transform and optimize the code for FPGA devices with maximal control (for FPGA experienced users)
- how to get this done automatically by date auto-optmization heuristics (for non-experienced users or to quickly get a working example).

Let's start with `ATAX`, a Matrix Transpose vector multiplication included in polybench suite: the case of ATAX, that computes $y = A^T Ax$.

Following the [Numpy API](https://nbviewer.jupyter.org/github/spcl/dace/blob/master/tutorials/numpy_frontend.ipynb) tutorial, we start by by writing the DaCe program as a regular python method, annotated with the `dace.program` annotation, with explict type annotation. 

In [None]:
import dace
import numpy as np

M, N  = 24, 24

@dace.program
def atax(A: dace.float32[M, N], x: dace.float32[N]):
    return (A @ x) @ A

Then we can parse the program to build its SDFG and have a look at it:

In [None]:
sdfg = atax.to_sdfg()
sdfg

At this point, we need to transform it for FPGA execution: 

In [None]:
from dace.transformation.interstate import FPGATransformSDFG
sdfg.apply_transformations(FPGATransformSDFG)
sdfg

This transformation takes care of creating create additional pre- and post-states to perform memory transfers between host and device performing memory transfers between host and device. 
The actual computation is now scheduled to be executed on the FPGA as an FPGA kernel, and memories accessed by the transformed subgraph are replaced with their FPGA equivalents.

We can notice how the current SDFG contains two library nodes Library nodes are high-level nodes that represent two (generic) matrix multiplication. During compilation and optimization, Library Nodes are expanded by replacing them with a subgraph, lowering them towards a concrete implementation of their behavior. For FPGA, it is convenient to do this explictly. 

First of all, specialize the two generic matrix multiplication. In this case they are indeed two matrix-vector multiplication (one transposed)

In [None]:
sdfg.expand_library_nodes(recursive=False)
sdfg

For all matrix-vector multiplications (`gemv` and `gemvt`) we can use the `FPGA_Accumulate` expension this FPGA-oriented expansion iterates over the input matrix in simple row-major order (with optional tiling). The user can also specify a different expansion for each library node. Please refer to the documentation to see [all available FPGA expensions](https://spcldace.readthedocs.io/en/latest/optimization/fpga.html#available-fpga-expansions). We now choose the expansion and apply it

In [None]:
from dace.libraries.blas import Gemv
Gemv.default_implementation = "FPGA_Accumulate"
sdfg.expand_library_nodes()
sdfg