# Making firmware with hls4ml

This is the fun part! Now we'll convert our Keras model into highly parallel FPGA firmware with hls4ml, using only a few lines of code. This is the basic flow:

<img src="images/hls4ml_conifer.png" alt="hls4ml" width="1000" img align="center"/>

So what hlsm4l does is that it takes your TF/Keras/ONNX/Pytorch model and converts it into C++ code that a high-level-synthesis (HLS) tool can read. It automatically adds pipelining stages to your hardware depending on the target FPGA  and clock period constraints (usually at LHC we're on a 2 or 5 nanosecond clock).

After converting your model with hls, you'll see that a lot of C++ code gets generated, which is then passed to Vivado HLS for compilation.

One thing that will become important is the *reuse factor*. The ReuseFactor is our mechanism for tuning the parallelism.

<img src="images/reuse.png" alt="reuse" width="600" img align="center"/>

Since the FPGA on the Pynq board is very tiny, we need to use the multipliers several time so we put a really high reuse factor to make it fit the board 

================================================================

Available resources on the Xilinx Zynq XC7Z020 SoC (FPGA part number \texttt{xc7z020clg400-1}) on the TUL PYNQ-Z2 development board:
```
+-----------------+---------+-------+--------+-------+-----+
|       Name      | BRAM_18K| DSP48E|   FF   |  LUT  | URAM|
+-----------------+---------+-------+--------+-------+-----+
|Available        |      280|    220|  106400|  53200|    0|
+-----------------+---------+-------+--------+-------+-----+
```
(Just considering number of multiplications, our model would use $57*32+32*16+16*3+3*16+16*32+32*57= 4,768$ multiplications and we only have 220 DSPs!)


Let's make bitfiles both of the large and the compressed model,setting the reuse factor to 64:

In [None]:
import hls4ml
import util
import h5py
import numpy as np
import tensorflow as tf

from qkeras.utils import _add_supported_quantized_objects
co = {}; _add_supported_quantized_objects(co)

hls4ml.model.optimizer.OutputRoundingSaturationMode.layers = []  
hls4ml.model.optimizer.OutputRoundingSaturationMode.layers = ['Activation']
hls4ml.model.optimizer.OutputRoundingSaturationMode.rounding_mode = 'AP_RND'
hls4ml.model.optimizer.OutputRoundingSaturationMode.saturation_mode = 'AP_SAT'

with h5py.File('Ato4l_dataset.h5', 'r') as file:
    signal_test_data = np.array(file['Data'])

# First the baseline:
autoencoder = tf.keras.models.load_model('baseline_ae.h5')

# Then the compressed model:
q_autoencoder = tf.keras.models.load_model('compressed_ae.h5', custom_objects=co)

In [None]:
# Lets convert the baseline model into a bitfile for PYNQ
config = hls4ml.utils.config_from_keras_model(autoencoder, granularity='name')
config['Model']['Strategy'] = 'Resource'

for layer in config['LayerName'].keys():
    config['LayerName'][layer]['ReuseFactor'] = 64 #Use the same resources multiple times. This is neccessary in this case because the FPGA is small.
hls_model = hls4ml.converters.convert_from_keras_model(autoencoder,
                                                         hls_config=config,
                                                         backend='VivadoAccelerator', #You need this backend to generate firmware for Zynq
                                                         output_dir='/mnt/data/thaarres/baseline_ae_pynq',
                                                         board='pynq-z2') # This is our FPGA!
                                                   
hls_model.compile()

y_hls4ml = hls_model.predict(np.ascontiguousarray(signal_test_data))
hls_model.build(csim=False, synth=True, export=True)
hls4ml.templates.VivadoAcceleratorBackend.make_bitfile(hls_model)

# Package the model and some test data to be moved over to the Pynq
util.package(hls_model, signal_test_data, y_hls4ml)



In [None]:
# Lets convert the QKeras model into a bitfile for PYNQ
config = hls4ml.utils.config_from_keras_model(q_autoencoder, granularity='name')
config['Model']['Strategy'] = 'Resource'
for layer in config['LayerName'].keys():
    config['LayerName'][layer]['ReuseFactor'] = 64
q_hls_model = hls4ml.converters.convert_from_keras_model(q_autoencoder,
                                                         hls_config=config,
                                                         backend='VivadoAccelerator',
                                                         output_dir='/mnt/data/thaarres/qkeras_ae_pynq',
                                                         board='pynq-z2')                                                
q_hls_model.compile()

y_q_hls4ml = q_hls_model.predict(np.ascontiguousarray(signal_test_data))
q_hls_model.build(csim=False, synth=True, export=True)
hls4ml.templates.VivadoAcceleratorBackend.make_bitfile(q_hls_model)
util.package(q_hls_model, signal_test_data, y_q_hls4ml)

We made it and have our bit file! From the synthesis reports we can check how many resoures each network is consuming as well as the total latency.
We can get the latency from the C synthesis report in `qkeras_ae/myproject_prj/solution1/syn/report/myproject_csynth.rpt` and the resource consumption from `qkeras_ae/util.rpt'

### Baseline model

```
+ Latency (from myproject_csynth.rpt) : 
    * Summary: 
    +---------+---------+----------+----------+-----+-----+----------+
    |  Latency (cycles) |  Latency (absolute) |  Interval | Pipeline |
    |   min   |   max   |    min   |    max   | min | max |   Type   |
    +---------+---------+----------+----------+-----+-----+----------+
    |      525|      531| 3.702 us | 3.744 us |   64|   64| dataflow |
    +---------+---------+----------+----------+-----+-----+----------+

+ Utilization (from util.rpt) : 
    * Summary:    
+---------------+---------------+-----------+-------------+---------------+------------+----------+------------
|   Total LUTs  |   Logic LUTs  |  LUTRAMs  |     SRLs    |      FFs      |   RAMB36   |  RAMB18  |  DSP Blocks
+---------------+---------------+-----------+-------------+---------------+------------+----------+------------
| 38533(72.43%) | 37501(70.49%) | 22(0.13%) | 1010(5.80%) | 50970(47.90%) | 32(22.86%) | 5(1.79%) | 208(94.55%)
    
 ```

### Quantized and pruned model:
```
+ Latency (from myproject_csynth.rpt) : 
    * Summary: 
    +---------+---------+----------+----------+-----+-----+----------+
    |  Latency (cycles) |  Latency (absolute) |  Interval | Pipeline |
    |   min   |   max   |    min   |    max   | min | max |   Type   |
    +---------+---------+----------+----------+-----+-----+----------+
    |      507|      513| 3.575 us | 3.617 us |   64|   64| dataflow |
    +---------+---------+----------+----------+-----+-----+----------+
 
 + Utilization (from util.rpt) : 
    * Summary: 
+---------------+---------------+-----------+------------+---------------+------------+----------+-------------
|   Total LUTs  |   Logic LUTs  |  LUTRAMs  |    SRLs    |      FFs      |   RAMB36   |  RAMB18  |  DSP Blocks 
+---------------+---------------+-----------+------------+---------------+------------+----------+-------------
| 36309(68.25%) | 36053(67.77%) | 22(0.13%) | 234(1.34%) | 45152(42.44%) | 28(20.00%) | 3(1.07%) | 112(50.91%) 

```

So we see that, despite having the same latency, the resource constumption for the compressed model is significantly smaller! That in turn means that we don't need to reuse the same multipliers as many times as we have defined (by setting reuse to 64). So quantization can improve the latency, in terms of allowing us to use a *lower reuse factor*.

## Compare to Level-1 trigger FPGA

Deployment on the Pynq is of course a toy study and the FPGAs we have in the Level-1 trigger are much much bigger (and much much more expensive). For fun, let's see what the latency would be with a fully parallel implementation (reuse factor of 1) on a Xilinx VU9P FPGA.

Available resources on the Xilinx Virtex Ultrascale 9+ FPGA:
```
+---------------------+---------+-------+---------+---------+-----+
|         Name        | BRAM_18K| DSP48E|    FF   |   LUT   | URAM|
+---------------------+---------+-------+---------+---------+-----+
|Available            |     4320|   6840|  2364480|  1182240|  960|
+---------------------+---------+-------+---------+---------+-----+
```
Thirty times more DSPs available than on the Pynq!

In [None]:
# Baseline model
config = hls4ml.utils.config_from_keras_model(autoencoder, granularity='name')
config['Model']['Strategy'] = 'Latency'

for layer in config['LayerName'].keys():
    config['LayerName'][layer]['ReuseFactor'] = 1 #Use the same resources multiple times. This is neccessary in this case because the FPGA is small.
hls_model = hls4ml.converters.convert_from_keras_model(autoencoder,
                                                         hls_config=config,
                                                         output_dir='/mnt/data/thaarres/baseline_ae_vu9p',
                                                         part='xcvu9p-flgb2104-2l-e') # L1T FPGA!
                                                   
hls_model.compile()
hls_model.build(csim=False, synth=True, vsynth=True)

# Compressed model
config = hls4ml.utils.config_from_keras_model(q_autoencoder, granularity='name')
config['Model']['Strategy'] = 'Latency'

for layer in config['LayerName'].keys():
    config['LayerName'][layer]['ReuseFactor'] = 1 #Use the same resources multiple times. This is neccessary in this case because the FPGA is small.
q_hls_model = hls4ml.converters.convert_from_keras_model(q_autoencoder,
                                                         hls_config=config,
                                                         output_dir='/mnt/data/thaarres/qkeras_ae_vu9p',
                                                         part='xcvu9p-flgb2104-2l-e') # L1T FPGA!

q_hls_model.compile()
q_hls_model.build(csim=False, synth=True, vsynth=True)


The latency can still be found in `qkeras_ae_vu9p/myproject_prj/solution1/syn/report/myproject_csynth.rpt`, but the resources are now listed in `qkeras_ae_vu9p/vivado_synth.rpt':

### Baseline model
```
+ Latency (from myproject_csynth.rpt) : 
    * Summary: 
    +---------+---------+-----------+-----------+-----+-----+----------+
    |  Latency (cycles) |   Latency (absolute)  |  Interval | Pipeline |
    |   min   |   max   |    min    |    max    | min | max |   Type   |
    +---------+---------+-----------+-----------+-----+-----+----------+
    |       17|       17| 85.000 ns | 85.000 ns |    1|    1| function |
    +---------+---------+-----------+-----------+-----+-----+----------+

+ Utilization (from vivado_synth.rpt) : 
    * Summary:  
    
 ```

### Quantized and pruned model:
```
+ Latency (from myproject_csynth.rpt) : 
    * Summary: 
    +---------+---------+-----------+-----------+-----+-----+----------+
    |  Latency (cycles) |   Latency (absolute)  |  Interval | Pipeline |
    |   min   |   max   |    min    |    max    | min | max |   Type   |
    +---------+---------+-----------+-----------+-----+-----+----------+
    |       14|       14| 70.000 ns | 70.000 ns |    1|    1| function |
    +---------+---------+-----------+-----------+-----+-----+----------+
    
+ Utilization (from vivado_synth.rpt) : 
    * Summary:      

```

So we can see that with a completely parallel implementation, the algorithm runs sigificantly faster and would be within the L1 requirements to run in the Global trigger! And since no one would let you deploy an algorithm that uses almost the full board, quantization and pruning are KEY tools for edge ML!