# Guide to Thresholding, Subtraction, and Averaging in CustomLogic 

[CustomLogic User guide](https://documentation.euresys.com/Products/COAXLINK/COAXLINK_16_0/en-us/Content/11_Pdf/D209ET-Coaxlink_CustomLogic_User_Guide-eGrabber-16.0.2.2128.pdf)

Written by: Ryan Forelli

Last Modified: 1/4/2023

## 1. Introduction

The CustomLogic toolkit features a Vivado reference design allowing users to implement custom image processing and manipulation on the onboard XCKU035 FPGA. The 3602 Coaxlink Octo and 3603 Coaxlink Quad CXP-12 are supported through CustomLogic. This repository provides reference designs for the 3602 Coaxlink Octo. The CustomLogic package and this reference design have the following software/hardware requirements. A few presets in eGrabber studio are also required.

### 1.a. Software Requirements
- Vivado 2018.3
- eGrabber 22.1 or higher (although any version in the past few years should work)

### 1.b. Hardware Requirements
- **Single** 3602 Coaxlink **Octo** (we can switch to two if these examples work)
- Phantom S710 camera

### 1.c. eGrabber Studio Presets
- Set pixel format to Mono 12-bit
- Set unpacking mode (under "Data Stream" menu) to "Lsb"
- Set camera resolution to 260x320

---------------------------

## 2. Intro to High-level Synthesis

The designs included in this repository rely heavily on high-level synthesis (HLS). Essentially, HLS is C++ for FPGAs which greatly streamlines the implementation of more complex algorithms. The downside is we cannot optimize our designs to the same degree we could with a hardware description language (HDL). The C++ code we write and strategically placed pragmas goes through a C-synthesis process which translates our design to less-than human readable register-transfer level (RTL) description. Some resources to get started with HLS is included below. While many functions/constructs valid in standard C++ compilers are valid in HLS, many are not (e.g. malloc, printf, etc). However, the behavior of HLS can be "simulated" by running it through g++ like normal, in which case these functions can be used. It is important to remember we are writing hardware in HLS, not software. The processing architecture is not fixed on an FPGA. Many of the standard practices for optimizing a program for a CPU are non-existent or take different forms in HLS. These guides provide good explanations.

- https://docs.xilinx.com/v/u/en-US/ug998-vivado-intro-fpga-design-hls
- https://indico.cern.ch/event/857790/attachments/1929374/3199760/HLS_Tutorial.pdf

---------------------------

## 3. Repository Structure


The repository contains four reference designs.
- **01_fw_sample_thresholding**: This design performs pixel thresholding. Currently any pixel below 2047 will be set to 0. Any pixel greater than or equal to 2047 will remain unchanged.
- **02_fw_sample_subtraction**: This design subtracts every other image. Given frames: f0, f1, f2, f3,..., f0 and f2 will be transferred to the host like normal, f1 will contain the subtraction result f1-f0 and f3 will contain f3-f2. Currently, saturation logic is enabled to prevent overflows. For example, 400 − 3010 = 0. All arithmetic is unsigned 12-bit.
- **03_fw_sample_average_9frames**: This design averages every 9 frames, storing the result in the 10th frame. For example, n9 = sum(n0, n1,... n8). The accumulator in this design is sized at 16 bits to cover the possible range of unsigned 12-bit additions of ten numbers.
- **03_fw_sample_average_499frames**: Same as above, except this design averages every 499 frames, storing the result in the 500th frame. For example, n499 = sum(n0, n1,... n498). The accumulator in this design is sized at **21** bits to cover the possible range of unsigned 12-bit additions of five-hundred numbers. This bitwidth MUST be increased if the length of the addition or pixel format changes.

---------------------------

## 4. Getting started

### 4.a. Thresholding

Let's start with **01_fw_sample_thresholding**. Looking at the project structure, we have 

- **01_readme**: CustomLogic readme
- **02_coaxlink**: Encrypted CoaXPress IP and other proprietary firmware. We will not touch this.
- **03_scripts**: Vivado project build scripts. We will be running scripts in here but no modification should be necessary.
- **04_ref_design**: CustomLogic top-level and sub-module HDL designs. Since we are using HLS, we shouldn't need to touch this.
- **05_model_design_hls**: Contains all HLS for this project. We will be working here.
- **06_release**: The final bitstream for the FPGA will appear in this folder. Note that I have included precompiled bitstreams for all projects.  

Looking at **05_model_design_hls**. We will see a few more folders. 
- srcs: Our HLS source C++ files.
- scripts: Contains our C-synthesis script. We should not have to touch this unless we want to disable C-simulation.
- tb_data: Contains testbench data for c-simulation.

Only a few of the HLS files are important to us at the moment. myproject_axi.cpp is the top-level file. We should never have to change anything in there unless we have to communicate with the DRAM interface. The important file is **myproject.cpp**. 

In **myproject.cpp**, scroll to the function myproject(). First we'll see the input data stream and output data stream provided by CustomLogic (see 04_ref_design/CustomLogic.vhd for top-level interfaces). These data streams transfer c structs typenamed ``video_if_t``. We can see what data is stored in these structs by looking in CustomLogic.h. 

![image.png](attachment:43a0657c-ce43-4264-a06d-f079398cbdd7.png)

We can see we have image data stored in ``Data``. The type of the member is ``DataMono12`` which is really just a 128-bit wide integer. This is because the data stream depth on the Euresys Octo is 128 bits, meaning we get access to 128-bits of data every clock cycle. We are recording at 12-bit (which is packed into 2-bytes), so every cycle we get 128 / 16 = 8 pixels. This is extremely important to remember. Also note this file defines a macro ``MONO16PIX`` which allows us to "index" into this 128-bit wide string to retrieve individual pixels. The ``User`` field contains some side-channel info defined by Euresys. The four bits of this field tells use where in the image each packet comes from, although it is not relevant for our task.

Refocusing to myproject(), we will see several function calls and a variable to define the packed depth of the data stream (which is just image_resolution / 8_pixels_per_data_packet). The first function, ``read_pixel_data()``, read's the raw data stream and forwards the relevant packets (packets with image data) to us by checking the side-channel User data. Next, we pass the image to ``pix_threshold()``, which is where the magic happens.
![image.png](attachment:1e0537fd-787d-49d0-82d4-1504de19996c.png)


Looking at ``pix_threshold()``, we immediately begin looping through the image data, one packet at a time. After a packet is read from the functions input stream ``StreamIn``, we loop through the captured packet's bit string using MONO16PIX. For each pixel, we check if the pixel's value is greater than 2047. If so, we leave it alone, otherwise, we set the pixel to 0. Next we just attach the side-channel User info to the data packet, and write to the output stream.
![image.png](attachment:b61c10d0-7d16-4c2d-b816-3ae829e03b39.png)

This example is fairly simple, and we could have placed this ternary statement in read_pixel_data, but as we increase the complexity of our algorithm, we would likely run into difficulties with scheduling and maintaining pipeline constraints during synthesis. Pipelining at the functional level (within ``myproject()``) and spreading the constituent components of the algorithm across multiple functions when using a streaming architecture typically yields the highest throughput. None of these examples are complex enough to warrant more than one function, but we can see the benefit when implementing neural networks, for example (see [hls4ml](https://fastmachinelearning.org/hls4ml/)).

## 4.b. Synthesis & Implementation

Before moving to the more complex examples, deploy and verify this example to ensure there are no higher-level issues. 

To execute C-simulation and C-synthesis, run ``vivado_hls 05_model_design_hls/scripts/run_hls.tcl``. Remember to use Vivado (HLS) 2018.3!

Looking at ``vivado_hls.log``, we will see the C-simulation run first. It is a lengthy file due to the size of the images. If we ctrl+F "Received Image", we will see the image that will be returned to the host. The image supplied to the testbench (found in ``tb_data/tb_input_features.dat``) writes an image consisting of a repeating arithmetic sequence, 0,1,2,3...4095,0,1,2,3...4095... If our thresholding code works, we should see a sequence 0,0,0,0,...2048,2049,...4095,0,0,0,0,...2048,2049,...4095... As expected, this is the image we receive.

![image.png](attachment:22839bf0-3bb1-4f89-b86d-b22a58dff88b.png)


Once C-synthesis is complete, we can run logic synthesis and place & route. These processes compile a gate-level netlist of the required resources to implement the design and maps the design to the targeted FPGA. The script we will run also generates the bitstream. It takes a few hours, so ``nohup`` may be useful when executing this command, ``nohup vivado -mode tcl -source 03_scripts/run_impl.tcl &``. 


Once the process completes, the first thing we will want to check is the folder ``06_release``. If a .bit file appeared/updated, that means the process was successful. If not, check ``nohup.out`` for the error message. Once you've verified the bitstream was generated, let's check the timing report, ``post_route_phys_opt_timing_summary.rpt``. The timing reports tells us whether the synthesis tools were able to implement our design under the desired timing constraints. The frame grabber uses a 250MHz clock (4ns period), so achieving timing closure can be tricky. Even with simple designs, poorly written HLS or HDL can result in timing failure. In ``post_route_phys_opt_timing_summary.rpt``, scroll down to the "Design Timing Summary". If the "WNS(ns)" or "TNS(ns)" value is negative, then timing failed. Here, we can see our worst negative slack (WNS) is 0.51ns, indicating timing closure.

![image.png](attachment:6f1e8e73-a19f-4816-9386-d97b37f01426.png)


## 4.c. Deployment

Now we can deploy the bitstream (which should be named ``CoaxlinkOcto_1cam.bit``) to the frame grabber. Follow these steps for deployment.
1. Open **Euresys Coaxlink Firmware Manager** on host PC.
2. Drag and drop the bitstream to the upload box in **Euresys Coaxlink Firmware Manager**.
3. In **Euresys Coaxlink Firmware Manager**, navigate to the **Coaxlink Cards** page via the sidebar menu icon.
4. Select the Octo frame grabber.
5. Select **Install firmware variant...**.
6. Select **proceed** and **Ok**.
7. One completed, restart host PC.


Now we can capture a buffer in eGrabber studio. Reminder: set the presets listed at the start of the guide before continuing. It is recommended to run the frame grabber at a low frame rate (~100fps) while testing to verify the algorithm's behavior. Setting a ``BufferPartCount`` of a relatively high number is recommended at higher framerates to take advantage of batching and to reduce processing overhead. I also recommend capturing in raw or tiff format for these tests. When you capture a buffer, you should see the darker regions of the image saturated to completely dark. 

---------------------------


## 5. Subtraction Firmware Example

Once you verify the thresholding example works, we can move on to the image subtraction example project, ``02_fw_sample_subtraction``. The only substantial difference between the first example and this example is that ``pix_threshold()`` changes to ``pix_subtract()``. Just like last time, we loop through the image stream. This time, we have an arbitrator which just tells us whether we're buffering the current frame or subtracting a previously buffered frame from the current frame. Note that buffered frames (the subtrahends) are also transmitted back to host in this example. 

When the arbitrator is 0, we write all pixels in the current packet to the buffer ``buf``. 

![image.png](attachment:c5aedf9c-51f4-4a2c-9c2b-7c5269e5af3d.png)

When the next frame arrives (arbitrator=1), we subtract the buffered frame from the current frame, and forward the result to the output. Note that the operands are cast to unsigned fixed-point which supports overflow modes (e.g. AP_SAT, [more info here](https://docs.xilinx.com/r/en-US/ug1399-vitis-hls/Fixed-Point-Identifier-Summary?tocId=2KrIwwS1HQvpiafHrUBMHg)) as described earlier. 

![image.png](attachment:e8fc4c52-9796-45d4-ba8c-3ca624aa38e8.png)

The rest of this function is similar to ``pix_threshold()``. In testing this bitstream, even frames should be unchanged, and I would guess odd frames 
 (the subtraction result) will appear completely dark if the camera field of view is static.

 Note that this design failed timing. Achieving timing closure can be a weeks long task experimenting with the HLS and synthesis strategies, so given the TNS here is extremely low, it likely won't effect the implementations behavior. For a final implementation, we will want to ensure the design meets timing.

 ![image.png](attachment:2a6fb816-cc17-4f79-83b8-52ce4ae90bb7.png)

 ---------------------------

## 6. Averaging Firmware Example (9 frames)

Now let's move on to the image averaging example project, ``03_fw_sample_average_9frames``. This project averages every 9 frames, storing the average in the 10th frame. In this project, we will be looking at ``pix_average()``. Here, we have a frame counter and the same loop to read the data stream. The number of frames to average is ``num_frames``. For the first 9 frames, we simply add each pixel to a running sum "image" called ``sum``. Note that all of these frames are also sent to the output stream back to the host as well.

![image.png](attachment:b86759c2-fb88-410c-8dfd-c258bfb4b7e3.png)


On the 10th frame, we calculate the average of the 9 frames, and forward the result back to the host in place of the 10th frame.

![image.png](attachment:0b987fc3-79c4-4380-b1da-897166340249.png)

Follow the pixel thresholding example steps for synthesis & deployment. Remember to check timing results. If there are only a few failing endpoints in the timing report, it's likely the design will behave correctly. My implementation shows yields a likely-acceptable TNS. If these last two examples fail to behave properly, try the 499 frame example since it achieved timing closure.

![image.png](attachment:d285af93-d337-4016-8401-a92210f8dc5f.png)

In testing this example, I would expect every 10th image to appear fairly similar to the preceding 9 assuming a static camera view.

IMPORTANT NOTE: Each pixel of running sum, "image" is sized at 16 bits. This is sufficient to cover the full range of the addition of ten 12-bit unsigned integers. If ``num_frames`` changes, this bit width MUST be sized appropriately. ceil(log2(num_frames*((2^12)-1))) should yield the desired bit width.


### 6.a. C-simulation

This c-simulation for this project supplies 9 images. The first five are all 0, the last four are constant 4000. Thus the averaged image should be constant 1777.

![image.png](attachment:ebb0fcfa-935c-469e-98d8-681db0db494f.png)

---------------------------

## 7. Averaging Firmware Example (499 frames)

Before moving on, lets generate some testbench data for this next example design. It is not already included due to GitHub file size limits.

In [None]:
num_sets = 250
values_per_set = 83200

values_0 = [0] * values_per_set
values_4000 = [4000] * values_per_set

all_sets = [values_0] * num_sets + [values_4000] * num_sets

with open("03_fw_sample_average_499frames/05_model_design_hls/tb_data/tb_input_features.dat", "w") as file:
    for set_values in all_sets:
        line = " ".join(map(str, set_values))
        file.write(line + "\n")

Now let's move to averaging every 499 frames, ``03_fw_sample_average_499frames``. This project averages every 499 frames, storing the average in the 500th frame. We will still be looking at ``pix_average()``. Again, we have a frame counter and the same loop to read the data stream. The number of frames to average is ``num_frames``. For the first 499 frames, we simply add each pixel to a running sum "image" called ``sum``. Note that all of these frames are also sent to the output stream back to the host as well.

![image.png](attachment:b86759c2-fb88-410c-8dfd-c258bfb4b7e3.png)


On the 500th frame, we calculate the average of the 499 frames, and forward the result back to the host in place of the 500th frame.

![image.png](attachment:0b987fc3-79c4-4380-b1da-897166340249.png)

IMPORTANT NOTE: Each pixel of running sum, "image" is sized at 21 bits. This is sufficient to cover the full range of the addition of five-hundred 12-bit unsigned integers. If ``num_frames`` changes, this bit width MUST be sized appropriately. ceil(log2(num_frames*((2^12)-1))) should yield the desired bit width.

Looking at the timing report, we can see this example achieved timing closure. You may notice that even though this example and the previous example are nearly identical, one met timing while the other did not. These are instances where taking advantage of and experimenting with the various synthesis directives and strategies available in the Vivado Design Suite comes in handy. See [here](https://docs.xilinx.com/r/en-US/ug904-vivado-implementation/Using-Directives?tocId=dV9wYjuIP6n9oUJhkoHuRg) for more info. See ``03_scripts/run_impl.tcl`` to see the strategy, directives, and physical optimization loop we use for these designs.

![image.png](attachment:3575fe1f-9abd-4ab5-b89e-ff14e5ccff15.png)



#### 7.a. C-simulation

This c-simulation for this project supplies 499 images. The first two hundred fifty are all 0, the last two hundred forty-nine are constant 4000. Thus the averaged image should be constant 1995.

![image.png](attachment:a005944a-8381-46cf-b80b-82e9cd2a8445.png)

#### 7.b. Averaging Validation

To validate the averaging examples, import the raw image buffer in python, calculate the average of the first 9 or 499 images, and compare with the 10th or 500th. Use integer arithmetic.

---------------------------

## 8. Conclusion

Judging from experience, it's fairly unlikely all of these examples will work 100% first try. I am always available over email, rff224@lehigh.edu. If these examples go well, merging the subtraction and averaging shouldn't be too difficult. We may need to ask Eric (from Euresys) about DMAing only the frames we want/discarding 499 frames and how that works with the host control.


#### A note about pragmas

Throughout the HLS source code you will find preprocessor directives called pragmas. In HLS, they allow us to control how our C structures are implemented in hardware. For example, we applied ``ARRAY_PARTITION`` pragmas to restructure how our arrays are partitioned among the FPGA's BRAMS. Partitioning large arrays can increase throughput (as a consequence of more read/write ports). For example, I observed a 16x latency decrease for the averaging function just by cyclically partitioning the running sum array. See [here](https://docs.xilinx.com/r/en-US/ug1399-vitis-hls/HLS-Pragmas) for a complete list of pragmas and more info. Note that some of the pragmas in this documentation are invalid in 2018.3 since the docs refer to Vitis 2023.

------------------