**Vitis Accel Examples: Systolic Array** 

### Reference:

https://github.com/Xilinx/Vitis Accel Examples/tree/master/cpp kernels/systolic ar ray

### 1. Introduction

The systolic array is a hardware-efficient implementation for matrix multiplication. Compared to a direct-mapped design, the architecture of a systolic array adopts the parallel processing technique in a pipelined way. The network of computing units rhythmically processes and passes data through the system. Below is an illustration of a systolic array processor.



Figure. A 4 x 4 systolic array of processors for matrix multiplication.

(Source: http://ecelabs.njit.edu/ece459/lab3.php)

The two matrices A and B are shifted into the boundary processors in column 1 and row 1, respectively, as shown in Figure. The leading and trailing 0s in rows and columns are employed so that elements  $a_{ir}$  and  $b_{rj}$  arrive at processor  $P_{ij}$  simultaneously for the operation  $a_{ir} \times b_{rj}$  to be performed. In the array,  $P_{ij}$  is

initialized to 0 for all i, j = 1, 2, 3, 4. In the end, processor Pij will contain  $c_{ij}$ , for  $1 \le i$ ,  $j \le 4$ . This algorithm takes time O(n) for  $n \times n$  matrices.

# 2. Vitis HLS Implementation

Source Code

```
Kernel code:
--- mmult.cpp Matrix multiplication kernel

Host code:
--- host.cpp Host program
--- xcl2.cpp xcl2 function file
--- xcl2.hpp xcl2 header file
```

In the matrix multiplication kernel, the matrices A and B are first loaded into the kernel buffer to enable burst mode. In the for loop, we can use #pragma HLS LOOP\_TRIPCOUNT to specify the total number of iterations performed by a loop, so that timing analysis can be performed. The min and max value is set to 16, aligned with the input matrix specification defined in the host program.

```
readA:
    for (int loc = 0, i = 0, j = 0; loc < a_row * a_col; loc++, j++) {
    #pragma HLS LOOP_TRIPCOUNT min = c_size* c_size max = c_size * c_size
        if (j == a_col) {
            i++;
            j = 0;
        }
        localA[i][j] = a[loc];
    }

// Read Input B
readB:
    for (int loc = 0, i = 0, j = 0; loc < b_row * b_col; loc++, j++) {
    #pragma HLS LOOP_TRIPCOUNT min = c_size* c_size max = c_size * c_size
        if (j == b_col) {
            i++;
            j = 0;
        }
        localB[i][j] = b[loc];
    }
</pre>
```

• The LOOP\_TRIPCOUNT pragma or directive is for analysis only, and does not impact the results of synthesis. Note that the absence of this pragma causes an incorrect timing result in the synthesis report!!!

The systolic matrix multiplication is performed in a three-level hierarchy of the for loop. In the loop, a single multiply-and-accumulate (MAC) operation is performed.

Finally, the output matrix C is written to global memory in burst mode.

Here we introduce the optimization pragma for the mmult kernel:

```
HLS ARRAY_PARTITION variable = localA dim = 1 complete
The matrix localA is partitioned into multiple arrays in the row direction
HLS ARRAY_PARTITION variable = localB dim = 2 complete
The matrix localB is partitioned into multiple arrays in the column direction
HLS ARRAY_PARTITION variable = localC dim = 0 complete
The matrix localC is partitioned completely. That is to say, it will be implemented using a register array with a size of 16 x 16 x 32 bits.
```

### HLS UNROLL

This allows the MAC operation in the for loop to be processed in parallel.

### 3. Vitis HLS Build Flow

The build flow is the same as in Lab 3, except the program arguments in run

configurations should be "./binary\_container\_1.xclbin" only. See 2022.1-Workbook-Lab3.pdf for further details.

#### Source:

https://github.com/bol-edu/course-lab 3/blob/2022.1/2022.1-Workbook-Lab3.pdf

# 4. Result Comparison

### **Analysis of the Hardware Emulation Result**

In this section, we perform a detailed analysis of the result of our optimization. We use the hardware emulation result of the baseline design and the optimized design. It is easily noticed that the latency is reduced from 5102 to 1019. The kernel execution time in the simulation is also reduced from 0.027 to 0.020. The effect of the optimization techniques is described as follows.

#### Before:



#### After:





# Effect of array partition

In the Bind Storage Report, we can see that after array partition, the BRAM for A and B in the original solution is replaced with 16 (storing each row or each column) register files each. In addition, the BRAM for C is replaced with all registers, due to the **complete** partition argument in the pragma. Therefore, the FF (Flip-Flop) and LUT in the Resource Estimate of the synthesis report are drastically increased.

### Before:



#### After:



# Effect of loop unrolling

In the original solution, the system calculates a single MAC at each cycle, so the trip count is 4096 ( $16 \times 16 \times 16$ ). This part accounts for 80% of the total latency and must be parallelized to increase throughput.

• Note that in the original solution, the computation is automatically pipelined, so the computing resource is actually more than 1 MAC unit.

Due to the UNROLL pragma, the computation is performed in parallel. In the optimized design, the systolic module is performed in a pipelined parallel fashion. The initiation interval (II), which is the number of clock cycles before the function can accept new input data, is 1. We can therefore know that the design is **fully pipelined** to increase throughput.

The latency of systolic computation can be calculated by leveraging the iteration latency, trip count, and interval. An interval of 19 is obtained by the following equation:

$$Latency - II + II \times (TC-1) = 5 - 1 + (16 - 1) = 19$$



The latency after loop unrolling is reduced by over 95.5% (from 4100 to 19 cycles).

# 5. GitHub Link:

https://github.com/yochenglin/HLS LabB systolic array

# 6. Appendix

**Run Hardware Emulation** 



# **Run Hardware**



[Console output redirected to file /mnt/HLSNAS/O3.RJWQGv/labB\_yocheng/systolic\_array\_opt/systolic\_array/Hardware/SystemDebuggi Found Platform

Polatform Name: Xilinx
INFO: Reading /mnt/HLSNAS/03.RJWQGv/labB\_yocheng/systolic\_array\_opt/systolic\_array\_system/Hardware/binary\_container\_1.xclbin Trying to program device[0]: xilinx\_u50\_gen3x16\_xdma\_base\_5

Device[0]: program successful!

TEST PASSED

#### Run hardware success!