# SOC Design HW3 FIR Design m111061547

蕭翔

# Outline:

| Overview       |
|----------------|
| Block diagram  |
| Operation      |
| Resource usage |
| Report         |
| CitUuh link    |

Date:2023/10/22

## 1. Overview

In this digital signal processing (DSP) finite impulse response (FIR) design, we have a total of 600 data points and 11 coefficients. However, we are constrained to use only one multiplier and one adder for our implementation.

The input data is received through the AXI-Stream protocol, which operates under the ARM architecture. It is essential to store this incoming data in a memory block known as Data Bram.

On the other hand, the FIR filter's coefficients, represented as "taps," are supplied through the AXI-Lite protocol.

The main objective of this design is to efficiently compute the convolution of the input data and the filter coefficients while adhering to the hardware constraints. The resulting output data is then transmitted using the AXI-Stream protocol.



# 2. Block diagram

The block diagram depicts the key components of our system. This includes interfaces for data\_ram, tap\_ram, and the host-side (testbench).

Tap\_ram serves as a straightforward storage unit for the 11 filter coefficients received via AXI-Lite.

Data\_ram presents a more intricate challenge, handling 600 data points. It employs a shift\_ram design to control data\_A and facilitate data processing within the data flow block.



## 3. Operation

## FSM:

IDLE State: Initially, the FSM is in the IDLE state, waiting for the first data to arrive. It remains in this state until it detects wdata with a value of 32'd600, at which point it transitions to the TAP2BRAM state.

TAP2BRAM State: In the TAP2BRAM state, the FSM is responsible for storing the filter coefficients (taps). These coefficients are received via the AXI-Lite interface and are written into the tap\_ram. Once the storage is complete, wvalid is asserted, and when rdata equals 0 (indicating ap\_start), and wdata is set to 1, the FSM transitions to the Compute state.

Compute State: In the Compute state, the DSP FIR filtering operation commences. This is where the convolution between the input data and the filter taps is calculated.

Done State: When sm\_tlast is asserted, indicating the end of the data stream, the FSM transitions to the done state. In this state, ap\_done is asserted, indicating that the computation is finished, and the FSM returns to the IDLE state, awaiting the next data input.



# **Configuration Read:**

### ap idle:

When rdata is 32'd4, signifying the ap\_idle state, the FSM transitions to the TAP2BRAM state as soon as it reads a data length of 600.



#### ap start:

When transitioning into the Compute state, an rdata value of 32'b0 indicates that ap\_start has been read, signaling the commencement of DSP computation.

STAT:1→2
TAP2BRAM→Compute

rdata:



#### ap done:

Upon entering the Done state, an rdata value of 2 is detected, signifying the completion of DSP computation. Subsequently, as the final sm\_tdata is processed, the FSM returns to the IDLE state, with rdata reading 4, indicating the system is in the ap idle state.



## TAP\_RAM:

#### **AXI-lite** write



I've added a control signal, tap\_data\_wen, which synchronizes awready and wready during a handshake when awvalid is set to 1 from the host.

For tap\_WE, it becomes 4'b1111 when both wvalid and wready are 1, allowing wdata to be written into tap Di with a one-clock delay.

Additionally, I've implemented an address adjustment for tap\_A. When awaddr is equal to or greater than 12'h020, (awaddr - 12'h020) is used as the address for tap\_A. This ensures a clear address mapping for tap\_BRAM, spanning from 12'h000 to 12'h028

#### **AXI-lite read**



I've set up a tap\_rdata\_cnt to cycle through values 0-2. When it reaches 1, arready is asserted, and rvalid is delayed by one clock cycle to align their timing. With rvalid and rready both high, you can read the rdata corresponding to the tap\_address.

Additionally, when arready and arvalid are both 1, araddr is received. To correctly access tap\_ram, we subtract 12'h020 from araddr, aligning with the address configuration we've established.

These adjustments ensure synchronized data transfer and address calculations in our system.

# **Shift\_RAM:**



I've implemented a data storage approach for data\_ram, resembling a FIFO design. It includes a fifo\_ptr\_cnt to control data\_A access. When ss\_tready is set to 1, it enables the writing of ss\_tdata into data\_ram when data we is 4'b1111.

When fifo\_ptr\_cnt reaches 11, data\_A is reset to zero. This signifies that the multiplication operation has consumed the initial 11 data points, and we require new data for further DSP computation. If fifo\_ptr\_cnt t is not equal to 11, we multiply its value by 4 and assign it to data\_A, facilitating the shift ram functionality as depicted in the diagram.

## **DSP** calculation:



Our DSP computation follows the structure illustrated in the diagram. It involves a delay of 3 clock cycles from data\_RAM to feed data\_Xn. Tap\_A provides the corresponding address to read tap\_Do, and we employ a "coef" register to store these 11 taps. Subsequently, we perform pairwise multiplications and accumulate the results. The final accumulated value is sent as sm\_tdata for comparison with the golden data.

Additionally, I've implemented a "cal\_cnt" to count from 1 to 11 to ensure that data\_ram receives all 11 data points. When "cal\_cnt" reaches 4, an "enable\_one\_cycle\_done" signal is set. At this point, sm\_tdata's value is sent for comparison. Finally, when "gold\_cnt" reaches 600, indicating the completion of 600 calculations, "sm\_last" is asserted, signifying the conclusion of DSP computation and the transition to the "done\_state."

These steps outline the core of our DSP computation process, providing an efficient and structured approach to achieving our desired results.

## 4. Resource usage

#### FF and LUT:

| 1. Slice Logic        |     |   |   |  |   |           |   |      |
|-----------------------|-----|---|---|--|---|-----------|---|------|
| : "                   | :   | : |   |  |   | Available | : | :    |
| Slice LUTs*           | 268 | • | 0 |  | 0 |           | ï | 0.50 |
| LUT as Logic          | 268 | 1 | 0 |  | 0 | 53200     | Ī | 0.50 |
| LUT as Memory         | 0   | 1 | 0 |  | 0 | 17400     | Ī | 0.00 |
| Slice Registers       | 299 | 1 | 0 |  | 0 | 106400    | Ī | 0.28 |
| Register as Flip Flop | 299 | 1 | 0 |  | 0 | 106400    | Ī | 0.28 |
| Register as Latch     | 0   | 1 | 0 |  | 0 | 106400    | I | 0.00 |
| F7 Muxes              | 0   | T | 0 |  | 0 | 26600     | I | 0.00 |
| F8 Muxes<br>+         | 0   | 1 | 0 |  | 0 | 13300     | 1 | 0.00 |

#### Bram:

| 2. Memory                                    |                 |             |             |                       |                              |
|----------------------------------------------|-----------------|-------------|-------------|-----------------------|------------------------------|
| Site Type                                    | Used            | <br>  Fixed | Prohibited  | Available             | +<br>  Util%                 |
| Block RAM Tile<br>  RAMB36/FIF0*<br>  RAMB18 | 0<br>  0<br>  0 |             | 0<br>0<br>0 | 140<br>  140<br>  280 | 0.00  <br>  0.00  <br>  0.00 |

#### DSP:

# 5. Report

Timing Clk cycle:7.5ns Critical path:

```
Max Delay Paths
Slack (MET) :
                         0.876ns (required time - arrival time)
  Source:
                        data_Xn_reg[16]/C
                         (rising edge-triggered cell FDCE clocked by Sysclk {rise@0.000ns fall@4.000ns period=8.000ns})
 Destination:
                        mult10__1/PCIN[0]
                          (rising edge-triggered cell DSP48E1 clocked by Sysclk {rise@0.000ns fall@4.000ns period=8.000ns})
                        Sysclk
  Path Group:
                       Setup (Max at Slow Process Corner)
  Path Type:
  Requirement:
                      8.000ns (Sysclk rise@8.000ns - Sysclk rise@0.000ns)
  Data Path Delay:
                        5.544ns (logic 4.689ns (84.582%) route 0.855ns (15.418%))
                      1 (DSP48E1=1)
-0.145ns (DCD - SCD + CPR)
  Logic Levels:
  Clock Path Skew:
   Destination Clock Delay (DCD): 2.128ns = ( 10.128 - 8.000 )
    Source Clock Delay (SCD): 2.456ns
    Clock Pessimism Removal (CPR):
                                   0.184ns
  Clock Uncertainty: 0.035ns ((TSJ^2 + TIJ^2)^1/2 + DJ) / 2 + PE
    Total System Jitter
                          (TSJ):
                                   0.071ns
    Total Input Jitter
                          (TIJ):
                                    0.000ns
    Discrete Jitter
                           (DJ):
                                    0.000ns
    Phase Error
                           (PE):
                                    0.000ns
```

#### 6. GitHub link

https://github.com/sonnyshiau/lab fir