# The Polynomial Evaluation Accelerator

Noah Olson, Erin Quartararo, Taruni Sanjay, Cole Schneider

Team 1

ENEE408C Section 0101

Shuvra Bhattacharyya

December 12, 2022

# Contents

| 1  | Tea  | m Contributions                           | 2  |
|----|------|-------------------------------------------|----|
|    | 1.1  | Noah Olson                                | 2  |
|    | 1.2  | Erin Quartararo                           | 3  |
|    | 1.3  | Taruni Sanjay                             | 4  |
|    | 1.4  | Cole Schneider                            | 5  |
| 2  | Exe  | ecutive Summary                           | 6  |
| 3  | Goa  | als and Design Overview                   | 6  |
| 4  | Rea  | distic Constraints                        | 8  |
| 5  | Eng  | gineering Standards                       | 10 |
| 6  | Alte | ernative Designs and Design Choices       | 11 |
| 7  | Tec  | hnical Analysis for System and Subsystems | 14 |
|    | 7.1  | Top-Level Modules                         | 14 |
|    | 7.2  | Low-Level Modules                         | 15 |
| 8  | Des  | ign Validation for System and Subsystems  | 16 |
| 9  | Tes  | t Plan                                    | 17 |
|    | 9.1  | LIDE-C Testing                            | 17 |
|    | 9.2  | LIDE-V Testing                            | 18 |
| 10 | Pro  | ject Planning and Management              | 18 |
| 11 | Cor  | nclusion                                  | 18 |
| 12 | Ref  | erences                                   | 19 |

#### 1 Team Contributions

#### 1.1 Noah Olson

Early contributions to the project included the final breakdown of bits for the control input and how to handle N and S as one-dimensional arrays within Verilog. For the LIDE-C, Noah worked along with Erin to write the following: PEA actor header file (lide\_c\_PEA.h), constructor function, destructor function, enable function, and the struct for PEA actor within the source file (lide\_c\_PEA.c). Noah also handled the unit testing of EVB within the LIDE-C implementation.

Noah's contributions to the LIDE-V included finalizing the upper-level modules. These included: enable\_PEA.v, and invoke\_PEA.v. He also designed the full system test bench. After compilation, Noah handled the system testing of enable\_PEA.v, invoke\_PEA.v, firing\_state\_FSM\_2.v, mem\_controller.v, get\_command\_FSM\_3.v, and STP\_FSM\_3.v modules. During the system-level testing, Noah made edits to the port lists of all the modules above and fixed Verilog run-time errors within these modules. Within firing\_state\_FSM\_2.v, mem\_controller.v, get\_command\_FSM\_3.v, and STP\_FSM\_3.v Noah added and removed states and signals to achieve the desired functionality for the whole system. Noah also assisted in the system testing of both EVP\_FSM\_3.v and EVB\_FSM\_3.v. His final contribution was the implementation of a third design for the hardware.

I pledge on my honor that I have not given or received any unauthorized assistance on this assignment /examination.

Noah Olson

### 1.2 Erin Quartararo

For the software implementation, Erin and Noah jointly wrote the header file (lide\_c\_PEA.c) and the invoke function. Erin wrote the enable function and applied error handling. Once this was finished, Erin set up the Cmake files. For testing the software implementation, Erin designed one test for STP, one test for EVP, and five tests for multi-instruction testing.

For the hardware implementation, Erin led the planning of the whole-system design, using mind maps and charts to plan module interactions. After this planning stage was finished, Erin wrote initial implementations for all instruction-related modules. These modules included get\_command\_FSM\_3.v, the module used to split a command token into its proper segments, as well as STP\_FSM\_3.v, EVP\_FSM\_3.v, EVB\_FSM\_3.v, and RST\_FSM\_3.v. She also made revisions as necessary to firing\_state\_FSM\_2.v, the module responsible for handling interactions between instruction FSMs (and RAM), and the invoke top-level module. When testing of individual modules began, Erin tested and debugged the EVP and EVB modules. Once full-system testing began, Erin designed a few tests for each instruction, debugging along the way. She wrote three multi-instruction, full-system tests and confirmed their proper operation.

Erin also contributed to writing and revising this report.

I pledge on my honor that I have not given or received any unauthorized assistance on this assignment/examination.

Erin Quartarazo

### 1.3 Taruni Sanjay

For the software implementation, Taruni completed the unit testing for the EVP and RST instructions. Taruni ran the test commands to check equivalence for all the tests respectively and confirmed that these instructions work. Taruni wrote a total of 3 unit tests for the EVP instruction and 3 for the RST instruction within the LIDE-C implementation. For the hardware implementation, Taruni drew out the design hierarchy diagrams for the entire project. After this, Taruni and Cole drew the state transition diagrams during the planning stage for every instruction computed. During the implementation stage, Taruni assisted Cole with the enable\_PEA.v and invoke\_PEA.v modules. After this, Taruni wrote down the second-level finite-state machine module, firing\_state\_FSM2.v which acts as the binding element between the top and lowest level modules in the design. Taruni categorized each instruction into firing and wait states in firing\_state\_FSM2.v, implying functionality for each of the states. After this, Taruni wrote down the output signal assignments for every state mode by declaring different output registers and wires to pass the data between the upper and lower-level codes. Taruni also received some help from Erin, Noah, and Cole during the module instantiation phase in the firing state. For documentation, Taruni completed the documentation for the source codes of the hardware component of design 1.

Finally, Taruni led the implementation of multiple sections of the report and project planning.

I pledge on my honor that I have not given or received any unauthorized assistance on this assignment/examination.

Taruni Sanjay

#### 1.4 Cole Schneider

Cole made active contributions during all phases of the project. These include planning, implementation, verification, and analysis of results.

During the initial planning phase, he proposed design decisions that stuck throughout all versions of the Polynomial Evaluation Accelerator (PEA)'s implementation, such as the contents of the command token.

During the implementation phase, Cole contributed to modules at the top-level, creating diagrams for the interface of the PEA and writing drafts of the invoke\_PEA.v, and enable\_PEA.v modules. However, Cole's main contributions were at the lower levels of the PEA. He designed and implemented the mem\_controller.v module, as well as the interactions between lower-level finite state machines with multiple memory modules. One such contribution included the design of a RAM module with additional reset capabilities, N\_ram.v. Due to the familiarity he gained during the planning and implementation phases, Cole was able to develop a second design for the PEA that tweaked the memory layout of the original PEA implementation.

During the verification phase, Cole designed multiple unit tests and system-level tests for both the software and hardware components. For the software component, he designed unit tests for the STP function. For the hardware component, he designed multiple unit and system-level tests, from which he discovered bugs and other unintended functionalities across several modules. He applied the appropriate fixes, and in some cases, added new elements to the design. These include the multiplexers found in the firing\_state\_FSM2.v module and a revamp of the state transitions and outputs for multiple finite state machines.

For the results and concluding phase of the project, Cole ran the synthesis and implementation processes on all three of the PEA designs. He used the Vivado design suite to generate the reports needed to analyze the designs and compare their performance.

I pledge on my honor that I have not given or received any unauthorized assistance on this assignment/examination.

Cole Schneider

## 2 Executive Summary

The overall interface of the Polynomial Evaluation Accelerator can be represented as two-input, dual-port first-in, first-out(FIFO) buffers, one Polynomial Evaluation Accelerator, and two output dual-port buffers. The input buffers are named Data and Command, and the output buffers are named Result and Status. This project is represented as two components – Hardware (Verilog-based) and Software (C-based). The software component implements the Polynomial Evaluation Accelerator (PEA) as a LIDE-C actor, while the hardware component of the project requires the implementation of the PEA in Verilog, and applicable testing of the related files from the LIDE-C version of the actor are leveraged. The Verilog component is developed using LIDE-V, which provides integration of lightweight dataflow programming with the Verilog hardware description language. The Verilog-based implementation can be viewed as an application-specific accelerator that is dedicated to polynomial evaluation. Performance and resource requirements (hardware cost) are two issues that the hardware implementation faces. For the purposes of this project, performance is the average latency for processing streams of instructions and data. Specific streams of benchmark inputs are described in detail in this documentation. The simulation and synthesis are done using Xilinx Vivado. The testing was completed for each of the independent modules before combining them into one whole integrated system, which was used for validation and testing. The software component for this project implements Cmake (cross-platform, compiler-independent build system generator) to generate executable files. Unit testing in C is done for every instruction and tests are validated with the dxtest command. Unit testing in the hardware component generates Vivado files. The synthesis and implementation are carried out in a lib subdirectory. The test results for hardware and software are shown in the transcripts.txt file and the diagnostics.txt file, respectively, corresponding to the specific test directory in which the text file is located. Alternative designs were generated for the hardware component; these alternate designs have functional differences between them which enable the group to show that various designs can be implemented to enable successful working of the accelerator. Accelerators with these functionalities are used in FPGA system design applications like Artix-7 FPGA. The overall findings and results were an informative learning outcome for the group to gain a real-world perspective on related problems.

# 3 Goals and Design Overview

The major goal of our pareto design (design\_1) is to reduce any space and time redundancy possible in the implementation. The design hierarchy was first designed, keeping in mind the previous mini-projects implemented for this course. The structure is hierarchical and concurrent. The population FIFOs of the

input buffers are interfaced with the topmost level of the finite state machine (FSM).

The enable file, enable\_PEA.v, consists of information pertaining to receiving tokens from the two input FIFOs - Data Buffer and Command Buffer and pushing them to the testbench ultimately. This information consists of the following:

- 1. Declaration of port list signals including word( size ) and buffer( size ) parameters.
- 2. Declaration of state modes SETUP\_INSTR and INSTR along with the instruction modes to be implemented later in the level 3 finite-state machines (FSM).
- 3. Conditions to check read and write addresses of the command and data rams to ensure that there are enough tokens to continue with the firing of the PEA actor.
- 4. Determining activation of enable signal to a high operation as per the conditions for STP, EVP, EVB, and RST.

The Verilog file, invoke\_PEA.v, consists of the implementation of the topmost level of the Finite State Machine. This Verilog file implements the following:

- Declaration of the port list consisting of input FIFO population and writing out commands for the output FIFO buffers - Result and Status.
  - 2. Implementation of the state modes which fulfill one complete operation cycle in the top module.
  - 3. Instantiation of nested FSM for actor firing state.
- 4. Updating the current state module. Complete state evolution of the top-level FSM module(including leaf-level state conditions).
  - 5. Execution of nested FSM and assignment to the output signals.

The Verilog File, firing\_state\_FSM2.v consists of the information that corresponds to the actor modes and a state transition pattern that is equivalent to the described mode transition illustrations. The Verilog file implements the following:

- 1. Declaration of the port list which consists of the population commands of the input FIFOs and output commands as well as a reset instruction signal.
- 2. Declaration of state modes and instruction states which support the actor to fire in the nested Finite State Machine(FSM).
  - 3. Instantiation of RAM modules for Data, Command, N vector, and S vector.
  - 4. Instantiation of the mux modules textttrd\_addr\_S\_MUX.v , rd\_addr\_data\_MUX.v and output\_MUX.v.
  - 5. Instantiation of level 3 modules Get-Command, STP, EVP, EVB, and RESET.
- 6. Condition for the RESET instruction in order to get the initial state starting. State evolution of the nested FSM The initial state(SETUP\_INSTR), INSTR with decision branches STP, EVP, EVB, and RST.

- 7. Firing all the instruction modes and initializing the next state modules in order to complete one operational cycle.
- 8. Conditional assignment of all output signals and consumption and production rates of input and output FIFOs respectively.

The lowest-level modules perform the computation of evaluating the polynomial in the accelerator. From a high-level point of view, the third-level modules comprise of the following:

- 1. The GET\_COMMAND module splits the input command and updates the new address command to be read.
- 2. The STP\_COMMAND instruction reads the first input data along with updating the address of the data which reads the token.
- 3. The coefficient states result in incrementing the next token for the address of the S RAM. The error handling for the STP is the final state for the instruction handling.
- 4. The EVP instruction allows the N vector RAM to read in tokens pertaining to data and update the index counter of the S RAM too.
  - 5. The sum of the monomial is first computed. After this, the exponent too is calculated.
- 6. The EVB instruction comprises of enabling the EVP function, firing it, and lastly updating the counter of the b value.
- 7. The RST instruction comprises of two state modes one which enables the reset instruction to a high operation and the other which enables it to a low operation.
- 8. The mem\_controller.v is implemented to confirm a successful interface between the FIFO and the RAM modules.

The major challenges faced were the generation of the output signals by assigning loose ends to a number of wires. Any wire assigned to an unknown/undeclared signal would result in an 'x' value or 'z' value for high impedance. From a design point of view, the connections between all the FSMs were tricky because of the abundance of signals, registers, and wires. The solution to overcome space redundancy was one of the major drawbacks because the solution to these challenges required us to design more multipliers for the data address and S vectors.

#### 4 Realistic Constraints

As a group, our aim was to build a design with good performance and reduced latency. Some of the factors which are integral are given as follows:

\* The project requirements state desired capacity or sizes for the word argument and buffer depth of the



Figure 1: High-Level Design Hierarchy of the Polynomial Evaluation Accelerator

FIFO. During the course of our project planning, we kept a keen eye on the bit widths and FIFO capacities of the input and output buffers. The size of both input buffers are 16-bit width and the output buffers are 32-bit width length. One of the other integral constraints is reducing space redundancy for our design. Our test results also make sure there is no ambiguity with bit widths in any of the state modes too.

- \* In the software implementation of our project, the enable and invoke functions were the two most important and crucial parts. The authenticity of the enable function would be to check and compare the size of the input buffers with their populations. The invoke function handles the computation of all the instructions required to be computed for the actor to be generated, fired, and terminated. Both these functions give the readers a clear picture of how the actor implementation is done and is, therefore, an important constraint to be considered.
- \* The software component of our project is completely written down in C. C is one of the most convenient object-oriented languages when it comes to implementing designs at the physical level. The hardware component of our project is fully implemented in Verilog HDL. Verilog can easily simplify, simulate, debug and synthesize designs. The Xilinx Vivado tool assists Verilog in this implementation. The 2001 Verilog keyword has been implemented throughout the course of our project.
- \* The testing for our project is done on the Vivado suite. The most important constraint while testing a design in Vivado would be the implementation of the clock cycles and the I/O timing constraints. Vivado as a simulation is quite user-friendly because it shows the viewer all details with regard to the slack and timing

constraints after the synthesis is done. Vivado also generates netlist files, which show all details pertaining to look-up tables and multipliers as well as the operating frequency and critical path for the design.

# 5 Engineering Standards

The software implementation is done using the LIDE-C package environment handed by the DSPCAD research lab handouts. LIDE (DSPCAD Lightweight Dataflow Environment) is a flexible, lightweight design environment that allows designers to experiment with dataflow-based approaches for the design and implementation of digital signal processing (DSP) systems. LIDE contains libraries of dataflow graph elements (primitive actors, hierarchical actors, and edges) and utilities that assist designers in modeling, simulating, and implementing DSP systems using formal dataflow techniques. The Verilog implementation is done by the LIDE-V Dataflow environment through which users can access 'makeme' and 'runme' files to compile their codes in a successful manner. The testing for Verilog and C are done using DICE packages which are user-friendly and fast. The unit testing is confirmed in a successful manner by running the 'dxtest' commands for all unit tests collectively. The 'dxitsout' would make sure the correct output is equal to the output obtained after running the driver function. LIDE-V for Verilog has an in-built interactive project development for Xilinx Vivado to view and debug designs along with the Tcl console.

C continues to be the realm of low-level language programming. Unix as a language compatible with the C language. The utilization of abstract data types and interfaces in the project can be well understood in C. The actor implementation in C uses the concepts of C well. C is in close association with high-level, lowlevel, and middle-level languages too. This project required a large code broken down into small fragments which works well with C. For instance, the enable and invoke functions are two different major blocks coded in lide\_c\_pea.c which can be differentiated clearly. C also has fewer libraries in comparison with other high-level languages and is utilized in embedded programming applications too. Benefits of using Verilog: Verilog as a hardware description language is now a standard language for the design of digital subsystems, micro-controllers, and flip-flops. Verilog is used to simplify the design schematic and make it easier for the programmer to lay out the hierarchy of the design in code(low-level language). In our project, we have utilized top-down and bottom-up designs at behavioral, register transfer, and gate-level abstraction. Our project uses the 2001 Verilog standard keyword. Benefits of using Xilinx Vivado: Xilinx Vivado is able to verify and simulate design integrity at different stages of the design process such as behavioral, post-synthesis functional, and timing simulation and Post-implementation functional and timing simulation. We use the Vivado tool suite to generate the schematic HDL by running a behavioral simulation, synthesis, and postsynthesis. Vivado supports both Windows and Linux operating systems with powerful debugging features that are aimed to address verification needs.

## 6 Alternative Designs and Design Choices

An alternative design to the original PEA uses a different layout for the RAM used to store the coefficient vectors, which we denote S RAM. The original PEA stores the coefficient vectors in a linear fashion, indexing each 16-bit word within a one-dimensional array of 88 words: with each word representing a single coefficient, there are 11 words per coefficient vector and a total of 8 coefficient vectors. This one-dimensional array method is shown below in Figure 2.



Figure 2: The S RAM layout of the original design, which stores all CV coefficients in a linear fashion

The alternative design stores the coefficients in a two-dimensional, 8x11 array of words. This method is shown below in Figure 3.



Figure 3: The S RAM layout of design 2, which stores all 88 CV coefficients in an 8x11 two-dimensional array

The motivation behind this design change is to reduce the combinational logic required to read and write from the S RAM module. In the original design, a 7-bit adder is used within the STP, EVP, and EVB modules to determine the read and write address inputs for the S RAM module. However, for the second design, this 7-bit adder is not required. Instead, two addresses – one for the coefficient vector (the row) and one for the specific coefficient (column) – are determined by the command token and a 4-bit adder, respectively. By reducing the complexity and the number of gates for the adder circuit, the resource utilization of the Artix-7 FPGA decreases. Shown below in Figure 4 are the utilization reports generated using the Vivado implementation process. Notice that the utilization percentage for design 1 is slightly larger than for design

2, indicating that the alternative design is ever so slightly more efficient.

| Site Type              | +-  | Used | Fixed | Prohibited | Available | Util8  | -+  | '                      |    |     | Fixed | Prohibited | •       |      |
|------------------------|-----|------|-------|------------|-----------|--------|-----|------------------------|----|-----|-------|------------|---------|------|
| Slice LUTs             | 1   | 334  | 1 0   | 0          | 20800     | 1 1.61 | Ī   | Slice LUTs             | i. | 328 | . 0   | . 0        | 20800   | 1.58 |
| LUT as Logic           |     | 110  | 1 0   | 0          | 20800     | 0.53   | 1.  | LUT as Logic           | 1  | 104 | 1 0   | 0          | 20800   | 0.50 |
| LUT as Memory          |     | 224  | 0     | 0          | 9600      | 1 2.33 | 1   | LUT as Memory          | 1  | 224 | 1 0   | 0          | 9600    | 2.33 |
| LUT as Distributed RAM |     | 224  | 1 0   | 1          |           | 1      | - 1 | LUT as Distributed RAM | 1  | 224 | 1 0   | 1          | l .     | 1 1  |
| LUT as Shift Register  |     | 0    | 1 0   | l          |           | 1      | - 1 | LUT as Shift Register  | 1  | 0   | 1 0   | 1          | l .     | 1 1  |
| Slice Registers        |     | 78   | 0     | 0          | 41600     | 0.19   | 1   | Slice Registers        | 1  | 76  | 1 0   | 0          | 41600   | 0.18 |
| Register as Flip Flop  | 1   | 43   | 1 0   | 0          | 41600     | 0.10   | 1.1 | Register as Flip Flop  | 1  | 42  | 1 0   | 0          | I 41600 | 0.10 |
| Register as Latch      | 1   | 35   | 1 0   | 0          | 41600     | 0.08   | 1   | Register as Latch      | 1  | 34  | 0     | 0          | 41600   | 0.08 |
| F7 Muxes               |     | 16   | 0     | 0          | 16300     | 0.10   | 1   | F7 Muxes               | 1  | 16  | 1 0   | 0          | 16300   | 0.10 |
| F8 Muxes               | 1   | 8    | 1 0   | 0          | 8150      | 0.10   | 1.  | F8 Muxes               |    | 8   | 0     | 1 0        | 8150    | 0.10 |
| +                      | -+- |      | +     |            |           | +      | -+  | +                      | +  |     | +     | +          | +       | ++   |

Figure 4: The utilization reports for design 1 (left) and design 2 (right), generated by the Vivado implementation process

The original design uses 110 Logic LUTs, 43 registers as flip flops, and 35 registers as latches. The alternate design uses 104 Logic LUTs, 42 registers as flip flops, and 34 registers as latches, which is a minuscule improvement. However, despite the reduction in LUTs and registers, the additional read and write address signals require an increase in power consumption. Shown below in Figure 5 are the routed power reports generated by the Vivado implementation process for each design. Notice that the total on-chip power consumption is lower for design 1 than for design 2.

| Total On-Chip Power (W)     Design Power Budget (W)     Power Budget Margin (W)     Dynamic (W)     Device Static (W)     Effective TJA (C/W)     Max Ambient (C)     Junction Temperature (C)     Confidence Level     Setting File     Design Note Matched |     | NA<br>0.012<br>0.060<br>5.0<br>99.6<br>25.4<br>Low | +  | Total On-Chip Power (W)   0.087     Design Power Budget (W)   Unspecified*     Power Budget Margin (W)   NA     Dynamic (W)   0.026     Device Static (W)   0.060     Effective TJA (C/W)   5.0     Max Ambient (C)   99.6     Junction Temperature (C)   25.4     Confidence Level   Low     Setting File       Simulation Activity File       Design Nets Matched   NA | + |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----|----------------------------------------------------|----|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---|
| Design Nets Matched                                                                                                                                                                                                                                          |     | NA                                                 |    | Simulation Activity File  <br>  Design Nets Matched   NA                                                                                                                                                                                                                                                                                                                 |   |
| +                                                                                                                                                                                                                                                            | -+- |                                                    | -+ | ++                                                                                                                                                                                                                                                                                                                                                                       | ١ |

Figure 5: The routed power reports for design 1 (left) and design 2 (right)

The original design utilizes a total on-chip power of 7.2mW, while the alternate design utilizes 8.7mW, which is approximately a 20% increase. This increase is likely due to the increased number of signals at the interfaces of third-level FSM modules like STP, EVP, EVB, and S RAM. The large increase in power consumption does not outweigh the slight improvement in resource utilization for the second design. As such, the alternative design requires a greater number of resources overall compared to the original design. The performance of both designs is the same, despite the alternative design requiring additional resources. The performance can be compared using the timing reports shown in Figure ??. The worst negative slack (WNS) and total negative slack are 0 ns for both designs when a 20 ns clock period is used.

The WNS values for both designs are 0 ns. This means that the maximum possible clock frequency is  $\frac{1}{20}$  ns = 50 MHz for both designs. Due to the similar data flow of both designs, the number of clock cycles required to provide output to the external buffers is the same. Given that the maximum possible speed of

| WNS(ns)     |       | _    | - |      | Endpoints | WHS(ns) | THS (ns |
|-------------|-------|------|---|------|-----------|---------|---------|
| 0.000       | 0.000 | <br> | 0 | <br> | 1830      | 0.017   | 0.00    |
|             |       |      |   |      |           |         |         |
|             |       |      |   |      |           |         |         |
|             |       |      |   |      |           |         |         |
|             |       |      |   |      |           |         |         |
| mii         |       | <br> |   | <br> |           |         |         |
| esign Timir | _     | <br> |   | <br> |           |         |         |
| _           | _     | <br> |   | <br> |           |         |         |
| _           |       | <br> |   | <br> | Endpoints | WHS(ns) | THS (ns |

Figure 6: The routed timing reports for design 1 (top) and design 2 (bottom)

computation and the required number of cycles are the same for each design, both designs perform equally.

Given that designs 1 and 2 perform equally well and that the second design is less efficient with resource usage, the original design is better. This is reflected in the Pareto diagram in Figure 7, in which the original design lies on a greater utility curve than the alternative design. Design 1 lies on a greater utility curve despite it having equal performance with design 2.



Figure 7: The pareto diagram with the utilities of designs 1 and 2 plotted against performance and resource efficiency

The change in S RAM layout is not the only additional design. A third design, which focuses on reducing the number of states in the third-level FSMs, has also been explored.

During the verification phase of the project, additional states such as STATE\_IDLE were added to modules at the third level. Further analysis of the behavioral description of these modules indicated that certain states could be combined or removed entirely to achieve the same behavioral function. As such, STATE\_IDLE

was removed in the STP and EVP modules, and the GET\_NEXT\_COEFF and textttCOMPUTE\_EXPONENT states were combined in the EVP FSM.

However, this design cannot be evaluated since it does not meet timing requirements. Based on the timing report shown in Figure 8, the worst negative slack value is negative, at -13.441 ns. For the design to be valid, this value must be non-negative, and thus design 3 is invalid.

Figure 8: The timing report for the third PEA design

Throughout the planning, implementation, and verification phases of this project, many design decisions and paths were considered. The alternative designs, 2 and 3, were based on the many ideas and observations made during the design process. Unfortunately, design 2 is less efficient compared to the original design, and the third design has the potential to outperform its counterpart but did not come to fruition before the deadline of this project.

# 7 Technical Analysis for System and Subsystems

#### 7.1 Top-Level Modules

Our design as shown in Figure 10, shows an overview of the design hierarchy that we derived as a group over the past few months. The input buffers, data and command, send their respective data into the finite state machines. The enable module reads in the read address for data and command as two signals each and checks if the population of the command and data FIFOs is greater than the value of the second argument to evaluate the functionality of EVB or not. Each finite-state machine at the lowest level is an independent finite-state machine that performs its relevant computation. The enable module was initially implemented in a way very similar to the prior assignments completed during the class. After a detailed discussion and a better understanding of this project, we came to the conclusion that it's best to check for the FIFO buffer's depth and the address commands it reads. The SETUP\_INSTR state mode enables a high operation if the difference between the read address command from the write address command is greater than one. Moving on to the next state modes, the STP and EVB check if the difference is greater than the second argument (value of b). The EVP module checks if the difference is greater than one. After implementing these novel measures, the enable module was able to take in the command address and data address tokens in order to

prepare them for firing later on.

The top module with the invoke function indicates the completion of one full operation cycle. The state modes comprise the Idle, Firing start, and Firing wait stages, wherein the very first always block checks for the reset instruction and transitions to the Idle state, and the second always block performs the state transition as the invoke signal is enabled. It first transitions to the firing state wherein tokens are getting fired, and once the output signal is enabled, the firing transitions to a wait state. The input signal assignment is enabled to a high operation, too. This top module is easy to understand and conveniently shows a clear difference between the functionality of the first and second levels of the finite state machine.

The firing state module acts as a binding hierarchy between the top module and the lowest level (instruction) modules. This design first checks the functionality of the reset instruction and begins the transition with the 'Start' state mode. The second segment comprises two modes – SETUP\_INSTR and INSTR. The main difference between this firing state and others (assignments) is the declaration of the "get command" state mode – GET\_COMMAND\_START, GET\_COMMAND\_WAIT, and GET\_COMMAND\_FINISH. "Get command" is not a definite instruction and is utilized for pre-empting tokens to start the computation. The output signal assignment shows the signal which comes in from the parent finite state machine module.

#### 7.2 Low-Level Modules

The third-level FSMs include instruction modules, a "get command" module, and RAM modules. Normal operation involves close interaction and passing of signals with the firing\_state\_FSM\_2 module. These signals include read and write addresses for the different RAMs, enable signals to allow reading and writing to and from RAM, and enable and done signals for each instruction module. Result and status are outputted by each instruction module to be passed up to the output FIFOs. Depending on each module's implementation, other signals may be passed as well, but the ones listed are the common ones.

The get\_command\_FSM\_3 module takes a command token, intentionally read in during one of its states, as input. In a subsequent state, the module splits the command token into an instruction, a first argument, and a second argument. These three command components are the outputs of this module. The outputted instruction is used by firing\_state\_FSM\_2 to determine what instruction will be processed next, and that instruction's enable signal is set appropriately. Only one instruction can be enabled at a time. The STP, EVP, and EVB modules take the first and second arguments as inputs. Each of them performs their respective operations and outputs a result and status appropriately.

Each instruction modules features three always blocks. The first always block depends upon the positive edge of the clock and the negative edge of the reset signal; on a reset, it resets the state to STATE\_IDLE or

STATE\_START and resets counters, temporary registers, and anything else that requires clearing; on a positive clock edge, it pushes all temporary "next" values into their respective "final" values, which could be either registers local to the module or outputs. The second always block determines the next state, sometimes depending upon when certain conditions are met. For example, EVP's next state is set to STATE\_END when counter tracking the index of S we are currently processing reaches N, the size of the coefficient vector. The third always blocks changes the values of registers and outputs depending upon what current state the module is operating in. In data-retrieval-related states or data-sending-related states, this may involve setting read enable or write enable signals high and changing read or write addresses. In computation-related states, this may involve performing a computation and updating registers or outputs as a result.

The design utilizes 4 separate RAM modules. They are used to store the set of coefficient vectors, S, N (the degree of each polynomial in S), data tokens, and command tokens. The memory controller unit senses whenever there are idle data or command tokens in the respective input FIFOs, and loads them into the data and command RAM modules. The N RAM and S RAM modules are used when calls are made to STP, EVP, or EVB. For STP, The selected CV has its degree taken from the Get Command module and written into the specified N RAM. Data tokens are then taken from the data ram and written into the S RAM modules. EVP and EVB interact with the RAM modules in a similar manner, reading values from the N RAM at the address specified in the command token, and from the S RAM in an incremental fashion, starting at the first coefficient of the specified CV. After STP, EVP, and EVB are done, write enable signals are driven high, along with result and status tokens.

# 8 Design Validation for System and Subsystems

The testbench inputs used to verify the design cover a large breadth of cases at the system level, ensuring a high level of confidence in the correct performance of the whole system. Various cases of each input were used, such as STP commands with large and small amounts of data inputs. Multi-instruction tests were also used to verify system performance, addressing behavior in unexpected corner cases.

These system-level testbenches were built from existing unit tests for the subsystem level. Every module on the third FSM layer, including the memory controller, STP, EVP, EVB, and RST was given its own thorough unit testing to ensure that functionality was correct. An incremental integration and test allowed for corner cases to be covered at every level.

From these testbenches, performance of design variations could be evaluated based on the amount of time required to write result and status tokens to the output FIFOs when given the same test inputs. Figure 9 shows that the time benchmark differs between different designs. Notice that the time required for the result

to become a value of 13 is greater for design 1 than for design 3. This indicates that design 3 performed better in the behavioral simulation.

TIME:183 STATE\_EVP:8, STATE\_STP:0, result:0 TIME:185 STATE\_EVP:10, STATE\_STP:0, result:13 TIME:187 STATE\_EVP:0, STATE\_STP:0, result:13 TIME:189 STATE\_EVP:0, STATE\_STP:0, result:0

TIME:173 STATE\_EVP:8, STATE\_STP:1, result:0 TIME:175 STATE\_EVP:10, STATE\_STP:1, result:13 TIME:177 STATE\_EVP:1, STATE\_STP:1, result:13 TIME:179 STATE\_EVP:1, STATE\_STP:1, result:0

Figure 9: The EVP testbench output in BASH for design 1 (top) and design 3 (bottom) for an EVP unit test

Using a combination of thorough unit testing, system testing, and time benchmarking, the performance of the PEA can be evaluated for each design. Evaluation by simulation time allows for each design to be compared performance-wise in a consistent manner.

## 9 Test Plan

For testing, we used the unit testing tools 'dxtest', 'dxmktest', and 'dxitsout' that come with the DICE testing suite. The organization of both LIDE-C and LIDE-V testing directories followed the DICE testing conventions with a separate testing sub-tree for each main testing area. Within each testing, sub-tree were individual testing sub-directories (ITS) for each test. Our approach to the testing for both LIDE-C and LIDE-V implementations of the PEA was the Incremental Development Workflow (IDW). This involved making incremental changes to the code and then testing it to ensure the desired outcome was reached.

#### 9.1 LIDE-C Testing

The testing for each command (STP, EVP, EVB, RST) and multiple instructions were divided among the group members. Each member was tasked with generating an ITS testing suite that would validate that the command was working for various inputs. After the code was compiled, we created lide\_c\_pea\_driver.c.

This driver utilized the CFDF canonical scheduler lide\_c\_util\_simple\_scheduler. Along with the lide\_c\_util\_guarded\_exe to continue invoking the PEA actor when there were multiple instructions. For each test, the results and statuses of the respective input command were directed into a file out.txt within each ITS. This output was then validated and passed.

#### 9.2 LIDE-V Testing

For testing the hardware implementation, we took advantage of Verilog's hierarchical design structure and tested different modules separately before testing the entire system together. Using this method allowed us to verify that the individual modules were generating the correct results and statuses for EVP, and EVB. While also ensuring that STP was writing to the CV's RAM properly. To build up to the whole system level testing, each module was incrementally instantiated after the previous modules' signals and states were behaving as expected. The order of incorporating the modules in the whole system was as follows: enable\_PEA.v, invoke\_PEA.v, firing\_state\_FSM\_2.v, mem\_controller.v, single\_port\_ram.v, get\_command\_FSM\_3.v, STP\_FSM\_3.v, EVP\_FSM\_3.v, EVB\_FSM\_3.v, and RST\_FSM3.v. During the whole system testing, '\$monitor' statements and Vivado timing diagrams were used to create state transition diagrams, watch read/write addresses and view various enable signals timing.

## 10 Project Planning and Management

The project planning and scheduling are represented chronologically with the help of a Gantt chart.

This Gantt chart provides a systematic way of depicting the allotment of work among group members and deadlines for every task.

## 11 Conclusion

Our project comprises three designs, out of which design\_1 is the Pareto design and design\_2 and design\_3 are the non - Pareto designs. We have taken important benchmark considerations with regard to performance and latency measurements. The Pareto design is the reference on which the two non-Pareto designs have been built. The software implementation of our project is the common reference point with regard to the functionality of our project. We have used multiple references from the software component of our project in order to build designs and create alternate optional designs. We aimed to create something which could eventually scale up in the FPGA system design and subsystem design domain. We hope to continue implementing the technical skills and soft skills we learned during the course of this subject and domain.



Figure 10: Project Planning and scheduling

# 12 References

- [1] 'ELSS: A Logic Synthesis Tool for FPGAs', R.P Ranauro and M.M. Lightart, proceedings of the 4th annual IEEE Asic conference and Exhibit, 1991, pp.13.2.1-13.2.4.
- [2] 'Technology Mapping in MIS', E. Detjens, G. Gannot et..al., proceedings of the ICCAD, 1987, pp.1 16-119.
- [3] 'Module Generation for VHDL synthesis', R.W. Dekker and M.M. Lightart, proceedings of the VIUF spring conference, 1993.
- U. Afzaal, J. -A. Lee, "Low-cost Hardware Redundancy for Fault-mitigation in Power-constrained IoT Systems," 2020 International Conference on Information and Communication Technology Convergence (ICTC), 2020, pp. 60-62, doi: 10.1109/ICTC49870.2020.9289420.P. S. Abril, R. Plant, The patent holder's dilemma: Buy, sell, or troll? Communications of the ACM 50 (2007) 36-44. doi:10.1145/1188913.1188915.

- [2] P. Mallavarapu, H. N. Upadhyay, G. Rajkumar and V. Elamaran, "Fault-tolerant digital filters on FPGA using hardware redundancy techniques," 2017 International conference of Electronics, Communication and Aerospace Technology (ICECA), 2017, pp. 256-259, doi: 10.1109/ICECA.2017.8212811.
- [3] Eckhardt, Dave E., Larry D. Lee, "A theoretical basis for the analysis of multiversion software subject to coincident errors." IEEE Transactions on software engineering 12 (1985) 1511-1517.
- [4] Igor V. Kovalev, Mikhail V. Saramud, Vasiliy V. Losev, Information and Software Technology 120 (2020). doi:10.1016/j.infsof.2019.106245