# Lab report Digital Design (EDA322)

Writing Guidelines

Group TUE\_PM\_8 Niklas Gustafsson Oskar Lundström

February 14, 2017

# Contents

| 1            | Introduction |                                                 | 1   |  |
|--------------|--------------|-------------------------------------------------|-----|--|
| 2            | Met          | chod                                            | d 2 |  |
|              | 2.1          | Arithmetic and Logic Unit (ALU)                 | 2   |  |
|              | 2.2          | Top-level Design                                | 4   |  |
|              | 2.3          | Controller                                      | 6   |  |
|              | 2.4          | Processor's Testbench                           | 8   |  |
|              | 2.5          | ChAcc on Nexys 3 board                          | 9   |  |
|              | 2.6          | Performance, Area and Power Analysis (Optional) | 11  |  |
| 3            | Ana          | nalysis 1:                                      |     |  |
| $\mathbf{A}$ | Appendix 1   |                                                 | 13  |  |

#### 1 Introduction

This part will introduce the reader to the report.

At the beginning, describe what the purpose of this lab report is. Then describe briefly what each section discusses and finally summarize the most important conclusions.

The purpose of this report is to describe the steps of building the ChAcc processor. Each section is intended to describe the most important parts of the work as well as the different challenges we faced.

In lab 2 we created an RCA and a CLA from full-adders. We then used this in combination with extra logic to create an ALU that can perform some basic operations such as addition, bitwise AND-ing and equality-testing.

In lab 3 we implemented register and a memory with generic sizes. We also designed a bus using multiplexers instead of tri-state buffers. These components were then combined to create the datapath of the processor.

In lab 4 we engineered the controller of the processor. We started out with a state diagram describing our controller as a mealy FSM and then realized it in VHDL with three processes. The datapath now had a functioning controller so we now tested it with a provided testbench.

In lab 5 we designed a testbench for the processor. We were provided with a program and expected outputs of some control signals. The testbench loads the program, executes it and compares the output with the expected values.

In lab 6 we synthesized our design of the processor. We encountered warnings from the synthesis tool which we had previously not seen. These errors were corrected and then we generated a bitsream which we loaded to the FPGA. We also made sure the program worked as expected.

Our main conclusions after completing all of the labs are that:

- You need to follow the specifications for control signals exactly, otherwise the instructions will either do nothing or do something completely wrong.
- Unlike when programming, there are more non-obvious things to take care of when describing hardware.
- test

### 2 Method

#### 2.1 Arithmetic and Logic Unit (ALU)

When we built our ALU we began by constructing the most basic component, the full-adder. We started by minimizing the boolean expression for the sum and the carry out signals using a Karnaugh diagram. We came to the conclusion that a full-adder should be built like described in the picture below.



Figure 1: A full-adder.

You can also build a full-adder using two half-adders and some extra logic. We came to this conclusion by connecting the resulting signal of the first half-adder to the second one along with the carry in signal. After this we just added the extra logic that was needed, which was an OR gate.



Figure 2: A full-adder made from two half-adders.

We created an 8-bit ripple carry adder (RCA) by connecting eight full-adders together according to the diagram below. The idea is to simply connect the carry out signal of the first full-adder to the carry in signal of the second one, and so on. The RCA was used in our original design of the ALU but later we replaced it with a carry lookahead adder (CLA).

The ALU includes a comparator unit which is used to check whether the operands are equal or not. You can implement it by using the structural or dataflow style of VHDL. We decided to do the dataflow implementation



Figure 3: A ripple carry adder.

because we thought it would be easier and that it would result in a more clean code. In order to check whether the operands are not equal we performed a bitwise XOR on the operands and OR:ed the results. This was used as the resulting NEQ signal. The EQ signal was simply an inversion of the NEQ signal.

If we were to do this in structural VHDL we would have to create a component which cheked if two certain bits were not equal. We would then have to create eight such components. The input as a whole is equal if all 8 bits are equal, otherwise not.

The operation A minus B is performed by converting B to its 2-complement and then adding it with A. By passing B through 8 XOR gates, each having SUB as its second input, we can form B:s 1-complement when subtracting. We connect SUB to carry-in to form the 2-complement.



Figure 4: Logic for performing subtraction with an adder.

One thing that we learned was that in VHDL the entity names in a project have to be unique for all references to work properly.

We also learnt why the critical path for an n-bit RCA is 2n + 1 gates. For every full adder the carry-out depth depends on the inputs (3 gates deep) and the carry-in (2+previous depth). The depth is the maximum of these. For the first the depth becomes 3, for the second 5 and so on. In other words, for the first adder the inputs are the fartherest away, but for the following it is the carry-in instead.

#### 2.2 Top-level Design

#### Registers

We made a generic implementation of a register by letting a variable hold an integer representing the number of bits the register is able to hold. The vectors for in- and output to the register are adjusted after this variable. We implemented our register using behavioral VHDL, i.e using a process statement, since that seemed to result in the simplest and cleanest code. Another reason for using a process statement is that we know that assigning signals values in a process statement result in flip-flops when the code is synthesized.

#### Memory

We also made our implementation of the memory generic. We have two integer variables, one describing the number of address bits  $(addr\_width)$  and one describing the number of bits each memory location can hold  $(data\_width)$ . The number of bits in the input signals for address and data are adjusted after these variables. The memory itself is an array of a type which we specified. The number of entries in this type of array is  $2^{addr\_width}$  and the size of each entry is  $data\_width$ . The read operation is asynchronous and the write operation is synchronous.



Figure 5: The operation of the memory.

In figure 5 interesting things happen at the red numbers.

- 1. The read is asynchronous, so the output is changed directly after the address is changed.
- 2. The write is synchronous, so the write does not happen until the positive edge.
- 3. When the input is changed, no write happens at the negative edge.
- 4. If we changed the address to something else, and then change back, the value previously written is still saved.

#### Bus

The bus was implemented with a mux. The 4 indata signals each had 8 bits. Bit 0 from each goes to a 4 to 1 mux, bit 1 from each to another mux and so on, i.e. the signals were muxed bitwise. The muxes are inferred by

the tools since we used a WITH SELECT statement. The mux select signal was created from a one-hot representation of the 4 control signals. Based on which bit was set, the corresponding mux select signal was created. If more than 1 bit are set, the bus error signal is set.

We used a multiplexer for the bus since if 2 or more control signals are set, the bus will not take an undefined value, as opossed to a tri-state version. If 0 or more than 1 control signals are set, we put 0's on the bus since we will not read from it anyway if that is the case.

#### Datapath

The last step was to connect all components of the datapath as described by figure 6. In the datapath entity we had signals with the same names as on that figure (made it easier to code if the names were the same). We then instatiated all components and connected them according to the figure. Some components, such as the registers, have different datawidths in different places. But since they were generic it was easy to have multiple versions.



Figure 6: The datapath of the processor.

#### 2.3 Controller

#### The FSM

We chose to design our controller as a Mealy FSM. We did this in order to reduce the amount of states (fewer flip-flops). If we would've chosen a Moore design we would roughly need one state for each combination of opcode and stage, although we would be able to minimize it to some extent. Fewer states means that we only have to keep track of five states, namely the stages (FE,DE,DE\*,EX,ME), and hence the resulting VHDL code as well as the circuit itself becomes simpler. The statements for assigning the output signals a value are rather long. This would however be the case no matter which design we chose. If we had chosen a Moore design each row in these statements would represent a state and hence the code would be equally long.



Figure 7: The state diagram.

#### Testing

When we first ran the testbench we noticed a problem. The contents of the data memory did not match the specifications of what it should contain after running the testbench. We started troubleshooting and found three problems.

- 1. We found that our EQ and NEQ signals to the controller weren't assigned at all. We fixed this but the problem remained.
- 2. We continued to troubleshoot the controller by examining the waveform diagram of the controller and datapath when running the test-

bench. While doing this we noticed that after about 500ns the opcodes didn't match the instructions that should be executed. It turned out that the dmRd signal didn't have the correct value when the AND instruction was executed. In the processors specification document this signal was a "don't care" for this opcode when it really should've been set to 1, so we changed it. A few days after writing the code we checked the processors specification document and noticed that this had been changed there as well. We assume that this simply was a mistake.

3. The instruction for the jump statement was placed at memory location 256 (index starts from 0). This meant that when we jumped to memory location 255 a NOOP instruction was executed, the PC overflowed and then the program started from the beginning. We corrected this by moving the contents of memory location 256 to memory location 255. After this the testbench ran successfully.

In figure 8 we see the waveform diagram of instructions 1 to 3 in the test program being executed. If you compare the diagram with the processors specification document it's clear that the signals have the correct value at the correct time. You can also see that it's first after the fetch stage is completed that the opcode changes. This is because the fetch stage has to be completed before the opcode can be sent from the FE/DE register to the controller.



Figure 8: The waveform diagram for executing instructions 1 to 3 in the test program.

Note that when the opcode changes from 1 to 7 in figure 8 a spike occurs for the signal dataLd. This is an effect of our Mealy design. The output signal only depends on state and input (opcode). Since the stage is DE and the opcode is 1 for a brief moment the dataLd signal becomes 1. When the opcode changes to 7 dataLd becomes 0. This is not a problem since the spike occurs after the positive edge of the clock. Since everything is synchronous the control signals have corrected themselves before the next positive edge, meaning that no errors will be made.

#### 2.4 Processor's Testbench

The testbench is used to verify the processor's correct operation. By simulating inputs to the processor and examining the outputs you can determine whether an operation was carried out in a correct fashion or not. In practice this is done by writing an assembler program which you load into the memory of the processor. The verification is done by comparing the actual outputs with the expected outputs while running the program on the processor. If any of the outputs take an unexpected value an error has been found.

When designing the testbench we initially thought that there were huge issues with the flag register. We got no output to the flag2seg signal which indicated that something was wrong with the flag register. This seemed a little bit odd however, since the flag register had worked just fine previously. When checking the VHDL code for the processor it turned out that we'd forgotten to connect the flag2seg signal to the flag register. After connecting this our testbench ran successfully.

#### How the testbench works

- Generating input. The input to the processor are the control signlas (externalIn, clk, master\_load\_enable and aresetn) and the instruction and data memory files. The control signals are assigned values, some periodically, in the architecture of the testbench. The memory content is read when the memory entities are instantiated. Based on the memory content different output is expected, hence we consider those files an input from the testbench.
- Reading expected outputs. The expected output values are read from text files into five different arrays, one for each output signals to check. The reading is done inside a process in the testbench architecture. For each file, while it has not reached the end, read the current line. That line is parsed as an 8-bit vector and is stored in the array at the current index. The index is incremented and the same procedure is repeated.
- Comparing outputs and expected outputs. The output signals of interest have one process each, sensitive to the signal. When that signal changes the process is triggered and its value is compared to its expected value in the array by using asserts. If they are equal, the index for that array is incremented. Otherwise the testbench stops and reports the error. The output signals that are checked are DataMemOut, flag2seg, pc2seg, acc2seg and disp2seg.

#### 2.5 ChAcc on Nexys 3 board

#### Synthesis vs simulation

We encountered three problems in particular:

- In the process that assigned the next state we did not have *else* clauses and *when others* statement. We thought we didn't need this since we had exhautsed all possibilites already. When synthesizing, latches were created because we did not have these statements and hence we added them.
- When synthesizing we got warnings that we did not have all signals
  used in the process in the sensitivity list. This warning did not occur
  when simulating so it did not affect the correctness when just simulating. This will not change the behaviour of the program logically but
  we added it to avoid glitches.
- When we ran the program on the FPGA we observed the PC values and opcode values. They did not match the program. It appeared that the values jumped more than one step each cycle. The lab assistant suggested that we should make sure that the load control signals were set only if <code>master\_load\_enable</code> was set. This fixed the problem. If, for instance, <code>pcLd</code> does not depend on <code>master\_load\_enable</code> PC will be incremented at the pace of the main clock. But the stage would not change if <code>master\_load\_enable</code> was 0. This is of course unwanted beaviour. If the stage would change independently of <code>master\_load\_enable</code> the hardware would function correctly, but we would not be able to toggle the progress manually and observe changes ourselves.

When we simulated our processor all of our tests were completed successfully. We didn't get any indications that we should make any of the changes mentioned in bullets 1 and 2. It was when we synthesized our design that we got warnings about the sensitivity list and not exhausting all possibilities. Since it's hardware we're dealing with we have to make sure that the VHDL we write is possible to synthesize into well functioning hardware. If the synthesis tool ask us to make certain changes it's a good idea to follow its advice.

#### Verifying the correctness

We began the verification by observing the PC value. Since it should just go from 0 to 6 it would be easy to detect an error and follow the progress. But instead in jumped between 0 and 6 directly, and sometimes to 1 and

2, indicating that something was wrong. This lead us to the conclusions above. After fixing it, we observed PC again and this time it went from 0 all the way to 6 and the started over again from 0, as we expected. Next we observed the opcode. The opcode was also easy to follow since we could just compare it to the <code>inst\_mem.mif</code> file. The opcode sequence was correct as well. Finally we observed the <code>Display</code> sequence. It correctly showed the Fibonacci sequence. Table 1 shows the sequence for all three seven segment displays. This was the same sequence we saw in Modelsim.

| pc      | 0, 1, 2, 3, 4, 5, 6, 0,              |
|---------|--------------------------------------|
| opcode  | 6, 2, F, 7, 1, 7, C, 6,              |
| display | 0, 2, 3, 5, 8, D, 15, 22, 37, 59, 90 |

Table 1: These are the sequences of the seven segment displays on the FPGA. The values are hexdecimal.

## 2.6 Performance, Area and Power Analysis (Optional)

 $(\max: 2 pages)$ 

To be announced in the Lab7PM.

## 3 Analysis

After completing lab 2 to 5 we have a processor which can perform some basic operations. The processor consists of an ALU, registers, memories, muxes, a bus and a controller. Through careful testing and debugging we have verified that each component works as intended, both individually and together. We used waveforms to test the components and a testbench we designed ourselves to test the processor as a whole.

We've made two projects for each of our labs, one using an ALU based on an RCA and one based on a CLA. Although we haven't been able to notice any difference in performance (which we didn't expect to be able to) it's been interesting to see both implementations and how they accomplish the same task. Since the critical path of the CLA is shorter the clock frequency can be increased and the performance enhanced.

When testing our controller in lab 4 we had some issues. In order to solve this we needed to use waveforms to check what instructions were being executed as well as if the correct state and control signals were set. By examining the waveforms we were able to track the error to a certain control signal which wasn't set when it should've been. This gave us valuable experience in how to use simulations and waveforms to debug errors.

It was a good learning experience to design the controller as an FSM. Previously we had only designed smaller circuits for the purpose of learning how to do it, and not for use in practice. Designing an FSM for use in practice gave use a better understanding of the characteristics of a mealy FSM compared to a moore FSM (number of states and timing).

The hardware and the testbench have been described in VHDL. We have gained experience in this and also seen the difference between programming and describing hardware in code. Throughout this lab series we have kept in mind to write the VHDL in such a way that no unwnated hardware (e.g. latches) is synthesized. This was new to us compared to programming, where you can write more freely.

We realize that if the goal was to design an ASIC much more work would be needed. For example we would need to describe the placement of the components on the chip. And if we were to make some errors (which we have done) it would cost both money and time.

# A Appendix

(max: 4 pages)

In the appendix, you can include extra figures or tables that don't fit in the main body of the lab report.