

# CE/CZ3001

# **LAB: 4**

In lab 4, you will modify the 5-stage pipelined processor of Lab-3 to include CBZ and B instructions. In this lab you will understand the working of the 5-stage pipelined datapath and simulate the CBZ and B instructions along with the previous R-type and D-type instructions. You will be provided with Verilog code for the datapath. CBZ and B instructions changes the flow of the program thus introducing control hazards. You need to understand the effect of control hazard and eliminate them for the correct result in pipeline.



Fig. 1 Five-stage Pipelined CPU implementation diagram for R, D, CB and B type instructions

# I. <u>FIVE STAGE PIPELINED IMPLEMENTATION OF R, D, CB AND B TYPE</u> INSTRUCTION FORMAT FOR 64 BIT CPU—UNDERSTANDING CONTROL HAZARD

In this part, we are adding both "CBZ" and "B" instructions to the data path along with the previous set of instructions covered in previous labs based on the five-stage pipelined CPU shown in Fig. 1. As both the instructions change the ordering of instructions by changing the content of Program Counter (PC), we need to be careful about the control hazards introduced by these instructions.

<u>CB format CBZ instruction format:</u> (Note the opcode used as per define.v= 8'b00000111) Meaning: meaning: [Rt] == 0, then branch to the target location with respect to program counter value. Compare the content of register Rt with zero in the ALU stage and if they are same (ALU zero flag=1), PC



is updated to a new location whose address is "PC+ sign-extended(Address)", called as branch target address. Note that PC is the instruction's PC. Also note that CBZ uses PC relative addressing. The bit assignments to different fields of CB-format are shown below.

| opcode | Address | Rt    |
|--------|---------|-------|
| 31     | 24 23   | 5 4 0 |

#### Example:

1. CBZ X5, #3 (meaning: [X5] == 0, then branch to the location PC + (3). Here '3' is an immediate value. Note that here we are not multiplying the immediate address by 4 as the instruction memory is not byte addressing. The machine format is 07000065<sub>H</sub> (given below is binary format).

| 00000111 | 000000000000000011 | 00101 |
|----------|--------------------|-------|
| 31       | 24 23              | 5 4 0 |

Do note that the comparison of the value (X5 == 0) is done in the EXE cycle and the PC counter is also updated in the same EXE cycle as per Fig. 1. This indicates that there will be a penalty of two cycles after each CBZ for the CBZ to take a decision. Hence to go for a conservative way to remove control hazard, we need to insert two NOPs after every CBZ instruction to remove the control hazard.

<u>B format instruction format:</u> (Note the opcode used as per define.v= 6'b001000)

Meaning: meaning: B offset (Unconditional Branch to the target address which is calculated by "PC + sign-extended(Address[25:0])"). Please note that the PC is the instruction's PC. Unconditional branch also uses PC relative addressing mode. The bit assignments to different fields of B-format are shown below.

| opcode |       | Address |   |
|--------|-------|---------|---|
| 31     | 26 25 |         | 0 |

#### Example:

1. B #3 (meaning: branch to the immediate address 3 and the address to be inputted to the program counter is calculated as "PC+(3)". Here '3' is an address. Note that here we are not multiplying the immediate address by 4 as the instruction memory is not byte addressing. The machine format is 20000003<sub>H</sub> (given below is binary format).

| 001000 | 00 0000 0000 0000 0000 0000 0011 |   |
|--------|----------------------------------|---|
| 31     | 26 25                            | 0 |

Do note that the program counter (PC) is the instruction's PC of jump instruction and the targeting address will be updated to program counter in the EXE stage as per Fig. 1. This indicates that there will be a penalty of two cycles after each B for the B to take a decision, which is the same as CBZ. Hence to go for a conservative way to remove the control hazard, we need to insert two NOPs after every B instruction to remove the control hazard.

#### Details of the code given

In the first stage, program counter 'PC.v' provides address to instruction memory. Note that to calculate the targeting address, CBZ and B both require its own PC. Thus, we use a pipeline "IF\_IDstage.v" to store current PC and pass it to the following stages. In the second stage, the instruction is read, control signals are generated. For both CBZ and B instructions, the instruction is decoded to generate the control signals and to get the data from the RF (B instruction don't need the data from the RF). The third stage comprises of comparison of the content of registers using 'ALU.v'. Please note that a zero flag has been inserted to



ALU such that whenever the output of ALU=0, then the 'zero\_flag=1'. The zero flag result and the branch control signal are ANDed to generate the derived control signal 'PCSrc'. 'PCSrc' selects between the branch target address and nPC (PC+1) to update the program counter. The program counter is updated in execute stage and we need to add two NOPs to reduce the branch penalty cycles. The rest of the instructions works in the similar manner as explained in previous labs.

## **Analysis:**

- 1. Double click RTL schematic to view the circuit diagram for the five-stage pipelined implementation for CB and B type instructions. The RTL schematic should closely resemble Fig. 1.
- 2. Test the 5-stage pipelined datapath implementation by generating a test bench. Test bench can be generated in the 'simulation mode' by right clicking the top module and adding 'New source' to be 'Verilog test fixture'. Name the Verilog test bench and choose the corresponding top module to generate the test bench. When you do the simulation, ensure that you have targeted the testbench file (left-click the file in Hierarchy). Once the test bench is generated, you need to check the bit-width of the inputs and outputs, add the clock and the test inputs as shown below.

Note: You need to change the simulation run time. In simulation mode, click Process → Process Properties and change "Simulation Run Time" to 4000 ns.

- 3. Inserting CLK signal: Insert 'always #15 clk = ~clk;' before the 'initial' statement.
- 4. Add the following test vectors on the portion "//Add the stimulus"

#25 rst = 1;

#25 rst=0:

As there are some random instructions are already in IMEM (imem\_test0.txt) and random data in DMEM (dmem\_txt0.txt), we can simulate and see the functionality of the datapath.

- 5. You can add new instructions and data by modifying the text files for both IMEM and DMEM.
- 6. Analyse how PC counter is updated during the execution of a CBZ and B instruction? How many NOPs are inserted to remove hazards for both instructions?

#### Example for verification of CBZ and B

1. Convert the following pseudo code to LEGv8 instructions, to be executed in the given 5 stage pipelined datapath in Fig. 1.

You can use <a href="http://www.ntu.edu.sg/home/smitha/OPCoder/OPCoder/converter.html">http://www.ntu.edu.sg/home/smitha/OPCoder/OPCoder/converter.html</a> to convert the ARM instructions to machine code in binary or hexadecimal number system. Please do note to make corresponding changes to your opcode part of the web link (if needed) according to the opcode that you have used in 'define.v' file.

For your easiness to convert to LEGv8 code, we have provided the instructions and its hexadecimal conversions in Fig.2 and also given the same as "imem\_test\_ex.txt".



| 64 bit PC address (in | 32 bit instruction      | instructions      | Meaning                  |
|-----------------------|-------------------------|-------------------|--------------------------|
| hex)                  | from IMEM               |                   |                          |
| 0000000000000000      | 001F03FF                | NOP               |                          |
| 0000000000000001      | 00A06021                | LDUR X1, [X1, #6] | Load terminat. index "4" |
| 00000000000000000     | 00A07042                | LDUR X2, [X2, #7] | Load a const value "1"   |
| 0000000000000003      | 00A08063                | LDUR X3, [X3, #8] | Load a const value "5"   |
| 00000000000000004     | 00A00024                | LDUR X4, [X1, #0] | Load A[i]                |
| 00000000000000005     | 00030084                | ADD X4, X4, X3    | Add A[i] and "5"         |
| 00000000000000006     | 00C00024                | STUR X4, [X1, #0] | Store A[i]               |
| 0000000000000007      | 00220021                | SUB X1, X1, X2    | Subtract index by 1      |
| 80000000000000000     | 07000081(this           | CBZ X1, #4        | When termination         |
|                       | instruction may be      |                   | index=0, exit            |
|                       | different due to where  |                   |                          |
|                       | you want to branch. It  |                   |                          |
|                       | should be outside the   |                   |                          |
|                       | loop to exit out from   |                   |                          |
|                       | the loop)               |                   |                          |
| 00000000000000009     | 23FFFFFB(this           | B #-5             | Branch to PC address 4,  |
|                       | instruction may be      |                   | to load the next value.  |
|                       | different after you add |                   |                          |
|                       | NOPs)                   |                   |                          |

Fig. 2 Instructions in the IMEM

- 2. In order to verify the operation, the LEGv8 instructions for the above pseudocode are given in Fig. 2 and data values in Fig. 3 (given in "dmem\_test\_ex.txt"). Note that Array [A] starts from memory address location [4] of DMEM. Please note that all registers are initialized to zero during rest. Hence there is no need to reinitialize the loop pointer.
- 3. Choose the input files as "imem\_test\_ex.txt" in imemory.v and "dmem\_test\_ex.txt" in dmemory.v for the simulation. Do note that the set of instructions given in Fig. 2 has data and control dependencies. In order to get the correct result while running in a five stage pipelined architecture; we need to add NOPs where ever necessary to remove hazards. Once the NOPs are correctly inserted to remove the data dependencies and control depandancies, we will get the updated result in DMEM as can be seen in Fig. 4

| 64 bit address (in hex) | 64 bit data from DMEM |
|-------------------------|-----------------------|
| 0000000000000000        | 000000000000000       |
| 0000000000000001        | 000000000000000       |
| 00000000000000002       | 000000000000000       |
| 0000000000000003        | 0000000000000000      |
| 0000000000000004        | 0000000000000000      |
| 0000000000000005        | 000000000000000       |
| 00000000000000006       | 0000000000000004      |
| 0000000000000007        | 0000000000000001      |
| 0000000000000000        | 0000000000000005      |
| 0000000000000009        | 000000000000000       |
| 00000000000000A         | 0000000000000000      |
| 000000000000000B        | 0000000000000000      |

Fig. 3 Data in the DMEM

4. You can note that writing back to DMEM register is done using STUR. Note that we are not writing back to the text file. Hence the update is only visible at the DMEM register as shown below.



If all dependencies are removed then you can get the result as shown in Fig. 4.



Fig. 4 DMEM register after executing the modified instructions in Fig. 2 for data in Fig. 3

5. If NOPs are not correctly inserted the result written back to DMEM will be wrong.

### **EVALUATION**

- 1) Find the execution time of the instructions given in Fig. 2 by adding NOPs to remove both data and control dependencies.
- 2) Do loop unrolling by 2 and 4 (maximal loop unrolling) and reordering and find the total execution time. The results of both the programs should be the same as indicated in Fig. 4.