## HW 05: RISC-V with GCD Accelerator

Due: December 21st 23:59pm

### Goals

In this assignment, you will implement GCD in Verilog and C. You should partition the application into hardware and software parts. For the hardware part, wrap your Verilog modules in two ways: one with **Pico Co-Processor Interface (PCPI)**, the other one using **memory-mapped I/O (MMAP)** method. For the software part, compile your C code with the RISC-V toolchain. The program will be run on PicoRV32, a CPU core that implements the RISC-V ISA. Profiling tools are provided for you to see how these architectures affect the performance.

#### References

Picorv32: <a href="https://github.com/cliffordwolf/picorv32">https://github.com/cliffordwolf/picorv32</a> MUST READ!!!

RISC-V: https://riscv.org/specifications/

## Setup

- 1. Login only to ic54~58. We have built the RISC-V cross-compiler toolchain on these machines under: /users/course/2017F/cs412500/tools/riscv32i
- 2. Download the source code from iLMS
- 3. Under directory picorv32, compile the program and run a Verilog simulation by

```
make pcpi
```

4. You should see the following output:

```
EBREAK instruction at 0x000006C4
    000006C7
                x8 00000000
                                 x16 7C235965
                                                  x24 00000003
                х9
    00000694
                     0000000
                                 x17 84A97420
                                                  x25 00000001
    00010000
                 x10 20000000
                                 x18 00000002
                                                  x26 00009002
                                     00000160
    DEADBEEF
                 x11
                     2000AD00
                                 x19
                                                  x27
                                                      00009002
                     0000004F
                                     0000000
                                                      00000001
                 x12
                                                  x28
    00000F42
                    0000004E
                                     00000015
                                                      7C235965
                                                      0000000
    00000008
                    00000045
                                     000006C4
                 x14
                                                  x30
    0000000
                    000001E0
                                     000006C4
                                                  x31 000000C4
Number of
          fast external IRQs counted: 0
Number of slow external IRQs counted: 0
Number of timer IRQs counted: 20
TRAP after 317614 clock cycles
ALL TESTS PASSED.
```

## Directory Structure (% You only need to modify the orange files!)



# **Detailed Explanation of the Multest Example**

**start.S** is the entry point of our program. It contains the startup routine which is responsible for initializing and calling the rest of the program. If you don't want to run all functions in each simulation, undefine the corresponding macros at the top of the file.

Without underlying operating system support, we can't utilize the standard I/O library. Fortunately, *print.c* predefines some basic functions to handle I/Os by writing values to address 0x1000000 and leaves the job to *testbench.v*.

*multest.c* tests 4 kinds of RISC-V standard integer multiplication instructions, each in software or hardware implementation. MUL performs a 32-bit multiplication and places the lower 32 bits in the destination register. MULH, MULHU, and MULHSU perform the same multiplication but return the upper 32 bits of the full 64-bit product, for signed-signed, unsigned-unsigned and signed-unsigned multiplication, respectively.

When the compiler parses a multiplication operator "\*" in a program, it looks up the corresponding function routine in gcc software library to generate assembly code. On the other hand, hard\_mul() is a wrapper for assembly instruction MUL. After picorv32 decodes it, the picorv32\_pcpi\_mul PCPI submodule is initiated, and the multiplication is performed by this hardware coprocessor.

Run the simulation again, but this time we also generate a trace log and a waveform file

```
make pcpi_fsdb
```

To measure the latency of each instruction, we've inserted **tick()**s to get the current cycle count of CPU. The figure below is the final output. The first line displays the two integers we'd like to multiply. The following lines show the output value and the cycle count for each kind of multiplication. We can observe a substantial speedup with the hardware implementations.

#### Save the trace log in readable format:

```
python3 showtrace.py testbench.trace firmware/firmware.elf > trace.txt
```

Now we can examine the hexadecimal assembly in *trace.txt*. The 4 entries in a row are destination, current program counter, hexadecimal instruction, and decoded assembly instruction, from left to right. The destination field starts with a symbol indicating the value type of the subsequent digits. More precisely, ">" means "branch target", "@" means "load, store destination address" and "=" means "ALU or register output".

| dest.     | - | PC       | - | hex inst. | inst.                          |
|-----------|---|----------|---|-----------|--------------------------------|
| >00000a68 |   | 0000046c |   | 5fc000ef  | jal ra,a68 <multest></multest> |
| =002f3f50 |   | 00000a68 |   | 7171      | addi sp,sp,-176                |
| @002f3ffc |   | 00000a6a |   | d706      | sw ra,172(sp)                  |
| =00000470 | 1 | 00000a6a | 1 | d706      | sw ra,172(sp)                  |

# **Memory Interface**

In this assignment, you need to design a specific accelerator to do GCD calculation. In the test bench, both CPU and accelerator are connected to the same memory through PicoRV32's native memory interface, which is a valid-ready interface that can run one memory transfer at a time. Refer to the <u>GitHub pages</u> of PicoRV32 for further information. In practice, the accelerator can be controlled by either PCPI or MMAP. Both of them will be introduced in the following sections.

## **Memory Layout**

Some memory regions are reserved for special purposes. Their relationships are summarized in the table below:

| Address Range                                               | Memory Region Description                                                 |  |  |  |  |
|-------------------------------------------------------------|---------------------------------------------------------------------------|--|--|--|--|
| 0x00000000~                                                 | Text region for our program                                               |  |  |  |  |
| 0x0000FFFF                                                  |                                                                           |  |  |  |  |
| 0x00BF0000~                                                 | Stack region. Stack pointer points to the end of the memory initially     |  |  |  |  |
| 0x00C00000                                                  |                                                                           |  |  |  |  |
| 0x10000000                                                  | The remaining addresses can be used for memory mapped I/Os. For instance, |  |  |  |  |
|                                                             | 0x1000000 is specialized for the console output                           |  |  |  |  |
| 0x20000000                                                  | start.S writes here to test the correctness of programs                   |  |  |  |  |
| The following addresses is needed only for MMAP (Method 2)! |                                                                           |  |  |  |  |
| 0x40000000                                                  | GCD_MMAP_READ_STATUS: read the status of GCD core                         |  |  |  |  |
| 0x40000004                                                  | GCD_MMAP_READ_Y: read the GCD output                                      |  |  |  |  |
| 0x40000008                                                  | GCD_MMAP_WRITE_A: write first integer                                     |  |  |  |  |
| 0x400000c                                                   | GCD_MMAP_WRITE_B: write second integer                                    |  |  |  |  |
| 0x40000010                                                  | GCD_MMAP_START: write anything here to start calculation                  |  |  |  |  |

### Method 1: PCPI

The Pico Co-Processor Interface (PCPI) can be used to implement non-branching instructions in external cores. For the purpose of each signal, refer to the <u>GitHub pages</u> of PicoRV32. PCPI cores can be connected inside or outside the CPU module. Three PCPI submodules have already been implemented in <u>picorv32.v</u>. Trace their code carefully and finish your own PCPI core in <u>pcpi.v</u>.



### **Method 2: MMAP**

Devices that support MMAP can be controlled by Read/Write signals. Reading/Writing to specific memory addresses would be treated as control signals for register configuration or I/Os. For instance, writing to 0x40000010 can trigger the calculation of our GCD module in the diagram below.



#### Run the simulation of MMAP example:

Comment out ENABLE GCD PCPI; uncomment ENABLE GCD MMAP in start. S; then run

make mmap

# **Problem Description**

Euclidean Algorithm Using Subtraction Only: <a href="http://www.naturalnumbers.org/EuclidSubtract.html">http://www.naturalnumbers.org/EuclidSubtract.html</a>

# **Working Items**

- 1. Go through all the files, study the purposes of each file. Follow the **TODO** marks to complete the program
- 2. Implement hardware version of GCD in <a href="gcd\_pcpi.v">gcd\_pcpi.v</a> and <a href="gcd\_mmap.v">gcd\_mmap.v</a>. The necessary signals have already been declared for you. But you should design your own state machine. The GCD module takes two 32-bit integers as inputs. Use successive subtraction method to obtain the final result.
- 3. (For **Method 1**) Use the customized instruction hard\_gcd() in *firmware/gcd\_pcpi.c* to initiate the GCD accelerator.
- 4. (For **Method 2**) Access the MMAP address region in **firmware/gcd\_mmap.c** to control the GCD accelerator.
- 5. Implement software version of GCD in gcd pcpi.c or gcd mmap.c for comparison.
- 6. Profile your program.

### **Questions & Discussion**

- 1. What are the advantages of running applications on a hardware accelerator?
- 2. Which version scales better if the input values of GCD are getting larger?
- 3. Given a specific application, under what circumstances does its software version outperforms its hardware version?
- 4. Roughly describe the hardware architecture of PicoRV32 (picorv32.v).
- 5. Optional:
  - i. PicoRV32 has several Verilog module parameters (e.g., FAST\_MEMORY, etc.). Try different combinations of them and explain their functions.
  - ii. Discuss anything interesting you've discovered