### Designing a Simple Pipelined RISC-V CPU

(coding the hardware in BSV)

or

## Learning BSV for Digital Design

(using a Simple Pipelined RISC-V CPU as a running example)

Rishiyur S. Nikhil Bluespec, Inc.



© 2023-2024 Rishiyur S.Nikhil

**DRAFT:** January 27, 2024, 12:30h EDT **Please do not circulate without the author's permission.** 

## Contents

| 1 | Intro | oduction                                                                                        | 1-1  |
|---|-------|-------------------------------------------------------------------------------------------------|------|
| 2 |       | RISC-V ISA; and apiling Programs for the ISA                                                    | 2-1  |
| 3 |       | gn Space for RISC-V interpreters: n Software Functional Simulators to High-Performance Hardware | 3-1  |
|   | 3.1   | The RISC-V designs in this book                                                                 | 3-2  |
|   | 3.2   | Abstract algorithm for interpreting an ISA                                                      | 3-3  |
|   | 3.3   | Plan for the order in which we tackle topics                                                    | 3-4  |
| 4 | BSV   | : Combinational circuits for the RISC-V step functions                                          | 4-1  |
|   | 4.1   | Introduction                                                                                    | 4-1  |
|   | 4.2   | BSV: Bit Vectors                                                                                | 4-2  |
|   | 4.3   | BSV: Boolean values                                                                             | 4-3  |
|   |       | 4.3.1 BSV: Caution: Bool and Bit#(1) are different types                                        | 4-4  |
|   |       | 4.3.2 BSV: Example: recognizing legal RISC-V BRANCH instructions                                | 4-4  |
|   |       | 4.3.3 BSV: Combinational circuits and primitives                                                | 4-5  |
|   | 4.4   | BSV: Functions                                                                                  | 4-6  |
|   | 4.5   | BSV: A small testbench to test our code                                                         | 4-7  |
|   | 4.6   | $BSV: {\tt enum}\ types$                                                                        | 4-9  |
|   |       | 4.6.1 deriving (Bits)                                                                           | 4-9  |
|   |       | 4.6.2 deriving (Eq)                                                                             | 4-10 |
|   |       | 4.6.3 deriving (FShow)                                                                          | 4-10 |
|   | 4.7   | BSV: Syntax of Identifiers                                                                      | 4-10 |
|   | 4.8   | BSV: Syntax of comments                                                                         | 4-10 |
|   | 4.9   | BSV: if-then-else statements and hardware multiplexers                                          | 4-11 |
|   |       | 4.9.1 Parallel multiplexers and MUX synthesis                                                   | 4-12 |
|   | 4.10  | BSV: Sharing code for RV32 and RV64 <i>via</i> parameterization                                 | 4-14 |
|   |       | 4.10.1 BSV: Numeric types                                                                       | 4-14 |
|   |       | 4.10.2 BSV: Type synonyms                                                                       | 4-15 |
|   |       | 4.10.3 BSV: The numeric value corresponding to a numeric type                                   | 4-15 |
|   |       | 4.10.4 BSV: Conditional compilation                                                             | 4-15 |

4 CONTENTS

| 5 | Stru<br>Mer | ${ m (ct\ types\ (BSV)} \ { m nory\ requests\ and\ responses\ (RISC-V)}$ | 5-1 |
|---|-------------|--------------------------------------------------------------------------|-----|
|   | 5.1         | RISC-V: structs communicated between steps                               | 5-1 |
|   | 5.2         | BSV: struct types                                                        | 5-2 |
|   |             | 5.2.1 Creating struct values                                             | 5-3 |
|   |             | 5.2.2 Selecting struct fields                                            | 5-4 |
|   |             | 5.2.3 Updating struct fields using assignment                            | 5-5 |
|   | 5.3         | RISC-V: Memory Requests and Responses; IMem and DMem                     | 5-5 |
|   |             | 5.3.1 Separation of IMem and DMem (Harvard Architecture)                 | 5-5 |
|   |             | 5.3.2 RISC-V: Memory Requests                                            | 5-6 |
|   |             | 5.3.3 RISC-V: Address Alignment                                          | 5-6 |
|   |             | 5.3.4 RISC-V: Memory Responses                                           | 5-8 |
|   | 5.4         | RISC-V: The Decode function                                              | 5-8 |
| 6 | The         | Fetch function: memory requests and responses                            | 6-1 |
|   | 6.1         | Introduction                                                             | 6-1 |
|   | 6.2         | RISC-V: the result-type of the Fetch function                            | 6-2 |
|   | 6.3         | RISC-V: The Fetch Function                                               | 6-2 |
|   |             | 6.3.1 BSV: Don't-care values                                             | 6-3 |
| 7 | BSV         | 7: Modules and Interfaces;                                               |     |
|   |             | m and Fife Interface and Module Skeleton                                 | 7-1 |
|   | 7.1         | Introduction                                                             | 7-1 |
|   | 7.2         | BSV: Modules: state, behavior and interfaces                             | 7-1 |
|   |             | 7.2.1 BSV: Interface declarations                                        | 7-2 |
|   |             | 7.2.2 BSV: Module declarations                                           | 7-3 |
|   | 7.3         | BSV: Example: Registers                                                  | 7-3 |
|   |             | 7.3.1 BSV: Reg#(t), the register interface                               | 7-3 |
|   |             | 7.3.2 BSV: $mkReg(v)$ , a register module (consructor)                   | 7-4 |
|   |             | 7.3.3 BSV: Registers are strongly typed                                  | 7-4 |
|   |             | 7.3.4 BSV: Syntactic shorthands for register access                      | 7-5 |
|   | 7.4         | BSV: Example: FIFOs                                                      | 7-5 |
|   |             | 7.4.1 BSV: FIF0F#( $t$ ), the FIFO interface                             | 7-6 |
|   |             | 7.4.2 BSV: mkFIF0F, a fifo module (constructor)                          | 7-6 |
|   |             | 7.4.3 BSV: FIFOFs are strongly typed                                     | 7-7 |
|   |             | 7.4.4 BSV: Semi-FIFO interfaces for each end of a FIFO                   | 7-7 |
|   | 7.5         | BSV: Example: Register-files                                             | 7-7 |
|   |             | 7.5.1 RSV: Porfila#(+1, +0), the register file interface                 | 7 7 |

|    |      |                                                                | 5    |
|----|------|----------------------------------------------------------------|------|
|    |      | 7.5.2 BSV: mkRegFileFull, a register-file module (constructor) | 7-8  |
|    |      | 7.5.3 BSV: What about the RISC-V register x0?                  | 7-8  |
|    | 7.6  | RISC-V: The interface for the Drum and Fife CPU modules        | 7-9  |
|    | 7.7  | RISC-V: A skeleton CPU module for Drum                         | 7-9  |
|    | 7.8  | BSV: Interface-transformer functions                           | 7-10 |
| 8  |      | 7: Finite State Machines (FSMs); Drum CPU                      | 8-1  |
|    | 8.1  | Introduction                                                   | 8-1  |
|    | 8.2  | BSV: Behavior in Modules                                       | 8-2  |
|    | 8.3  | BSV: Finite State Machines (FSMs)                              | 8-2  |
|    |      | 8.3.1 Sequential FSMs, Concurrent FSMs, and Digital Hardware   | 8-2  |
|    | 8.4  | BSV: StmtFSM                                                   | 8-3  |
|    |      | 8.4.1 BSV: Actions and the Action type                         | 8-3  |
|    |      | 8.4.2 BSV: StmtFSM: sequences of actions                       | 8-4  |
|    |      | 8.4.3 BSV: StmtFSM: if-then-elses                              | 8-4  |
|    |      | 8.4.4 BSV: StmtFSM: while-loops                                | 8-5  |
|    |      | 8.4.5 BSV: StmtFSM: mkAutoFSM: a simple FSM module constructor | 8-5  |
|    | 8.5  | RISC-V: The Drum CPU module                                    | 8-5  |
|    | 8.6  | RISC-V: Compiling Drum BSV code to Verilog                     | 8-6  |
|    | 8.7  | RISC-V: Compiling and linking Verilog using Verilator          | 8-6  |
|    | 8.8  | RISC-V: Verilog simulation of Drum                             | 8-6  |
|    | 8.9  | RISC-V: Implementing Drum Verilog on an FPGA                   | 8-6  |
| 9  | Pend | ding (to be written): Drum                                     | 9-1  |
|    | 9.1  | Drum: Remaining functions                                      | 9-1  |
|    | 9.2  | Drum: Traps, and basic CSRs                                    | 9-2  |
|    | 9.3  | Drum: Interrupts                                               | 9-2  |
|    | 9.4  | Measurements for Performance Analysis                          | 9-2  |
| 10 | Peno | ding (to be written): Fife                                     | 10-1 |
|    | 10.1 | Fife: pipelined execution                                      | 10-2 |
|    | 10.2 | PC-prediction (control speculation)                            | 10-3 |
|    | 10.3 | Register read/write hazards in Fife                            | 10-3 |
|    | 10.4 | Speculative vs. non-speculative memory ops                     | 10-3 |
|    | 10.5 | Fife: CSRs                                                     | 10-4 |
|    | 10.6 | Fife: Interrupts                                               | 10-4 |

6 CONTENTS

| 11           | Pend                                       | ling (to be written): Advanced topics, possibly in "Book 2"?   | 11-1  |  |  |  |  |  |  |  |
|--------------|--------------------------------------------|----------------------------------------------------------------|-------|--|--|--|--|--|--|--|
|              | 11.1                                       | Advanced branch prediction                                     | 11-1  |  |  |  |  |  |  |  |
|              | 11.2                                       | Caches                                                         | 11-1  |  |  |  |  |  |  |  |
|              | 11.3                                       | Compressed instructions                                        | 11-2  |  |  |  |  |  |  |  |
|              | 11.4                                       | AMO operations                                                 | 11-2  |  |  |  |  |  |  |  |
|              | 11.5                                       | Multiple Privilege levels                                      | 11-2  |  |  |  |  |  |  |  |
|              | 11.6                                       | Memory Protection with PMPs                                    | 11-2  |  |  |  |  |  |  |  |
|              | 11.7                                       | Virtual Memory                                                 | 11-2  |  |  |  |  |  |  |  |
|              |                                            | 11.7.1 RISC-V ISA Formal Specification                         | 11-2  |  |  |  |  |  |  |  |
| A            | Reso                                       | ources: Documents and Tools                                    | A-1   |  |  |  |  |  |  |  |
|              | A.1                                        | GitHub                                                         | A-1   |  |  |  |  |  |  |  |
|              | A.2                                        | RISC-V ISA (Instruction Set Architecture) Specifications       | A-1   |  |  |  |  |  |  |  |
|              | A.3                                        | RISC-V Assembly Language Manuals                               | A-1   |  |  |  |  |  |  |  |
|              | A.4                                        | RISC-V GNU tools, including riscv-gcc compiler                 | A-2   |  |  |  |  |  |  |  |
|              | A.5                                        | BSV                                                            | A-2   |  |  |  |  |  |  |  |
|              |                                            | A.5.1 "BSV By Example" book (free downloadable PDF)            | A-3   |  |  |  |  |  |  |  |
|              |                                            | A.5.2 <b>BSV</b> Tutorial                                      | A-3   |  |  |  |  |  |  |  |
|              |                                            | A.5.3 MIT Course Material                                      | A-4   |  |  |  |  |  |  |  |
|              |                                            | A.5.4 University of Cambridge Examples                         | A-4   |  |  |  |  |  |  |  |
|              |                                            | A.5.5 $bsc$ download and installation; $bsc$ and $BSV$ manuals | A-4   |  |  |  |  |  |  |  |
|              | A.6 Verilator (or other Verilog simulator) |                                                                |       |  |  |  |  |  |  |  |
|              | A.7                                        | Amazon AWS                                                     | A-5   |  |  |  |  |  |  |  |
|              | A.8                                        | Xilinx Vivado                                                  | A-6   |  |  |  |  |  |  |  |
|              | A.9                                        | RISC-V textbooks                                               | A-6   |  |  |  |  |  |  |  |
| В            | Why                                        | BSV?                                                           | B-1   |  |  |  |  |  |  |  |
|              | B.1                                        | Why BSV instead of some other Hardware Design Language?        | B-2   |  |  |  |  |  |  |  |
|              |                                            | B.1.1 A better computational model                             | B-3   |  |  |  |  |  |  |  |
|              |                                            | B.1.2 Modern language features                                 | B-4   |  |  |  |  |  |  |  |
|              |                                            | B.1.3 Comparison with C++-based High Level Synthesis           | B-4   |  |  |  |  |  |  |  |
| $\mathbf{C}$ | Glos                                       | sary                                                           | C-1   |  |  |  |  |  |  |  |
| Inc          | dex of                                     | f BSV topics INDEX-B                                           | SSV-1 |  |  |  |  |  |  |  |
| Bil          | bliogr                                     | raphy I                                                        | BIB-1 |  |  |  |  |  |  |  |

## Chapter 1

## Introduction

"Digital Design" and "CPU Design" (or "Computer Architecture") are traditionally taught separately, usually in that order, with separate textbooks. Digital Design is usually taught using one of the traditional hardware design languages Verilog, SystemVerilog or VHDL, and often makes use of small, often artificial examples. CPU Design is often taught without actually designing hardware, relying instead on textbooks, abstract schematics, and simulators implemented in software.

This book takes a different approach: we learn about simple CPU architectures by designing them with a modern Hardware Design Language (HDL) called BSV, learning Digital Design as an ongoing, intertwined accompanying topic. Each Digital Design example will be taken directly from the CPU Design, so that the example's use-case (context) is always perfectly clear, and the reader always has a clear sense of the purpose of the example.

The CPU we design here will execute instructions from the RISC-V Instruction Set Architecture (ISA), which is an industrial-strength ISA (with many commercial implementations). Our designs will be simple (typical of small, embedded systems and micro-controllers, not laptops/workstations or servers). Nevertheless, it will be powerful enough to execute Linux, an industrial-strength operating system.

Figure 1.1 shows the plan for topics covered in this book.

- The first step is to understand the RISC-V ISA itself. What are RISC-V instructions, how are they coded in bits, and what do they mean? This topic is not a focus of this book (for which there are plenty of textbooks available), but understanding the ISA is of course a prerequisite to informing our design. The RISC-V ISA has many options; our focus will be on a "standard" suite:
  - From the RISC-V Unprivileged ISA spec: basic integer arithmetic and logic operations; branch and jump; load and store; integer multiply and divide; atomic memory operations; floating-point operations; compressed instructions (so-called RV32IMAFDC and RV64IMAFDC).
  - From the RISC-V Privileged ISA spec: handling traps and interrupts; Control and Status Registers (CSRs); Machine, Supervisor and User Privilege levels.
- In order to run actual RISC-V programs on our implementations, we need to undertand how to use the *riscv-qcc* compiler to compile C and RISC-V Assembly Language



Figure 1.1: Topics covered (in red text in red box)

programs into RISC-V binaries (so-called "ELF" files). Another useful tool is *riscv-objdump*, which can disassemble the binary back into assembly-language text. This is useful for debugging our implementation, so that we can understand execution instruction-by-instruction, and diagnose anything that goes wrong.

So, far, all this is not implementation-specific, i.e., it is generic information about RISC-V.

- A RISC-V CPU and system can be *modeled* in a simulator coded in C (say). Such a C-based simulator is compiled (with *gcc*, say) and run like any other C program. We will not be discussing this much in this book.
- We will code our hardware design in the BSV HDL. We will use BSV not just for the CPU itself, but also for the "system" components around it: an interconnect, Memory, UART and GPIO. Later we will discuss MMUs (Memory Management Units) and Caches, and possibly other devices and accelerators.
- We will learn how to use the bsc compiler to translate our BSV code into Verilog RTL.
- We will learn how our Verilog RTL can directly be simulated in a Verilog simulator. We will use the free, open-source "Verilator" simulator, but you can also run it on any other Verilog simulator, available from a number of providers.

This will provide an exact, cycle-by-cycle accurate simulation of the very same design that we'll run later on an FPGA. This is invaluable for debugging the hardware design, because the turnaround time to fix a problem and run a new simulation is very short (minutes) compared to creating a new version for an FPGA (several hours).

Of course, Verilog simulation will run much more slowly (10,000x or more slower) compared to an FPGA, and so is useful primarily for early debugging and analysis of the design, running on small RISC-V programs.

- When we execute our Verilog RTL hardware design in Verilog simulation (where the
  hardware design itself is executing a RISC-V binary program), it will produce a trace
  file describing events during the simulation. We will learn how to analyze these traces
  to identify bugs and bottlenecks in our design, from which we can correct design errors
  and possibly improve performance.
- We will learn now to process our Verilog RTL through an FPGA synthesis tool to create an FPGA bitfile which can then be loaded into an FPGA and executed.
   Although it can be synthesized and run on a number of FPGAs from different vendors, in this book we'll discuss how to build and run it for an FPGA on the Amazon AWS cloud.
- Our Verilog RTL can also be processed through ASIC synthesis tools targeting ASIC fabrication. We will not be dissussing this much in this book.

We will create two hardware designs. The first design is "Drum", a non-pipelined implementation which will familiarize us all the basic concepts and flows (the RISC-V ISA, preparing and running a RISC-V binary to run on the design, analysing traces), without being distracted by the complexities of pipelineing for high performance. The second design is "Fife", which has a five-to-six stage, mostly in-order pipeline, which is a microarchitectural change focused on higher performance (speed) than Drum. Both will execute exactly the same binaries; the only difference will be in Fife's superior performance (speed). Both designs will share a large part of the BSV code implementing the essential functionality for executing the RISC-V ISA.

As we work through the two designs, we will concurrently learn how to code in BSV, the HDL for our designs. BSV is a modern, high-level HDL taking inspiration from modern software programming languages, in particular the Haskell functional programming language and a class of formal specification languages for concurrent programming (including Term Rewriting Systems, Unity, TLA+, and Event-B). BSV is not just for CPU design; like Verilog and SystemVerilog, it is a "universal" language for any digital design, whether related to CPUs or not.

Please see Appendix A for a detailed listing of resources (documents and software tools) needed for this book.

## Chapter 2

# The RISC-V ISA; and Compiling Programs for the ISA

This chapter contains only generic RISC-V information, not specific to this book. This chapter can be safely skipped by those already familiar with these generic topics, or who choose to learn it elsewhere:

- RISC-V ISA specification documents. Please see Appendix A.2 for links.
- RISC-V Assembly Language manuals Please see Appendix A.3 for links.
- RISC-V Gnu Tools and related documentation. Please see Appendix A.4 for links.
- RISC-V textbooks. Please see Appendix A.9 for links.

Lecture slides exist for this chapter; need to render into narrative text.

NOTE:

This chapter will be written later, since there is much information available in the given links.

## Chapter 3

## Design Space for RISC-V interpreters: From Software Functional Simulators to High-Performance Hardware

Any artefact/engine that executes the instructions of any ISA is an *interpreter* for that ISA. The classical meaning of an interpreter is an algorithm (program) that examines/traverses a data structure that is itself the represention of a target program, and performs actions accordingly. In our case, the target program is a RISC-V binary and the data structure is an array or RISC-V instructions. The algorithm examines RISC-V instructions in the array, conceptually one-instruction-at-a-time, and performs the instruction's actions.

Any algorithm can be implemented in software or in hardware. Further, the boundary is fluid: parts of the algorithm can be implemented in software, cooperating with other parts that are implemented in hardware ("accelerators"). The choice between software and hardware implementation is pragmatic (speed, power, cost, cost of debugging and modification, cost of redesign, etc.); functionally there is no theoretical difference.

When we implement an ISA interpreter in software, we call it a "simulator". When we implement it in hardware, we call it a hardware implementation. Both software simulators and hardware implementations can vary widely in microarchitecture. Some design options are:

- Sequential or pipelined? One full instruction at-a-time, or multiple instructions flowing through a pipe, each at a more advanced step in its execution than the one behind it.
- Predictive (in pipelined implementations)? *E.g.*, predict what instructions to fetch while a BRANCH/JUMP flows through the pipe before we know the actual next-instruction determined the BRANCH/JUMP.
- Superscalar/VLIW? Fetch and execute more than one instruction in parallel, taking care to preserve sequential ISA semantics.
- Out-of-order? Execute each instruction as soon as its input data is available, without waiting for prior instructions which may still be waiting for their inputs.

For the same microarchitecture, a software simulator is typically *much slower* than a hardware implementation. This is because it involves (at least) two layers of simulation. The software simulator is itself a program that is being interpreted, perhaps directly in hardware. That program (the simulator), in turn, is interpreting the target ISA. The two interpreters need not and may not be for the same ISA. For example, if we run a RISC-V software simulator on a modern server, the lower level may be an x86 or ARM interpreter (*i.e.*, the CPU in in the server). A software simulator written in Python or Java involves three layers of ISAs, *e.g.*, hardware x86/ARM interpreting x86/ARM instructions representing a program to interpret bytecode (second level ISA), which, in turn represents an interpreter for RISC-V programs. Every additional layer of interpretation can slow down overall performance by possibly orders of magnitude.

Paradoxically, adding any of the microarchitectural details mentioned in the list above will normally slow down a software simulator but speed up a hardware implementation. This is because those microarchitetural details expose more parallelism and concurrency in the interpretation algorithm. Hardware implementations actually execute these parallel actions in parallel, whereas a software simulator (written, say, in C/C++) may execute them sequentially (i.e., modeling parallelism but in fact being sequential). Of course, the extra hardware speed is not free: it needs more hardware and more complexity in the design (cost, power consumption).

#### 3.1 The RISC-V designs in this book

In this book we will focus on two simple hardware implementations, Both designs are coded in BSV, a free, open-source, modern, High-Level Hardware Design Language (HLHDL). BSV code can be compiled into Verilog, which can then be run on any Verilog simulator, or can be further processed by FPGA tools to run on FPGAs, or by ASIC tools for ASIC implementations. For more discussion of our choice of BSV, please see Appendix B.

Our first hardware RISC-V implementation—"Drum"—will be a simple one-full-instruction-at-a-time interpereter, almost a direct transliteration into BSV code of the generic ISA execution algorithm to be described next in Section 3.2. It does not implement any interesting microarchitectural feature, not even pipelining, which is the most basic microarchitectural feature of most CPU implementations. Lacking microarchitectural features, in fact the BSV code will look very similar to what you might write in C/C++ for a purely functional RISC-V simulator. Being written in BSV, however, we can compile and run it on actual hardware (FPGAs, ASICs).

Drum will not be fast compared to other hardware CPUs, because of lack of microarchitectural features, but we should still be able to run it at several 100 MHz on an FPGA, which will make it faster than many software functional simulators. It will be small (silicon area, and therefore low power as well). Drum is covered from Chapter 5.4 through Chapter 9.

Our second implementation—"Fife"—adds pipelining. Pipelining introduces new complications because of potential interaction between instructions that are at different stages in the pipe. We can focus on these new complications because all the functional aspects of RISC-V ISA execution have already been addressed in Drum. In fact, we will reuse the functional code from Drum without change. Drum is covered from Chapter 10 through Chapter 10.

For both Drum and Fife, we will focus initially on only the RV32I option of the RISC-V ISA. Please refer to the specification document "The RISC-V Instruction Set Manual Volume I: Unprivileged ISA" [14]. In particular, look at Chapter 24 "RV32/64G Instruction Set Listings", and the first table therein, entitled "RV32I Base Instruction Set", showing forty instructions. These instructions are describe in more detail in the same document in Chapter 2 "RV32I Base Integer Instruction Set, Version 2.1".

We will extend this with just enough functionality to be able to recover from illegal instructions (*i.e.*, an instruction outside the set of forty RV32I instructions) and to handle interrupts. This minimal functionality will be taken from the specification document "The RISC-V Instruction Set Manual Volume II: Privileged Architecture" [15].

Beyond this book, we extend Drum and Fife to handle RV64I and more Unprivileged ISA options—M: integer multiply/divide, A: atomics, FD: single-and double-precision floating point, and C: compressed. We also handle more privileged ISA options—Privilege levels (M: Machine, S: Supervisor and U:User; full complement of Control and Status Registers (CSRs); Virtual Memory). With these extensions, Drum and Fife will be able to a full-feature Operating System (OS), such as Linux.

#### 3.2 Abstract algorithm for interpreting an ISA

From our previous study of the RISC-V ISA, we know that the basic integer "architectural state" of a RISC-V CPU is very simple:

- A "program counter" (PC) indicating the address in memory of the next instruction to be executed.
- A "register file" consisting of 32 general purpose registers (GPRs), each containing data.

The PC and each register are either 32-bits wide (in the RV32 option of RISC-V) or 64-bits wide (in the RV64 option). For simplicity, we'll focus on RV32 here, but everything we discuss also applies to RV64.

Interpreting a program involves the repetition of a few simple steps,<sup>1</sup> illustrated in Figure 3.1:

- The "Fetch" step reads the current value of the PC and uses that value as an address in memory from which to read an instruction. Then, we proceed to the "Decode" step.
- The "Decode" step examines the fetched instruction to check if it is legal, to classify its major category (such as Control, Integer Arithmetic/Logic, or Memory), and to extract some properties such as which GPRs it reads (if any) and which GPR it writes (if any). Then, we proceed to the "Register-Read and Dispatch" step.
- The "Register-Read and Dispatch" step reads the GPRs for the instruction's inputs. Then, we proceed to one of the "Execute" steps, based on the category of the opcode in the instruction (Branch/Jump, Integer Arithmetic/Logic, or Memory).

<sup>&</sup>lt;sup>1</sup>We prefer the word "step" here instead of "stage", which we will reserve to refer to stages in a hardware pipeline such as Fife.



Figure 3.1: Simple interpretation of RISC-V instructions

- The "Execute Control" step is used for conditional-branch and jump instructions. For the former it evaluates the branch condition and, if true, and updates the PC to the branch-target PC. For jump instructions it updates the PC to the jump-target PC. Then, it goes back to the Fetch step to interpret the next instruction.
- The "Execute Integer Arithmetic and Logic" step is used for integer arithmetic and logic operations (addition, subtraction, boolean ops, shifts, *etc.*). Then, we proceed to the "Register-Write and Dispatch" step.
- The "Execute Memory Ops" step calculates a memory address based on an input value (that was read from a GPR) and reads or writes memory at that address. Then, we proceed to the "Register-Write and Increment PC" step.
- The "Register-Write and Increment PC" step writes the result from the previous Execute step back into a GPR, and increments the PC. Then, it goes back to the Fetch step to interpret the next instruction.

Thus we repeat these steps forever, instruction after instruction, starting each time at the Fetch step.

### 3.3 Plan for the order in which we tackle topics

This book serves two concurrent purposes: learning how to implement the RISC-V ISA and, specifically, how to implement it by coding it in BSV ("BSV learning"). The order in which we tackle topics is guided by the BSV-learning purpose, not by the step-by-step organization of Figure 3.1.

At the center of each step in Figure 3.1 are pure functions to decide what kind of instruction each 32-bit instruction is, perform arithmetic instructions, calculate addresses in

memory, calculate conditions on whether to branch or not, etc. These pure functions are "combinational" functions, which we tackle in the next couple of chapters.

Note, we are *not* going to descend to the level of simple logic gates, how to optimize them, or how to implement higher-level combinational functions such as adders and multiplexers in terms of gates. These activities are today routinely handled by excellent compilers ("synthesis tools"). Our lowest-level combinational circuits, the ones we take as primitives, will be so-called "RTL-level" operators for arithmetic, shifts and logic operators on bitvectors (+, -, <<, >>, &&, |||, ^, !, verb||, and so on).

## Chapter 4

# BSV: Combinational circuits for the RISC-V step functions

#### 4.1 Introduction

It is useful to start with the Decode step of Figure 3.1 because it involves bit-vectors, operations on bit-vectors, conditionals to classify instructions into classes, and enum types to name and encode instruction classes.

The inputs to the Decode step as depicted in Figure 3.1 are:

- A 32-bit piece of data—a RISC-V instruction—that has become available by reading it from memory at the PC address.<sup>1</sup>
- Any additional information passed on from the Fetch step.

The outputs of the Decode step have information needed by the next step (Register-Read and Dispatch). For a RISC-V instruction, useful information includes:

- Was the Fetch itself successful, or did it encounter a memory error; if so, what kind of memory error?
- Is it a legal 32-bit instruction?
- If legal, what is its broad classification: Control (Branch or Jump)? Integer Arithmetic or Logic? Memory Access? This will help in choosing the next step to which we must dispatch to execute the instruction.
- Does it have zero, one or two input registers? If so, which ones? This will help the next step in reading registers.
- Does it have zero or one output registers? If so, which one? This will help the final Register Write step in writing back a value to a register.

To compute these values, we need to examine "slices" of the 32-bit instruction ("bit vector"), such as the 7-bit "opcode" slice, the 5-bit "rs1", "rs2" and "rd" slices, and so on. We need to be able to compare these slices to constants (e.g., "Is the opcode a BRANCH opcode?"). We need to do things conditionally, e.g., if it is a BRANCH instruction, then it has an rs1 and rs2 slice but no rd slice, but if it is a JAL instruction it has an rd slice but no rs1 or

 $<sup>^{1}</sup>$ When implementing the so-called "C" RISC-V ISA extension ("compressed instructions"), instructions can also be 16-bits, but we ignore that for now.

rs2. Finally, as in all good programming languages, we'd need to be able to package all this functionality inside a "function" with clearly specified input(s) and output(s). In the next several sections—4.2, 4.3, 4.4, 4.9, —we will learn the BSV concepts needed to code these ideas.

#### 4.2 BSV: Bit Vectors

In BSV, as in many programming languages, every value has a *type*. The simplest, and lowest-level type in BSV is the bit-vector (a vector made up of a particular number bits). Later we will see that in BSV one can define more abstract types such as integers, booleans, vectors and arrays, lists, structs (records), tagged unions (algebraic types), trees, and so on. However, ultimately, any such value is represented in hardware as a bit-vector.

The BSV statement:

```
Bit #(32) pc_val;
```

declares the identifier pc\_val to have the type Bit#(32), *i.e.*, a bit-vector of 32 bits. The general syntax is similar to C or Verilog:

type identifier;

The BSC type Bit#(32) is roughly equivalent to the C type uint32\_t. Unlike C, where only a few sizes are available, all multiples of 8 bits—(uint8\_t, uint16\_t, uint32\_t and uint64\_t)—bit-vectors in BSV can have any size (Bit#(3), Bit#(51), Bit#(512), ...).

The bits in a BSV bit-vector of size n are indexed from n-1 (most-significant bit) to 0 (least-significant bit). You can extract a *slice* of a bit-vector using usual Verilog notation:

```
Bit #(32) pc_val;
Bit #(12) page_offset = pc_val [11:0];
```

In the second line, we extract 12 bits of pc\_val to get a bit-vector of size 12. BSV is *strongly typed* with respect to sizes, *i.e.*, it is very strict about matching sizes. For example, this statement:

```
Bit #(12) page_offset = pc_val [10:0];
```

will be reported as a type-error by the *bsc* compiler because the slice-expression on the right-hand side has type Bit#(11) which does not match the declared type Bit#(12).

BSV bit-vectors can be compared for equality and inequality. BSV bit-vectors are synonymous with unsigned integers, and so a number of other operations are also available on bit-vectors. Examples:

```
Bit \#(12) x, a, b, c, d, e, f;
1
2
       // Comparison ops: result type is Bool
3
       if (a == b) ...;
                               // equality
4
       if (a != b) ...;
                               // not-equal to
       if (a < b) ...;
                               // less-than
       if (a <= b) ...;
                               // less-than-or-equal-to
       if (a > b) ...;
                               // greather-than
       if (a >= b) ...;
                               // greater-than-or-equal-to
9
10
       // Arithmetic ops: result type is Bit #(12)
11
       x = a + b - c * d;
                               // add, subtract, multiply
12
       // Bitwise logic ops: result type is Bit #(12)
            AND OR
                      unary INVERT
                                      XOR XNOR XNOR
15
                                           d ^~ e ~^ f;
       x = a & b |
                         (~ c)
16
17
       // Shifts
18
       x = (a << 3) & (b >> 14);
                                     // left- and right-shift
19
```

Please see the BSV Language Reference Guide [1], Section 10.3, "Unary and binary operators" for a full list of available unary and binary operators. Unlike Haskell, in BSV you cannot define new unary or binary infix operators.

In such expressions, as usual bit-vector sizes must match exactly, else we'll get a type error, e.g., we cannot compare a Bit#(12) value with Bit#(11) value. Unlike C and Verilog, BSV does not implicitly extend or truncate bit-vectors to match sizes.

Two functions are available to zero-extend and truncate bit-vectors.

```
Bit #(12) a;
1
      Bit #(10) b;
2
                                       // Type error: mismatched sizes
      b = a;
3
      a = b;
                                       // Type error: mismatched sizes
                                      // Ok; truncates a to Bit #(10), then assigns
      b = truncate (a);
                                      // Ok; extends b to Bit #(12), then assigns
      a = zeroExtend (b);
6
      if (a == zeroExtend (b)) ...
                                      // Ok
7
                                      // Ok
      if (truncate (a) < b) ...
```

The functions truncate() and zeroExtend() are *polymorphic* in that they will truncate/extend by the appropriate amount as demanded by the context.

#### 4.3 BSV: Boolean values

In BSV, Bool is the type of a Boolean value. It has the usual boolean operators && (Boolean/logical AND), | | (Boolean/logical OR) and ! (Boolean/logical NOT).

#### 4.3.1 BSV: Caution: Bool and Bit#(1) are different types

BSV is unlike languages like C and Python which are very loose about what can be used as a boolean value. For example in C, any non-zero numeric value or pointer is considered "True".

In BSV, Bool and Bit#(1) are *distinct* types, *i.e.*, *bsc*'s type-checking will complain if one is used where the other is expected. This is because not all Bit#(1) values are meaningful as Boolean values.

The Boolean/logical operators mentioned above (such as &&) operate on Bool types and are distinct from the bit-wise logic operators mentioned earlier (such as &), which operate on Bit#(n) types.

Note that bitwise comparison operators, such as in the example if (a <= b) ... shown in Section 4.2 above, take Bit#(n) arguments and produce Bool results.

#### 4.3.2 BSV: Example: recognizing legal RISC-V BRANCH instructions

The RISC-V ISA has a family of six conditional-branch instructions. Figure 4.1 is an excerpt from the Unprivileged ISA specification document [14]. The first line just gives us

| 31  | 27       | 26   | 25 | 24 |     | 20 | 19 | 15 | 14   | 12  | 11           | 7                   | 6    | 0    |                       |
|-----|----------|------|----|----|-----|----|----|----|------|-----|--------------|---------------------|------|------|-----------------------|
| imr | n[12 10] | :5]  |    |    | rs2 |    | rs | 1  | func | ct3 | imm          | 4:1 11]             | opc  | ode  | B-type                |
|     |          |      |    |    |     |    |    |    |      |     |              |                     |      |      | -                     |
| imn | n[12 10] | :5]  |    |    | rs2 |    | rs | 1  | 00   | 0   | $_{\rm imm}$ | $4{:}1 11] \;\; \;$ | 1100 | 0011 | BEQ                   |
| imn | n[12 10] | :5]  |    |    | rs2 |    | rs | 1  | 00   | 1   | imm[         | 4:1 11]             | 1100 | 0011 | BNE                   |
| imn | n[12 10] | :5]  |    |    | rs2 |    | rs | 1  | 10   | 0   | imm[         | 4:1 11]             | 1100 | 0011 | $\operatorname{BLT}$  |
|     | n[12 10] |      |    |    | rs2 |    | rs | 1  | 10   | 1   | $_{ m imm}[$ | 4:1 11]             | 1100 | 0011 | $_{\mathrm{BGE}}$     |
| imn | n[12 10] | :5]  |    |    | rs2 |    | rs | 1  | 11   | 0   | imm[         | 4:1 11]             | 1100 | 0011 | $\operatorname{BLTU}$ |
| imn | n[12 10] | :5]_ |    |    | rs2 |    | rs | 1  | 11   | 1   | imm[         | 4:1 11]             | 1100 | 0011 | BGEU                  |

Figure 4.1: RISC-V conditional BRANCH instructions

the names of the various slices of a 32-bit BRANCH-type instruction, and the subsequent lines describe the six instructions. Note that they only differ in the funct3 slice, where they use only six of the possible eight 3-bit codes.

Assuming instr is a 32-bit instruction, we can write BSV code to compute whether instr is or is not a legal BRANCH instruction:

```
Bit #(7) opcode_BRANCH = 7'b_110_0011;

Bit #(7) opcode = instr [6:0];

Bit #(3) funct3 = instr [14:12];

Bool legal = (opcode == opcode_BRANCH)

&& (funct3 != 3'b010)

&& (funct3 != 3'b011));
```

Line 1 defines opcode\_BRANCH as a 7-bit constant whose binary value is 1100011. The '7b prefix indicates that the number should be read as a binary, not decimal, number. The "\_" underscore characters are present merely for our (human) readability, and have no semantic significance. Lines 3-4 extract relevant slices, and finally lines 5-7 define the desired legality condition.

Figure 4.2 shows the hardware circuit described by the code. Some observations:



Figure 4.2: Testing for a legal BRANCH instruction

- Lines with arrow-heads in the figure represent bundles of one or more wires, also called "buses". For buses that have more than one wire, we show a small diagonal cross-hatch labeled with the number of wires (such as "3" or "7").
- Names/identifiers in BSV code that are bound to values are simply names for buses (in most software programming languages names represent memory locations; this is *not* the case in BSV).

#### 4.3.3 BSV: Combinational circuits and primitives

Figure 4.2 is an example of a so-called *combinational* circuit. In general, a combinational circuit is any interconnection of combinational primitive "operators" that *does not contain cycles* (*i.e.*, a bus connecting back to an earlier part of the circuit). Examples of combinational primitive operators in BSV include comparisons (like == and !=), boolean operations (like &&), bit-slicing ([n1:n2]) truncation and extension, arithmetic (like +, -, \*), shifts (<< and >>), and multiplexers (discussed in Section 4.9, later).

In BSV, and Verilog/SystgemVerilog RTL, we consider such operators as "primitive". In fact, such operators must themselves be implemented using lower-level circuit primitives such as AND, OR, and NOT gates which, in turn, must be implemented with even lower-level circuit structures such as transistors. We do not concern ourselves with such lower-level implementation because nowadays this is performed for us automatically by excellent so-called "synthesis" tools.

#### BSV: Combinational circuits have no side-effects (are "pure")

There is no "storage" in a combinational circuit, nor any concept of "updating" any storage (no "side-effects"). When a 32-bit value is presented at the input (top) of the circuit in Figure 4.2, conceptually we "instantly" see the 1-bit result at the output (bottom) of the

circuit, i.e., a combinational circuit is conceptually a pure, instantaneous, mathematical function from intputs to outputs. If we change the 32-bit value presented at the input, conceptually the output changes instantaneously in response.

NOTE:

Circuits are physical artefacts and we cannot escape physics. Electrical signals will take some finite time to propagate from inputs to outputs through wires and silicon. This propagation delay will place a limit on the "clock speed" at which we are able to run a digital circuit. We ignore this for the moment, and discuss this in detail later.

#### BSV: Data types identify combinational circuits

In BSV, the output type of any circuit that produces a value will have one of three forms (the latter two will be discussed in detail later):

- ... some value type ...: Bit#(n), Bool, or types discussed later such as structs, enums, tagged unions, vectors, and so on.
- Action
- ActionValue #( ... some value type ... )

Circuits whose output types are of the first form are *guaranteed* by BSV's type-checking system to be pure combinational circuits—they cannot have any side-effects such as updating a register or memory location, enqueuing or dequeuing from a FIFO, outputting a value to a device or display, *etc.* Circuits with Action and ActionValue#() types, on the other hand, may have side-effects as part of the process of producing their output value.

BSV is unusual amongst programming languages in precisely tracking "pure" (no side-effects) vs. "impure" constructs through type-checking. In this regard it is most similar to Haskell, where potentially impure expressions must have certain monadic types (typically in the "IO monad"). The importance of tracking purity will become clear when we discuss rule and method conditions and scheduling, later.

#### 4.4 BSV: Functions

The fragment of code shown above can be packaged into a BSV function, specifying argument(s) with precise types and a precise result type:

Functions are invoked using the "application" syntax commonly used in most programming languages:

```
Bit #(32) x, y;

Bool result_x = is_legal_BRANCH (x);

Bool result_y = is_legal_BRANCH (y);
```

BSV function definition and application syntax is similar to SystemVerilog.

#### 4.5 BSV: A small testbench to test our code

Here is a small program to run our is\_legal\_BRANCH() function on a few tests:

```
import StmtFSM :: *;
1
2
    function Bool is_legal_BRANCH (Bit #(32) instr);
3
        ... as shown earlier ...
     endfunction
5
     (* synthesize *)
    module mkTop (Empty);
9
       mkAutoFSM (
10
           seq
11
12
                 Bit #(32) instr_BEQ = {7'h0, 5'h9, 5'h8, 3'b000, 5'h3, 7'b_110_0011};
13
                 $display ("instr_BEQ %08h => %0d", instr_BEQ,
14
                            is_legal_BRANCH (instr_BEQ));
15
              endaction
16
17
              action
18
                 Bit #(32) instr_BNE = {7'h0, 5'h9, 5'h8, 3'b001, 5'h3, 7'b_110_0011};
19
                 $display ("instr_BNE %08h => ", instr_BNE,
20
                            fshow (is_legal_BRANCH (instr_BNE)));
21
              endaction
22
23
              action
24
                 Bit #(32) instr_ILL_op = {7'h0, 5'h9, 5'h8, 3'b100, 5'h3, 7'b_110_0000};
25
                 $display ("instr_ILL_op %08h => ", instr_ILL_op,
26
                            fshow (is_legal_BRANCH (instr_ILL_op)));
27
              endaction
28
29
              action
30
                 Bit #(32) instr_ILL_f3 = {7'h0, 5'h9, 5'h8, 3'b010, 5'h3, 7'b_110_0011};
31
                 $display ("instr_ILL_f3 %08h => %0d", instr_ILL_f3,
32
                            is_legal_BRANCH (instr_ILL_f3));
33
              endaction
34
           endseq);
35
36
    endmodule
37
```

For the moment, don't try to understand all these boilerplate constructs in detail. Briefly, mkAutoFSM is like a sequential program (discussed in more detail in Section 8). It performs a sequence of four actions. In each action we define a 32-bit instruction with standard Verilog bit-concatenation syntax. For example, instr\_BEQ is defined as a 32-bit value by concatenating a 7-bit hex 0 as an "immediate" value, a 5-bit hex 9 for rs2, a 5-bit hex 8 for rs1, a 3-bit 0 for funct3, a 5-bit hex 7 for rd, and a 7-bit binary value for the branch opcode. instr\_BEQ and instr\_BNE are legal branch instruction encodings. instr\_ILL\_op is not a legal branch instruction because it has the wrong 7-bit opcode in the opcode slice. instr\_ILL\_f3 is not a legal branch instruction because it has an illegal 3-bit value in the funct3 slice.

In each action, the \$display() prints the instruction in hex format, and prints the Bool result of applying is\_legal\_Branch() to the instruction. In two of the \$display()s we print the Bool value as a decimal integer (%0d format). In the other two \$display()s we use fshow() to print booleans as "True" or "False".

Suppose this code is in a file Top.bsv. We can now compile, link and execute the design (in simulation) as follows:

```
# ---- Compile BSV source code
1
    $ bsc -u -sim Top.bsv
2
    checking package dependencies
3
    compiling Top.bsv
4
    code generation for mkTop starts
    Elaborated module file created: mkTop.ba
6
    All packages are up to date.
7
8
    # ---- Link to form a simulation exeutable
9
    $ bsc -sim -e mkTop -o ./exe_HW_bsim
10
    Bluesim object created: mkTop.{h,o}
11
    Bluesim object created: model_mkTop.{h,o}
12
    Simulation shared library created: exe_HW_bsim.so
    Simulation executable created: ./exe_HW_bsim
14
15
    # ---- Execute the simulator
16
    $ ./exe_HW_bsim
17
    instr_BEQ 009401e3 => 1
18
    instr_BNE 009411e3 => True
19
    instr_ILL_op 009441e0 => False
20
    instr_ILL_f3 009421e3 => 0
```

#### Exercise 4.1:

Extend the testbench to test more 32-bit values with is\_legal\_BRANCH().

#### Exercise 4.2:

Refer to the "RV32I Base Instruction Set" listing in "Chapter 24 RV32/64G Instruction Set Listings" in the RISC-V Unprivileged ISA specification document [14]. It lists 40 RV32I instructions. Similar to is\_legal\_BRANCH(), write BSV code for the following functions:

```
function Bool is_legal_JAL (Bit #(32) instr);
    ... acccepts JAL

function Bool is_legal_JALR (Bit #(32) instr);
    ... acccepts JALR

function Bool is_legal_IALU (Bit #(32) instr);
    ... acccepts LUI, AUIPC, ADDI, SLTI, ..., OR, AND

function Bool is_legal_Mem (Bit #(32) instr);
    ... accepts LB, LH, LW, LBU, LHU, SB, SH, SW
```

Ignore FENCE, ECALL and EBREAK instructions; for the moment we'll treat them as illegal instructions.

#### Exercise 4.3:

Extend the testbench to test more 32-bit values with all the is\_legal\_XXX() functions.

4.6 BSV: enum types

In Figure 3.1, in the Register-Read and Dispatch step, we need to know whether the incoming instruction is illegal, a control instruction (branch or jump), an integer arithmetic/logic instruction, or a memory-accessing instruction. We could think of coding these "classes" using numbers (0 for illegal, 1 for control, 2 for IALU, 3 for Mem), but it is more readable, and cleaner, to use an "enum" type (similar to enum types in SystemVerilog and C):

```
typedef enum {OPCLASS_ILLEGAL,

OPCLASS_CONTROL, // BRANCH, JAL, JALR

OPCLASS_IALU,

OPCLASS_MEM } // LOAD, STORE, AMO

OPCLASS
deriving (Bits, Eq, FShow);
```

This defines a type OpClass containing four constants, OPCLASS\_ILLEGAL, OPCLASS\_CONTROL, and so on.

#### **4.6.1** deriving (Bits)

Because we said "deriving(Bits)", the *bsc* compiler will automatically represent them with the obvious codes 0, 1, 2 and 3 in a minimal bit-width (Bit#(2)). For other codings, we would *not* say "deriving(Bits)", and we would provide an explicit mapping function into codes (see "typeclass instances", later).

#### 4.6.2 deriving (Eq)

Because we said "deriving(Eq)", the *bsc* compiler will automatically define the "equality" (and "inequality") functions for values of this new type, in the natural and obvious way. For other definitions of equality/inequality, we would *not* say "deriving(Eq)", and we would define equality/inequality as we wish (see "typeclass instances", later).

#### 4.6.3 deriving (FShow)

Given a value v of type OpClass, if we directly print it (e.g., in a \$display() statement), it will print its numeric code (0, 1, ...). Because we said "deriving(FShow)", the bsc compiler will automatically define an "fshow()" function for this type: if we print fshow(v), it will print its symbolic name from the enum declaration instead (i.e., it will print the labels OPCLASS\_ILLEGAL, OPCLASS\_CONTROL, and so on).

#### 4.7 BSV: Syntax of Identifiers

The syntax of an identifier (name) in BSV follows the same conventions as in many programming languages: any sequence of alphabets, digits and underscore characters, with the first letter always being an alphabet.

BSV follows the Haskell system where an identifier has a different roles depending on whether its first letter is lower-case or upper-case. An upper-case first-letter represents a *constant*, either a value constant or a type constant. All variables (value variables or type variables) begin with a lower-case letter.

In the enum type-definition in Section 4.6, the identifiers OPCLASS\_ILLEGAL, OPCLASS\_CONTROL, OPCLASS\_IALU, OPCLASS\_MEM are all value constants (they all begin with an upper-case letter). The identifier OpClass (and identifiers seen earlier: Bool and Bit) are all type constants. The identifiers Bits, Eq. and FShow are all typeclass constants.

Other variables seen earlier, like x, y, a, b, opcode, and result\_x are all ordinary value variables.

#### 4.8 BSV: Syntax of comments

Comments in BSV have the same syntactic conventions as in Verilog, System Verilog and  $\mathrm{C/C}++:$ 

• A pair of forward-slashes ("//") begins a comment that spans to the end of the current line.

There are many examples of this in the code fragments already shown above.

• A region of text spanning multiple lines can be a comment if preceded by "/\*" (forward-slash and asterisk) and followed by "\*/" (asterisk and forward-slash).

This form is often used to "comment-out" a region of text during debugging or trying out alternatives.

#### 4.9 BSV: if-then-else statements and hardware multiplexers

In most programming languages, "if-then-else" is a so-called "control" construct: depending on the boolean condition, either the then-arm or the else-arm is executed (not both!).

In BSV an "if-then-else" represents a hardware *multiplexer*. The then-arm and else-arm each represent hardware that computes some value. The if-then-else construct simply selects the output of one of the two arms and passes it on as its output. Stated another way, the if-then-else "multiplexes" the two arm-outputs into a single output. In programming-language terms, *both* arms of the conditional are always "executed"—each arm represents an actual piece of hardware that is continuously computing its output.

The data type of the condition in an if-then-else must always exactly be Bool (not Bit#(1), not an integer, etc.). The types of the two arms of the conditional must be exactly the same, and this is also the type of the output of the output of the if-then-else.

For example, here is a function that distinguishes CONTROL instructions from other instructions, returning an OpClass (Section 4.6):

```
function OpClass instr_opclass (Bit #(32) instr);
      OpClass result;
2
      if (is_legal_BRANCH (instr)
3
          || is_legal_JAL (instr)
4
          || is_legal_JALR (instr))
5
         result = OPCLASS_CONTROL;
6
      else
7
         result = OPCLASS_ILLEGAL;
      return result;
9
   endfunction
```

This can also be written using so-called "conditional expressions" (using the same syntax as in SystemVerilog and C):

It's a matter of taste and style whether one uses if-then-else expressions or C-style conditional expressions. It may also depend on the size of the sub-epressions. The primary goal should be readability.

Both these code fragments represent the same hardware, shown in Figure 4.3. The 32-bit instr argument is fed into the circuits for is\_legal\_BRANCH() (hardware schematic in



Figure 4.3: If-then-else is a multiplexer

Figure 4.2), is\_legal\_JAL() and is\_legal\_JALR() which are OR'd to produce a Bool output which, in turn, is used to select one of two 2-bit constant values, producing a final 2-bit result. The multiplexer, also called a "MUX" for short, is a primitive combinational circuit.

If-then-elses and conditional expressions can of course be nested:

```
function Bool instr_opclass (Bit #(32) instr);
1
       OpClass result;
       if (is_legal_BRANCH (instr)
3
          || is_legal_JAL (instr)
4
          || is_legal_JALR (instr))
5
          result = OPCLASS_CONTROL;
6
       else if (is_legal_IALU (instr))
          result = OPCLASS_IALU;
       else if (is_legal_Mem (instr))
          result = OPCLASS_MEM;
10
       else
11
          result = OPCLASS_ILLEGAL;
12
       return result;
13
    endfunction
14
```

This represents a cascade of multiplexers in hardware, as shown in Figure 4.4

#### 4.9.1 Parallel multiplexers and MUX synthesis

The circuit in Figure 4.4 has a serial structure—the OPCLASS\_CONTROL branch has priority, and only if its condition is False can one of the other results flow through. Also observe that the longest path length increases *linearly* with number of classes—here, OPCLASS\_ILLEGAL flows through all three multiplexers.

But we know from RISC-V instruction encodings that the OPCLASS\_CONTROL, OPCLASS\_IALU and OPCLASS\_MEM conditions are *mutually exclusive*; no instruction simultaneously falls into more than one such class. In such situations (mutually exclusive conditions) it is possible to create more a efficient circuit called a *parallel MUX*. An exercise below shows how to create a parallel MUX explicitly, but in many cases downstream RTL-to-lower-level-hardware synthesis tools will do this automatically.



Figure 4.4: Nested if-then-elses become cascaded multiplexers

#### Exercise 4.4:

Write a testbench for the instr\_opclass() function: pass in different 32-bit instructions to produce the op class, and print out the op class. When printing the class, try printing it as an integer, and also using fshow().

#### Exercise 4.5:

Write a new version of the instr\_opclass() function that expresses a parallel MUX instead of a priority MUX. The key ideas are:

- Define a value x\_CONTROL that is either OPCLASS\_CONTROL, or 0 (of the same bit-width) if the Bool values of is\_legal\_BRANCH(), is\_legal\_JAL() and is\_legal\_JALR() are all False. We can implement this by replicating the 1-bit Bool condition to the width of the OpClass type and bitwise-AND'ing this with OPCLASS\_CONTROL.
- Similarly, define values x\_IALU and x\_MEM that is either OPCLASS\_IALU/OPCLASS\_MEM or 0 if the Bool value of is\_legal\_IALU()/is\_legal\_MEM() is False.
- Define a value x\_ILLEGAL that is either OPCLASS\_ILLEGAL, or 0 if any of the previous conditions is True.
- Finally, just bitwise-OR the four x\_XXXvalues together to produce there result.

This kind of MUX is also called an AND-OR mux because of its structure. Note that it relies for correct operation on *precisely one* of the bitwise-OR arguments being True. Here, we are assured of this because of the mutual exclusivity of the conditions.

#### Exercise 4.6:

Sketch out a schematic diagram for the hardware of the parallel MUX in the previous exercise.

The schematic will clearly reveal the parallel nature of the MUX compared to the serial nature of the priority MUX shown earlier in the schematic in Figure 4.4.

How many multiplexers does each OPCLASS\_XXX value flow through? How many bitwise-OR operators are needed? Note that we can organize the bitwise-OR operators as a balanced binary tree, so that the path through the circuit grows *logarithmically*, not linearly, with the number of classes.

#### Exercise 4.7:

Test your new version of the instr\_opclass() function in your testbench.

#### 4.10 BSV: Sharing code for RV32 and RV64 via parameterization

The RISC-V ISA is actually two ISAs—a 32-bit ISA called RV32 and a 64-bit ISA called RV64. These are not randomly different ISAs; they have been carefully engineered to overlap as much as possible:

- Most of the RV32 instructions are exactly the same in RV64
- Three R32 instructions are slightly different in RV64—the shift instructions SLLI, SRLI and SRAI have 5-bit shift-amounts in RV32 (allowing up to 32-bit shifts), whereas they have 6-bit shift-amounts in RV64 (allowing up to 64-bit shifts).
- RV64 adds several new instructions that compute on 64-bit values.

Because of this large overlap of RV32 and RV64, we would like to share BSV code as much as possible between RV32 and RV64, *i.e.*, we would like to parameterize our BSV code so that it can be re-used between RV32 and RV64 implementations.

#### 4.10.1 BSV: Numeric types

We have mentioned the type Bit#(n) frequently so far, representing bit-vectors of width n bits. All our examples showed a particular n, such as:

```
Bit #(32) instr;
Bit #(32) pc_val;
```

The first declaration is fine for both RV32 and RV64, since instructions are 32-bits wide in both. However, the second declaration only works in RV32, since the program counter is 64-bits wide in RV64 (type Bit#(64)).

The "32" or "64" argument to the Bit#(n) type is a numeric type. Although syntactically they look just like the values 32 and 64, when used inside a type-expression like Bit#(n), they are not values, but numeric types. BSV's type-system carefully distinguishes between these two cases because numeric-types usually say something about hardware structure, which cannot be changed once created! So, while we can perform arbitrary arithmetic on numeric values, we cannot do so on numeric types.<sup>2</sup>

<sup>&</sup>lt;sup>2</sup>A limited form of arithmetic is possible on numeric types. Consider a generic function that takes two

#### 4.10.2 BSV: Type synonyms

In BSV we can define a new symbolic name for an existing type, and then we can use that symbolic names in place of the existing type. Example, from RV32 code:

```
typedef 32 XLEN; // new name for numeric type 32

Bit #(XLEN) pc_val;
Bit #(XLEN) rs1_val; // Value read from register rs1 in register file
Bit #(XLEN) rs2_val; // Value read from register rs2 in register file
Bit #(XLEN) rd_val; // Value written to register rd in register file
```

By changing the single definition in line 1 to:

```
typedef 64 XLEN; // new name for numeric type 32
```

the remaining code will work for RV64 as well.

#### 4.10.3 BSV: The numeric value corresponding to a numeric type

Although BSV keeps a strict separation of numeric types and numeric values (and limits the available arithmetic on the former), it is always safe to convert a numeric type into the corresponding numeric value, since these values are all known statically (at compile time). The built-in pseudo-function valueOf() is provided for this:

```
Integer xlen = valueOf (XLEN);
```

Here, xlen is an ordinary value variable whose integer value is the same as that expressed by the numeric type XLEN.

#### 4.10.4 BSV: Conditional compilation

Just like in Verilog, SystemVerilog and C/C++, the *bsc* compiler runs BSV source code through a "pre-processor" before compilation, which can perform simple text ("macro") substitutions. Using this facility, we can pass an argument to the compiler that has the effect of configuring the source code for RV32 or RV64 (the following code is from Fife/Drum's Arch.bsv file:

arguments of type Bit#(m) and Bit#(n) and returns the concatenation of these bit-vectors: its output type is Bit#(m+n). By limiting the available arithmetic *bsc* can resolve it completely "statically", *i.e.*, at compile time, before it even compiles to Verilog RTL. We ignore this for now, and discuss it later.

```
'ifdef RV32
1
2
        typedef 32 XLEN;
3
4
    'elsif RV64
5
6
        typedef 64 XLEN;
7
8
    'endif
9
10
        Integer xlen = valueOf (XLEN);
11
```

As in Verilog and SystemVerilog, pre-processor directives begin with a 'character (backtick) (analogous to #ifdef in the C/C++ pre-processor).

When we invoke the *bsc* compiler, we can pass it command line arguments -DRV32 or -DRV64; the pre-processor will then select the appropriate typedef line. Thus, we can write common code that will work for both RV32 and RV64. The integer value xlen will have the numeric value 32 or 64 depending on how it was compiled.

Pre-processor macros allow us to conditionally compile different source text based on the macro definitions we supply to the compiler. We can also compile alternatives based on the value xlen

```
if (xlen == 32) begin
    ... code that must execute if we are in RV32 mode ...
end
else begin
    ... code that must execute if we are in RV64 mode ...
end
end
```

Whenever possible, it is preferable to use the if(xlen==...) form instead of the 'ifdef form for conditional compilation because (a), the code is more readable and (b), as we experience in many languages, pre-processor macros can be quite dodgy (scoping, inadvertant variable capture, inadvertant surprises due to associativity of infix operators, ...).

Note that in the if(xlen==...) form both arms of the conditional must type-check correctly, whether xlen is 32 or 64. There are ways to achieve this with judicious use of bit-slicing, extend() and truncate(); we will point them out as we encounter them. If the two arms cannot both type-check whether xlen is 32 or 64, we may have to resort to the 'ifdef form.

There is zero run-time overhead in using the if(xlen==...) form because the *bsc* compiler will evalute the if-condition statically and reduce the if-then-else to just the relevant arm.

# Chapter 5

# Struct types (BSV) Memory requests and responses (RISC-V)

# 5.1 RISC-V: structs communicated between steps

Various kinds of information need to be communicated between the steps of Figure 3.1—program counter values, instructions, values read from registers, values to be written back to registers, and so on. struct data types (short for "structures") are suitable for bundling together heterogeneous collections of values. (This is the same concept in C and SystemVerilog; it is also called a "record" in some programming languages.) Each component of a struct is called a "field" or a "member" of the struct.



Figure 5.1: Simple interpretation of RISC-V instructions (Fig. 3.1 with arrows annotated with struct types)

Figure 5.1 annotates Figure 3.1 with struct types communicated on each of the black arrows between steps, and each of the red arrows to and from memory. In this and the next few

chapters we will flesh out the details of all these struct types. We will use exactly the same struct types for Fife and Drum, *i.e.*, whether the implementation is pipelined or not.

#### 5.2 BSV: struct types

Consider the black arrow from the Decode step to the Register-read-and-Dispatch step of Figure 5.1. We want to communicate several values, including:

- The current PC. This will be needed by BRANCH, JAL, JALR and AUIPC instructions to compute addresses that are offset from the current PC. It will be needed for any traps (exceptions) that may occur, which save PC for the trap-handler.
- An exception flag, indicating:
  - whether an error was encountered in the Fetch-to-Memory-to-Decode path, or
  - whether the Decode step's analysis indicates that the instruction is not legal (unrecognized 32-bit code).

When the exception flag is true, a cause field provides more detail.

- If there is no exception (no Fetch memory error; instruction is legal), other fields provide more analytical detail for subsequent steps:
  - The "fall-through" PC, *i.e.*, the address of the next instruction following this one in memory. For RV32I and RV64I, this will always be PC+4, since all instructions are 4-bytes long. For most instructions, the fall-through PC is indeed the unique next PC. For conditional BRANCH instructions, this is the next PC if the BRANCH is not taken. For JAL and JALR instructions (unconditional jumps), this is the "return address" saved by the instruction in a register.
  - The instruction itself. This will be needed for opcode details, the rs1, rs2 and rd register indexes, immediate values, etc.

The next several fields are derived by analyzing the instruction. They could be rederived wherever needed by re-analyzing the instruction, but we perform that work just once in the Decode step and communicate the results.

- The OpClass. This indicates to which next-step in Figure 5.1 we dispatch for subsequent actions.
- Whether the instruction reads rs1 and/or rs2 register values. This will be needed to control reading from the register file.
- Whether the instruction writes an rd register value. This will be needed to control writing to the register file.
- The "immediate" value in the instruction. Refer to the top of each page of "Table 24.2 Instruction listing for RISC-V" in the Unprivileged Spec [14], which shows that the I-, S-, B-, U- and J-type instructions have immediate values of different sizes and encode them in different ways ("bit-swizzled"). We untangle these once in the Decode stage, and pass on the clarified results in the imm field.

 $<sup>^{1}</sup>$ If we implement the "C" RISC-V ISA extension (compressed instructions), the correct fall-through PC may be PC+2.

This heterogeneous collection of values is most conveniently expressed as a struct type:

```
typedef struct {Bit #(XLEN)
1
2
                      Bool
                                    exception;
3
                      Bit #(XLEN)
                                                  // Memory exception during Fetch;
                                    cause;
4
                                                  // Illegal instr decoded.
5
                      // If not exception
6
                      Bit #(XLEN)
                                    fallthru_pc;
7
                      Bit #(32)
                                    instr;
                      OpClass
                                    opclass;
9
                      Bool
                                    has_rs1;
10
                      Bool
                                    has_rs2;
11
                      Bool
                                    has_rd;
12
                      Bit #(32)
                                    imm;
                                                    // Canonical (bit-swizzled)
13
    } D_to_RR
14
    deriving (Bits, FShow);
15
```

Because we said "deriving(Bits)", the *bsc* compiler will automatically work out a representation for D\_to\_RR values in bits, using the straightforward method of simply concatenating the bit-vectors of each field into a bit-vector for the whole struct. The total bit-size of a D\_to\_RR struct value is simply the sum of the individual bit-sizes of the fields. If we had not said "deriving(Bits)", we could explicitly provide some other custom representation in bits.<sup>2</sup>

Because we said "deriving (FShow)", the bsc compiler will automatically define an "fshow()" function for this type: if we print fshow(v), it will print something like this:

```
D_to_RR {inum=..., pc=..., instr=..., ... }
```

#### 5.2.1 Creating struct values

We can create a new value of type D\_to\_RR with syntax like this:

```
D_to_RR x = D_to_RR {pc: ... value of field ..., exception: ... value of field ..., cause: ... value of field ..., fallthru_pc: ... value of field ..., instr: ... value of field ..., ...};
```

 $<sup>^2</sup>$ In C/C++, compilers will often "pad" out fields (insert unused bits between fields) to be aligned on byte and word boundaries, for more efficient access in byte-structured memories; thus, a struct's size in C/C++ may be larger than the sum of the field sizes, and may even vary depending on the compiler's target architecture. In hardware design, these values may reside in wires, registers, FIFOs, etc which have no "byte-structured" bias, and so we do not play any such "padding" games.

The right-hand side is sometimes called a "struct expression", *i.e.*, it is an expression which, when evaluated, produces a struct value.

The repetition of D\_to\_RR above seems verbose; the left-hand side instance is the type, and the right-hand side instance is the "struct constructor" (think of it as a function that takes the field values as arguments and returns a struct value). The *bsc* compiler's type-analysis is able to infer the type from the right-hand side, so we can just use the keyword "let":

```
let x = D_{to}RR \{pc:
                                            ... value of field ... ,
                            exception:
                                            ... value of field ... ,
2
                            cause:
                                            ... value of field ... ,
3
                            fallthru_pc:
                                           ... value of field ... ,
4
                            instr:
                                            ... value of field ... ,
5
                             ...};
6
```

The order in which the field values are given does not matter; the bsc compiler will put the fields into the correct offsets in the struct value.

Not all field values need be given in a struct expression. The *bsc* compiler will issue a warning for each unspecified field, and insert an "unspecified" (and unpredictable) values. You can indicate that a field is intentionally left unspecified (and suppress the compiler warning) using a "?":

In the above example, the exception cause field is meaningless when the exception field is false, and we indicate explicitly this with "?".<sup>3</sup>

#### 5.2.2 Selecting struct fields

Struct fields can be selected using the usual "dot" notation common to SystemVerilog and C/C++:

```
x.pc
x.instr
```

<sup>&</sup>lt;sup>3</sup>In the RTL languages Verilog, SystemVerilog and VHDL, there is a concept of "X" (unassigned) values. In BSV there is no such concept; thus "?" repesents some legitimate—but unspecified—binary value. Note that even in the RTL languages, "X" has meaning only in simulators, not in real hardware, where it also becomes some legitimate—but unspecified—binary value.

#### 5.2.3 Updating struct fields using assignment

Struct fields can be updated with assignment using the usual "dot" notation common to SystemVerilog and C/C++:

```
x.pc = ... new value ...;
x.instr = ... new value ...;
```

#### 5.3 RISC-V: Memory Requests and Responses; IMem and DMem

In Figure 5.1, the Mem\_Req and Mem\_Rsp structs are used in two places. The Fetch step issues a memory request, and the corresponding memory response is received by the Decode step. Similarly, the "Execute Memory Ops" steps issues a memory request and consumes a memory response. We use the shorthand term "IMem" for the first context (for Instruction Memory) and "DMem" for the latter context (for Data Memory).

#### 5.3.1 Separation of IMem and DMem (Harvard Architecture)

The separation of memory channels for instructions and data (loads/stores) is quite standard in modern CPU architectures, and is informally called a "Harvard Architecture". The term refers to the architecture of the Harvard Mark I computer, designed and built by Harvard University and IBM in the 1940s (the term itself was coined much later). It sometimes refers just to separate, concurrent paths to memory for instructions and data, and sometimes also to physically separate memories for instructions and data (more discussion in Wikipedia: https://en.wikipedia.org/wiki/Harvard\_architecture).

Modern software is typically not "self-modifying", *i.e.*, instructions and data are placed in different areas of memory, and load/store instructions never write into the instruction area, *i.e.*, programs never over-write instructions in memory. This allows separate hardware for memory access for instructions vs. memory access for data, which can run concurrently, *i.e.*, we may fetch an instruction at the same time as we are accessing data memory for a previous load/store instruction (we will see this in Fife). We can also tune and optimize each memory path separately for their different dynamic behavioral patterns. In some systems we can also *protect* the instruction memory area, *i.e.*, enforce in hardware the policy of not over-writing instructions.

This view of strict separation of IMem and DMem has to be tempered somewhat when considering languages like JavaScript, Python etc. that employ so-called "JIT" compiling ("Just-In-Time"). The run-time systems of such languages generate instructions on-the-fly, i.e.,, LOAD/STORE instructions produce data through the DMem channel that will (soon) be fetched as instructions. But even in these systems, there is a strict protocol of phases. During a code-generation phase, the data produced is considered as ordinary data. Then there is a deliberately executed phase-change, where the virtual memory protections of the data-pages just written are changed so that they are now viewed as read-only instruction pages, after which these new instructions can be fetched.

#### 5.3.2 RISC-V: Memory Requests

A memory-request is either for reading data ("LOAD") or for writing data ("STORE").<sup>4</sup> We can express these request-types using an enum type (similar to enums in C or SystemVerilog):

An IMem request is for one 32-bit instruction (four bytes).<sup>5</sup> A DMem request may be for one, two or four bytes.<sup>6</sup> We express these request-size options using an enum type:

```
typedef enum {MEM_1B, MEM_2B, MEM_4B} Mem_Req_Size
deriving (Eq, FShow, Bits);
```

A memory request bundles a request type, a size, and an address. For memory-writes, we also bundle the data to be stored. We express this bundle using a struct:

```
typedef struct {Mem_Req_Type req_type;

Mem_Req_Size size;

Bit #(XLEN) addr;

Bit #(XLEN) data; // Only for STORE

Mem_Req
deriving (Eq, FShow, Bits);
```

For STORE requests of 1 and 2 bytes (*i.e.*, smaller than the data field) we assume the data is passed in the least-significant bytes of the data field.

This is the information sent to Memory from the Read-PC-and-Fetch step and also from the Execute-Memory-Ops step in Figure 5.1.

#### 5.3.3 RISC-V: Address Alignment

Although nowadays we think of all computer memories in units of 8-bit bytes and being byte-addressed, in practice in hardware, it is usually simpler if memory-requests are aligned

<sup>&</sup>lt;sup>4</sup>When implementing the "A" RISC-V ISA extension (Atomic Ops), memory requests can also be for LR (Load-Reserved), SC (Store-Conditional), AMOSWAP, AMOADD, AMOXOR, AMOAND, AMOOR, AMOMIN, AMOMAX, AMOMINU, and AMOMAXU.

<sup>&</sup>lt;sup>5</sup>When implementing the "C" RISC-V ISA extension (compressed instructions), instructions can also be 16-bits (2 bytes). When implementing more sophisticated Fetch units, we may actually fetch much larger chunks, such as a full cache line.

<sup>&</sup>lt;sup>6</sup>In RV64, and with the "D" RISC-V ISA extension for double-precision floating-point even in RV32, memory-requests can also be for eight bytes.

<sup>&</sup>lt;sup>7</sup>Some early computers, until about the late 1970s, had other memory granularities—multiples of 6, 7, 9 bits, *etc*. Those were the days of bespoke memories for each computer design. Mass-production of memory chips resulted in standardization to 8-bit bytes.

to an address according to the request size. Specifically, the address for a 2-byte request should be even, *i.e.*, the least significant bit of the address, addr[0], should be zero. The address for a 4-byte request should have zero in the two least significant bits (addr[1:0]) and the address for an 8-byte request should have zero in the three least significant bits (addr[2:0]).

We can see why address-alignment is desirable. Memory implementations (chips) are usually architected to retrieve multiple bytes at a time (e.g., 64 bytes) so that all those bytes can share addressing and control circuitry. With such an organization, a misaligned access request may straddle the boundaries of such "naturally sized" units and so may require two consecutive reads/writes. Caches are usually organized to hold multi-byte cache lines (e.g.,, 32 bytes) in order to share the addressing and miss-handling circuitry, and to move data efficiently in and out of the cache. Again, a misaligned access request may straddle a cache-line boundary, and may require two consecutive accesses, which may hit or miss independently. Virtual memory systems are usually organized in pages, units of typically 4K-8K bytes, in order to share virtual-memory handling circuitry, and to move data efficiently between main memory and disks. Again, a misaligned access may straddle a page boundary, and may require two consecutive accesses, which may hit or page-fault independently and differently. In short, misaligned accesses add significant complexity to memory-system hardware design.

We can organize our software so that misaligned accesses are exceedingly rare. Most software is produced by compilers, and the compiler ensures that instructions and data are placed in memory at aligned addresses, possibly by padding gaps between "adjacent" smaller-sized data (such as a pair of 1-byte-sized fields in a struct). This padding may waste a few bytes of memory, but pays back in greater speed and reduced complexity.

Although misaligned accesses are rare, we cannot always guarantee their absence in software, since software can calculate an arbitrary address before performing a memory access. It seems wasteful to have to pay for extra hardware complexity (with attendant loss in overall performance) for such rare cases. In many computer systems, therefore, these rare misaligned accesses are relegated to software handling:

- The memory system simply refuses to handle a misaligned access, and returns a "misaligned" error instead.
- The CPU, receiving such an error response, undergoes a "trap" which directs it to piece of software called an trap-handler (or exception handler). The trap-handler (in software) performs the memory access in multiple smaller pieces (in the worst case, of size 1 byte), each of which is aligned. In other words, the trap-handler "completes" the original memory access before resuming the main-line code that attempted it.

The RISC-V ISA specification does not forbid misaligned accesses nor prescribe how they should be handled. Some implementations will handle it in hardware, and other implementations will return a "misaligned" error and rely on a trap handler to complete the access. Some implementations may only run software where it has been proven to not generate misaligned memory requests, and therefore may not even contain a trap-handler for misaligned accesses.

#### 5.3.4 RISC-V: Memory Responses

The response from memory for any request may be to report success, an alignment error, or some other error. Examples of "other errors" are:

- Absence of memory at the given address. For example, although RV32I addresses are 32-bits, which can address 4GiB of memory, we may provision our system with something smaller, say 1 GiB.
- An unsupported operation. E.g., an attempt to write into a read-only memory (ROM).
- Corruption of data, due to electrical glitches, environmental electromagnetic pulses, etc.. These errors are usually detected with some kind of error-detecting code, such as parity bits.

These different memory response-types can be encoded in an enum type:

A memory-response contains the response-type and, for a LOAD request with an OK, the data that was read from memory. This can be expressed in a struct:

For LOAD requests of 1 and 2 bytes (*i.e.*, smaller than the data field) we assume the data is returned in the least-significant bytes of the data field.

#### 5.4 RISC-V: The Decode function

We are now in a position to write fn\_D, the Decode function in Figure 5.1:

```
function D_to_RR
fn_D (F_to_D x_F_to_D, Mem_Rsp rsp_IMem);

Bit #(32) instr = truncate (rsp_IMem.data);
Bit #(5) rd = instr_rd (instr);

let fallthru_pc = x_F_to_D.pc + 4;
```

```
8
       // Baseline info to next stage
9
       let y = D_{to}RR \{pc:
                                         x_F_to_D.pc,
10
11
                          exception:
                                         False,
12
                          cause:
                                         ?,
13
14
                          // not-exception
15
                          fallthru_pc: fallthru_pc,
16
                          instr:
                                        instr,
17
                                        ?,
                          opclass:
18
                          has_rs1:
                                        False,
19
                          has_rs2:
                                        False,
20
                          has_rd:
                                        False,
21
                          writes_mem: False,
22
                          imm:
                                        0};
23
24
       if (rsp_IMem.rsp_type == MEM_RSP_MISALIGNED) begin
25
           y.exception = True;
26
                        = cause_INSTRUCTION_ADDRESS_MISALIGNED;
           y.cause
27
       end
28
       else if (rsp_IMem.rsp_type == MEM_RSP_ERR) begin
29
          y.exception = True;
30
           y.cause
                       = cause_INSTRUCTION_ACCESS_FAULT;
31
       end
32
       else if (is_legal_LUI (instr) || is_legal_AUIPC (instr)) begin
33
           y.opclass = OPCLASS_IALU;
34
           y.has_rd = non_zero_rd;
35
          y.imm
                     = zeroExtend (instr_imm_U (instr));
36
       end
37
       else if (is_legal_BRANCH (instr)) begin
38
          y.opclass = OPCLASS_CONTROL;
39
          y.has_rs1 = True;
40
          y.has_rs2 = True;
41
                     = zeroExtend (instr_imm_B (instr));
42
       end
43
       else if (is_legal_JAL (instr)) begin
44
           y.opclass = OPCLASS_CONTROL;
45
          y.has_rd = non_zero_rd;
46
                     = zeroExtend (instr_imm_J (instr));
           y.imm
47
       end
48
       else if (is_legal_JALR (instr)) begin
49
          y.opclass = OPCLASS_CONTROL;
50
          y.has_rs1 = True;
51
          y.has_rd = non_zero_rd;
52
                     = zeroExtend (instr_imm_I (instr));
          y.imm
53
       end
54
       else if (is_legal_LOAD (instr)) begin
55
```

```
y.opclass = OPCLASS_MEM;
56
           y.has_rs1 = True;
57
           y.has_rd = non_zero_rd;
58
                      = zeroExtend (instr_imm_I (instr));
           y.imm
59
        end
60
        else if (is_legal_STORE (instr)) begin
61
           y.opclass
                         = OPCLASS_MEM;
62
           y.has_rs1
                         = True;
63
           y.has_rs2
                         = True;
64
           y.writes_mem = True;
65
                         = zeroExtend (instr_imm_S (instr));
           y.imm
66
        end
67
        else if (is_legal_OP_IMM (instr)) begin
68
           y.opclass = OPCLASS_IALU;
69
           y.has_rs1 = True;
70
           y.has_rd = non_zero_rd;
71
           y.imm
                      = zeroExtend (instr_imm_I (instr));
72
       \quad \text{end} \quad
73
        else if (is_legal_OP (instr)) begin
74
           y.opclass = OPCLASS_IALU;
75
           y.has_rs1 = True;
76
           y.has_rs2 = True;
77
           y.has_rd = non_zero_rd;
78
        end
79
        else if (is_legal_ECALL (instr) || is_legal_EBREAK (instr)) begin
80
           y.opclass = OPCLASS_SYSTEM;
81
        end
82
        else if (is_legal_FENCE (instr) || is_legal_FENCE_I (instr)) begin
83
           y.opclass = OPCLASS_FENCE;
84
           y.has_rs1 = True;
85
           y.has_rd = non_zero_rd;
86
        end
87
        else begin
88
           y.exception = True;
89
                        = cause_ILLEGAL_INSTRUCTION;
           y.cause
90
        end
91
92
        return y;
93
    endfunction
94
```

Note the use of functions <code>instr\_imm\_I()</code>, <code>instr\_imm\_S()</code>, <code>instr\_imm\_B()</code>, <code>instr\_imm\_U()</code> and <code>instr\_imm\_J()</code>, which extract the "immediate" field from a 32-bit instruction. For the different classes of instruction (I, S, B, U, J), the immediates are coded differently, and have different widths. These functions decode the bits and zero-extend them to XLEN bits (the width of <code>d\_to\_rr.imm</code>). Later, in the "execute" code for each instruction, we will pick out the relevant bits out of the XLEN bits, and zero-extend or sign-extend them as needed for the particular instruction.

An exercise below suggests that you write the code for these instr\_imm\_X functions.

Observe that the entire fn\_D() function is just a (large) combinational circuit—it is an acyclic composition of smaller combinational circuits, many of which we've seen earlier.

#### Exercise 5.1:

Write the functions instr\_imm\_I(), instr\_imm\_S(), instr\_imm\_B(), instr\_imm\_U() and instr\_imm\_J().

#### Exercise 5.2:

Write a testbench for fn\_D(), apply it on a number of 32-bit values (instructions) and print the results.

# Chapter 6

# The Fetch function: memory requests and responses

#### 6.1 Introduction

In this chapter we discuss the Fetch function of Figure 3.1, which we repeat here for convenience, adding annotatinos on the arrows to show the struct types they communicate. All these struct declarations can be found in the file: src\_Common/Inter\_Stage.bsv.



Figure 6.1: Simple interpretation of RISC-V instructions (same as Fig. 3.1, with arrows annotated with struct types)

The Fetch function *per se* is fairly simple, even trivial. Most of our discussion is about the messages sent from the Fetch step to memory (memory requests) and the messages returned from memory to the Decode step (memory responses). The outputs of Read-PC-and-Fetch step are:

• A memory request to memory, to read an instruction.

• Some additional information passed on to the Decode step for subsequent use.

## 6.2 RISC-V: the result-type of the Fetch function

In Figure 6.1, the information passed from the Fetch stage to the Decode stage is just the PC:

```
typedef struct {
   Bit #(XLEN) pc;
} F_to_D
deriving (Bits, FShow);
```

It might seem like overkill to define a struct for just one field like this, but it has the following advantages:

- It becomes easy to add more fields later, should we need to do so. In particular, for Fife we will need to add some branch-prediction information. We may also wish to add temporary fields that aid in debugging.
- Stronger type-checking: each new struct type is distinct from all other types in a BSV program. Thus, the type-checker will catch any error where we may inadvertantly pass some unrelated and irrelevant XLEN-wide value in place of an F\_to\_D struct.
- There is *no* runtime cost: an F\_to\_D value occupies the same XLEN bits as the pc field by itself.

BSV structs can be nested. The Fetch function's result is a nested struct containing two structs, one being the memory-request to be sent to memory, and the other being the information to be passed to the Decode step:

```
typedef struct {
   F_to_D to_D;
   Mem_Req mem_req;
} Result_F
deriving (Bits, FShow);
```

#### 6.3 RISC-V: The Fetch Function

Finally, we can write the Fetch function, which is very simple, almost trivial. Its input is the current value of the program counter (PC), and it returns a Result\_F. The PC is used as the address from which to read memory.

```
function Result_F fn_F (Bit #(XLEN)
1
        Result_F y = ?;
2
3
        // Info to next stage
4
        y.to_D = F_to_D \{pc: pc\};
6
        // Request to IMem
        y.mem_req = Mem_Req {req_type: MEM_LOAD,
8
                               size:
                                          MEM_4B,
9
                                addr:
                                          pc,
10
                                                       // don't-care value
                                data :
                                          ?};
11
          return y;
12
    endfunction
13
```

#### 6.3.1 BSV: Don't-care values

Since we are performing a LOAD, the value carried in the data field of the request is irrelevant. Instead of placing some particular but arbitrary value (such as "0") into that field, we use BSV's "?" notation for a don't-care value.

First, this conveys to the human reader that the value in this field is irrelevant.

Second, it can result in more efficient circuitry. If we had said "0", for example, the *bsc* compiler would have had to create circuitry that ensured that that field value was 0. By saying "?", the *bsc* compiler is allowed to omit all that circuitry.

Third, in places where it does not result in additional hardware, the bsc compiler usually injects the specific value 'h\_AAAA\_AAAA (of suitable bit-width). While debugging, observing such a value in some computation is often a clue that something is wrong.

NOTE:

Verilog and SystemVerilog have a concept of "X" values. Each bit of a register or wire carries an "X" value until it has been assigned a specific binary value (0 or 1). However, note that this is *only in simulation*, where the simulator can and does model 3-valued logic (0, 1 and X) for each bit, and is able to propagate X values through operators, registers, *etc.* Hardware only implements 2-valued logic—every bit is either 0 or 1. Thus, this is an artefact that is only useful during debugging in simulation.

BSV only has 2-valued logic; there is no concept of an "X" value. A BSV "?" expression has some specific, but potentially unpredictable, binary value.

# Chapter 7

# BSV: Modules and Interfaces; Drum and Fife Interface and Module Skeleton

#### 7.1 Introduction

It is good engineering practice to organize the code for any non-trivial system, whether in hardware or software, into a well-structured network of smaller, manageable *modules*. Each module should have a clear, independent specification so that it can be understood on its own, and so that it can transparently be substituted by another module with the same functionality but perhaps other desirable properties (*e.g.*, speed, area, power). The external specification of a module—its "interface"—should not rely on, and ideally note even mention, internal implementation details of the module.

For example, each of the units shown in Figure 7.1 could be a separate module.

In this chapter we discuss interfaces and modules in BSV, including a few that are provided by the BSV libraries and are useful in implementing our RISC-V CPUs. At the end of this chapter, we show the interface for the Drum and Fife CPUs—everything excluding the Memory in Figure 7.1.

#### 7.2 BSV: Modules: state, behavior and interfaces

Modules encapsulate modifiable state, internal behavior (rules) and external behavior (interface methods). In this sense they are similar to "objects" in object-oriented programming languages such as C++, Java, and Python. A BSV module is like an object constructor; a module instance is like an object; and its interface is a set of methods that can be invoked like functions/procedures.

Modules and interfaces clearly separate the concerns of externally-visible functionality ("external API", what a module does) vs. internal implementation details (how the module does it).



Figure 7.1: Simple interpretation of RISC-V instructions (same as Fig. 3.1)

BSV modules are typically organized in a *hierarchy*—in Drum and Fife, the top-level system will be a module, instantiating, as sub-modules, a CPU module (Drum or Fife) and a memory system module; the CPU module, in turn, will instantiate sub-modules such as a register-file module, and so on. A clean, common interface allows us transparently to substitute the Drum CPU module for the Fife CPU module and *vice versa*.

In the next several sections we describe the concepts of BSV modules and interfaces. These sections may require re-reading a couple of times; the concepts become properly internalized only after seeing/using/creating several examples. Temporal behavior will be discussed in the next chapter.

#### 7.2.1 BSV: Interface declarations

An interface declaration in BSV declares a new interface, and looks like this:

interface interface-type;

... method and sub-interface declarations ...

endinterface

The interface represents the external view of a module, *i.e.*, it declares a set of methods that can be invoked from an external context. Each method declaration only lists its arguments and their types, and the method's overall result type. The body of the method is defined elsewhere, within a specific module that offers this interface type.

Interfaces can be nested (can contain sub-interfaces which themselves have methods or sub-sub-interfaces, and so on). This is just a syntactic abstraction mechanism; ultimately, all interactions with a module are through its methods, whether at the top level of the interface type, in a sub-interface, in a sub-sub-interface, etc.

An interface declaration just defines a new type. There can be any number of module declarations each of which offers the same interface. For example, the BSV library contains a repertoire of FIFO modules, all of which have the same FIFO interface type.

Section 7.3.1 shows the interface declaration type for register modules. Section 7.4.1 shows the interface declaration type for FIFO modules.

#### 7.2.2 BSV: Module declarations

A module declaration in BSV describes a module with a particular interface type:

```
(* synthesize *)
module module-name ( interface-type );
... instantiation of module state (registers, FIFOs, other sub-modules) ...
... behavior (rules and FSMs) ...
... interface (API) method implementations ...
endmodule
```

The (\* synthesize \*) attribute at the top has no semantic significance. It is merely an indication to the *bsc* compiler, when compiling to Verilog, to create a Verilog module for this BSV module, and not to "inline" it into any parent module.

A module declaration declares a module constructor; it can be invoked multiple times to obtain multiple *instances*.

## 7.3 BSV: Example: Registers

BSV treats all "state elements" (components that store persistent values) uniformly as modules with interfaces. This includes "primitives" like ordinary registers.

A register is the simplest storage element in digital hardware, a single memory cell containing a single value (represented as a bit-vector). We can (over-)write with a new value, and we can read out the value stored by the most recent write.

#### 7.3.1 BSV: Reg#(t), the register interface

The standard register interface type in BSV has two methods:

```
interface Reg #(t);
method t _read();
method Action _write (t x);
endinterface
```

Here, "t" is the type of value stored in the register (discussed in more detail below).

The \_read() method (with no arguments) just returns the value stored in the register, of type t. The \_write() method takes one argument, a value of type t, and stores it in the register, over-writing any previous value and holding the new value until over-written by the next \_write().

#### 7.3.2 BSV: mkReg(v), a register module (consructor)

A standard BSV library register module is mkReg. It is used to instantiate a new register, with a specified reset value, using a statement like this:

```
Reg #(Bit #(XLEN)) pc <- mkReg (0);
```

Here we declare a new identifier pc with interface type Reg#(Bit#(XLEN)) (the register interface type) and bind it to the interface offered by a newly instantiated register. The "0" argument to mkReg() specifies the reset-value of the register, *i.e.*, the value held in the register immediately after the hardware has been reset.

In general any module is instantiated using this syntax:

```
interface-type module-name <- module-constructor;
```

Different module-constructors may or may not have arguments. For example, this instantiation uses a different BSV library register constructor:

```
Reg #(Bit #(XLEN)) pc <- mkRegU;
```

mkRegU instantiates a register with an unspecified (unpredictable) reset value, and hence does not need an argument.

#### 7.3.3 BSV: Registers are strongly typed

Unlike Verilog and System Verilog, BSV registers are "strongly typed". Each register instance can only hold values of one particular type.

Further, the register-contents type need not be Bits#(); it can more generally it can be any BSV type that has a representation in bits. Thus, the type of a value in a register can be an enum, a struct, a nested struct, etc., if we have used a deriving(Bits) declaration (or its explicit analog) to ensure that it has a representation in bits.

#### 7.3.4 BSV: Syntactic shorthands for register access

Registers are so ubuiquitous in digital design that BSV provides some special syntactic shorthands for reading and writing registers.

Just mentioning a register in an expression can be used as a shorthand for invoking its **\_read** method. Thus, the expression:

```
rg_pc + 4
```

is shorthand for:

```
rg_pc._read + 4
```

To invoke the \_write method on a register, one can use a conventional assignment statement. Thus, the expression:

```
rg_pc._write (v)
```

can be written like this:<sup>1</sup>

```
rg_pc <= v
```

To recap, a statement like this:

```
rg_pc <= rg_pc + 4
```

is shorthand for:

```
rg_pc._write (rg_pc_read + 4)
```

## 7.4 BSV: Example: FIFOs

FIFOs (First-in-First-out) elements are queues of values and are broadly useful in any hardware design (arguably as useful as registers). We can enqueue a new value into a FIFO at the tail (back) of the queue, and dequeue a value from the head (front) of the queue. Most BSV FIFOs are automatically "flow-controlled", *i.e.*, it is impossible to enqueue into a full FIFO and to dequeue from an empty FIFO.

<sup>&</sup>lt;sup>1</sup>Rather than use "=" or ":=" common in software programming languages, we use "<=", which is the Verilog/SystemVerilog notation for "delayed assignment".

#### 7.4.1 BSV: FIFOF#(t), the FIFO interface

A standard FIFO interface type in the BSV library is:

```
interface FIFOF #(t);
method Bool notEmpty();
method Bool notFull();
method t first();
method Action deq();
method Action enq (t x);
method Action clear();
endinterface
```

Here, "t" is the type of value stored in the register (discussed in more detail below).

The notEmpty() and notFull() are simple predicates to test if a FIFOF is empty or full, respectively.

The first() and deq() methods are used to access the head of the queue. They are only available if the FIFO is not empty. The first() method returns the value at the head of the queue. This is non-destructive, *i.e.*, it does not modify the FIFO. The deq() method modifies the FIFO: it discards the value at head of the queue and advances the queue.

The enq() method is used to access the tail of the queue, and is only available if the FIFO is not full. It modifies the FIFO by appending the argument x to the tail of the queue.

The clear method is used to empty the queue immediately (discard all its contents).

#### 7.4.2 BSV: mkFIF0F, a fifo module (constructor)

The BSV library contains many different FIFO modules (constructors): single-element FIFOs, FIFOs of a specified depth (queue length), FIFOs with and without automatic flow-control, etc. In Drum we use this one:

```
FIFOF #(Mem_Req) f_to_IMem <- mkFIFOF;
```

Here we declare a new identifier f\_to\_IMem with interface type FIFOF#(Bit#(XLEN)) (the FIFO interface type) and bind it to the interface offered by a newly instantiated FIFO.

Different module-constructors may or may not have arguments. For example, this instantiation uses a different BSV library FIFO constructor:

```
FIFOF #(RR_to_RW) f_RR_to_RW <- mkSizedFIFOF (8);
```

This instantiates a FIFOF of specific queue-depth (8). Note that module constructor arguments can play different roles. In mkReg above, the argument (0) became a dynamic value, the value held by the register after reset. Here, the argument (8) only describes *structure*, *i.e.*, the size of the FIFO.

#### 7.4.3 BSV: FIFOFs are strongly typed

Each BSV FIFOF instance can only hold values of one particular type.

Further, the FIFO-contents type need not be Bits#(); it can more generally it can be *any* BSV type that has a representation in bits. Thus, the type of values in a FIFO can be an enum, a struct, a nested struct, *etc.*, if we have used a deriving(Bits) declaration (or its explicit analog) to ensure that it has a representation in bits.

#### 7.4.4 BSV: Semi-FIFO interfaces for each end of a FIFO

FIFOs are often used to connect two separate modules together, for one module to communicate values to the next one. For example, the Fetch step communicates memory requests to memory. In this situation, one module only interacts with the "enqueue" side, and the other module only interacts with the "dequeue" side.

In these situations we will also find it useful to use the following "Semi-FIFO" interfaces interfaces for each "end" of a FIFO queue:

```
interface FIFOF_0 #(t);
method Bool notEmpty();
method t first();
method Action deq();
endinterface
```

```
interface FIFOF_I #(t);
method Bool notFull();
method Action enq (t x);
endinterface
```

There is no extra hardware implied here; these are simply limited "views", or abstractions of an existing FIFO interface.

## 7.5 BSV: Example: Register-files

A register-file is an array of registers with a common pair of methods to read or write a particular register identified by an index, which is an argument to the read and write methods.

#### 7.5.1 BSV: Regfile#(t1, t2), the register-file interface

A standard register-file interface type in the BSV library is:

```
interface RegFile #(type index_t, type data_t);
method Action upd (index_t addr, data_t d);
method data_t sub (index_t addr);
endinterface: RegFile
```

Here, "index\_t" is the type we will use to identify one of the registers in the register-file. For RISC-V, since we have 32 registers, we will use Bit#(5) as the index type.

"data\_t" is the type of value stored in each of the registers. For RISC-V, this will be Bit#(XLEN).

The upd(j,v) method allows us to store the value v in the j'th register. The sub(j) method returns the current value v in the j'th register.

#### 7.5.2 BSV: mkRegFileFull, a register-file module (constructor)

The BSV library contains some different register-file modules (constructors). In Drum we use this one.

```
RegFile #(Bit #(5), Bit #(XLEN)) gprs <- mkRegFileFull;
```

Here we declare a new identifier gprs with interface type RegFile#(Bit#(5),Bit#(XLEN)) (the register-file interface type) and bind it to the interface offered by a newly instantiated register-file. The number of registers in the register-file is known from the full range of the index type Bit#(5), *i.e.*, it will have 32 registers, indexed from 0 to 31.

#### 7.5.3 BSV: What about the RISC-V register x0?

In RISC-V, register x0 (index 0) is defined as "always zero". Any value written to x0 is ignored/discarded, and any read from x0 always returns 0. So, presumably, we do not need an actual register to hold this value, just some circuitry to ensure that we always get 0 when we tey to "read" from x0.

In the previous section, we used the module mkRegFileFull to instantiate a register-file with 32 registers (inferring 32 from the full range of the index type Bit#(5)). Instead, we could use an alternate register-file module from the BSV library that allows us to provide, as module constructor arguments, the lower and upper indexes of interest. This instantiate exactly 31 registers indexed from 1 to 31, thereby saving XLEN bits of register state in our hardware.

```
RegFile #(Bit #(5), Bit #(XLEN)) gprs <- mkRegFile (1, 31);
```

Either way, we need circuitry to implement the "always zero" semantics of x0.

#### 7.6 RISC-V: The interface for the Drum and Fife CPU modules

By sharing a common interface between the Drum and Fife CPU modules, we can easily substitute one for the other in the overall system. We now have all the pieces in place to describe the interface:

```
interface CPU_IFC;
interface FIFOF_O #(Mem_Req) fo_IMem_req;
interface FIFOF_I #(Mem_Rsp) fi_IMem_rsp;

interface FIFOF_O #(Mem_Req) fo_DMem_req;
interface FIFOF_I #(Mem_Rsp) fi_DMem_rsp;
endinterface
```

The interface is simple:

- A FIFOF\_O to carry memory requests instructions (out-bound from the CPU to the memory);
- A FIFOF\_I to carry memory responses containing instructions (in-bound from memory to the CPU);
- A FIFO\_O to carry memory requests from load/store instructions (out-bound from the CPU to the memory);
- A FIFOF\_I to carry corresponding load/store memory responses (in-bound from memory to the CPU).

#### 7.7 RISC-V: A skeleton CPU module for Drum

Here is a skeleton of the CPU module for Drum. We will fill in the missing pieces in subsequent chapters.

```
(* synthesize *)
    module mkCPU (CPU_IFC);
2
       // =======
3
       // STATE
4
5
       // Paths to and from memory
6
       FIFOF #(Mem_Req) f_to_IMem
                                       <- mkFIFOF;
       FIFOF #(Mem_Rsp) f_from_IMem <- mkFIFOF;</pre>
       FIFOF #(Mem_Req) f_to_DMem
                                       <- mkFIFOF;
       FIFOF #(Mem_Rsp) f_from_DMem <- mkFIFOF;</pre>
10
11
       // The integer register-file
12
       RegFile #(XLEN)
                            gprs <- mkRegFileFull;</pre>
13
14
       // The program counter
15
       Reg #(Bit #(XLEN)) rg_pc <- mkReg (0);</pre>
```

```
17
       // Inter-step registers
                                                                     <- mkRegU;
       Reg #(F_to_D)
                                         rg_F_to_D
19
       Reg #(D_to_RR)
                                                                     <- mkRegU;
                                         rg_D_to_RR
20
                                                                     <- mkRegU;
       Reg #(RR_to_EX_IALU)
                                         rg_RR_to_EX_IALU
21
       Reg #(RR_to_EX_DMem_Req)
                                         rg_RR_to_EX_DMem_Req
                                                                     <- mkRegU;
22
       Reg #(EX_DMem_Req_to_EX_DMem_Rsp) rg_EX_DMem_Req_to_EX_DMem_Rsp <- mkRegU;</pre>
23
24
25
       // BEHAVIOR
26
       ... // This section will code the dynamic "behavior" of the module
28
       ... // and will be discussed in the next chapter
29
30
       31
       // INTERFACE
32
33
       // One of the sub-interfaces
34
       interface FIFOF_O #(Mem_Req) fo_IMem_req;
35
          method Bool notEmpty();
36
             return f_to_IMem.notEmpty;
37
          endmethod
38
39
          method t first();
40
             return f_to_IMem.first;
41
          endmethod
42
43
          method Action deq();
44
             f_to_IMem.deq;
45
          endmethod
46
       endinterface
47
48
       ... // and similarly for the fi_IMem_rsp sub-interface
49
       ... // and similarly for the fo_DMem_req sub-interface
50
       ... // and similarly for the fi_DMem_rsp sub-interface
51
52
    endmodule
53
54
```

#### 7.8 BSV: Interface-transformer functions

The idea of "viewing" the output-side of a FIFOF interface as a FIFOF\_O interface can be expressed in a BSV function:

```
function FIFO_0 #(t) to_FIFOF_0 (FIFOF #(t) f);
interface FIFOF_0 #(Mem_Req) fo_IMem_req;
```

```
method Bool notEmpty();
3
              return f.notEmpty;
4
          endmethod
5
6
          method t first();
              return f.first;
          endmethod
9
10
          method Action deq();
11
              f.deq;
12
          endmethod
13
       endinterface
14
    endinterface
15
```

#### Exercise 7.1:

Write a similar function to transform a FIFOF interface into a FIFOF\_I interface, with notFull and enq methods.

With these functions in hand, we can simplify the INTERFACE section of our Drum CPU module just four lines:

```
(* synthesize *)
1
   module mkCPU (CPU_IFC);
2
3
      ... STATE and BEHAVIOR ...
4
      // ======
                        ______
6
      // INTERFACE
      interface fo_IMem_req = to FIFOF_O (f_to_IMem);
9
      interface fi_IMem_req = to FIFOF_O (f_from_IMem);
10
      interface fo_DMem_req = to FIFOF_O (f_to_DMem);
11
      interface fi_DMem_req = to FIFOF_O (f_from_DMem);
12
   endmodule
13
```

# Chapter 8

# BSV: Finite State Machines (FSMs); The Drum CPU

#### 8.1 Introduction

So far, we have only been discussing pure combinational functions, for which there is no concept of time. Combinational functions are just "instantaneous", pure mathematical functions transforming input values to output values. However, a CPU, as shown in Figure 8.1 represents a *processs*, a behavior that evolves over time. For example the Drum CPU ex-



Figure 8.1: Simple interpretation of RISC-V instructions (same as Fig. 3.1)

ecutes one full instruction after another, and the black arrows in the diagram represent an infinite loop. For each instruction, first it performs a Fetch operation, which sends a request to memory. Some time later, the memory sends back a response, which is then processed by the Decode step, before proceeding to the next step, and so on. After the Execute Control and Register-write steps it loops back to the Fetch step for the next instruction.

In the Fife CPU, on the other hand, each of the yellow boxes represents a "stage" and the black arrows represent messages sent from one stage to another. Each stage is itself an infinite loop, consuming incoming messages and producing outgoing messages. Thus, while the Decode state is working on the first instruction, the Fetch stage is already working on the second instruction, and so on. There is a sequence of instructions in this pipeline, advancing on every "tick". A pipelined CPU also needs some additional state to manage interactions between multiple instructions in the pipeline, such as "epochs", "scoreboards" and "store-buffers" (see Figure 10.1).

In this chapter we build up to creating the Drum CPU.

#### 8.2 BSV: Behavior in Modules

We repeat here, as a reminder, the display from Section 7.2.2 that shows the overall components of a BSV module to remind us of where the independent behaviours (processes) of a module reside textually in a module declaration:

```
(* synthesize *)
module module-name ( interface-type );
    ... instantiation of module state (registers, FIFOs, other sub-modules) ...
    ... behavior (rules and FSMs) ...
    ... interface (API) method implementations ...
endmodule
```

## 8.3 BSV: Finite State Machines (FSMs)

Finite state machines (FSMs) are a classical concept in digital hardware design. An FSM is an artefact that can, at any time, be in one of a number of possible *states*. One state is often distinguised as the *start* state, the initial state of the FSM.

From each state, an FSM can *transition* to another state; the choice of destination state may depend on predicates on the current state and on external inputs to the FSM.

One classical notation for FSMs is "bubble-and-arrow" diagrams: bubbles represent states, and arrows represent transitions between states. Thus, an FSM is a specification of a *process* that, over time, moves from state to to state. The process can be infinite, with an transitions back to earlier states.

For Drum, we will interpret Figure 8.1 as a bubble-and-arrow diagram for an FSM. Each of the yellow boxes will be a state, and the black arrows represent state-transitions.

#### 8.3.1 Sequential FSMs, Concurrent FSMs, and Digital Hardware

Classical FSMs in the literature are *sequential* FSMs—every transition is from the current state to a unique, particular next state.<sup>1</sup>.

 $<sup>^{1}</sup>$ The transition can be *non-deterministic* in choice of next-state from a set of states, but once the choice is made, the FSM is again in a particular state

Most non-trivial digital hardware systems are *concurrent FSMs*. Different module instances are separate FSMs, each running their own process(es). These separate FSMs may communicate with each other *via* shared state (registers, fifos, register files, *etc.*).

For Fife, we will interpret Figure 8.1 as a set of concurrent FSMs. Each of the yellow boxes will be an FSM, and the black arrows represent communication between FSMs.

#### 8.4 BSV: StmtFSM

The central BSV construct for temporal behavior (processes) is the "rule". Collections of rules can express any sequential or concurrent FSM. However, because Drum is a simple, structured sequential process, we can use a simpler, higher-level BSV notation called "StmtFSM" (which, in turn, is just converted into rules by the *bsc* compiler).

StmtFSM is a sub-language within BSV for expressing *structured*, mostly sequential processes.<sup>2</sup> The constructs are similar to most common sequential programming languages: blocks to express temporally sequenced actions, if-then-elses, while-loops and for-loops.

(Note: we already briefly encountered a simple StmtFSM—the testbench in Section 4.5.)

#### 8.4.1 BSV: Actions and the Action type

The fundamental building-block for StmtFSM is the "action", which is a statement/expression of type Action. Some common examples:

```
rg_pc <= rg_pc + 4;  // Assignment to a register

f_F_to_D.deq;  // Dequeue a fifo

f_D_to_RR.enq (v);  // Enqueue into a fifo

$display ("Hello, World!");  // Print something (in simulation only)
```

As discussed in Section 7.3.4 the first assignment statement is syntactic shorthand for:

```
rg_pc._write (rg_pc._read + 4)
```

i.e., it is an invocation of the register \_write method which, as described in Section 7.3.1 has type Action. Similarly, as described in Section 7.4.1, fifo enq and deq methods have return-type Action, so the statements f\_D\_to\_RR.enq (v) and f\_D\_to\_RR.enq (v) have type Action.

\$display() is a built-in construct in BSV that also has type Action.

<sup>&</sup>lt;sup>2</sup>StmtFSM can also express structured fork-join parallelism, but we do not need that for Drum.

#### BSV: Action-blocks: grouping actions into larger actions

The Action type is recursive: it is either a primitive action (like those described just above), or it is a collection of things of type Action, collected using an action block (bracketed by the BSV keywords action and endaction). For example the above primitive actions can be collected into a single entity which itself has type Action:

```
action

rg_pc <= rg_pc + 4;  // Assignment to a register

f_F_to_D.deq;  // Dequeue a fifo

f_D_to_RR.enq (v);  // Enqueue into a fifo

$display ("Hello, World!");  // Print something (in simulation only)

endaction
```

Although the actions in an action block must be written in some textual order, there is no temporal ordering of these actions. All the primitive actions in an action block (either directly in the block or, recursively in a sub-block) occur "instantly" and "simultaneously".

#### 8.4.2 BSV: StmtFSM: sequences of actions

Our first construct that has temporal behavior is the **seq-endseq** block. Each item in the block is typically an entity of type |Action|, and they are performed sequentially, one after another.

The testbench in Section 4.5 contains an example of a seq block:

```
seq
... action 1 ...

... action 2 ...

... action 3 ...

... action 4 ...

endseq
```

The seq block itself has type Stmt. The items in a block can have type Action or the type Stmt, *i.e.*,, blocks can be nested.

#### 8.4.3 BSV: StmtFSM: if-then-elses

Conditional execution can be expressed with traditional if-then-else notation:

Note: in Section 4.9 we described ordinary BSV if-then-else expressions. StmtFSM uses the same notation, but there is no ambiguity—the context always clearly distinguishes what we mean, because there is no overlap between ordinary expressions and StmtFSM constructions.

#### 8.4.4 BSV: StmtFSM: while-loops

Repetitive processes can be expressed with traditional while-loop notation:

```
while (... Bool expression ...)
conduction while (... expression of type Stmt ...
```

(Note: StmtFSM also has for-loop notation, but we do not need it for Drum.)

#### 8.4.5 BSV: StmtFSM: mkAutoFSM: a simple FSM module constructor

Given an entity of type Stmt, we can pass it as an argument to to the module constructor mkAutoFSM

```
mkAutoFSM (... expression of type Stmt ...);
```

This creates an FSM with the behavior specified by the Stmt argument. The FSM starts running immediately as we come out of reset, starting at the first statement, and terminates when we fall through the last statement.

#### 8.5 RISC-V: The Drum CPU module

The outline of the CPU module for Drum was shown in the display in Section 7.7. Here we fill in the BEHAVIOR section that we elided in that display.

```
(* synthesize *)
1
  module mkCPU (CPU_IFC);
2
3
4
     // STATE
     ... // shown earlier
     9
     // BEHAVIOR
10
11
    mkAutoFSM (
12
       seq
13
         fa_Fetch_step;
```

```
fa_Decode_step;
15
              fa_Register_Read_and_Despatch_step;
16
17
              if (rg_D_to_RR.opclass == OPCLASS_ILLEGAL) action
18
                 $display ("%Od: RR.rl_RR: ILLEGAL INSTR", cur_cycle);
19
                 $display (" ", fshow_D_to_RR (x));
20
                 $finish (1);
21
              endaction
22
23
              else if (result_RR.redirect)
24
                 rg_pc <= y.redirected_pc;
25
26
              else if (result_RR.to_RW.merge_source == MERGE_EX_IALU) seq
27
                 rg_RR_to_EX_IALU <= result_RR.to_EX_IALU;</pre>
28
                 fa_EX_IALU_step;
29
                 fa_RR_to_RW_step;
30
31
              else if (result_RR.to_RW.merge_source == MERGE_EX_MEM)
32
                 f_to_DMem.enq (result_RR.to_DMem);
33
                 fa_EX_Mem_step;
34
                 fa_RR_to_RW_step
35
           endseq);
36
37
38
       // INTERFACE
39
40
        ... // shown earlier
41
    endmodule
42
43
```

- 8.6 RISC-V: Compiling Drum BSV code to Verilog
- 8.7 RISC-V: Compiling and linking Verilog using Verilator
- 8.8 RISC-V: Verilog simulation of Drum
- 8.9 RISC-V: Implementing Drum Verilog on an FPGA

# Chapter 9

# Pending (to be written): Drum

This is a place-holder chapter with a TODO list of pending possible topics to be written.



Figure 9.1: Simple interpretation of RISC-V instructions (same as Fig. 3.1)

# 9.1 Drum: Remaining functions

Register-read and write

EX IALU

DMem

Retire

#### 9.2 Drum: Traps, and basic CSRs

Note: These (implementation-independent) explanations may already be done in Chapter 2 on the RISC-V ISA:

- Explanation of illegal instruction traps
- Explanation of Fetch memory-access traps
- Explanation of LOAD/STORE/AMO memory-access traps

Overview of how a trap is handled, using minimal set of CSRs (Control and Status Registers): MCAUSE, MTVAL, MEPC, MTVEC, MCAUSE, MRET.

CSR register file and CSRRxx instructions to access them.

CSR (Control Status Registers)

### 9.3 Drum: Interrupts

Interrupts in Drum:

- General concepts: MIP and MIE registers, MSTATUS.interrupt-enable-bit.
- Interrupts are initially disabled using the MSTATUS.interrupt-enable bit immediately; CSRxx can be used to re-enable.
- Interrupts are handled just like traps; the only question is: when to check for interrupts and respond.
- How does MIE bit return to 0?

#### 9.4 Measurements for Performance Analysis

MTIME, MCYCLE, MINSTRET

"hpmcounter" CSRs for other events.

# Chapter 10

# Pending (to be written): Fife

This is a place-holder chapter with a TODO list of pending possible topics to be written.

Figure 10.1 annotates the abstract execution algorithm in Figure 3.1 with some specifics for the pipelined implementation in Fife. In particular:



Figure 10.1: Pipelined interpretation of RISC-V instructions (same as Fig. 3.1)

- Each of the yellow boxes is now itself a *process*. Each yellow box is an infinite loop that consumes data from input FIFOs and produces data on output FIFOs. All yellow boxes run concurrently, which is how we get pipelined behavior.
- Pipelining requires that the Fetch unit fetches the "next" instruction before it even knows what the current instruction is (it could be a BRANCH or a JAL/JALR which transfers control to a different PC that is not PC+4. Fetching the instruction, and executing it, could result in a exception in which case, again, the "next" instruction will be at the PC for the trap handler, not the current PC+4.

We may choose to respond to an external interrupt in which case, again, the "next" instruction will be at the PC for the trap/interrupt handler, not the current PC+4.

As a result, we may fetch a "wrong-path" instruction. The **epoch** register is for correcting wrong paths.

- Pipelining implies that an instruction reaching the RR step, and needing to read register J may have to wait for an older instruction that is further down the pipe that has not yet completed to write a value into register J. The scoreboard is for dealing with this.
- Instructions may attempt to write to memory. The store buffer holds these updates until we can deterimine whether the instruction is a wrong-path instruction or not. If it's a wrong path instruction, we can discard the update, else commit it to memory.
- MMIO accesses may not be "memory-like". A LOAD can have a side effect; a STORE may have side-effects more than just writing a value, such as starting a motor or launching a rocket! Finally, a STORE to address A followed by a LOAD from address A may not return the same value. For such accesses, the "execute DMem" stage must defer the access until later when we are sure it is not a wrong-path instruction.



Figure 10.2: Actions in the "Retire" stage of Fife

## 10.1 Fife: pipelined execution

Fife: Basic concepts of pipelined execution of instructions, i.e., pipelined implementation of the abstract algorithm of Figure 3.1.

Fife: parallel execution pipes for "Execute" steps in Figure 10.1

- Dispatching to different pipes after Register-Read
- Results from "Execute" pipes may be ready in different order from dispatch.
- Tagging each pipe so "Retire" stage can retire in-order

## 10.2 PC-prediction (control speculation)

#### PC-prediction

- The need for at least a minimal PC-prediction in Fetch stage: for many instructions, next-PC may not be known until deep in the pipeline; if so, what should Fetch do in the meantime?
  - for conditional-branches (BRANCH) and jumps (JAL and JALR), next-PC is not known until "Control" step of the pipe. What should Fetch do in the meantime?
  - If there is a trap, next-PC changes from the normal next-PC.
    - \* An instruction-fetch may trap, which we will discover only in memory-response arriving at the Decode stage
    - \* The decode stage may find that the fetched "instruction" is not a legal instruction. This causes a trap.
    - \* An instruction may trap in the Execute stage (e.g.,, LOAD/STORE/AMO may trap in memory access)
- The simplest predictor: PC+4
- Handling mispredictions: epochs, carrying epochs on instructions, discarding wrongepoch. How many epochs are needed?
- More advanced predictors

# 10.3 Register read/write hazards in Fife

Scoreboards, scoreboard management, stalling-while-scoreboard busy.

Brief discussion of bypassing?

## 10.4 Speculative vs. non-speculative memory ops

"Execute Memory Ops" stage is always speculative (because of mispredictions, and because older instructions in other pipes may trap).

- Need for store-buffer and final commit/discard from Retire-stage
- Mem-system performs store only in store-buffer
- Retire stage sends commit/discard message to store-buffer. Discuss: do we need 'discard' messages, or are 'commit' messages enough?

#### What about MMIO?

- LOADs may have side-effects
- LOADs may not be idempotent
- STOREs may have side-effects in addition to value stored
- LOAD may not return most-recently stored value

So, these cannot be done speculatively, i.e., in "Execute Memory Ops" stage.

- 'Execute Memory ops' stage defers the request by sending it back as another kind of 'response'
- Retire step performs it once we know it is non-speculative.

#### 10.5 Fife: CSRs

- CSRRxx are read-modify-write operations
- CSRRxx access may not be memory-like (side-effecting reads, read may not return last written value,
- ... (a bit like MMIO issues)

Hard to pipeline, so execute in Retire stage, as FSM.

CSRRxx instructions should be rare, so FSM exec does not affect overall performance.

## 10.6 Fife: Interrupts

Sample for interrupts in Retire stage, fix up CSRs and and redirect.

Retire stage already has infra for CSR update and redirection, so this is a small incremental change.

# Chapter 11

# Pending (to be written): Advanced topics, possibly in "Book 2"?

This is a place-holder chapter with a TODO list of pending possible topics to be written.

Goal: Linux capability, better performance.

## 11.1 Advanced branch prediction

Is a form of online machine-learning.

Branch instruction hints.

Branch-Target buffers (BTBs)

Return-Address Stacks (RASs)

Hysteresis in prediction.

#### 11.2 Caches

Separate I- and D-caches.

FENCE.I: for "manual" I- and D-Mem coherence.

FENCE: flushing caches for for devices.

Multi-level caches and cache hierarchies.

Cache-coherence.

Non-blocking caches.

# 11.3 Compressed instructions

Implementation: wholly in Decode stage. Common shared between Fife and Drum.

Affect of compressed instructions on Fetch.

Affect of compressed instructions on PC-prediction.

Affect of non-32-bit alignment of compressed instructions.

## 11.4 AMO operations

Note: These (implementation-independent) explanations may already be done in Chapter 2 on the RISC-V ISA:

- Explanation of LR and SC instructions
- Explanation of AMOxxx instructions
- Implementation in the memory system

### 11.5 Multiple Privilege levels

Machine, Supervisor, User

Interrupt/trap delegation.

Hypervisor

## 11.6 Memory Protection with PMPs

Page Tables, TLBs

## 11.7 Virtual Memory

Page Tables, TLBs

#### 11.7.1 RISC-V ISA Formal Specification

Sail model

# Appendix A

# **Resources: Documents and Tools**

This appendix describes all the resources relevant to this course.

#### A.1 GitHub

We will be using GitHub extensively. Course materials will be provided in a public GitHub repository, and GitHub's "discussion" facilities can be used to for questions and answers, visible to all.

For students who do not already know how to use GitHub, we will teach the basics.

More detailed documentation can be found starting at: https://docs.github.com/en/get-started/quickstart

# A.2 RISC-V ISA (Instruction Set Architecture) Specifications

We will refer to the Unpriviledged ISA very frequently, so you may wish to download a copy of the PDF for your laptop, and/or print a copy. The Privileged ISA document is not needed until later.

- "The RISC-V Instruction Set Manual Volume I: Unprivileged ISA".

  Bibliography entry [14] contains a link to riscv.org from which to download a PDF.
- "The RISC-V Instruction Set Manual Volume II: Privileged Architecture"

  Bibliography entry [15] contains a link to riscv.org from which to download a PDF.

# A.3 RISC-V Assembly Language Manuals

We will not do very much assembly language programming, and we will teach whatever notation we need during the course.

There are several RISC-V Assembly Language manuals available online, and some in bookstores; download them only if you prefer a local copy:

• "RISC-V Assembly Programmer's Manual", Palmer Dabbelt, Michael Clark and Alex Bradbury.

Bibliography entry [3] contains a link to online manual.

- "RISC-V ASSEMBLY LANGUAGE Programmer Manual Part I", Shakti RISC-V Team, Indian Institute of Technology, Madras, India. Please see bibliography entry [13] for link from which to download a PDF.
- "An Introduction to Assembly Programming with RISC-V", Edson Borin.

  Bibliography entry [2] contains a link from which to download a PDF.
- "RISC-V Assembly Language", Anthony J. Dos Reis. Bibliography entry [5]. Available in bookstores.

## A.4 RISC-V GNU tools, including riscv-gcc compiler

We will be using the GNU tool chain, specifically the *gcc* compiler and linker, and the *objdump* tool for disassembling an ELF file.

During the course we will show you how to install and use these tools.

The use of these tools is mostly the same as when targeting any target architecture, including well-known architectures like x86 and ARM; the student can find voluminous tutorial materials available on the GNU tool chain on web and in books.

qcc has some specific options for RISC-V; these are documented here:

- https://gcc.gnu.org/onlinedocs/gcc/RISC-V-Options.html
- https://gcc.gnu.org/onlinedocs/gcc/gcc-command-options/machine-dependent-options/risc-v-options.html

It is also useful to know how to use the GNU debugger tool, gdb. Again, the student can find voluminous tutorial materials available on on web and in books.

#### A.5 BSV

In this course, we design the hardware of our RISC-V pipelined CPU using the High Level Hardware Description Language **BSV**. The reasons for our choice (instead of using Verilog, SystemVerilog or VHDL) are discussed in more detail in Appendix B of this document, as well as in the Introduction of the "BSV by Example" book described below.

No advance knowledge of **BSV** is needed for this course; we will teach all necessary **BSV** concepts during the course as we go along.

However, for those who would like to study **BSV** on their own, or wish to view additional **BSV** materials, the following sections provide some resources.

#### A.5.1 "BSV By Example" book (free downloadable PDF)

This book takes the student through a series of small, targeted  ${\bf BSV}:$  examples:

BSV by Example, by Rishiyur S. Nikhil and Kathy R. Czeck, 2010.

Quoting from the Introduction:

"This book is intended to be a gentle introduction to BSV."

"This book tries to take you into the BSV language one small step at a time. Each section includes a complete, executable (and synthesizable) BSV program, and tries to focus on just one feature of the language"

A bound copy of the book can be purchased on Amazon, but a PDF copy of the book and a tar file containing all the BSV program examples in the book can be downloaded for free from the GitHub BSVLang repository at:

https://github.com/BSVLang/Main/tree/master/Tutorials/BSV\_Training

- Book (PDF): repository/Tutorials/BSV\_by\_Example\_Book/bsv\_by\_example.pdf
- Machine-readable version of all examples in the book:
   repository/Tutorials/BSV\_by\_Example\_Book/bsv\_by\_example\_appendix.tar.gz

#### A.5.2 BSV Tutorial

A BSV self-paced tutorial is available in the GitHub BSVLang repository:

https://github.com/BSVLang/Main/tree/master/Tutorials/BSV\_Training

in the directory repository/Tutorials/BSV\_Training/ which looks like this:

```
BSV_Training/
Build/
Example_Programs/
Common
Eg02a_HelloWorld
...
Eg03a_Bubblesort
...
Eg04a_MicroArchs
...
Eg05a_CRegs_Greater_Concurrency
...
Eg06a_Mergesort
...
Eg09a_AXI4_Stream
Reference
```

Each of the Eg\* directories cotains a complete example, along with documentation explaining the example, and instructions on how to compile and Verilog-simulate it. The Reference directory contains a collection of lecture slide decks explaining the BSV language.

#### A.5.3 MIT Course Material

Massachusetts Institute of Technology (MIT) periodically teaches courses on using **BSV** for digital hardware design. The following link:

http://csg.csail.mit.edu/6.375/6\_375\_2013\_www/handouts.html

contains downloadable material:

- PDFs of slide decks for 12 lectures
- PDFs of slide decks for 4 tutorials classes
- PDFs and codes for 6 laboratories

#### A.5.4 University of Cambridge Examples

Prof. Simon Moore of University of Cambridge, UK, uses **BSV** in his teaching and research. Several of his **BSV** examples can be found here:

```
https://www.cl.cam.ac.uk/~swm11/examples/bluespec/
```

These examples are somewhat more advanced than the ones in the previous sections.

#### A.5.5 bsc download and installation; bsc and BSV manuals

bsc is free and open-source, and can be downloaded and installed as described in BSV's GitHub web site https://github.com/B-Lang-org/bsc.

On the main page of that repository you will find links to the following documents (same links also given here):

• The "BSV Language Reference Guide". This document describes the syntax and semantics of BSV.

```
PDF: https://github.com/B-Lang-org/bsc/releases/latest/download/BSV_lang_ref_guide.pdf
```

• The "BSC Libraries Reference Guide". This document describes the extensive set of libraries and IP (Intellectual Property blocks) available to the **BSV** user.

```
PDF: https://github.com/B-Lang-org/bsc/releases/latest/download/bsc_libraries_ref_guide.pdf
```

• The "BSC User Guide". This document describes how to use the *bsc* compiler, which compiles our hardware descriptions written in **BSV** into Verilog (which can then be simulated or synthesizes using standard Verilog tools).

```
PDF: https://github.com/B-Lang-org/bsc/releases/latest/download/bsc_user_guide.pdf
```

We will be using the Language Reference Guide and Librares Reference Guide extensively, so you may wish to download a copy for your laptop.

## A.6 Verilator (or other Verilog simulator)

We will be doing Verilog simulations extensively during this course. For low cost (free), and uniformity, we will be using Verilator.

During the course, we will show you how to install Verilator and use it.

The Verilator web site, https://www.veripool.org/verilator/, contains instructions on how to install Verilator, and also links to PDF and HTML manuals for Verilator. Version 5.004, or any more recent version, will be suitable.

You can use other Verilog simulators if you prefer, but you should independently know how to use them because we cannot offer support during the course. Some possibilities:

• Icarus Verilog, also known as "iverilog". This is a very good, free and open-source, easy-to-use Verilog simulator, but is quite slow compared to other Verilog simulators and so may be less useful for large designs.

```
https://steveicarus.github.io/iverilog/usage/getting_started.html
```

• Commercial simulators from Synopsys, Cadence or Siemens/Mentor Graphics), Aldec, and others. Each of these needs a paid license.

#### A.7 Amazon AWS

All hands-on work in this course will be run on the Amazon AWS cloud. This way, everyone in the course has a common, stable, predictable environment and we do not have to waste any time dealing with the countless variations in environments found on different laptops and servers.

During the course, we will explain all necessary concepts as we go along, including how to set them AWS instances and use them.

The Amazon AWS cloud offers, on the "AWS Marketplace" a vast variety of choices for virtual machines or, to use AWS terminology, *instances*. We expect to use the following kinds of instances:

- A: An instance running the latest version of Ubuntu (Linux).
- B: A so-called "F1 instance", also running Ubuntu. F1 instances have attached FPGAs.
- C: An instance running the so-called "AWS FPGA Developer AMI" available in the AWS Marketplace. This runs CentOS (Linux) and comes pre-installed with Xilinx Vivado tools, which we will use for creating FPGA bitfile images during the course.

In Amazon's pricing, (B) is the most expensive, and so we will use that only when we actually run on FPGA. For general development and simulation activities, we'll use (A) which is much cheaper. We will use (C) whenever we're creating a new FPGA bitfile image.

The standard Amazon documentation is can be found here:

- "Set up to use Amazon EC2" https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/get-set-up-for-amazon-ec2.
- "Tutorial: Get started with Amazon EC2 Linux instances" https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2\_GetStarted.html

#### A.8 Xilinx Vivado

The FPGAs on Amazon AWS F1 instances are Xilinx Ultrascale FPGAs. Thus, when we build bitfiles on Amazon AWS, we will be using Xilinx Vivado tools (which are provided by AWS for zero incremental cost on AWS FPGA Developer AMI instances, see A.7).

During this course, we will explain all necessary concepts as we go along.

When building a bitfile, it is particularly useful to understand how to interpret the Vivado timing and resource reports. The timing report indicates:

- whether or not our design has successfully met our desired frequency target (MHz), and
- if it did not, which part of our circuit is the likely culprit, which needs to be fixed.

The resource report indicates the "size" or our design (how may LUTs, flip-flops, BRAMs, DSPs, etc.).

For more details, Xilinx has extensive documentation for which a good starting point is the "Vivado Design Suite Overview" at

https://docs.xilinx.com/r/en-US/ug910-vivado-getting-started/Vivado-Design-Suite-Overview.

#### A.9 RISC-V textbooks

advanced textbook.

This course is self-contained, and it is not necessary to acquire any textbooks.

The following list is provide only as a courtesy and convenience. All these books are written using the RISC-V instruction set as examples, and are available in bookstores.

- "The RISC-V Reader: An Open Architecture Atlas", David Patterson and Andrew Waterman, Strawberry Canyon, 2017. Available in bookstores.

  Bibliography entry [4].
- "Computer Organization and Design RISC-V Edition (2nd Edition): The Hardware Software Interface", David A. Patterson and John L. Hennessy Morgan Kaufman, 2020. Available in bookstores.

  Bibliography entry [11].
- "Computer Architecture: A Quantitative Approach, 6th Edition", John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2017. Available in bookstores.

  Bibliography entry [6]. This is the "classic" textbook on computer architecture, a more

# Appendix B

# Why BSV?

The BSV language is a modern, high-level, hardware description language with a strong formal semantic basis. It is *fully synthesizable*, *i.e.*, the *bsc* compiler can also compile your source code into Verilog [8], which we regard as the "assembly language" of hardware design. That Verilog code can then be further compiled by ASIC synthesis tools such as Synopsys' Design Compiler or FPGA synthesis tools such as Xilinx Vivado or Altera Quartus, for implementation in ASICs and FPGAs, respectively.

BSV is very suitable for describing architectures precisely and succinctly, and has all the conveniences of modern advanced programming languages such as expressive user-defined types, strong type checking, polymorphism, object orientation and even higher order functions during static elaboration.

All computation in BSV is expressed using "Rules". For many people, this takes a little acclimatization because it is *very* different from traditional programming models (such as in C++ or Java) which are based on sequential processes. But over time, it becomes *the* natural way to think about hardware computation, which is based on massive, fine-grained, heterogeneous parallelism. Complex and high-speed hardware designs are full of very subtle issues of concurrency and ordering, and BSV's computational model is one of the best vehicles with which to study and understand this.

Modern hardware systems-on-a-chip (SoCs) have so much hardware on a single chip that it is useful to conceptualize them, analyze them and design them as distributed systems rather than as globally synchronous systems (the traditional view), i.e., where architectural components are loosely coupled and communicate with messages, instead of attempting instantaneous access to global state. This is because the delay in communicating a signal across a chip is now comparable to the clock periods of individual modules. Again, BSV's computational model is well suited to this style of design.

A key to architecting complex systems and reusable modules, whether in software or in hardware, is powerful *interfaces*. Module interfaces in BSV are object-oriented (based on methods), polymorphic/generic, and capture certain computational protocols. This facilitates creating highly reusable modules, enables quick experimentation with alternatives structures, and allows designs to be changed gracefully over time as the requirements and specifications evolve.

Architectural models written in BSV are fully executable. They can be simulated in the Bluesim<sup>TM</sup> simulator; they can be synthesized to Verilog and simulated on a Verilog simu-

lator; and they can be further synthesized to run on FPGAs or be etched into ASIC silicon, as illustrated in Fig. 1.1. Even when the final target is an ASIC, the ability to run on FPGAs enables early architectural exploration, early development of the software that will later run on the ASIC, and much more extensive and early verification of the correctness of the design. Students are also very excited to see their designs actually running on real FPGA hardware.

In this book, we teach the use of BSV for the design of complex hardware modules and systems by going in detail through a series of examples, and exploring basic concepts as needed along the way, such as combinational circuits, pipelines, data types, modularity and complex concurrency. At every stage the student is encouraged to run the designs at least in simulation, but preferably also on FPGAs.

By the end of the course we will have seen all the source code for a complete simple, pipelined RISC-V CPU along with a small "SoC" (System-on-a-Chip) including an interconnect, a connection to memory, and a few devices such as a UART.

#### Who uses BSV?

BSV has been used in teaching and research at major universities, including MIT (Massachusetts Institute of Technology, USA), University of Cambridge (UK), Indian Institute of Technology, Madras (India), Indian Institute of Technology, Mumbai (India), Seoul National University (South Korea), University of Texas at Austin (USA), Carnegie Mellon University (USA), Georgia Institute of Technology (USA), Cornell University (USA) and Technical University of Darmstadt (Germany).

BSV has been used to design major IP components in commercial ASICs from Texas Instruments, ST Microelectronics and Google. It has been used for FPGA-based modeling at IBM, Intel, Qualcomm, Microsoft Research, several DARPA projects, and others. It is being used for commercial RISC-V processors from InCore Semiconductors (Shakti line of RISC-V processors, India) and The C-DAC (Center for Development of Advance Computing) (Vega line of RISC-V processors, India).

## B.1 Why BSV instead of some other Hardware Design Language?

The rest of this chapter is intended for those interested in comparing BSV's approach to other approaches (Verilog, SystemVerilog, VHDL, and SystemC), and can be safely skipped by others who just want to get on with learning BSV.

One may be curious why the material in this book could not have been covered using one of the more widely known languages for hardware design: Verilog [8], VHDL [7], SystemVerilog [10], SystemC [9]. There are several reasons, outlined below.

In the following paragraphs, we will refer to all the above languages, or at least their synthesizable subsets, as "RTL" (Register Transfer Level languages).

#### B.1.1 A better computational model

Paradoxically, the formal definitions of the semantics of traditional hardware design languages (HDLs)— Verilog, SystemVerilog, VHDL and SystemC— are not in terms of hardware concepts, but in terms of software simulation on conventional computers. Like traditional software programming languages, they are defined in terms of sequential statement execution, with traditional conditionals, loops, and procedure calls and returns, reading and writing conventional variables. Programs can have multiple concurrent processes (e.g., "always blocks" in Verilog), but each of them is defined with traditional sequential programming semantics.

Digital hardware, on the other hand, has a quite different computation model. It consists of hundreds, if not thousands of concurrent "state machines" that transform the current state of the hardware, implemented using registers, memories and FIFOs. By and large, there is no sequencing of these state machines based on program counters or statement sequences. Rather, these state machines are independent and "reactive", *i.e.*, each one performs an action whenever certain conditions hold, e.g., when a register holds a particular value, or a value is available in a FIFO, etc.

To bridge this rather large gap from conventional sequential processes to concurrent reactive state machines requires a major mental shift. One must severely restrict code to only a much smaller so-called *synthesizable subset* of a conventional HDL. Processes are restricted to simple clocked loops: "always @posedge CLK ...", also known as an "always-block". Even more draconian is a transposition away from the natural concurrent state machine view to a *state element-centric* view: even though a state element may be read and written by multiple state machines, all updates to that state element must be concentrated in a single always-block, usually in a large conditional construct (if-then-else, case, ...) that describes all the different contributions of different state machines. This transposition, from the natural state machine-centric view to the rather unnatural state element-centric view, is necessary because in the the synthesizable subset there is no synchronization between always-blocks; the programmer has to plan every detail of how to resolve (arbitrate) competing updates to each state element.

In other words, in conventional HDLs, neither the simulation view (sequential processes) nor the synthesizable view (state element-centric always blocks) are a natural way to model hardware behavior.

BSV programs, instead, directly express the natural model of hardware—concurrent state machines. Each "rule" in BSV is a reactive state transition that awaits some condition on the hardware state and then takes an action to transforms the state. Further, each rule is an atomic transaction, i.e., the details of how one arbitrates competing accesses from multiple rules to common shared state is left to the compiler. This kind of arbitration logic, which is hand-written in other HDLs, is a major source of bugs.

In BSV, unlike in other HDLs, the semantics are identical whether you execute in simulation or in hardware—there is no mental gear shift necessary, and simulation behavior is always identical to synthesized hardware behavior.

Finally, the Rules computation model uniquely encourages *refinement*, a powerful design methodology. We initially create a high-level, approximate model of a target design, using a few large rules. Both the level of micro-architectural detail and the range of functionality

are approximated (abstracted). Often such a model can be written in less than a day, and it can immediately be executed to verify functional correctness. Then, over time, we incrementally add architectural detail—for example, pipeline registers and state machines with more, smaller steps—and the original rules (large step state transitions) are replaced by more, smaller rules (small step state transitions). The atomic semantics of rules makes this a robust methodology, *i.e.*,, a refinement does not have a large ripple effect. This is in quite dramatic contrast to the difficulty in changing RTL, which is notoriously brittle and unforgiving.

Refinement allows early and continous confidence in functional correctness and completeness, since we execute the code very frequently. Refinement allows mid-course corrections in functionality, after observing execution on real data. Refinement allows separating functionality from performance, achieving functionality early and holding it constant while we improve performance to meet a performance target (by target performance we mean some desired targets for speed, area, and power).

#### **B.1.2** Modern language features

The field of programming languages has seen tremendous progress since the early days (1950s). Modern high-level languages have advanced type systems (polymorphism, type-classes and overloading, functional types, and so on). Modern high-level languages have strong mechanisms for encapsulation and abstraction (such as object-orientation) which promote the separation of concerns between externally visible behavior and internal representation choices. Modern high-level languages make frequent use of higher-order functions—functions whose arguments and results can themselves be functions and data structures whose components can be functions.

Unfortunately, practically none of these powerful features are present in the synthesizable subsets of conventional HDLs<sup>1</sup>. BSV, on the other hand, adopts the full power of the Haskell functional programming language [12]: algebraic types, functional types, polymorphic types, typeclasses, higher-order functions, and recursive and monadic static elaboration. This delivers unprecedented expressive power, type safety and type flexibility in a hardware design language.

#### B.1.3 Comparison with C++-based High Level Synthesis

Recently, some tools have become available under the rubric of "High Level Synthesis" (HLS) that claim to shield you from this mental gear shift from simulation to hardware. Designs are written in a traditional sequential programming language (typically C++), and an HLS tool automatically compiles this into a hardware implementation. While beautiful in concept, there are many serious limitations in practice, which are discussed below.

<sup>&</sup>lt;sup>1</sup>SystemVerilog and SystemC have object-orientation, polymorphism, and overloading, but these are typically used only in simulation for verification environments of hardware designs, not for actual hardware design itself.

#### C++ codes need significant rewriting

C++ HLS tools will rarely accept arbitrary, off-the-shelf C++ codes and produce good hardware implementations. C++ codes often require significant restructuring to achieve good results.

First, the tools only accept a limited subset of C++ syntax. In particular, these tools are very averse to any kind of pointer-based argument passing or data structures, unless all the pointers can be resolved by the compiler (*i.e.*,, the compiler statically knows the addresses represented by the pointers). This is because, while C++ normally executes on machines that provide the abstraction of a single large memory with a single address space (so a pointer is fundamentally an address, and dynamic allocation and relocation are easy), hardware designs typically use hundreds or thousands of individual memory units, from registers to register files to SRAMs, DRAMs, Flash memories, and ROMs, each with its own address space.

Second, most C++ codes written for conventional execution rely deeply on sequential execution. For example, they may re-use a variable (multiple reads and writes in different phases of the code). Many of these programming techniques, often a good idea for higher performance and smaller memory footprints in conventional execution, are exactly the opposite of what is needed for hardware implementation, which is highly parallel.

Overall, for good results, one must develop a keen sense of the hardware implementation impact of various "styles" of writing C++ code. Small changes in style can mean the difference between a terrible implementation and an acceptable one. One vendor insists that any team adopting their tool should not consist solely of C++ experts, but must also include hardware engineers.

#### Narrow range of applicability due to automatic parallelization

C++ is, by official definition, a completely sequential language. Hardware, on the other hand, relies on massive, fine-grain parallelism. It is the HLS tool that has to pull off this magical transformation.

C++ HLS tools rely on a body of knowledge called CDFG Analysis (Control and Data Flow Graph Analysis). After parsing and typechecking, the C++ program is represented internally in a data structure called the CDFG. This CDFG, initially directly reflecting the sequential nature of the source, is analyzed and transformed into a parallel representation from which, eventually, hardware is generated.

It turns out that this transformation only works well for a narrow range of program structures—cleanly nested for-loops with fixed iteration bounds, operating on dense rectangular arrays. Of course, many signal-processing and image-processing applications do have this structure, and C++ HLS tools have found their greatest success in this arena.

But the moment we step outside this sweet spot, towards sparse arrays or programs that are highly control-dominated, these tools fall off a cliff. Most hardware design in fact involves components that don't fall into the C++ HLS sweet spot: CPUs, cache systems, switched interconnects, flash memory and disk controllers, high-speed I/O controllers for Ethernet, PCIe, USB, and so on. For example, we are unaware of any project using C++ HLS for CPU design, whereas there are over a dozen such projects using BSV.

#### Lack of "Algotecture": Architectural transparency and predictability

Most people with some training in Computer Science are familiar with the idea that Algorithms are Job One—when writing performance-critical software, the first-order concern is to design a good algorithm. Further, creating a good algorithm is a creative act; compilers don't automatically create good algorithms for you<sup>2</sup>.

Unfortunately, because most of our codes run on classical von Neumann machines, many people forget that, when the execution platform changes, our old algorithms may no longer be any good—the assumptions about the cost of fundamental operations may no longer valid and in fact may be wildly different, requiring a complete re-think of the algorithm.

This bring up a fundamental difference between software design and hardware design. In software, you are given a particular target architecture (CPU, GPU, cluster, vector machine, ...), and the designer's job is to design a good algorithm for that fixed architecture. In hardware, on the other hand, the designer's job is to design the algorithm and the architecture jointly. In other words, for hardware designers, algorithm and architecture are joined at the hip; it is meaningless to separate these activities. We thus use the term Algorithm to describe this integrated activity.

Unfortunately, most C++ HLS tools provide very narrow visibility and control into architecture. For example, directives for loop unrolling and loop fusion may allow you to express some variation in iterative vs. parallel vs. pipelined structures. But, basically, it's the tool that chooses the architecture, and you have some weak knobs to guide its choices. A common syndrome with C++ HLS tools is that one quickly produces an implementation, but it is terrible in area or performance, and this is followed by a long tail of activity in which the designer tweaks the knobs every which way in an effor to beat it down into the desired performance envelope.

In contrast, with BSV, architectural choices (like algorithmic choices) are in the hands of the designer, where it should be. There are no surprises with respect to architecture; performance is never a mystery, and the designer can quickly improve it and converge to an acceptable solution.

#### **Summary**

In summary, it is our experience that BSV is a much better language for complex hardware design, whether control or data oriented, whether for modeling or architectural exploration or final implementation, or for synthesizable on-FPGA verification transactors. Following the philosophy of DSLs (Domain Specific Languages), BSV is very much an expressive DSL beautifully suited for hardware design, whereas sequential C++ is certainly not (it was never intended to be!).

<sup>&</sup>lt;sup>2</sup>Of course, there is research in this area, but this starts entering the realm of Artificial Intelligence.

# Appendix C

# Glossary

- **ASIC** Application-Specific Integrated Circuit. A kind of electronic device that represents a desired digital circuit directly in silicon and has been fabricated for that purpose (not customizable and general-purpose like an FPGA).
- API Application Programming Interface. Term commonly used in many programming languages, methodologies and protocols to describe the set of functions/procedures/methods used to interact with a module/object by external entitities (from outside the module/object). The API clearly separates external concerns from internal concerns. External concerns are about "what" a method does or sequence of methods do: what are their argument and result types, and what do they (abstractly) achieve. Internal concerns are about "how" methods do what they are supposed to do. This separation of concerns also allow transparently substituting a module implementation with an alternate implementation (e.g., for greater efficiency) without disturbing the external context.
- **BSV**, **BH** An open-source, modern, High-Level HDL. Two optional syntaxes (choose to one's taste): BSV has traditional Verilog-like syntax, BH has traditional Haskell-like syntax.
  - **CPU** Central Processing Unit. The computational element of a computer.
  - CSRs Control and Status Registers. These are special registers in the ISA, most of which are accessibly only while executing at higher privilege levels (Machine and Supervisor). Certain key CSRs play a central role in disciplined transition between privilege levels, in virtual memory, and in memory protection.
  - **DRAM** Dynamic Random Access Memory. A kind of silicon chip that implements memory. Compared to SRAM, is larger (number of bits), denser (bits per silicon area), cheaper (\$ per bit), uses less power (watts per bit) and is more complex to operate (needing regular refreshing etc.). Usually off-chip (not part of an ASIC or an FPGA).
  - **FPGA** Field Programmable Gate Array. A kind of electronic device that has configurable circuits that can be customized to represent any desired digital design. These are catalog parts available from several vendors.
- **FPGA Board** A circuit board containing one or more FPGAs, a power supply, and DRAM memories. Often contains other facilities such as GPIO, UARTS, JTAG, PCIe bus connections, Ethernet connection, USB connection, Flash memory, and so on.

- **GPIO** General Purpose Input Output. An electronic device attached to a computer system. When the CPU stores a byte/word to a GPIO address, the bits of the word appear as electronic signals from the device, and can be used as an *actuator*—switch on/off a back of LED lamps, a relay, a motor, *etc...* When the CPU loads a byte/word from a GPIO address, it can read the state of a *sensor*—switches, photocells, motor speed, temperature, *etc...*
- **GPR** General Purpose Register. For RISC-V, just a synonym for the basic register set holding integers. They are "general purpose" in the sense that software is free to use them in any way (in contrast with some earlier ISAs that restricted certain registers to certain roles, such as holding addresses).
- **HDL** Hardware Design Language. A language in which one can represent circuits, and for which there are tools that can render a program into actual circuits for FPGAs and ASICs. Examples include: BSV, BH, Chisel, Verilog, SystemVerilog, VHDL.
- HLHDL High-Level Hardware Design Language. An HDL with higher-levels of abstraction and more powerful constructs and semantics compared to the traditional HDLs Verilog, SystemVerilog and VHDL, in the same sense that modern software programming languages (Java, Python, Javascript, Haskell, OCaml, ...) have higher-levels of abstraction than C/C++ which, in turn, have higher levels of abstraction than Assembly Language. Examples include BSV, BH (the Haskell-syntax variant of BSV), Chisel, and HLS.
  - **HLS** High Level Synthesis. The term typically used for tools and methodology that compile C/C++/SystemC programs into hardware. HLS can be fragile in that it works best only on certain subsets of C/C++ ("simple rectangular loop and array" algorithms), and require certain coding styles and directives.
  - **ISA** Instruction Set Architecture. A specification of instructions: how an instruction is coded in bits; "architectural state" (PC, registers *etc.*); what it means to execute an instruction; assembly language syntax. The specification is described independently of any particular implementation, traditionally in a manual with text and diagrams, occasionally and recently also in a formal-specification language.
    - An ISA can (and typically does) have many possible implementations, varying widely in speed, size, power, cost, technology (ASIC, FPGA), etc. Examples of famous ISAs and vendors who supply implementations include RISC-V (diverse vendors), x86 (Intel and AMD), ARM (Arm, Apple, Samsung, others), Sparc (Sun, Oracle, Fujitsu, others), MIPS (MIPS, Inc.), Power and PowerPC (IBM, others), ...
- Microarchitecture The structural and behavioral details of an ISA implementation that are below the level of abstraction of the ISA, i.e., not demanded by the ISA but chosen by the implementor for practical reasons (speed, power, area, cost, ...). Examples: pipelines, branch prediction, scoreboards, register renaming, out-of-order execution, superscalarity, instruction fission and fusion, replicated execution units, store-buffers, ...
  - **OS** Operating System. Can vary from small, embedded, real-time OSs such as FreeRTOS, to more capable embedded OSs like Zephyr, to secure micro-kernels like seL4, to full-featured OSs like Linux, Windows, MacOS, Solaris, AIX, etc.

RISC-V A particular standard ISA. Originated circa 2008-2010 in research at University of California, Berkeley, and subsequently spun out (2010s) into an international non-profit consortium "RISC-V International" (RVI) headquartered in Switzerland (https://riscv.org).

Unlike other well-known ISAs, the RISC-V ISA is an *open* standard, *i.e.*, implementors do not need to pay any license fee in order to use the ISA, which is one of the factors behind its wide adoption by hundreds of vendors.

**RTL** Register-Transfer Level/Language. This is a level of abstraction of describing hardware that assumes that the available primitive components are clocked registers and combinational circuits for multiplexers, and basic arithmetic and logic functions (adders, subtractors, boolean operations, shifters, etc.).

This is a higher level of abstraction than AND/OR/XOR/NOT gates which, in turn, are a higher level of abstraction than transistors which, in turn, are a higher level of abstraction than silicon regions. Each layer of abstraction is automatically compiled to a lower layer using various tools.

- RVI RISC-V International. See entry for RISC-V.
- **SoC** System-on-a-chip. Refers to a complete computing system on a chip, including one or more CPUs (with MMUs and caches), shared caches, interconnects, DRAM interface, JTAG, accelerators and devices, etc.
- **SRAM** Static Random Access Memory. A kind of silicon chip that implements memory. See DRAM above for comparison. Usually on-chip in an ASIC or an FPGA.
- SystemVerilog One of the major HDLs. Originally created in the 2000s as a proper superset of Verilog (and thereby subsuming Verilog), and incorporating many features from VHDL; incorporated some modern features from object-oriented software programming languages (principally used in verification testbenches in simulation only); then an IEEE standard that has gone through several versions. Can be used for both analog and digital circuits. Some features can only be used in simulation (a "synthesizable subset" can be rendered into hardware).
  - **UART** Universal Asynchronous Receiver/Transmitter. An electronic device attached to a computer system through which the CPU can read ASCII characters from a keyboard and send ASCII characters to a display screen. Typically used for the main console of a computer system.
  - Verilog One of the two grand old HDLs (the other is VHDL). Originally created in the 1980s; then an IEEE standard that has gone through several versions; then subsumed by SystemVerilog. Can be used for both analog and digital circuits. Some features can only be used in simulation (a "synthesizable subset" can be rendered into hardware).
  - VHDL One of the two grand old HDLs (the other is Verilog). Originally created in the 1980s; then an IEEE standard that has gone through several versions. Many features were adopted by SystemVerilog. Can be used for both analog and digital circuits. Some features can only be used in simulation (a "synthesizable subset" can be rendered into hardware).

# Index

| Address alignment, 5-6                 | interface transformer from FIF0F, 7- $10$ |
|----------------------------------------|-------------------------------------------|
| BSV                                    | FIFOF_I semi-fifo interface, 7-7          |
| ?, the don't care value, 6-3           | FIF0F_0: semi-fifo interface, 7-7         |
| AAAA_AAAA, the default don't care      | Finite State Machines, 8-2                |
| value, 6-3                             | FSMs, 8-2                                 |
| Action: primitive type of actions, 8-3 | concurrent vs sequential, 8-2             |
| actions, 8-3                           | sequential $vs.$ concurrent, 8-2          |
| Bit Vectors, 4-2                       | Stmt: type of argument to FSM mod-        |
| extend, $4-3$                          | ule constructors, 8-4                     |
| operators, 4-2                         | functions                                 |
| slices, 4-2                            | application, 4-7                          |
| truncate, 4-3                          | definition, 4-6                           |
| zeroExtend, 4-3                        | Haskell similarity, 4-6                   |
| Bool, 4-3                              | identifer syntax, 4-10                    |
| operators, 4-3                         | first letter lower or upper case, 4-10    |
| bus (hardware, bundle of wires), 4-5   | if-then-else, 4-11                        |
| Combinational circuits, 4-5            | nested, 4-12                              |
| data types, 4-6                        | Interface                                 |
| purity, 4-5                            | FIFOF FIFO interface, 7-6                 |
| Combinational primitives, 4-5          | Reg register interface, 7-3               |
| Comments, 4-10                         | RegFile register-file interface, 7-7      |
| Conditional compilation, 4-15          | type, 7-2                                 |
| deriving                               | Interface transformer functions, 7-10     |
| Bits, 5-3                              | let                                       |
| Fshow, 5-3                             | binding an identifier with implicit       |
| deriving Bits, 4-9                     | type declaration, 5-4                     |
| deriving Eq. 4-10                      | member                                    |
| deriving FShow, 4-10                   | of a struct, 5-1                          |
| \$display has Action type, 8-3         | mkFIFOF                                   |
| enum types, 4-9                        | instantiation, 7-6                        |
| field                                  | module (constructor), 7-6                 |
| of a struct, $5-1$                     | reset value, 7-6                          |
| FIFO, <b>7-5</b>                       | mkReg                                     |
| FIFOF                                  | instantiation, 7-4                        |
| type of stored value, 7-6              | module (constructor), 7-4                 |
| FIFOF interface, 7-6                   | reset value, 7-4                          |
| FIF0F interface methods, 7-6           | mkRegU                                    |
| FIFOF_O                                | module (constructor), 7-4                 |

| mkSizedFIFOF                          | struct                                 |
|---------------------------------------|----------------------------------------|
| module (constructor), 7-6             | type declaration, 5-3                  |
| Module, 7-1                           | (* synthesize *) attribute, 7-3        |
| behavior, 7-3, 8-2                    | Testbenches, 4-7                       |
| instance, 7-3                         | FSMs, 4-7                              |
| instantiation, 7-4                    | Typeclass                              |
| interface, 7-3                        | instance, 4-9                          |
| mkFIFOF module (constructor), 7-6     | Typeclasses, 4-9                       |
| mkReg module (constructor), 7-4       | Types                                  |
| mkRegFileFull module (construc-       | Bit #(n), 4-2                          |
| tor), 7-8                             | Bool, 4-3                              |
| state, 7-3                            | interface, 7-2                         |
| multiplexers, 4-11                    | numeric, 4-14                          |
| cascaded/serial/priority, 4-11        | of combinational circuits, 4-6         |
| parallel, 4-12                        | synonyms, 4-15                         |
| MUX, 4-11                             | valueOf: value of a numeric type,      |
| parameterization, 4-14                | 4-15                                   |
| Propagation delay, 4-6                | valueOf: value of a numeric type, 4-15 |
| RegFile, 7-7                          | <b>01</b> /                            |
| Register, 7-3                         | DMem, Data Memory5-5                   |
| _read method, 7-3                     | Drum                                   |
| Reg register interface, 7-3           | as an FSM, $8-2$                       |
| _write method, 7-3                    | CPU interface, 7-9                     |
| Register-file, 7-7                    | CPU module, 8-5                        |
| methods, 7-7                          | Decode function, 5-8                   |
| RegFile interface, 7-7                | Skeleton module, 7-9                   |
| RegFile register-file interface, 7-7  | Dif.                                   |
| type of index, 7-7                    | Fife                                   |
| type of stored value, 7-7             | as a set of concurrent FSMs, 8-3       |
| Semi-FIFO                             | CPU interface, 7-9                     |
| FIFOF_I semi-fifo interface, 7-7      | Skeleton module, 7-9                   |
| FIFOF_0 semi-fifo interface, 7-7      | FIFO                                   |
| StmtFSM, 8-3                          | strongly-typed, 7-7                    |
| for: loop-repetition of actions, 8-5  | Harvard architecture, 5-5              |
| if-then-else: conditional actions,    | Self-modifying code, 5-5               |
| 8-4                                   | ben-modifying code, 5-5                |
| seq endseq: sequences of ac-          | IMem, Instruction Memory5-5            |
| tions, 8-4                            | ,                                      |
| while: loop-repetition of actions, 8- | JIT                                    |
| 5                                     | Just-in-time compiling, 5-5            |
| struct                                | Just-in-time compiling                 |
| entire struct values, 5-3             | JIT, 5-5                               |
| field assignment/update, 5-5          | _                                      |
| field selection, 5-4                  | let                                    |
| hetoregeneous collection of values,   | BSV: binding an identifier with im-    |
| 5-1                                   | plicit type declaration, 5-4           |
| Nested, 6-2                           | Memory                                 |
| Single-field structs, 6-2             | Address alignment, 5-6                 |
|                                       | 11441000 41101110110, 0 0              |

```
Request, 5-6
Response, 5-7

Register
<= register assignment, 7-5
implicit register read, 7-5
strongly-typed, 7-4

RISC-V
Architectural state, 3-3
DMem (data memory), 5-5
Drum CPU module, 8-5
Drum skeleton module, 7-9
Fife skeleton module, 7-9
IMem (instruction memory), 5-5
x0: Special "always zero" register, 7-8
```

# **Bibliography**

- [1] Bluespec, Inc. BSV Guide, 2022 (first version 2000).
- [2] E. Borin. An Introduction to Assembly Programming with RISC-V. 2021 (Revised: May 9, 2022). PDF online: https://riscv-programming.org/book.html.
- [3] P. Dabbelt, M. Clark, and A. Bradbury. RISC-V Assembly Programmer's Manual, Recent update: Jun 29, 2023. Online: https://github.com/riscv-non-isa/riscv-asm-manual/blob/master/riscv-asm.md.
- [4] P. David and W. Andrew. The RISC-V Reader: An Open Architecture Atlas. Strawberry Canyon, 2017. ISBN-10: 0999249118, ISBN-13: 978-0999249116. Available in bookstores.
- [5] A. J. Dos Reis. RISC-V Assembly Language. 2019. ISBN-10: 1088462006, ISBN-13: 978-1088462003. Available on Amazon.com.
- [6] J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative Approach, 6th Edition. Morgan Kaufmann. The Morgan Kaufmann Series in Computer Architecture and Design. ISBN-10: 0128119055, ISBN-13: 978-0128119051. Available in bookstores.
- [7] IEEE. IEEE Standard VHDL Language Reference Manual, IEEE Std 1076-1993, 2002.
- [8] IEEE. IEEE Standard Verilog (R) Hardware Description Language, 2005. IEEE Std 1364-2005.
- [9] IEEE. IEEE Standard for Standard SystemC Language Reference Manual, January 9 2012. IEEE Std 1666-2011.
- [10] IEEE. IEEE Standard for System Verilog—Unified Hardware Design, Specification and Verification Language, 21 February 2013. IEEE Std 1800-2012.
- [11] D. A. Patterson and J. L. Hennessy. Computer Organization and Design RISC-V Edition (2nd Edition): The Hardware Software Interface. Morgan Kaufman, 2020. The Morgan Kaufmann Series in Computer Architecture and Design. ISBN-10: 0128203315, ISBN-13: 978-0128203316. Available in bookstores.
- [12] S. Peyton Jones (Editor). Haskell 98 Language and Libraries: The Revised Report. Cambridge University Press, May 5 2003. haskell.org.

BIB-2 BIBLIOGRAPHY

[13] SHAKTI Development Team @ iitm '20 shakti.org.in (Indian Institute of Technology, Madras, India). RISC-V ASSEMBLY LANGUAGE, Programmer Manual Part I, 2020. PDF online: https://shakti.org.in/docs/risc-v-asm-manual.pdf.

- [14] A. Waterman and K. Asanović. The RISC-V Instruction Set Manual Volume I: Unprivileged ISA, December 13 2019. Document Version 20191213.. PDF online (and newer versions, if any): https://riscv.org/technical/specifications/.
- [15] A. Waterman, K. Asanović, and J. Hauser. The RISC-V Instruction Set Manual Volume II: Privileged Architecture, December 4 2021. Document Version 20211203. PDF online (and newer versions, if any): https://riscv.org/technical/specifications/.