# ECE 4750 Computer Architecture, Fall 2013

## Lab 1: Iterative Integer Multiplier

School of Electrical and Computer Engineering Cornell University

revision: 2014-09-14-00-24

The first lab assignment is a warmup lab where you will design two implementations of an integer iterative multiplier: the baseline design is a fixed-latency implementation that always takes the same number of cycles, and the alternative design is a variable-latency implementation that exploits properties of the input operands to reduce the execution time. You are required to implement the baseline and alternative designs, verify the designs using an effective testing strategy, and perform an evaluation comparing the two implementations. As with all lab assignments, the majority of your grade will be determined by the lab report. You should consult the course lab assignment assessment rubric for more information about the expectations for all lab assignments and how they will be assessed.

This lab is designed to give you experience with:

- the Verilog hardware modeling framework;
- abstraction levels including functional- and register-transfer-level modeling;
- design principles including modularity, hierarchy, and encapsulation;
- design patterns including message interfaces, control/datapath split, and FSM control;
- agile design methodologies including incremental development and test-driven development.

Your experience in this lab assignment will create a solid foundation for completing the rest of the lab assignments ultimately culminating in the implementation of a complete multicore processor.

This handout assumes that you have read and understand the course tutorials and the lab assessment rubric. To get started, you should access the ECE computing resources and perform the following steps:

```
% source setup-ece4750.sh
% ece4750-lab-admin start ece4750-lab1-imul
% mkdir ${HOME}/ece4750/ece4750-lab1-imul/build
% cd ${HOME}/ece4750/ece4750-lab1-imul/build
% ../configure
% make
% make check
```

All of the tests should pass except for the tests related to the iterative multipliers you will be implementing in this lab. For this lab you will be working in the lab1-imul subproject which includes the following files:

- lab1-imul-msgs.v multiplier message types
- lab1-imul-msgs.t.v unit tests for multiplier message types
- lab1-imul-IntMulFL.v functional-level multiplier
- lab1-imul-IntMulBase.v fixed-latency multiplier baseline design
- lab1-imul-IntMulAlt.v variable-latency multiplier alternative design
- lab1-imul-test-harness.v multiplier test harness shared across designs

- lab1-imul-IntMulFL.t.v functional-level multiplier unit tests
- lab1-imul-IntMulBase.t.v fixed-latency multiplier unit tests
- lab1-imul-IntMulAlt.t.v variable-latency multiplier unit tests
- lab1-imul-input-gen.py Python script to generate random input datasets
- lab1-imul-sim-harness multiplier simulator harness shared across designs
- lab1-imul-sim-base multiplier simulator for fixed-latency multiplier
- lab1-imul-sim-alt multiplier simulator for variable-latency multiplier

You can use the ece4750-lab-admin script and the git command to add members to your lab group, use your own group repository to collaborate with your group members, and ultimately prepare your submission for uploading to CMS. You can also use GitHub, but you must still use the exce4750-lab-admin script to start your lab, update your lab if the staff releases any revisions, and prepare your submission.

#### 1. Introduction

You learned about basic digital logic design in your previous coursework, and in this warmup lab we will put this knowledge into practice by building multiple implementations of an integer multiplier. Although it is certainly possible to design a single-cycle combinational integer multiplier, such an implementation is likely to be on the critical path in an aggressive design. Later in the course we will learn about pipelining as a technique to improve cycle time (i.e., clock frequency) while maintaining high throughput, but in this lab we take a different approach. Our designs will iteratively calculate the multiplication using a series of add and shift operations. The baseline design always uses a fixed number of steps, while the variable latency design is able to improve performance by exploiting structure in the input operands. The iterative approach will enable improved cycle time compared to a single-cycle design, but at reduced throughput (as measured in average cycles per transaction).

We have provided you with a functional-level model of an integer multiplier as shown in Figure 1. You can find this functional-level implementation in lab1-imul-IntMulFL.v and the associated unit tests in lab1-imul-test-harness.t.v and lab1-imul-IntMulFL.t.v. This implementation always takes a single-cycle and uses a higher-level modeling style compared to the register-transfer-level (RTL) modeling you will be using in your designs. These varying levels of modeling (i.e., functional-level versus RTL) are an example of abstraction. The interfaces for all three designs (i.e., functional-



Figure 1: Functional-Level Implementation of Integer Multplier – Input and output use latency-insensitive val/rdy interfaces. The input message includes two 32-bit operands; output message is a 32-bit result. Clock and reset signals are not shown.

```
1 def imul( a, b ):
2
3    result = 0
4    for i in range(32):
5        if b & 0x1 == 1:
6            result += a
7            a = a << 1
8            b = b >> 1
9
10    return result
```

Figure 2: Iterative Multiplication Algorithm – Iteratively use shifts and subtractions to calculate the partial-products over time. This is executable Python code.

level model, fixed-latency RTL model, and variable-latency RTL model) are identical. Each multiplier should take as input a 64-bit message with two fields containing the 32-bit operands. The message type is defined in lab1-imul-msgs.v. All implementations should treat the input operands and the result as two's complement numbers and thus should be able to handle both signed and unsigned multiplication. In addition, both the input and output interface use the val/rdy microprotocol to control when new inputs can be sent to the multiplier and when the multiplier has a new result ready to be consumed. This is an example of the *encapsulation design principle* in general, and more specifically the latency-insensitive *message interface design pattern*. We are hiding implementation details (i.e., the multiplier latency) from the interface using the val/rdy microprotocol. Another module should be able to send messages to the multiplier and never explicitly be concerned with how many cycles the implementation takes to execute a multiply transaction and return the result message.

### 2. Baseline Design

The baseline design for this lab assignment is a fixed-latency iterative integer multiplier. As with all of the baseline designs in this course, we provide sufficient details in the lab handout such that your primary focus is simply on implementing the design. Figure 2 illustrates the iterative multiplication algorithm using "pseudocode" which is really executable Python code. Try out this algorithm on your own and make sure you understand how it works before starting to implement the baseline design. Note that while this Python code will work fine with positive integers it will produce what looks like very large numbers when multiplying negative numbers. This is due to the fact that Python provides infinitely sized integers, with negative numbers being represented in two's complement with an infinite number of ones extending towards the most-significant bit. It is relatively straightforward to handle negative numbers in Python by explicitly checking the sign-bit and adding some additional masking and two's complement conversion logic. Since the real hardware will not need to do this it is not shown in the algorithm.

We will be decomposing the baseline design into two separate modules: the datapath which has paths for moving data through various arithmetic blocks, muxes, and registers; and the control unit which is in charge of managing the movement of data through the datapath. This decomposition is an example of the *modular design principle* in general, and more specifically the *control/datapath split design pattern*. Your datapath module, control unit module, as well as the parent module that connects the datapath and control unit together should all be placed in lab1-imul-IntMulBase.v.

The datapath for the baseline design is shown in Figure 3. The blue signals represent control/status signals for communicating between the datapath and the control unit. Your datapath module should instantiate a child module for each of the blocks in the datapath diagram; in other words, you must use a structural design style in the datapath. Although you are free to develop your own modules to use in the datapath, we recommend using the ones provided for you in the VC library, the Verilog Component library. You can find out more about the VC library by looking in the vc subproject. The collections of modules that you might be particularly interested in for this lab are as follows:

- vc-muxes.v collection of muxes
- vc-regs.v collection of registers
- vc-arithmetic.v collection of arithmetic units

Note that while you should implement the datapath by structurally composing child modules, those child modules themselves can simply use RTL modeling. For example, you do not need to explicitly model an adder with gates, simply instantiate an adder module which uses the + operator in a combinational concurrent block. Again in your final design, there should be a one-to-one correspon-

dence between the datapath diagram and the structural implementation in your code. This recursive modular decomposition is an example of the *hierarchy design principle*.

The IDLE state is responsible for consuming the message from the input interface and placing the input operands into the input registers; the CALC state is responsible for iteratively using adds and shifts to calculate the multiplication; and the DONE state is responsible for sending the message out the output interface. Note that if you implement the FSM exactly as shown each multiply should take 35 cycles: one for the IDLE state, 32 for iterative calculation, one cycle to realize the calculation is done, and one cycle for the DONE state. The extra cycle to realize the calculation is done is because we are not using Mealy transitions from the CALC to DONE states. We need to wait for the counter to reach 32 and then we move into the idle state. It is relatively straight-forward to eliminate this extra bubble (although not required). Your control unit should be structured into three parts: a sequential concurrent block for just the state element, a combinational concurrent block for state transitions, and a combinational concurrent block for state outputs. You will probably want to use a counter to keep track of the 32 cycles required for the iterative calculation in the CALC state. The control unit for your iterative multiplier is an example of the *finite-state-machine design pattern*.

You may want to consider implementing a simpler version of the fixed-latency integer multiplier before trying to implement the fully featured design. For example, you could implement a version of the multiplier that does not handle the val/rdy signals correctly and test this without random delays. Once you have the simpler version working and passing the tests, then you could add the additional complexity required to handle the val/rdy signals correctly. This is an example of an *incremental development design methodology*.



Figure 3: Datapath for Fixed-Latency Iterative Integer Multiplier – All datpath components are 32-bits wide. Shifters are constant one-bit shifters. We use registered inputs with a minimal of logic before the registers.



Figure 4: Control FSM for Fixed-Latency Iterative Integer Multiplier – Hybrid Moore/Mealy FSM with Mealy transitions in the CALC state.

### 3. Alternative Design

The fixed-latency iterative integer multiplier should always take 34–35 cycles to compute the result. While it is possible to optimize away the cycle to realize the calculation is done and to eliminate the IDLE/DONE states, fundamentally this algorithm is limited by the 32 cycles required for the iterative calculation. For your alternative design you should implement a variable latency iterative multiplier, which takes advantage of the structure in some pairs of input operands. More specifically, if an input operand has many consecutive zeros we don't need to shift one bit per cycle; instead we can shift the B register multiple bits in one step and directly jump to the next required addition. You should try to be reasonably aggressive in terms of improving the performance, but do not try to squeeze too much logic into a single clock period. Your critical path should not be longer than an adder plus some muxing, or a shifter and some muxing. A critical path with an adder, a shifter, and some muxing might be acceptable, but this will certainly result in a longer cycle time (i.e., lower clock frequency) so you must discuss these trade-offs qualitatively in your lab report.

As with all of the alternative designs in this course, we simply sketch out the requirements and you are responsible for both the design and the actual implementation. Your implementation should leverage the design principles, patterns, and methodologies you used for the baseline design. We have provided the top-level module interface for your variable latency multiplier design in the file lab1-imul-IntMulAlt.v. Most of your code for the alternative design should be placed in this file. If you need any additional datapath or control components, then you should refactor this logic into its own module and file.

### 4. Testing Strategy

We provide you one directed test for your multiplier, and you are responsible for developing the rest of the verification. Writing tests is one of the most important and challenging parts of computer architecture. Designers often spend far more time designing tests than they do designing the actual hardware. You will want to initially write tests using the functional-level model. Once these tests are working on the functional-level model, you can move on to testing the baseline and alternative designs. Our emphasis on testing and more specifically on writing the tests first in order to motivate a specific implementation is an example of the *test-driven design methodology*.

The following commands illustrate how to run all of tests for the entire project, how to run just the tests for this lab, and how to run just the tests for the functional-level model, baseline design, and alternative design.

```
% cd ${HOME}/ece4750/ece4750-lab1-imul/build
% make check
% make lab1-imul-check
% make lab1-imul-IntMulFL-test && ./lab1-imul-IntMulFL-test
% make lab1-imul-IntMulBase-test && ./lab1-imul-IntMulBase-test
% make lab1-imul-IntMulAlt-test && ./lab1-imul-IntMulAlt-test
```

You will add your directed tests to lab1-imul-test-harness.v. Since this harness is shared across the functional-level model, the baseline design, and the alternative design you can write your tests once and reuse them to test all three models. You will be adding more test cases. Do not just make the given test case larger. Some suggestions for what you might want to test are listed below. Each of these would probably be a separate test case.

- Combinations of multiplying zero, one, and negative one
- Small negative numbers × small positive numbers

```
//-----
  // Test Case: random small
5 `VC_TEST_CASE_BEGIN( N, "random small" )
   init_rand_delays( 0, 0 );
  include "lab1-imul-gen-input_small.py.v"
  run_test;
10 end
11 VC_TEST_CASE_END
13 //-----
  // Test Case: random small w/ random delays
16
  `VC_TEST_CASE_BEGIN( N, "random small w/ random delays" )
17
  init_rand_delays( 3, 14 );
19
    include "lab1-imul-gen-input_small.py.v"
  run_test;
22 end
23 VC_TEST_CASE_END
```

Code at https://github.com/cbatten/x/blob/master/code-lab1-random-testing.v

**Figure 5: Randomized Test Cases for the Multplier –** The file lab1-imul-gen-input\_small.py.v is generated by a Python script and contains a Verilog snippet with 100 random test inputs and the corresponding correct outputs. You will need to change N to be the appropriate test case number.

- Small positive numbers × small negative numbers
- Small negative numbers × small negative numbers
- Large positive numbers × large positive numbers
- Large positive numbers × large negative numbers
- Large negative numbers × large positive numbers
- Large negative numbers × large negative numbers
- Multiplying numbers with the low order bits masked off
- Multiplying numbers with middle bits masked off
- Multiplying sparse numbers with many zeros but few ones
- Multiplying dense numbers with many ones but few zeros
- Tests specifically designed to trigger corner cases in your alternative design
- Testing all or some of the above using random source and sink delays

Once you have finished writing your directed tests you should move on to writing random tests. We have created a simple Python-based random test generation system for you to use. The lab1-imul-gen-input.py script takes one argument specifying which random dataset to generate and prints out a snippet of Verilog code suitable for use in a test case. Currently the script only generates one random dataset comprised of small random numbers multiplied together. You can see how the script works like this:

```
% cd ${HOME}/ece4750/ece4750-lab1-imul/build % python ../lab1-imul/lab1-imul-gen-input.py small
```

This should print out a snippet of Verilog code that uses the init task to initialize the test source and sink in the test harness. The build system is setup to automatically run this script and generate a Verilog file named lab1-imul-gen-input\_small.py.v. You can try it out like this:

```
% cd ${HOME}/ece4750/ece4750-lab1-imul/build
```

```
% make lab1-imul-gen-input_small.py.v
% cat lab1-imul-gen-input_small.py.v
```

You can use the Verilog code shown in Figure 5 in your lab1-imul-test-harness.v to include this generated Verilog file as a test case. To add more random testing, you will need to modify the lab1-imul-gen-input.py script to handle additional datasets, modify the lab1-imul.mk to list the new .py.v files, and of course modify the lab1-imul-test-harness.v to include a new test case. Each random test should be a separate test case. You will want to use both zero and non-zero random delays in the test source and sink with your random testing. The kinds of random testing should be similar in spirit to the directed testing discussed above. You might want to consider randomly generating 32b values and then masking off different bits to create different variations. For example, you can mask off the low bits, the high bits, the low and high bits, or just randomly force some bits to zero or one to create numbers with more or less zeros.

In addition to testing the design as a whole, if you added any new datapath or control components you must write unit tests for these new components!

#### 5. Evaluation

Once you have verified the functionality of the baseline and alternate design, you should then use the provided simulator to evaluate these two designs. You can run the simulator like this:

```
% cd ${HOME}/ece4750/ece4750-lab1-imul/build
% make
% ./lab1-imul-sim-base +input=small +stats
% ./lab1-imul-sim-alt +input=small +stats
```

The simulator will display the total number of cycles to execute the specified input dataset. You should study the line traces (with the +verbose=1 option) and possibly the waveforms (with the +dump-vcd=<file-name> option) to understand the reason why each design performs as it does on the various patterns.

You can run simulations for all of the input datasets like this:

```
% cd ${HOME}/ece4750/ece4750-lab1-imul/build % make lab1-imul-eval
```

We only provide a single random dataset for evaluating your design, but clearly this is not sufficient. You will need to use more datasets to make a compelling comparative evaluation of how your baseline and alternative design perform, especially since the alternative design has data-dependent behavior. We recommend maybe six or so datasets for evaluation. Obviously these datasets need to be carefully chosen to highlight the differences between the baseline and alternative designs. We actually recommend you reuse some of the random datasets you already developed for verification. You will just need to modify the lab1-imul-sim-harness.v to reference these additional random datasets. Search for where we include lab1-imul-gen-input\_small.py.v and add additional else clauses to include other datasets based on the command line parameter given to the simulator. You will also need to add the name of the new dataset to the lab1\_imul\_eval\_inputs make variable in lab1-imul.mk if you want the build system to use this dataset when you run make lab1-imul-eval.

#### 6. Extensions

Students can pursue extensions to demonstrate creativity and a deeper understanding of the course material. Extensions only add a small bonus to your overall lab grade. They should not be at-

tempted unless your baseline design, alternative design, verification, evaluation, and lab report are completely finished. You can do any kind of extension you like, but some ideas are listed below.

- Design a three-input or four-input multiplier unit by composing two or three instances of your two-input multiplier unit. Make sure you verify that the val/rdy interfaces for your new multiplier work correctly.
- Design a multiplier that takes two 32-bit inputs but produces a 64-bit output. You will need to explicitly manage the situation where one or both of inputs are negative, so make sure you carefully test both positive and negative inputs.
- Design either a fixed-latency or variable-latency divider module using a similar approach as used in this lab. Instead of shift and add, you will need to use shift and subtract. Potentially add support for signed and unsigned division operations and/or signed and unsigned remainder operations.
- Design a unified multiplier/divider module with a single datapath and control unit, such that multiplication and division share as much of the datapath as possible.

### Acknowledgments

This lab was created by Christopher Batten, Shreesha Srinath, and Ji Kim as part of the course ECE 4750 Computer Architecture at Cornell University.