### Georgia Institute of Techology

## Schools of Computer Science and Electrical & Computer Engineering

CS 4290/6290, ECE 4100/6100: Spring 2012 (Conte)

**Project 2: Tomasulo Algorithm Pipelined Processor** 

Due: Before 11:55 PM on 30 March 2012

**VERSION: 1.2** 

# **Rules**

- 1. All students (CS 4290/6290, ECE 4100/6100) must work *alone*.
- 2. Sharing of code between students is viewed as cheating and will receive appropriate action in accordance with University policy. It is acceptable for you to compare your results with other students to help debug your program. It is not acceptable to collaborate on the simulator design or the final experiments.

# **Project Description:**

In this project, you will construct a simulator for an out-of-order superscalar processor that uses the Tomasulo algorithm and fetches *N* instructions per cycle. Then you will use the simulator to find the appropriate number of function units and fetch rate for each benchmark.

# **Specification of simulator:**

## 1. INPUT FORMAT

The input traces will be given in the form:

```
<address> <function unit type> <dest reg #> <src1 reg #> <src2 reg#> <address> <function unit type> <dest reg #> <src1 reg #> <src2 reg#> ...

Where
```

```
<address> is the address of the instruction (in hex)

<function unit type> is either "0", "1" or "2"

<dest reg #>, <src1 reg#> and <src2 reg #> are integers in the range [0..127]
```

**note:** if any reg # is -1, then there is no register for that part of the instruction (e.g., a branch instruction has -1 for its <dest reg #>)

#### For example:

ab120024 0 1 2 3 ab120028 1 4 1 3 ab12002c 2 -1 4 7

#### Means:

"operation type 0" R1, R2, R3 "operation type 1" R4, R1, R3

#### **Command-line parameters**

Your project should include a Makefile which builds binary in your project's root directory named *tomasulo\_sim*. The program should run from this root directory as:

/tomasulo\_sim N k0 k1 k2 R trace\_file

The command line parameters are as follows:

- N Fetch rate (instructions per cycle)
- trace\_file Path name to the trace file
- k0 Number of #0 Function Units
- k1 Number of #1 Function Units
- k2 Number of #2 Function Units
- R Number of result buses

Traces will be placed on the website.

# 2. FUNCTION UNIT MIX

| Function unit type | Number of units* | Latency |
|--------------------|------------------|---------|
| 0                  | parameter: k0    | 1       |

<sup>&</sup>quot;operation type 2" -, R4, R7 no destination register!

| 1 | parameter: k1 | 2 |
|---|---------------|---|
| 2 | parameter: k2 | 3 |

#### Notes:

- The number of function units is a parameter of the simulation and should be adjustable along the range of 1 to 3 units each (see experiments below).
- Function units of types 1 and 2 are pipelined into stages. (2 stages for function unit type 1; 3 stages for function unit type 2).

## 3. PIPELINE TIMING

Assume the following pipeline structure:

| Stage | Name                     | Number of cycles per instruction                                  |
|-------|--------------------------|-------------------------------------------------------------------|
| 1     | Instruction Fetch/Decode | 1                                                                 |
| 2     | Dispatch                 | 1 (Because the dispatch and scheduling queues are both unlimited) |
| 3     | Scheduling               | Variable, depends on data dependencies                            |
| 4     | Execute                  | Depends on function unit type, see table above                    |
| 5     | State update             | Variable, depends on data dependencies (see notes 6 and 7 below)  |

#### **Details:**

- 1. You should explicitly model the dispatch and scheduling queues
- 2. The sizes of the dispatch and scheduling queues are unlimited.
- 3. If there are multiple independent instructions ready to fire during the same cycle in the scheduling queue, service them in tag order (i.e., lowest tag value to highest). (This will guarantee that your results can match ours.)
- 4. A fired instruction remains in the scheduling queue until it completes.

- 5. The fire rate (issue rate) is only limited by the number of available FUs.
- 6. There are R result buses (also called Common Data Buses, or CDBs). If an instruction wants to retire but all result buses are used, it must wait an additional cycle. All conflicts should be resolved in tag order.
- 7. Assume that instructions retire in the same order that they complete. Instruction retirement is unconstrained (imprecise interrupts are possible).
- 8. During State Update, values are committed to the register file.
- 9. **NOTE:** All instructions MUST spend at least one full cycle in each stage of each unit

#### 3.1 FETCH/DECODE UNIT

The fetch/decode rate is N instructions per cycle. Each cycle, the Fetch/Decode unit can always supply N instructions to the Dispatch unit. Since the dispatch queue length is unlimited, there is room for these instructions in the dispatch queue.

#### 3.2 WHEN TO UPDATE THE CLOCK

Note that the actual hardware has the following structure:

- Instruction Fetch/Decode PIPELINE REGISTER
- Dispatch PIPELINE REGISTER
- Scheduling PIPELINE REGISTER
- Execute PIPELINE REGISTER
- (Execute) // types 2 and 3 (PIPELINE REGISTER)
- (Execute) // type 3 (PIPELINE REGISTER)
- State update

Instruction movement only happens when the latches are clocked, which occurs at the rising edge of each clock cycle. You must simulate the same behavior of the pipeline latches, even if you do not model the actual latches. For example, if an instruction is ready to move from scheduling to execute, the motion only takes effect at the beginning of the next clock cycle.

Note that the latching is not strict between the dispatch unit and the scheduling unit, since both units use the scheduling queues. For example, if the dispatch unit inserts an instruction into one of the scheduling queues during clock cycle J, that instruction must spend  $at \ least \ cycle \ J+1$  in the scheduling unit.

In addition, assume each clock cycle is divided into two half cycles (you do not need to explicitly model this, but please make sure your simulator follows this ordering of events):

| Cycle half  | Action                                                                                                        |  |
|-------------|---------------------------------------------------------------------------------------------------------------|--|
| first half  | The register file is written via a result bus                                                                 |  |
|             | Any independent instruction in scheduling queue is marked to fire                                             |  |
|             | The dispatch unit reserves slots in the scheduling queues                                                     |  |
|             |                                                                                                               |  |
| second half | The register file is read by Dispatch                                                                         |  |
|             | Scheduling queues are updated via a result bus (and can mark newly independent instructions as ready to fire) |  |
|             | The state update unit deletes completed instructions from scheduling queue                                    |  |

## 3.3 OPERATION OF THE DISPATCH QUEUE

Note that the dispatch queue is scanned from head to tail (in program order). When an instruction is inserted into the scheduling queue, it is deleted from the dispatch queue.

#### 3.4 TAG GENERATION

Tags are generated sequentially for every instruction, beginning with 0. The first instruction in a trace will use a tag value of 0, the second will use 1, etc. The traces are sufficiently short so that you should not need to reuse tags.

#### 3.5 BUS ARBITRATION

There are **R result buses** (common data buses), which means that up to R instructions that complete in the same cycle may update the schedule queues and register file in the following cycle. Unless we agree on some ordering of this updating, multiple different (and valid) machine states can result. This will complicate validation

considerably. Therefore, assume the order of update is the same as the order of the tag values.

#### Example:

Tag values of instructions waiting to complete: 100, 102, 110, 104 where R = 3

#### Action:

- The instruction with the tag value = 100 updates the schedule queue/register file, then instruction 102, then 104, all of these happen in parallel, at the same time, and at the beginning of the next cycle!
- On the next cycle the instruction with the tag value = 110 updates the schedule queue/register file

### 3.7 STATISTICS (OUTPUT)

The simulator outputs the following statistics after completion of the experimental run:

- 1. Average number and Standard Deviation of instructions fired per cycle
- 2. Total number of instructions in the trace
- 3. Total simulated run time for the input: the run time is the total number of cycles from when the first instruction entered the instruction fetch/decode unit until when the last instruction completed.
- 4. Average number and Standard Deviation of instructions completed per cycle (IPC)
- 5. Maximum and Average scheduling queue size
- 6. The cycle-by-cycle behavior for all cycles of execution.

### 4. EXPERIMENTS

#### 4.1 VALIDATION

Your simulator must match the validation output that we will place on-line.

#### **4.2 RUNS**

For each trace on T-Square, use your simulator as follows:

- 1. Vary the following parameters:
  - o Number of FUs of each type (k0, k1, k2) of either 1 or 2 units each.
  - o The number of result buses, R.
  - o fetch rate, N = 4 or 8
- 2. Based on the results, select the **least amount of hardware** (as measured by total number of FUs (kO+kI+k2) and result buses @) for **each** benchmark trace that

provides **nearly the highest value** for retired IPC. As with project 1, justify your design choice.

#### 4.3 What to turn in

- 1. The output and statistics for the validation run to show your simulator matches ours. (TA will define format of the output in an announcement to the class.)
- 2. The experimental results as described above. (As before, there are multiple answers for each trace file, so I will know which students "collaborated" inappropriately!)
- 3. The commented source code for the simulator program.

#### 4.4 Hints

- Work out by hand what should occur in each unit at each cycle for an example set of instructions (e.g., the first 10 instructions in one of the traces). **Do this** *before* **you begin writing your program!**
- The algorithm in the notes is not intended to be implemented in C/C++/Java. Each "step" in the algorithm represents activity that occurs *in parallel* with other steps (due to the pipelining of the machine). Therefore, you will run into difficulty if you translate the algorithm from the notes to C/C++/Java directly.
- Start early: this is a difficult project.
- Keep a counter for each line in the trace file. Use this for the Tomasulo tags.
- Make each pipeline stage into a function (method)
- Execute the procedures in *reverse order* from the flow of instructions (e.g., execute the state update procedure, then the procedure for the second stage of execution, then the procedure for the first stage of execution, etc.)
- It should help to have a data structure for all of the pipeline latches (e.g., a two dimensional array, one dimension for the number of execution units, another dimension for the number of pipeline stages)
- Add a "debug" mode to your simulator that will print out what occurs during each simulated execution cycle. The debug mode should match the style of our example verification output.
- Check the webpage frequently for updates and notes. There will be a lot of clarifications required to get everyone matching our output.

## 4.5 Grading

- 0% You do not hand in anything by the due date
- +50% Your simulator doesn't run, does not work, but you hand in *significant* commented code
- +25% Your simulator matches the validation output posted on the website
- +20% You ran all experiments outlined above (your simulator *must be* validated first!)
- +5% You justified each design with graphs or tables and a persuasive argument