Student Name: Store Arstortick

Student ID: 57677

## COMP 303 - Computer Architecture: HW3

Due: Nov 13, 2019, 4:00 pm

Instructor: Didem Unat

Corresponding TAs: Aditya Sasongko

**Notes:** This is an individual assignment (no groups). You may discuss the problems with your peers but the submitted work must be your own work. No late assignment will be accepted. Submit a SOFT copy of your assignment to the blackboard \*AND\* submit a HARD copy of it to the COMP 303 mailboxes or bring it to the class. This assignment is worth 4% of your total grade.

## Problem 1

(25+25+10 pts) Consider a six-stage pipeline (IF, ID, EX, M1, M2, WB) processor where memory operations take two pipeline stages, so load result values are not available until after the M2 (memory) stage is completed. For this problem we will use the following assembly language sequence.

```
myloop:

ADD r1, r3, r1

LW r1, 0(r1)

SW r2, 0(r1)

SUBI r4, r4, 4

LW r3, 0(r4)

ADD r2, r2, r1

BNEZ r4, myloop

SUB r1, r3, r1

OR r5, r1, r2

...
```

- a) Draw the pipeline timing diagram using the table below for the code sequence above.
  - Start with the first instruction of the loop (line 1) and draw the timing diagram through one full loop iteration including the first ADD of the second loop iteration.
  - Assume branches are resolved in the execution stage (EX) and branches are NOT taken.
  - Show all stalls with an X and all the flushes with an F.
  - Show the forwardings with an arrow needed to help eliminate stalls and fill the forwarding table with source and destination pipeline registers.

| Cycles (time) |  |
|---------------|--|
|               |  |

|      | C1 | C2 | СЗ  | C4 | C5 | C6 | C7 | C8  | C9 | C10 | C11 | C12      | C13 | C14 | C15 | C16 | C17 | C18 |
|------|----|----|-----|----|----|----|----|-----|----|-----|-----|----------|-----|-----|-----|-----|-----|-----|
| ADD  | 15 | ID | EX. | M1 | M2 | WB |    |     |    |     |     |          |     |     |     |     |     |     |
| LW   |    | IF | ID  | EX | M1 | MZ | WB |     |    |     |     |          |     |     |     |     |     |     |
| SW   |    |    | 15  | ID | X  | X  | EX | MI  | W2 | WB  |     |          |     |     |     |     |     |     |
| SUBI |    |    |     | 15 | X  | ×  | 10 | EX. | M1 | MZ  | WB  |          |     |     |     |     |     |     |
| LW   |    |    |     |    | X  | X  | F  | 12  | EX | M1  | M2  | wß       |     |     |     |     |     |     |
| ADD  |    |    |     |    |    |    |    | F   | 10 | EX  | M1  | M2       | WB  |     |     |     |     |     |
| Bnez |    |    |     |    |    |    |    | ·   | IF | 10  | EX  | M        | M2  | MB  |     |     |     |     |
| SUB  |    |    |     |    |    |    |    |     |    | IF  | D   | <u>և</u> | μ   | F   | F   |     |     |     |
| OR   |    |    |     |    |    |    |    |     |    |     | IF  | щ        | ۴   | F   | F   | F   |     |     |
| ADD  |    |    |     |    |    |    |    |     |    |     |     | ۲        | ID  | EX  | M   | M2  | B   |     |
|      |    |    |     |    |    |    |    |     |    |     |     |          |     |     |     |     |     |     |

| Source Instruction<br>(e.g. ADD r1, r3, r1) | Source Location<br>(e.g. ID/EX) | Destination<br>Instruction | Destination Location<br>(e.g. ID/EX) | Due to Register<br>(e.g. R1) |
|---------------------------------------------|---------------------------------|----------------------------|--------------------------------------|------------------------------|
| Addrling, r1                                | EXIMEM                          | Lw <1,0(1)                 | 10/EX                                | r1                           |
| LM 41,0 LM)                                 | MEM/WB                          | Sw r2,0(1)                 | IDIEX                                | -1                           |
| SUB1 +4,+4,4                                | EX /MEM                         | BNEZ ryimyloop             | IDIEX                                | <b>r</b> 4                   |
| SUB  +4, +4, 4                              | EX/MEM                          | LW r3,0(14)                | IDIEX                                | ry                           |
|                                             |                                 |                            |                                      |                              |
|                                             |                                 |                            |                                      |                              |
|                                             |                                 |                            |                                      |                              |
|                                             |                                 |                            |                                      |                              |

Student Name: Sitare Arslantick Page 2 of 6

b) Now assume that the architecture uses branch delay slots to try to eliminate all the branch penalty and compiler rearranges the code for the delay slots as follows:

Note that instructions 6 and 7 (lines 6 and 7) are "delay slot" instructions which means that they will be executed even when the branch in line 5 is taken. Draw the pipeline timing diagram and fill the forwarding table for this code. Start with the first instruction of the loop (line 1) and draw the timing diagram through one full loop iteration including the first ADD of the second loop iteration.

|      |    | _  | Cycles | (time) |    |    | _        |            |        |     |     |                |     |     |     |     |     |     |
|------|----|----|--------|--------|----|----|----------|------------|--------|-----|-----|----------------|-----|-----|-----|-----|-----|-----|
|      |    |    |        |        |    | ,  |          |            |        |     |     |                |     |     |     |     |     |     |
|      | C1 | C2 | C3     | C4     | C5 | C6 | C7       | C8         | C9     | C10 | C11 | C12            | C13 | C14 | C15 | C16 | C17 | C18 |
| ADD  | IF | ID | EX.    | M      | M2 | WB |          |            |        |     |     |                |     |     |     |     |     |     |
| LW   |    | IF | ID     | EX     | M  | M2 | WB       |            |        |     |     |                |     |     |     |     |     |     |
| SW   |    |    | 16     | ID     | X  | X  | EX       | <b>M</b> 1 | M2     | WB  |     |                |     |     |     |     |     |     |
| SUBI |    |    |        | F      | X  | X  | <u>J</u> | X          | M      | M2  | WB  |                |     |     |     |     |     |     |
| Bnez |    |    |        |        | X  | X  | Ŧ        | g,         | 以<br>以 | 47  | M2  | WB             |     |     |     |     |     |     |
| LW   |    |    |        |        |    |    |          | ī          | שו     | EX  | M1  | M <sub>2</sub> | WB  |     |     |     |     |     |
| ADD  |    |    |        |        |    |    |          |            | IF     | 10  | EX  | M1             | M2  | WB  |     |     |     |     |
| ADD  |    |    |        |        |    |    |          |            |        | IF  | סו  | EX             | M1  | MJ  | MB  |     |     |     |
|      | ·  |    |        |        |    |    | ·        |            |        |     | ·   |                |     |     |     |     |     |     |

| Source Instruction<br>(e.g. ADD r1, r3, r1) | Source Location<br>(e.g. ID/EX) | Destination<br>Instruction | Destination Location<br>(e.g. ID/EX) | Due to Register<br>(e.g. R1) |
|---------------------------------------------|---------------------------------|----------------------------|--------------------------------------|------------------------------|
| ADD r1, r3, r1                              | EX/MEM1                         | LW 1110(11)                | IDIEX                                | r1                           |
| Lw r1,0(1)                                  | MEM2/WB                         | swr2,0(11)                 | IDIEX                                | -1                           |
| SUBI 14,14,4                                | EX/MEM1                         | BNEZ 14, myloop            | IDIEX                                | 4                            |
| Subl chirty 4                               | MEM1/MEM2                       | LW +3,0(14)                | 1D/EX                                | ۳4                           |
|                                             |                                 |                            |                                      |                              |
|                                             |                                 |                            |                                      |                              |
|                                             |                                 |                            |                                      |                              |
|                                             |                                 |                            |                                      |                              |

c) Fill the table with RAW, WAW, and WAR dependencies for the above code example in (b) where delay slot instructions are used.

|                               | RAW                          |                    |                                | WAR                         |                    |                                | WAW                          |                    |
|-------------------------------|------------------------------|--------------------|--------------------------------|-----------------------------|--------------------|--------------------------------|------------------------------|--------------------|
| From<br>Instruction<br>(Read) | To<br>Instruction<br>(Write) | Due to<br>Register | From<br>Instruction<br>(Write) | To<br>instruction<br>(Read) | Due to<br>Register | From<br>Instruction<br>(Write) | To<br>Instruction<br>(Write) | Due to<br>Register |
| Lw (1,0(s1)                   | Add-1, r3, r1                | r1                 | All richid                     | Lw r1, 0(11)                | r1                 | AJJ H,13,11                    | Lw r1,0(r1)                  | r1                 |
| Sw r210(11)                   | LW +1,0(r1)                  | r1                 | SUB rl, r3, c1                 | A11 c2,r2,r4                | r1                 |                                |                              |                    |
| BNEZ ry,myloop                | 5081 24,24,4                 | ۲4                 |                                |                             |                    |                                |                              |                    |
| OR +5,+1,+2                   | sub r1, r3,r1                | r1                 |                                |                             |                    |                                |                              |                    |
|                               |                              |                    |                                |                             |                    |                                |                              |                    |
|                               |                              |                    |                                |                             |                    |                                |                              |                    |
|                               |                              |                    |                                |                             |                    |                                |                              |                    |
|                               |                              |                    |                                |                             |                    |                                |                              |                    |

## Problem 2



(24 pts) In figure above, unpipelined computation hardware is shown. On each 320 ps cycle, the system spends 300 ps evaluation of a combinational logic function and 20 ps storing the results in an output register. As a result, the latency of the hardware is 320 ps and throughput is 3.125 GIPS (giga instructions per second).

We would like to divide this combinational logic into separate logics; sequence of six blocks, named A to F, having delays of 80, 30, 60, 50, 70, and 10 ps, respectively as illustrated below:



Then we can create pipeline versions of this design by inserting pipeline registers between pairs of these blocks. Different combinations of pipeline depth (how many stages) and maximum throughput arise, depending on where we insert the pipeline registers. Assume that each pipeline register has a delay of 20 ps.

a) Inserting a single register gives a 2-stage pipeline. Where should the register be inserted to maximize throughput? What would be the clock cycle time, throughput and latency?

Between Cond D.

Cycletime= & +30+60+20=190 ps

Lotency: toleltime=320+20=340ps

TH = 1/190 = 01005263 ~ 5.263 GPS

b) Where should two registers be inserted to maximize the throughput of a 3-stage pipeline?

b) Where should two registers be inserted to maximize the throughput of a 3-stage pipeli. What would be the clock cycle time, throughput and latency?

Lotercy = 340+20=360 ps Cycletime = 80+30+20=60+50+20=130ps - 7 692 6178

c) Where should three registers be inserted to maximize the throughput of a 4-stage pipeline? What would be the clock cycle time, throughput and latency? A\_B ~ C\_D and D\_E

Cycletime = 30160+20=110 ps + Throughput = 1/110 = 0,00 90 91 = 9,091

d) What is the minimum number of stages that would yield a design with the maximum GIPS achievable throughput? Describe this design, its clock cycle time, throughput, and its la-

tency. Ascycletime & THT

4 registers -> CT=100 ps

moximize further.

tope

TH = 1/100 = 0.01000

5 Stopes maximizes TH.

Student Name: Latency = 380+20 = 400ps.

Page 5 of 6

Sitare Arslantürk

## Problem 3



(16 pts) For the MIPS datapath shown above there parts are malfunctioning. For each one: Provide a snippet of code that will fail and that will still work. Explain in words the negative consequence of cutting the line relative to the working, unmodified processor.

a) Line 1:

| Failed Code     | Working Code | Explanation (be concise)                                                                                 |  |  |
|-----------------|--------------|----------------------------------------------------------------------------------------------------------|--|--|
| LW \$11 0(\$62) | Mod 7        | The multiplexer receives only the ALU outcome, so load operations will fail to write back the true value |  |  |

b) Line 2:

| Failed Code        | Working Code     | Explanation (be concise)                                         |  |  |
|--------------------|------------------|------------------------------------------------------------------|--|--|
| odd \$41,\$42,\$43 | SW \$51, 0(\$52) | His the Regumentations<br>Cutithe register contbe<br>written on. |  |  |

c) Line 3: Explain the consequences of replacing the value with 8.

PC+8 would go to the second next instruction, each time skipping to jetch the next instruction which is located at PC+4.