### ILP: COMPILER-BASED TECHNIQUES

Mahdi Nazm Bojnordi

**Assistant Professor** 

School of Computing

University of Utah



#### Overview

- Announcements
  - Homework 2 submission deadline: Feb. 13<sup>th</sup>
  - Homework 1 solutions will be released soon

- This lecture
  - Program execution
  - Loop optimization
  - Superscalar pipelines
  - Software pipelining

### Big Picture

□ Goal: improving performance

#### Software (ILP and IC) Performance = $(IPC \times F) / IC$ **Increasing IPC:** 1. Improve ILP Code gen. 2. Exploit more ILP Increasing F: **Architecture** 1. Deeper pipelines 2. Faster technology Circuit/Device Hardware (IPC) Write Memory Inst. Inst. Execute **Fetch** Decode Access back

### Big Picture

#### □ Goal: improving performance

Software (ILP and IC)



**Architectural Techniques:** 

- Deep pipelining
  - Ideal speedup = n times
- Exploiting ILP
  - Dynamic scheduling (HW)
  - Static scheduling (SW)

Hardware (IPC)



### Processor Pipeline

 Necessary stall cycles between dependent instructions EX FP/integer multiply IF ID MEM WB Producer Consumer Stalls FP/integer divider Load Any fp.ALU 3 Any @ 2007 Elsevier, Inc. All rights reserved. fp.ALU Store int.ALU **Branch** 

### Program

#### Loop book-keeping overheads

```
Loop: L.D F0, O(R1)

ADD.D F4, F0, F2

S.D F4, O(R1)

DADDUI R1, R1, #-8

BNE R1, R2, Loop
```

| Producer | Consumer | Stalls |
|----------|----------|--------|
| Load     | Any      | 1      |
| fp.ALU   | Any      | 3      |
| fp.ALU   | Store    | 2      |
| int.ALU  | Branch   | 1      |

Goal: adding *s* to all of the array elements

|    | 0 | 1 | 2 |       | 999 |
|----|---|---|---|-------|-----|
| m: |   |   |   | • • • |     |

s:

#### **Execution Schedule**

#### Diverse impact of stall cycles on performance

| Loop: | L.D    | FO, O(R1)    |
|-------|--------|--------------|
|       | ADD.D  | F4, F0, F2   |
|       | S.D    | F4, O(R1)    |
|       | DADDUI | R1, R1, #-8  |
|       | BNE    | R1, R2, Loop |
|       |        |              |

| Producer | Consumer | Stalls |
|----------|----------|--------|
| Load     | Any      | 1      |
| fp.ALU   | Any      | 3      |
| fp.ALU   | Store    | 2      |
| int.ALU  | Branch   | 1      |

```
FO, O(R1)
Loop:
        L.D
         stall
        ADD.D
                  F4, F0, F2
         stall
         stall
         S.D
                  F4, O(R1)
                  R1, R1, #-8
        DADDUI
         stall
         BNE
                  R1, R2, Loop
         stall
```

#### Schedule 1:

5 stall cycles

3 loop body instructions

2 loop counter instructions

# Loop Optimization

### Loop Optimization

#### Re-ordering and changing immediate values

```
Loop: L.D F0, O(R1)

DADDUI R1, R1, #-8

ADD.D F4, F0, F2

stall

BNE R1, R2, Loop

S.D F4, 8(R1)
```

```
stall
ADD.D F4, F0, F2
stall
stall
S.D F4, O(R1)
DADDUI R1, R1, #-8
stall
BNE R1, R2, Loop
stall
```

FO, O(R1)

#### Schedule 2:

1 stall cycle

3 loop body instructions

2 loop counter instructions

#### Schedule 1:

Loop:

5 stall cycles

L.D

3 loop body instructions

2 loop counter instructions

## Loop Unrolling

#### Reducing loop overhead by unrolling

```
Loop:
       L.D
              FO, O(R1)
       ADD.D F4, F0, F2
       S.D F4, O(R1)
       L.D F6, -8(R1)
       ADD.D F8, F6, F2
       S.D F8, -8(R1)
       L.D F10,-16(R1)
       ADD.D F12, F10, F2
       S.D
          F12, -16(R1)
       L.D F14, -24(R1)
       ADD.D F16, F14, F2
       S.D F16, -24(R1)
       DADDUI R1, R1, #-32
               R1,R2, Loop
       BNE
```

```
do {
    m[i-0] = m[i-0] + s;
    m[i-1] = m[i-1] + s;
    m[i-2] = m[i-2] + s;
    m[i-3] = m[i-3] + s;
    i = i-4;
} while(i>0)
```

Goal: adding *s* to all of the array elements



s:

## Loop Unrolling

Reducing loop overhead by unrolling

```
Loop:
       L.D
              FO, O(R1)
       ADD.D F4, F0, F2
       S.D F4, O(R1)
       L.D F6, -8(R1)
       ADD.D F8, F6, F2
       S.D F8, -8(R1)
       L.D F10,-16(R1)
       ADD.D F12, F10, F2
       S.D
          F12, -16(R1)
       L.D F14, -24(R1)
       ADD.D F16, F14, F2
       S.D F16, -24(R1)
       DADDUI R1, R1, #-32
              R1,R2, Loop
       BNE
```

Schedule 3:

14 stall cycles

12 loop body instructions
2 loop counter instructions

### Instruction Reordering

#### □ Eliminating stall cycles by unrolling and scheduling

| Loop: | L.D    | FO, O(R1)    |
|-------|--------|--------------|
|       | ADD.D  | F4, F0, F2   |
|       | S.D    | F4, O(R1)    |
|       | L.D    | F6, -8(R1)   |
|       | ADD.D  | F8, F6, F2   |
|       | S.D    | F8, -8(R1)   |
|       | L.D    | F10,-16(R1)  |
|       | ADD.D  | F12, F10, F2 |
|       | S.D    | F12, -16(R1) |
|       | L.D    | F14, -24(R1) |
|       | ADD.D  | F16, F14, F2 |
|       | S.D    | F16, -24(R1) |
|       | DADDUI | R1, R1, #-32 |
|       | BNE    | R1,R2, Loop  |

| Loop: | L.D    | FO, O(R1)    |
|-------|--------|--------------|
|       | L.D    | F6, -8(R1)   |
|       | L.D    | F10,-16(R1)  |
|       | L.D    | F14, -24(R1) |
|       | ADD.D  | F4, F0, F2   |
|       | ADD.D  | F8, F6, F2   |
|       | ADD.D  | F12, F10, F2 |
|       | ADD.D  | F16, F14, F2 |
|       | S.D    | F4, O(R1)    |
|       | S.D    | F8, -8(R1)   |
|       | DADDUI | R1, R1, #-32 |
|       | S.D    | F12, 16(R1)  |
|       | BNE    | R1,R2, Loop  |
|       | S.D    | F16, 8(R1)   |

### **IPC** Limit

#### Eliminating stall cycles by unrolling and scheduling

#### Schedule 4:

0 stall cycles12 loop body instructions2 loop counter instructions

- + IPC = 1
- more instructions
- more registers



| _     |        |              |
|-------|--------|--------------|
| Loop: | L.D    | FO, O(R1)    |
|       | L.D    | F6, -8(R1)   |
|       | L.D    | F10,-16(R1)  |
|       | L.D    | F14, -24(R1) |
|       | ADD.D  | F4, F0, F2   |
|       | ADD.D  | F8, F6, F2   |
|       | ADD.D  | F12, F10, F2 |
|       | ADD.D  | F16, F14, F2 |
|       | S.D    | F4, O(R1)    |
|       | S.D    | F8, -8(R1)   |
|       | DADDUI | R1, R1, #-32 |
|       | S.D    | F12, 16(R1)  |
|       | BNE    | R1,R2, Loop  |
|       | S.D    | F16, 8(R1)   |

### Summary of Scalar Pipelines

- Upper bound on throughput
  - IPC <= 1</p>
- Unified pipeline for all functional units
  - Underutilized resources
- □ Inefficient freeze policy
  - A stall cycle delays all the following cycles
- Pipeline hazards
  - Stall cycles result in limited throughput

- Separate integer and floating point pipelines
  - An instruction packet is fetched every cycle
    - Very large instruction word (VLIW)
  - Inst. packet has one fp. and one int. slots
  - Compiler's job is to find instructions for the slots
  - □ IPC <= 2



#### □ Forming instruction packets

| 4     |        |              |
|-------|--------|--------------|
| Loop: | L.D    | FO, O(R1)    |
|       | L.D    | F6, -8(R1)   |
|       | L.D    | F10,-16(R1)  |
|       | L.D    | F14, -24(R1) |
|       | ADD.D  | F4, F0, F2   |
|       | ADD.D  | F8, F6, F2   |
|       | ADD.D  | F12, F10, F2 |
|       | ADD.D  | F16, F14, F2 |
|       | S.D    | F4, O(R1)    |
|       | S.D    | F8, -8(R1)   |
|       | DADDUI | R1, R1, #-32 |
|       | S.D    | F12, 16(R1)  |
|       | BNE    | R1,R2, Loop  |
|       | S.D    | F16, 8(R1)   |
|       |        |              |

Floating-point operations

□ Ideally, the number of empty slots is zero

```
FO, O(R1)
Loop:
       L.D
               F6, -8(R1)
       L.D
       L.D
               F10,-16(R1)
       L.D
               F14, -24(R1)
       DADDUI R1, R1, #-32
       S.D
               F4, 32(R1)
       S.D
               F8, 24(R1)
       S.D
               F12, 16(R1)
       BNE
               R1,R2, Loop
       S.D
               F16, 8(R1)
```

```
NOP
NOP
ADD.D F4, F0, F2
ADD.D F8, F6, F2
ADD.D F12, F10, F2
ADD.D F16, F14, F2
NOP
NOP
NOP
NOP
```

#### Schedule 5:

0 stall cycles

8 loop body packets

2 loop overhead cycles

IPC = 1.4



```
Loop: L.D F0, O(R1)

stall

ADD.D F4, F0, F2

stall

stall

S.D F4, O(R1)

DADDUI R1, R1, #-8

stall

BNE R1, R2, Loop

stall
```



loop: SD (1) Loop: S.D F4, O(R1)
ADD (2) ADD.D F4, F0, F2
LD (3) LD F0, -16(R1)
ADDI DADDUI R1, R1, #-8
BNE BNE R1, R2, Loop



**Prologue and Epilogue?**