### **Caches I**

CSE 351 Spring 2019

#### **Instructor:**

**Ruth Anderson** 

### **Teaching Assistants:**

Gavin Cai
Jack Eggleston
John Feltrup
Britt Henderson
Richard Jiang
Jack Skalitzky
Sophie Tian
Connie Wang
Sam Wolfson
Casey Xing
Chin Yeoh









**Alt text:** I looked at some of the data dumps from vulnerable sites, and it was ... bad. I saw emails, passwords, password hints. SSL keys and session cookies. Important servers brimming with visitor IPs. Attack ships on fire off the shoulder of Orion, c-beams glittering in the dark near the Tannhäuser Gate. I should probably patch OpenSSL.

http://xkcd.com/1353/

### **Administrivia**

- Homework 3 due TONIGHT, Wednesday (5/8)
- Mid-quarter survey due Thursday (5/9)
- Lab 3, due Wednesday (5/15)
  - Bring your laptops to section tomorrow!
  - Download lab3 file and untar it before section
- Midterm Grading in progress, grades coming soon
  - Solutions posted on website
  - Rubric and grades will be found on Gradescope
  - Regrade requests will be open for a short time after grade release via Gradescope

## Roadmap

#### C:

```
car *c = malloc(sizeof(car));
c->miles = 100;
c->gals = 17;
float mpg = get_mpg(c);
free(c);
```

#### Java:

Memory & data
Integers & floats
x86 assembly
Procedures & stacks
Executables
Arrays & structs

### Memory & caches

Processes
Virtual memory
Memory allocation
Java vs. C

Assembly language:

```
get_mpg:
   pushq %rbp
   movq %rsp, %rbp
   ...
   popq %rbp
   ret
```

Machine code:

OS:



Computer system:







### **Aside: Units and Prefixes**

- Here focusing on large numbers (exponents > 0)
- Note that  $10^3 \approx 2^{10}$
- SI prefixes are ambiguous if base 10 or 2
- IEC prefixes are unambiguously base 2

SIZE PREFIXES (10<sup>x</sup> for Disk, Communication; 2<sup>x</sup> for Memory)

| SI Size          | Prefix | Symbol | IEC Size        | Prefix | Symbol |
|------------------|--------|--------|-----------------|--------|--------|
| 10 <sup>3</sup>  | Kilo-  | K      | 2 <sup>10</sup> | Kibi-  | Ki     |
| 10 <sup>6</sup>  | Mega-  | M      | $2^{20}$        | Mebi-  | Mi     |
| 10 <sup>9</sup>  | Giga-  | G      | 2 <sup>30</sup> | Gibi-  | Gi     |
| 10 <sup>12</sup> | Tera-  | T      | 2 <sup>40</sup> | Tebi-  | Ti     |
| 10 <sup>15</sup> | Peta-  | P      | 2 <sup>50</sup> | Pebi-  | Pi     |
| 10 <sup>18</sup> | Exa-   | Е      | $2^{60}$        | Exbi-  | Ei     |
| 10 <sup>21</sup> | Zetta- | Z      | 2 <sup>70</sup> | Zebi-  | Zi     |
| 10 <sup>24</sup> | Yotta- | Y      | 2 <sup>80</sup> | Yobi-  | Yi     |

### **How to Remember?**

- Will be given to you on Final reference sheet
- Mnemonics
  - There unfortunately isn't one well-accepted mnemonic
    - But that shouldn't stop you from trying to come with one!
  - Killer Mechanical Giraffe Teaches Pet, Extinct Zebra to Yodel
  - Kirby Missed Ganondorf Terribly, Potentially Exterminating
     Zelda and Yoshi
  - xkcd: Karl Marx Gave The Proletariat Eleven Zeppelins, Yo
    - https://xkcd.com/992/
  - Post your best on Piazza!

# How does execution time grow with SIZE?

```
int array[SIZE];
int sum = 0;
for (int i = 0; i < 200000; i++)
  for (int j = 0; j < SIZE; j++)
    sum += array[j];
                           Time
                       Plot
                                      (SIZE
```

### **Actual Data**



# Making memory accesses fast!

- Cache basics
- Principle of locality
- Memory hierarchies
- Cache organization
- Program optimizations that consider caches

### **Processor-Memory Gap**



**Problem: Processor-Memory Bottleneck** 

addq (2vax), 2vcx

Processor performance doubled about every 18 months



Bus latency / bandwidth evolved much slower

Main Memory

Core 2 Duo:

Can process at least 256 Bytes/cycle



Core 2 Duo:

**Bandwidth** 

2 Bytes/cycle

Latency

100-200 cycles (80-60ns)



**Problem: lots of waiting on memory** 

cycle: single machine step (fixed-time)

# **Problem: Processor-Memory Bottleneck**



cycle: single machine step (fixed-time)

## Cache 5

- Pronunciation: "cash"
  - We abbreviate this as "\$"
- English: A hidden storage space for provisions, weapons, and/or treasures
- Computer: Memory with short access time used for the storage of frequently or recently used instructions (i-cache/I\$) or data (d-cache/D\$)

L16: Caches I

 More generally: Used to optimize data transfers between any system elements with different characteristics (network interface cache, I/O cache, etc.)

### **General Cache Mechanics**





**Memory** 

3

7

11

**15** 

**General Cache Concepts: Miss** 





5

9

6

10

14

0

4

8

**Outa in block b is needed** 

Block b is not in cache: Miss!

- Block b is fetched from memory
- **(3)** Block b is stored in cache
  - Placement policy: determines where b goes
  - Replacement policy: determines which block gets evicted (victim)
- (4) Data is returned to CPU

# **Why Caches Work**

Locality: Programs tend to use data and instructions with addresses near or equal to those they have used recently

block

# **Why Caches Work**

 Locality: Programs tend to use data and instructions with addresses near or equal to those they have used recently

### Temporal locality:

Recently referenced items are *likely* to be referenced again in the near future

e.g. like for / while loops

# **Why Caches Work**

 Locality: Programs tend to use data and instructions with addresses near or equal to those they have used recently













block

### **Example: Any Locality?**

```
sum = 0;
for (i = 0; i < n; i++)
{
   sum += a[i];
}
return sum;</pre>
```

### Data:

Temporal: sum referenced in each iteration

Spatial: array a[] accessed in stride-1 pattern

### Instructions:

Temporal: cycle through loop repeatedly

Spatial: reference instructions in sequence

```
int sum_array_rows(int a[M][N])
{
   int i, j, sum = 0;

   for (i = 0; i < M; i++)
        for (j = 0; j < N; j++)
            sum += a[i][j];

   return sum;
}</pre>
```





#### M = 3, N=4

| a[0][0] | a[0][1] | a[0][2] | a[0][3] |
|---------|---------|---------|---------|
| a[1][0] | a[1][1] | a[1][2] | a[1][3] |
|         |         |         |         |

#### **Access Pattern:**

#### Layout in Memory



**Note:** 76 is just one possible starting address of array a

```
int sum_array_cols(int a[M][N])
    int i, j, sum = 0;
    for (j = 0; j < N; j++)
        for (i = 0; i < M; i++)
            sum += a[i][j];
    return sum;
```





```
M = 3 N = 4
```

a[0][0] a[0][2] a[0][1] a[0][3]

a[1][0] a[1][1] a[1][2] a[1][3]

a[2][0]||a[2][1]| a[2][2] a[2][3]

**Access Pattern:** 

1) a[0][0] stride = ? a[1][0]

3) a[2][0]

stride-4 stride-N

a[0][1] a[1][1]

a[2][1]

a[0][2]

a[1][2]

a[2][2]

a[0][3] a[1][3] 11)

12) a[2][3]

What is wrong with this code?

How can it be fixed?



What is wrong with this code?

How can it be fixed?

```
inner loop: i → stride-L

j → stride-1

k → stride-N*L
```

### Layout in Memory (M = ?, N = 3, L = 4)



CSE351, Spring 2019

# **Cache Performance Metrics**



- Huge difference between a cache hit and a cache miss
  - Could be 100x speed difference between accessing cache and main memory (measured in clock cycles)
- Miss Rate (MR)
  - Fraction of memory references not found in cache (misses / accesses) = 1 - Hit Rate
- Hit Time (HT)
  - Time to deliver a block in the cache to the processor
    - Includes time to determine whether the block is in the cache
- Miss Penalty (MP)
  - Additional time required because of a miss

### **Cache Performance**

- Two things hurt the performance of a cache:
  - Miss rate and miss penalty
- Average Memory Access Time (AMAT): average time to access memory considering both hits and misses

```
AMAT ≠ Hit time + Miss rate × Miss penalty
(abbreviated AMAT = HT + MR \times MP)
                                               HT+HR +MT*MR
                                               HT*(1-MR)+(HT+MP)* MR
                                                 HT-HILLPAR+HILLPAR +MP*MR
```

• 99% hit rate twice as good as 97% hit rate!

(because of how drastically slower it is to move stuff from memory into the

Assume HT of 1 clock cycle and MP of 100 clock cycles

## **Peer Instruction Question**



 Processor specs: 200 ps clock, MP of 50 clock cycles, MR of 0.02 misses/instruction, and HT of 1 clock cycle

- Which improvement would be best?
  - Vote at <a href="http://pollev.com/rea">http://pollev.com/rea</a>
  - **A.** 190 ps clock



### Can we have more than one cache?

- Why would we want to do that?
  - Avoid going to memory!

- O optimize L1 for fast HT O optimize L2 for low MR
- Typical performance numbers:
  - Miss Rate
    - L1 MR = 3-10%
    - L2 MR = Quite small (e.g. < 1%), depending on parameters, etc.
  - Hit Time
    - L1 HT = 4 clock cycles
    - L2 HT = 10 clock cycles
  - Miss Penalty
    - P = 50-200 cycles for missing in L2 & going to main memory
    - Trend: increasing!

# An Example Memory Hierarchy



### **Summary**

### Memory Hierarchy

- Successively higher levels contain "most used" data from lower levels
- Exploits temporal and spatial locality
- Caches are intermediate storage levels used to optimize data transfers between any system elements with different characteristics

### Cache Performance

- Ideal case: found in cache (hit)
- Bad case: not found in cache (miss), search in next level
- Average Memory Access Time (AMAT) = HT + MR × MP
  - Hurt by Miss Rate and Miss Penalty