+

Cache Memories



#### +Cache Memories

- Cache memories are small, fast SRAM-based memories managed automatically in hardware
  - Hold frequently accessed blocks of main memory
- CPU looks first for data in cache
- Typical system structure:







Smaller, faster, more expensive memory caches a subset of the blocks

#### Memory









Smaller, faster, more expensive memory caches a subset of the blocks

Data is copied in block-sized transfer units

#### Memory









Smaller, faster, more expensive memory caches a subset of the blocks

Data is copied in block-sized transfer units

#### Memory













Smaller, faster, more expensive memory caches a subset of the blocks

Data is copied in block-sized transfer units

#### Memory









Smaller, faster, more expensive memory caches a subset of the blocks

Data is copied in block-sized transfer units

#### Memory











Memory

| 0     | 1         | 2         | 3         |
|-------|-----------|-----------|-----------|
| 4     | 5         | 6         | 7         |
| 8     | 9         | 10        | 11        |
| 12    | 13        | 14        | 15        |
| • • • | • • • • • | • • • • • | • • • • • |

#### +General Cache Organization



















B bytes per cache block (the data)

Direct mapped: One line per set Assume: cache block size 8 bytes



#### Address of int:

t bits 0...01 100











Direct mapped: One line per set Assume: cache block size 8 bytes



If tag doesn't match: old line is evicted and replaced



M=16 bytes (4-bit addresses), B=2 bytes/block, S=4 sets, E=1 line/set

- $[0000_2],$
- 1  $[0\underline{001}_2]$ ,
- 7  $[0\underline{11}1_2]$ ,
- $[1000_2],$
- $0 \quad [0000_2]$

|       | V | Tag | Block |
|-------|---|-----|-------|
| Set 0 | 0 | ?   | ?     |
| Set 1 |   |     |       |
| Set 2 |   |     |       |
| Set 3 |   |     |       |



M=16 bytes (4-bit addresses), B=2 bytes/block, S=4 sets, E=1 line/set

| 0 | $(0000_{2}),$                  | micc |
|---|--------------------------------|------|
| 1 | $[0001_{2}],$                  | miss |
| 7 | [0 <u>11</u> 1 <sub>2</sub> ], |      |
| 8 | $[1000_{2}],$                  |      |
| 0 | [0000 <sub>3</sub> ]           |      |

|       | V | Tag | Block |
|-------|---|-----|-------|
| Set 0 | 0 | ?   | ?     |
| Set 1 |   |     |       |
| Set 2 |   |     |       |
| Set 3 |   |     |       |



M=16 bytes (4-bit addresses), B=2 bytes/block, S=4 sets, E=1 line/set

| 0 | $(0000_{2}],$                  | miss   |
|---|--------------------------------|--------|
| 1 | $[0001_{2}],$                  | 111133 |
| 7 | [0 <u>11</u> 1 <sub>2</sub> ], |        |
| 8 | $[1000_{2}^{-}],$              |        |
| 0 | [0 <u>00</u> 0 <sub>2</sub> ]  |        |

|       | ٧ | Tag | Block  |
|-------|---|-----|--------|
| Set 0 | 1 | 0   | M[0-1] |
| Set 1 |   |     |        |
| Set 2 |   |     |        |
| Set 3 |   |     |        |



| t=1 | s=2 | b=1 |
|-----|-----|-----|
| Х   | XX  | Х   |

M=16 bytes (4-bit addresses), B=2 bytes/block, S=4 sets, E=1 line/set

| 0 | $[0000_{2}],$                  | micc        |
|---|--------------------------------|-------------|
| 1 | $[0001_{2}],$                  | miss<br>hit |
| 7 | [0 <u>11</u> 1 <sub>2</sub> ], |             |
| 8 | [1 <u>00</u> 0 <sub>2</sub> ], |             |
| 0 | [0 <u>00</u> 0 <sub>2</sub> ]  |             |

|       | V | Tag | Block  |
|-------|---|-----|--------|
| Set 0 | 1 | 0   | M[0-1] |
| Set 1 |   |     |        |
| Set 2 |   |     |        |
| Set 3 |   |     |        |



M=16 bytes (4-bit addresses), B=2 bytes/block, S=4 sets, E=1 line/set

| 0 | [0 <u>00</u> 0 <sub>2</sub> ], | miss |
|---|--------------------------------|------|
| 1 | [0 <u>00</u> 1 <sub>2</sub> ], | hit  |
| 7 | [0 <u>11</u> 1 <sub>2</sub> ], | miss |
| 8 | $[1000_{2}],$                  |      |
| 0 | $[0000_{2}^{-}]$               |      |

|       | V | Tag | Block  |
|-------|---|-----|--------|
| Set 0 | 1 | 0   | M[0-1] |
| Set 1 |   |     |        |
| Set 2 |   |     |        |
| Set 3 |   |     |        |



M=16 bytes (4-bit addresses), B=2 bytes/block, S=4 sets, E=1 line/set

Address trace (reads, one byte per read):

| 0 | $[0000_{2}],$                  |      |
|---|--------------------------------|------|
| U |                                | miss |
| 1 | [0 <u>00</u> 1 <sub>2</sub> ], | hit  |
| 7 | [0 <u>11</u> 1 <sub>2</sub> ], | miss |
| 8 | [1 <u>00</u> 0 <sub>2</sub> ], |      |

 $[0000_{2}]$ 

|       | V | Tag | Block  |
|-------|---|-----|--------|
| Set 0 | 1 | 0   | M[0-1] |
| Set 1 |   |     |        |
| Set 2 |   |     |        |
| Set 3 | 1 | 0   | M[6-7] |



M=16 bytes (4-bit addresses), B=2 bytes/block, S=4 sets, E=1 line/set

|   | (1 2 21 21 2 )                 | ) · · · · · |
|---|--------------------------------|-------------|
| 0 | [0 <u>00</u> 0 <sub>2</sub> ], | miss        |
| 1 | [0 <u>00</u> 1 <sub>2</sub> ], | hit         |
| 7 | [0 <u>11</u> 1 <sub>2</sub> ], | miss        |
| 8 | [1 <u>00</u> 0 <sub>2</sub> ], | miss        |
| 0 | [0000]                         |             |

|       | V | Tag | Block  |
|-------|---|-----|--------|
| Set 0 | 1 | 0   | M[0-1] |
| Set 1 |   |     |        |
| Set 2 |   |     |        |
| Set 3 | 1 | 0   | M[6-7] |



| t=1 | s=2 | b=1 |
|-----|-----|-----|
| X   | XX  | X   |

M=16 bytes (4-bit addresses), B=2 bytes/block, S=4 sets, E=1 line/set

|   | (                              | )    |
|---|--------------------------------|------|
| 0 | [0 <u>00</u> 0 <sub>2</sub> ], | miss |
| 1 | [0 <u>00</u> 1 <sub>2</sub> ], | hit  |
| 7 | [0 <u>11</u> 1 <sub>2</sub> ], | miss |
| 8 | [1 <u>00</u> 0 <sub>2</sub> ], | miss |
| 0 | [0000]                         |      |

|       | ٧ | Tag | Block  |
|-------|---|-----|--------|
| Set 0 | 1 | 1   | M[8-9] |
| Set 1 |   |     |        |
| Set 2 |   |     |        |
| Set 3 | 1 | 0   | M[6-7] |



| t=1 | s=2 | b=1 |
|-----|-----|-----|
| X   | XX  | X   |

M=16 bytes (4-bit addresses), B=2 bytes/block, S=4 sets, E=1 line/set

| 0 | $[0000_{2}],$                  | miss |
|---|--------------------------------|------|
| 1 | $[0001_{2}^{2}],$              | hit  |
| 7 | [0 <u>11</u> 1 <sub>2</sub> ], | miss |
| 8 | [1 <u>00</u> 0 <sub>2</sub> ], | miss |
| 0 | [0000]                         | miss |

|       | V | Tag | Block  |
|-------|---|-----|--------|
| Set 0 | 1 | 1   | M[8-9] |
| Set 1 |   |     |        |
| Set 2 |   |     |        |
| Set 3 | 1 | 0   | M[6-7] |



M=16 bytes (4-bit addresses), B=2 bytes/block, S=4 sets, E=1 line/set

| 0 | $[0000_{2}],$                  | miss |
|---|--------------------------------|------|
| 1 | $[0001_{2}^{2}],$              | hit  |
| 7 | [0 <u>11</u> 1 <sub>2</sub> ], | miss |
| 8 | [1 <u>00</u> 0 <sub>2</sub> ], | miss |
| 0 | [0000]                         | miss |

|       | V | Tag | Block  |
|-------|---|-----|--------|
| Set 0 | 1 | 0   | M[0-1] |
| Set 1 |   |     |        |
| Set 2 |   |     |        |
| Set 3 | 1 | 0   | M[6-7] |

#### +E-way Set Associative Cache (E = 2)

E = 2: Two lines per set

Assume: cache block size 8 bytes

Address of short int:

t bits 0...01 100



#### +E-way Set Associative Cache (E = 2)

E = 2: Two lines per set



# +E-way Set Associative Cache (E = 2)

E = 2: Two lines per set

Assume: cache block size 8 bytes



v tag 01234567 v tag 01234567













block offset











#### No match:

- One line in set is selected for eviction and replacement
- Replacement policies: random, least recently used (LRU), ...



M=16 byte addresses, B=2 bytes/block, S=2 sets, E=2 blocks/set

- 0 [00<u>0</u>0<sub>2</sub>], 1 [00<u>0</u>1<sub>2</sub>], 7 [01<u>1</u>1<sub>2</sub>],
- $[10\underline{0}0_2],$
- $0 \quad [00\underline{0}0_2]$

|       | V | Tag | Block |
|-------|---|-----|-------|
| Set 0 | 0 | ?   | ?     |
|       | 0 |     |       |

| Set 1 | 0 |  |
|-------|---|--|
|       | 0 |  |



M=16 byte addresses, B=2 bytes/block, S=2 sets, E=2 blocks/set

| . Dyte per | (Icads, one                    | daress trace |
|------------|--------------------------------|--------------|
| miss       | [00 <u>0</u> 0 <sub>2</sub> ], | 0            |
|            | [00 <u>0</u> 1 <sub>2</sub> ], | 1            |
|            | [01 <u>1</u> 1 <sub>2</sub> ], | 7            |
|            | [10 <u>0</u> 0 <sub>2</sub> ], | 8            |
|            | $[00\underline{0}0_{2}]$       | 0            |
|            |                                |              |

|       | V | Tag | Block |
|-------|---|-----|-------|
| Set 0 | 0 | ?   | ?     |
|       | 0 |     |       |





M=16 byte addresses, B=2 bytes/block, S=2 sets, E=2 blocks/set

| byte per | (icads, one                    | ddic33 tidec |
|----------|--------------------------------|--------------|
| miss     | $[00\underline{0}0_{2}],$      | 0            |
|          | [00 <u>0</u> 1 <sub>2</sub> ], | 1            |
|          | [01 <u>1</u> 1 <sub>2</sub> ], | 7            |
|          | [10 <u>0</u> 0 <sub>2</sub> ], | 8            |
|          | $[00\underline{0}0_{2}]$       | 0            |
|          |                                |              |

|       | V | Tag | Block  |
|-------|---|-----|--------|
| Set 0 | 1 | 00  | M[0-1] |
|       | 0 |     |        |

| Set 1  | 0 |  |
|--------|---|--|
| ן אפני | 0 |  |



M=16 byte addresses, B=2 bytes/block, S=2 sets, E=2 blocks/set

| css crace | (i caas, one i                 | y ce per |
|-----------|--------------------------------|----------|
| 0         | [00 <u>0</u> 0 <sub>2</sub> ], | miss     |
| 1         | [00 <u>0</u> 1 <sub>2</sub> ], | hit      |
| 7         | [01 <u>1</u> 1 <sub>2</sub> ], |          |
| 8         | [10 <u>0</u> 0 <sub>2</sub> ], |          |
| 0         | [00 <u>0</u> 0 <sub>2</sub> ]  |          |
|           |                                |          |

|       | V | Tag | Block  |
|-------|---|-----|--------|
| Set 0 | 1 | 00  | M[0-1] |
|       | 0 |     |        |

| Set 1 | 0 |  |
|-------|---|--|
|       | 0 |  |



M=16 byte addresses, B=2 bytes/block, S=2 sets, E=2 blocks/set

|   | , (. caas, c                   | <i>,</i> |
|---|--------------------------------|----------|
| 0 | $[00\underline{0}0_{2}],$      | miss     |
| 1 | [00 <u>0</u> 1 <sub>2</sub> ], | hit      |
| 7 | [01 <u>1</u> 1 <sub>2</sub> ], | miss     |
| 8 | [10 <u>0</u> 0 <sub>2</sub> ], |          |
| 0 | [00 <u>0</u> 0 <sub>2</sub> ]  |          |

|       | V | Tag | Block  |
|-------|---|-----|--------|
| Set 0 | 1 | 00  | M[0-1] |
|       | 0 |     |        |





M=16 byte addresses, B=2 bytes/block, S=2 sets, E=2 blocks/set

| 000 | , (. caas, cc s                | Jee Pe. |
|-----|--------------------------------|---------|
| 0   | $[00\underline{0}0_{2}],$      | miss    |
| 1   | [00 <u>0</u> 1 <sub>2</sub> ], | hit     |
| 7   | [01 <u>1</u> 1 <sub>2</sub> ], | miss    |
| 8   | [10 <u>0</u> 0 <sub>2</sub> ], |         |
| 0   | [00 <u>0</u> 0 <sub>2</sub> ]  |         |

|       | V | Tag | Block  |
|-------|---|-----|--------|
| Set 0 | 1 | 00  | M[0-1] |
|       | 0 |     |        |

| Set 1 | 1 | 01 | M[6-7] |
|-------|---|----|--------|
| Jet I | 0 |    |        |



M=16 byte addresses, B=2 bytes/block, S=2 sets, E=2 blocks/set

|   | (1 2 3 3 3 7 7 7 7 7           | ) F  |
|---|--------------------------------|------|
| 0 | $[00\underline{0}0_{2}],$      | miss |
| 1 | [00 <u>0</u> 1 <sub>2</sub> ], | hit  |
| 7 | [01 <u>1</u> 1 <sub>2</sub> ], | miss |
| 8 | [10 <u>0</u> 0 <sub>2</sub> ], | miss |
| 0 | [0000]                         |      |

|       | V | Tag | Block  |
|-------|---|-----|--------|
| Set 0 | 1 | 00  | M[0-1] |
|       | 0 |     |        |

| Set 1 1 | 1 | 01 | M[6-7] |
|---------|---|----|--------|
| שבנ ו   | 0 |    |        |



M=16 byte addresses, B=2 bytes/block, S=2 sets, E=2 blocks/set

|   | (1 2 3 3 3 7 7 7 7 7           | ) F  |
|---|--------------------------------|------|
| 0 | $[00\underline{0}0_{2}],$      | miss |
| 1 | [00 <u>0</u> 1 <sub>2</sub> ], | hit  |
| 7 | [01 <u>1</u> 1 <sub>2</sub> ], | miss |
| 8 | [10 <u>0</u> 0 <sub>2</sub> ], | miss |
| 0 | [0000]                         |      |

|       | V | Tag | Block  |
|-------|---|-----|--------|
| Set 0 | 1 | 00  | M[0-1] |
| 361.0 | 1 | 10  | M[8-9] |

| Set 1 | 1 | 01 | M[6-7] |
|-------|---|----|--------|
| Jet I | 0 |    |        |



M=16 byte addresses, B=2 bytes/block, S=2 sets, E=2 blocks/set

| ^ | $\tilde{\Gamma}$               | •    |
|---|--------------------------------|------|
| 0 | [00 <u>0</u> 0 <sub>2</sub> ], | miss |
| 1 | [00 <u>0</u> 1 <sub>2</sub> ], | hit  |
| 7 | [01 <u>1</u> 1 <sub>2</sub> ], | miss |
| 8 | [10 <u>0</u> 0 <sub>2</sub> ], | miss |
| 0 | $[0000_{2}]$                   | hit  |

|       | V | Tag | Block  |
|-------|---|-----|--------|
| Set 0 | 1 | 00  | M[0-1] |
|       | 1 | 10  | M[8-9] |

| Set 1 | 1 | 01 | M[6-7] |
|-------|---|----|--------|
| Jet I | 0 |    |        |

### +What about writes?



- Multiple copies of data exist:
  - L1, L2, L3, Main Memory, Disk
- What to do on a write-hit?
  - *Write-through* (write immediately to memory, expensive)
  - Write-back (defer write to memory until replacement of line)
    - Need a dirty bit (line different from memory or not?)
- What to do on a write-miss?
  - Write-allocate (load into cache, update line in cache)
    - Good if more writes to the location follow
  - *No-write-allocate* (writes straight to memory, does not load into cache)
- Typical
  - Write-back + Write-allocate (i.e. update cache until eviction)

### +Cache Performance Metrics



#### Miss Rate

- Fraction of memory references not found in cache (misses / accesses) = 1 hit rate
- Typical numbers:
  - 3-10% for L1
  - < 1% for L2, depending on size, etc.</p>

#### Hit Time

- Time to deliver a line in the cache to the processor
  - includes time to determine whether the line is in the cache
- Typical numbers:
  - 4 clock cycle for L1
  - 10 clock cycles for L2

### Miss Penalty

- Additional time required because of a miss
  - typically 50-200 cycles for main memory

### +Let's think about those numbers

- Huge difference between a hit and a miss
- Would you believe 99% hits is twice as good as 97%?
  - Consider:
     cache hit time of 1 cycle
     miss penalty of 100 cycles
  - Average access time:

### +Let's think about those numbers



- Huge difference between a hit and a miss
- Would you believe 99% hits is twice as good as 97%?
  - Consider: cache hit time of 1 cycle miss penalty of 100 cycles
  - Average access time:

97% hits: 1 cycle + 0.03 \* 100 cycles = **4 cycles** 

99% hits: 1 cycle + 0.01 \* 100 cycles = 2 cycles

- This is why "miss rate" is used instead of "hit rate"

+

Writing Cache Friendly Code

## +Matrix Multiplication Example



### • Description:

- Multiply N x N matrices
- Matrix elements are doubles (8 bytes)
- $O(N^3)$  total operations
- N reads per source element
- N values summed per destination

# /\* ijk \*/ held in register

```
/* ijk */
for (i=0; i<n; i++) {
  for (j=0; j<n; j++) {
    sum = 0.0;
  for (k=0; k<n; k++)
    sum += a[i][k] * b[k][j];
  c[i][j] = sum;
}
}</pre>
```

# +Miss Rate Analysis for Matrix Multiply



#### Assume:

- Block size = 32B (big enough for 4 doubles)
- Matrix dimension (N) is "very large"
- Cache is not even big enough to hold multiple rows

### Analysis Method:

Look at access pattern of inner loop



# +Layout of C Arrays in Memory (review)



each row in contiguous memory locations

### Stepping through columns in one row:

- for (i = 0; i < N; i++)sum += a[0][i];
- accesses successive, contiguous elements
- if block size > sizeof( $a_{ij}$ ) bytes, exploit spatial locality
  - miss rate =  $sizeof(a_{ij}) / block size$

### Stepping through rows in one column:

- for (i = 0; i < n; i++) sum += a[i][0];
- accesses distant elements (stride-rowsize pattern)
- no spatial locality!
  - miss rate = 1 (i.e. 100%)

# +Matrix Multiplication (ijk)



```
/* ijk */
for (i=0; i<n; i++) {
  for (j=0; j<n; j++) {
    sum = 0.0;
    for (k=0; k<n; k++)
        sum += a[i][k] * b[k][j];
    c[i][j] = sum;
}</pre>
```

### Misses per inner loop iteration:

<u>A</u> <u>B</u> <u>C</u> 0.25 1.0 0.0

#### Inner loop:



#### double[2][2]

```
sum += A[0][0] * B[0][0]
sum += A[0][1] * B[1][0]
sum += A[0][0] * B[0][1]
sum += A[0][1] * B[1][1]
sum += A[1][0] * B[0][0]
sum += A[1][1] * B[1][0]
sum += A[1][0] * B[0][1]
sum += A[1][1] * B[1][1]
```

# +Matrix Multiplication (jik)



```
/* jik */
for (j=0; j<n; j++) {
  for (i=0; i<n; i++) {
    sum = 0.0;
    for (k=0; k<n; k++)
       sum += a[i][k] * b[k][j];
    c[i][j] = sum
  }
}</pre>
```

#### Inner loop:



### Misses per inner loop iteration:

<u>A</u> <u>B</u> <u>C</u> 0.25 1.0 0.0

Effectively same as ijk

# +Matrix Multiplication (kij)

```
/* kij */
for (k=0; k<n; k++) {
  for (i=0; i<n; i++) {
    r = a[i][k];
    for (j=0; j<n; j++)
        c[i][j] += r * b[k][j];
  }
}</pre>
```

#### Inner loop:



### Misses per inner loop iteration:

<u>A</u> <u>B</u> <u>C</u> 0.0 0.25 0.25

#### double[2][2]

```
C[0][0] += r * B[0][0]

C[0][1] += r * B[0][1]

C[1][0] += r * B[0][0]

C[1][1] += r * B[0][1]

C[0][0] += r * B[1][0]

C[0][1] += r * B[1][1]

C[1][0] += r * B[1][1]

C[1][1] += r * B[1][1]
```

# +Matrix Multiplication (ikj)



```
/* ikj */
for (i=0; i<n; i++) {
  for (k=0; k<n; k++) {
    r = a[i][k];
  for (j=0; j<n; j++)
    c[i][j] += r * b[k][j];
}</pre>
```

### Inner loop:



### Misses per inner loop iteration:

<u>A</u> <u>B</u> <u>C</u> 0.0 0.25 0.25

Effectively same as kij

## + Matrix Multiplication (jki)

```
/* jki */
for (j=0; j<n; j++) {
  for (k=0; k<n; k++) {
    r = b[k][j];
    for (i=0; i<n; i++)
        c[i][j] += a[i][k] * r;
  }
}</pre>
```

#### Inner loop:



### Misses per inner loop iteration:

<u>A</u> <u>B</u> <u>C</u> 1.0 0.0 1.0

#### double[2][2]

```
C[0][0] += A[0][0] * r

C[1][0] += A[1][0] * r

C[0][0] += A[0][1] * r

C[1][0] += A[1][1] * r

C[0][1] += A[0][0] * r

C[1][1] += A[1][0] * r

C[0][1] += A[0][1] * r

C[1][1] += A[1][1] * r
```

# +Matrix Multiplication (kji)



```
/* kji */
for (k=0; k<n; k++) {
  for (j=0; j<n; j++) {
    r = b[k][j];
  for (i=0; i<n; i++)
    c[i][j] += a[i][k] * r;
}</pre>
```

### Inner loop:



### Misses per inner loop iteration:

<u>A</u> <u>B</u> 1.0 0.0 1.

Effectively same as jki

### +Summary of Matrix Multiplication

```
for (i=0; i<n; i++) {
  for (j=0; j<n; j++) {
    sum = 0.0;
  for (k=0; k<n; k++)
    sum += a[i][k] * b[k][j];
  c[i][j] = sum;
}
</pre>
```

```
for (k=0; k<n; k++) {
  for (i=0; i<n; i++) {
    r = a[i][k];
    for (j=0; j<n; j++)
    c[i][j] += r * b[k][j];
}</pre>
```

```
for (j=0; j<n; j++) {
  for (k=0; k<n; k++) {
    r = b[k][j];
    for (i=0; i<n; i++)
      c[i][j] += a[i][k] * r;
  }
}</pre>
```

### ijk (& jik):

- 2 loads, 0 stores
- misses/iter = **1.25**

### kij (& ikj):

- 2 loads, 1 store
- misses/iter = **0.5**

#### jki (& kji):

- 2 loads, 1 store
- misses/iter = **2.0**

## +Matrix Multiply Performance





## +Cache Summary



### Cache memories can have significant performance impact

- You can write your programs to exploit this!
  - Focus on the inner loops, where bulk of computations and memory accesses occur.
  - Try to maximize spatial locality by reading data objects with sequentially with stride 1.
  - Will be next lab.