## **Caches II**

CSE 351 Spring 2019

#### **Instructor:**

**Ruth Anderson** 

#### **Teaching Assistants:**

Gavin Cai
Jack Eggleston
John Feltrup
Britt Henderson
Richard Jiang
Jack Skalitzky
Sophie Tian
Connie Wang
Sam Wolfson
Casey Xing
Chin Yeoh



### **Administrivia**

- Lab 3, due Wednesday (5/15)
- Homework 4, due Wed (5/22) (Structs, Caches)

L17: Caches II

- Midterm Grading completed
  - You should have received an email from Gradescope
  - Solutions posted on website
  - Rubric and grades will be found on Gradescope
  - Regrade requests will be open for a short time after grade release via Gradescope

# **Memory Hierarchies**

- Some fundamental and enduring properties of hardware and software systems:
  - Faster storage technologies almost always cost more per byte and have lower capacity
  - The gaps between memory technology speeds are widening
    - True for: registers  $\leftrightarrow$  cache, cache  $\leftrightarrow$  DRAM, DRAM  $\leftrightarrow$  disk, etc.
  - Well-written programs tend to exhibit good locality
- These properties complement each other beautifully
  - They suggest an approach for organizing memory and storage systems known as a <u>memory hierarchy</u>
    - For each level k, the faster, smaller device at level k serves as a cache for the larger, slower device at level k+1

# **An Example Memory Hierarchy**



# **An Example Memory Hierarchy**





# **Intel Core i7 Cache Hierarchy**

#### Processor package



Block size:

64 bytes for all caches

L1 i-cache and d-cache:

32 KiB, 8-way,

Access: 4 cycles

L2 unified cache:

256 KiB, 8-way,

Access: 11 cycles

L3 unified cache:

8 MiB, 16-way,

Access: 30-40 cycles

# Making memory accesses fast!

- Cache basics
- Principle of locality
- Memory hierarchies
- Cache organization
  - Direct-mapped (sets; index + tag)
  - Associativity (ways)
  - Replacement policy
  - Handling writes
- Program optimizations that consider caches

# **Cache Organization (1)**

**Note:** The textbook uses "B" for block size

- $\bullet$  Block Size (K): unit of transfer between \$ and Mem
  - Given in bytes and always a power of 2 (e.g. 64 B)
  - Blocks consist of adjacent bytes (differ in address by 1)



# **Cache Organization (1)**

**Note:** The textbook uses "b" for offset bits

 $\bullet$  Block Size (K): unit of transfer between \$ and Mem

L17: Caches II

- Given in bytes and always a power of 2 (e.g. 64 B)
- Blocks consist of adjacent bytes (differ in address by 1)
  - Spatial locality!
- Offset field

64 bytes

• Low-order  $\log_2(K) = K$  bits of address tell you which byte within a block

- (address) mod  $2^n = n$  lowest bits of address
- (address) modulo (# of bytes in a block)

How many bits do I need to specify every byte in a block?

m-k bits k bits

m-bit address: Block Number Block Offset (refers to byte in memory)

## **Peer Instruction Question**

24145

- \* If we have 6-bit addresses and block size K = 4 B, which block and byte does 0x15 refer to?
  - Vote at: <a href="http://PollEv.com/rea">http://PollEv.com/rea</a>

|    | <b>Block Num</b> | <b>Block Offset</b> | $0x$ $\frac{1}{2}$                                             |
|----|------------------|---------------------|----------------------------------------------------------------|
| A. | 1                | 1                   | address: Ob O 1 O 1 O 1  block num offset  (value 5) (value 1) |
| B. | 1                | 5                   | (value 5) (value 1)                                            |
| C. | 5                | 1                   | offset width = logy (K) = logy (4) = 2 bis                     |
| D. | 5                | 5                   |                                                                |

E. We're lost...



# **Cache Organization (2)**

- Cache Size (C): amount of data the \$ can store
  - Cache can only hold so much data (subset of next level)
  - Given in bytes (C) or number of blocks (C/K)
  - Example: C = 32 KiB =  $512^6$  blocks if using 64-B blocks  $2^5 \times 2^{10} = 2^{15} B \times \frac{1 \text{ block}}{2^6 B} = 2^9 \text{ blocks}$
- Where should data go in the cache?
  - We need a mapping from memory addresses to specific locations in the cache to make checking the cache for an address fast
- What is a data structure that provides fast lookup?
  - Hash table!

# **Review: Hash Tables for Fast Lookup**

L17: Caches II



Place Data in Cache by Hashing Address



# Place Data in Cache by Hashing Address



## **Practice Question**

- \* 6-bit addresses, block size K = 4 B, and our cache holds S = 4 blocks.
- A request for address 0x2A results in a cache miss. Which index does this block get loaded into and which 3 other addresses are loaded along with it?

10 1010 --> has offset 2 which means that there are two

No voting for this question



# Place Data in Cache by Hashing Address



# **Tags Differentiate Blocks in Same Index**



# **Checking for a Requested Address**

- CPU sends address request for chunk of data
  - Address and requested data are not the same thing!
    - Analogy: your friend ≠ his or her phone number



Note: t and s sizes will change based on hash function

## Cache Puzzle

### Vote at http://pollev.com/rea

- Based on the following behavior, which of the following block sizes is NOT possible for our cache?
  - Cache starts empty, also known as a cold cache
  - Access (addr: hit/miss) stream: hit: block with data already in \$ miss: data not in \$, pulls block containing data

• (14: miss), (15: hit), (16: miss)

Deams

16 is in a

L>3 16 is in a different block

- A. 4 bytes
- B. 8 bytes√
- **C.** 16 bytes √
- D. 32 bytes
- E. We're lost...



Kmin = 2B, Kmax = 16B

# **Direct-Mapped Cache**



01 01

01 10

01 11

1000

10**01** 

10**10** 

10 11

1100

11 01

11 10

11 11

# **Direct-Mapped Cache Problem**





- What happens if we access the following addresses?
  - **8**, 24, 8, 24, 8, ...?

Thrashing! We have to keep loading and evicting the same data because they map to the same set in the cache!

- Conflict in cache (misses!)
- Rest of cache goes unused
- Solution?

# **Associativity**

- What if we could store data in any place in the cache?
  - More complicated hardware = more power consumed, slower
- So we combine the two ideas:
  - Each address maps to exactly one set

Each set can store block in more than one way

2-way:
4 sets,
2 blocks each
Set
0
1
2



