# CSC 252: Computer Organization Spring 2022: Lecture 22

Instructor: Yuhao Zhu

Department of Computer Science
University of Rochester

#### **Announcements**

- Cache problem set: <a href="https://www.cs.rochester.edu/courses/252/spring2022/handouts.html">https://www.cs.rochester.edu/courses/252/spring2022/handouts.html</a>
  - Not to be turned in. Won't be graded.
- Assignment 4 due April 8.

| SUN | MON | TUE | WED | THU   | FRI              | SAT |
|-----|-----|-----|-----|-------|------------------|-----|
| 27  | 28  | 29  | 30  | 31    | Apr 1            | 2   |
| 3   | 4   | 5   | 6   | Today | <sup>8</sup> Due | 9   |

#### So Far...

- VM basic concepts and operation
- Other critical benefits of VM
- Address translation

#### **Address Translation**



- Translate address from a VA to PA
  - Enforce permissions
  - Fetch from disk

Virtual address (issued by CPU)



Physical address (what will be used to access the DRAM)

Virtual address (issued by CPU)



Physical address (what will be used to access the DRAM)

















1) Processor sends virtual address to MMU



1) Processor sends virtual address to MMU



- 1) Processor sends virtual address to MMU
- 2-3) MMU fetches PTE from page table in memory



- 1) Processor sends virtual address to MMU
- 2-3) MMU fetches PTE from page table in memory
- 4) MMU sends physical address to cache/memory



- 1) Processor sends virtual address to MMU
- 2-3) MMU fetches PTE from page table in memory
- 4) MMU sends physical address to cache/memory
- 5) Cache/memory sends data word to processor

VA: virtual address, PA: physical address, PTE: page table entry, PTEA = PTE address

# **Today**

- Three Virtual Memory Optimizations
  - TLB
  - Virtually-indexed, physically-tagged cache
  - Page the page table (a.k.a., multi-level page table)
- Case-study: Intel Core i7/Linux example

# Speeding up Address Translation

## Speeding up Address Translation

- Problem: Every memory load/store requires two memory accesses: one for PTE, another for real
  - The PTE access is kind of an overhead
  - Can we speed it up?

## Speeding up Address Translation

- Problem: Every memory load/store requires two memory accesses: one for PTE, another for real
  - The PTE access is kind of an overhead
  - Can we speed it up?
- Page table entries (PTEs) are already cached in L1 data cache like any other memory data. But:
  - PTEs may be evicted by other data references
  - PTE hit still requires a small L1 delay

























- Solution: Translation Lookaside Buffer (TLB)
  - Think of it as a dedicated cache for page table
  - Small set-associative hardware cache in MMU
  - Contains complete page table entries for a small number of pages

- Solution: Translation Lookaside Buffer (TLB)
  - Think of it as a dedicated cache for page table
  - Small set-associative hardware cache in MMU
  - Contains complete page table entries for a small number of pages

| Tag | Set Index |
|-----|-----------|
|-----|-----------|

- Solution: Translation Lookaside Buffer (TLB)
  - Think of it as a dedicated cache for page table
  - Small set-associative hardware cache in MMU
  - Contains complete page table entries for a small number of pages





A Conventional Data Cache

- Solution: Translation Lookaside Buffer (TLB)
  - Think of it as a dedicated cache for page table
  - Small set-associative hardware cache in MMU
  - Contains complete page table entries for a small number of pages



## Speeding up Translation with a TLB

- Solution: Translation Lookaside Buffer (TLB)
  - Think of it as a dedicated cache for page table
  - Small set-associative hardware cache in MMU
  - Contains complete page table entries for a small number of pages

































A TLB hit eliminates a memory access













## **Today**

- Three Virtual Memory Optimizations
  - TI B
  - Virtually-indexed, physically-tagged cache
  - Page the page table (a.k.a., multi-level page table)
- Case-study: Intel Core i7/Linux example

#### Performance Issue in VM

- Address translation and cache accesses are serialized
  - First translate from VA to PA
  - Then use PA to access cache
  - Slow! Can we speed it up?











16



Virtual Address

Virtual page number (VPN)

Page Offset

Physical Address





• Assuming 4K page size, cache line size is 16 bytes.



- Assuming 4K page size, cache line size is 16 bytes.
- Set Index = 8 bits. Can only have 256 Sets => Limit cache size



- Assuming 4K page size, cache line size is 16 bytes.
- Set Index = 8 bits. Can only have 256 Sets => Limit cache size
- Increasing cache size then requires increasing associativity



- Assuming 4K page size, cache line size is 16 bytes.
- Set Index = 8 bits. Can only have 256 Sets => Limit cache size
- Increasing cache size then requires increasing associativity
  - Not ideal because that requires comparing more tags



- Assuming 4K page size, cache line size is 16 bytes.
- Set Index = 8 bits. Can only have 256 Sets => Limit cache size
- Increasing cache size then requires increasing associativity
  - Not ideal because that requires comparing more tags
- Solutions?



What if we use 9 bits for Set Index? More Sets now.



- What if we use 9 bits for Set Index? More Sets now.
- How can this still work?



- What if we use 9 bits for Set Index? More Sets now.
- How can this still work?
- The least significant bit in VPN and PPN must be the same



- What if we use 9 bits for Set Index? More Sets now.
- How can this still work?
- The least significant bit in VPN and PPN must be the same
- That is: an even VA must be mapped to an even PA, and an odd VA must be mapped to an odd PA

## **Today**

- Three Virtual Memory Optimizations
  - TLB
  - Virtually-indexed, physically-tagged cache
  - Page the page table (a.k.a., multi-level page table)
- Case-study: Intel Core i7/Linux example

# Where Does Page Table Live?

#### Where Does Page Table Live?

- It needs to be at a specific location where we can find it
  - In main memory, with its start address stored in a special register (PTBR)

#### Where Does Page Table Live?

- It needs to be at a specific location where we can find it
  - In main memory, with its start address stored in a special register (PTBR)
- Assume 4KB page, 48-bit virtual memory, each PTE is 8 Bytes
  - 2<sup>36</sup> PTEs in a page table
  - 512 GB total size per page table??!!

#### Where Does Page Table Live?

- It needs to be at a specific location where we can find it
  - In main memory, with its start address stored in a special register (PTBR)
- Assume 4KB page, 48-bit virtual memory, each PTE is 8 Bytes
  - 2<sup>36</sup> PTEs in a page table
  - 512 GB total size per page table??!!
- Problem: Page tables are huge
  - One table per process!
  - Storing them all in main memory wastes space

- Observation: Only a small number of pages (working set) are accessed during a certain period of time, due to locality
- Put only the relevant page table entires in main memory
- Idea: Put page table in Virtual Memory and swap it just like data



- Observation: Only a small number of pages (working set) are accessed during a certain period of time, due to locality
- Put only the relevant page table entires in main memory
- Idea: Put page table in Virtual Memory and swap it just like data



- Observation: Only a small number of pages (working set) are accessed during a certain period of time, due to locality
- Put only the relevant page table entires in main memory
- Idea: Put page table in Virtual Memory and swap it just like data



- Observation: Only a small number of pages (working set) are accessed during a certain period of time, due to locality
- Put only the relevant page table entires in main memory
- Idea: Put page table in Virtual Memory and swap it just like data



- Observation: Only a small number of pages (working set) are accessed during a certain period of time, due to locality
- Put only the relevant page table entires in main memory
- Idea: Put page table in Virtual Memory and swap it just like data



- Observation: Only a small number of pages (working set) are accessed during a certain period of time, due to locality
- Put only the relevant page table entires in main memory
- Idea: Put page table in Virtual Memory and swap it just like data



- Observation: Only a small number of pages (working set) are accessed during a certain period of time, due to locality
- Put only the relevant page table entires in main memory
- Idea: Put page table in Virtual Memory and swap it just like data



- Observation: Only a small number of pages (working set) are accessed during a certain period of time, due to locality
- Put only the relevant page table entires in main memory
- Idea: Put page table in Virtual Memory and swap it just like data



#### Effectively: A 2-Level Page Table

#### Level 1 table:

- Always in physical memory at a known location.
- Each L1 PTE points to the start address of a L2 page table.
- Bring that table to memory on-demand.
- Level 2 table:
  - Each PTE points to an actual data page



Virtual memory

VP<sub>0</sub>

---

**VP 1023** 

**VP 1024** 

•••

**VP 2047** 

unallocated pages

unallocated pages

**VP 9215** 







- Level 2 page table size:
  - $2^{32} / 2^{12} * 4 = 4 MB$
- Level 1 page table size:
  - $(2^{32} / 2^{12} * 4) / 2^{12} * 4 = 4 \text{ KB}$

### How to Access a 2-Level Page Table?



### How to Access a 2-Level Page Table?



### Translating with a k-level Page Table



#### **Today**

- Three Virtual Memory Optimizations
  - TLB
  - Virtually-indexed, physically-tagged cache
  - Page the page table (a.k.a., multi-level page table)
- Case-study: Intel Core i7/Linux example

### Intel Core i7 Memory System



























### Core i7 Level 4 Page Table Entries



# Each entry references a 4K child page. Significant fields:

**P:** Child page is present in memory (1) or not (0)

R/W: Read-only or read-write access permission for child page

**U/S:** User or supervisor mode access

WT: Write-through or write-back cache policy for this page

**A:** Reference bit (set by MMU on reads and writes, cleared by software)

**D:** Dirty bit (set by MMU on writes, cleared by software)

Page physical base address: 40 most significant bits of physical page address (forces pages to be 4KB aligned)

**XD:** Disable or enable instruction fetches from this page.