# Multiprocessors and Caching

CS/COE 1541 (Fall 2020) Wonsun Ahn



#### Two ways to use multiple processors

- Distributed (Memory) System
  - Processors do not share memory (and by extension data)
  - o Processors exchange data through network messages
  - Programming standards:
    - Message Passing Interface (MPI) C/C++ API for exchanging messages
  - Data exchange protocols: TCP/IP, UDP/IP, Ethernet, ...
- Shared Memory System (a.k.a. Multiprocessor System)
  - Processors share memory (and by extension data)
  - Programming standards:
    - Pthreads (POSIX threads), Java threads APIs for threading
    - OpenMP Compiler #pragma directives for parallelization
  - o Cache coherence protocol: protocol for exchanging data among caches
  - → Just like Ethernet, caches are part of a larger network of caches



#### Shared Data Review

- What bad thing can happen when you have shared data?
- Dataraces!
  - You learned it in CS/COE 449. Yes, you did.



```
int shared = 0;
void *add(void *unused) {
  for (int i=0; i < 1000000; i++) { shared++; }
  return NULL;
int main() {
 pthread t t;
  // Child thread starts running add
 pthread create (&t, NULL, add, NULL);
  // Main thread starts running add
  add (NULL);
  // Wait until child thread completes
  pthread join(t, NULL);
  printf("shared=%d\n", shared);
  return 0;
```

```
bash-4.2$ ./datarace
shared=1085894
bash-4.2$ ./datarace
shared=1101173
bash-4.2$ ./datarace
shared=1065494
```

- Q) What do you expect from running this? Maybe shared=2000000?
- A) Nondeterministic result! Due to datarace on shared.































- Why did this occur in the first place?
- Because data was replicated to CPU registers and each worked on its own copy!
  - When two threads do shared++; initially shared = 1





```
pthread mutex t lock;
int shared = 0;
void *add(void *unused) {
  for (int i=0; i < 1000000; i++) {
      pthread mutex lock(&lock);
      shared++;
      pthread mutex unlock(&lock);
  return NULL;
int main() {
```

```
bash-4.2$ ./datarace
shared=2000000
bash-4.2$ ./datarace
shared=2000000
bash-4.2$ ./datarace
shared=2000000
```

- Data race is fixed! Now shared is always 2000000.
- Problem solved? No! CPU registers is not the only place replication happens!



• What happens if caches sit in between processors and memory?



• Let's say CPU 0 first fetches shared for incrementing





• Then CPU 0 increments shared 100 times to 100





Pittsburgh

• Then CPU 2 gets hold of the mutex and fetches shared from L3



• Then CPU 2 increments shared 100 times to 100 again





• Clearly this is wrong. L1 caches of CPU 0 and CPU 2 are incoherent.





#### Cache Incoherence: Problem with Write-back Private

- This problem does not occur with a shared cache.
  - All processors share and work on a single copy of data.



- The problem exists only with private caches.
- This problem does not occur with write-through private caches.
  - o The private copy is always kept **consistent** with lower memory.
- The problem exists for mostly write-back private caches.
  - Private copy is at times inconsistent with lower memory.
  - o **Incoherence** occurs when private copies differ from each other.
    - → Means processors return different values for same location!



# Cache Coherence



#### Cache Coherence

- Cache coherence (loosely defined):
  - All processors of system should see the same view of memory
  - o Copies of values cached by processors should adhere to this rule
- Each ISA has a different definition of what that "view" means
  - Memory consistency model: definition of what that "view" is
  - That's a discussion for another day...
- All models agree on one thing:
  - That a change in value should reflect on all copies (eventually)



# Implementing Cache Coherence

- How do you guarantee all processors have the same view?
- Cache coherence protocol: A protocol, or set of rules, that all caches must follow to ensure coherence between caches
  - MSI (Modified-Shared-Invalid)
  - MESI (Modified-Exclusive-Shared-Invalid)
  - ... often named after the states in cache controller FSM
- Three states of **MSI** protocol (maintained for each block):
  - Modified: Dirty. Only this cache has copy.
  - Shared: Clean. Other caches may have copy.
  - Invalid: Block contains no data.



# MSI Snoopy Cache Coherence Protocol



- Each processor monitors (snoops) the activity on the bus
  - o In much the same way as how nodes snoop the Ethernet
- Cache state changes in response to both:
  - Read / writes from the local processor
  - Read misses / write misses from remote processors it snoops



#### MSI: Example

- All bus activity is show in blue. Cache changes block state in response.
- Bus activity is generated only for cache misses, or for invalidates
- Other caches must maintain coherence by monitoring that bus activity

| Event In P1's cache In P2's cache |                                          |                                          |
|-----------------------------------|------------------------------------------|------------------------------------------|
|                                   | L = invalid                              | L = invalid                              |
| P1 writes 10 to A                 |                                          | Read Exclusive A (from write in P1)      |
| (write miss)                      | $L \leftarrow A = 10 \text{ (modified)}$ | L = invalid                              |
| P1 reads A                        |                                          |                                          |
| (read hit)                        | $L \leftarrow A = 10 \text{ (modified)}$ | L = invalid                              |
| P2 reads A                        | Read A (from read in P2)                 |                                          |
| (read miss)                       | L ← A = 10 (shared)                      | L ← A = 10 (shared)                      |
| P2 writes 20 to A                 | Invalidate A (from write in F            | 2)                                       |
| (write hit)                       | L = invalid                              | $L \leftarrow A = 20 \text{ (modified)}$ |
| P2 writes 40 to A                 |                                          |                                          |
| (write hit)                       | L = invalid                              | L ← A = 40 (modified)                    |
| P1 write 50 to A                  |                                          | Read Exclusive A (from write in P1)      |
| <sub>ity of</sub> (write miss)    | $L \leftarrow A = 50 \text{ (modified)}$ | L = invalid                              |

#### Cache Controller FSM for MSI Protocol

Processor activity in red, Bus activity in blue



# TLB Coherence



#### How about TLBs?

- We said TLBs are also a type of cache that caches PTEs.
  - So what happens if a processor changes a PTE?
  - How does that change get propagated to other processor TLBs?
- Unfortunately, there is no hardware coherence for TLBs. 🕾
- That means software (the OS) must handle the coherence
  - Which is of course much much slower



#### TLB shootdown

- In order to update a PTE (page table entry)
  - Initiator OS must first flush its own TLB
  - Send IPIs (Inter-processor interrupts) to other processors
    - To flush the TLBs for all other processors too
  - Source of significant performance overhead

TLB Flushes in Linux and FreeBSD



\* Courtesy of Nadav Amit et al. at VMWare

