# CSC 252: Computer Organization Spring 2021: Lecture 25

Instructor: Yuhao Zhu

Department of Computer Science
University of Rochester

#### **Announcements**

| SUN<br>25 | MON<br>26 | TUE<br>27 | WED<br>28 | THU<br>29       | FRI<br>30 | SAT<br>May 1 |
|-----------|-----------|-----------|-----------|-----------------|-----------|--------------|
| 2         | 3         | Today     | A5<br>Due | Last<br>Lecture | 7         | 8            |
| 9         | 10        | 11        | Final     | 13              | 14        | 15           |

#### **Announcements**

- Final exam: May 12, 19:15 PM -- 22:15 PM; online.
- Past exam & Problem set: <a href="https://www.cs.rochester.edu/courses/252/spring2021/handouts.html">https://www.cs.rochester.edu/courses/252/spring2021/handouts.html</a>
- Exam will be electronic using Gradescope, but we will send you an PDF version so that you can work offline in case
  - 1) you don't have Internet access at the exam time or
  - 2) you lose Internet access.
  - Write down the answers on a scratch paper, take pictures, and send us the pictures

#### **Announcements**

- Open book test: any sort of paper-based product, e.g., book, notes, magazine, old tests.
- Exams are designed to test your ability to apply what you have learned and not your memory (though a good memory could help).
- Nothing electronic (including laptop, cell phone, calculator, etc) other than the computer you use to take the exam.
- **Nothing biological**, including your roommate, husband, wife, your hamster, another professor, etc.
- "I don't know" gets 15% partial credit. Must erase everything else.

### Today

- From process to threads
  - Basic thread execution model
- Multi-threading programming
- Hardware support of threads
  - Single core
  - Multi-core
  - Cache coherence

# Shared Variables in Threaded C Programs

- One great thing about threads is that they can share same program variables.
- Question: Which variables in a threaded C program are shared?
- Intuitively, the answer is as simple as "global variables are shared" and "stack variables are private". Not so simple in reality.

#### Thread 1 (main thread) Thread 2 (peer thread)

#### stack 1

Thread 1 context:

Data registers

Condition codes

SP1

PC1

#### stack 2

Thread 2 context:
Data registers
Condition codes
SP2
PC2

#### Shared code and data

run-time heap
read/write data
read-only code/data

Kernel context:
VM structures
Descriptor table
brk pointer

```
char **ptr; /* global var */
void *thread(void *varqp)
    long myid = (long)vargp;
    static int cnt = 0;
    printf("[%ld]: %s (cnt=%d)\n",
         myid, ptr[myid], ++cnt);
    return NULL:
int main()
   long i;
    pthread_t tid;
    char *msgs[2] = {
        "Hello from foo",
        "Hello from bar"
   };
    ptr = msqs;
    for (i = 0; i < 2; i++)
        pthread_create(&tid,
            NULL,
            thread,
            (void *)i);
    pthread exit(NULL);
                             sharing
```

Main thread stack Peer thread 0 stack Peer thread 1 stack Memory mapped region for shared libraries Runtime heap (malloc) Uninitialized data (.bss) Initialized data (.data) Program text (.text)

```
char **ptr; /* global var */
void *thread(void *varqp)
    long myid = (long)vargp;
    static int cnt = 0;
    printf("[%ld]: %s (cnt=%d)\n",
         myid, ptr[myid], ++cnt);
    return NULL:
int main()
   long i;
    pthread_t tid;
    char *msgs[2] = {
        "Hello from foo",
        "Hello from bar"
   };
    ptr = msqs;
    for (i = 0; i < 2; i++)
        pthread_create(&tid,
            NULL,
            thread,
            (void *)i);
    pthread exit(NULL);
                             sharing
```

Main thread stack Peer thread 0 stack Peer thread 1 stack Memory mapped region for shared libraries Runtime heap (malloc) Uninitialized data (.bss) ptr Initialized data (.data) Program text (.text)

```
char **ptr; /* global var */
void *thread(void *varqp)
    long myid = (long)vargp;
    static int cnt = 0;
    printf("[%ld]: %s (cnt=%d)\n",
         myid, ptr[myid], ++cnt);
    return NULL:
int main()
   long i;
    pthread_t tid;
    char *msgs[2] = {
        "Hello from foo",
        "Hello from bar"
   };
    ptr = msqs;
    for (i = 0; i < 2; i++)
        pthread_create(&tid,
            NULL,
            thread,
            (void *)i);
    pthread exit(NULL);
                             sharing
```

Main thread stack i tid msgs Peer thread 0 stack Peer thread 1 stack Memory mapped region for shared libraries Runtime heap (malloc) Uninitialized data (.bss) ptr Initialized data (.data) Program text (.text)

```
char **ptr; /* global var */
void *thread(void *varqp)
    long myid = (long)vargp;
    static int cnt = 0;
    printf("[%ld]: %s (cnt=%d)\n",
         myid, ptr[myid], ++cnt);
    return NULL:
int main()
    long i;
    pthread_t tid;
    char *msgs[2] = {
        "Hello from foo",
        "Hello from bar"
    };
    ptr = msqs;
    for (i = 0; i < 2; i++)
        pthread_create(&tid,
            NULL,
            thread,
            (void *)i);
    pthread exit(NULL);
                             sharing
```

Main thread stack i tid msgs Peer thread 0 stack myid Peer thread 1 stack Memory mapped region for shared libraries Runtime heap (malloc) Uninitialized data (.bss) ptr Initialized data (.data) Program text (.text)

```
char **ptr; /* global var */
void *thread(void *varqp)
    long myid = (long)vargp;
    static int cnt = 0;
    printf("[%ld]: %s (cnt=%d)\n",
         myid, ptr[myid], ++cnt);
    return NULL:
int main()
    long i;
    pthread_t tid;
    char *msgs[2] = {
        "Hello from foo",
        "Hello from bar"
    };
    ptr = msgs;
    for (i = 0; i < 2; i++)
        pthread_create(&tid,
            NULL,
            thread,
            (void *)i);
    pthread exit(NULL);
                             sharing
```

Main thread stack i tid msgs Peer thread 0 stack myid Peer thread 1 stack myid Memory mapped region for shared libraries Runtime heap (malloc) Uninitialized data (.bss) ptr Initialized data (.data) Program text (.text)

```
char **ptr; /* global var */
void *thread(void *varqp)
    long myid = (long)vargp;
    static int cnt = 0;
    printf("[%ld]: %s (cnt=%d)\n",
         myid, ptr[myid], ++cnt);
    return NULL:
int main()
    long i;
    pthread_t tid;
    char *msgs[2] = {
        "Hello from foo",
        "Hello from bar"
    };
    ptr = msgs;
    for (i = 0; i < 2; i++)
        pthread_create(&tid,
            NULL,
            thread,
            (void *)i);
    pthread exit(NULL);
                             sharing
```

```
Main thread stack
        i tid
         msgs
  Peer thread 0 stack
         myid
  Peer thread 1 stack
         myid
Memory mapped region
  for shared libraries
Runtime heap (malloc)
Uninitialized data (.bss)
         ptr
 Initialized data (.data)
         cnt
  Program text (.text)
```

```
char **ptr; /* global var */
void *thread(void *varqp)
    long myid = (long)vargp;
    static int cnt = 0;
    printf("[%ld]: %s (cnt=%d)\n",
         myid, ptr[myid], ++cnt);
    return NULL:
int main()
   long i;
    pthread_t tid;
    char *msgs[2] = {
        "Hello from foo",
        "Hello from bar"
   };
    ptr = msgs;
    for (i = 0; i < 2; i++)
        pthread_create(&tid,
            NULL,
            thread,
            (void *)i);
    pthread exit(NULL);
                             sharing
```

```
Main thread stack
        i tid
         msgs
  Peer thread 0 stack
         myid
  Peer thread 1 stack
         myid
Memory mapped region
  for shared libraries
Runtime heap (malloc)
Uninitialized data (.bss)
         ptr
 Initialized data (.data)
         cnt
  Program text (.text)
```

p0 p1 main

```
char **ptr; /* global var */
void *thread(void *varqp)
    long myid = (long)vargp;
    static int cnt = 0;
    printf("[%ld]: %s (cnt=%d)\n",
         myid, ptr[myid], ++cnt);
    return NULL:
int main()
   long i;
    pthread_t tid;
    char *msgs[2] = {
        "Hello from foo",
        "Hello from bar"
   };
    ptr = msgs;
    for (i = 0; i < 2; i++)
        pthread_create(&tid,
            NULL,
            thread,
            (void *)i);
    pthread exit(NULL);
                             sharing
```

Main thread stack i tid msgs Peer thread 0 stack myid Peer thread 1 stack myid Memory mapped region for shared libraries Runtime heap (malloc) Uninitialized data (.bss) ptr Initialized data (.data) cnt Program text (.text)

p0 p1 main p0 p1

```
char **ptr; /* global var */
void *thread(void *varqp)
    long myid = (long)vargp;
    static int cnt = 0;
    printf("[%ld]: %s (cnt=%d)\n",
         myid, ptr[myid], ++cnt);
    return NULL:
int main()
    long i;
    pthread_t tid;
    char *msgs[2] = {
        "Hello from foo",
        "Hello from bar"
    };
    ptr = msqs;
    for (i = 0; i < 2; i++)
        pthread_create(&tid,
            NULL,
            thread,
            (void *)i);
    pthread exit(NULL);
                             sharing
```

Main thread stack i tid msgs Peer thread 0 stack myid Peer thread 1 stack myid Memory mapped region for shared libraries Runtime heap (malloc) Uninitialized data (.bss) ptr Initialized data (.data) cnt Program text (.text)

main main **p0 p1 p0 p1** 

```
char **ptr; /* global var */
void *thread(void *varqp)
    long myid = (long)vargp;
    static int cnt = 0;
    printf("[%ld]: %s (cnt=%d)\n",
         myid, ptr[myid], ++cnt);
    return NULL:
int main()
   long i;
    pthread_t tid;
    char *msgs[2] = {
        "Hello from foo",
        "Hello from bar"
   };
    ptr = msqs;
    for (i = 0; i < 2; i++)
        pthread_create(&tid,
            NULL,
            thread,
            (void *)i);
    pthread exit(NULL);
                             sharing
```

Main thread stack i tid msgs Peer thread 0 stack myid Peer thread 1 stack myid Memory mapped region for shared libraries Runtime heap (malloc) Uninitialized data (.bss) ptr Initialized data (.data) cnt Program text (.text)

main main **p0 p1** main **p0 p1 p0 p1** 

```
char **ptr; /* global var */
void *thread(void *varqp)
    long myid = (long)vargp;
    static int cnt = 0;
    printf("[%ld]: %s (cnt=%d)\n",
         myid, ptr[myid], ++cnt);
    return NULL:
int main()
   long i;
    pthread_t tid;
    char *msgs[2] = {
        "Hello from foo",
        "Hello from bar"
   };
    ptr = msgs;
    for (i = 0; i < 2; i++)
        pthread_create(&tid,
            NULL,
            thread,
            (void *)i);
    pthread exit(NULL);
                             sharing
```

```
Main thread stack
       i tid
                         main
         msgs
                         p0
                                      main
                               p1
 Peer thread 0 stack
                         p0
        myid
 Peer thread 1 stack
        myid
Memory mapped region
  for shared libraries
Runtime heap (malloc)
Uninitialized data (.bss)
                                      main
                         p0
                                p1
         ptr
 Initialized data (.data)
                         p0
                                p1
         cnt
  Program text (.text)
```

```
char **ptr; /* global var */
void *thread(void *varqp)
    long myid = (long)vargp;
    static int cnt = 0;
    printf("[%ld]: %s (cnt=%d)\n",
         myid, ptr[myid], ++cnt);
    return NULL:
int main()
   long i;
    pthread_t tid;
    char *msgs[2] = {
        "Hello from foo",
        "Hello from bar"
   };
    ptr = msgs;
    for (i = 0; i < 2; i++)
        pthread_create(&tid,
            NULL,
            thread,
            (void *)i):
    pthread exit(NULL);
                             sharing
```

Main thread stack i tid main msgs **p0** main **p1** Peer thread 0 stack **p0** myid Peer thread 1 stack **p1** myid Memory mapped region for shared libraries Runtime heap (malloc) Uninitialized data (.bss) main **p0 p1** ptr Initialized data (.data) **p0 p1** cnt Program text (.text)

# **Synchronizing Threads**

- Shared variables are handy...
- ...but introduce the possibility of nasty synchronization errors.

#### Improper Synchronization

```
/* Global shared variable */
volatile long cnt = 0; /* Counter */
int main(int argc, char **argv)
   pthread t tid1, tid2;
   long niters = 10000;
   Pthread create(&tid1, NULL,
        thread, &niters);
    Pthread create(&tid2, NULL,
        thread, &niters);
    Pthread join(tid1, NULL);
    Pthread_join(tid2, NULL);
    /* Check result */
    if (cnt != (2 * 10000))
        printf("B00M! cnt=%ld\n", cnt);
    else
        printf("OK cnt=%ld\n", cnt);
    exit(0):
                                  badcnt.c
```

# Improper Synchronization

```
/* Global shared variable */
volatile long cnt = 0; /* Counter */
int main(int argc, char **argv)
   pthread t tid1, tid2;
   long niters = 10000;
   Pthread create(&tid1, NULL,
        thread, &niters);
    Pthread create(&tid2, NULL,
        thread, &niters);
    Pthread join(tid1, NULL);
    Pthread_join(tid2, NULL);
    /* Check result */
    if (cnt != (2 * 10000))
        printf("B00M! cnt=%ld\n", cnt);
    else
        printf("OK cnt=%ld\n", cnt);
    exit(0):
                                  badcnt.c
```

```
linux> ./badcnt
OK cnt=20000
linux> ./badcnt
BOOM! cnt=13051
```

cnt should be 20,000.

What went wrong?

#### **Assembly Code for Counter Loop**

C code for counter loop in thread i

```
for (i = 0; i < niters; i++)
     cnt++;</pre>
```

#### Asm code for thread i

```
movq (%rdi), %rcx
    testq %rcx,%rcx
                              H_i: Head
    ile .L2
   movl $0, %eax
.L3:
                             L;: Load cnt
   movq cnt(%rip),%rdx
                              U;: Update cnt
    addq $1, %rdx
   movq %rdx, cnt(%rip)
                              S: Store cnt
    addq $1, %rax
    cmpq %rcx, %rax
    jne
          .L3
                              T_i: Tail
.L2:
```

#### **Concurrent Execution**

 Key observation: In general, any sequentially consistent interleaving is possible, but some give an unexpected result!

| i | (thread) | instr <sub>i</sub>             | %rdx <sub>1</sub> | %rdx <sub>2</sub> | cnt<br>(shared) |                  |
|---|----------|--------------------------------|-------------------|-------------------|-----------------|------------------|
|   | 1        | L                              | n                 | _                 | 0               | Thread 1         |
|   | 1        | <sub>1</sub><br>U <sub>1</sub> | 1                 | -                 | 0               | critical section |
|   | 1        | S <sub>1</sub>                 | 1                 | -                 | 1               | Thread 2         |
|   | 2        | $L_2$                          | -                 | 1                 | 1               | critical section |
|   | 2        | $U_2$                          | -                 | 2                 | 1               |                  |
|   | 2        | $S_2$                          | -                 | 2                 | 2               |                  |

### **Concurrent Execution (cont)**

 A legal (feasible) but undesired ordering: two threads increment the counter, but the result is 1 instead of 2

| i (thread) | instr <sub>i</sub> | %rdx <sub>1</sub> | %rdx <sub>2</sub> | cnt<br>(shared) |
|------------|--------------------|-------------------|-------------------|-----------------|
| 1          | L <sub>1</sub>     | 0                 | -                 | 0               |
| 1          | U <sub>1</sub>     | 1                 | -                 | 0               |
| 2          | L <sub>2</sub>     | -                 | 0                 | 0               |
| 1          | S <sub>1</sub>     | 1                 | -                 | 1               |
| 2          | $U_2$              | -                 | 1                 | 1               |
| 2          | S <sub>2</sub>     | -                 | 1                 | 1               |

```
L<sub>i</sub> movq cnt(%rip),%rdx
U<sub>i</sub> addq $1, %rdx
S<sub>i</sub> movq %rdx, cnt(%rip)
```

#### **Assembly Code for Counter Loop**

C code for counter loop in thread i

```
for (i = 0; i < niters; i++)
    cnt++;</pre>
```

#### Asm code for thread i

```
movq (%rdi), %rcx
    testq %rcx,%rcx
                                H_i: Head
    jle .L2
    movl $0, %eax
.L3:
                                L;: Load cnt
    movq cnt(%rip),%rdx
                                U;: Update cnt
    addq $1, %rdx
    movq %rdx, cnt(%rip)
                                S<sub>i</sub>: Store cnt
    addq $1, %rax
           %rcx, %rax
    cmpq
    jne
           .L3
                                T_i: Tail
.L2:
```

critical section wrt cnt

#### **Critical Section**

- Code section (a sequence of instructions) where no more than one thread should be executing concurrently.
  - Critical section refers to code, but its intention is to protect data!

(%rdi), %rcx movq testq %rcx,%rcx  $H_i$ : Head jle .L2 movl \$0, %eax .L3: critical L;: Load cnt movq cnt(%rip),%rdx section U;: Update cnt addq \$1, %rdx wrt cnt movq %rdx, cnt(%rip) S<sub>i</sub>: Store cnt addq \$1, %rax %rcx, %rax cmpq jne .L3  $T_i$ : Tail .L2:

14

#### **Critical Section**

- Code section (a sequence of instructions) where no more than one thread should be executing concurrently.
  - Critical section refers to code, but its intention is to protect data!
- Threads need to have *mutually exclusive* access to critical section. That is, the execution of the critical section must be *atomic*: instructions in a CS either are executed entirely without interruption or not executed at all.

```
movq (%rdi), %rcx
                    testq %rcx, %rcx
                                                H_i: Head
                           .L2
                    jle
                    movl $0, %eax
               .L3:
critical
                                                L;: Load cnt
                    movq cnt(%rip),%rdx
section
                                                Ui: Update cnt
                    addq $1, %rdx
wrt cnt
                           %rdx, cnt(%rip)
                    movq
                                                S<sub>i</sub>: Store cnt
                    addq $1, %rax
                           %rcx, %rax
                    cmpq
                           .L3
                    jne
                                                T_i: Tail
                .L2:
```

#### **Enforcing Mutual Exclusion**

- We must coordinate/synchronize the execution of the threads
  - i.e., need to guarantee mutually exclusive access for each critical section.
- Classic solution:
  - Semaphores/mutex (Edsger Dijkstra)
- Other approaches
  - Condition variables
  - Monitors (Java)
  - 254/258 discusses these

• Basic idea:

#### Basic idea:

 Associate each shared variable (or related set of shared variables) with a unique variable, called **semaphore**, initially 1.

#### Basic idea:

- Associate each shared variable (or related set of shared variables) with a unique variable, called **semaphore**, initially 1.
- Every time a thread tries to enter the critical section, it first checks the semaphore value. If it's still 1, the thread decrements the mutex value to 0 (through a **P operation**) and enters the critical section. If it's 0, wait.

#### • Basic idea:

- Associate each shared variable (or related set of shared variables) with a unique variable, called **semaphore**, initially 1.
- Every time a thread tries to enter the critical section, it first checks the semaphore value. If it's still 1, the thread decrements the mutex value to 0 (through a **P operation**) and enters the critical section. If it's 0, wait.
- Every time a thread exits the critical section, it increments the semaphore value to 1 (through a V operation) so that other threads are now allowed to enter the critical section.

#### • Basic idea:

- Associate each shared variable (or related set of shared variables) with a unique variable, called **semaphore**, initially 1.
- Every time a thread tries to enter the critical section, it first checks the semaphore value. If it's still 1, the thread decrements the mutex value to 0 (through a **P operation**) and enters the critical section. If it's 0, wait.
- Every time a thread exits the critical section, it increments the semaphore value to 1 (through a V operation) so that other threads are now allowed to enter the critical section.
- No more than one thread can be in the critical section at a time.

#### Basic idea:

- Associate each shared variable (or related set of shared variables) with a unique variable, called **semaphore**, initially 1.
- Every time a thread tries to enter the critical section, it first checks the semaphore value. If it's still 1, the thread decrements the mutex value to 0 (through a **P operation**) and enters the critical section. If it's 0, wait.
- Every time a thread exits the critical section, it increments the semaphore value to 1 (through a V operation) so that other threads are now allowed to enter the critical section.
- No more than one thread can be in the critical section at a time.

#### Terminology

#### • Basic idea:

- Associate each shared variable (or related set of shared variables) with a unique variable, called **semaphore**, initially 1.
- Every time a thread tries to enter the critical section, it first checks the semaphore value. If it's still 1, the thread decrements the mutex value to 0 (through a **P operation**) and enters the critical section. If it's 0, wait.
- Every time a thread exits the critical section, it increments the semaphore value to 1 (through a V operation) so that other threads are now allowed to enter the critical section.
- No more than one thread can be in the critical section at a time.

#### Terminology

• Binary semaphore is also called mutex (i.e., the semaphore value could only be 0 or 1)

#### • Basic idea:

- Associate each shared variable (or related set of shared variables) with a unique variable, called **semaphore**, initially 1.
- Every time a thread tries to enter the critical section, it first checks the semaphore value. If it's still 1, the thread decrements the mutex value to 0 (through a **P operation**) and enters the critical section. If it's 0, wait.
- Every time a thread exits the critical section, it increments the semaphore value to 1 (through a V operation) so that other threads are now allowed to enter the critical section.
- No more than one thread can be in the critical section at a time.

#### Terminology

- Binary semaphore is also called mutex (i.e., the semaphore value could only be 0 or 1)
- Think of P operation as "locking", and V as "unlocking".

## **Proper Synchronization**

Define and initialize a mutex for the shared variable cnt:

```
volatile long cnt = 0; /* Counter */
sem_t mutex; /* Semaphore that protects cnt */
Sem_init(&mutex, 0, 1); /* mutex = 1 */
```

Surround critical section with P and V:

```
for (i = 0; i < niters; i++) {
    P(&mutex);
    cnt++;
    V(&mutex);
}</pre>
```

```
linux> ./goodcnt 10000
OK cnt=20000
linux> ./goodcnt 10000
OK cnt=20000
linux>
```

Warning: It's orders of magnitude slower than badent.c.

 Wouldn't there be a problem when multiple threads access the mutex? How do we ensure exclusive accesses to mutex itself?

```
for (i = 0; i < niters; i++) {
    P(&mutex);
    cnt++;
    V(&mutex);
}</pre>
```

- Wouldn't there be a problem when multiple threads access the mutex? How do we ensure exclusive accesses to mutex itself?
- Hardware MUST provide mechanisms for atomic accesses to the mutex variable.

```
for (i = 0; i < niters; i++) {
    P(&mutex);
    cnt++;
    V(&mutex);
}</pre>
```

- Wouldn't there be a problem when multiple threads access the mutex? How do we ensure exclusive accesses to mutex itself?
- Hardware MUST provide mechanisms for atomic accesses to the mutex variable.
  - Checking mutex value and setting its value must be an atomic unit: they either are performed entirely or not performed at all.

```
for (i = 0; i < niters; i++) {
    P(&mutex);
    cnt++;
    V(&mutex);
}</pre>
```

- Wouldn't there be a problem when multiple threads access the mutex? How do we ensure exclusive accesses to mutex itself?
- Hardware MUST provide mechanisms for atomic accesses to the mutex variable.
  - Checking mutex value and setting its value must be an atomic unit: they either are performed entirely or not performed at all.
  - on x86: the atomic test-and-set instruction.

```
for (i = 0; i < niters; i++) {
    P(&mutex);
    cnt++;
    V(&mutex);
}</pre>
```

- Wouldn't there be a problem when multiple threads access the mutex? How do we ensure exclusive accesses to mutex itself?
- Hardware MUST provide mechanisms for atomic accesses to the mutex variable.
  - Checking mutex value and setting its value must be an atomic unit: they either are performed entirely or not performed at all.
  - on x86: the atomic test-and-set instruction.

```
for (i = 0; i < niters; i++) {
    P(&mutex);
    cnt++;
    V(&mutex);
}</pre>
```

```
function Lock(boolean *lock) {
    while (test_and_set(lock) == 1);
}
```

### Deadlock

- Def: A process/thread is deadlocked if and only if it is waiting for a condition that will never be true
- General to concurrent/parallel programming (threads, processes)
- Typical Scenario
  - Processes 1 and 2 needs two resources (A and B) to proceed
  - Process 1 acquires A, waits for B
  - Process 2 acquires B, waits for A
  - Both will wait forever!

# **Deadlocking With Semaphores**

```
void *count(void *varqp)
    int i:
    int id = (int) varqp;
    for (i = 0; i < NITERS; i++) {
        P(&mutex[id]); P(&mutex[1-id]);
        cnt++:
        V(&mutex[id]); V(&mutex[1-id]);
    return NULL;
int main()
    pthread_t tid[2];
    Sem_init(&mutex[0], 0, 1); /* mutex[0] = 1 */
    Sem_init(&mutex[1], 0, 1); /* mutex[1] = 1 */
    Pthread_create(&tid[0], NULL, count, (void*) 0);
    Pthread_create(&tid[1], NULL, count, (void*) 1);
    Pthread_join(tid[0], NULL);
    Pthread_join(tid[1], NULL);
    printf("cnt=%d\n", cnt);
    exit(0);
```

```
Tid[0]: Tid[1]: P(s<sub>0</sub>); P(s<sub>1</sub>); P(s<sub>0</sub>); Cnt++; V(s<sub>0</sub>); V(s<sub>1</sub>); V(s<sub>0</sub>);
```

# **Avoiding Deadlock**

Acquire shared resources in same order

```
Tid[0]: Tid[1]: P(s<sub>0</sub>); P(s<sub>1</sub>); P(s<sub>0</sub>); Cnt++; V(s<sub>0</sub>); V(s<sub>1</sub>); V(s<sub>0</sub>);
```



```
Tid[0]: Tid[1]:
P(s0);
P(s1);
Cnt++;
V(s0);
V(s1);
Tid[1]:
P(s0);
P(s1);
V(s1);
V(s0);
```

 Signal handlers are concurrent with main program and may share the same global data structures.

 Signal handlers are concurrent with main program and may share the same global data structures.

```
static int x = 5;
void handler(int sig)
   x = 10;
int main(int argc, char **argv)
    int pid;
    Signal(SIGCHLD, handler);
    if ((pid = Fork()) == 0) { /* Child */
        Execve("/bin/date", argv, NULL);
    if (x == 5)
        y = x * 2; // You'd expect y == 10
    exit(0);
```

 Signal handlers are concurrent with main program and may share the same global data structures.

```
static int x = 5;
void handler(int sig)
   x = 10;
int main(int argc, char **argv)
    int pid;
    Signal(SIGCHLD, handler);
    if ((pid = Fork()) == 0) { /* Child */
        Execve("/bin/date", argv, NULL);
    if (x == 5)
        y = x * 2; // You'd expect y == 10
    exit(0);
```

What if the following happens:

 Signal handlers are concurrent with main program and may share the same global data structures.

```
static int x = 5;
void handler(int sig)
   x = 10;
int main(int argc, char **argv)
    int pid;
    Signal(SIGCHLD, handler);
    if ((pid = Fork()) == 0) { /* Child */
        Execve("/bin/date", argv, NULL);
    if (x == 5)
        y = x * 2; // You'd expect y == 10
    exit(0);
```

#### What if the following happens:

 Parent process executes and finishes if (x == 5)

 Signal handlers are concurrent with main program and may share the same global data structures.

```
static int x = 5;
void handler(int sig)
   x = 10;
int main(int argc, char **argv)
    int pid;
    Signal(SIGCHLD, handler);
    if ((pid = Fork()) == 0) { /* Child */
        Execve("/bin/date", argv, NULL);
    if (x == 5)
        y = x * 2; // You'd expect y == 10
    exit(0);
```

#### What if the following happens:

- Parent process executes and finishes if (x == 5)
- OS decides to take the SIGCHLD interrupt and executes the handler

 Signal handlers are concurrent with main program and may share the same global data structures.

```
static int x = 5;
void handler(int sig)
   x = 10;
int main(int argc, char **argv)
    int pid;
    Signal(SIGCHLD, handler);
    if ((pid = Fork()) == 0) { /* Child */
        Execve("/bin/date", argv, NULL);
    if (x == 5)
        y = x * 2; // You'd expect y == 10
    exit(0);
```

#### What if the following happens:

- Parent process executes and finishes if (x == 5)
- OS decides to take the SIGCHLD interrupt and executes the handler
- When return to parent process, y == 20!

# Fixing the Signal Handling Bug

```
static int x = 5;
void handler(int sig)
   x = 10;
int main(int argc, char **argv)
    int pid;
    sigset_t mask_all, prev_all;
    sigfillset(&mask all);
    signal(SIGCHLD, handler);
    if ((pid = Fork()) == 0) { /* Child */
        Execve("/bin/date", argv, NULL);
    Sigprocmask(SIG_BLOCK, &mask_all, &prev_all);
    if (x == 5)
        y = x * 2; // You'd expect y == 10
    Sigprocmask(SIG_SETMASK, &prev_all, NULL);
    exit(0);
```

 Block all signals before accessing a shared, global data structure.

```
static int x = 5;
void handler(int sig)
   P(&mutex);
   x = 10;
    V(&mutex);
int main(int argc, char **argv)
{
    int pid;
    sigset t mask all, prev all;
    signal(SIGCHLD, handler);
    if ((pid = Fork()) == 0) { /* Child */
        Execve("/bin/date", argv, NULL);
    }
    P(&mutex);
    if (x == 5)
        y = x * 2; // You'd expect y == 10
    V(&mutex);
    exit(0);
```

```
static int x = 5;
void handler(int sig)
   P(&mutex);
   x = 10;
   V(&mutex);
int main(int argc, char **argv)
{
    int pid;
    sigset t mask all, prev all;
    signal(SIGCHLD, handler);
    if ((pid = Fork()) == 0) { /* Child */
        Execve("/bin/date", argv, NULL);
   P(&mutex);
    if (x == 5)
        y = x * 2; // You'd expect y == 10
    V(&mutex);
   exit(0);
```

 This implementation will get into a deadlock.

```
static int x = 5;
void handler(int sig)
   P(&mutex);
   x = 10;
    V(&mutex);
int main(int argc, char **argv)
{
    int pid;
    sigset t mask all, prev all;
    signal(SIGCHLD, handler);
    if ((pid = Fork()) == 0) { /* Child */
        Execve("/bin/date", argv, NULL);
    P(&mutex);
    if (x == 5)
        y = x * 2; // You'd expect y == 10
    V(&mutex);
    exit(0);
```

- This implementation will get into a deadlock.
- Signal handler wants the mutex, which is acquired by the main program.

```
static int x = 5;
void handler(int sig)
   P(&mutex);
   x = 10;
    V(&mutex);
int main(int argc, char **argv)
    int pid;
    sigset t mask all, prev all;
    signal(SIGCHLD, handler);
    if ((pid = Fork()) == 0) { /* Child */
        Execve("/bin/date", argv, NULL);
   P(&mutex);
    if (x == 5)
        y = x * 2; // You'd expect y == 10
    V(&mutex);
   exit(0);
```

- This implementation will get into a deadlock.
- Signal handler wants the mutex, which is acquired by the main program.
- Key: signal handler is in the same process/ thread as the main program. The kernel forces the handler to finish before returning to the main program.

# Summary of Multi-threading Programming

- Concurrent/parallel threads access shared variables
- Need to protect concurrent accesses to guarantee correctness
- Semaphores (e.g., mutex) provide a simple solution
- Can lead to deadlock if not careful
- Take CSC 254/258 to know more about avoiding deadlocks (and parallel programming in general)

## Thread-level Parallelism (TLP)

- Thread-Level Parallelism
  - Splitting a task into independent sub-tasks
  - Each thread is responsible for a sub-task

## Thread-level Parallelism (TLP)

- Thread-Level Parallelism
  - Splitting a task into independent sub-tasks
  - Each thread is responsible for a sub-task
- Example: Parallel summation of N number
  - Partition values 1, ..., n-1 into t ranges, Ln/t values each range
  - Each of t threads processes one range (sub-task)
  - Sum all sub-sums in the end

## Thread-level Parallelism (TLP)

- Thread-Level Parallelism
  - Splitting a task into independent sub-tasks
  - Each thread is responsible for a sub-task
- Example: Parallel summation of N number
  - Partition values 1, ..., n-1 into t ranges, Ln/t values each range
  - Each of t threads processes one range (sub-task)
  - Sum all sub-sums in the end
- Question: if you parallel you work N ways, do you always an N times speedup?



- Maximum speedup limited by the sequential portion
- Main cause: Non-parallelizable operations on data



- Maximum speedup limited by the sequential portion
- Main cause: Non-parallelizable operations on data
- Parallel portion is usually not perfectly parallel as well
  - e.g., Synchronization overhead



- Maximum speedup limited by the sequential portion
- Main cause: Non-parallelizable operations on data
- Parallel portion is usually not perfectly parallel as well
  - e.g., Synchronization overhead

```
Each thread:
loop {
    Compute
    P(A)
    Update shared data
    V(A)
```



- Maximum speedup limited by the sequential portion
- Main cause: Non-parallelizable operations on data
- Parallel portion is usually not perfectly parallel as well
  - e.g., Synchronization overhead

```
Each thread:
```

```
Ioop {
    Compute
    P(A)
    Update shared data
    V(A)
}
```



- Maximum speedup limited by the sequential portion
- Main cause: Non-parallelizable operations on data
- Parallel portion is usually not perfectly parallel as well
  - e.g., Synchronization overhead

```
Each thread:
```

```
Ioop {
    Compute
    P(A)
    Update shared data
    V(A)
    C
```



- Maximum speedup limited by the sequential portion
- Main cause: Non-parallelizable operations on data
- Parallel portion is usually not perfectly parallel as well
  - e.g., Synchronization overhead

#### Each thread:





- Gene Amdahl (1922 2015). Giant in computer architecture
- Captures the difficulty of using parallelism to speed things up

- Gene Amdahl (1922 2015). Giant in computer architecture
- Captures the difficulty of using parallelism to speed things up
- Amdahl's Law
  - f: Parallelizable fraction of a program
  - N: Number of processors (i.e., maximal achievable speedup)

- Gene Amdahl (1922 2015). Giant in computer architecture
- Captures the difficulty of using parallelism to speed things up
- Amdahl's Law
  - f: Parallelizable fraction of a program
  - N: Number of processors (i.e., maximal achievable speedup)

1 - f

- Gene Amdahl (1922 2015). Giant in computer architecture
- Captures the difficulty of using parallelism to speed things up
- Amdahl's Law
  - f: Parallelizable fraction of a program
  - N: Number of processors (i.e., maximal achievable speedup)

- Gene Amdahl (1922 2015). Giant in computer architecture
- Captures the difficulty of using parallelism to speed things up
- Amdahl's Law
  - f: Parallelizable fraction of a program
  - N: Number of processors (i.e., maximal achievable speedup)

- Gene Amdahl (1922 2015). Giant in computer architecture
- Captures the difficulty of using parallelism to speed things up
- Amdahl's Law
  - f: Parallelizable fraction of a program
  - N: Number of processors (i.e., maximal achievable speedup)

Speedup = 
$$\frac{1}{1 - f} + \frac{f}{N}$$

### Amdahl's Law

- Gene Amdahl (1922 2015). Giant in computer architecture
- Captures the difficulty of using parallelism to speed things up
- Amdahl's Law
  - f: Parallelizable fraction of a program
  - N: Number of processors (i.e., maximal achievable speedup)

Speedup = 
$$\frac{1}{1 - f} + \frac{f}{N}$$

Completely parallelizable (f = 1): Speedup = N

### Amdahl's Law

- Gene Amdahl (1922 2015). Giant in computer architecture
- Captures the difficulty of using parallelism to speed things up
- Amdahl's Law
  - f: Parallelizable fraction of a program
  - N: Number of processors (i.e., maximal achievable speedup)

Speedup = 
$$\frac{1}{1 - f} + \frac{f}{N}$$

- Completely parallelizable (f = 1): Speedup = N
- Completely sequential (f = 0): Speedup = 1

### Amdahl's Law

- Gene Amdahl (1922 2015). Giant in computer architecture
- Captures the difficulty of using parallelism to speed things up
- Amdahl's Law
  - f: Parallelizable fraction of a program
  - N: Number of processors (i.e., maximal achievable speedup)

Speedup = 
$$\frac{1}{1 - f} + \frac{f}{N}$$

- Completely parallelizable (f = 1): Speedup = N
- Completely sequential (f = 0): Speedup = 1
- Mostly parallelizable (f = 0.9, N = 1000): Speedup = 9.9

## Today

- From process to threads
  - Basic thread execution model
- Multi-threading programming
- Hardware support of threads
  - Single core
  - Multi-core
  - Cache coherence

### Can A Single Core Support Multi-threading?

Need to multiplex between different threads (time slicing)



• Can single-core multi-threading provide any performance gains?



Can single-core multi-threading provide any performance gains?



Can single-core multi-threading provide any performance gains?



- Can single-core multi-threading provide any performance gains?
- If Thread A has a cache miss and the pipeline gets stalled, switch to Thread C. Improves the overall performance.



### When to Switch?

- Coarse grained
  - Event based, e.g., switch on L3 cache miss
  - Quantum based (every thousands of cycles)

### When to Switch?

#### Coarse grained

- Event based, e.g., switch on L3 cache miss
- Quantum based (every thousands of cycles)

#### • Fine grained

- Cycle by cycle
- Thornton, "CDC 6600: Design of a Computer," 1970.
- Burton Smith, "A pipelined, shared resource MIMD computer," ICPP 1978. The HEP machine. A seminal paper that shows that using multithreading can avoid branch prediction.

### When to Switch?

- Coarse grained
  - Event based, e.g., switch on L3 cache miss
  - Quantum based (every thousands of cycles)
- Fine grained
  - Cycle by cycle
  - Thornton, "CDC 6600: Design of a Computer," 1970.
  - Burton Smith, "A pipelined, shared resource MIMD computer," ICPP 1978. The HEP machine. A seminal paper that shows that using multithreading can avoid branch prediction.
- Either way, need to save/restore thread context upon switching.

 One big bonus of fine-grained switching: no need for branch predictor!!

#### The stalling approach



 One big bonus of fine-grained switching: no need for branch predictor!!

#### The branch prediction approach



 One big bonus of fine-grained switching: no need for branch predictor!!

#### The fine-grained multi-threading approach

```
xorq %rax, %rax
                                     D
                                           M
                                              W
Inst x from TID=1
                                     F
                                              М
                                                 W
                                        D
Inst y from TID=2
                                        F
                                           D
                                              Ε
                                                 М
                                                    W
                    # Not taken
jne L1
                                           F
                                              D
                                                 Ε
                                                    М
                                                       W
Inst x+1 from TID=1
                                              F
                                                 D
                                                    Ε
                                                       M
                                                          W
Inst y+1 from TID=2
                                                 F
                                                       Ε
                                                          M
irmovq $1, %rax # Fall Through
                                                    F
                                                          Ε
Inst x+2 from TID=1
                                                       F
                                                          D
Inst y+2 from TID=2
                                                          F
```

35

- One big bonus of fine-grained switching: no need for branch predictor!!
  - Context switching overhead would be very high! Use separate hardware contexts for each thread (e.g., separate register files).

#### The fine-grained multi-threading approach



35

- One big bonus of fine-grained switching: no need for branch predictor!!
  - Context switching overhead would be very high! Use separate hardware contexts for each thread (e.g., separate register files).
  - GPUs do this (among other things). More later.

#### The fine-grained multi-threading approach



35

## Multi-threading Illustration (so far...)



### Modern Single-Core: Superscalar

- Typically has multiple function units to allow for decoding and issuing multiple instructions at the same time
- Called "Superscalar"



# From Scalar to Multi-Scalar Multi-threading



# From Scalar to Multi-Scalar Multi-threading

**Functional Units** 



- Context
  Switch
- Thread 2



# From Scalar to Multi-Scalar Multi-threading



## Simultaneous Multi-Threading (SMT)

- Intel call it hyper-threading.
- Replicate enough hardware structures to process K instruction streams, i.e., threads. K copies of all registers. Share functional units.
- SMT = Superscalar + Multi-threading



## Simultaneous Multi-Threading (SMT)

- Intel call it hyper-threading.
- Replicate enough hardware structures to process K instruction streams, i.e., threads. K copies of all registers. Share functional units.
- SMT = Superscalar + Multi-threading



### Conventional Multi-threading vs. Hyper-threading



### Conventional Multi-threading vs. Hyper-threading



**SMT** 



- Context Switch
- Thread 2





Multiple threads actually execute in parallel (even with one single core)

### Conventional Multi-threading vs. Hyper-threading



**SMT** 



- Context Switch
- Thread 2





Multiple threads actually execute in parallel (even with one single core)

No/little context switch overhead

## Today

- From process to threads
  - Basic thread execution model
- Multi-threading programming
- Hardware support of threads
  - Single core
  - Multi-core
  - Cache coherence

### Multi-Threading on a Multi-core Processor



- Each core can run multiple threads, mostly through coarse-grained switching.
- Fine-grained switching on conventional multicore CPU is too costly.

### Combine Multi-core with SMT

- Common for laptop/desktop/server machine. E.g., 2 physical cores, each core has 2 hyper-threads => 4 virtual cores.
- Not for mobile processors (Hyper-threading costly to implement)



## Asymmetric Multiprocessor (AMP)

Offer a large performance-energy trade-off space

**Energy Consumption** 



Performance

## Asymmetric Chip-Multiprocessor (ACMP)

Already used in commodity devices (e.g., Samsung Galaxy S6, iPhone 7)



## Today

- From process to threads
  - Basic thread execution model
- Multi-threading programming
- Hardware support of threads
  - Single core
  - Multi-core
  - Cache coherence

### The Issue

• Assume that we have a multi-core processor. Thread 0 runs on Core 0, and Thread 1 runs on Core 1.

#### The Issue

- Assume that we have a multi-core processor. Thread 0 runs on Core 0, and Thread 1 runs on Core 1.
- Threads share variables: e.g., Thread 0 writes to an address, followed by Thread 1 reading.

### The Issue

- Assume that we have a multi-core processor. Thread 0 runs on Core 0, and Thread 1 runs on Core 1.
- Threads share variables: e.g., Thread 0 writes to an address, followed by Thread 1 reading.



- Assume that we have a multi-core processor. Thread 0 runs on Core 0, and Thread 1 runs on Core 1.
- Threads share variables: e.g., Thread 0 writes to an address, followed by Thread 1 reading.
- Each read should receive the value last written by anyone



- Assume that we have a multi-core processor. Thread 0 runs on Core 0, and Thread 1 runs on Core 1.
- Threads share variables: e.g., Thread 0 writes to an address, followed by Thread 1 reading.
- Each read should receive the value last written by anyone
- Basic question: If multiple cores access the same data, how do they ensure they all see a consistent state?



- Without cache, the issue is (theoretically) solvable by using mutex.
- ...because there is only one copy of x in the entire system. Accesses to x in memory are serialized by mutex.



















• **Issue**: there are multiple copies of the same data in the system, and they could have different values at the same time.

- **Issue**: there are multiple copies of the same data in the system, and they could have different values at the same time.
- Idea: ensure multiple copies have same value, i.e., coherent

- **Issue**: there are multiple copies of the same data in the system, and they could have different values at the same time.
- Idea: ensure multiple copies have same value, i.e., coherent
- **How?** Two options:

- **Issue**: there are multiple copies of the same data in the system, and they could have different values at the same time.
- **Idea**: ensure multiple copies have same value, i.e., *coherent*
- **How?** Two options:
  - Update: push new value to all copies (in other caches)

- **Issue**: there are multiple copies of the same data in the system, and they could have different values at the same time.
- **Idea**: ensure multiple copies have same value, i.e., *coherent*
- **How?** Two options:
  - Update: push new value to all copies (in other caches)
  - Invalidate: invalidate other copies (in other caches)

# Readings: Cache Coherence

#### Most helpful

- Culler and Singh, Parallel Computer Architecture
  - Chapter 5.1 (pp 269 283), Chapter 5.3 (pp 291 305)
- Patterson&Hennessy, Computer Organization and Design
  - Chapter 5.8 (pp 534 538 in 4<sup>th</sup> and 4<sup>th</sup> revised eds.)
- Papamarcos and Patel, "A low-overhead coherence solution for multiprocessors with private cache memories," ISCA 1984.

#### Also very useful

- Censier and Feautrier, "A new solution to coherence problems in multicache systems," IEEE Trans. Computers, 1978.
- Goodman, "Using cache memory to reduce processor-memory traffic," ISCA 1983.
- Laudon and Lenoski, "The SGI Origin: a ccNUMA highly scalable server," ISCA 1997.
- Martin et al, "Token coherence: decoupling performance and correctness," ISCA 2003.
- Baer and Wang, "On the inclusion properties for multi-level cache hierarchies," ISCA 1988.

Hardware-guaranteed cache coherence is complex to implement.

- Hardware-guaranteed cache coherence is complex to implement.
- Can the programmers ensure cache coherence themselves?

- Hardware-guaranteed cache coherence is complex to implement.
- Can the programmers ensure cache coherence themselves?
- Key: ISA must provide cache flush/invalidate instructions
  - FLUSH-LOCAL A: Flushes/invalidates the cache block containing address A from a processor's local cache.
  - FLUSH-GLOBAL A: Flushes/invalidates the cache block containing address A from all other processors' caches.
  - FLUSH-CACHE X: Flushes/invalidates all blocks in cache X.

- Hardware-guaranteed cache coherence is complex to implement.
- Can the programmers ensure cache coherence themselves?
- Key: ISA must provide cache flush/invalidate instructions
  - FLUSH-LOCAL A: Flushes/invalidates the cache block containing address A from a processor's local cache.
  - FLUSH-GLOBAL A: Flushes/invalidates the cache block containing address A from all other processors' caches.
  - FLUSH-CACHE X: Flushes/invalidates all blocks in cache X.
- Classic example: TLB
  - Hardware does not guarantee that TLBs of different core are coherent
  - ISA provides instructions for OS to flush PTEs
  - Called "TLB shootdown"