Chapter 7: Shared Memory Parallel Programming with OpenMP

<b>Some basics</b>
* Shared memory system => All the cores can access all the memory locations
* OpenMP can be defined as multiprocessing, a directives-based shared memory API
* an instance of a program running on a processor is called a thread (vs. a process in MPI)
* In Python, Processor means a physical processor (separate memory space) and threading (same memory space) means a virtual entity (a small chunk inside a processor)

* After #pragma omp, i.e., the parallel directive, the block of code will be
run inside all threads.
* Clause num_threads can specify num of threads. 
* The team of threads, the master (original) and thread_count-1 (slaves) will call the code, after all the threads are finished, slave threads will be terminated and master thread continues.
* omp_get_thread_num => rank or id of a thread
* omp_get_num_threads => total number of threads
* Compile on stromboli: <b> gcc -g -Wall -fopenmp -o omp_hello omp_hello.c </b> 
* Submit on strombolie : submit_script.sh
```bash
#!/bin/bash

./omp_hello 4 > output

```
* parallel directive will be just ignored if OpenMP is not supported by the compiler. However to avoid error from include section, we can use

```c
#ifdef _OPENMP
#include <omp.h>
#endif
```

7.1 False Sharing and Padding

* Symmetric Multi-Processor (SMP): a shared address space with equal time access for each processor; OS treats every processor the same way.
* Non-Uniform Memory Access multiprocessor (NUMA): Different memory regions have different access costs. (near and far memory)

<u> Any multiprocessor CPU with a cache is a NUMA system.</u>

* OpenMP is a multi-threading, shared address model; threads communicate by sharing variables
* OS scheduler decides when to run which threads ... interleaved fairness
* To avoid race conditions, synchronization can be used but it is expensive.
* Change how data is accessed to minimize the need for synchronization
* <i> SPMD program can be a good solution. </i>

<b><i> If independent data elements happen to sit on the same cache line, each update will cause the cahce lines to slosh back and forth between threads, this is called false sharing. <br/>
Pad arrays so elements you use are on distinct cache lines. <br/>
Padding array requires deep knowledge of the cache architecture, systems have different sized cache lines => software performance may fall apart
<i><b>

7.3 Scope of variables and the reduction clause
* Shared scope (global)
* Private scope (local)

Three ways we can make an operation thread safe

* Critical: stops all the other thread except one, so the operation can be safely done and stored.
* Atomic: A special type only when we have the form like x \<op\>= \<expression>, x++, ++x, --x, x--. This can be faster than an ordinary ciritical section, this is made to exploit special hardware. 
* Reduction: OpenMP creates private variables for all threads and at the end it runs the mentioned operation. Method overloading is not available. 

```c
// just before the operation
#pragma omp cirtical
#pragma omp atomic

// before the code block
#pragma omp parallel_num_threads(thread_count) reduction(+: result)
```

Difference between MPI and OpenMP
<table>
<tr>
    <td>MPI</td>
    <td>OpenMP</td>
</tr>
<tr>
    <td>Distributed memory</td>
    <td>Shared memory</td>
</tr>
</table>