## Serial to Parallel: OpenMP

The simplest and most common parallel pattern is to take a serial program and convert it into a parallel program.

* My code’s not running fast enough:
  * Video: data delays produce jitter, stalls
  * Web: page render time causes user loss, discomfort
  * Batch processing, indexing, analysis not completing in time 
  * High-throughput finance: other models running faster and beating mine to a decision -> lose arbitrage opportunity
* This leads to a natural software engineering process
  * Profile code: find out what’s slow
  * Parallelize slow part(s) only
  * Migrate from serial implementation to parallel implementation
* It's  not the best process
  * Serial to parallel doesn’t produce the best designs
  * Best parallel implementation may require a totally different design with no incremental refactoring from serial implementation
* Just the easiest
  * Compared to a clean-slate redesign

### What is OpenMP

Parallel programming environment (not language) for:
* Master/slave and/or fork/join execution model
* Loop parallelism patterns
* Thread parallelism in shared-memory architectures
But this doesn’t mean anything yet.

It’s the simplest approach to parallelism
* Write a serial program in a language that you know (C/C++ or Fortran)
* Add directives to parallelize portions of the code
* Get a parallel program that computes that exact same result (_serial to parallel equivalence_)

What's the merit:
* Incremental parallelism
* Simple to Use
* Portable (for the most part)

Limitations:
* Difficult to manage memory usage
* No distributed capabilities

## Block Parallelism

The fundamental Principle in OpenMP is to parallelize a _block_ and run multiple instances of the block with parallel threads.

Refer to file `openmp/block.c`

### The OpenMP Toolchain

No toolchain

* Add -fopenmp to compiler command line
  * generate code from pragmas directives
  * links to libopenmp
* `#include “omp.h”` to import symbols into your source code

Example compile command lines:
  * gcc -fopenmp -O3 program.c (gcc)
  * gcc -Xpreprocessor -fopenmp -O3 -lomp program.c (clang MacOSX)

#### (Aside) Compiler Optimization

All compilers, including gcc/g++, have optimization flags that must be set to get good performance.

Optimization level –O*:
  * -O0 (default) = Reduce compilation time and make debugging produce the expected results.
  * -O1 = simple optimizations that don’t take a lot of compile time
  * -O2 = rewrite loops, follow jump pointers, inline small functions, no time/space tradeoffs
  * -O3 = vectorize, inline functions, branch prediction

When debugging, you want to use -O0 so the code makes sense.

For performance, -O3 to vectorize code to processors.

### Memory Model and Hardware

OpenMP is a parallel programming environment that:
  * creates _threads_ that run on mutliple processor cores

#### Shared Memory

* Coherent read/write to common memory from multiple cores/processors
  * Coherent = repeatable read, read last write, ….
  * Abstraction that there is a single memory for all processors
  * Data sharing by reading/writing to memory
* Hardware that provides this abstraction are called shared memory architectures (typically in a “single machine”)
  * Even if there are different physical memories
  * Non-Uniform memory architectures (NUMA) are typical today
    * with different latency and throughput from cores to memory locations
    * but the appearance (semantics) of a single, unified memory
    
<img src="https://hpc.llnl.gov/sites/default/files/numa.gif" width="512" title="Non-Uniform Memory Architecture" />


### OpenMP onto Accelerators (no more)

Accelerators are co-processors (not the main CPU)
  * Offload compute-intensive tasks onto specialized hardware
  * Power dense and cheaper when compared with main CPU
  * Often limited programming models or compute capabilities

Xeon-Phi **was** the Intel accelerator architecture. Cancelled in 2020.
  * supposed to be easier to program than GPUs because it runs lightly modified x86 code
  * programmable via OpenMP with "offload" compiler directives
  * GPUs won. They became easier to program due to frameworks and dominate the ML market.


### Loop Parallelism

Loop parallelism is a form of parallelism and _programming pattern_ that derives parallel tasks from the iterations of loops.

* Most common use and programming pattern for OpenMP
  * add parallel directives to a for loop
  * OpenMP divides the loops iterations into _chunks_ assigned to threads
* Merits of loop parallelism
  * __Sequential equivalence__: parallel program is equivalent to a serial program (easy to write and maintain, good tools)
  * __Refactoring__: Incremental conversion of a serial program to a parallel program (easy to test and debug)
* Drawbacks of loop parallelism
  * __Memory utilization__: if loop access patterns don’t match cache hierarchy, programs often require massive restructuring
  
### #pragma parallel for

Refer to openmp/loop.c. 

xeus-cling (the C/C++ notebook environment) is not working with OpenMP. The following cell will not run.

In [None]:
#include <iostream>
#include <omp.h>

{
  #pragma omp parallel for 
  for ( int i=0; i<100; i++ )
  {
    std::cout << "OMP Thread# " << omp_get_thread_num() << " loop variable " << i << "\n";
  }
}

OpenMP divided the iterations of the loops into contiguous _chunks_ assigned to threads
  * number of threads derived from environment
  * chunks are (by default) sequential: leads to _coalesced_ and _sequential_ memory utilization
  
__STOP__ And learn about the <a href="./Lec05_cache_hierarchy.ipynb">Cache Hierarchy</a>.  Then start again.