## OpenMP: Compilers and Toolchain

Let's recall that an OpenMP program consists of:
1. Compiler directives that generate parallel code
2. Library calls to configure runtime
  a. Need to `#include <omp.h>` to get symbols/functions

* For gcc/g++, OpenMP is built in.  Use the openmp option.
```sh
gcc -fopenmp program.c
````
* For clang, OpenMP is kind of built in.  On MacOSX, it needs help with linking.
```sh
clang -Xpreprocessor -fopenmp -lomp program.c 
```

### Compiler Optimization

The goal is to develop __fast__ and parallel code.  OpenMP makes it parallel. Compilers make it fast.  They perform things like:
  * loop unrolling
  * inline functions
  * branch prediction
  * vectorization
These can effect performance by __large constant factors__.

The degree of optimization is specified at compile time.  It is typical to debug unoptimized code and deploy optimized code.  In gcc/g++:
  * -O0 (default) = Reduce compilation time and make debugging produce the expected results.
  * -O1 = simple optimizations that don’t take a lot of compile time
  * -O2 = rewrite loops, follow jump pointers, inline small functions, no time/space tradeoffs
  * -O3 = vectorize, inline functions, branch prediction
  * and, -g include debugging symbols
  


### Things we learned in `stencil.c`

1. Compiler optimization matters
2. Different compilers produce different code
  * clang rewrote the loop, gcc didn't
3. Iteration order matters
  * if your compiler doesn't rewrite the loop
  * many things will prevent your compiler from rewriting (more on this later)
4. Caching affects performance (particularly initial runs)
  * Measure on a warm cache.
  * The initial run will often load data into processor cache from memory.