## Loop Optimization

This lecture is almost all by example in file [stencil.c](./openmp/omp_c/stencil.c).  

### Loop iteration order

For 2-d dense data, the array must be serialized to memory, i.e. in a linear order.
The serialization strategies are named by which dimension (row versus column) 
occurs sequentially in memory.

<img src="https://upload.wikimedia.org/wikipedia/commons/4/4d/Row_and_column_major_order.svg" width=256 title="Row versus column major order." />

Choosing a memory efficient order for loops has a big impact on performance.
  * Successive loop iterations access adjacent elements or
  * Successive loop iterations access strided elements
  
Our examples do indexing in row-major order `array[x*DIM+y]`
  * placing y in the inner loop leads to sequential access
  * placing x in the inner loop leads to strided access
  
The different orders are also associated with programming languages that use these conventions.
  
<img src="https://images.slideplayer.com/23/6540072/slides/slide_3.jpg" width=512 title="from Edgar Gabriel at UH" />

There are many conventions about loop ordering and they get confusing.  Reason carefully about how the loops variables are enumerated and the data layout.  E.g., images are almost always in Fortran order so that programming then in C looks weird.

#### Manual Optimizations

The file demonstrates successive optimizations to a loop. 
These are mostly considered compiler optimizations in CS, but 
for OpenMP it makes sense to do them by hand.

### Loop Unrolling

Loop unrolling is a time-space tradeoff typically made by compilers
  * time: eliminate branching instructions in evaluating loop conditional
  * space: make a bigger program with more statements

This example unrolls the entire stencil (5x5) eliminating the two inner loops.
  * how many instructions are saved?

### Loop Fusion

Replace multiple loops with a single one.
* for OpenMP this reduces thread startup costs.
* shown in `fused_stencil_sum_omp()`

### Separate dependencies

* Use reductions (shown in `max_el_reduce()`)
  * Note the the compiler actually does a reasonable job of this at `-O3`

### Other Optimizations

* Split loops -- rarely effective

