## Loop Optimization in OpenMP

This lecture is almost all by example in file [stencil.c](./openmp/omp_c/stencil.c).  

### Loop iteration order

For 2-d dense data, the array must be serialized to memory, i.e. in a linear order.
The serialization strategies are named by which dimension (row versus column) 
occurs sequentially in memory.

<img src="https://upload.wikimedia.org/wikipedia/commons/4/4d/Row_and_column_major_order.svg" width=256 title="Row versus column major order." />

Choosing a memory efficient order for loops has a big impact on performance.
  * Successive loop iterations access adjacent elements or
  * Successive loop iterations access strided elements
  
The different orders are also associated with programming languages that use these conventions.
  
<img src="https://images.slideplayer.com/23/6540072/slides/slide_3.jpg" width=512 title="from Edgar Gabriel at UH" />

There are many conventions about loop ordering and they get confusing.  Reason carefully about how the loops variables are enumerated and the data layout.  For example, images are almost always in Fortran order so that programming then in C looks weird.

### Sequential access in `stencil.c`

We provide two routines that show the difference between sequential and strided access in C.

Which of the following performs sequential access?

```c
void initializeyx ( double* array )
{
    /* Initialize the array to random values */
    for (int y=0; y<DIM; y++) {
        for (int x=0; x<DIM; x++) {
            array[x*DIM+y] = (double)rand()/RAND_MAX;
        }        
    }
}

void initializexy ( double* array )
{
    /* Initialize the array to random values */
    for (int x=0; x<DIM; x++) {
        for (int y=0; y<DIM; y++) {
            array[x*DIM+y] = (double)rand()/RAND_MAX;
        }        
    }
}
```

### A Parallel Stencil

A common pattern in numerical computing is to compute a [compact stencil](https://en.wikipedia.org/wiki/Compact_stencil). 

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/c/c5/CompactStencil.svg/300px-CompactStencil.svg.png" width=256 title="Compact stencil." />

The following function computes an average over a compact stencil at each (well defined) cell in a 2-d grid.  This computation pattern is used frequently in convoluation neural networks.

```c
void stencil_average ( double* input_ar, double* output_ar )
{
    double partial = 0.0;

    for (int x=HWIDTH; x<DIM-HWIDTH; x++) {
        for (int y=HWIDTH; y<DIM-HWIDTH; y++) {
            for (int xs=-1*HWIDTH; xs<=HWIDTH; xs++) {
                for (int ys=-1*HWIDTH; ys<=HWIDTH; ys++) {
                    partial += input_ar[DIM*(x+xs)+(y+ys)];
                }   
            }   
            output_ar[DIM*x+y] = partial/((2*HWIDTH+1)*(2*HWIDTH+1));
            partial=0.0;
        }       
    }
}
```

#### Manual Optimizations in `stencil.c`

The file demonstrates successive optimizations to a loop. 
These are mostly considered compiler optimizations in CS, but 
for OpenMP it makes sense to do them by hand.

### Loop Unrolling

Loop unrolling is a time-space tradeoff typically made by compilers
  * time: eliminate branching instructions in evaluating loop conditional
  * space: make a bigger program with more statements

This example unrolls the entire stencil (5x5) eliminating the two inner loops.
  * how many instructions are saved?

### Loop Fusion

Replace multiple loops with a single one.
* for OpenMP this reduces thread startup costs.
* shown in `fused_stencil_sum_omp()`

### Separate dependencies

* Use reductions (shown in `max_el_reduce()`)
  * Note the the compiler actually does a reasonable job of this at `-O3`

### Other Optimizations

* Split loops -- rarely effective



### Loop Scheduling

This is really an aside. I just want you to know that it exists.

The full looping directive includes the specification of a scheduling directive and a chunk size
```c
#pragma omp parallel for schedule(kind [,chunk size])
```
in which schedule can be one of:
* Static – divide loop into equal sized chunks
* Dynamic — build internal work queue and dispatch blocksize at a time
* Guided — dynamic scheduling with decreasing block size for load balance
* Auto — compiler chooses from above
* Runtime — runtime configuration chooses from above
