## Loop Dependencies

In OpenMP we come to think of loops as _parallel_ executions of the iterations.
In serial programming, they are ordered executions of code.  To match these abstraction, 
we need loop iterations to be _independent_ of each other.  This involves working around or
eliminating dependencies.

### The Loop "Recipe"
* Find the bottlenecks (profile)
* Eliminate _loop carried dependencies_
* Parallelize the loops
  * Semantically neutral directives are very helpful.  This is perhaps the main factor behind OpenMP’s success.
* Optimize the loop schedule
  * Load balance, avoid task skew, amortize startup

### Loop Carried Dependencies

When one iteration of a loop depends upon the computations of other iterations, i.e. the dependency is between different iterations of the loop.

* Can be addressed via loop rewriting
  * Can’t my compiler do this?
* Removable dependencies
  * Code transformations
* Separable dependencies
  * Accumulation operations (mean, sum, count)
  * Extrema (max, min)
  * Connections to the reduce in map/reduce

Dependent loop
  
```c
int offset1 = c;
int offset2 = 0;

for ( int i=0; i<N; i++ )
{
  offset1 = offset1 + 1;
  d[offset1] = big_time_work ( offset1 );
  offset2 = offset2 + i;
  a[offset2] = other_big_calc ( offset2 );
}
```

and a semantically equivalent loop with no dependencies.

```c
for ( int i=0; i<N; i++ )
{
  d[c+i] = big_time_work ( c+i );
  a[(i*i+i)/2] = other_big_calc ( (i*i+i)/2 );
}
```

### Types of Dependencies (from [Wikipedia](https://en.wikipedia.org/wiki/Loop_dependence_analysis))

Dependencies are data/ordering relationships between statements.  We will present dependencies by show simple statements and then a corresponding loop carried dependency.

__True Dependency__: write before read

```c
S1: a = 5;
S2: b = a;
```
in a loop, previous iterations write a value before later iterations read it.
```
 for(j = 1; j < n; j++)
    S1: a[j] = a[j-1];
```

__Anti Dependency__: read before write
```c
S1: a = b;
S2: b = 5;
```
in a loop previous iterations read a value that will be written later.  The danger in parallelization is that the later loop would run first and overwrite the value.
```
 for(j = 0; j < n; j++)
    S1: b[j] = b[j+1];
```

__Output Dependency__: write after write
```c
S1: c = 8; 
S2: c = 15;
```
the second statement must be run after so that the first doesn't overwrite.  The first can actually be discarded.
```c
 for(j = 0; j < n; j++)
    S1: c[j] = j;  
    S2: c[j+1] = 5;
```

There is also __input dependence__ and __control flow dependence__.  They are important but less relevant for OpenMP transformations.

### What to do with dependencies?

True dependencies are true and must be preserved.  Output and anti-dependencies can be handled with temporary variables. 

Serial loop with an anti-dependency
```c
for(i = 0; i < n; i++) {
    x = (b[i] + c[i]) / 2;
    a = a[i+1] + x;
}
```
This can be converted by avoid by creating a read-only copy of the `a` array so that the read and the write to the same address (in different loops) don't compete.
```c
#pragma omp parallel for
for(i = 0; i < n; i++) {
    a_copy[i] = a[i+1];
}
#pragma omp parallel for
for(i = 0; i < n; i++) {
    x = (b[i] + c[i]) / 2;
    a = a_copy[i] + x;
}
```

Serial loop with a _flow dependency_ (a variant of anti-dependency baed on loop iterations).  This can be addressed by __loop skewing__.
```c
for(i = 0; i < n; i++) {
    b[i] = b[i] + a[i-1];
    a[i] = a[i] + c[i];
}
```
We need to make sure not to overwrite an a member before it is read.  This can be done without copying by changing the iteration discipline.  Do one operation out of the loop and skew the rest of operations to match the other.
```c
b[1] = b[1] - a[0]
#pragma omp parallel fow
for(i = 1; i < n; i++) {
    a[i] = a[i] + c[i];
    b[i+1] = b[i+1] + a[i];    
}
a[n-1] = a[n-1] + c[n-1]
```
This converts a loop `for { AB }` into `A for {BA} B` and thus called skewing. 


### Scoping

There are a bunch of scoping specifiers in OpenMP.  See notebook __NoteBook__: <a href="openmp/ompvariables.ipynb">Loop Parallelism</a>.  These include:
  * private: create a local copy of an externally scoped variable
  * firstprivate: create an initialized local copy variable of an externally scoped variable
  * lastprivate: create a local copy of an externally scoped variable and copy out value from last iteration
  
These are useful with `for` loops and need to be used with shared variable __when they are being updated__.  If the variable is not updated in the loop, you can use a shared variable which can/will be cached by all threads.
  
Loop variables should not be updated inside the loop.  This will create bad results.
  * Assume that OpenMP enumerates all the loop variables before runtime.
  