## Warps and Loop Unrolling

### Review Data Decomposition

CUDA defines an equivalence between data (_grid_) and execution (_thread_).

<img src="https://docs.nvidia.com/cuda/cuda-c-programming-guide/graphics/grid-of-thread-blocks.png" />

The programmer models their problem as a grid of data for which one thread is allocated per cell.  This is largely dictated by the requirement for coalesced memory access.

The __Thread Block__ is an intermediate level of decomposition that runs on a single stream multiprocessor.

### Scheduling Threads (Warps)

<img src="https://www.3dgep.com/wp-content/uploads/2011/11/Dual-Warp-Scheduler.png" width=512 />

* CUDA threads are actually mapped onto hardware 32 threads at a time:
  * concurrent launch of 16 thread half-warp
  * half-warp matches the cache line size, i.e. if each thread reads/writes a contiguous element and the access is aligned, it is coalesced.
  * so memory architecture dictates scheduling
* Interleaving multiple warps allows longer running instructions one per clock cycle
  * instruction execution actually takes many clock cycles
  * same principle as processor pipelining
  
Warp execution is the __SIMD__ in GPU.  All threads do exactly the same thing at the same time.

#### What this means for unrolling?

Let's consider an inner loop of a CUDA kernel operating on shared memory that performing a reduction:
  * use half as many threads in each iteration (from a thread block down to 2).
  * merging results to thread 0
  * synchronize across thread blocks on each iteration

```c
for (unsigned int j=blockDim.x >> 1; j>0; j>>=1)
{
  if (tid < j)
    SharedData[tid] += SharedData[tid+j];
  __syncthreads();
}
```

And unroll the loop fully

```c
if (tid < 128)
    SharedData[tid] += SharedData[tid+128];
__syncthreads(); 
if (tid < 64)
    SharedData[tid] += SharedData[tid+64];
__syncthreads();

...

if (tid < 1)
    SharedData[tid] += SharedData[tid+1];
__syncthreads();
```

OK, but what do we know about 32 or fewer threads:
  * only thread 0-31 are active
  * the operate in a warp
  * the SIMD property guarantees that they are synchronous
  * remove `__syncthreads()`
  
```c
if (tid < 128)
    SharedData[tid] += SharedData[tid+128];
__syncthreads();
if (tid < 64)
    SharedData[tid] += SharedData[tid+64];
__syncthreads();

if (tid < 32)
{
  SharedData[tid] += SharedData[tid+32];
  SharedData[tid] += SharedData[tid+16];
  SharedData[tid] += SharedData[tid+8];
  SharedData[tid] += SharedData[tid+4];
  SharedData[tid] += SharedData[tid+2];
  SharedData[tid] += SharedData[tid+1];
}
```

* We just eliminated:
    * 5 branching `if` statements
    * 6 barriers (with contention)
    * 1/3 of all instructions in the kernel
* 1/3 seems like way too many, how is that possible?
    * a single instruction gets charged against the whole warp
   

## Divergence and Conditionals


Conditional operators `if, for, do, while` are expensive in CUDA. They incur extra overhead to decide which threads in a warp run and which don't for simple conditionals.  It gets worse when there is __divergence__.

When a SIMD warp faces an if statement in which threads diverge
```c
if (a[i] > C)
{
    action;
} else {
    otheraction;
}
```
the two actions are serialized.  
  * the test `if` runs as an instruction
  * `action` runs on threads that pass
  * then `otheraction` runs on other threads
  
  
### Eliminating Conditionals

Many conditional statements can be eliminated.

* By loop unrolling 
* with predicated code
  * converted a conditional to a predicate and operation
  * all threads execute and get differet result
  * this is a cool and powerful patter.
  
```c
  /* branch version */
  if ( src[i] < V )
    j++;

  /* predicated version */
  bool b = (src[i] < V);
  j+=b;
```