## Warps, Divergence, and Loop Unrolling

### Review Data Decomposition

CUDA defines an equivalence between data (_grid_) and execution (_thread_).

<img src="https://docs.nvidia.com/cuda/cuda-c-programming-guide/graphics/grid-of-thread-blocks.png" />

The programmer models their problem as a grid of data for which one thread is allocated per cell.  This is largely dictated by the requirement for coalesced memory access.

The __Thread Block__ is an intermediate level of decomposition that runs on a single stream multiprocessor.

### Scheduling Threads (Warps)

<img src="https://www.3dgep.com/wp-content/uploads/2011/11/Dual-Warp-Scheduler.png" />

* CUDA threads are actually mapped onto hardware 32 threads at a time:
  * concurrent launch of 16 thread half-warp
  * half-warp matches the cache line size, i.e. if each thread reads/writes a contiguous element and the access is aligned, it is coalesced.
  * so memory architecture dictates scheduling
* Interleaving multiple warps allows longer running instructions one per clock cycle
  * instruction execution actually takes many clock cycles
  * same principle as processor pipelining

#### What this means for unrolling?

Let's consider an inner loop of a CUDA kernel operating on shared memory that performing a reduction:
  * use half as many threads in each iteration (from a thread block down to 2).
  * merging results to thread 0
  * synchronize across thread blocks on each iteration


```c
for (unsigned int j=blockDim.x >> 1; j>0; j>>=1)
{
  if (tid < j)
    SharedData[tid] += SharedData[tid+j];
  __syncthreads();
}
```