## CUDA Memory Usage

Lecture derived from https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html.


### Memory Access Patterns



  
#### Coalesced Memory Accesses
<img src="https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/graphics/coalesced-access.png" />
in which all threads access a single cache line.  The CUDA warp (SIMD execution context) does a completely parallel transfer in a single memory access.

Similar access patters may take twice as long:

<img src="https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/graphics/unaligned-sequential-addresses.png" />

This happens as two memory references.  The effect of this can be seen in this kernel example

```c
__global__ void offsetCopy(float *odata, float* idata, int offset)
{
    int xid = blockIdx.x * blockDim.x + threadIdx.x + offset;
    odata[xid] = idata[xid];
}
```

which produces the following latency chart.  Every 32 offsets, it gets lucky and is aligned and realizes higher throughput.


<img src="https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/graphics/performance-of-offsetcopy-kernel.png" />

#### Banked Memory

CUDA memory is _banked_ and caches are _direct mapped_.

Only one thread at a time can access memory at a given bank offest and every memory address associates with a single bank offest---this is direct mapped.  So, when accessing data with a stride of 2, we get
half the throughput.  The following image demonstrates a bank conflict and requires two memory access to read the data.

<img src="https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/graphics/adjacent-threads-accessing-memory-with-stride-of-2.png" />

This results in a throughput collapse when accessing strided data.  At strides of 32, one gets only one word per cache line.

<img src="https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/graphics/performance-of-stridecopy-kernel.png" />

### On Chip and Off Chip Memory

CUDA has a very little managed cache (an inconsequential amount of L1 and L2).  However, modest amounts of "programmable" cache, aka _shared memory_ are available on each stream processor.  To a first order, think of memory as being either on-chip (shared-memory and registers) or off-chip (all other).

<img src="https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/graphics/memory-spaces-on-cuda-device.png" />

From the perspective of Roofline performance, transfers from device/texture/read-only memory competes for the off-chip memory bandwidth and limit performance.  In contrast, references to on-chip memory increase operational intensity when they replace off-chip access; same amount of computing with fewer off-chip acesses.

#### Registers

You the programmer cannot program the registers. The are used by the compiler to place data. There are some programming practices that make it so that there is lots of data for the compiler to put in registers.  You should think of this as fast scratch space for the compiler.

__GPUs have many registers__ in comparison with CPUs  in whi



