# Kernel 2: Global Memory Coalescing

Take a moment to look back at our [discussion on warps and SMs](cuda_preface.ipynb), as that is key to the optimisation we will introduce in this first improved kernel.

As we know, the threads in a block are grouped into as many warps as possible, and this is done based on consecutive threadId. You might be familiar with threadId.x and threadId.y as shown in our example in the previous notebook, but you might be wondering what exactly consecutive would mean in terms of the whole threadId.

Let's take another look at our trusty 4x4 matrix to map this out.

![](../../images/GEMM1/index8.png)

As you can see, it is probably as you imagined, traversing down the block, row-wise, but if we think about 

a) algorithmically finding this for a thread

b) considering the possibility of a 3rd (z) dimension

we end up with the following threadId calculation:

In [None]:
threadId = threadIdx.x + (blockDim.x)*(threadIdx.y) + (blockDim.x)*(blockDim.y)*(threadIdx.z)

This very similar to indexing we have seen in the past, simply counting all the threads that have been traversed by the second (and third) terms, by skipping all those rows (both in the x and y direction) and then adding the current column.

#### Global Memory Coalescing
This allows us to introduce a very vital concept to achieving peak bandwidth, as since threads in a warp having consecutive threadIds means they have consecutive global memory accesses, the sequential accesses by threads in a warp can be grouped together and executed as one - this powerful technique is known as **Global Memory Coalescing**.

This works by the GPU coalescing requests for consecutive global memory addresses into fewer, larger memory transactions, with an example from Simon's work being shown below:

![](../../images/GEMM1/coalescing1.png)

First thing to note, is that for simplification purposes, the warpsize has been reduced to 8 threads insteaed of the actual 32. 

In this naive approach, each thread in each warp makes an individual access to global memory, even though the memory all the threads need is contiguous. This would mean there are 8 memory accesses per warp -

Instead, we can simply group together all the memory accesses, since they are to contingous memory location. As you can see at the bottom, this is done by grouping four consecutive memory accesses as a single one, thus reducing the number of memory accesses to just two 32B loads instead of eight 8B loads.

If you are wondering why do we limit ourselves to just four memory accesses, why not just do the whole warp at once, to have even more efficient memory access? Good question, because this now concerns the GPU's limitations, as well as this simplified model in this example.

GPUs support 32B, 64B and 128B memory accesses, and the amount of memory accesses it can coalesce into one also then depends on the memory access size of each thread. 

Let's play around with some examples to ground this intuition.

    i) if each thread actually wanted to load in a 32B float from global memoery, the warp scheduler could then combine 4 of these into a single 128B transaction

    ii) we could even consider a case where each of the 32 threads in the warp simply want a 4B float, so now the GPU can fit all 32 memory accesses into a single 128B access, loading the data for the entire warp in one swoop ~  our exploratory idea above is possible!

    iii) let's say now, instead, the data was of type double, so 8 bytes each, since now the total size of memory accesses is 8*32 = 256, we will need 2 128B transactions to accomodate this

So with global memory coalescing, we simply try to reduce the memory accesses happening by grouping them together as much as we can into one of the larger GPU-supported accesses.

Side-note: in cases ii and iii, the smaller access sizes i.e. 4 and 8 actually result in wasted memory accesses, since the smallest access size supported is 32B so the remaining bytes in both these cases would just be wasted -> bad bandwidth utilization

As we saw, we use a slightly different thread indexing calculation this time around to make this happen, so let's scrutinize the thread indexing for the previous naive kernel, to see why coalescing was not possible then.

In [None]:
const uint col = blockIdx.x * blockDim.x + threadIdx.x;
const uint row = blockIdx.y * blockDim.y + threadIdx.y;

How consecutive Ids are found here are by incremeneting the column first, essentially traversing their block column-major instead of row major, i.e. something like this:

![](../../images/GEMM1/naivememoryaccess.png)

Since the threads are grouped in the warp based on column, they will need the same column from B but different rows from A, i.e. when iteration on i, threads will all need B[i, col] but will need different elements from A: A[row1, i], A[row2, i], etc which are NOT contiguous elements with the memory access pattern then becoming something like this:

![](../../images/GEMM1/coalescing2.png)


So since the memory access is not contiguous, we cannot coalesce here.

If we instead, use the indexing introduced above, where we go in row-major order, we get something like this:

![](../../images/GEMM1/coalescememoryaccess.png)

Now, the threads in the warp will need the same row from A but different columns from B i.e. on iteration i, threads will all need A[row, i] but will need different elements from B: B[i, col0], B[i, col1], ... (as seen in the diagram from B), and these are contiguous elements, allowing for memory coalescing to happen in this case.

Before we talk about the code change for this kernel, one small caveat to address by the following illustration:

![](../../images/GEMM1/coalescecaveat.png)

In both these cases, even though the memory access pattern is different (seems a bit off in the first one), the memory access can be coalesced!

This is as the only condition for full coalescing is that the threads within the warp have to access consecutive addresses overall, not that the order of theses accesses also has to be consecutive within the warp. This inutitively makes sense as, even in the first one, if we were to have a larger grouped memory access, we get the necessary (and exactly same) memory load as would have been if those threads were accessing the memory in a consecutive way based on their numbering.

Also, to clarify a point that I did not catch at first, coalescing is flexible, i.e. its not like if all threads in the warp aren't contigous, then we can't coalesce at all and have to fall back to the naive memory access. Moreso, it just tries to optimise it as much as possible. 

Let's quickly go through with an example: let's say we have a warp of 32 threads, and the first 31 threads all make contiguous 4B memory accesses but the last (32th) thread makes a 4B access far away. This just means that the memory access only suffers from having to make an extra transaction, for this last thread. So the first transaction will have the first 31*4 = 124B of the 128B transaction being used and 4B that is wasted, and then we will have another transaction of 128B will only the first 4B being used and the other 124B being wasted. - as we can see, a lot of wasted bandwidth, a whole 128B access worth in total.

### Code
So, since you probably will have noticed, we only really focused on using a different indexing approach and not much else, and surely enough, that's all we need to change in the code!


In [None]:
const int col = blockIdx.x * BLOCKSIZE + (threadIdx.x / BLOCKSIZE);
const int row = blockIdx.y * BLOCKSIZE + (threadIdx.x % BLOCKSIZE);

# same inner body

if (col < M && row < N) {    
    float tmp = 0.0;
    for (int i = 0; i < K; i++){
        tmp += A[row * K + i] * B[i * N + col]
    }  
    
    C[row * N + col] = alpha * tmp + beta * C[row * N + col];
}

I know that wasn't as straightforward as you expected, but stay with me now.

And then we could call it as so, illuminating the BLOCKSIZE const you were probably going to question.

In [None]:
dim3 gridDim(CEIL_DIV(M, 32), CEIL_DIV(N, 32));
dim3 blockDim(32 * 32);
sgemm_coalescing<<<gridDim, blockDim>>>(M, N, K, alpha, A, B, beta, C);

Hmm, notice anything different? Go on, take another look. 

Yep! We have made the blocks 1-dimensional now, with the same amount of threads, effectively flattening them.

Let's digest all this one by one.

So, why the 1D? It's because when we have a 2D block , the threadIdx.x only spans along one row (32 in this case) before wrapping onto the next row, so each warp would be locked onto one row each. But now, by flattening into 1D, we have warps that naturally work through the entire tile in a contiguous way.

Again, looking at the indexing:

In [None]:
const int col = blockIdx.x * BLOCKSIZE + (threadIdx.x / BLOCKSIZE);
const int row = blockIdx.y * BLOCKSIZE + (threadIdx.x % BLOCKSIZE);

BLOCKSIZE would be 32 in this case, and NOT 32*32 as even though row spans that many threads, and we know it is one-dimensional, we need to understand that blocks are just used to conceptually map on the threads inside to ACTUAL entries in the matrix. 

What this means is that a block being dimensional does NOT constrain the its threads to also be one-dimensional i.e. a single row in the output matrix. A block is just a collection of threads and we decide how to map those threads onto our data structure, and this case, we logically rebuild the same 32x32 2D tiles, as before, on the output matrix for each block.

A good analogy to understand this is as follows: Simply think of a CUDA block as an arrangement of 1024 sport players in a line, these are our 'threads'. Now, we need to arrange them on a field in a formation, and this formation can be anything, we can make any shapes out of the players by choosing them and making them stand at a certain position. This simply is what is happening. We have our threads that we address by one-dimensional indices but then we arrange them as tiles on the output matrix.

So now, looking at our indexing code:
- The column for a thread is found by first offsetting to the first column of the block tile, simply by skipping blockIdx.x * 32 columns. Now that we have arrived at the start of the tile for the current thread's block, we then find the exact offset by doing integer division of the threadId.x by 32.
- The row for a thread is found by , again, first offsetting to the intended block tile's row. Then, we find the exact offset by doing modulus of threadId.x by 32, 

What this does is simply mould the threads into the 2D tile as intended

Example (thread numbers are == threadIdx.x since theres only Id in one dimension):

    - Thread 0 -> (x, y) = (0,0)
    - Thread 1 -> (0,1)
    - Thread 2 -> (0,2)
    ...
    - Thread 31 -> (0,31)
    - Thread 32 -> (1,0)
    - Thread 33 -> (1,1)


<sub> Small optional note about low-level implications I thought I would include that's pretty cool. Making this change i.e. GMEM coalescing does not actually change the assembly code since access coalescing is all done at runtime by the hardware. This intuitively makes sense since coalescing requires contiguous access, which cannot be guaranteed at compile time, since we pass the matrix pointers at compile-time. </sub>

The last point is that you probably might have noticed that we did not mention shared memory (SMEM) at all as it is not exploited/used here at all - hence room for more optimisations! However, it should be noted that even though data is not stored in SMEM to be used by other threads, caches stil exist.

So when one warp has a large coalesced load, the data is broadcasted to the whole warp. Let's say the next warp needs the same row (but different columns), this row was just loaded by the previous warp so it is likely to still be in the L1 cache, so we get a cache hit and no extra trip all the way to global memory.

So, even though this is just the second kernel, given the caching oppurtunities, it can run surprisingly well, especially on smaller problems.

However, the limitation is that caching is oppurtunistic. The L1 cache is pretty tiny (and while the L2 cache is bigger) the row data of A and B for large matrices exceeds those pretty quickly, which means a lot of frequen

For a fun little example tying back to the analogy

And there we have it! 2nd kernel down. Get some more coffee, take a little walk around, and whenever you're ready, I'll see you on Kernel 3.

prompt theory

when we caolesce what is the maximum coalescing we ccan do - lets just assume the context is regular matrix multiplication. so now, how many memory accesses can we group together, and explain the reasoning behind this (i asumme something to do with limitations of GPU regarding the size of memory accesses it can do, and the size of memory accesses each thread intends)

