# Kernel 5: 2D Blocktiling

So as mentioned, our objective now is to increase arithmetic intensity - essentially want more work done per load we make.

We are going to up the ante with our thread's evolving into multitasking agents, with now each thread computing a whopping 8x8 grid of elements of C. It's come a long way since it's single element days.

Let's step through the code first, since I think we are getting familiar with the flow of these 'chunk-based' kernels. Then after we can flesh out some intuition with some further explanation and diagrams.

In [None]:
float threadResults[TM * TN] = {0.0};

Pretty much the same as before, but now we are storing a mini grid of results for each thread, rather than a single column.

__Note: TM and TN are both 8 in our example here__

In [None]:
float regM[TM] = {0.0};
float regN[TN] = {0.0};

Something new here, we use thread-private registers that are the fastest storage we can use,

Now, taking a look at the (very similar) outer loop:

In [None]:
// outer loop
for (uint bkIdx = 0; bkIdx < K; bkIdx += BK) {
  // populate the SMEM caches
  for (uint loadOffset = 0; loadOffset < BM; loadOffset += strideA) {
    As[(innerRowA + loadOffset) * BK + innerColA] =
        A[(innerRowA + loadOffset) * K + innerColA];
  }
  for (uint loadOffset = 0; loadOffset < BK; loadOffset += strideB) {
    Bs[(innerRowB + loadOffset) * BN + innerColB] =
        B[(innerRowB + loadOffset) * N + innerColB];
  }
  __syncthreads();

  // ... rest of outer loop body
}


So, we can see the outer loop is the same, just advancing through the columns of A and rows of B chunk by chunk,

However, now the loading into the SMEM caches is wholely different, having seperate loops for these, which we did not use before.

This is as each thread will load multiple elements instead of just one, but evidently, it still won't load all the elements it needs, as that is the power of the shared memory - other threads will also load in elements that a thread will need.

it will go through both As and Bs, and load in one element per stride, as we slide along the whole of As in strides.

Let's take a look at what is happening visually, for As:

![](../../images/GEMM1/strideloading.png)

Each thread will only compute one entry per cycle (iteration), with each cycle being a sub-chunk of As that moves down by strideA.

Similarly, this is what happens what happens along the columns of Bs.

If we notice something, even if we assume threads are computing whole 8x8 grids of output entries, they will be loading in entries that span the whole of the current As buffer, some of which will never be used by the thread itself, but it still loads into shared memory for the other threads. 

This really highlights how the first step of blocktiling is collaboration-focused, with our thread workers being strong in arms.

Then the __syncthreads() call, again, just ensures that all the workers are done collaboratively loading into SMEM, and can be aligned to start using the fruits of their labour for dot products.

Looking further into the rest of the outer loop body:

In [None]:
  // first inner loop
  for (uint dotIdx = 0; dotIdx < BK; ++dotIdx) {
    // load relevant As & Bs entries into registers
    for (uint i = 0; i < TM; ++i) {
      regM[i] = As[(threadRow * TM + i) * BK + dotIdx];
    }
    for (uint i = 0; i < TN; ++i) {
      regN[i] = Bs[dotIdx * BN + threadCol * TN + i];
    }
    // perform outer product on register cache, accumulate

  }

For our first inner loop

In [None]:
    // into threadResults
    for (uint resIdxM = 0; resIdxM < TM; ++resIdxM) {
      for (uint resIdxN = 0; resIdxN < TN; ++resIdxN) {
        threadResults[resIdxM * TN + resIdxN] +=
            regM[resIdxM] * regN[resIdxN];
      }
    }

Jeez, the loops keeping on coming. This is still inside the same first inner loop - this is also a good time to remind you about the raw existing code living the kernels/ folder, under the same corresponding name as this .ipynb file, since it is a lot easier to look at the code in its entirety, as it can be jarring to see it in chunks like this, especially with the amount of loops we are seeing.

So this first loop is just

In [None]:
__syncthreads();

  // advance blocktile
  A += BK;     // move BK columns to right
  B += BK * N; // move BK rows down

Again, these are our standard maintanence commmands we have been running for the previous few kernels, that are inside the main outer loop - they just ensure that warps don't rush ahead to corrupt the SMEM caches and that the pointers are moved appropriately for the next iteration.