# 8) Blocked Matrix-Matrix Multiplication

Last time:

- CPU optimization

Today:

1. [Blocked matrix-matrix multiply](#blocked-matrix-matrix-multiply)
2. [Blocking for registers](#blocking-for-registers)

## 1. Blocked matrix-matrix multiply 

In this lecture, we are primarily concerned about optimizing the code for small matrix-matrix multiplies -- that is problems which fit in the cache; efficient use of the cache for large matrices is another topic.

:::{tip}
For this lecture, watch the [video 2.2.1](https://www.cs.utexas.edu/users/flame/laff/pfhp/week2-basic-idea.html) on the LAFF course to have a basic idea.
:::

The key concept for this unit is blocked matrix-matrix multiply. Namely, we will think of the matrices as partitioned into a set of blocks:

$$
  A =
   \begin{bmatrix}
    \begin{array}{c|c|c|c}
     A_{11} & A_{12} & \dots  & A_{1K}\\
     \hline
     A_{21} & A_{22} & \dots  & A_{2K}\\
     \hline
     \vdots & \vdots & \ddots & \vdots\\
     \hline
     A_{M1} & A_{M2} & \dots  & A_{MK}
    \end{array}
   \end{bmatrix},
   B =
   \begin{bmatrix}
    \begin{array}{c|c|c|c}
     B_{11} & B_{12} & \dots  & B_{1N}\\
     \hline
     B_{21} & B_{22} & \dots  & B_{2N}\\
     \hline
     \vdots & \vdots & \ddots & \vdots\\
     \hline
     B_{K1} & B_{K2} & \dots  & B_{KN}
    \end{array}
   \end{bmatrix},
   C =
   \begin{bmatrix}
    \begin{array}{c|c|c|c}
      C_{11} & C_{12} & \dots  & C_{1N}\\
      \hline
      C_{21} & C_{22} & \dots  & C_{2N}\\
      \hline
      \vdots & \vdots & \ddots & \vdots\\
      \hline
      C_{M1} & C_{M2} & \dots  & C_{MN}
    \end{array}
   \end{bmatrix},
$$

where each block is of $C$ is of size $m_{b} \times n_{b}$. The matrices $A$ and $B$ are partitioned in a conformal manner into $M \times K$ and $K \times N$ blocks respectively; the block size associated with $K$ is $k_{b}$.

With this partitioning of the matrix, the update for block $IJ$ of $C$ is then

$$
C_{IJ} := C_{IJ} + \sum_{P=1}^{K} A_{IP} B_{PJ}
$$

Two questions:

 - What sizes to pick for $m_{b}$, $n_{b}$, and $k_{b}$ to have a matrix-matrix multiplication that makes sense (dimension-wise)?
 - What sizes to pick for $m_{b}$, $n_{b}$, and $k_{b}$ in order to most efficiently use the available resources?
 - Which of the previous forms of matrix-matrix multiply should be used for the inner block multiply $A_{IP} B_{PJ}$?

### Example:

If we partition the matrix $C$ into $3 \times 4$ blocks (that is, $3$ blocks in the row direction and $4$ blocks in the column direction):

$$
C =
   \begin{bmatrix}
    \begin{array}{c|c|c|c}
      C_{11} & C_{12} & C_{12}  & C_{14}\\
      \hline
      C_{21} & C_{22} & C_{23}  & C_{24}\\
      \hline
      C_{31} & C_{32} & C_{33}  & C_{34}
    \end{array}
   \end{bmatrix},
$$

Then we _have to_ partition the matrix $A$ into $3$ blocks in the row direction and the matrix $B$ into $4$ blocks in the column direction:

$$
A =
   \begin{bmatrix}
    \begin{array}{cccc}
      &  &   & \\
      \hline
      &  &   & \\
      \hline
      &  &   & 
    \end{array}
   \end{bmatrix},
   B =
   \begin{bmatrix}
    \begin{array}{c|c|c|c}
      &  &  & \\
      &  &  & \\
      &  &  & \\
    \end{array}
   \end{bmatrix}
$$