# Background

## Memory Wall

- Every few years the number of transistors on a microchip increases, and as a result the capacity for performing arithmetic operations has increased exponentially.

    - A **transistor** is a tiny electronic switch — the basic “work unit” of digital logic and memory. Modern CPUs and GPUs have billions of them.

    - More transistors enable more logic gates, memory cells, and specialized units → greater parallelism, higher throughput, and more complex computations.

- However, data movement capacity **(memory bandwidth)** hasn’t increased as fast, creating the **memory wall** — a key bottleneck in deep learning and Tensor Core workloads.
    - Bandwidth is how much **data per second** your memory system can deliver to (and accept from) the **compute units**
    - **Analogy**:
        - Think of a highawy:
            - Cars = **bytes**
            - Number of lanes = **bus width**
            - Speed limit = **clock rate**
            - Toll booths/merges = **memory controllers/channels**
        - How many cars reach the city each second? That's the bandwidth
    - Do not confuse **bandwidth** with **latency**:
        - **Latency** is how long one trip takes
        - **Bandwidth** is how many bytes per second you can keep flowing

- Hence if we want to fully utilize Tensor Cores, we must increase the number of bytes moved between DRAM and compute units.

![image.png](../../images/GEMM2/simple_computer.png)

## Roofline Charts 

- The roofline model states, that performance will be limited by **one of two things**
    - **Compute:**
        - Data is readily available in fast memory
        - The bottleneck is the number of floating-point operations per second (**FLOP/s)** your hardware can execute
        - Adding more **bandwidth** here won’t help — only faster **ALUs (Arithmetic Logical Unit: Used to perform computations)**, **Tensor Cores**, or more **parallelism** will improve performance
    - **Memory-bound:**
        - The **compute units can’t be** **fed fast enough** because data has to be fetched from slow memory (**DRAM**)
        - The bottleneck is **memory bandwidth (β bytes/sec)**
        - Doubling **compute** units won’t help unless you also increase your **bandwidth** or improve data reuse in fast memory.
    - Plot Interpretation:
        - **X-axis:** Operation intensity = **FLOPs** per byte transferred (how much work you do per data fetched)
        - **Y-axis:** Achievable performance (**FLOP/s)**
        - The **roof** has two segments:
            - **Slopped line →** memory-bound region
            - **Flat line** → compute-bound region
            - Where you land depends on how much computation you can do before you have to fetch more data


![image.png](../../images/GEMM2/roof_line.png)

## Explanation
- Any given computation has a certain number of FLOPs that need to be performed. For example, if you want to multiply a M by K matrix with a K by N matrix we need to perform $2 * M * N * K$ FLOPs.
    - For each pair $(i, j)$ you do K multiplies, and K - 1 adds $\approx{2K FLOPS}$. 1 FLOP(Floating point operation) for multiply + 1 FLOP for addition
    - There are M * N outputs, so total $FLOPs \approx{(M N) * (2K)} ={2*M*N*K}$
- The more FLOPs/sec our algorithm can achieve, the faster we can get the matrix multiplication done.
- The roofline model gives us an upper bound on the FLOPs/sec we can achieve, subject to $\tau$ and $\beta$ which are fixed properties of our hardware.
    - $\tau$ (tau) = the **peak compute throughput** of your device for a given datatype/op (e.g **FP32** (Floating Point 32-bit float), **FP16/Tensor**)
        - $\tau$ is typically a large number. For example, for the T4 GPU, $\tau$= 65,000,000,000,000 FLOPs/second. Units: FLOP/s -> 65 TFLOPs/second -> Tera FLOPs = $10^9$ FLOPs
    - $\beta$ (beta) = the **peak sustained memory bandwidth** between a given memory level and the cores (e.g., DRAM <-> SM). Units: bytes/s
- We will refer to achieved FLOPs/sec as $T$ for throughput, and the upper bound on $T$ as $T_{max}$
- The maximum FLOP/sec we can achieve ($T_{max}$) is modeled as a function of a variable called *computational intensity* ($I$), this is a property of the algorithm we will write.
    - *Computational Intensity*: FLOPs done per byte moved between that memory level and the cores (reads + writes).
        - Units: FLOP/byte
- This metric measures the "data reuse" of our algorithm
    - For each byte moved from slow memory to fast memory, how many FLOPs do we perform on it.
- The roofline model says the upper bound on FLOPs/sec, ($T_{max}$), we can achieve, is the minimum of our computational intensity times memory bandwidth, and the peak floating point throughput of our hardware
                                    
    $
    T_{max} = \min(\beta \cdot I, \tau)
    $

- The ridge point (where the sloped line → memory-bound meets the flat line → compute-bound) is given by:

    $
    I^* = \frac{\tau}{\beta}
    $

- The roofline model says there are two ways $T_{max}$ can be limited:
    - $T_{max}$ can never exceed $\tau$. Even if we perform infinity operations on each byte we move into fast memory, we are still limited by the peak floating point throughput of our hardware.
        - When $\tau$ is our limiting factor we are *compute-bound*, this is a great place to be.
    - $T_{max}$ may also be limited by our memory bandwidth times the computational intensity of our algorithm. If $\tau$ were infinite, the achieved floating point throughput would simply be the number of bytes/sec being moved into fast memory, times the number of FLOPs performed per byte moved $\beta * I$
        - When we multiply $\beta$ and $I$, the units cancel out to give FLOP/sec.
            - $(bytes/s) *(FLOP/bytes) ={FLOP/s}$
        - If $\beta * I <{\tau}$, or $I <{\tau/\beta}$ then we are *memory-bound*, meaning we are limited by how fast we can feed our compute units.
        - In this situation we should rewrite our algorithms to increase *computational intensity* $I$ in order to make our algorithm compute-bound
    - Why do we try to increase $I$ and not our bandwidth?
        - Since $\tau$ and $\beta$ are limited (they are fixed by hardware), we increase $I$ in order to become compute-bound. 
            - Otherwise, we would have to change to hardware with higher bandwidth.
            - Additionaly this approach is better because if our algorithm can maximize *computational intensity* on hardware with less bandwidth then it will perform better on new hardware than an algorithm that doesn't maximize *computational intensity*.
- How do we maximize *computational intensity*?
    - In practice, this means moving a chunk of data from slow memory to fast memory, and then performing as many useful operations on it as allowed by our algorithm.
    - Maximizing the amount of operations on a chunk of data, means we use that data until it won't be used in an operation again. I.e after we fetch it the first time, we won't fetch it again.
    - As a result, we reduce the number of trips to slow memory, and now our performance depends on how many operations we can perform on each byte that is moved into fast memory -> we are **compute-bound**

- **TL;DR:**
    - **Fast memory** (shared memory in the SM) is physically close to the compute units.
    - **Slow memory** (DRAM) is farther away, so accessing it takes longer.
    - Peak compute **τ (FLOP/s)** is achievable only when your arithmetic intensity **I** is high enough and the cores are kept busy
    - Achievable performance: $T = min(\beta * I, \tau)$; To beat the memory wall, reduce DRAM traffic and increase reuse, so $I >{\tau/\beta}$, keeping work in **fast memory**. 


## Rooflines for NVIDIA Tesla T4
- We will plug in some numbers specific to our GPU, and look at the resulting roofline model to inform us on how to approach designing our algorithm.
    - On a real computer, there isn't just a single $\tau$ or $\beta$.
    - There are multiple compute ceilings ($\tau$) for different *instruction paths/data types* (FFMA FP32 vs HMMA FP16/BF16) and multiple bandwidth ceilings ($\beta$) for different *memory levels* (HBM/DRAM, L2, L1 cache, share memory.)


### Tensor Core vs. FFMA

- **Tensor Cores** are NVIDIA's specialized hardware unit designed for matrix multiply-accumulate (MMA). 
    - It computes a small tile operation like $C_{tile} +={A_{tile} \times B_{tile}}$ in one instruction at **warp scope** (more on this later)
    - Instead of doing scalar operations one-by-one on CUDA cores, a Tensor Core performs many fused multiply-adds in parallel on fixed-size tiles

- **FFMA** (Fused Floating Multiply-Add)
    - A single instruction on CUDA cores that computes $d ={a \times b + c}$ with one rounding at the end.
        - One rounding (fused) means less rounding error than separate multiplication then addition.

### Side-by-side example (4×4 matrix)

Suppose we want to multiply two $4 \times 4$ matrices $A$ and $B$ and accumulate into $C$.

- **Using only FFMAs (CUDA cores):**
  - Each element of $C$ is a dot product of one row of $A$ and one column of $B$:  

    $
    C[i,j] = \sum_{k=0}^{3} A[i,k] \times B[k,j]
    $

  - Each dot product has **4 multiply–adds**, so computing one $C[i,j]$ requires **4 FFMAs**.  
  - Since $C$ is $4 \times 4$, it has **16 elements total**.  
  - Therefore the total work is  

    $
    16 \times 4 = 64 \;\; \text{FFMAs across the whole matrix.}
    $

  - Each FFMA is of the form $d = a \times b + c$, updating one scalar at a time.

- **Using one Tensor Core instruction (HMMA):**
  - Instead of 64 separate scalar instructions, the entire warp issues a single **HMMA instruction** that updates the whole $4 \times 4$ tile of $C$ at once:  

    $
    C_{4\times4} \mathrel{+{=}} A_{4\times4} \times B_{4\times4}
    $

  - Under the hood, this one instruction bundles together all 64 multiply–adds required for the tile.  
  - To the programmer, it’s **one warp-level instruction** instead of 64 separate scalar FFMAs.



- In order to design our roofline model, we first need to know the global memory bandwidth $\beta_{gmem}$ of our device. 
    - NVIDIA spec sheets report *theoretical* memory bandwidth, which is never achievable in practice. So instead, we use a benchmark.
    - According to ["Dissecting the NVidia Turing T4 GPU via Microbenchmarking"](https://arxiv.org/pdf/1903.07486), the achievable memory bandwidth of the T4 is 220 GB/sec (this is 68% of the 320 GB/sec theoretical memory bandwidth)

- Next, we look at the peak floating point throughput with and without the tensor core.
    - Similarly to memory, the theoretical numbers are not achievable without.
    - Instead we use cuBLAS (matrix multiplication library) half precision and single precision GEMM kernels as the achievable floating point throughput numbers.
        - half precision uses **tensor cores** while single precision doesn't
    - The half precision kernel is done by **HMMA.1688**
        - This instruction performs a single small hardware accelerated matmul
    - The single precision kernel is done by **FFMA**
    - According to the benchmarks obtained by Alex, the tensor core **HMMA.1688** throughput is 49439 GFLOP/sec, which we will call $\tau_{HMMA}$.
    - The non-tensor core FFMA throughput is 7455 GFLOP/sec which we will call $\tau_{FFMA}$
    - These are respectively 76% and 92% of the theoretical peak throughputs
        - HMMA is 76% of Tensor-core theoretical peak
        - FFMA is 92% of CUDA-core theoretical peak
    - We get the resulting roofline model 


![image.png](../../images/GEMM2/t4_roofline.png)

- From the plot it is clear that the comparative hardness of writing a kernel that achieves peak FLOP/sec with tensor core instructions is harder than with fused multiply add instructions
    - This comes from the fact that the peak throughput of tensor core $\tau_{HMMA}$ needs ~6.6x more arithmetic intensity than what we need for peak throughput for fused multiply add $\tau_{FFMA}$
    - The balance points indicate that with FFMA instructions we can perform ~33 FLOPs per byte fetched from DRAM, whereas with the tensor cores we can perform ~224 FLOPs per byte fetched from DRAM.
        - This means if we took a kernel that reached peak flops achievable with FFMA instructions, simply replacing the fused multiply adds in the inner loop with tensor core instructions would not be sufficient enough to get high tensor core utilization.
        - We would additionally need to improve the code that moves data around to increase computational intensity by a factor of six.

### Shared memory vs. L2 cache vs. global memory
- It is crucial to understand our computers memory hierarchy if we want to write an optimized kernel for the tensor cores.
    - The roofline model simplifies the memory hierarchy down to two storage types, one large and slow, and the other fast and instantaneous.
    - In reality, there are more memory levels, each with a different bandwidth and capacity.
        

![image.png](../../images/GEMM2/t4_memory_hierarchy.png)

- It is critical to use the faster and smaller levels of the memory hierarchy effectively in order to increase **arithmetic intensity**, and thus move from being memory-bound to compute-bound.
    - This requires ingenuity because of the limited size of on-chip memory. For instance, on the Tesla T4 the **shared memory** has about **16.6× the bandwidth of global (DRAM) memory**.
    - However, on each streaming multiprocessor (SM) it only fits **64 KiB**.  
        - **KiB (kibibyte) = 1024 bytes**
    - When multiplying large matrices, 64 KiB is only enough to fit a **tiny tile of A, B, and C** at once. Efficient kernels must therefore reuse these tiles heavily before loading new ones from global memory.

![image.png](../../images/GEMM2/t4_memory_roofline1.png)

The plot compares the balance point of tensor cores with respect to:
- **Global memory (DRAM)**
    - Largest and slowest level of the memory hierarchy
- **L2 Cache**
    - Stores recently accessed data from DRAM, and is shared between the 16 SMs on the T4
- **Shared Memory**
    - Memory on SM

- All of these balance points are with respect to the tensor cores.
- Global memory has a balance point of 224, this means we need 224 FLOPs per byte fetched from DRAM in order to keep our tensor cores busy.
- The L2 cache has a balance point of 38, which is much more of a manageable number
    - If a good number of our memory accesses can hit the L2 cache rather than going all the way to global memory, we are more likely to become compute bound.
    - Why not try and get memory access to hit shared memory?
        - Shared memory isn't large enough to hold the full matrices, so we use it to store tiles of the matrices instead.
- Instead shared memory is used to explicitly manage cache that will hold small portions of the input matrices local to a particular SM.
- Within the SM, threads will load their own local portion of the matrices from shared memory into register memory
    - Register memory is where data must reside in order for it to be computed on. 
    - I.e, if we want to perform an add (a + b), we retrieve the values from shared memory, then store them on registers. Compute the sum, and store it back to one of the registers. Then eventually write it back to global memory.
- When shared memory is operating at full bandwidth, its balance point is 13. 
    - This means we need to cache enough data in registers to perform 13 FLOPs for each byte read from shared memory.
    - The SMs have enough register memory, which allows us to do that.
- Our challenge will be to enable shared memory to operate at full bandwidth
    - In practice this means organizing the data layout in such a way that we can read it and write it without bank conflicts.
- Once shared memory is at full bandwidth, sufficient arithmetic intensity will be easy to achieve.
- However, despite the balance point of shared memory being 13, it alone is not fast enough to achieve peak tensor core throughput.
    - If we stopped at shared memory reuse, you hit a ceiling: the Tensor Cores can still execute faster than shared memory can feed them.
    - So we bring in registers, as they are much faster than shared memory.
    - This is possible because each SM on the T4 has **tens of thousands of registers** (65,536 in total), and each thread can use hundreds of them. Registers are the fastest memory on the GPU, with massive aggregate bandwidth across all threads. This makes them capable of reusing values many times and feeding the Tensor Cores at full speed, unlike shared memory which would eventually become a bottleneck.

- These balance point numbers (224, 38, 13) all come from the formula:

  $
  I^* = \frac{\tau}{\beta}
  $

  where $\tau$ is the compute throughput of the Tensor Cores and $\beta$ is the memory bandwidth of that level of the hierarchy.

- For the T4, the achievable Tensor Core throughput is about $\tau_{HMMA} = 49,439 GFLOPs  
  - If we divide this by the DRAM bandwidth (220 GB/s), we get the global memory balance point $\approx 224$ FLOPs/byte.  
  - Dividing by the L2 bandwidth gives $\approx 38$ FLOPs/byte.  
  - Dividing by the shared memory bandwidth gives $\approx 13$ FLOPs/byte.

