# Background

## Memory Wall

- Every few years the number of transistors on a microchip increases, and as a result the capacity for performing arithmetic operations has increased exponentially.

    - A **transistor** is a tiny electronic switch — the basic “work unit” of digital logic and memory. Modern CPUs and GPUs have billions of them.

    - More transistors enable more logic gates, memory cells, and specialized units → greater parallelism, higher throughput, and more complex computations.

- However, data movement capacity **(memory bandwidth)** hasn’t increased as fast, creating the **memory wall** — a key bottleneck in deep learning and Tensor Core workloads.
    - Bandwidth is how much **data per second** your memory system can deliver to (and accept from) the **compute units**
    - **Analogy**:
        - Think of a highawy:
            - Cars = **bytes**
            - Number of lanes = **bus width**
            - Speed limit = **clock rate**
            - Toll booths/merges = **memory controllers/channels**
        - How many cars reach the city each second? That's the bandwidth
    - Do not confuse **bandwidth** with **latency**:
        - **Latency** is how long one trip takes
        - **Bandwidth** is how many bytes per second you can keep flowing

- Hence if we want to fully utilize Tensor Cores, we must increase the number of bytes moved between DRAM and compute units.

![image.png](../../images/GEMM2/simple_computer.png)

## Roofline Charts 

- The roofline model states, that performance will be limited by **one of two things**
    - **Compute:**
        - Data is readily available in fast memory
        - The bottleneck is the number of floating-point operations per second (**FLOP/s)** your hardware can execute
        - Adding more **bandwidth** here won’t help — only faster **ALUs (Arithmetic Logical Unit: Used to perform computations)**, **Tensor Cores**, or more **parallelism** will improve performance
    - **Memory-bound:**
        - The **compute units can’t be** **fed fast enough** because data has to be fetched from slow memory (**DRAM**)
        - The bottleneck is **memory bandwidth (β bytes/sec)**
        - Doubling **compute** units won’t help unless you also increase your **bandwidth** or improve data reuse in fast memory.
    - Plot Interpretation:
        - **X-axis:** Operation intensity = **FLOPs** per byte transferred (how much work you do per data fetched)
        - **Y-axis:** Achievable performance (**FLOP/s)**
        - The **roof** has two segments:
            - **Slopped line →** memory-bound region
            - **Flat line** → compute-bound region
            - Where you land depends on how much computation you can do before you have to fetch more data


![image.png](../../images/GEMM2/roof_line.png)

- **Explanation**
    - Any given computation has a certain number of FLOPs that need to be performed. For example, if you want to multiply a M by K matrix with a K by N matrix we need to perform $2 * M * N * K$ FLOPs.
        - For each pair $(i, j)$ you do K multiplies, and K - 1 adds $\approx{2K FLOPS}$. 1 FLOP(Floating point operation) for multiply + 1 FLOP for addition
        - There are M * N outputs, so total $FLOPs \approx{(M N) * (2K)} ={2*M*N*K}$
    - The more FLOPs/sec our algorithm can achieve, the faster we can get the matrix multiplication done.
    - The roofline model gives us an upper bound on the FLOPs/sec we can achieve, subject to $\tau$ and $\beta$ which are fixed properties of our hardware.
        - $\tau$ (tau) = the **peak compute throughput** of your device for a given datatype/op (e.g **FP32** (Floating Point 32-bit float), **FP16/Tensor**)
            - $\tau$ is typically a large number. For example, for the T4 GPU, $\tau$= 65,000,000,000,000 FLOPs/second. Units: FLOP/s
        - $\beta$ (beta) = the **peak sustained memory bandwidth** between a given memory level and the cores (e.g., DRAM <-> SM). Units: bytes/s
    - We will refer to achieved FLOPs/sec as $T$ for throughput, and the upper bound on $T$ as $T_{max}$
    - The maximum FLOP/sec we can achieve ($T_{max}$) is modeled as a function of a variable called *computational intensity* ($I$), this is a property of the algorithm we will write.
        - Units: FlOP/byte
        - FLOPs done per byte moved between that memory level and the cores (reads + writes).
    - This metric measures the "data reuse" of our algorithm in units of FLOPs/byte'
        - For each byte moved from slow memory to fast memory, how many FLOPs do we perform on it.
    - The roofline model says the upper bound on FLOPs/sec ($T_{max}$) we can achieve is the minimum of our computational intensity times memory bandwidth, and the peak floating point throughput of our hardware
                                        $T_{max} = min(\beta * I, \tau)$
    - The roofline model says there are two ways $T_{max}$ can be limited:
        - $T_{max}$ can never exceed $\tau$. Even if we perform infinity operations on each byte we move into fast memory, we are still limited by the peak floating point throughput of our hardware.
            - When $\tau$ is our limiting factor we are *compute-bound*, this is a great place to be.
        - $T_{max}$ may also be limited by our memory bandwidth times the computational intensity of our algorithm. If $\tau$ were infinite, the achieved floating point throughput would simply be the number of bytes/sec being moved into fast memory, times the number of FLOPs performed per byte moved $\beta * I$
            - When we multiply $\beta$ and $I$, the units cancel out to give FLOP/sec.
                - $(bytes/s) *(FLOP/bytes) ={FLOP/s}$
            - If $\beta * I <{\tau}$, or $I <{\tau/\beta}$ then we are *memory-bound*, meaning we are limited by how fast we can feed our compute units.
            - In this situation we should rewrite our algorithms to increase *I* in order to make our algorithm compute-bound
        - Why do we try to increase $I$ and not our bandwidth?
            - Since $\tau$ and $\beta$ are limited (they are fixed by hardware), we increase $I$ in order to become compute-bound. 
                - Otherwise, we would have to change hardware with higher bandwidth.
                - Additionaly this approach is better because if our algorithm can maximize *computational intensity* on hardware with less bandwidth then it will perform better on new hardware than an algorithm that doesn't maximize *computational intensity*.
        - The ridge point(where the sloped line-> memory-bound meets the flat line-> compute-bound) $I ={\tau/\beta}$.
    - How do we maximize *computational intensity*?
        - In practice, this means moving a chunk of data from slow memory to fast memory, and then performing as many useful operations on it as allowed by our algorithm.
        - Maximizing the amount of operations on a chunk of data, means we use that data until it won't be used in an operation again. I.e after we fetch it the first time, we won't fetch it again.
        - As a result, we reduce the number of trips to slow memory, and now our performance depends on how many operations we can perform on each byte that is moved into fast memory **compute-bound**

- **TL;DR:**
    - **Fast memory** (shared memory in the SM) is physically close to the compute units.
    - **Slow memory** (DRAM) is farther away, so accessing it takes longer.
    - Peak compute **τ (FLOP/s)** is achievable only when your arithmetic intensity **I** is high enough and the cores are kept busy
    - Achievable performance: $T = min(\beta * I, \tau)$; To beat the memory wall, reduce DRAM traffic and increase reuse, so $I >{\tau/\beta}$, keeping work in **fast memory**.
