# Background

## Memory Wall

- Every few years the number of transistors on a microchip increases, and as a result the capacity for performing arithmetic operations has increased exponentially.

    - A **transistor** is a tiny electronic switch — the basic “work unit” of digital logic and memory. Modern CPUs and GPUs have billions of them.

    - More transistors enable more logic gates, memory cells, and specialized units → greater parallelism, higher throughput, and more complex computations.

- However, data movement capacity (memory bandwidth) hasn’t increased as fast, creating the **memory wall** — a key bottleneck in deep learning and Tensor Core workloads.

- Hence if we want to fully utilize Tensor Cores, we must increase the number of bytes moved between DRAM and compute units.

## Roofline Charts 

- The roofline model states, that performance will be limited by **one of two things**
    - **Compute:**
        - Data is readily available in fast memory
        - The bottleneck is the number of floating-point operations per second (**FLOP/s)** your hardware can execute
        - Adding more **bandwidth** here won’t help — only faster **ALUs (Arithmetic Logical Unit: Used to perform computations)**, **Tensor Cores**, or more **parallelism** will improve performance
    - **Memory-bound:**
        - The **compute units can’t be** **fed fast enough** because data has to be fetched from slow memory (**DRAM**)
        - The bottleneck is **memory bandwidth (β bytes/sec)**
        - Doubling **compute** units won’t help unless you also increase your **bandwidth** or improve data reuse in fast memory.
    - Plot Interpretation:
        - **X-axis:** Operation intensity = **FLOPs** per byte transferred (how much work you do per data fetched)
        - **Y-axis:** Achievable performance (**FLOP/s)**
        - The **roof** has two segments:
            - **Slopped line →** memory-bound region
            - **Flat line** → compute-bound region
            - Where you land depends on how much computation you can do before you have to fetch more data
- **Two-level memory model in GPUs:**
    - **Fast memory** (shared memory in the SM) is physically close to the compute units.
    - **Slow memory** (DRAM) is farther away, so accessing it takes longer.
    - Computation can only be performed at peak rate (**τ FLOP/s**) on data already in fast memory.
    - Slow memory can transfer data into fast memory at **β bytes/sec**
    - Because of the memory wall, we want to move as much data as possible, and as fast as possible.
        - Hence, since **τ** is larger than **β,** we want to maximize time spent operating in in **fast memory,** and minimize **slow memory accesses**

![image.png](images\GEMM2\roof_line.png)