The L2 Cache Barrier in MLA Serving

Code and profiling data for the paper "The Hidden Bottleneck in MLA Serving: Reconstruction GEMMs, INT4 Quantization, and the L2 Cache Barrier" (Robert Zhang).

Key Claim

INT4 quantization of MLA reconstruction weights fails to outperform FP16 cuBLAS because the 16 MB weight matrix fits inside the H100's 50 MB L2 cache. Weights are served at L2 bandwidth (~12 TB/s) rather than HBM bandwidth (3.35 TB/s), making quantization's HBM savings irrelevant. The speedup only materializes once weights exceed L2 capacity.

Repo Structure

kernels/          Triton/PyTorch transformer kernels and benchmarks
profiling/        Microbenchmarks, NCU profiling scripts, and results
paper/            LaTeX source and figures

Requirements

NVIDIA H100 80 GB SXM5 (L2 cache = 50 MB); some experiments also run on A100
PyTorch ≥ 2.1 with CUDA
Triton ≥ 3.0

pip install torch triton

Reproducing Key Experiments

L2 barrier sweep (scales weight matrix from 8 MB → 128 MB across the 50 MB L2 boundary):

cd profiling
python bench_l2_barrier.py

INT4 batched GEMM benchmark:

python bench_int4_bmm.py

MLA reconstruction profiling:

python profile_mla_reconstruction.py

NCU kernel profiling (requires ncu on PATH):

bash ncu_profile.sh
python analyze_ncu.py

End-to-end profiling:

bash profile_e2e.sh

See profiling/RUNBOOK.md for full instructions and profiling/RESULTS.md for a summary of findings.

Paper

The LaTeX source is in paper/. Build with:

cd paper
pdflatex main.tex && bibtex main && pdflatex main.tex && pdflatex main.tex

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
kernels		kernels
paper		paper
profiling		profiling
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

The L2 Cache Barrier in MLA Serving

Key Claim

Repo Structure

Requirements

Reproducing Key Experiments

Paper

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

The L2 Cache Barrier in MLA Serving

Key Claim

Repo Structure

Requirements

Reproducing Key Experiments

Paper

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages