📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉
-
Updated
Aug 5, 2025 - Python
📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉
Light-field imaging application for plenoptic cameras
[ICLR 2025] Palu: Compressing KV-Cache with Low-Rank Projection
Light field geometry estimator for plenoptic cameras
xKV: Cross-Layer SVD for KV-Cache Compression
An efficient and scalable attention module designed to reduce memory usage and improve inference speed in large language models. Designed and implemented the Multi-Head Latent Attention (MLA) module as a drop-in replacement for traditional multi-head attention (MHA) in large language models.
A Mixture of Experts model with latent attention designed for efficient training and inference.
Add a description, image, and links to the mla topic page so that developers can more easily learn about it.
To associate your repository with the mla topic, visit your repo's landing page and select "manage topics."