xlite-dev
Pinned Loading
Repositories
- Awesome-LLM-Inference Public
📖A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, FlashAttention, PagedAttention, MLA, Parallelism, Prefix Cache, Chunked Prefill, PD Disaggregate, etc. 🎉🎉
- CUDA-Learn-Notes Public
📚Modern CUDA Learn Notes with PyTorch: 200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe API (Achieve 98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
- hgemm-tensorcores-mma Public
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
- lite.ai.toolkit Public
🛠 A lite C++ toolkit that contains 100+ Awesome AI models (Stable-Diffusion, FaceFusion, YOLO series, Face/Object Detection, Seg, Matting, etc), support MNN, ORT and TensorRT. 🎉🎉
- ffpa-attn-mma Public
📚FFPA(Split-D): Yet another Faster Flash Prefill Attention with O(1) GPU SRAM complexity for headdim > 256, ~2x↑🎉vs SDPA EA.
- Awesome-Diffusion-Inference Public
📖A curated list of Awesome Diffusion Inference Papers with codes: Sampling, Caching, Multi-GPUs, etc. 🎉🎉
- SageAttention Public Forked from thu-ml/SageAttention
Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
- statistic-learning-R-note Public
📒《统计学习方法-李航: 笔记-从原理到实现》200-page PDF Notes, with detailed explanations of various math formulas, implementations of many algorithms using the R language. 🎉