-
University of Edinburgh
- Edinburgh
-
16:31
- same time - https://dd-duda.github.io/
Stars
Optimized communication collectives for the Cerebras waferscale engine
A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.
A Datacenter Scale Distributed Inference Serving Framework
SGLang is a fast serving framework for large language models and vision language models.
High-throughput offline inference for MoE models with limited GPU memory.
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
MoBA: Mixture of Block Attention for Long-Context LLMs
A high-throughput and memory-efficient inference and serving engine for LLMs
Bag of Tricks for Inference-time Computation of LLM Reasoning
KernelBench: Can LLMs Write GPU Kernels? - Benchmark with Torch -> CUDA problems
SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs
Fully open reproduction of DeepSeek-R1
PyTorch library for cost-effective, fast and easy serving of MoE models.
Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels
More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression
Tile primitives for speedy kernels
An extention of TVMScript to write simple and high performance GPU kernels with tensorcore.
An algorithm for weight-activation quantization (W4A4, W4A8) of LLMs, supporting both static and dynamic quantization
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
Fast and memory-efficient exact attention
Benchmark tests supporting the TiledCUDA library.
Implement Flash Attention using Cute.
📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).