Skip to content
View DD-DuDa's full-sized avatar

Block or report DD-DuDa

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Optimized communication collectives for the Cerebras waferscale engine

Python 6 Updated Jun 5, 2024

A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.

C++ 24 Updated Mar 26, 2025

A Datacenter Scale Distributed Inference Serving Framework

Rust 3,368 232 Updated Mar 28, 2025

SGLang is a fast serving framework for large language models and vision language models.

Python 12,622 1,387 Updated Mar 29, 2025

A Python Compiler Design Toolkit

Python 327 86 Updated Mar 29, 2025

High-throughput offline inference for MoE models with limited GPU memory.

Python 9 Updated Mar 25, 2025

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 5,109 536 Updated Mar 28, 2025

FlashMLA: Efficient MLA decoding kernels

C++ 11,387 811 Updated Mar 1, 2025

MoBA: Mixture of Block Attention for Long-Context LLMs

Python 1,695 101 Updated Mar 7, 2025

A high-throughput and memory-efficient inference and serving engine for LLMs

Python 43,032 6,535 Updated Mar 29, 2025

Bag of Tricks for Inference-time Computation of LLM Reasoning

Python 7 3 Updated Mar 3, 2025

KernelBench: Can LLMs Write GPU Kernels? - Benchmark with Torch -> CUDA problems

Python 239 22 Updated Mar 29, 2025

SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs

Cuda 9 2 Updated Mar 25, 2025

Fully open reproduction of DeepSeek-R1

Python 23,470 2,136 Updated Mar 29, 2025

MLIR For Beginners tutorial

C++ 934 84 Updated Feb 7, 2025

PyTorch library for cost-effective, fast and easy serving of MoE models.

Python 157 12 Updated Mar 27, 2025

Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels

C++ 861 63 Updated Mar 29, 2025

More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression

Python 11 Updated Jan 15, 2025

Tile primitives for speedy kernels

Cuda 2,194 131 Updated Mar 29, 2025

An extention of TVMScript to write simple and high performance GPU kernels with tensorcore.

Python 51 2 Updated Jul 23, 2024

An algorithm for weight-activation quantization (W4A4, W4A8) of LLMs, supporting both static and dynamic quantization

Python 124 7 Updated Feb 2, 2025

⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.

Cuda 63 3 Updated Mar 25, 2025

Fast and memory-efficient exact attention

Python 67 7 Updated Mar 3, 2025

Benchmark tests supporting the TiledCUDA library.

Cuda 16 2 Updated Nov 19, 2024

Implement Flash Attention using Cute.

Cuda 74 4 Updated Dec 17, 2024

📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).

Cuda 3,055 327 Updated Mar 27, 2025
Next
Showing results