DD-DuDa

Dayou Du DD-DuDa

MLsys

86 followers · 129 following

University of Edinburgh
Edinburgh
16:31 - same time
https://dd-duda.github.io/

Achievements

Stars

spcl / spatial-collectives

Optimized communication collectives for the Cerebras waferscale engine

Python 6 Updated Jun 5, 2024

DD-DuDa / BitDecoding

A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.

C++ 24 Updated Mar 26, 2025

ai-dynamo / dynamo

A Datacenter Scale Distributed Inference Serving Framework

Rust 3,368 232 Updated Mar 28, 2025

sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.

Python 12,622 1,387 Updated Mar 29, 2025

xdslproject / xdsl

A Python Compiler Design Toolkit

Python 327 86 Updated Mar 29, 2025

EfficientMoE / MoE-Gen

High-throughput offline inference for MoE models with limited GPU memory.

Python 9 Updated Mar 25, 2025

deepseek-ai / DeepGEMM

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 5,109 536 Updated Mar 28, 2025

deepseek-ai / FlashMLA

FlashMLA: Efficient MLA decoding kernels

C++ 11,387 811 Updated Mar 1, 2025

MoonshotAI / MoBA

MoBA: Mixture of Block Attention for Long-Context LLMs

Python 1,695 101 Updated Mar 7, 2025

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

Python 43,032 6,535 Updated Mar 29, 2025

usail-hkust / benchmark_inference_time_computation_LLM

Bag of Tricks for Inference-time Computation of LLM Reasoning

Python 7 3 Updated Mar 3, 2025

ScalingIntelligence / KernelBench

KernelBench: Can LLMs Write GPU Kernels? - Benchmark with Torch -> CUDA problems

Python 239 22 Updated Mar 29, 2025

xxyux / SpInfer

SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs

Cuda 9 2 Updated Mar 25, 2025

PrincetonUniversity / LLMCompass

Python 129 32 Updated Jul 1, 2024

huggingface / open-r1

Fully open reproduction of DeepSeek-R1

Python 23,470 2,136 Updated Mar 29, 2025

j2kun / mlir-tutorial

MLIR For Beginners tutorial

C++ 934 84 Updated Feb 7, 2025

EfficientMoE / MoE-Infinity

PyTorch library for cost-effective, fast and easy serving of MoE models.

Python 157 12 Updated Mar 27, 2025

tile-ai / tilelang

Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels

C++ 861 63 Updated Mar 29, 2025

zhzihao / QPruningKV

More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression

Python 11 Updated Jan 15, 2025

deepseek-ai / DeepSeek-V3

Python 94,604 15,294 Updated Mar 16, 2025

HazyResearch / ThunderKittens

Tile primitives for speedy kernels

Cuda 2,194 131 Updated Mar 29, 2025

nox-410 / tvm.tl

An extention of TVMScript to write simple and high performance GPU kernels with tensorcore.

Python 51 2 Updated Jul 23, 2024

ChenMnZ / PrefixQuant

An algorithm for weight-activation quantization (W4A4, W4A8) of LLMs, supporting both static and dynamic quantization

Python 124 7 Updated Feb 2, 2025

AmazingSealock / HKUSTGZ_Indoor_Robot

C++ 13 1 Updated Dec 10, 2024

shengyp / doing_the_PhD

2,099 266 Updated Mar 28, 2025

xlite-dev / hgemm-mma

⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.

Cuda 63 3 Updated Mar 25, 2025

xiayuqing0622 / flex_head_fa

Forked from Dao-AILab/flash-attention

Fast and memory-efficient exact attention

Python 67 7 Updated Mar 3, 2025

TiledTensor / TiledBench

Benchmark tests supporting the TiledCUDA library.

Cuda 16 2 Updated Nov 19, 2024

luliyucoordinate / cute-flash-attention

Implement Flash Attention using Cute.

Cuda 74 4 Updated Dec 17, 2024

xlite-dev / CUDA-Learn-Notes

📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).

Cuda 3,055 327 Updated Mar 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dayou Du DD-DuDa

Achievements

Achievements

Block or report DD-DuDa

Stars

spcl / spatial-collectives

DD-DuDa / BitDecoding

ai-dynamo / dynamo

sgl-project / sglang

xdslproject / xdsl

EfficientMoE / MoE-Gen

deepseek-ai / DeepGEMM

deepseek-ai / FlashMLA

MoonshotAI / MoBA

vllm-project / vllm

usail-hkust / benchmark_inference_time_computation_LLM

ScalingIntelligence / KernelBench

xxyux / SpInfer

PrincetonUniversity / LLMCompass

huggingface / open-r1

j2kun / mlir-tutorial

EfficientMoE / MoE-Infinity

tile-ai / tilelang

zhzihao / QPruningKV

deepseek-ai / DeepSeek-V3

HazyResearch / ThunderKittens

nox-410 / tvm.tl

ChenMnZ / PrefixQuant

AmazingSealock / HKUSTGZ_Indoor_Robot

shengyp / doing_the_PhD

xlite-dev / hgemm-mma

xiayuqing0622 / flex_head_fa

TiledTensor / TiledBench

luliyucoordinate / cute-flash-attention

xlite-dev / CUDA-Learn-Notes