
-
Data Science and Analytic Thrust, Information Hub, HKUST(GZ)
- GuangZhou
- https://www.zhihu.com/people/peijieDong
- https://pprp.github.io
- https://scholar.google.com/citations?user=TqS6s4gAAAAJ
Lists (32)
Sort Name ascending (A-Z)
Attention
C++
CSBasic
DataAug
Dataset
diffusion
Distill
GPT
🗡️ Graph
Graph Structure Learning;👹incremental
incremental learning📥 interest
KAN
⭐ life
lightweight
👍 Meta
MLP
NAS
Object Detection
optimization
PEFT
LORA🌟 Prune
quant
sparse_training
layer freezeSPP
SSL
SSM
symbol
template
TestTimeAdaptation
utils
VIT
数字人
Starred repositories
📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
A repository aimed at pruning DeepSeek V3, R1 and R1-zero to a usable size
XAttention: Block Sparse Attention with Antidiagonal Scoring
Build effective agents using Model Context Protocol and simple workflow patterns
ZO2 (Zeroth-Order Offloading): Full Parameter Fine-Tuning 175B LLMs with 18GB GPU Memory
A Datacenter Scale Distributed Inference Serving Framework
😎 A curated list of tensor decomposition resources for model compression.
YaRN: Efficient Context Window Extension of Large Language Models
An open-source solution for full parameter fine-tuning of DeepSeek-V3/R1 671B, including complete code and scripts from training to inference, as well as some practical experiences and conclusions.…
[ICLR 2025] COAT: Compressing Optimizer States and Activation for Memory-Efficient FP8 Training
A fast communication-overlapping library for tensor/expert parallelism on GPUs.
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilizatio…
A series of math-specific large language models of our Qwen2 series.
A lightweight data processing framework built on DuckDB and 3FS.
A high-performance distributed file system designed to address the challenges of AI training and inference workloads.
Analyze computation-communication overlap in V3/R1.
A bidirectional pipeline parallelism algorithm for computation-communication overlap in V3/R1 training.
fanshiqing / grouped_gemm
Forked from tgale96/grouped_gemmPyTorch bindings for CUTLASS grouped GEMM.
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
A pandoc LaTeX template to convert markdown files to PDF or LaTeX.
enhuiz / eisvogel
Forked from Wandmalfarbe/pandoc-latex-templateA pandoc LaTeX template to convert markdown files to PDF or LaTeX.
D^2-MoE: Delta Decompression for MoE-based LLMs Compression
DeepEP: an efficient expert-parallel communication library