Skip to content
@xlite-dev

xlite-dev

Develop ML/AI toolkits and ML/AI/CUDA Learning resources.

Pinned Loading

  1. LeetCUDA Public

    📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉

    Cuda 5.7k 597

  2. lite.ai.toolkit Public

    🛠 A lite C++ AI toolkit: 100+ models with MNN, ORT and TRT, including Det, Seg, Stable-Diffusion, Face-Fusion, etc.🎉

    C++ 4.2k 751

  3. Awesome-LLM-Inference Public

    📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉

    Python 4.3k 295

  4. Awesome-DiT-Inference Public

    📚A curated list of Awesome Diffusion Inference Papers with Codes: Sampling, Cache, Quantization, Parallelism, etc.🎉

    Python 329 18

  5. torchlm Public

    💎An easy-to-use PyTorch library for face landmarks detection: training, evaluation, inference, and 100+ data augmentations.🎉

    Python 261 25

  6. ffpa-attn Public

    ⚡️FFPA: Extend FlashAttention-2 with Split-D, achieve ~O(1) SRAM complexity for large headdim, 1.8x~3x↑ vs SDPA.🎉

    Cuda 192 8

Repositories

Showing 10 of 31 repositories
  • SageAttention Public Forked from thu-ml/SageAttention

    Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.

    Cuda 0 Apache-2.0 166 0 0 Updated Jul 21, 2025
  • LeetCUDA Public

    📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉

    Cuda 5,654 GPL-3.0 597 8 0 Updated Jul 21, 2025
  • diffusers Public Forked from huggingface/diffusers

    🤗 Diffusers: State-of-the-art diffusion models for image, video, and audio generation in PyTorch and FLAX.

    Python 0 Apache-2.0 6,188 0 0 Updated Jul 21, 2025
  • flux-faster Public

    A forked version of flux-fast that makes flux-fast even faster with cache-dit, 3.3x speedup on NVIDIA L20.

    Python 14 0 0 0 Updated Jul 18, 2025
  • flux-fast Public Forked from huggingface/flux-fast

    A forked version of flux-fast that makes flux-fast even faster with cache-dit.

    Python 5 8 0 0 Updated Jul 18, 2025
  • torchlm Public

    💎An easy-to-use PyTorch library for face landmarks detection: training, evaluation, inference, and 100+ data augmentations.🎉

    Python 261 MIT 25 15 0 Updated Jul 16, 2025
  • cache-dit Public Forked from vipshop/cache-dit

    🤗CacheDiT: A Training-free and Easy-to-use Cache Acceleration Toolbox for Diffusion Transformers🔥

    Python 4 4 0 0 Updated Jul 15, 2025
  • SpargeAttn Public Forked from thu-ml/SpargeAttn

    SpargeAttention: A training-free sparse attention that can accelerate any model inference.

    Cuda 6 Apache-2.0 48 0 0 Updated Jul 14, 2025
  • nunchaku Public Forked from nunchaku-tech/nunchaku

    [ICLR2025 Spotlight] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models

    Python 2 Apache-2.0 125 0 0 Updated Jul 14, 2025
  • Awesome-DiT-Inference Public

    📚A curated list of Awesome Diffusion Inference Papers with Codes: Sampling, Cache, Quantization, Parallelism, etc.🎉

    Python 329 GPL-3.0 18 0 0 Updated Jul 14, 2025

Top languages

Loading…