Skip to content

v0.1.1

Pre-release
Pre-release

Choose a tag to compare

@yanfeiwong yanfeiwong released this 07 Jun 15:57
· 22 commits to main since this release

Release Notes (v0.1.1)

Adafactor8Bit v0.1.1

底层 CUDA kernel 的优化,在保持 8-bit 量化显存优势的同时,进一步提升训练吞吐:

  • Kernel 融合:将 EMA 更新、量化、范数计算与参数更新融合为 4 个独立 kernel,减少 kernel launch 开销与中间张量分配
  • 2D 零物化计算:针对行列方差外积,直接在 kernel 中在线计算 v_ij,避免物化完整 [R, C] 方差矩阵,降低峰值显存
  • 多精度支持apply_update kernel 支持 FP32/FP16/BF16 参数原地更新,适配混合精度训练场景
  • 内存访问优化:优化 Warp-Level Reduction 实现,并确保 float4/uchar4 向量化加载满足对齐要求,提高内存带宽利用率

This release focuses on low-level CUDA kernel optimizations, improving training throughput while preserving the memory efficiency of 8-bit quantization:

  • Kernel Fusion: Fused EMA updates, quantization, norm computation, and parameter updates into four dedicated kernels, reducing launch overhead and intermediate tensor allocations
  • Zero-Materialization 2D Variance: For factorized row/column second-moment statistics, v_ij is computed on-the-fly within kernels, eliminating the need to materialize the full [R, C] variance matrix and reducing peak memory usage
  • Mixed-Precision Support: The apply_update kernel now supports in-place parameter updates for FP32, FP16, and BF16 tensors, enabling seamless mixed-precision training
  • Memory Access Optimization: Improved warp-level reduction implementation and ensured proper alignment for float4/uchar4 vectorized loads, increasing effective memory bandwidth utilization