v0.1.1
Pre-release
Pre-release
Release Notes (v0.1.1)
Adafactor8Bit v0.1.1
底层 CUDA kernel 的优化,在保持 8-bit 量化显存优势的同时,进一步提升训练吞吐:
- Kernel 融合:将 EMA 更新、量化、范数计算与参数更新融合为 4 个独立 kernel,减少 kernel launch 开销与中间张量分配
- 2D 零物化计算:针对行列方差外积,直接在 kernel 中在线计算
v_ij,避免物化完整[R, C]方差矩阵,降低峰值显存 - 多精度支持:
apply_updatekernel 支持 FP32/FP16/BF16 参数原地更新,适配混合精度训练场景 - 内存访问优化:优化 Warp-Level Reduction 实现,并确保 float4/uchar4 向量化加载满足对齐要求,提高内存带宽利用率
This release focuses on low-level CUDA kernel optimizations, improving training throughput while preserving the memory efficiency of 8-bit quantization:
- Kernel Fusion: Fused EMA updates, quantization, norm computation, and parameter updates into four dedicated kernels, reducing launch overhead and intermediate tensor allocations
- Zero-Materialization 2D Variance: For factorized row/column second-moment statistics,
v_ijis computed on-the-fly within kernels, eliminating the need to materialize the full[R, C]variance matrix and reducing peak memory usage - Mixed-Precision Support: The
apply_updatekernel now supports in-place parameter updates for FP32, FP16, and BF16 tensors, enabling seamless mixed-precision training - Memory Access Optimization: Improved warp-level reduction implementation and ensured proper alignment for
float4/uchar4vectorized loads, increasing effective memory bandwidth utilization