Release v0.1.1 · yanfeiwong/adafactor-8bit

Release Notes (v0.1.1)

Adafactor8Bit v0.1.1

底层 CUDA kernel 的优化，在保持 8-bit 量化显存优势的同时，进一步提升训练吞吐：

Kernel 融合：将 EMA 更新、量化、范数计算与参数更新融合为 4 个独立 kernel，减少 kernel launch 开销与中间张量分配
2D 零物化计算：针对行列方差外积，直接在 kernel 中在线计算 v_ij，避免物化完整 [R, C] 方差矩阵，降低峰值显存
多精度支持：apply_update kernel 支持 FP32/FP16/BF16 参数原地更新，适配混合精度训练场景
内存访问优化：优化 Warp-Level Reduction 实现，并确保 float4/uchar4 向量化加载满足对齐要求，提高内存带宽利用率

This release focuses on low-level CUDA kernel optimizations, improving training throughput while preserving the memory efficiency of 8-bit quantization:

Kernel Fusion: Fused EMA updates, quantization, norm computation, and parameter updates into four dedicated kernels, reducing launch overhead and intermediate tensor allocations
Zero-Materialization 2D Variance: For factorized row/column second-moment statistics, v_ij is computed on-the-fly within kernels, eliminating the need to materialize the full [R, C] variance matrix and reducing peak memory usage
Mixed-Precision Support: The apply_update kernel now supports in-place parameter updates for FP32, FP16, and BF16 tensors, enabling seamless mixed-precision training
Memory Access Optimization: Improved warp-level reduction implementation and ensured proper alignment for float4/uchar4 vectorized loads, increasing effective memory bandwidth utilization

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.1.1

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Release Notes (v0.1.1)

Uh oh!