Skip to content

v0.1.2

Pre-release
Pre-release

Choose a tag to compare

@yanfeiwong yanfeiwong released this 07 Jun 17:04
· 21 commits to main since this release

Release Notes (v0.1.2)

Adafactor8Bit v0.1.2

修复多维张量的数学分解逻辑,进一步压榨显存与鲁棒性:

  • N-D 张量数学修复:修复了 3D/4D 张量(如 Conv2d、MoE 权重)行列分解时维度被错误压平的 Bug。现在严格沿最后两维(dim=-1, -2)分解,并在 CUDA 中通过索引映射完美支持批量维度,确保算法的数学正确性。
  • 显存峰值优化:使用 torch.norm 替代 grad.square().mean(),彻底消除 Eager Mode 下大矩阵平方梯度的中间张量物化,大幅降低训练时的峰值显存占用。
  • 归约链路精简:在计算更新范数时引入 atomicAdd 直接累加到全局标量,省去额外的显存分配与 PyTorch 层的 .sum() Kernel 调用。
  • 边缘 Case 与鲁棒性:兼容 0-D 标量张量(避免 IndexError 崩溃);增加非连续内存视图的安全拦截与自动对齐,防止隐式的内存越界。

Release Notes (v0.1.2)

Adafactor8Bit v0.1.2

Fixes mathematical factorization for N-D tensors and further optimizes peak memory and robustness:

  • N-D Tensor Mathematical Fix: Fixed a bug where 3D/4D tensors (e.g., Conv2d, MoE weights) were incorrectly flattened during row/column factorization. Now strictly factorizes along the last two dimensions (dim=-1, -2) and uses index mapping in CUDA to perfectly support batch dimensions, ensuring mathematical correctness.
  • Peak Memory Optimization: Replaced grad.square().mean() with torch.norm to completely eliminate the materialization of large intermediate squared-gradient tensors in Eager Mode, significantly reducing peak memory usage.
  • Streamlined Reduction Pipeline: Introduced atomicAdd to accumulate update norms directly into a global scalar, eliminating extra memory allocations and PyTorch-level .sum() kernel calls.
  • Edge Cases & Robustness: Added support for 0-D scalar tensors (preventing IndexError crashes) and implemented safe interception and auto-alignment for non-contiguous memory views to prevent implicit out-of-bounds access.