v0.1.2
Pre-release
Pre-release
Release Notes (v0.1.2)
Adafactor8Bit v0.1.2
修复多维张量的数学分解逻辑,进一步压榨显存与鲁棒性:
- N-D 张量数学修复:修复了 3D/4D 张量(如 Conv2d、MoE 权重)行列分解时维度被错误压平的 Bug。现在严格沿最后两维(
dim=-1, -2)分解,并在 CUDA 中通过索引映射完美支持批量维度,确保算法的数学正确性。 - 显存峰值优化:使用
torch.norm替代grad.square().mean(),彻底消除 Eager Mode 下大矩阵平方梯度的中间张量物化,大幅降低训练时的峰值显存占用。 - 归约链路精简:在计算更新范数时引入
atomicAdd直接累加到全局标量,省去额外的显存分配与 PyTorch 层的.sum()Kernel 调用。 - 边缘 Case 与鲁棒性:兼容 0-D 标量张量(避免
IndexError崩溃);增加非连续内存视图的安全拦截与自动对齐,防止隐式的内存越界。
Release Notes (v0.1.2)
Adafactor8Bit v0.1.2
Fixes mathematical factorization for N-D tensors and further optimizes peak memory and robustness:
- N-D Tensor Mathematical Fix: Fixed a bug where 3D/4D tensors (e.g., Conv2d, MoE weights) were incorrectly flattened during row/column factorization. Now strictly factorizes along the last two dimensions (
dim=-1, -2) and uses index mapping in CUDA to perfectly support batch dimensions, ensuring mathematical correctness. - Peak Memory Optimization: Replaced
grad.square().mean()withtorch.normto completely eliminate the materialization of large intermediate squared-gradient tensors in Eager Mode, significantly reducing peak memory usage. - Streamlined Reduction Pipeline: Introduced
atomicAddto accumulate update norms directly into a global scalar, eliminating extra memory allocations and PyTorch-level.sum()kernel calls. - Edge Cases & Robustness: Added support for 0-D scalar tensors (preventing
IndexErrorcrashes) and implemented safe interception and auto-alignment for non-contiguous memory views to prevent implicit out-of-bounds access.