Release v0.1.2 · yanfeiwong/adafactor-8bit

Adafactor8Bit v0.1.2

修复多维张量的数学分解逻辑，进一步压榨显存与鲁棒性：

N-D 张量数学修复：修复了 3D/4D 张量（如 Conv2d、MoE 权重）行列分解时维度被错误压平的 Bug。现在严格沿最后两维（dim=-1, -2）分解，并在 CUDA 中通过索引映射完美支持批量维度，确保算法的数学正确性。
显存峰值优化：使用 torch.norm 替代 grad.square().mean()，彻底消除 Eager Mode 下大矩阵平方梯度的中间张量物化，大幅降低训练时的峰值显存占用。
归约链路精简：在计算更新范数时引入 atomicAdd 直接累加到全局标量，省去额外的显存分配与 PyTorch 层的 .sum() Kernel 调用。
边缘 Case 与鲁棒性：兼容 0-D 标量张量（避免 IndexError 崩溃）；增加非连续内存视图的安全拦截与自动对齐，防止隐式的内存越界。

Adafactor8Bit v0.1.2

Fixes mathematical factorization for N-D tensors and further optimizes peak memory and robustness:

N-D Tensor Mathematical Fix: Fixed a bug where 3D/4D tensors (e.g., Conv2d, MoE weights) were incorrectly flattened during row/column factorization. Now strictly factorizes along the last two dimensions (dim=-1, -2) and uses index mapping in CUDA to perfectly support batch dimensions, ensuring mathematical correctness.
Peak Memory Optimization: Replaced grad.square().mean() with torch.norm to completely eliminate the materialization of large intermediate squared-gradient tensors in Eager Mode, significantly reducing peak memory usage.
Streamlined Reduction Pipeline: Introduced atomicAdd to accumulate update norms directly into a global scalar, eliminating extra memory allocations and PyTorch-level .sum() kernel calls.
Edge Cases & Robustness: Added support for 0-D scalar tensors (preventing IndexError crashes) and implemented safe interception and auto-alignment for non-contiguous memory views to prevent implicit out-of-bounds access.

Provide feedback