Releases: yanfeiwong/adafactor-8bit
v0.1.9
Release Notes (v0.1.9)
新功能
-
Adafactor 路径支持 Fira 限制器
新增
enable_fira_for_adafactor参数。标准 Adafactor 更新路径现已支持 Fira 范数增长限制器,可用于平滑参数更新并减少 Loss 尖峰。在许多训练场景下,这使得外部梯度裁剪(如clip_grad_norm_)不再必要,从而简化训练流程。 -
可配置的
fira_margin新增全局参数
fira_margin(默认值:0.01)。此前 APOLLO 路径中硬编码的 1% 范数增长阈值现已提取为可配置项,并同时作用于 APOLLO 与 Adafactor 两条更新路径。用户可以根据具体任务灵活调整范数增长控制的容忍范围。
文档与示例
-
新增高级混合路径示例
新增
examples/advanced_usage.py,展示针对不同参数类型与张量维度的混合路径实践,并演示如何在同一个优化器配置中组合使用 APOLLO、Adafactor、量化以及 Fira 限制器。
New Features
-
Fira Limiter Support for the Adafactor Path
Added the
enable_fira_for_adafactoroption. The standard Adafactor update path can now leverage the Fira Norm-Growth Limiter to smooth parameter updates and reduce loss spikes. In many training setups, this makes external gradient clipping (e.g.clip_grad_norm_) unnecessary, resulting in a simpler training pipeline. -
Configurable
fira_marginAdded the global
fira_marginparameter (default:0.01). The previously hardcoded 1% norm-growth threshold used by the APOLLO path is now fully configurable and shared across both APOLLO and Adafactor routes. This allows users to tune the aggressiveness of norm-growth control for different workloads.
Documentation & Examples
-
Advanced Hybrid Routing Example
Added
examples/advanced_usage.py, demonstrating practical hybrid routing strategies for architectures containing different parameter types and tensor shapes. The example also showcases how to combine APOLLO, Adafactor, quantization, and the Fira Limiter in a single optimizer configuration.
v0.1.8
Release Notes (v0.1.8)
v0.1.8
🌕 “Tranquility Base” (静海基地):APOLLO 低秩投影与 Fira 减震器
"One small step for optimizer state, one giant leap for convergence."
(优化器状态的一小步,模型收敛的一大步。)
- 引入 APOLLO 随机子空间投影:新增
apollo_rank参数。启用后,优化器会将梯度投影至低秩子空间以估算二阶矩缩放因子。相比 Adafactor 默认的行列独立假设,随机投影能捕获更丰富的协方差信息,在极低显存开销下加速收敛。 - 集成 Fira Norm-Growth Limiter:为 APOLLO 路径配备了“减震器”。通过动态限制缩放梯度的范数增长率,有效抑制因投影矩阵周期性刷新而引发的梯度突变(Loss Spike),为低秩训练保驾护航。
- 极限显存压缩选项 (
apollo_factorize):提供实验性的“低秩空间内行列分解”选项。利用随机投影的保范性质,在低秩子空间内进一步应用 Adafactor 的行列分解,将优化器状态显存压缩至极限。
- APOLLO Random Subspace Projection: Introduced the
apollo_rankparameter. When enabled, the optimizer projects gradients into a low-rank subspace to estimate second-moment scaling factors. Compared to Adafactor's default row/column independence assumption, random projection captures richer covariance information, accelerating convergence with ultra-low memory overhead. - Fira Norm-Growth Limiter Integration: Equipped the APOLLO path with a "shock absorber". By dynamically capping the norm growth rate of the scaled gradients, it effectively suppresses destructive gradient spikes (Loss Spikes) caused by periodic projection matrix refreshes.
- Extreme VRAM Compression Option (
apollo_factorize): Offers an experimental "row/column factorization within low-rank subspace" option. Leveraging the norm-preserving property of random projections, it further applies Adafactor's factorization inside the low-rank space, pushing optimizer state memory compression to its limits.
v0.1.7
Release Notes (v0.1.7)
v0.1.7
CUDA 边缘场景防御与状态鲁棒性加固
- 引入全局 Norm 溢出保护:在 CUDA Kernel 中增加了对数域的安全阈值截断,防止极端稀疏梯度导致单点更新过大,进而引发 FP32 累加器溢出为 INF 及全局更新失效的问题。
- 加固 EMA 状态防崩溃机制:在量化融合算子中增加了对数域上下界的物理 Clamp,阻断 Loss Spike 或底层浮点异常导致 Scale 爆炸及未定义行为 (UB) 的路径。
CUDA Edge-Case Defense & State Robustness Hardening
- Global Norm Overflow Protection: Added a log-domain safety threshold clamp in CUDA kernels to prevent single-point updates from becoming excessively large under extremely sparse gradients, which could cause FP32 accumulator overflow to INF and global update failure.
- EMA State Anti-Collapse Hardening: Introduced physical clamps for the upper and lower bounds of the log domain in the fused quantization kernel, blocking paths where Loss Spikes or underlying floating-point anomalies could cause Scale explosions and undefined behavior (UB).
v0.1.6
Release Notes (v0.1.6)
v0.1.6
CUDA 数值稳定性增强与 Kernel 性能优化
- CUDA 更新逻辑引入对数空间 (Log-Space) 计算:重构了 1D 和 2D 参数的更新逻辑,将线性空间的乘除法转换为对数空间的加减法,保持数学逻辑严格等价。
- 解决极小方差相乘导致的浮点下溢 (Underflow) 问题:针对极端梯度场景下极小方差相乘导致结果归零的失真现象进行了修复,提升了长尾分布下的数值鲁棒性。
- 优化 CUDA Kernel 数学指令:调整底层数学指令组合,利用硬件 SFU 指令替代部分复杂的指数与开方运算,降低 Kernel 执行开销。
CUDA Numerical Stability Enhancements & Kernel Performance Optimization
- Log-Space Computation in CUDA Kernels: Refactored 1D and 2D parameter updates, replacing linear multiplications and divisions with log-space additions and subtractions while maintaining strict mathematical equivalence.
- Underflow Mitigation for Small Variances: Fixed the zeroing distortion that occurs when multiplying very small variances in extreme gradient scenarios, improving numerical robustness for long-tail distributions.
- CUDA Math Instruction Optimization: Adjusted underlying math instruction combinations, leveraging hardware SFU instructions to replace certain complex exponential and square root operations, thereby reducing kernel execution overhead.
v0.1.5
Release Notes (v0.1.5)
v0.1.5
长期训练支持、编译兼容性及文档完善
- 支持长期连续训练:新增
beta2参数。允许解除与训练步数的硬绑定,防止长序列训练中 EMA 窗口膨胀导致的优化器钝化。 - JIT 编译旁路开关:新增
use_cuda_kernel参数。允许在无 CUDA 编译器环境中显式禁用 JIT 编译,直接回退至纯 PyTorch 实现,并修复了编译失败时的重试卡顿问题。 - 文档完善。
Long Training Support, Compilation Compatibility & Documentation Improvements
- Support for long-term continual training: Added the
beta2parameter, which decouples the optimizer state from the hard-coded training step count, preventing optimizer "blunted" caused by EMA window inflation during long-sequence training. - JIT compilation bypass switch: Added the
use_cuda_kernelparameter. This allows explicitly disabling JIT compilation in environments without a CUDA compiler, falling back to a pure PyTorch implementation, and fixes the retry hang issue upon compilation failure. - Documentation improvements.
v0.1.4
Release Notes (v0.1.4)
数学逻辑对齐与训练鲁棒性增强
- 严格对齐官方限幅:移除了硬截断,恢复
eps1²限幅机制,并增加 FP32 下溢保护。 - 修复状态污染 Bug:修正了非量化 1D 路径中因原地操作(in-place)导致的 EMA 状态被意外覆盖的问题。
- 引入 NaN/Inf 防御:在 CUDA 算子中增加了对极端梯度的清洗逻辑,防止 Loss Spike 摧毁局部量化状态。
- 新增解耦权重衰减:提供
decoupled_weight_decay选项。 - 默认参数对齐:将
eps和relative_step的默认值对齐 PyTorch 官方。
Mathematical Logic Alignment & Training Robustness Enhancements
- Strict clamping alignment: Removed hard truncation, restored the
eps²clamping mechanism, and added FP32 underflow protection. - Fixed state pollution bug: Corrected an issue where EMA states were accidentally overwritten due to in-place operations in the non-quantized 1D path.
- Added NaN/Inf defense: Introduced gradient cleaning logic in CUDA operators to prevent loss spikes from destroying local quantization states.
- Added decoupled weight decay: Provides a
decoupled_weight_decayoption. - Default parameter alignment: Aligned default values of
epsandrelative_stepwith PyTorch official defaults.
v0.1.3
Release Notes (v0.1.3)
底层量化逻辑重构,引入对数空间映射与即时 CUDA 算子:
- 对数空间量化重构:将二阶矩(方差)的量化方式从线性空间迁移至对数空间(
log2/exp2)。更契合方差的长尾分布特性,有效缓解极小方差被截断为零导致的权重震荡问题。 - CUDA 算子零物化融合:移除了 Python 端的中间张量实例化。反量化、EMA 更新与重新量化等操作均在 CUDA 内部即时(On-the-fly)完成,减少了显存读写开销与峰值显存占用。
- 状态管理精简:将
step计数器从 GPU Tensor 调整为 Python 原生标量,减少了设备端显存开销,使控制流更加轻量。
Fundamental quantization logic overhaul, introducing log-space mapping and on-the-fly CUDA kernels to improve numerical stability and memory efficiency:
- Log-Space Quantization Overhaul: Migrated the quantization of the second moment (variance) from linear space to log-space (
log2/exp2). Better accommodates the long-tail distribution of variances, effectively mitigating weight oscillations caused by small variances being truncated to zero and improving numerical stability. - Zero-Materialization CUDA Kernels: Removed intermediate tensor instantiation in Python. Operations such as dequantization, EMA updates, and requantization are now performed on-the-fly entirely within CUDA, reducing memory bandwidth overhead and peak memory usage.
- Streamlined State Management: Downgraded the
stepcounter from a GPU Tensor to a native Python scalar, reducing device-side memory overhead and making the control flow more lightweight.
v0.1.2
Release Notes (v0.1.2)
Adafactor8Bit v0.1.2
修复多维张量的数学分解逻辑,进一步压榨显存与鲁棒性:
- N-D 张量数学修复:修复了 3D/4D 张量(如 Conv2d、MoE 权重)行列分解时维度被错误压平的 Bug。现在严格沿最后两维(
dim=-1, -2)分解,并在 CUDA 中通过索引映射完美支持批量维度,确保算法的数学正确性。 - 显存峰值优化:使用
torch.norm替代grad.square().mean(),彻底消除 Eager Mode 下大矩阵平方梯度的中间张量物化,大幅降低训练时的峰值显存占用。 - 归约链路精简:在计算更新范数时引入
atomicAdd直接累加到全局标量,省去额外的显存分配与 PyTorch 层的.sum()Kernel 调用。 - 边缘 Case 与鲁棒性:兼容 0-D 标量张量(避免
IndexError崩溃);增加非连续内存视图的安全拦截与自动对齐,防止隐式的内存越界。
Release Notes (v0.1.2)
Adafactor8Bit v0.1.2
Fixes mathematical factorization for N-D tensors and further optimizes peak memory and robustness:
- N-D Tensor Mathematical Fix: Fixed a bug where 3D/4D tensors (e.g., Conv2d, MoE weights) were incorrectly flattened during row/column factorization. Now strictly factorizes along the last two dimensions (
dim=-1, -2) and uses index mapping in CUDA to perfectly support batch dimensions, ensuring mathematical correctness. - Peak Memory Optimization: Replaced
grad.square().mean()withtorch.normto completely eliminate the materialization of large intermediate squared-gradient tensors in Eager Mode, significantly reducing peak memory usage. - Streamlined Reduction Pipeline: Introduced
atomicAddto accumulate update norms directly into a global scalar, eliminating extra memory allocations and PyTorch-level.sum()kernel calls. - Edge Cases & Robustness: Added support for 0-D scalar tensors (preventing
IndexErrorcrashes) and implemented safe interception and auto-alignment for non-contiguous memory views to prevent implicit out-of-bounds access.
v0.1.1
Release Notes (v0.1.1)
Adafactor8Bit v0.1.1
底层 CUDA kernel 的优化,在保持 8-bit 量化显存优势的同时,进一步提升训练吞吐:
- Kernel 融合:将 EMA 更新、量化、范数计算与参数更新融合为 4 个独立 kernel,减少 kernel launch 开销与中间张量分配
- 2D 零物化计算:针对行列方差外积,直接在 kernel 中在线计算
v_ij,避免物化完整[R, C]方差矩阵,降低峰值显存 - 多精度支持:
apply_updatekernel 支持 FP32/FP16/BF16 参数原地更新,适配混合精度训练场景 - 内存访问优化:优化 Warp-Level Reduction 实现,并确保 float4/uchar4 向量化加载满足对齐要求,提高内存带宽利用率
This release focuses on low-level CUDA kernel optimizations, improving training throughput while preserving the memory efficiency of 8-bit quantization:
- Kernel Fusion: Fused EMA updates, quantization, norm computation, and parameter updates into four dedicated kernels, reducing launch overhead and intermediate tensor allocations
- Zero-Materialization 2D Variance: For factorized row/column second-moment statistics,
v_ijis computed on-the-fly within kernels, eliminating the need to materialize the full[R, C]variance matrix and reducing peak memory usage - Mixed-Precision Support: The
apply_updatekernel now supports in-place parameter updates for FP32, FP16, and BF16 tensors, enabling seamless mixed-precision training - Memory Access Optimization: Improved warp-level reduction implementation and ensured proper alignment for
float4/uchar4vectorized loads, increasing effective memory bandwidth utilization
v0.1.0 - Initial Release
Initial Release