v0.1.8
Release Notes (v0.1.8)
v0.1.8
🌕 “Tranquility Base” (静海基地):APOLLO 低秩投影与 Fira 减震器
"One small step for optimizer state, one giant leap for convergence."
(优化器状态的一小步,模型收敛的一大步。)
- 引入 APOLLO 随机子空间投影:新增
apollo_rank参数。启用后,优化器会将梯度投影至低秩子空间以估算二阶矩缩放因子。相比 Adafactor 默认的行列独立假设,随机投影能捕获更丰富的协方差信息,在极低显存开销下加速收敛。 - 集成 Fira Norm-Growth Limiter:为 APOLLO 路径配备了“减震器”。通过动态限制缩放梯度的范数增长率,有效抑制因投影矩阵周期性刷新而引发的梯度突变(Loss Spike),为低秩训练保驾护航。
- 极限显存压缩选项 (
apollo_factorize):提供实验性的“低秩空间内行列分解”选项。利用随机投影的保范性质,在低秩子空间内进一步应用 Adafactor 的行列分解,将优化器状态显存压缩至极限。
- APOLLO Random Subspace Projection: Introduced the
apollo_rankparameter. When enabled, the optimizer projects gradients into a low-rank subspace to estimate second-moment scaling factors. Compared to Adafactor's default row/column independence assumption, random projection captures richer covariance information, accelerating convergence with ultra-low memory overhead. - Fira Norm-Growth Limiter Integration: Equipped the APOLLO path with a "shock absorber". By dynamically capping the norm growth rate of the scaled gradients, it effectively suppresses destructive gradient spikes (Loss Spikes) caused by periodic projection matrix refreshes.
- Extreme VRAM Compression Option (
apollo_factorize): Offers an experimental "row/column factorization within low-rank subspace" option. Leveraging the norm-preserving property of random projections, it further applies Adafactor's factorization inside the low-rank space, pushing optimizer state memory compression to its limits.