Skip to content

v0.1.8

Choose a tag to compare

@yanfeiwong yanfeiwong released this 13 Jun 03:13
· 8 commits to main since this release

Release Notes (v0.1.8)

v0.1.8

🌕 “Tranquility Base” (静海基地):APOLLO 低秩投影与 Fira 减震器

"One small step for optimizer state, one giant leap for convergence."
(优化器状态的一小步,模型收敛的一大步。)

  • 引入 APOLLO 随机子空间投影:新增 apollo_rank 参数。启用后,优化器会将梯度投影至低秩子空间以估算二阶矩缩放因子。相比 Adafactor 默认的行列独立假设,随机投影能捕获更丰富的协方差信息,在极低显存开销下加速收敛。
  • 集成 Fira Norm-Growth Limiter:为 APOLLO 路径配备了“减震器”。通过动态限制缩放梯度的范数增长率,有效抑制因投影矩阵周期性刷新而引发的梯度突变(Loss Spike),为低秩训练保驾护航。
  • 极限显存压缩选项 (apollo_factorize):提供实验性的“低秩空间内行列分解”选项。利用随机投影的保范性质,在低秩子空间内进一步应用 Adafactor 的行列分解,将优化器状态显存压缩至极限。

  • APOLLO Random Subspace Projection: Introduced the apollo_rank parameter. When enabled, the optimizer projects gradients into a low-rank subspace to estimate second-moment scaling factors. Compared to Adafactor's default row/column independence assumption, random projection captures richer covariance information, accelerating convergence with ultra-low memory overhead.
  • Fira Norm-Growth Limiter Integration: Equipped the APOLLO path with a "shock absorber". By dynamically capping the norm growth rate of the scaled gradients, it effectively suppresses destructive gradient spikes (Loss Spikes) caused by periodic projection matrix refreshes.
  • Extreme VRAM Compression Option (apollo_factorize): Offers an experimental "row/column factorization within low-rank subspace" option. Leveraging the norm-preserving property of random projections, it further applies Adafactor's factorization inside the low-rank space, pushing optimizer state memory compression to its limits.