Skip to content

v0.1.3

Pre-release
Pre-release

Choose a tag to compare

@yanfeiwong yanfeiwong released this 08 Jun 04:01
· 19 commits to main since this release

Release Notes (v0.1.3)

底层量化逻辑重构,引入对数空间映射与即时 CUDA 算子:

  • 对数空间量化重构:将二阶矩(方差)的量化方式从线性空间迁移至对数空间(log2/exp2)。更契合方差的长尾分布特性,有效缓解极小方差被截断为零导致的权重震荡问题。
  • CUDA 算子零物化融合:移除了 Python 端的中间张量实例化。反量化、EMA 更新与重新量化等操作均在 CUDA 内部即时(On-the-fly)完成,减少了显存读写开销与峰值显存占用。
  • 状态管理精简:将 step 计数器从 GPU Tensor 调整为 Python 原生标量,减少了设备端显存开销,使控制流更加轻量。

Fundamental quantization logic overhaul, introducing log-space mapping and on-the-fly CUDA kernels to improve numerical stability and memory efficiency:

  • Log-Space Quantization Overhaul: Migrated the quantization of the second moment (variance) from linear space to log-space (log2/exp2). Better accommodates the long-tail distribution of variances, effectively mitigating weight oscillations caused by small variances being truncated to zero and improving numerical stability.
  • Zero-Materialization CUDA Kernels: Removed intermediate tensor instantiation in Python. Operations such as dequantization, EMA updates, and requantization are now performed on-the-fly entirely within CUDA, reducing memory bandwidth overhead and peak memory usage.
  • Streamlined State Management: Downgraded the step counter from a GPU Tensor to a native Python scalar, reducing device-side memory overhead and making the control flow more lightweight.