v0.1.3
Pre-release
Pre-release
Release Notes (v0.1.3)
底层量化逻辑重构,引入对数空间映射与即时 CUDA 算子:
- 对数空间量化重构:将二阶矩(方差)的量化方式从线性空间迁移至对数空间(
log2/exp2)。更契合方差的长尾分布特性,有效缓解极小方差被截断为零导致的权重震荡问题。 - CUDA 算子零物化融合:移除了 Python 端的中间张量实例化。反量化、EMA 更新与重新量化等操作均在 CUDA 内部即时(On-the-fly)完成,减少了显存读写开销与峰值显存占用。
- 状态管理精简:将
step计数器从 GPU Tensor 调整为 Python 原生标量,减少了设备端显存开销,使控制流更加轻量。
Fundamental quantization logic overhaul, introducing log-space mapping and on-the-fly CUDA kernels to improve numerical stability and memory efficiency:
- Log-Space Quantization Overhaul: Migrated the quantization of the second moment (variance) from linear space to log-space (
log2/exp2). Better accommodates the long-tail distribution of variances, effectively mitigating weight oscillations caused by small variances being truncated to zero and improving numerical stability. - Zero-Materialization CUDA Kernels: Removed intermediate tensor instantiation in Python. Operations such as dequantization, EMA updates, and requantization are now performed on-the-fly entirely within CUDA, reducing memory bandwidth overhead and peak memory usage.
- Streamlined State Management: Downgraded the
stepcounter from a GPU Tensor to a native Python scalar, reducing device-side memory overhead and making the control flow more lightweight.