Skip to content

[2/N] feat: streaming quantize for svdquant#956

Merged
DefTruth merged 5 commits into
mainfrom
dev
Apr 7, 2026
Merged

[2/N] feat: streaming quantize for svdquant#956
DefTruth merged 5 commits into
mainfrom
dev

Conversation

@DefTruth
Copy link
Copy Markdown
Member

@DefTruth DefTruth commented Apr 7, 2026

tests/kernels/test_svdquant_quantizer.py::test_svdquant_quantizer_returns_module_state_dict PASSED
tests/kernels/test_svdquant_quantizer.py::test_svdquant_quantizer_repairs_invalid_smooth_scales PASSED
tests/kernels/test_svdquant_quantizer.py::test_svdquant_quantizer_rejects_unsupported_geometry PASSED
tests/kernels/test_svdquant_quantizer.py::test_svdquant_quantizer_state_dict_loads_into_module PASSED
tests/kernels/test_svdquant_quantizer.py::test_svdquant_quantizer_streaming_matches_eager_state_dict PASSED
tests/kernels/test_svdquant_quantizer.py::test_svdquant_quantizer_low_precision_svd_requires_fallback_when_unsupported PASSED
tests/kernels/test_svdquant_quantizer.py::test_svdquant_quantizer_runtime_rank32_beats_rank0 SVDQ linear module accuracy report

rank mae rmse max_abs rel_l2 cosine latency_ms
0 0.247247 0.314898 1.296875 0.153771 0.988331 0.000000
16 0.116185 0.148158 0.593750 0.072349 0.997398 0.000000
32 0.055915 0.071060 0.304688 0.034700 0.999400 0.000000
128 0.004684 0.007475 0.031250 0.003650 0.999994 0.000000

PASSED
tests/kernels/test_svdquant_quantizer.py::test_svdquant_toymodel_rank_accuracy_roundtrip_report SVDQ ToyModel profiling config

num_heads embed_dim batch seq_len high_precision fp32_fallback
32 4096 8 1024 False True

SVDQ ToyModel quantization latency

rank quantization_s
0 0.158318
16 9.710871
32 9.704566
128 9.742677

SVDQ ToyModel accuracy report

rank mae rmse max_abs rel_l2 cosine latency_ms
-1 0.000000 0.000000 0.000000 0.000000 1.000000 11.750928
0 0.015257 0.019142 0.113281 0.322592 0.951864 5.514978
16 0.014535 0.018235 0.101562 0.307310 0.956059 5.682464
32 0.014254 0.017881 0.111328 0.301334 0.957608 5.767454
128 0.013537 0.016981 0.111328 0.286182 0.961555 6.225797

PASSED
tests/kernels/test_svdquant_quantizer.py::test_svdquant_streaming_memory_peak_is_lower SVDQ streaming memory benchmark

rank cpu_calibration_gib streaming_peak_gib eager_peak_gib savings_pct
32 10.00 0.5426 10.2927 94.73

PASSED

@DefTruth
Copy link
Copy Markdown
Member Author

DefTruth commented Apr 7, 2026

@copilot summary

Copy link
Copy Markdown

Copilot AI commented Apr 7, 2026

@copilot summary

PR Summary: Streaming Quantize Mode for SVDQuant

This PR adds a streaming quantization mode to the SVDQuant (svdq) pipeline, reducing peak GPU memory during quantization by processing calibration activations one chunk at a time instead of loading them all onto the device at once.

Key Changes

quantizer.py

  • Added streaming: bool = True parameter to quantize_linear_svdq_w4a4 — when enabled, activation tensors are moved to device one-by-one and the running span is computed incrementally (fold + discard), avoiding the cost of materializing all activations on GPU simultaneously.
  • Added high_precision: bool and fp32_fallback: bool parameters (replacing the old fast_svd flag) to give finer control over numerical precision during SVD and scale computation.
  • Introduced _resolve_math_dtype helper and threaded math_dtype / output_dtype through compute_smooth_scale and _compute_group_scales to avoid redundant casts.
  • Simplified standardize_calibration_activations — device/dtype placement is now deferred to the quantizer loop.

lowrank.py

  • Replaced fast_svd: bool with high_precision: bool + fp32_fallback: bool.
    • high_precision=True → float64 SVD.
    • fp32_fallback=True → float32 SVD.
    • Both False → SVD in the native weight dtype (e.g. bfloat16), with a clear RuntimeError on backends that don't support it.

test_svdquant_quantizer.py + _svdq_test_utils.py

  • New test test_svdquant_streaming_matches_eager_state_dict — verifies streaming and eager modes produce identical state dicts.
  • New test test_svdquant_streaming_memory_peak_is_lower — benchmarks GPU peak memory; streaming uses ~0.54 GiB vs ~10.29 GiB for eager (~95 % savings) when calibration data is 10 GiB.
  • New test test_svdquant_quantizer_low_precision_svd_requires_fallback_when_unsupported.
  • Environment-variable knobs (CACHE_DIT_SVDQ_TEST_*) to control precision, memory size, and streaming threshold in CI.

@DefTruth DefTruth changed the title [2/N] feat: streaming quantize mode for svdq [2/N] feat: streaming quantization for svdquant Apr 7, 2026
@DefTruth DefTruth changed the title [2/N] feat: streaming quantization for svdquant [2/N] feat: streaming quantize for svdquant Apr 7, 2026
@DefTruth DefTruth merged commit 003b224 into main Apr 7, 2026
4 checks passed
@DefTruth DefTruth deleted the dev branch April 7, 2026 05:38
@DefTruth DefTruth mentioned this pull request Apr 9, 2026
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants