[2/N] feat: streaming quantize for svdquant by DefTruth · Pull Request #956 · vipshop/cache-dit

DefTruth · 2026-04-07T04:04:01Z

tests/kernels/test_svdquant_quantizer.py::test_svdquant_quantizer_returns_module_state_dict PASSED
tests/kernels/test_svdquant_quantizer.py::test_svdquant_quantizer_repairs_invalid_smooth_scales PASSED
tests/kernels/test_svdquant_quantizer.py::test_svdquant_quantizer_rejects_unsupported_geometry PASSED
tests/kernels/test_svdquant_quantizer.py::test_svdquant_quantizer_state_dict_loads_into_module PASSED
tests/kernels/test_svdquant_quantizer.py::test_svdquant_quantizer_streaming_matches_eager_state_dict PASSED
tests/kernels/test_svdquant_quantizer.py::test_svdquant_quantizer_low_precision_svd_requires_fallback_when_unsupported PASSED
tests/kernels/test_svdquant_quantizer.py::test_svdquant_quantizer_runtime_rank32_beats_rank0 SVDQ linear module accuracy report

rank	mae	rmse	max_abs	rel_l2	cosine
0	0.247247	0.314898	1.296875	0.153771	0.988331
16	0.116185	0.148158	0.593750	0.072349	0.997398
32	0.055915	0.071060	0.304688	0.034700	0.999400
128	0.004684	0.007475	0.031250	0.003650	0.999994

PASSED
tests/kernels/test_svdquant_quantizer.py::test_svdquant_toymodel_rank_accuracy_roundtrip_report SVDQ ToyModel profiling config

num_heads	embed_dim	batch	seq_len	high_precision	fp32_fallback
32	4096	8	1024	False	True

SVDQ ToyModel quantization latency

rank	quantization_s
0	0.158318
16	9.710871
32	9.704566
128	9.742677

SVDQ ToyModel accuracy report

rank	mae	rmse	max_abs	rel_l2	cosine	latency_ms
-1	0.000000	0.000000	0.000000	0.000000	1.000000	11.750928
0	0.015257	0.019142	0.113281	0.322592	0.951864	5.514978
16	0.014535	0.018235	0.101562	0.307310	0.956059	5.682464
32	0.014254	0.017881	0.111328	0.301334	0.957608	5.767454
128	0.013537	0.016981	0.111328	0.286182	0.961555	6.225797

PASSED
tests/kernels/test_svdquant_quantizer.py::test_svdquant_streaming_memory_peak_is_lower SVDQ streaming memory benchmark

rank	cpu_calibration_gib	streaming_peak_gib	eager_peak_gib	savings_pct
32	10.00	0.5426	10.2927	94.73

PASSED

DefTruth · 2026-04-07T04:04:22Z

@copilot summary

Copilot · 2026-04-07T04:05:38Z

@copilot summary

PR Summary: Streaming Quantize Mode for SVDQuant

This PR adds a streaming quantization mode to the SVDQuant (svdq) pipeline, reducing peak GPU memory during quantization by processing calibration activations one chunk at a time instead of loading them all onto the device at once.

Key Changes

quantizer.py

Added streaming: bool = True parameter to quantize_linear_svdq_w4a4 — when enabled, activation tensors are moved to device one-by-one and the running span is computed incrementally (fold + discard), avoiding the cost of materializing all activations on GPU simultaneously.
Added high_precision: bool and fp32_fallback: bool parameters (replacing the old fast_svd flag) to give finer control over numerical precision during SVD and scale computation.
Introduced _resolve_math_dtype helper and threaded math_dtype / output_dtype through compute_smooth_scale and _compute_group_scales to avoid redundant casts.
Simplified standardize_calibration_activations — device/dtype placement is now deferred to the quantizer loop.

lowrank.py

Replaced fast_svd: bool with high_precision: bool + fp32_fallback: bool.
- high_precision=True → float64 SVD.
- fp32_fallback=True → float32 SVD.
- Both False → SVD in the native weight dtype (e.g. bfloat16), with a clear RuntimeError on backends that don't support it.

test_svdquant_quantizer.py + _svdq_test_utils.py

New test test_svdquant_streaming_matches_eager_state_dict — verifies streaming and eager modes produce identical state dicts.
New test test_svdquant_streaming_memory_peak_is_lower — benchmarks GPU peak memory; streaming uses ~0.54 GiB vs ~10.29 GiB for eager (~95 % savings) when calibration data is 10 GiB.
New test test_svdquant_quantizer_low_precision_svd_requires_fallback_when_unsupported.
Environment-variable knobs (CACHE_DIT_SVDQ_TEST_*) to control precision, memory size, and streaming threshold in CI.

[2/N] feat: streaming quantize mode for svdq

abf54ec

Copilot started work on behalf of DefTruth April 7, 2026 04:04 View session

Copilot finished work on behalf of DefTruth April 7, 2026 04:06

DefTruth added 4 commits April 7, 2026 04:08

[2/N] feat: streaming quantize mode for svdq

805cce6

[2/N] feat: streaming quantize mode for svdq

ff97740

[2/N] feat: streaming quantize mode for svdq

f76057f

[2/N] feat: streaming quantize mode for svdq

231b594

DefTruth changed the title ~~[2/N] feat: streaming quantize mode for svdq~~ [2/N] feat: streaming quantization for svdquant Apr 7, 2026

DefTruth changed the title ~~[2/N] feat: streaming quantization for svdquant~~ [2/N] feat: streaming quantize for svdquant Apr 7, 2026

DefTruth merged commit 003b224 into main Apr 7, 2026
4 checks passed

DefTruth deleted the dev branch April 7, 2026 05:38

DefTruth mentioned this pull request Apr 9, 2026

[RFC] v1.4.0 Roadmap #856

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[2/N] feat: streaming quantize for svdquant#956

[2/N] feat: streaming quantize for svdquant#956
DefTruth merged 5 commits into
mainfrom
dev

DefTruth commented Apr 7, 2026 •

edited

Loading

Uh oh!

DefTruth commented Apr 7, 2026

Uh oh!

Copilot AI commented Apr 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

DefTruth commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DefTruth commented Apr 7, 2026

Uh oh!

Copilot AI commented Apr 7, 2026

PR Summary: Streaming Quantize Mode for SVDQuant

Key Changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

DefTruth commented Apr 7, 2026 •

edited

Loading