Add scale-only version of the HQQ algorithm for IntxWeightOnlyConfig/Int8DynamicActivationIntxWeightConfig #3110

metascroy · 2025-10-01T23:25:01Z

This PR introduces a scale-only version of the HQQ algorithm for IntxWeightOnlyConfig/Int8DynamicActivationIntxWeightConfig, which we find improves model quality on mmlu from 0.50 to 0.55 when quantizing Gemma3-4B.

pytorch-bot · 2025-10-01T23:25:05Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3110

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 1 Pending

As of commit be24ef2 with merge base 5cbbd73 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

jerryzh168 · 2025-10-01T23:53:15Z

Thanks! Add some unit tests for this path as well?

metascroy · 2025-10-02T01:43:52Z

Thanks! Add some unit tests for this path as well?

Added new E2E test

torchao/quantization/quant_api.py

mergennachin · 2025-10-02T01:44:42Z

torchao/quantization/quantize_/workflows/intx/intx_choose_qparams_algorithm.py

+    """
+    Uses `torchao.quantization.quant_primitives._choose_qparams_and_quantize_scale_only_hqq`
+    """
+    HQQ = "hqq"


There are two variants of HQQ today, which makes it confusing:

_choose_qparams_and_quantize_affine_hqq -- which is used in version 1 of the config when use_hqq=True

_choose_qparams_and_quantize_scale_only_hqq - this one

These configs are used in ExecuTorch, and never supported HQQ in any previous version.

use_hqq was used in version 1 of the int4 server config.

If you think it would be clearer, I could call the algorithm enum HQQ_SCALE_ONLY or HQQ_NO_ZERO_POINT

If you think it would be clearer, I could call the algorithm enum HQQ_SCALE_ONLY or HQQ_NO_ZERO_POINT

Yeah, let's do HQQ_SCALE_ONLY then

torchao/quantization/quant_primitives.py

torchao/quantization/qat/api.py

mergennachin · 2025-10-02T13:57:52Z

This is great btw

Why would one ever not do HQQ_scale_only on XNNPACK on ET if it improves the accuracy compared to naive? It doesn't take a long time to quantize, right? Roughly a few minutes?

Hope we can make it to the 0.14 branch cut, which is Oct 6.

test/quantization/quantize_/workflows/intx/test_intx_unpacked_to_int8_tensor.py

mergennachin · 2025-10-02T14:09:25Z

torchao/quantization/quantize_/workflows/intx/intx_choose_qparams_algorithm.py

+
+# can switch to StrEnum (https://docs.python.org/3/library/enum.html#enum.StrEnum)
+# after python 3.10 is end of life (https://devguide.python.org/versions/)
+class IntxChooseQParamsAlgorithm(str, Enum):


Also, do we need to introduce yet another class?

Can we just extend existing Int4ChooseQParamsAlgorithm and add affine and hqq_scale_only?

And then rename/promote Int4ChooseQParamsAlgorithm to IntxChooseQParamsAlgorithm in a follow-up PR?

In torchao's refactor (removing AffineQuantizedTensor), the direction is for subclasses to not share higher-level abstractions, but instead define their own enums. This is how packing format works as well (intx_packing_format for the intx subclass, and int4_packing_format for the int4 subclass).

I'll let @jerryzh168 comment here as well

Will defer to @jerryzh168 then

yeah we want local abstractions instead of global abstractions unless it's required.

metascroy · 2025-10-02T16:56:41Z

This is great btw

Why would one ever not do HQQ_scale_only on XNNPACK on ET if it improves the accuracy compared to naive? It doesn't take a long time to quantize, right? Roughly a few minutes?

Hope we can make it to the 0.14 branch cut, which is Oct 6.

Yes, I want to measure on some more models and benchmarks and if results are neutral to positive, I'll update the default in etLLM.

mergennachin

LGTM, will defer the question to @jerryzh168

mergennachin · 2025-10-02T17:40:50Z

torchao/quantization/quantize_/workflows/intx/intx_choose_qparams_algorithm.py

+
+# can switch to StrEnum (https://docs.python.org/3/library/enum.html#enum.StrEnum)
+# after python 3.10 is end of life (https://devguide.python.org/versions/)
+class IntxChooseQParamsAlgorithm(str, Enum):


Will defer to @jerryzh168 then

jerryzh168 · 2025-10-02T18:08:46Z

torchao/quantization/qat/api.py

btw, are the changes in this file tested as well?

I think this exisiting unit test covers them: https://github.com/pytorch/ao/blob/main/test/quantization/test_qat.py#L2321

metascroy added 2 commits October 1, 2025 16:18

init

2fba4c8

up

9290b82

metascroy requested a review from jerryzh168 October 1, 2025 23:25

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 1, 2025

metascroy added the topic: new feature Use this tag if this PR adds a new feature label Oct 1, 2025

mergennachin self-requested a review October 2, 2025 01:06

metascroy added 3 commits October 1, 2025 18:41

up

58913e9

up

ae46a82

up

adfebdf