Skip to content

Conversation

metascroy
Copy link
Contributor

This PR introduces a scale-only version of the HQQ algorithm for IntxWeightOnlyConfig/Int8DynamicActivationIntxWeightConfig, which we find improves model quality on mmlu from 0.50 to 0.55 when quantizing Gemma3-4B.

@metascroy metascroy requested a review from jerryzh168 October 1, 2025 23:25
Copy link

pytorch-bot bot commented Oct 1, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3110

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 1 Pending

As of commit be24ef2 with merge base 5cbbd73 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 1, 2025
@metascroy metascroy added the topic: new feature Use this tag if this PR adds a new feature label Oct 1, 2025
@jerryzh168
Copy link
Contributor

jerryzh168 commented Oct 1, 2025

Thanks! Add some unit tests for this path as well?

@mergennachin mergennachin self-requested a review October 2, 2025 01:06
@metascroy
Copy link
Contributor Author

Thanks! Add some unit tests for this path as well?

Added new E2E test

"""
Uses `torchao.quantization.quant_primitives._choose_qparams_and_quantize_scale_only_hqq`
"""
HQQ = "hqq"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two variants of HQQ today, which makes it confusing:

_choose_qparams_and_quantize_affine_hqq -- which is used in version 1 of the config when use_hqq=True

_choose_qparams_and_quantize_scale_only_hqq - this one

Copy link

@scottroy scottroy Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These configs are used in ExecuTorch, and never supported HQQ in any previous version.

use_hqq was used in version 1 of the int4 server config.

If you think it would be clearer, I could call the algorithm enum HQQ_SCALE_ONLY or HQQ_NO_ZERO_POINT

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you think it would be clearer, I could call the algorithm enum HQQ_SCALE_ONLY or HQQ_NO_ZERO_POINT

Yeah, let's do HQQ_SCALE_ONLY then

@mergennachin
Copy link

This is great btw

Why would one ever not do HQQ_scale_only on XNNPACK on ET if it improves the accuracy compared to naive? It doesn't take a long time to quantize, right? Roughly a few minutes?

Hope we can make it to the 0.14 branch cut, which is Oct 6.


# can switch to StrEnum (https://docs.python.org/3/library/enum.html#enum.StrEnum)
# after python 3.10 is end of life (https://devguide.python.org/versions/)
class IntxChooseQParamsAlgorithm(str, Enum):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, do we need to introduce yet another class?

Can we just extend existing Int4ChooseQParamsAlgorithm and add affine and hqq_scale_only?

And then rename/promote Int4ChooseQParamsAlgorithm to IntxChooseQParamsAlgorithm in a follow-up PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In torchao's refactor (removing AffineQuantizedTensor), the direction is for subclasses to not share higher-level abstractions, but instead define their own enums. This is how packing format works as well (intx_packing_format for the intx subclass, and int4_packing_format for the int4 subclass).

I'll let @jerryzh168 comment here as well

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will defer to @jerryzh168 then

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah we want local abstractions instead of global abstractions unless it's required.

@metascroy
Copy link
Contributor Author

This is great btw

Why would one ever not do HQQ_scale_only on XNNPACK on ET if it improves the accuracy compared to naive? It doesn't take a long time to quantize, right? Roughly a few minutes?

Hope we can make it to the 0.14 branch cut, which is Oct 6.

Yes, I want to measure on some more models and benchmarks and if results are neutral to positive, I'll update the default in etLLM.

Copy link

@mergennachin mergennachin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, will defer the question to @jerryzh168


# can switch to StrEnum (https://docs.python.org/3/library/enum.html#enum.StrEnum)
# after python 3.10 is end of life (https://devguide.python.org/versions/)
class IntxChooseQParamsAlgorithm(str, Enum):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will defer to @jerryzh168 then

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, are the changes in this file tested as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@metascroy metascroy merged commit 01849b2 into main Oct 2, 2025
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. topic: new feature Use this tag if this PR adds a new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants