-
Notifications
You must be signed in to change notification settings - Fork 349
Add scale-only version of the HQQ algorithm for IntxWeightOnlyConfig/Int8DynamicActivationIntxWeightConfig #3110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/3110
Note: Links to docs will display an error until the docs builds have been completed. ⏳ No Failures, 1 PendingAs of commit be24ef2 with merge base 5cbbd73 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Thanks! Add some unit tests for this path as well? |
Added new E2E test |
""" | ||
Uses `torchao.quantization.quant_primitives._choose_qparams_and_quantize_scale_only_hqq` | ||
""" | ||
HQQ = "hqq" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two variants of HQQ today, which makes it confusing:
_choose_qparams_and_quantize_affine_hqq -- which is used in version 1 of the config when use_hqq=True
_choose_qparams_and_quantize_scale_only_hqq - this one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These configs are used in ExecuTorch, and never supported HQQ in any previous version.
use_hqq was used in version 1 of the int4 server config.
If you think it would be clearer, I could call the algorithm enum HQQ_SCALE_ONLY or HQQ_NO_ZERO_POINT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you think it would be clearer, I could call the algorithm enum HQQ_SCALE_ONLY or HQQ_NO_ZERO_POINT
Yeah, let's do HQQ_SCALE_ONLY then
This is great btw Why would one ever not do HQQ_scale_only on XNNPACK on ET if it improves the accuracy compared to naive? It doesn't take a long time to quantize, right? Roughly a few minutes? Hope we can make it to the 0.14 branch cut, which is Oct 6. |
test/quantization/quantize_/workflows/intx/test_intx_unpacked_to_int8_tensor.py
Show resolved
Hide resolved
|
||
# can switch to StrEnum (https://docs.python.org/3/library/enum.html#enum.StrEnum) | ||
# after python 3.10 is end of life (https://devguide.python.org/versions/) | ||
class IntxChooseQParamsAlgorithm(str, Enum): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, do we need to introduce yet another class?
Can we just extend existing Int4ChooseQParamsAlgorithm and add affine and hqq_scale_only?
And then rename/promote Int4ChooseQParamsAlgorithm to IntxChooseQParamsAlgorithm in a follow-up PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In torchao's refactor (removing AffineQuantizedTensor), the direction is for subclasses to not share higher-level abstractions, but instead define their own enums. This is how packing format works as well (intx_packing_format for the intx subclass, and int4_packing_format for the int4 subclass).
I'll let @jerryzh168 comment here as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will defer to @jerryzh168 then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah we want local abstractions instead of global abstractions unless it's required.
Yes, I want to measure on some more models and benchmarks and if results are neutral to positive, I'll update the default in etLLM. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, will defer the question to @jerryzh168
|
||
# can switch to StrEnum (https://docs.python.org/3/library/enum.html#enum.StrEnum) | ||
# after python 3.10 is end of life (https://devguide.python.org/versions/) | ||
class IntxChooseQParamsAlgorithm(str, Enum): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will defer to @jerryzh168 then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw, are the changes in this file tested as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this exisiting unit test covers them: https://github.com/pytorch/ao/blob/main/test/quantization/test_qat.py#L2321
This PR introduces a scale-only version of the HQQ algorithm for IntxWeightOnlyConfig/Int8DynamicActivationIntxWeightConfig, which we find improves model quality on mmlu from 0.50 to 0.55 when quantizing Gemma3-4B.