-
Notifications
You must be signed in to change notification settings - Fork 685
Make HQQ default PTQ quantization in ExecuTorch #14834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/14834
Note: Links to docs will display an error until the docs builds have been completed. ❌ 6 New Failures, 23 Cancelled Jobs, 2 Unrelated FailuresAs of commit 55a9588 with merge base d4129b7 ( NEW FAILURES - The following jobs have failed:
CANCELLED JOBS - The following jobs were cancelled. Please retry:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
BROKEN TRUNK - The following job failed but was present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@metascroy has imported this pull request. If you are a Meta employee, you can view this in D84020605. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome!! Thanks for working on this
Can we make this part of the release/1.0? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How did you test gemma3 with etllm? (from your pr description)
I published quantized checkpoints for all of these models on my personal huggingface page: https://huggingface.co/metascroy So you can do something like:
to re-create the number. |
I'll cherry pick, but I need to land the AO diff train before this so we don't break internal. |
Summary: On MMLU, we see a 1% to 10% improvement on MMLU accuracy by using HQQ algorithm for quantization, depending on the model. We make this the default quantization algorithm in the export LLM export script: ``` MODEL INT8-INT4 INT8-INT4-HQQ --------------------------------------------- Gemma3-4B 0.50 0.55 SmolLM3-3B 0.5397 0.5546 Phi4-mini-4B 0.6050 0.6156 Qwen3-4B 0.6526 0.6585 Llama1B 0.2892 0.3153 ``` Evaulations were done using lm_eval with the following command: ``` HF_HUB_DISABLE_XET=1 lm_eval --model hf --model_args pretrained=quantized_model_id --tasks mmlu --device cuda:0 --batch_size 8 ``` Here quantized_model_id corresponding to one of the above model/quant-scheme combinations uploaded to my huggingface account. Pull Request resolved: #14834 Reviewed By: mergennachin Differential Revision: D84020605 Pulled By: metascroy
b7a6f5f
to
6c63fcd
Compare
@metascroy has exported this pull request. If you are a Meta employee, you can view the originating Diff in D84020605. |
Summary: On MMLU, we see a 1% to 10% improvement on MMLU accuracy by using HQQ algorithm for quantization, depending on the model. We make this the default quantization algorithm in the export LLM export script: ``` MODEL INT8-INT4 INT8-INT4-HQQ --------------------------------------------- Gemma3-4B 0.50 0.55 SmolLM3-3B 0.5397 0.5546 Phi4-mini-4B 0.6050 0.6156 Qwen3-4B 0.6526 0.6585 Llama1B 0.2892 0.3153 ``` Evaulations were done using lm_eval with the following command: ``` HF_HUB_DISABLE_XET=1 lm_eval --model hf --model_args pretrained=quantized_model_id --tasks mmlu --device cuda:0 --batch_size 8 ``` Here quantized_model_id corresponding to one of the above model/quant-scheme combinations uploaded to my huggingface account. Pull Request resolved: #14834 Reviewed By: mergennachin Differential Revision: D84020605 Pulled By: metascroy
6c63fcd
to
c379dab
Compare
Summary: On MMLU, we see a 1% to 10% improvement on MMLU accuracy by using HQQ algorithm for quantization, depending on the model. We make this the default quantization algorithm in the export LLM export script: ``` MODEL INT8-INT4 INT8-INT4-HQQ --------------------------------------------- Gemma3-4B 0.50 0.55 SmolLM3-3B 0.5397 0.5546 Phi4-mini-4B 0.6050 0.6156 Qwen3-4B 0.6526 0.6585 Llama1B 0.2892 0.3153 ``` Evaulations were done using lm_eval with the following command: ``` HF_HUB_DISABLE_XET=1 lm_eval --model hf --model_args pretrained=quantized_model_id --tasks mmlu --device cuda:0 --batch_size 8 ``` Here quantized_model_id corresponding to one of the above model/quant-scheme combinations uploaded to my huggingface account. Pull Request resolved: #14834 Reviewed By: mergennachin Differential Revision: D84020605 Pulled By: metascroy
c379dab
to
7017fe3
Compare
Summary: On MMLU, we see a 1% to 10% improvement on MMLU accuracy by using HQQ algorithm for quantization, depending on the model. We make this the default quantization algorithm in the export LLM export script: ``` MODEL INT8-INT4 INT8-INT4-HQQ --------------------------------------------- Gemma3-4B 0.50 0.55 SmolLM3-3B 0.5397 0.5546 Phi4-mini-4B 0.6050 0.6156 Qwen3-4B 0.6526 0.6585 Llama1B 0.2892 0.3153 ``` Evaulations were done using lm_eval with the following command: ``` HF_HUB_DISABLE_XET=1 lm_eval --model hf --model_args pretrained=quantized_model_id --tasks mmlu --device cuda:0 --batch_size 8 ``` Here quantized_model_id corresponding to one of the above model/quant-scheme combinations uploaded to my huggingface account. Pull Request resolved: #14834 Reviewed By: mergennachin Differential Revision: D84020605 Pulled By: metascroy
7017fe3
to
55a9588
Compare
@pytorchbot cherry-pick --onto release/1.0 -c critical |
@pytorchbot cherry-pick --onto release/1.0 -c "critical" |
Cherry picking #14834Command
Details for Dev Infra teamRaised by workflow job |
Differential Revision: D84020605 Pull Request resolved: pytorch#14834 (cherry picked from commit d39992f)
On MMLU, we see a 1% to 10% improvement on MMLU accuracy by using HQQ algorithm for quantization, depending on the model.
We make this the default quantization algorithm in the export LLM export script:
Evaulations were done using lm_eval with the following command:
Here quantized_model_id corresponding to one of the above model/quant-scheme combinations uploaded to my huggingface account.