Skip to content

Conversation

metascroy
Copy link
Contributor

@metascroy metascroy commented Oct 6, 2025

On MMLU, we see a 1% to 10% improvement on MMLU accuracy by using HQQ algorithm for quantization, depending on the model.

We make this the default quantization algorithm in the export LLM export script:

MODEL			INT8-INT4		INT8-INT4-HQQ
---------------------------------------------
Gemma3-4B		0.50			0.55
SmolLM3-3B		0.5397			0.5546
Phi4-mini-4B	0.6050			0.6156
Qwen3-4B		0.6526			0.6585			
Llama1B			0.2892			0.3153	

Evaulations were done using lm_eval with the following command:

HF_HUB_DISABLE_XET=1 lm_eval --model hf --model_args pretrained=quantized_model_id --tasks mmlu --device cuda:0 --batch_size 8

Here quantized_model_id corresponding to one of the above model/quant-scheme combinations uploaded to my huggingface account.

Copy link

pytorch-bot bot commented Oct 6, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/14834

Note: Links to docs will display an error until the docs builds have been completed.

❌ 6 New Failures, 23 Cancelled Jobs, 2 Unrelated Failures

As of commit 55a9588 with merge base d4129b7 (image):

NEW FAILURES - The following jobs have failed:

CANCELLED JOBS - The following jobs were cancelled. Please retry:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

BROKEN TRUNK - The following job failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 6, 2025
Copy link

meta-codesync bot commented Oct 6, 2025

@metascroy has imported this pull request. If you are a Meta employee, you can view this in D84020605.

Copy link
Contributor

@mergennachin mergennachin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome!! Thanks for working on this

@mergennachin
Copy link
Contributor

Can we make this part of the release/1.0?

@jackzhxng jackzhxng added the release notes: quantization Changes to quantization label Oct 7, 2025
Copy link
Contributor

@jackzhxng jackzhxng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you test gemma3 with etllm? (from your pr description)

@metascroy
Copy link
Contributor Author

How did you test gemma3 with etllm? (from your pr description)

I published quantized checkpoints for all of these models on my personal huggingface page: https://huggingface.co/metascroy

So you can do something like:

HF_HUB_DISABLE_XET=1 lm_eval --model hf --model_args pretrained=metascroy/gemma-3-4b-it-INT8-INT4-HQQ --tasks mmlu --device cuda:0 --batch_size 8

to re-create the number.

@metascroy
Copy link
Contributor Author

Can we make this part of the release/1.0?

I'll cherry pick, but I need to land the AO diff train before this so we don't break internal.

metascroy added a commit that referenced this pull request Oct 9, 2025
Summary:
On MMLU, we see a 1% to 10% improvement on MMLU accuracy by using HQQ algorithm for quantization, depending on the model.

We make this the default quantization algorithm in the export LLM export script:

```
MODEL			INT8-INT4		INT8-INT4-HQQ
---------------------------------------------
Gemma3-4B		0.50			0.55
SmolLM3-3B		0.5397			0.5546
Phi4-mini-4B	0.6050			0.6156
Qwen3-4B		0.6526			0.6585
Llama1B			0.2892			0.3153
```

Evaulations were done using lm_eval with the following command:
```
HF_HUB_DISABLE_XET=1 lm_eval --model hf --model_args pretrained=quantized_model_id --tasks mmlu --device cuda:0 --batch_size 8
```

Here quantized_model_id corresponding to one of the above model/quant-scheme combinations uploaded to my huggingface account.

Pull Request resolved: #14834

Reviewed By: mergennachin

Differential Revision: D84020605

Pulled By: metascroy
Copy link

meta-codesync bot commented Oct 9, 2025

@metascroy has exported this pull request. If you are a Meta employee, you can view the originating Diff in D84020605.

metascroy added a commit that referenced this pull request Oct 9, 2025
Summary:
On MMLU, we see a 1% to 10% improvement on MMLU accuracy by using HQQ algorithm for quantization, depending on the model.

We make this the default quantization algorithm in the export LLM export script:

```
MODEL			INT8-INT4		INT8-INT4-HQQ
---------------------------------------------
Gemma3-4B		0.50			0.55
SmolLM3-3B		0.5397			0.5546
Phi4-mini-4B	0.6050			0.6156
Qwen3-4B		0.6526			0.6585
Llama1B			0.2892			0.3153
```

Evaulations were done using lm_eval with the following command:
```
HF_HUB_DISABLE_XET=1 lm_eval --model hf --model_args pretrained=quantized_model_id --tasks mmlu --device cuda:0 --batch_size 8
```

Here quantized_model_id corresponding to one of the above model/quant-scheme combinations uploaded to my huggingface account.

Pull Request resolved: #14834

Reviewed By: mergennachin

Differential Revision: D84020605

Pulled By: metascroy
metascroy added a commit that referenced this pull request Oct 9, 2025
Summary:
On MMLU, we see a 1% to 10% improvement on MMLU accuracy by using HQQ algorithm for quantization, depending on the model.

We make this the default quantization algorithm in the export LLM export script:

```
MODEL			INT8-INT4		INT8-INT4-HQQ
---------------------------------------------
Gemma3-4B		0.50			0.55
SmolLM3-3B		0.5397			0.5546
Phi4-mini-4B	0.6050			0.6156
Qwen3-4B		0.6526			0.6585
Llama1B			0.2892			0.3153
```

Evaulations were done using lm_eval with the following command:
```
HF_HUB_DISABLE_XET=1 lm_eval --model hf --model_args pretrained=quantized_model_id --tasks mmlu --device cuda:0 --batch_size 8
```

Here quantized_model_id corresponding to one of the above model/quant-scheme combinations uploaded to my huggingface account.

Pull Request resolved: #14834

Reviewed By: mergennachin

Differential Revision: D84020605

Pulled By: metascroy
Summary:
On MMLU, we see a 1% to 10% improvement on MMLU accuracy by using HQQ algorithm for quantization, depending on the model.

We make this the default quantization algorithm in the export LLM export script:

```
MODEL			INT8-INT4		INT8-INT4-HQQ
---------------------------------------------
Gemma3-4B		0.50			0.55
SmolLM3-3B		0.5397			0.5546
Phi4-mini-4B	0.6050			0.6156
Qwen3-4B		0.6526			0.6585
Llama1B			0.2892			0.3153
```

Evaulations were done using lm_eval with the following command:
```
HF_HUB_DISABLE_XET=1 lm_eval --model hf --model_args pretrained=quantized_model_id --tasks mmlu --device cuda:0 --batch_size 8
```

Here quantized_model_id corresponding to one of the above model/quant-scheme combinations uploaded to my huggingface account.

Pull Request resolved: #14834

Reviewed By: mergennachin

Differential Revision: D84020605

Pulled By: metascroy
@meta-codesync meta-codesync bot merged commit d39992f into main Oct 10, 2025
257 of 291 checks passed
@meta-codesync meta-codesync bot deleted the make-hqq-default branch October 10, 2025 00:36
@metascroy
Copy link
Contributor Author

@pytorchbot cherry-pick --onto release/1.0 -c critical

@metascroy
Copy link
Contributor Author

@pytorchbot cherry-pick --onto release/1.0 -c "critical"

@pytorchbot
Copy link
Collaborator

Cherry picking #14834

Command git -C /home/runner/work/executorch/executorch cherry-pick -x -X theirs d39992f6d971e3548ee3ffe943d9224f63979126 returned non-zero exit code 1

hint: Recursive merging with submodules currently only supports trivial cases.
hint: Please manually handle the merging of each conflicted submodule.
hint: This can be accomplished with the following steps:
hint:  - go to submodule (third-party/ao), and either merge commit 01849b2b1
hint:    or update to an existing commit which has merged those changes
hint:  - come back to superproject and run:
hint:
hint:       git add third-party/ao
hint:
hint:    to record the above merge or update
hint:  - resolve any other conflicts in the superproject
hint:  - commit the resulting index in the superproject
hint:
hint: Disable this message with "git config set advice.submoduleMergeConflict false"
Auto-merging examples/models/llama/export_llama_lib.py
Failed to merge submodule third-party/ao (commits don't follow merge-base)
CONFLICT (submodule): Merge conflict in third-party/ao
error: could not apply d39992f6d9... Make HQQ default PTQ quantization in ExecuTorch
hint: After resolving the conflicts, mark them with
hint: "git add/rm <pathspec>", then run
hint: "git cherry-pick --continue".
hint: You can instead skip this commit with "git cherry-pick --skip".
hint: To abort and get back to the state before "git cherry-pick",
hint: run "git cherry-pick --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Details for Dev Infra team Raised by workflow job

metascroy added a commit to metascroy/executorch that referenced this pull request Oct 10, 2025
Differential Revision: D84020605

Pull Request resolved: pytorch#14834

(cherry picked from commit d39992f)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/trunk CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported release notes: quantization Changes to quantization

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants