Use FusedMovingAvgObsFakeQuantize instead of FakeQuantize for faster QAT #14740

navsud · 2025-10-02T01:18:43Z

Summary:
FusedMovingAvgObsFakeQuantize speeds up by fusing FakeQuantize and MovingAverageMinMaxObserver into one CUDA op. Using it should give good speedups. This change updates the QAT qconfigs to accordingly.

Tested on llama model on HTP and got ~4x QAT speedup.

Differential Revision: D83583655

pytorch-bot · 2025-10-02T01:18:47Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/14740

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit d95e62f with merge base c997fe4 ():

NEW FAILURE - The following job has failed:

pull / test-samsung-models-linux / linux-job (gh)
RuntimeError: Command docker exec -t fb0466d9e960b7defa6fc6ae3822e6c03e39f551d189f590eb707fae16bb8509 /exec failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2025-10-02T01:18:55Z

@navsud has exported this pull request. If you are a Meta employee, you can view the originating Diff in D83583655.

…QAT (pytorch#14740) Summary: FusedMovingAvgObsFakeQuantize speeds up by fusing FakeQuantize and MovingAverageMinMaxObserver into one CUDA op. Using it should give good speedups. This change updates the QAT qconfigs to accordingly. Tested on llama model on HTP and got ~4x QAT speedup. Differential Revision: D83583655

cccclai · 2025-10-03T00:36:42Z

backends/qualcomm/quantizer/qconfig.py

 ) -> QuantizationConfig:
    extra_args: Dict[str, Any] = {"eps": 2**-20}
-    act_fake_quant_ctr = FakeQuantize.with_args(
+    act_fake_quant_ctr = FusedMovingAvgObsFakeQuantize.with_args(


What's the difference between FakeQuantize and FusedMovingAvgObsFakeQuantize

FusedMovingAvgObsFakeQuantize - as the name suggests, has a combined op for FakeQuantize and MovingAvgObserver which makes it faster than two separate ops: FakeQuantize and MovingAvgObserver.

…QAT (pytorch#14740) Summary: FusedMovingAvgObsFakeQuantize speeds up by fusing FakeQuantize and MovingAverageMinMaxObserver into one CUDA op. Using it should give good speedups. This change updates the QAT qconfigs to accordingly. Tested on llama model on HTP and got ~4x QAT speedup. Reviewed By: billmguo Differential Revision: D83583655

navsud requested a review from cccclai as a code owner October 2, 2025 01:18

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 2, 2025

facebook-github-bot added fb-exported meta-exported labels Oct 2, 2025

navsud force-pushed the export-D83583655 branch from 698f6bf to aa9860c Compare October 2, 2025 19:38

navsud added the release notes: none Do not include this in the release notes label Oct 2, 2025

navsud requested review from haowhsu-quic and shewu-quic October 2, 2025 19:39

navsud force-pushed the export-D83583655 branch from aa9860c to 5e2a9ef Compare October 2, 2025 20:00

navsud requested a review from lucylq as a code owner October 2, 2025 20:00

navsud requested a review from jackzhxng as a code owner October 2, 2025 20:00

navsud force-pushed the export-D83583655 branch from 5e2a9ef to c1f908c Compare October 2, 2025 23:28

billmguo approved these changes Oct 3, 2025

View reviewed changes

cccclai reviewed Oct 3, 2025

View reviewed changes

navsud force-pushed the export-D83583655 branch from c1f908c to 4cbcb42 Compare October 3, 2025 00:44

navsud force-pushed the export-D83583655 branch from 4cbcb42 to f7dc1f0 Compare October 3, 2025 01:05

navsud force-pushed the export-D83583655 branch from f7dc1f0 to d95e62f Compare October 3, 2025 01:38

billmguo approved these changes Oct 3, 2025

View reviewed changes

facebook-github-bot merged commit e652746 into pytorch:main Oct 3, 2025
131 of 133 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use FusedMovingAvgObsFakeQuantize instead of FakeQuantize for faster QAT #14740

Use FusedMovingAvgObsFakeQuantize instead of FakeQuantize for faster QAT #14740

Uh oh!

navsud commented Oct 2, 2025

Uh oh!

pytorch-bot bot commented Oct 2, 2025 •

edited

Loading

Uh oh!

facebook-github-bot commented Oct 2, 2025

Uh oh!

cccclai Oct 3, 2025

Uh oh!

navsud Oct 3, 2025

Uh oh!

Uh oh!

Uh oh!

Use FusedMovingAvgObsFakeQuantize instead of FakeQuantize for faster QAT #14740

Use FusedMovingAvgObsFakeQuantize instead of FakeQuantize for faster QAT #14740

Uh oh!

Conversation

navsud commented Oct 2, 2025

Uh oh!

pytorch-bot bot commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/14740

❌ 1 New Failure

Uh oh!

facebook-github-bot commented Oct 2, 2025

Uh oh!

cccclai Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

navsud Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pytorch-bot bot commented Oct 2, 2025 •

edited

Loading