Add 16A8W quantization configuration utility for ARM backend #13641

Ninja91 · 2025-08-25T18:56:37Z

Stack from ghstack (oldest at bottom):

This diff implements a 16A8W (16-bit activations, 8-bit weights) quantization configuration utility for the ExecutorTorch ARM backend, following the feedback from D79746479.

Key Changes

1. New Quantization Configuration Function

Add get_16a8w_quantization_config() in fbcode/executorch/backends/arm/quantizer/arm_quantizer.py
Provides 16-bit activations with HistogramObserver (better precision than 8A8W)
Maintains 8-bit weights with MinMaxObserver/PerChannelMinMaxObserver (memory efficient)
Technically supported by TOSA through EXT-INT16 extension/profile

Benefits

Better Precision: 16-bit activations provide higher precision than 8-bit. Useful for carrying precision for recurring neural nets.

Differential Revision: D79763381

cc @digantdesai @freddan80 @per @zingo @oscarandersson8218

This diff implements a 16A8W (16-bit activations, 8-bit weights) quantization configuration utility for the ExecutorTorch ARM backend, following the feedback from D79746479. ## Key Changes **1. New Quantization Configuration Function** - Add `get_16a8w_quantization_config()` in `fbcode/executorch/backends/arm/quantizer/arm_quantizer.py` - Provides 16-bit activations with HistogramObserver (better precision than 8A8W) - Maintains 8-bit weights with MinMaxObserver/PerChannelMinMaxObserver (memory efficient) - **Technically supported by TOSA through [EXT-INT16 extension/profile](https://www.mlplatform.org/tosa/tosa_spec.html#_conv2d)** ## Benefits - **Better Precision**: 16-bit activations provide higher precision than 8-bit. Useful for carrying precision for recurring neural nets. Differential Revision: [D79763381](https://our.internmc.facebook.com/intern/diff/D79763381/) [ghstack-poisoned]

This diff implements a 16A8W (16-bit activations, 8-bit weights) quantization configuration utility for the ExecutorTorch ARM backend, following the feedback from D79746479. ## Key Changes **1. New Quantization Configuration Function** - Add `get_16a8w_quantization_config()` in `fbcode/executorch/backends/arm/quantizer/arm_quantizer.py` - Provides 16-bit activations with HistogramObserver (better precision than 8A8W) - Maintains 8-bit weights with MinMaxObserver/PerChannelMinMaxObserver (memory efficient) - **Technically supported by TOSA through [EXT-INT16 extension/profile](https://www.mlplatform.org/tosa/tosa_spec.html#_conv2d)** ## Benefits - **Better Precision**: 16-bit activations provide higher precision than 8-bit. Useful for carrying precision for recurring neural nets. Differential Revision: [D79763381](https://our.internmc.facebook.com/intern/diff/D79763381/) ghstack-source-id: 305459598 Pull Request resolved: #13641

pytorch-bot · 2025-08-25T18:56:41Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/13641

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures

As of commit bc36615 with merge base 9053089 ():

NEW FAILURES - The following jobs have failed:

Build documentation / build (buck2) / Build doc (gh)
At least one of the pre-conditions you specified did not hold
pull / unittest-arm-backend-with-no-fvp (test_pytest_ops) / linux-job (gh)
RuntimeError: Command docker exec -t eb250c8cd228a9dd7dd9a1ce1fec7a18e46dc9b18790fe2ed8962699d3d0a307 /exec failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

facebook-github-bot · 2025-08-25T18:56:47Z

This pull request was exported from Phabricator. Differential Revision: D79763381

github-actions · 2025-08-25T18:57:18Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Ninja91 · 2025-08-25T18:57:53Z

Older PR was #13175 which was closed

…#13641) Summary: This diff implements a 16A8W (16-bit activations, 8-bit weights) quantization configuration utility for the ExecutorTorch ARM backend, following the feedback from D79746479. ## Key Changes **1. New Quantization Configuration Function** - Add `get_16a8w_quantization_config()` in `fbcode/executorch/backends/arm/quantizer/arm_quantizer.py` - Provides 16-bit activations with HistogramObserver (better precision than 8A8W) - Maintains 8-bit weights with MinMaxObserver/PerChannelMinMaxObserver (memory efficient) - **Technically supported by TOSA through [EXT-INT16 extension/profile](https://www.mlplatform.org/tosa/tosa_spec.html#_conv2d)** ## Benefits - **Better Precision**: 16-bit activations provide higher precision than 8-bit. Useful for carrying precision for recurring neural nets. ghstack-source-id: 305459598 exported-using-ghexport Reviewed By: 3l1 Differential Revision: D79763381

This diff implements a 16A8W (16-bit activations, 8-bit weights) quantization configuration utility for the ExecutorTorch ARM backend, following the feedback from D79746479. ## Key Changes **1. New Quantization Configuration Function** - Add `get_16a8w_quantization_config()` in `fbcode/executorch/backends/arm/quantizer/arm_quantizer.py` - Provides 16-bit activations with HistogramObserver (better precision than 8A8W) - Maintains 8-bit weights with MinMaxObserver/PerChannelMinMaxObserver (memory efficient) - **Technically supported by TOSA through [EXT-INT16 extension/profile](https://www.mlplatform.org/tosa/tosa_spec.html#_conv2d)** ## Benefits - **Better Precision**: 16-bit activations provide higher precision than 8-bit. Useful for carrying precision for recurring neural nets. Differential Revision: [D79763381](https://our.internmc.facebook.com/intern/diff/D79763381/) cc digantdesai freddan80 per zingo oscarandersson8218 [ghstack-poisoned]

facebook-github-bot · 2025-08-25T21:41:23Z

This pull request was exported from Phabricator. Differential Revision: D79763381

This diff implements a 16A8W (16-bit activations, 8-bit weights) quantization configuration utility for the ExecutorTorch ARM backend, following the feedback from D79746479. ## Key Changes **1. New Quantization Configuration Function** - Add `get_16a8w_quantization_config()` in `fbcode/executorch/backends/arm/quantizer/arm_quantizer.py` - Provides 16-bit activations with HistogramObserver (better precision than 8A8W) - Maintains 8-bit weights with MinMaxObserver/PerChannelMinMaxObserver (memory efficient) - **Technically supported by TOSA through [EXT-INT16 extension/profile](https://www.mlplatform.org/tosa/tosa_spec.html#_conv2d)** ## Benefits - **Better Precision**: 16-bit activations provide higher precision than 8-bit. Useful for carrying precision for recurring neural nets. Differential Revision: [D79763381](https://our.internmc.facebook.com/intern/diff/D79763381/) cc digantdesai freddan80 per zingo oscarandersson8218 [ghstack-poisoned]

facebook-github-bot · 2025-08-26T04:16:21Z

This pull request was exported from Phabricator. Differential Revision: D79763381

backends/arm/quantizer/arm_quantizer.py

digantdesai

Typically you add accompanying test with a PR. so each PR (and commits inside are self contained).

Ninja91 · 2025-08-26T15:37:44Z

@digantdesai I see some unrelated test failing https://github.com/pytorch/executorch/actions/runs/17227752037/job/48875652934?pr=13641.

Any guidance towards merging this?

This diff implements a 16A8W (16-bit activations, 8-bit weights) quantization configuration utility for the ExecutorTorch ARM backend, following the feedback from D79746479. ## Key Changes **1. New Quantization Configuration Function** - Add `get_16a8w_quantization_config()` in `fbcode/executorch/backends/arm/quantizer/arm_quantizer.py` - Provides 16-bit activations with HistogramObserver (better precision than 8A8W) - Maintains 8-bit weights with MinMaxObserver/PerChannelMinMaxObserver (memory efficient) - **Technically supported by TOSA through [EXT-INT16 extension/profile](https://www.mlplatform.org/tosa/tosa_spec.html#_conv2d)** ## Benefits - **Better Precision**: 16-bit activations provide higher precision than 8-bit. Useful for carrying precision for recurring neural nets. Differential Revision: [D79763381](https://our.internmc.facebook.com/intern/diff/D79763381/) cc digantdesai freddan80 per zingo oscarandersson8218 [ghstack-poisoned]

Pull Request resolved: #13641 This diff implements a 16A8W (16-bit activations, 8-bit weights) quantization configuration utility for the ExecutorTorch ARM backend, following the feedback from D79746479. ## Key Changes **1. New Quantization Configuration Function** - Add `get_16a8w_quantization_config()` in `fbcode/executorch/backends/arm/quantizer/arm_quantizer.py` - Provides 16-bit activations with HistogramObserver (better precision than 8A8W) - Maintains 8-bit weights with MinMaxObserver/PerChannelMinMaxObserver (memory efficient) - **Technically supported by TOSA through [EXT-INT16 extension/profile](https://www.mlplatform.org/tosa/tosa_spec.html#_conv2d)** ## Benefits - **Better Precision**: 16-bit activations provide higher precision than 8-bit. Useful for carrying precision for recurring neural nets. ghstack-source-id: 305891620 @exported-using-ghexport Differential Revision: [D79763381](https://our.internmc.facebook.com/intern/diff/D79763381/)

facebook-github-bot · 2025-08-27T05:40:19Z

This pull request was exported from Phabricator. Differential Revision: D79763381

Pull Request resolved: #13641 This diff implements a 16A8W (16-bit activations, 8-bit weights) quantization configuration utility for the ExecutorTorch ARM backend, following the feedback from D79746479. ## Key Changes **1. New Quantization Configuration Function** - Add `get_16a8w_quantization_config()` in `fbcode/executorch/backends/arm/quantizer/arm_quantizer.py` - Provides 16-bit activations with HistogramObserver (better precision than 8A8W) - Maintains 8-bit weights with MinMaxObserver/PerChannelMinMaxObserver (memory efficient) - **Technically supported by TOSA through [EXT-INT16 extension/profile](https://www.mlplatform.org/tosa/tosa_spec.html#_conv2d)** ## Benefits - **Better Precision**: 16-bit activations provide higher precision than 8-bit. Useful for carrying precision for recurring neural nets. ghstack-source-id: 305891620 @exported-using-ghexport Differential Revision: [D79763381](https://our.internmc.facebook.com/intern/diff/D79763381/)

Pull Request resolved: #13641 This diff implements a 16A8W (16-bit activations, 8-bit weights) quantization configuration utility for the ExecutorTorch ARM backend, following the feedback from D79746479. ## Key Changes **1. New Quantization Configuration Function** - Add `get_16a8w_quantization_config()` in `fbcode/executorch/backends/arm/quantizer/arm_quantizer.py` - Provides 16-bit activations with HistogramObserver (better precision than 8A8W) - Maintains 8-bit weights with MinMaxObserver/PerChannelMinMaxObserver (memory efficient) - **Technically supported by TOSA through [EXT-INT16 extension/profile](https://www.mlplatform.org/tosa/tosa_spec.html#_conv2d)** ## Benefits - **Better Precision**: 16-bit activations provide higher precision than 8-bit. Useful for carrying precision for recurring neural nets. ghstack-source-id: 305991462 @exported-using-ghexport Differential Revision: [D79763381](https://our.internmc.facebook.com/intern/diff/D79763381/)

facebook-github-bot · 2025-08-27T18:49:33Z

This pull request was exported from Phabricator. Differential Revision: D79763381

@Ninja91

This PR was created by the merge bot to help merge the original PR into the main branch. ghstack PR number: #13641 by @Ninja91 ^ Please use this as the source of truth for the PR details, comments, and reviews ghstack PR base: https://github.com/pytorch/executorch/tree/gh/Ninja91/1/base ghstack PR head: https://github.com/pytorch/executorch/tree/gh/Ninja91/1/head Merge bot PR base: https://github.com/pytorch/executorch/tree/main Merge bot PR head: https://github.com/pytorch/executorch/tree/gh/Ninja91/1/orig @diff-train-skip-merge Co-authored-by: Nitin Jain <jainnitin@meta.com>

Ninja91 requested a review from digantdesai as a code owner August 25, 2025 18:56

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 25, 2025

facebook-github-bot added the fb-exported label Aug 25, 2025

mergennachin added the partner: arm For backend delegation, kernels, demo, etc. from the 3rd-party partner, Arm label Aug 25, 2025

Ninja91 mentioned this pull request Aug 25, 2025

Add 16A8W linear ops support and test #13658

Merged

Ninja91 mentioned this pull request Aug 25, 2025

Add 16A8W support and test for add operation #13653

Closed

per approved these changes Aug 26, 2025

View reviewed changes

backends/arm/quantizer/arm_quantizer.py Show resolved Hide resolved

digantdesai reviewed Aug 26, 2025

View reviewed changes

backends/arm/quantizer/arm_quantizer.py Show resolved Hide resolved

digantdesai approved these changes Aug 26, 2025

View reviewed changes

Ninja91 merged commit b8ee1d2 into gh/Ninja91/1/base Aug 27, 2025
109 of 112 checks passed

Ninja91 deleted the gh/Ninja91/1/head branch August 27, 2025 17:19

Ninja91 temporarily deployed to cherry-pick-bot August 27, 2025 17:19 — with GitHub Actions Inactive

pytorchbot mentioned this pull request Aug 27, 2025

Add 16A8W quantization configuration utility for ARM backend #13728

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add 16A8W quantization configuration utility for ARM backend #13641

Add 16A8W quantization configuration utility for ARM backend #13641

Uh oh!

Ninja91 commented Aug 25, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Aug 25, 2025 •

edited

Loading

Uh oh!

facebook-github-bot commented Aug 25, 2025

Uh oh!

github-actions bot commented Aug 25, 2025

Uh oh!

Ninja91 commented Aug 25, 2025

Uh oh!

facebook-github-bot commented Aug 25, 2025

Uh oh!

facebook-github-bot commented Aug 26, 2025

Uh oh!

Uh oh!

Uh oh!

digantdesai left a comment

Uh oh!

Ninja91 commented Aug 26, 2025 •

edited

Loading

Uh oh!

facebook-github-bot commented Aug 27, 2025

Uh oh!

Uh oh!

facebook-github-bot commented Aug 27, 2025

Uh oh!

Uh oh!

Add 16A8W quantization configuration utility for ARM backend #13641

Add 16A8W quantization configuration utility for ARM backend #13641

Uh oh!

Conversation

Ninja91 commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key Changes

Benefits

Uh oh!

pytorch-bot bot commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/13641

❌ 2 New Failures

Uh oh!

facebook-github-bot commented Aug 25, 2025

Uh oh!

github-actions bot commented Aug 25, 2025

This PR needs a release notes: label

Uh oh!

Ninja91 commented Aug 25, 2025

Uh oh!

facebook-github-bot commented Aug 25, 2025

Uh oh!

facebook-github-bot commented Aug 26, 2025

Uh oh!

Uh oh!

Uh oh!

digantdesai left a comment

Choose a reason for hiding this comment

Uh oh!

Ninja91 commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Aug 27, 2025

Uh oh!

Uh oh!

facebook-github-bot commented Aug 27, 2025

Uh oh!

Uh oh!

Ninja91 commented Aug 25, 2025 •

edited

Loading

pytorch-bot bot commented Aug 25, 2025 •

edited

Loading

This PR needs a `release notes:` label

Ninja91 commented Aug 26, 2025 •

edited

Loading