Enables the per_tensor lowering patterns for weight per_packing #2391

choudhary-devang · 2025-06-17T07:24:25Z

This Pr is an extension of #2139 pr,

Major changes:
1)Introduced lowering pattern for "per_tensor" quantized weights.
2) Modified the original api get_default_arm_inductor_quantization_config to add user choice of using "per_tensor" and "per_channel" granularity in model weight's quantization.

supported shapes:

s8:s8:f32 - (per_tensor / per_channel) input : s8, weight : s8, output : f32
u8:s8:f32 - (per_tensor / per_channel ) input : u8, weight : s8, output : f32

Tested and verified for different models:

Bert model
Resnet model
Vit model
Custum models

Example script for refence:

import torch
from transformers import BertModel
from torchao.quantization.pt2e.quantize_pt2e import prepare_pt2e, convert_pt2e
import torchao.quantization.pt2e.quantizer.arm_inductor_quantizer as aiq
from torchao.quantization.pt2e.quantizer.arm_inductor_quantizer import ArmInductorQuantizer
import torch._inductor.config as config
# Enable C++ wrapper for Inductor
config.cpp_wrapper = True
config.freezing=True

model_name = "bert-base-uncased"
model = BertModel.from_pretrained(model_name)

# Set the model to eval mode
model = model.eval()

# Create the data, using dummy data here as an example
traced_bs = 32
seq_length = 128
x = torch.randint(0, 10000, (traced_bs, seq_length))
attention_mask = torch.ones((traced_bs, seq_length))
example_inputs = (x, attention_mask)

# Capture the FX Graph to be quantized
with torch.no_grad():
    exported_model = torch.export.export_for_training(model, example_inputs).module()
    # Set up the quantizer and prepare the model for post-training quantization
    quantizer = ArmInductorQuantizer()
    quantizer.set_global(aiq.get_default_arm_inductor_quantization_config(is_dynamic=True, is_per_channel=True))
    prepared_model = prepare_pt2e(exported_model, quantizer)
    converted_model = convert_pt2e(prepared_model)
    converted_model = torch.compile(converted_model)
    with torch.profiler.profile( record_shapes=True) as prof:
        for _ in range(200):
            converted_model(*example_inputs)
print(prof.key_averages(group_by_input_shape=True).table(sort_by="self_cpu_time_total"))

Results

Model	FP32	quant (int8)	Speedup
resnet	62.967	44.482	1.415561
bert	103.879	71.953	1.443706
vit	69.031	59.973	1.151035

All time in sec, Taken on Aws Graviton 3E 32 core Instance

Pip list

cc: @jerryzh168, @fadara01, @Xia-Weiwen

pytorch-bot · 2025-06-17T07:24:29Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2391

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit e51e9ec with merge base 11ce634 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

choudhary-devang · 2025-06-26T05:06:00Z

Hi @jerryzh168, @fadara01, @Xia-Weiwen can you please review this pr
thankyou

jerryzh168 · 2025-06-26T17:23:57Z

Thanks, can you add some tests in https://github.com/pytorch/ao/tree/main/test/quantization/pt2e

choudhary-devang · 2025-07-14T07:12:49Z

Hi @jerryzh168,
I have added the testcase specific for the changes and to keep them separate i have added the file like : -ao/test/quantization/pt2e/test_arm_inductor_quantizer_per_tensor.py
can you please review this,
thankyou

fadara01 · 2025-07-20T10:33:44Z

Thanks for your PR!
Do we see any speedups (against fp32) for e.g. bert / resnet50 as a result of this lowering?
Do we need to do any work in pytorch - qconv and qlinear to support such lowerings?

choudhary-devang · 2025-07-21T05:39:41Z

Thanks for your PR! Do we see any speedups (against fp32) for e.g. bert / resnet50 as a result of this lowering? Do we need to do any work in pytorch - qconv and qlinear to support such lowerings?

Hi @fadara01, Thanks for the response.
I have updated the description to include some of the details, we don't need any changes in pytorch.
for my experimentation i have used pip install torch torchvision.

to recreate the experiment
Fp32 script

import torch
from transformers import BertModel

# model loading
model_name = "bert-base-uncased"
model = BertModel.from_pretrained(model_name)
# Create the data, using dummy data here as an example
traced_bs = 32
seq_length = 128
x = torch.randint(0, 10000, (traced_bs, seq_length))
attention_mask = torch.ones((traced_bs, seq_length))
example_inputs = (x, attention_mask)

# Inference 
with torch.no_grad():
    model = torch.compile(model)
    with torch.profiler.profile( record_shapes=True) as prof:
        for _ in range(200):
                model(x)
print(prof.key_averages(group_by_input_shape=True).table(sort_by="self_cpu_time_total"))

quant script

import torch
from transformers import BertModel
from torchao.quantization.pt2e.quantize_pt2e import prepare_pt2e, convert_pt2e
import torchao.quantization.pt2e.quantizer.arm_inductor_quantizer as aiq
from torchao.quantization.pt2e.quantizer.arm_inductor_quantizer import ArmInductorQuantizer
import torch._inductor.config as config
# Enable C++ wrapper for Inductor
config.cpp_wrapper = True
config.freezing=True

model_name = "bert-base-uncased"
model = BertModel.from_pretrained(model_name)

# Set the model to eval mode
model = model.eval()

# Create the data, using dummy data here as an example
traced_bs = 32
seq_length = 128
x = torch.randint(0, 10000, (traced_bs, seq_length))
attention_mask = torch.ones((traced_bs, seq_length))
example_inputs = (x, attention_mask)

# Capture the FX Graph to be quantized
with torch.no_grad():
    exported_model = torch.export.export_for_training(model, example_inputs).module()
    # Set up the quantizer and prepare the model for post-training quantization
    quantizer = ArmInductorQuantizer()
    quantizer.set_global(aiq.get_default_arm_inductor_quantization_config(is_dynamic=True, is_per_channel=True))
    prepared_model = prepare_pt2e(exported_model, quantizer)
    converted_model = convert_pt2e(prepared_model)
    converted_model = torch.compile(converted_model)
    with torch.profiler.profile( record_shapes=True) as prof:
        for _ in range(200):
            converted_model(*example_inputs)
print(prof.key_averages(group_by_input_shape=True).table(sort_by="self_cpu_time_total"))

current setup
**kernel **
onednn_verbose,v1,primitive,exec,cpu,matmul,lowp_gemm:acl,undef,src:s8:a:blocked:ab::f0 wei:s8::blocked:ab::f0 bia:f32:a:blocked:ab::f0_mask2 dst:f32:a:blocked:ab::f0,attr-scratchpad:user attr-scales:src0:0:f32+wei:0:f32 attr-zero-points:src0:0:s32,,50x512:512x1000,0.224854

fadara01 · 2025-07-21T10:33:26Z

Ahhh that's amazing! I remember doing a PoC for this exact thing back in the day and I had to tweak qlinear/qconv, hence my question.

choudhary-devang · 2025-07-22T05:31:41Z

Hi @jerryzh168, @fadara01, can you please approve and merge this change.
thankyou

facebook-github-bot added the CLA Signed label Jun 17, 2025

choudhary-devang force-pushed the Per_tensor_lowering branch from c698531 to 67d4a79 Compare June 26, 2025 05:02

jerryzh168 added the topic: improvement label Jun 26, 2025

choudhary-devang force-pushed the Per_tensor_lowering branch from 67d4a79 to d863085 Compare July 14, 2025 07:07

choudhary-devang force-pushed the Per_tensor_lowering branch from d863085 to 2caf61d Compare July 18, 2025 05:25

choudhary-devang added 2 commits July 20, 2025 13:12

Enables the per_tensor lowering patterns for weight per_packing

b1fd33a

added the testcases

e51e9ec

choudhary-devang force-pushed the Per_tensor_lowering branch from 2caf61d to e51e9ec Compare July 20, 2025 07:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enables the per_tensor lowering patterns for weight per_packing #2391

Enables the per_tensor lowering patterns for weight per_packing #2391

choudhary-devang commented Jun 17, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jun 17, 2025 •

edited

Loading

Uh oh!

choudhary-devang commented Jun 26, 2025

Uh oh!

jerryzh168 commented Jun 26, 2025

Uh oh!

choudhary-devang commented Jul 14, 2025

Uh oh!

fadara01 commented Jul 20, 2025

Uh oh!

choudhary-devang commented Jul 21, 2025

Uh oh!

fadara01 commented Jul 21, 2025 •

edited

Loading

Uh oh!

choudhary-devang commented Jul 22, 2025

Uh oh!

Uh oh!

Enables the per_tensor lowering patterns for weight per_packing #2391

Are you sure you want to change the base?

Enables the per_tensor lowering patterns for weight per_packing #2391

Conversation

choudhary-devang commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2391

✅ No Failures

Uh oh!

choudhary-devang commented Jun 26, 2025

Uh oh!

jerryzh168 commented Jun 26, 2025

Uh oh!

choudhary-devang commented Jul 14, 2025

Uh oh!

fadara01 commented Jul 20, 2025

Uh oh!

choudhary-devang commented Jul 21, 2025

Uh oh!

fadara01 commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

choudhary-devang commented Jul 22, 2025

Uh oh!

Uh oh!

choudhary-devang commented Jun 17, 2025 •

edited

Loading

pytorch-bot bot commented Jun 17, 2025 •

edited

Loading

fadara01 commented Jul 21, 2025 •

edited

Loading