🐛 Describe the bug
I am trying to quantize a neural network and deploy it on an Android device. I expect a reduced inference time of the quantized model over the unquantized model. However, I can not get the expected results. Why would I get the unexpected results?
The operating system I use:
22.04.1-Ubuntu
The Android device I use:
Redmi K50 with Dimensity 8100 Octa-core Max 2.85G Hz CPU, 8.0 GB RAM, Android version 14.
The results I got (it floats up and down, but does not show obvious acceleration):
FP32 MobileNet avg: 5.234 ms
NT8 MobileNet avg: 5.639 ms
Python package I am using (pip list results):
Package Version
------------------------- --------------------
absl-py 2.3.1
accelerate 1.9.0
aiohappyeyeballs 2.6.1
aiohttp 3.12.13
aiosignal 1.3.2
annotated-types 0.7.0
antlr4-python3-runtime 4.9.3
anyio 4.7.0
appdirs 1.4.4
argon2-cffi 21.3.0
argon2-cffi-bindings 21.2.0
asttokens 3.0.0
async-lru 2.0.4
async-timeout 5.0.1
attrs 24.3.0
audioread 3.0.1
babel 2.16.0
beautifulsoup4 4.13.4
bleach 6.2.0
brotlicffi 1.0.9.2
cattrs 25.1.1
certifi 2025.6.15
cffi 1.17.1
charset-normalizer 3.3.2
coloredlogs 15.0.1
comm 0.2.1
contourpy 1.3.2
coremltools 8.3.0
cppimport 22.8.2
cupy-cuda12x 13.5.1
cycler 0.12.1
datasets 4.0.0
debugpy 1.8.11
decorator 5.1.1
defusedxml 0.7.1
diffusers 0.34.0
dill 0.3.8
dllist 2.0.0
exceptiongroup 1.2.0
execnet 2.1.1
executing 0.8.3
executorch 0.7.0
expecttest 0.3.0
fastjsonschema 2.20.0
fastrlock 0.8.3
filelock 3.18.0
flatbuffers 25.2.10
fonttools 4.58.4
frozenlist 1.7.0
fsspec 2025.3.0
grpcio 1.74.0
h11 0.16.0
h5py 3.14.0
hf-xet 1.1.5
httpcore 1.0.9
httpx 0.28.1
huggingface-hub 0.33.4
humanfriendly 10.0
hydra-core 1.3.2
hypothesis 6.138.2
idna 3.10
imageio 2.37.0
importlib_metadata 8.7.0
iniconfig 2.1.0
ipykernel 6.29.5
ipython 8.30.0
jedi 0.19.2
Jinja2 3.1.6
joblib 1.5.1
json5 0.9.25
jsonschema 4.25.0
jsonschema-specifications 2023.7.1
jupyter_client 8.6.3
jupyter_core 5.8.1
jupyter-events 0.12.0
jupyter-lsp 2.2.5
jupyter_server 2.16.0
jupyter_server_terminals 0.5.3
jupyterlab 4.4.4
jupyterlab_pygments 0.3.0
jupyterlab_server 2.27.3
kiwisolver 1.4.8
lazy_loader 0.4
librosa 0.11.0
lightning 2.5.2
lightning-utilities 0.14.3
llvmlite 0.44.0
lpips 0.1.4
Mako 1.2.3
Markdown 3.8.2
markdown-it-py 3.0.0
MarkupSafe 3.0.2
matplotlib 3.10.3
matplotlib-inline 0.1.6
mdurl 0.1.2
mistune 3.1.2
ml_dtypes 0.5.1
mpmath 1.3.0
msgpack 1.1.1
multidict 6.5.1
multiprocess 0.70.16
nbclient 0.10.2
nbconvert 7.16.6
nbformat 5.10.4
ncnn 1.0.20250503
nest-asyncio 1.6.0
networkx 3.4.2
ninja 1.11.1.4
notebook 7.4.4
notebook_shim 0.2.4
numba 0.61.2
numpy 2.2.6
nvidia-cublas-cu12 12.8.4.1
nvidia-cuda-cupti-cu12 12.8.90
nvidia-cuda-nvrtc-cu11 11.8.89
nvidia-cuda-nvrtc-cu12 12.8.93
nvidia-cuda-runtime-cu12 12.8.90
nvidia-cudnn-cu12 9.10.2.21
nvidia-cufft-cu12 11.3.3.83
nvidia-cufile-cu12 1.13.1.3
nvidia-curand-cu12 10.3.9.90
nvidia-cusolver-cu12 11.7.3.90
nvidia-cusparse-cu12 12.5.8.93
nvidia-cusparselt-cu12 0.7.1
nvidia-ml-py 12.575.51
nvidia-modelopt 0.33.0
nvidia-modelopt-core 0.33.0
nvidia-nccl-cu12 2.27.3
nvidia-nvjitlink-cu12 12.8.93
nvidia-nvtx-cu12 12.8.90
nvitop 1.5.1
omegaconf 2.3.0
onnx 1.18.0
onnx_graphsurgeon 0.5.8
onnx-ir 0.1.2
onnxruntime-gpu 1.22.0
onnxscript 0.3.0
onnxsim 0.4.36
opencv-python 4.11.0.86
overrides 7.4.0
packaging 25.0
pandas 2.3.0
pandocfilters 1.5.0
parameterized 0.9.0
parso 0.8.4
peft 0.16.0
pexpect 4.9.0
pillow 11.2.1
pip 25.1
platformdirs 4.3.7
pluggy 1.6.0
pnnx 20250725
polygraphy 0.49.26
pooch 1.8.2
portalocker 3.2.0
prometheus_client 0.21.1
prompt-toolkit 3.0.43
propcache 0.3.2
protobuf 6.31.1
psutil 5.9.0
ptyprocess 0.7.0
PuLP 3.2.1
pure-eval 0.2.2
py-cpuinfo 9.0.0
py3nvml 0.2.7
pyaml 25.7.0
pyarrow 21.0.0
pybind11 3.0.0
pycparser 2.21
pycuda 2025.1.1
pydantic 2.11.7
pydantic_core 2.33.2
Pygments 2.19.1
pyparsing 3.2.3
pypesq 1.2.4
PySocks 1.7.1
pytest 8.4.1
pytest-rerunfailures 15.1
pytest-xdist 3.8.0
python-dateutil 2.9.0.post0
python-json-logger 3.2.1
python_speech_features 0.6
pytools 2025.2.2
pytorch-lightning 2.5.2
pytorch-msssim 1.0.0
pytz 2025.2
PyWavelets 1.8.0
PyYAML 6.0.2
pyzmq 26.2.0
quanto 0.2.0
referencing 0.30.2
regex 2024.11.6
requests 2.32.4
rfc3339-validator 0.1.4
rfc3986-validator 0.1.1
rich 14.0.0
rpds-py 0.22.3
ruamel.yaml 0.18.14
ruamel.yaml.clib 0.2.12
safetensors 0.5.3
scikit-image 0.25.2
scikit-learn 1.7.0
scipy 1.15.3
seaborn 0.13.2
Send2Trash 1.8.2
setuptools 78.1.1
siphash24 1.7
six 1.17.0
sniffio 1.3.0
sortedcontainers 2.4.0
soundfile 0.13.1
soupsieve 2.5
soxr 0.5.0.post1
stack-data 0.2.0
sympy 1.14.0
tabulate 0.9.0
tensorboard 2.20.0
tensorboard-data-server 0.7.2
tensorboardX 2.6.4
tensorrt 10.12.0.36
tensorrt_cu12 10.12.0.36
tensorrt_cu12_bindings 10.12.0.36
tensorrt_cu12_libs 10.12.0.36
terminado 0.17.1
thop 0.1.1.post2209072238
threadpoolctl 3.6.0
tifffile 2025.5.10
tinycss2 1.4.0
tokenizers 0.21.2
tomli 2.2.1
torch 2.8.0
torch_tensorrt 2.8.0
torchao 0.12.0
torchaudio 2.8.0
torchinfo 1.8.0
torchlibrosa 0.1.0
torchmetrics 1.7.3
torchprofile 0.0.4
torchvision 0.23.0
tornado 6.5.1
tqdm 4.67.1
traitlets 5.14.3
transformers 4.53.3
triton 3.4.0
typing_extensions 4.12.2
typing-inspection 0.4.1
tzdata 2025.2
urllib3 2.5.0
wcwidth 0.2.13
webencodings 0.5.1
websocket-client 1.8.0
Werkzeug 3.1.3
wheel 0.45.1
xmltodict 0.14.2
xxhash 3.5.0
yarl 1.20.1
zipp 3.23.0
I followed the document executorch, and the Python script to generate the quantized and unquantized model is given by:
import torch
import torchvision.models as models
from torch.export import export, ExportedProgram
from torchvision.models.mobilenetv2 import MobileNet_V2_Weights
from executorch.backends.xnnpack.partition.xnnpack_partitioner import XnnpackPartitioner
from executorch.exir import EdgeProgramManager, ExecutorchProgramManager, to_edge_transform_and_lower
from executorch.exir.backend.backend_api import to_backend
from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e, prepare_pt2e
from torch.export import export_for_training
from executorch.exir import EdgeCompileConfig, to_edge_transform_and_lower
from executorch.backends.xnnpack.quantizer.xnnpack_quantizer import (
get_symmetric_quantization_config,
XNNPACKQuantizer,
)
def quantize(model, example_inputs):
"""This is the official recommended flow for quantization in pytorch 2.0 export"""
print(f"Original model: {model}")
quantizer = XNNPACKQuantizer()
# if we set is_per_channel to True, we also need to add out_variant of quantize_per_channel/dequantize_per_channel
operator_config = get_symmetric_quantization_config(is_per_channel=False)
quantizer.set_global(operator_config)
m = prepare_pt2e(model, quantizer)
# calibration
m(*example_inputs)
m = convert_pt2e(m)
print(f"Quantized model: {m}")
# make sure we can export to flat buffer
return m
# Network and input
mobilenet_v2 = models.mobilenetv2.mobilenet_v2(weights=MobileNet_V2_Weights.DEFAULT).eval()
sample_inputs = (torch.randn(1, 3, 224, 224), )
# Unquantized model to .pte
exported_program: ExportedProgram = export(mobilenet_v2, sample_inputs)
edge: EdgeProgramManager = to_edge_transform_and_lower(
exported_program,
partitioner=[XnnpackPartitioner()],
)
exec_prog = edge.to_executorch()
with open("../checkpoint/android/xnnpack_mobilenetv2.pte", "wb") as file:
exec_prog.write_to_file(file)
# Quantized model to .pte
mobilenet_v2 = export_for_training(mobilenet_v2, sample_inputs).module() # 2-stage export for quantization path
quantized_mobilenetv2 = quantize(mobilenet_v2, sample_inputs)
edge = to_edge_transform_and_lower(
export(quantized_mobilenetv2, sample_inputs),
compile_config=EdgeCompileConfig(_check_ir_validity=False),
partitioner=[XnnpackPartitioner()]
)
exec_prog = edge.to_executorch()
with open("../checkpoint/android/qs8_xnnpack_mobilenetv2.pte", "wb") as file:
exec_prog.write_to_file(file)
The Java code I used
package com.example.pytorch2android;
import android.content.Context;
import android.os.Bundle;
import android.util.Log;
import android.widget.TextView;
import org.pytorch.executorch.EValue;
import org.pytorch.executorch.Module;
import org.pytorch.executorch.Tensor;
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.util.Locale;
import androidx.appcompat.app.AppCompatActivity;
public class MainActivity extends AppCompatActivity {
@Override
protected void onCreate(Bundle savedInstanceState) {
super.onCreate(savedInstanceState);
setContentView(R.layout.activity_main);
executorchAndroidMobileNet();
}
private void executorchAndroidMobileNet() {
Module modelInt8_mobilenet = null;
Module modelFloat_mobilenet = null;
int height = 224;
int width = 224;
try {
// Load ExecuTorch module from assets
String etPath_float = assetFilePath(this, "qs8_xnnpack_mobilenetv2.pte");
String etPath_int8 = assetFilePath(this, "xnnpack_mobilenetv2.pte");
modelFloat_mobilenet = Module.load(etPath_float);
modelInt8_mobilenet = Module.load(etPath_int8);
} catch (IOException e) {
Log.e("ExecuTorch", "Error reading assets", e);
finish();
return;
}
// ------------------------------------------------
// Prepare dummy input
float[] input = new float[1 * 3 * height * width];
Tensor inputTensor = Tensor.fromBlob(input, new long[]{1, 3, height, width});
EValue inputEValue = EValue.from(inputTensor);
// ------------------------------------------------
// Benchmark ExecuTorch INT8 model
double floatTimeMs = benchmarkExecuTorch_mobilenet(modelFloat_mobilenet, inputEValue);
double int8TimeMs = benchmarkExecuTorch_mobilenet(modelInt8_mobilenet, inputEValue);
String resultText = String.format(
Locale.US,
"FP32 MobileNet avg: %.3f ms\nINT8 MobileNet avg: %.3f ms\nSpeedup CNN: %.2fx",
floatTimeMs, int8TimeMs, floatTimeMs/ int8TimeMs
);
// Show results on UI
TextView textView = findViewById(R.id.text3);
textView.setText(resultText);
Log.i("ExecuTorchBenchmark", resultText);
}
private double benchmarkExecuTorch_mobilenet(Module model, EValue input) {
// Warm-up
for (int i = 0; i < 10; i++) {
model.forward(input);
}
// Timed runs
int numRuns = 100;
long startTime = System.nanoTime();
for (int i = 0; i < numRuns; i++) {
model.forward(input);
}
long endTime = System.nanoTime();
long totalTime = endTime - startTime;
return (totalTime / (double) numRuns) / 1_000_000.0;
}
public static String assetFilePath(Context context, String assetName) throws IOException {
File file = new File(context.getFilesDir(), assetName);
// Always overwrite (useful for debugging and model updates)
try (InputStream is = context.getAssets().open(assetName)) {
try (OutputStream os = new FileOutputStream(file, false)) {
byte[] buffer = new byte[4 * 1024];
int read;
while ((read = is.read(buffer)) != -1) {
os.write(buffer, 0, read);
}
os.flush();
}
}
Log.i("AssetLoader", "Copied asset to " + file.getAbsolutePath() + " size=" + file.length());
return file.getAbsolutePath();
}
}
Java packages' version I used:
[versions]
agp = "8.12.0"
fbjni = "0.7.0"
junit = "4.13.2"
junitVersion = "1.1.5"
espressoCore = "3.5.1"
appcompat = "1.7.1"
material = "1.10.0"
activity = "1.8.0"
constraintlayout = "2.1.4"
executorch_android = "0.7.0"
soloader = "0.12.1"
Versions
Thanks for contributing 🎉!
🐛 Describe the bug
I am trying to quantize a neural network and deploy it on an Android device. I expect a reduced inference time of the quantized model over the unquantized model. However, I can not get the expected results. Why would I get the unexpected results?
The operating system I use:
22.04.1-Ubuntu
The Android device I use:
Redmi K50 with Dimensity 8100 Octa-core Max 2.85G Hz CPU, 8.0 GB RAM, Android version 14.
The results I got (it floats up and down, but does not show obvious acceleration):
FP32 MobileNet avg: 5.234 ms
NT8 MobileNet avg: 5.639 ms
Python package I am using (pip list results):
I followed the document executorch, and the Python script to generate the quantized and unquantized model is given by:
The Java code I used
Java packages' version I used:
[versions]
agp = "8.12.0"
fbjni = "0.7.0"
junit = "4.13.2"
junitVersion = "1.1.5"
espressoCore = "3.5.1"
appcompat = "1.7.1"
material = "1.10.0"
activity = "1.8.0"
constraintlayout = "2.1.4"
executorch_android = "0.7.0"
soloader = "0.12.1"
Versions
Thanks for contributing 🎉!