Skip to content

Quantization of mobilenet_v2 on an android device shows no accleration #13502

@MiZhou22

Description

@MiZhou22

🐛 Describe the bug

I am trying to quantize a neural network and deploy it on an Android device. I expect a reduced inference time of the quantized model over the unquantized model. However, I can not get the expected results. Why would I get the unexpected results?

The operating system I use:
22.04.1-Ubuntu

The Android device I use:
Redmi K50 with Dimensity 8100 Octa-core Max 2.85G Hz CPU, 8.0 GB RAM, Android version 14.

The results I got (it floats up and down, but does not show obvious acceleration):
FP32 MobileNet avg: 5.234 ms
NT8 MobileNet avg: 5.639 ms

Python package I am using (pip list results):

Package                   Version
------------------------- --------------------
absl-py                   2.3.1
accelerate                1.9.0
aiohappyeyeballs          2.6.1
aiohttp                   3.12.13
aiosignal                 1.3.2
annotated-types           0.7.0
antlr4-python3-runtime    4.9.3
anyio                     4.7.0
appdirs                   1.4.4
argon2-cffi               21.3.0
argon2-cffi-bindings      21.2.0
asttokens                 3.0.0
async-lru                 2.0.4
async-timeout             5.0.1
attrs                     24.3.0
audioread                 3.0.1
babel                     2.16.0
beautifulsoup4            4.13.4
bleach                    6.2.0
brotlicffi                1.0.9.2
cattrs                    25.1.1
certifi                   2025.6.15
cffi                      1.17.1
charset-normalizer        3.3.2
coloredlogs               15.0.1
comm                      0.2.1
contourpy                 1.3.2
coremltools               8.3.0
cppimport                 22.8.2
cupy-cuda12x              13.5.1
cycler                    0.12.1
datasets                  4.0.0
debugpy                   1.8.11
decorator                 5.1.1
defusedxml                0.7.1
diffusers                 0.34.0
dill                      0.3.8
dllist                    2.0.0
exceptiongroup            1.2.0
execnet                   2.1.1
executing                 0.8.3
executorch                0.7.0
expecttest                0.3.0
fastjsonschema            2.20.0
fastrlock                 0.8.3
filelock                  3.18.0
flatbuffers               25.2.10
fonttools                 4.58.4
frozenlist                1.7.0
fsspec                    2025.3.0
grpcio                    1.74.0
h11                       0.16.0
h5py                      3.14.0
hf-xet                    1.1.5
httpcore                  1.0.9
httpx                     0.28.1
huggingface-hub           0.33.4
humanfriendly             10.0
hydra-core                1.3.2
hypothesis                6.138.2
idna                      3.10
imageio                   2.37.0
importlib_metadata        8.7.0
iniconfig                 2.1.0
ipykernel                 6.29.5
ipython                   8.30.0
jedi                      0.19.2
Jinja2                    3.1.6
joblib                    1.5.1
json5                     0.9.25
jsonschema                4.25.0
jsonschema-specifications 2023.7.1
jupyter_client            8.6.3
jupyter_core              5.8.1
jupyter-events            0.12.0
jupyter-lsp               2.2.5
jupyter_server            2.16.0
jupyter_server_terminals  0.5.3
jupyterlab                4.4.4
jupyterlab_pygments       0.3.0
jupyterlab_server         2.27.3
kiwisolver                1.4.8
lazy_loader               0.4
librosa                   0.11.0
lightning                 2.5.2
lightning-utilities       0.14.3
llvmlite                  0.44.0
lpips                     0.1.4
Mako                      1.2.3
Markdown                  3.8.2
markdown-it-py            3.0.0
MarkupSafe                3.0.2
matplotlib                3.10.3
matplotlib-inline         0.1.6
mdurl                     0.1.2
mistune                   3.1.2
ml_dtypes                 0.5.1
mpmath                    1.3.0
msgpack                   1.1.1
multidict                 6.5.1
multiprocess              0.70.16
nbclient                  0.10.2
nbconvert                 7.16.6
nbformat                  5.10.4
ncnn                      1.0.20250503
nest-asyncio              1.6.0
networkx                  3.4.2
ninja                     1.11.1.4
notebook                  7.4.4
notebook_shim             0.2.4
numba                     0.61.2
numpy                     2.2.6
nvidia-cublas-cu12        12.8.4.1
nvidia-cuda-cupti-cu12    12.8.90
nvidia-cuda-nvrtc-cu11    11.8.89
nvidia-cuda-nvrtc-cu12    12.8.93
nvidia-cuda-runtime-cu12  12.8.90
nvidia-cudnn-cu12         9.10.2.21
nvidia-cufft-cu12         11.3.3.83
nvidia-cufile-cu12        1.13.1.3
nvidia-curand-cu12        10.3.9.90
nvidia-cusolver-cu12      11.7.3.90
nvidia-cusparse-cu12      12.5.8.93
nvidia-cusparselt-cu12    0.7.1
nvidia-ml-py              12.575.51
nvidia-modelopt           0.33.0
nvidia-modelopt-core      0.33.0
nvidia-nccl-cu12          2.27.3
nvidia-nvjitlink-cu12     12.8.93
nvidia-nvtx-cu12          12.8.90
nvitop                    1.5.1
omegaconf                 2.3.0
onnx                      1.18.0
onnx_graphsurgeon         0.5.8
onnx-ir                   0.1.2
onnxruntime-gpu           1.22.0
onnxscript                0.3.0
onnxsim                   0.4.36
opencv-python             4.11.0.86
overrides                 7.4.0
packaging                 25.0
pandas                    2.3.0
pandocfilters             1.5.0
parameterized             0.9.0
parso                     0.8.4
peft                      0.16.0
pexpect                   4.9.0
pillow                    11.2.1
pip                       25.1
platformdirs              4.3.7
pluggy                    1.6.0
pnnx                      20250725
polygraphy                0.49.26
pooch                     1.8.2
portalocker               3.2.0
prometheus_client         0.21.1
prompt-toolkit            3.0.43
propcache                 0.3.2
protobuf                  6.31.1
psutil                    5.9.0
ptyprocess                0.7.0
PuLP                      3.2.1
pure-eval                 0.2.2
py-cpuinfo                9.0.0
py3nvml                   0.2.7
pyaml                     25.7.0
pyarrow                   21.0.0
pybind11                  3.0.0
pycparser                 2.21
pycuda                    2025.1.1
pydantic                  2.11.7
pydantic_core             2.33.2
Pygments                  2.19.1
pyparsing                 3.2.3
pypesq                    1.2.4
PySocks                   1.7.1
pytest                    8.4.1
pytest-rerunfailures      15.1
pytest-xdist              3.8.0
python-dateutil           2.9.0.post0
python-json-logger        3.2.1
python_speech_features    0.6
pytools                   2025.2.2
pytorch-lightning         2.5.2
pytorch-msssim            1.0.0
pytz                      2025.2
PyWavelets                1.8.0
PyYAML                    6.0.2
pyzmq                     26.2.0
quanto                    0.2.0
referencing               0.30.2
regex                     2024.11.6
requests                  2.32.4
rfc3339-validator         0.1.4
rfc3986-validator         0.1.1
rich                      14.0.0
rpds-py                   0.22.3
ruamel.yaml               0.18.14
ruamel.yaml.clib          0.2.12
safetensors               0.5.3
scikit-image              0.25.2
scikit-learn              1.7.0
scipy                     1.15.3
seaborn                   0.13.2
Send2Trash                1.8.2
setuptools                78.1.1
siphash24                 1.7
six                       1.17.0
sniffio                   1.3.0
sortedcontainers          2.4.0
soundfile                 0.13.1
soupsieve                 2.5
soxr                      0.5.0.post1
stack-data                0.2.0
sympy                     1.14.0
tabulate                  0.9.0
tensorboard               2.20.0
tensorboard-data-server   0.7.2
tensorboardX              2.6.4
tensorrt                  10.12.0.36
tensorrt_cu12             10.12.0.36
tensorrt_cu12_bindings    10.12.0.36
tensorrt_cu12_libs        10.12.0.36
terminado                 0.17.1
thop                      0.1.1.post2209072238
threadpoolctl             3.6.0
tifffile                  2025.5.10
tinycss2                  1.4.0
tokenizers                0.21.2
tomli                     2.2.1
torch                     2.8.0
torch_tensorrt            2.8.0
torchao                   0.12.0
torchaudio                2.8.0
torchinfo                 1.8.0
torchlibrosa              0.1.0
torchmetrics              1.7.3
torchprofile              0.0.4
torchvision               0.23.0
tornado                   6.5.1
tqdm                      4.67.1
traitlets                 5.14.3
transformers              4.53.3
triton                    3.4.0
typing_extensions         4.12.2
typing-inspection         0.4.1
tzdata                    2025.2
urllib3                   2.5.0
wcwidth                   0.2.13
webencodings              0.5.1
websocket-client          1.8.0
Werkzeug                  3.1.3
wheel                     0.45.1
xmltodict                 0.14.2
xxhash                    3.5.0
yarl                      1.20.1
zipp                      3.23.0

I followed the document executorch, and the Python script to generate the quantized and unquantized model is given by:

import torch
import torchvision.models as models

from torch.export import export, ExportedProgram
from torchvision.models.mobilenetv2 import MobileNet_V2_Weights
from executorch.backends.xnnpack.partition.xnnpack_partitioner import XnnpackPartitioner
from executorch.exir import EdgeProgramManager, ExecutorchProgramManager, to_edge_transform_and_lower
from executorch.exir.backend.backend_api import to_backend
from torchao.quantization.pt2e.quantize_pt2e import convert_pt2e, prepare_pt2e
from torch.export import export_for_training
from executorch.exir import EdgeCompileConfig, to_edge_transform_and_lower
from executorch.backends.xnnpack.quantizer.xnnpack_quantizer import (
    get_symmetric_quantization_config,
    XNNPACKQuantizer,
)

def quantize(model, example_inputs):
    """This is the official recommended flow for quantization in pytorch 2.0 export"""
    print(f"Original model: {model}")
    quantizer = XNNPACKQuantizer()
    # if we set is_per_channel to True, we also need to add out_variant of quantize_per_channel/dequantize_per_channel
    operator_config = get_symmetric_quantization_config(is_per_channel=False)
    quantizer.set_global(operator_config)
    m = prepare_pt2e(model, quantizer)
    # calibration
    m(*example_inputs)
    m = convert_pt2e(m)
    print(f"Quantized model: {m}")
    # make sure we can export to flat buffer
    return m

# Network and input
mobilenet_v2 = models.mobilenetv2.mobilenet_v2(weights=MobileNet_V2_Weights.DEFAULT).eval()
sample_inputs = (torch.randn(1, 3, 224, 224), )

# Unquantized model to .pte
exported_program: ExportedProgram = export(mobilenet_v2, sample_inputs)
edge: EdgeProgramManager = to_edge_transform_and_lower(
    exported_program,
    partitioner=[XnnpackPartitioner()],
)
exec_prog = edge.to_executorch()

with open("../checkpoint/android/xnnpack_mobilenetv2.pte", "wb") as file:
    exec_prog.write_to_file(file)


# Quantized model to .pte
mobilenet_v2 = export_for_training(mobilenet_v2, sample_inputs).module() # 2-stage export for quantization path
quantized_mobilenetv2 = quantize(mobilenet_v2, sample_inputs)
edge = to_edge_transform_and_lower(
    export(quantized_mobilenetv2, sample_inputs),
    compile_config=EdgeCompileConfig(_check_ir_validity=False),
    partitioner=[XnnpackPartitioner()]
)
exec_prog = edge.to_executorch()

with open("../checkpoint/android/qs8_xnnpack_mobilenetv2.pte", "wb") as file:
    exec_prog.write_to_file(file)

The Java code I used

package com.example.pytorch2android;

import android.content.Context;
import android.os.Bundle;
import android.util.Log;
import android.widget.TextView;

import org.pytorch.executorch.EValue;
import org.pytorch.executorch.Module;
import org.pytorch.executorch.Tensor;

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.util.Locale;

import androidx.appcompat.app.AppCompatActivity;

public class MainActivity extends AppCompatActivity {

    @Override
    protected void onCreate(Bundle savedInstanceState) {
        super.onCreate(savedInstanceState);
        setContentView(R.layout.activity_main);
        executorchAndroidMobileNet();
    }

    private void executorchAndroidMobileNet() {
        Module modelInt8_mobilenet = null; 
        Module modelFloat_mobilenet = null;  
        int height = 224;
        int width = 224;

        try {
            // Load ExecuTorch module from assets
            String etPath_float = assetFilePath(this, "qs8_xnnpack_mobilenetv2.pte");
            String etPath_int8 = assetFilePath(this, "xnnpack_mobilenetv2.pte");
            modelFloat_mobilenet = Module.load(etPath_float);
            modelInt8_mobilenet = Module.load(etPath_int8);
        } catch (IOException e) {
            Log.e("ExecuTorch", "Error reading assets", e);
            finish();
            return;
        }

        // ------------------------------------------------
        // Prepare dummy input
        float[] input = new float[1 * 3 * height * width];
        Tensor inputTensor = Tensor.fromBlob(input, new long[]{1, 3, height, width});
        EValue inputEValue = EValue.from(inputTensor);

        // ------------------------------------------------
        // Benchmark ExecuTorch INT8 model
        double floatTimeMs = benchmarkExecuTorch_mobilenet(modelFloat_mobilenet, inputEValue);
        double int8TimeMs = benchmarkExecuTorch_mobilenet(modelInt8_mobilenet, inputEValue);

        String resultText = String.format(
                Locale.US,
                "FP32 MobileNet avg: %.3f ms\nINT8 MobileNet avg: %.3f ms\nSpeedup CNN: %.2fx",
                floatTimeMs, int8TimeMs, floatTimeMs/ int8TimeMs
        );

        // Show results on UI
        TextView textView = findViewById(R.id.text3);
        textView.setText(resultText);
        Log.i("ExecuTorchBenchmark", resultText);
    }

    private double benchmarkExecuTorch_mobilenet(Module model, EValue input) {
        // Warm-up
        for (int i = 0; i < 10; i++) {
            model.forward(input);
        }

        // Timed runs
        int numRuns = 100;
        long startTime = System.nanoTime();
        for (int i = 0; i < numRuns; i++) {
            model.forward(input);
        }
        long endTime = System.nanoTime();
        long totalTime = endTime - startTime;

        return (totalTime / (double) numRuns) / 1_000_000.0;
    }


    public static String assetFilePath(Context context, String assetName) throws IOException {
        File file = new File(context.getFilesDir(), assetName);

        // Always overwrite (useful for debugging and model updates)
        try (InputStream is = context.getAssets().open(assetName)) {
            try (OutputStream os = new FileOutputStream(file, false)) {
                byte[] buffer = new byte[4 * 1024];
                int read;
                while ((read = is.read(buffer)) != -1) {
                    os.write(buffer, 0, read);
                }
                os.flush();
            }
        }

        Log.i("AssetLoader", "Copied asset to " + file.getAbsolutePath() + " size=" + file.length());
        return file.getAbsolutePath();
    }
}

Java packages' version I used:
[versions]
agp = "8.12.0"
fbjni = "0.7.0"
junit = "4.13.2"
junitVersion = "1.1.5"
espressoCore = "3.5.1"
appcompat = "1.7.1"
material = "1.10.0"
activity = "1.8.0"
constraintlayout = "2.1.4"
executorch_android = "0.7.0"
soloader = "0.12.1"

Versions

Thanks for contributing 🎉!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions