Open
Description
Describe the issue
Hi everyone,
I'm using the matmul_nbits_quantizer
with an 8-bit setting and everything runs fine during quantization—no errors on my side. However, when I run the resulting model on an Android device, it crashes with the following error message:
[E:onnxruntime:, sequential_executor.cc:572 ExecuteKernel] Non-zero status code returned while running MatMulNBits node. Name:'/k_proj/MatMul_Q8' Status Message: /home/iamj/Downloads/onnxruntime/onnxruntime/contrib_ops/cpu/quantization/matmul_nbits.cc:442 Status onnxruntime::contrib::MatMulNBits<float>::ComputeBUnpacked(const Tensor *, const Tensor *, const Tensor *, const Tensor *, const Tensor *, const Tensor *, Tensor *, AllocatorPtr &, concurrency::ThreadPool *, const MatMulComputeHelper &) const [T1 = float] nbits_ == 4 was false. Only 4b quantization is supported for unpacked compute.
It works well with 4-bits quanted model.
Details about my setup:
- ONNX Runtime version: 1.22.0 (Android libonnxruntime.so built from source)
- Quantization config:
bits=8
- Quantizer initialization:
quant_config.bits = 8
quant = matmul_nbits_quantizer.MatMulNBitsQuantizer(
model,
block_size=32,
is_symmetric=False,
accuracy_level=4,
quant_format=quant_utils.QuantFormat.QOperator,
algo_config=quant_config,
nodes_to_exclude=None
)
quant.process()
quant.model.save_model_to_file(
quanted_model_path,
True
)
Thanks in advance for your help!
To reproduce
//
Urgency
//
Platform
Android
OS Version
14
ONNX Runtime Installation
Built from Source
Compiler Version (if 'Built from Source')
cmake=3.31.6, NDK=26.3
Package Name (if 'Released Package')
None
ONNX Runtime Version or Commit ID
1.22.0
ONNX Runtime API
C++/C
Architecture
ARM64
Execution Provider
Default CPU
Execution Provider Library Version
1.22.0