Description
Describe the issue
There is a performance gap between onnxruntime-genai / onnxruntime QNN EP and Genie from the QNN SDK.
This was observed on an Android Snapdragon 8 Elite device (DSP arch v79).
Consider the AI Hub Phi-3.5-mini model.
onnxruntime-genai | Genie | |
---|---|---|
Token generation rate (tokens/second) | 11.7679 | 17.136786 |
Prompt processing rate (tokens/second) | 65.4206 | 374.111481 |
This model is split into four parts that run on NPU. Let's consider just the second one during token generation. With QNN basic profiling enabled, I observed latencies like this:
onnxruntime-genai | Genie | |
---|---|---|
Accelerator (execute excluding wait) time (microseconds) | 18351 | 14318 |
QNN (execute) time (microseconds) | 19235 | 16438 |
However, when changing the performance mode/profile from "burst" to "balanced", the latencies are more similar between onnxruntime-genai and Genie.
The suspicion is that difference in the handling of the "burst" performance mode/profile is contributing to the observed performance gap. Much of the Genie source code is available with the SDK, but this part is handled opaquely by calling into the backend extensions library. We could use some help from Qualcomm folks to investigate this further.
To reproduce
Code versions:
Genie SDK: v2.33.0.250327
onnxruntime: aada488
onnxruntime-genai: c1d04ea0
Download context binaries for Snapdragon 8 Elite from here.
ort_qnn_ep_issue.zip has directories with additional configuration and model files. After copying over the context binaries, the appropriate directory can be run with either Genie or onnxruntime-genai.
Urgency
No response
Platform
Android
OS Version
15
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
ONNX Runtime API
C++
Architecture
ARM64
Execution Provider
Other / Unknown
Execution Provider Library Version
QNN 2.33
Model File
No response
Is this a quantized model?
Yes