Skip to content

[Performance] [QNN EP] Performance gap between onnxruntime QNN EP and Genie from QNN SDK. #24417

Open
@edgchen1

Description

@edgchen1

Describe the issue

There is a performance gap between onnxruntime-genai / onnxruntime QNN EP and Genie from the QNN SDK.

This was observed on an Android Snapdragon 8 Elite device (DSP arch v79).

Consider the AI Hub Phi-3.5-mini model.

onnxruntime-genai Genie
Token generation rate (tokens/second) 11.7679 17.136786
Prompt processing rate (tokens/second) 65.4206 374.111481

This model is split into four parts that run on NPU. Let's consider just the second one during token generation. With QNN basic profiling enabled, I observed latencies like this:

onnxruntime-genai Genie
Accelerator (execute excluding wait) time (microseconds) 18351 14318
QNN (execute) time (microseconds) 19235 16438

However, when changing the performance mode/profile from "burst" to "balanced", the latencies are more similar between onnxruntime-genai and Genie.

The suspicion is that difference in the handling of the "burst" performance mode/profile is contributing to the observed performance gap. Much of the Genie source code is available with the SDK, but this part is handled opaquely by calling into the backend extensions library. We could use some help from Qualcomm folks to investigate this further.

To reproduce

Code versions:
Genie SDK: v2.33.0.250327
onnxruntime: aada488
onnxruntime-genai: c1d04ea0

Download context binaries for Snapdragon 8 Elite from here.

ort_qnn_ep_issue.zip has directories with additional configuration and model files. After copying over the context binaries, the appropriate directory can be run with either Genie or onnxruntime-genai.

Urgency

No response

Platform

Android

OS Version

15

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

aada488

ONNX Runtime API

C++

Architecture

ARM64

Execution Provider

Other / Unknown

Execution Provider Library Version

QNN 2.33

Model File

No response

Is this a quantized model?

Yes

Metadata

Metadata

Assignees

No one assigned

    Labels

    ep:QNNissues related to QNN exeution providerperformanceissues related to performance regressionsplatform:mobileissues related to ONNX Runtime mobile; typically submitted using template

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions