[Performance] [QNN EP] Performance gap between onnxruntime QNN EP and Genie from QNN SDK.

### Describe the issue

There is a performance gap between onnxruntime-genai / onnxruntime QNN EP and Genie from the QNN SDK.

This was observed on an Android Snapdragon 8 Elite device (DSP arch v79).

Consider the [AI Hub Phi-3.5-mini model](https://aihub.qualcomm.com/models/phi_3_5_mini_instruct).

| | onnxruntime-genai | Genie |
|-|-|-|
|Token generation rate (tokens/second) | 11.7679 | 17.136786 |
|Prompt processing rate (tokens/second) | 65.4206 | 374.111481 |

This model is split into four parts that run on NPU. Let's consider just the second one during token generation. With QNN basic profiling enabled, I observed latencies like this:

| | onnxruntime-genai | Genie |
|-|-|-|
| Accelerator (execute excluding wait) time (microseconds) | 18351 | 14318 |
| QNN (execute) time (microseconds) | 19235 | 16438 |

However, when changing the performance mode/profile from "burst" to "balanced", the latencies are more similar between onnxruntime-genai and Genie.

The suspicion is that difference in the handling of the "burst" performance mode/profile is contributing to the observed performance gap. Much of the Genie source code is available with the SDK, but this part is handled opaquely by calling into the backend extensions library. We could use some help from Qualcomm folks to investigate this further.

### To reproduce

Code versions:
Genie SDK: v2.33.0.250327
onnxruntime: aada488de2
onnxruntime-genai: c1d04ea0

Download context binaries for Snapdragon 8 Elite from [here](https://aihub.qualcomm.com/models/phi_3_5_mini_instruct).

[ort_qnn_ep_issue.zip](https://github.com/user-attachments/files/19742855/ort_qnn_ep_issue.zip) has directories with additional configuration and model files. After copying over the context binaries, the appropriate directory can be run with either Genie or onnxruntime-genai.

### Urgency

_No response_

### Platform

Android

### OS Version

15

### ONNX Runtime Installation

Built from Source

### ONNX Runtime Version or Commit ID

aada488

### ONNX Runtime API

C++

### Architecture

ARM64

### Execution Provider

Other / Unknown

### Execution Provider Library Version

QNN 2.33

### Model File

_No response_

### Is this a quantized model?

Yes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Performance] [QNN EP] Performance gap between onnxruntime QNN EP and Genie from QNN SDK. #24417

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Model File

Is this a quantized model?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	onnxruntime-genai	Genie
Token generation rate (tokens/second)	11.7679	17.136786
Prompt processing rate (tokens/second)	65.4206	374.111481

	onnxruntime-genai	Genie
Accelerator (execute excluding wait) time (microseconds)	18351	14318
QNN (execute) time (microseconds)	19235	16438

[Performance] [QNN EP] Performance gap between onnxruntime QNN EP and Genie from QNN SDK. #24417

Description

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Model File

Is this a quantized model?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions