Skip to content

Fix QNN runner KV cache bitwidth in Android JNI#17622

Closed
infil00p wants to merge 1 commit intopytorch:mainfrom
baseweight:fix-qnn-kv-bitwidth
Closed

Fix QNN runner KV cache bitwidth in Android JNI#17622
infil00p wants to merge 1 commit intopytorch:mainfrom
baseweight:fix-qnn-kv-bitwidth

Conversation

@infil00p
Copy link
Copy Markdown

@infil00p infil00p commented Feb 23, 2026

Summary

The QNN runner was hardcoded to use Runner<uint16_t>, but all current Llama quantization recipes use annotate_kv_8bit for 8-bit KV cache. This mismatch caused the KV cache data to be misinterpreted, resulting in degenerate output (repetitive real words like "Nigeria Nigeria...") while the model otherwise ran correctly on the HTP NPU.

Authored with Claude

Test plan

I built the AAR and used the AAR in the LlamaDemo in executorch-examples. This does require QAIRT 2.43.0 to test, which is later than the last QAIRT used to build the last Executorch release. This was tested on a OnePlus 15 running a Snapdragon 8 Elite Gen 5 (SM8850).

cc @kirklandsign @cbilgin @cccclai

The QNN runner was hardcoded to use Runner<uint16_t>, but all current
Llama quantization recipes use annotate_kv_8bit for 8-bit KV cache.
This mismatch caused the KV cache data to be misinterpreted, resulting
in degenerate output (repetitive real words like "Nigeria Nigeria...")
while the model otherwise ran correctly on the HTP NPU.

Authored with Claude

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Feb 23, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17622

Note: Links to docs will display an error until the docs builds have been completed.

❌ 7 New Failures, 1 Cancelled Job, 5 Unrelated Failures

As of commit af1f892 with merge base 298311e (image):

NEW FAILURES - The following jobs have failed:

CANCELLED JOB - The following job was cancelled. Please retry:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 23, 2026
@github-actions
Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@nil-is-all nil-is-all added the module: android Issues related to Android code, build, and execution label Feb 23, 2026
@nil-is-all nil-is-all added module: extension Issues related to code under extension/ module: qnn Issues related to Qualcomm's QNN delegate and code under backends/qualcomm/ labels Feb 23, 2026
executorch::extension::Module::LoadMode::MmapUseMlockIgnoreErrors);
std::string decoder_model = "llama3"; // use llama3 for now
runner_ = std::make_unique<example::Runner<uint16_t>>( // QNN runner
runner_ = std::make_unique<example::Runner<uint8_t>>( // QNN runner (8-bit KV cache)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a better way to handle this..I remember llama uses 16 bit kv cache and qwen uses 8 bit kv cache cc: @haowhsu-quic

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, probably need a branch here to dispatch runner correctly.

Copy link
Copy Markdown
Contributor

@abhinaykukkadapu abhinaykukkadapu Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@haowhsu-quic Ccan we detect it by dynamically querying get_kv_io_bit_width from the model if the method exists and do something like (default to 8 bit):

example::KvBitWidth kv_bitwidth = example::KvBitWidth::kWidth8;
if (module->method_names()->count("get_kv_io_bit_width") > 0) {
  kv_bitwidth = static_cast<example::KvBitWidth>(
      module->get("get_kv_io_bit_width")
          .get()
          .toScalar()
          .to<int64_t>());
}

if (kv_bitwidth == example::KvBitWidth::kWidth16) {
  runner_ = std::make_unique<example::Runner<uint16_t>>(...)
} else {
  runner_ = std::make_unique<example::Runner<uint8_t>>(...)
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@haowhsu-quic there an update on dynamically querying the bit-width? Is there anything to recommend for the PR's author to do?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @abhinaykukkadapu , we actually do this in our runner (https://github.com/pytorch/executorch/blob/main/examples/qualcomm/oss_scripts/llama/qnn_llama_runner.cpp).

example::KvBitWidth kv_bitwidth = example::KvBitWidth::kWidth8;
if (module->method_names()->count("get_kv_io_bit_width") > 0) {
  kv_bitwidth = static_cast<example::KvBitWidth>(
    module->get("get_kv_io_bit_width").get().toScalar().to<int64_t>());
}

Maybe PR's author can try to follow this apporach.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@infil00p could you try this approach and request review once done?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@infil00p bringing this up again

abhinaykukkadapu pushed a commit that referenced this pull request Apr 7, 2026
Summary:
The QNN runner in the Android JNI layer was hardcoded to use
Runner<uint16_t>, but models can be exported with either 8-bit or
16-bit KV caches. This mismatch caused the KV cache data to be
misinterpreted, resulting in gibberish output in the Android demo app
while the same model worked correctly via the CLI runner.

This change mirrors the dynamic KV bitwidth detection already present
in qnn_llama_runner.cpp by querying the model's get_kv_io_bit_width
method and instantiating the correct Runner<uint8_t> or
Runner<uint16_t> accordingly. Also passes temperature_ to the Runner
constructor which was previously omitted.

Fixes #18571
Closes #17622

Test Plan:
- Built Android AAR with QNN support (SDK 2.37) — jni_layer_llama.cpp
  compiles cleanly with both Runner<uint8_t> and Runner<uint16_t>
  template instantiations
- Unit tests pass (gradlew testDebugUnitTest)
abhinaykukkadapu pushed a commit to abhinaykukkadapu/executorch that referenced this pull request Apr 7, 2026
Summary:
The QNN runner in the Android JNI layer was hardcoded to use
Runner<uint16_t>, but models can be exported with either 8-bit or
16-bit KV caches. This mismatch caused the KV cache data to be
misinterpreted, resulting in gibberish output in the Android demo app
while the same model worked correctly via the CLI runner.

This change mirrors the dynamic KV bitwidth detection already present
in qnn_llama_runner.cpp by querying the model's get_kv_io_bit_width
method and instantiating the correct Runner<uint8_t> or
Runner<uint16_t> accordingly. Also passes temperature_ to the Runner
constructor which was previously omitted.

Fixes pytorch#18571
Closes pytorch#17622

Test Plan:
- Built Android AAR with QNN support (SDK 2.37) — jni_layer_llama.cpp
  compiles cleanly with both Runner<uint8_t> and Runner<uint16_t>
  template instantiations
- Unit tests pass (gradlew testDebugUnitTest)
@github-project-automation github-project-automation bot moved this from Todo to Done in ExecuTorch Android Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. module: android Issues related to Android code, build, and execution module: extension Issues related to code under extension/ module: qnn Issues related to Qualcomm's QNN delegate and code under backends/qualcomm/

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants