Fix QNN runner KV cache bitwidth in Android JNI by infil00p · Pull Request #17622 · pytorch/executorch

infil00p · 2026-02-23T15:36:38Z

Summary

The QNN runner was hardcoded to use Runner<uint16_t>, but all current Llama quantization recipes use annotate_kv_8bit for 8-bit KV cache. This mismatch caused the KV cache data to be misinterpreted, resulting in degenerate output (repetitive real words like "Nigeria Nigeria...") while the model otherwise ran correctly on the HTP NPU.

Authored with Claude

Test plan

I built the AAR and used the AAR in the LlamaDemo in executorch-examples. This does require QAIRT 2.43.0 to test, which is later than the last QAIRT used to build the last Executorch release. This was tested on a OnePlus 15 running a Snapdragon 8 Elite Gen 5 (SM8850).

cc @kirklandsign @cbilgin @cccclai

The QNN runner was hardcoded to use Runner<uint16_t>, but all current Llama quantization recipes use annotate_kv_8bit for 8-bit KV cache. This mismatch caused the KV cache data to be misinterpreted, resulting in degenerate output (repetitive real words like "Nigeria Nigeria...") while the model otherwise ran correctly on the HTP NPU. Authored with Claude Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

pytorch-bot · 2026-02-23T15:36:43Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17622

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 7 New Failures, 1 Cancelled Job, 5 Unrelated Failures

As of commit af1f892 with merge base 298311e ():

NEW FAILURES - The following jobs have failed:

Lint / lintrunner / linux-job (gh)
>>> Lint for extension/android/jni/jni_layer_llama.cpp:
pull / test-mediatek-models-linux / linux-job (gh)
RuntimeError: Command docker exec -t 13e6317b64440c11ac079be38a04f5d459d3076cd7b417b6597ee414d9760798 /exec failed with exit code 2
pull / test-openvino-linux / linux-job (gh)
RuntimeError: Command docker exec -t 5c28183eeb078144ba4c98072e2623263021a7987d59d0743e2fb6ab998596ab /exec failed with exit code 1
pull / test-samsung-models-linux / linux-job (gh)
RuntimeError: Command docker exec -t d5cd45c752afdf7decc6f8528a37493a737396f313eeef7581da958747c424a0 /exec failed with exit code 1
pull / test-samsung-quantmodels-linux / linux-job (gh)
RuntimeError: Command docker exec -t 5762e91d59e9852c32abf28059375dedd1968840014defe77110246e872eb4b6 /exec failed with exit code 1
pull / unittest-buck / linux / linux-job (gh)
RuntimeError: Command docker exec -t 830970ce33a33e546f6cab430f4585fdbaee0ccd3959377772a275bf1388b485 /exec failed with exit code 3
pull / unittest-buck / macos / macos-job (gh)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 3

CANCELLED JOB - The following job was cancelled. Please retry:

periodic (gh)

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

pull / test-models-linux (resnet50, portable, linux.2xlarge) / linux-job (gh) (detected as infra flaky with no log or failing log classifier)
pull / test-moshi-linux / linux-job (gh) (detected as infra flaky with no log or failing log classifier)
Test CUDA Builds / test-models-cuda (mv2) / linux-job (gh) (detected as infra flaky with no log or failing log classifier)
Test CUDA Builds / test-models-cuda (resnet18) / linux-job (gh) (detected as infra flaky with no log or failing log classifier)
Test CUDA Builds / unittest-cuda / linux-job (gh) (detected as infra flaky with no log or failing log classifier)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-02-23T15:37:24Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

cccclai · 2026-02-23T19:55:43Z

extension/android/jni/jni_layer_llama.cpp

          executorch::extension::Module::LoadMode::MmapUseMlockIgnoreErrors);
      std::string decoder_model = "llama3"; // use llama3 for now
-      runner_ = std::make_unique<example::Runner<uint16_t>>( // QNN runner
+      runner_ = std::make_unique<example::Runner<uint8_t>>( // QNN runner (8-bit KV cache)


I think we need a better way to handle this..I remember llama uses 16 bit kv cache and qwen uses 8 bit kv cache cc: @haowhsu-quic

Yes, probably need a branch here to dispatch runner correctly.

@haowhsu-quic Ccan we detect it by dynamically querying get_kv_io_bit_width from the model if the method exists and do something like (default to 8 bit):

example::KvBitWidth kv_bitwidth = example::KvBitWidth::kWidth8; if (module->method_names()->count("get_kv_io_bit_width") > 0) { kv_bitwidth = static_cast<example::KvBitWidth>( module->get("get_kv_io_bit_width") .get() .toScalar() .to<int64_t>()); } if (kv_bitwidth == example::KvBitWidth::kWidth16) { runner_ = std::make_unique<example::Runner<uint16_t>>(...) } else { runner_ = std::make_unique<example::Runner<uint8_t>>(...) }

@haowhsu-quic there an update on dynamically querying the bit-width? Is there anything to recommend for the PR's author to do?

Hi @abhinaykukkadapu , we actually do this in our runner (https://github.com/pytorch/executorch/blob/main/examples/qualcomm/oss_scripts/llama/qnn_llama_runner.cpp).

example::KvBitWidth kv_bitwidth = example::KvBitWidth::kWidth8; if (module->method_names()->count("get_kv_io_bit_width") > 0) { kv_bitwidth = static_cast<example::KvBitWidth>( module->get("get_kv_io_bit_width").get().toScalar().to<int64_t>()); }

Maybe PR's author can try to follow this apporach.

@infil00p could you try this approach and request review once done?

@infil00p bringing this up again

Summary: The QNN runner in the Android JNI layer was hardcoded to use Runner<uint16_t>, but models can be exported with either 8-bit or 16-bit KV caches. This mismatch caused the KV cache data to be misinterpreted, resulting in gibberish output in the Android demo app while the same model worked correctly via the CLI runner. This change mirrors the dynamic KV bitwidth detection already present in qnn_llama_runner.cpp by querying the model's get_kv_io_bit_width method and instantiating the correct Runner<uint8_t> or Runner<uint16_t> accordingly. Also passes temperature_ to the Runner constructor which was previously omitted. Fixes #18571 Closes #17622 Test Plan: - Built Android AAR with QNN support (SDK 2.37) — jni_layer_llama.cpp compiles cleanly with both Runner<uint8_t> and Runner<uint16_t> template instantiations - Unit tests pass (gradlew testDebugUnitTest)

Summary: The QNN runner in the Android JNI layer was hardcoded to use Runner<uint16_t>, but models can be exported with either 8-bit or 16-bit KV caches. This mismatch caused the KV cache data to be misinterpreted, resulting in gibberish output in the Android demo app while the same model worked correctly via the CLI runner. This change mirrors the dynamic KV bitwidth detection already present in qnn_llama_runner.cpp by querying the model's get_kv_io_bit_width method and instantiating the correct Runner<uint8_t> or Runner<uint16_t> accordingly. Also passes temperature_ to the Runner constructor which was previously omitted. Fixes pytorch#18571 Closes pytorch#17622 Test Plan: - Built Android AAR with QNN support (SDK 2.37) — jni_layer_llama.cpp compiles cleanly with both Runner<uint8_t> and Runner<uint16_t> template instantiations - Unit tests pass (gradlew testDebugUnitTest)

infil00p requested a review from kirklandsign as a code owner February 23, 2026 15:36

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 23, 2026

nil-is-all added the module: android Issues related to Android code, build, and execution label Feb 23, 2026

github-project-automation bot added this to ExecuTorch Android Feb 23, 2026

github-project-automation bot moved this to Todo in ExecuTorch Android Feb 23, 2026

nil-is-all added module: extension Issues related to code under extension/ module: qnn Issues related to Qualcomm's QNN delegate and code under backends/qualcomm/ labels Feb 23, 2026

mergennachin requested review from chenweng-quic, haowhsu-quic, shewu-quic and winskuo-quic February 23, 2026 17:34

GregoryComer requested review from abhinaykukkadapu and cccclai February 23, 2026 19:36

cccclai reviewed Feb 23, 2026

View reviewed changes

This was referenced Apr 7, 2026

Fix QNN runner KV cache bitwidth detection in Android JNI #18731

Closed

Fix QNN runner KV cache bitwidth detection in Android JNI #18732

Merged

abhinaykukkadapu closed this in #18732 Apr 8, 2026

github-project-automation bot moved this from Todo to Done in ExecuTorch Android Apr 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix QNN runner KV cache bitwidth in Android JNI#17622

Fix QNN runner KV cache bitwidth in Android JNI#17622
infil00p wants to merge 1 commit intopytorch:mainfrom
baseweight:fix-qnn-kv-bitwidth

infil00p commented Feb 23, 2026 •

edited by pytorch-bot bot

Loading

Uh oh!

pytorch-bot bot commented Feb 23, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 23, 2026

Uh oh!

cccclai Feb 23, 2026

Uh oh!

haowhsu-quic Feb 24, 2026

Uh oh!

abhinaykukkadapu Feb 24, 2026 •

edited

Loading

Uh oh!

nil-is-all Mar 2, 2026

Uh oh!

haowhsu-quic Mar 3, 2026

Uh oh!

nil-is-all Mar 5, 2026

Uh oh!

nil-is-all Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

infil00p commented Feb 23, 2026 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

pytorch-bot bot commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17622

❌ 7 New Failures, 1 Cancelled Job, 5 Unrelated Failures

Uh oh!

github-actions bot commented Feb 23, 2026

This PR needs a release notes: label

Uh oh!

cccclai Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

haowhsu-quic Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

abhinaykukkadapu Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nil-is-all Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

haowhsu-quic Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

nil-is-all Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

nil-is-all Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

infil00p commented Feb 23, 2026 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Feb 23, 2026 •

edited

Loading

This PR needs a `release notes:` label

abhinaykukkadapu Feb 24, 2026 •

edited

Loading