Fix double-subtraction of pos_ in TextLLMRunner::generate() (#18727) by kirklandsign · Pull Request #18727 · pytorch/executorch

kirklandsign · 2026-04-06T23:26:33Z

Summary:

When seq_len is set and pos_ > 0 (multi-turn conversations),
max_context_len was pre-adjusted by subtracting pos_, but
resolve_max_new_tokens then only subtracted num_prompt_tokens
instead of the full occupied position count. This caused
min(seq_len, max_context_len) to use a too-large max_context_len,
producing more tokens than allowed by seq_len.

Fix: use raw metadata value for max_context_len and pass pos_
(which includes prompt tokens after prefill) to
resolve_max_new_tokens, matching multimodal_runner's behavior.

Differential Revision: D99742232

pytorch-bot · 2026-04-06T23:26:39Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18727

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 3 Pending, 3 Unrelated Failures

As of commit 2332004 with merge base 19f7ff2 ():

NEW FAILURES - The following jobs have failed:

Cadence Build & Test / cpu-test / test-ops / test-ops (gh)
examples/cadence/operators/test_g3_ops.py::ATenOpTestCases::test_g3_neg_out_1
pull / unittest-arm-backend-with-no-deps (test_pytest_ops_tosa) / linux-job (gh)
RuntimeError: Command docker exec -t ad481c7b9b4710b6cb920511b73051c1075e61efacf9d3d61f81fd2479967f62 /exec failed with exit code 1
pull / unittest-editable / macos / macos-job (gh)
export/tests/test_target_recipes.py::TestTargetRecipes::test_mv3_model

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

pull / test-models-linux (emformer_join, portable, linux.4xlarge.memory) / linux-job (gh) (matched linux rule in flaky-rules.json)
The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.
pull / test-models-linux (resnet18, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh) (detected as infra flaky with no log or failing log classifier)
Test Vulkan Backend / test-vulkan / package-golden-artifacts (gh) (detected as infra flaky with no log or failing log classifier)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-codesync · 2026-04-06T23:26:43Z

@kirklandsign has exported this pull request. If you are a Meta employee, you can view the originating Diff in D99742232.

github-actions · 2026-04-06T23:27:25Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Copilot

Pull request overview

Fixes token budget resolution in TextLLMRunner::generate() for multi-turn conversations when seq_len is set, aligning behavior with multimodal_runner so occupied KV-cache positions are correctly accounted for.

Changes:

Stop pre-adjusting max_context_len by pos_; use the raw metadata value instead.
Resolve max_new_tokens using the full occupied position count (pos_ after prefill), and tighten the max-context prefill guard accordingly.
Add a regression test covering the multi-turn + seq_len case.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
extension/llm/runner/text_llm_runner.cpp	Corrects max token resolution by using raw `max_context_len` and passing occupied positions (`pos_`) into `resolve_max_new_tokens()`.
extension/llm/runner/test/test_text_llm_runner.cpp	Adds regression coverage to ensure `seq_len` limits respect prior-turn `pos_` occupancy.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-06T23:32:09Z

+  // Resolve max_new_tokens. pos_ now reflects all occupied positions
+  // (including prompt tokens just prefilled).
  int max_new_tokens =
-      config.resolve_max_new_tokens(max_context_len, num_prompt_tokens);
+      config.resolve_max_new_tokens(max_context_len, pos_);



GenerationConfig::resolve_max_new_tokens() takes int32_t parameters documented as num_prompt_tokens, but this call passes pos_ (an int64_t occupied-position count). This relies on implicit narrowing conversions and on a broader interpretation of the parameter than the API/docstring (and the pybinding arg name) suggests. Consider updating resolve_max_new_tokens to accept an int64_t occupied token count (or adding a new helper with clearer naming) and adjusting the documentation/bindings to avoid truncation risk and confusion.

Summary: When seq_len is set and pos_ > 0 (multi-turn conversations), max_context_len was pre-adjusted by subtracting pos_, but resolve_max_new_tokens then only subtracted num_prompt_tokens instead of the full occupied position count. This caused min(seq_len, max_context_len) to use a too-large max_context_len, producing more tokens than allowed by seq_len. Fix: use raw metadata value for max_context_len and pass pos_ (which includes prompt tokens after prefill) to resolve_max_new_tokens, matching multimodal_runner's behavior. Differential Revision: D99742232

Summary: Pull Request resolved: #18727 When seq_len is set and pos_ > 0 (multi-turn conversations), max_context_len was pre-adjusted by subtracting pos_, but resolve_max_new_tokens then only subtracted num_prompt_tokens instead of the full occupied position count. This caused min(seq_len, max_context_len) to use a too-large max_context_len, producing more tokens than allowed by seq_len. Fix: use raw metadata value for max_context_len and pass pos_ (which includes prompt tokens after prefill) to resolve_max_new_tokens, matching multimodal_runner's behavior. Differential Revision: D99742232

larryliu0820

Review automatically exported from Phabricator review in Meta.

…18727) Differential Revision: D99742232 Pull Request resolved: pytorch#18727

Copilot AI review requested due to automatic review settings April 6, 2026 23:26

kirklandsign requested review from larryliu0820 and mergennachin as code owners April 6, 2026 23:26

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 6, 2026

meta-codesync Bot added fb-exported meta-exported labels Apr 6, 2026

Copilot started reviewing on behalf of kirklandsign April 6, 2026 23:27 View session

Copilot AI reviewed Apr 6, 2026

View reviewed changes

meta-codesync Bot changed the title ~~Fix double-subtraction of pos_ in TextLLMRunner::generate()~~ Fix double-subtraction of pos_ in TextLLMRunner::generate() (#18727) Apr 7, 2026

meta-codesync Bot force-pushed the export-D99742232 branch from 6b8cca8 to 6deda58 Compare April 7, 2026 06:15

Copilot AI review requested due to automatic review settings April 7, 2026 18:15

meta-codesync Bot force-pushed the export-D99742232 branch from 6deda58 to 009b11d Compare April 7, 2026 18:15

kirklandsign review requested due to automatic review settings April 7, 2026 18:15

meta-codesync Bot force-pushed the export-D99742232 branch from 009b11d to 784d607 Compare April 7, 2026 18:16

kirklandsign force-pushed the export-D99742232 branch from 784d607 to 2332004 Compare April 7, 2026 18:19

larryliu0820 approved these changes Apr 7, 2026

View reviewed changes

meta-codesync Bot merged commit 5ba654f into main Apr 7, 2026
161 of 170 checks passed

meta-codesync Bot deleted the export-D99742232 branch April 7, 2026 20:15

jpiat pushed a commit to jpiat/executorch that referenced this pull request Apr 14, 2026

Fix double-subtraction of pos_ in TextLLMRunner::generate() (pytorch#…

7408022

…18727) Differential Revision: D99742232 Pull Request resolved: pytorch#18727

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix double-subtraction of pos_ in TextLLMRunner::generate() (#18727)#18727

Fix double-subtraction of pos_ in TextLLMRunner::generate() (#18727)#18727
meta-codesync[bot] merged 1 commit intomainfrom
export-D99742232

kirklandsign commented Apr 6, 2026 •

edited by meta-codesync Bot

Loading

Uh oh!

pytorch-bot Bot commented Apr 6, 2026 •

edited

Loading

Uh oh!

meta-codesync Bot commented Apr 6, 2026

Uh oh!

github-actions Bot commented Apr 6, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 6, 2026

Uh oh!

larryliu0820 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kirklandsign commented Apr 6, 2026 • edited by meta-codesync Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18727

❌ 3 New Failures, 3 Pending, 3 Unrelated Failures

Uh oh!

meta-codesync Bot commented Apr 6, 2026

Uh oh!

github-actions Bot commented Apr 6, 2026

This PR needs a release notes: label

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

larryliu0820 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kirklandsign commented Apr 6, 2026 •

edited by meta-codesync Bot

Loading

pytorch-bot Bot commented Apr 6, 2026 •

edited

Loading

This PR needs a `release notes:` label