Enable infinite generation with RoPE position remapping for attention sink (#19011) by kirklandsign · Pull Request #19011 · pytorch/executorch

kirklandsign · 2026-04-20T22:15:25Z

Summary:

Previously, attention sink models could not generate beyond max_context_len
because RoPE used the raw monotonic input_pos to index into the pre-computed
freqs_cis table, causing OOB when pos >= max_context_len.

This change adds position remapping in RopeWithAttentionSink:

Sink token positions (< sink_size) are preserved as-is
Window token positions are wrapped into the ring buffer range
[sink_size, sink_size + ring_size) using modular arithmetic

The 2x ring buffer (ring_size = 2 * window_size) ensures the live window
of tokens never spans a wrap boundary, preserving correct relative
distances in RoPE space.

This enables attention sink models to generate indefinitely — the KV cache
ring buffer recycles space while RoPE positions stay bounded.

Reviewed By: lucylq

Differential Revision: D100728748

pytorch-bot · 2026-04-20T22:15:29Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19011

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 3 Unrelated Failures

As of commit 9ae6844 with merge base 1d37abd ():

NEW FAILURE - The following job has failed:

pull / unittest / macos / macos-job (gh)
backends/xnnpack/test/ops/test_conv2d.py::TestConv2d::test_fp16_conv2d

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / test-qnn-testsuite-linux / test-backend-linux (qnn, models) / linux-job (gh) (detected as infra flaky with no log or failing log classifier)

BROKEN TRUNK - The following jobs failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / unittest / windows / windows-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / unittest-editable / windows / windows-job (gh) (trunk failure)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-codesync · 2026-04-20T22:15:34Z

@kirklandsign has exported this pull request. If you are a Meta employee, you can view the originating Diff in D100728748.

github-actions · 2026-04-20T22:16:18Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

… sink (#19011) Summary: Previously, attention sink models could not generate beyond max_context_len because RoPE used the raw monotonic input_pos to index into the pre-computed freqs_cis table, causing OOB when pos >= max_context_len. This change adds position remapping in RopeWithAttentionSink: - Sink token positions (< sink_size) are preserved as-is - Window token positions are wrapped into the ring buffer range [sink_size, sink_size + ring_size) using modular arithmetic The 2x ring buffer (ring_size = 2 * window_size) ensures the live window of tokens never spans a wrap boundary, preserving correct relative distances in RoPE space. This enables attention sink models to generate indefinitely — the KV cache ring buffer recycles space while RoPE positions stay bounded. Reviewed By: lucylq Differential Revision: D100728748

Copilot

Pull request overview

This PR enables “infinite” token generation for LLaMA attention-sink models by remapping RoPE positions into a bounded range aligned with the KV-cache ring buffer, preventing out-of-bounds indexing when decoding past max_context_len.

Changes:

Add RoPE position remapping logic in RopeWithAttentionSink.get_freqs (sink positions preserved; window positions wrapped into [sink_size, sink_size + 2*window_size)).
Add an end-to-end test that generates beyond max_context_len and validates outputs remain finite.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
examples/models/llama/source_transformation/attention_sink.py	Implements RoPE position remapping for attention-sink + ring-buffer KV cache to avoid OOB past `max_context_len`.
examples/models/llama/source_transformation/test_attention_sink.py	Adds E2E regression coverage for generating beyond `max_context_len`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

        assert input_pos is not None
-        # Use torch._check for export compatibility (data-dependent guard)
-        torch._check(input_pos[0].item() + seq_len <= self.max_context_length)
-        return super().get_freqs(input_pos, seq_len)
+        if not self.params.use_kv_cache:
+            return self.freqs_cos[:seq_len], self.freqs_sin[:seq_len]


        self.sink_size = sink_size
-        # max_context_len from params is used for RoPE frequencies (should be large)
-        self.max_context_length = self.params.max_context_len
+        self.ring_size = window_size * 2


+    def test_beyond_max_context_len(self):
+        """Generate tokens beyond max_context_len with RoPE position remapping."""
+        sink_size = 4
+        window_size = 16
+        # KV cache size = 36, max_context_len = 64
+        # Generate 100 tokens — well beyond max_context_len
+        args = self._make_args(max_context_len=64)
+        model = self._build_model(args, sink_size, window_size, use_custom_sdpa=False)
+
+        outputs = self._run_generation(model, args, num_tokens=100)
+
+        self.assertEqual(len(outputs), 97)  # 1 prefill + 96 decode steps
+        for out in outputs:
+            self.assertTrue(
+                torch.isfinite(out).all(),
+                "Output contains non-finite values beyond max_context_len",
+            )


+            # Dynamic shape: input_pos is [start_pos], remap and narrow
+            input_pos_item = input_pos[-1].item()
+            if input_pos_item < self.sink_size:
+                remapped_item = input_pos_item
+            else:
+                remapped_item = (
+                    self.sink_size
+                    + (input_pos_item - self.sink_size) % self.ring_size
+                )


… sink (#19011) Summary: Pull Request resolved: #19011 Previously, attention sink models could not generate beyond max_context_len because RoPE used the raw monotonic input_pos to index into the pre-computed freqs_cis table, causing OOB when pos >= max_context_len. This change adds position remapping in RopeWithAttentionSink: - Sink token positions (< sink_size) are preserved as-is - Window token positions are wrapped into the ring buffer range [sink_size, sink_size + ring_size) using modular arithmetic The 2x ring buffer (ring_size = 2 * window_size) ensures the live window of tokens never spans a wrap boundary, preserving correct relative distances in RoPE space. This enables attention sink models to generate indefinitely — the KV cache ring buffer recycles space while RoPE positions stay bounded. Reviewed By: lucylq Differential Revision: D100728748

… sink (#19011) Summary: Previously, attention sink models could not generate beyond max_context_len because RoPE used the raw monotonic input_pos to index into the pre-computed freqs_cis table, causing OOB when pos >= max_context_len. This change adds position remapping in RopeWithAttentionSink: - Sink token positions (< sink_size) are preserved as-is - Window token positions are wrapped into the ring buffer range [sink_size, sink_size + ring_size) using modular arithmetic The 2x ring buffer (ring_size = 2 * window_size) ensures the live window of tokens never spans a wrap boundary, preserving correct relative distances in RoPE space. This enables attention sink models to generate indefinitely — the KV cache ring buffer recycles space while RoPE positions stay bounded. Reviewed By: lucylq Differential Revision: D100728748

… sink (#19011) Summary: Pull Request resolved: #19011 Previously, attention sink models could not generate beyond max_context_len because RoPE used the raw monotonic input_pos to index into the pre-computed freqs_cis table, causing OOB when pos >= max_context_len. This change adds position remapping in RopeWithAttentionSink: - Sink token positions (< sink_size) are preserved as-is - Window token positions are wrapped into the ring buffer range [sink_size, sink_size + ring_size) using modular arithmetic The 2x ring buffer (ring_size = 2 * window_size) ensures the live window of tokens never spans a wrap boundary, preserving correct relative distances in RoPE space. This enables attention sink models to generate indefinitely — the KV cache ring buffer recycles space while RoPE positions stay bounded. Reviewed By: lucylq Differential Revision: D100728748

… sink (#19011) Summary: Previously, attention sink models could not generate beyond max_context_len because RoPE used the raw monotonic input_pos to index into the pre-computed freqs_cis table, causing OOB when pos >= max_context_len. This change adds position remapping in RopeWithAttentionSink: - Sink token positions (< sink_size) are preserved as-is - Window token positions are wrapped into the ring buffer range [sink_size, sink_size + ring_size) using modular arithmetic The 2x ring buffer (ring_size = 2 * window_size) ensures the live window of tokens never spans a wrap boundary, preserving correct relative distances in RoPE space. This enables attention sink models to generate indefinitely — the KV cache ring buffer recycles space while RoPE positions stay bounded. Reviewed By: lucylq Differential Revision: D100728748

… sink (#19011) Summary: Pull Request resolved: #19011 Previously, attention sink models could not generate beyond max_context_len because RoPE used the raw monotonic input_pos to index into the pre-computed freqs_cis table, causing OOB when pos >= max_context_len. This change adds position remapping in RopeWithAttentionSink: - Sink token positions (< sink_size) are preserved as-is - Window token positions are wrapped into the ring buffer range [sink_size, sink_size + ring_size) using modular arithmetic The 2x ring buffer (ring_size = 2 * window_size) ensures the live window of tokens never spans a wrap boundary, preserving correct relative distances in RoPE space. This enables attention sink models to generate indefinitely — the KV cache ring buffer recycles space while RoPE positions stay bounded. Reviewed By: lucylq Differential Revision: D100728748

… sink (#19011) Summary: Previously, attention sink models could not generate beyond max_context_len because RoPE used the raw monotonic input_pos to index into the pre-computed freqs_cis table, causing OOB when pos >= max_context_len. This change adds position remapping in RopeWithAttentionSink: - Sink token positions (< sink_size) are preserved as-is - Window token positions are wrapped into the ring buffer range [sink_size, sink_size + ring_size) using modular arithmetic The 2x ring buffer (ring_size = 2 * window_size) ensures the live window of tokens never spans a wrap boundary, preserving correct relative distances in RoPE space. This enables attention sink models to generate indefinitely — the KV cache ring buffer recycles space while RoPE positions stay bounded. Reviewed By: lucylq Differential Revision: D100728748

Copilot AI review requested due to automatic review settings April 20, 2026 22:15

kirklandsign requested a review from lucylq as a code owner April 20, 2026 22:15

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 20, 2026

meta-codesync Bot added fb-exported meta-exported labels Apr 20, 2026

Copilot started reviewing on behalf of kirklandsign April 20, 2026 22:15 View session

lucylq approved these changes Apr 20, 2026

View reviewed changes

meta-codesync Bot changed the title ~~Enable infinite generation with RoPE position remapping for attention sink~~ Enable infinite generation with RoPE position remapping for attention sink (#19011) Apr 20, 2026

meta-codesync Bot force-pushed the export-D100728748 branch from 7cce9a4 to 2a34458 Compare April 20, 2026 22:18

meta-codesync Bot force-pushed the export-D100728748 branch from 2a34458 to a6472e5 Compare April 20, 2026 22:19

Copilot AI reviewed Apr 20, 2026

View reviewed changes

kirklandsign force-pushed the export-D100728748 branch from a6472e5 to a451868 Compare April 20, 2026 22:22

Copilot AI review requested due to automatic review settings April 21, 2026 02:13

kirklandsign review requested due to automatic review settings April 21, 2026 02:13

meta-codesync Bot force-pushed the export-D100728748 branch from a451868 to db1328f Compare April 21, 2026 02:13

meta-codesync Bot force-pushed the export-D100728748 branch from db1328f to cdf3644 Compare April 21, 2026 02:14

kirklandsign force-pushed the export-D100728748 branch from cdf3644 to 5faf3a6 Compare April 21, 2026 02:16

Copilot AI review requested due to automatic review settings April 21, 2026 18:38

meta-codesync Bot force-pushed the export-D100728748 branch from 5faf3a6 to bfff183 Compare April 21, 2026 18:38

kirklandsign review requested due to automatic review settings April 21, 2026 18:38

kirklandsign force-pushed the export-D100728748 branch from bfff183 to 311be20 Compare April 21, 2026 18:43

Copilot AI review requested due to automatic review settings April 21, 2026 19:16

meta-codesync Bot force-pushed the export-D100728748 branch from 311be20 to 9ae6844 Compare April 21, 2026 19:16

kirklandsign review requested due to automatic review settings April 21, 2026 19:16

meta-codesync Bot merged commit 239f7cc into main Apr 21, 2026
175 of 179 checks passed

meta-codesync Bot deleted the export-D100728748 branch April 21, 2026 23:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable infinite generation with RoPE position remapping for attention sink (#19011)#19011

Enable infinite generation with RoPE position remapping for attention sink (#19011)#19011
meta-codesync[bot] merged 1 commit intomainfrom
export-D100728748

kirklandsign commented Apr 20, 2026 •

edited by meta-codesync Bot

Loading

Uh oh!

pytorch-bot Bot commented Apr 20, 2026 •

edited

Loading

Uh oh!

meta-codesync Bot commented Apr 20, 2026

Uh oh!

github-actions Bot commented Apr 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kirklandsign commented Apr 20, 2026 • edited by meta-codesync Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19011

❌ 1 New Failure, 3 Unrelated Failures

Uh oh!

meta-codesync Bot commented Apr 20, 2026

Uh oh!

github-actions Bot commented Apr 20, 2026

This PR needs a release notes: label

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kirklandsign commented Apr 20, 2026 •

edited by meta-codesync Bot

Loading

pytorch-bot Bot commented Apr 20, 2026 •

edited

Loading

This PR needs a `release notes:` label