Add page-aligned prefill scheduling. #19878

py4 · 2025-06-19T18:54:27Z

Purpose

Page-aligned prefill scheduling

We are a team from Google trying to extend vLLM with our jax/pallas kernels. We have benchmarked that for chunked-prefill, it gives around 5% e2e speed up per step if we write computed tokens during prefill, by page rather than by tokens. Intuitively, when updating the cache, there'll be more index entries when using one-index-per-token compared to one-per-page. As an example, if we cache layout is NUM_PAGES x PAGE_SIZE , if you want to write NUM_PAGES + 1 tokens, in practice you need both page index and index within the page. but if you had to write multiple of PAGE_SIZE, you just needed page indices.

This has become a motivation for us to add a flag to vLLM that makes sure the scheduler schedules prefill tokens in a page-aligned fashion. We have an early PR because we wanted to get some early feedback on this change and make sure we can land this properly in vLLM.

Test Plan

Will add.

Test Result

Will add.

(Optional) Documentation Update

github-actions · 2025-06-19T18:54:34Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Summary of Changes

Hello @py4, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces an optional page-aligned prefill scheduling mechanism to vLLM. The primary goal is to enhance end-to-end performance, particularly for chunked prefill operations, by optimizing how KV cache updates are handled. By aligning token writes to page boundaries, the system aims to reduce the overhead associated with cache indexing.

Highlights

Feature Introduction: Introduced a new page_aligned_scheduling boolean flag within SchedulerConfig to enable or disable a new prefill scheduling strategy.
Configuration and CLI Exposure: The page_aligned_scheduling option is now accessible via EngineArgs and can be configured through a new command-line argument, --page-aligned-scheduling.
Core Scheduler Logic Update: Modified the scheduler's logic to ensure that prefill segments for both running and pending requests are scheduled in page-aligned chunks when the new flag is enabled. This involves adjusting the number of tokens to be processed based on the KV cache's block size (page size).
Token Budget Management: Updated the token budget consumption mechanism to accurately reflect the page-aligned allocation. When page-aligned scheduling is active, the budget is reduced by the full page-aligned amount, even if the actual number of tokens in the segment is less than a full page.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces page-aligned scheduling for prefill operations, aiming to improve performance by aligning token scheduling with memory page sizes. The changes involve adding a new configuration flag and modifying the scheduler logic in vllm/v1/core/sched/scheduler.py.

Key feedback points:

A debug print statement should be removed.
There's an opportunity to refactor duplicated page-alignment logic to enhance maintainability.

Crucially, the PR description mentions that test plans and results will be added. Comprehensive testing will be essential to validate this significant change to the scheduling mechanism, covering various scenarios and edge cases related to page alignment and budget calculations. Performance benchmarks demonstrating the claimed 5% speed-up would also be valuable.

vllm/v1/core/sched/scheduler.py

gemini-code-assist · 2025-06-19T18:55:49Z

vllm/v1/core/sched/scheduler.py

+            # Make sure prefills are page_aligned
+            if num_new_tokens != 1 and self.scheduler_config.page_aligned_scheduling:
+                assert request.num_computed_tokens + num_new_tokens <= request.num_tokens_with_spec
+                # If it's not the last segment of the prefill, schedule less tokens but page_aligned. It's guaranteed that it doesn't surpass the budget.
+                if request.num_computed_tokens + num_new_tokens < request.num_tokens_with_spec:
+                    num_new_tokens = (num_new_tokens // PAGE_SIZE) * PAGE_SIZE
+
+                # we want to reduce page aligned from the budget, hence the `math.ceil` here.
+                if math.ceil(num_new_tokens / PAGE_SIZE) * PAGE_SIZE > token_budget:
+                    req_index += 1
+                    break
+


The logic for page-aligned scheduling, specifically adjusting num_new_tokens and checking against token_budget (lines 220-231 for RUNNING requests), is substantially similar to the logic for WAITING requests (lines 435-448).

To improve maintainability and reduce redundancy, consider refactoring this common logic into a private helper method. This method could take parameters like the request, current num_new_tokens, token_budget, PAGE_SIZE, and a way to determine if it's the last segment of the prefill. It could return the adjusted num_new_tokens and a boolean indicating if scheduling is possible within the budget.

For example:

def _align_tokens_for_prefill(self, request_total_tokens: int, current_computed_tokens: int, num_new_tokens: int, token_budget: int, page_size: int) -> tuple[int, bool]: # ... (combined logic here) # Returns (adjusted_num_new_tokens, can_schedule_within_budget)

This would make the main schedule method cleaner and easier to follow.

gemini-code-assist · 2025-06-19T18:55:49Z

vllm/v1/core/sched/scheduler.py

+            # For prefills, reduce page-aligned from the budget.
+            if num_new_tokens != 1 and self.scheduler_config.page_aligned_scheduling:
+                token_budget -= (math.ceil(num_new_tokens / PAGE_SIZE) * PAGE_SIZE)
+            else:
+                token_budget -= num_new_tokens


The logic for decrementing token_budget based on page-aligned scheduling (lines 303-307) is duplicated for WAITING requests (lines 513-517).

If the page-alignment logic (mentioned in a separate comment for lines 220-231 and 435-448) is refactored into a helper method, that helper could also return the amount of budget to consume. This would centralize the budget calculation and further reduce duplication.

robertgshaw2-redhat · 2025-06-19T19:35:18Z

This generally looks like a reasonable change to me. I think we should consider profiling whether this is something we can just turn on by default.

One note: one thing that is a big trick about the scheduler is that we are continuing to add if ... else logic which makes the code hard to read.

This PR effectively introduces another scheduling constraint related to the "request state" rather than to the "batch state". We currently have 2 other "request state" constraints (long prefill threshold, max_model_len). Perhaps we could have a single utility function that enforces the three "request state" constraints which can be used in the RUNNING and WAITING lopp.

py4 requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners June 19, 2025 18:54

gemini-code-assist bot reviewed Jun 19, 2025

View reviewed changes

mergify bot added the v1 label Jun 19, 2025

gemini-code-assist bot reviewed Jun 19, 2025

View reviewed changes

Add page-aligned prefill scheduling.

5dc1e27

py4 force-pushed the 0618_prototype branch from 7315026 to 5dc1e27 Compare June 19, 2025 19:08

py4 changed the title ~~Add page-aligned prefill.~~ Add page-aligned prefill scheduling. Jun 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add page-aligned prefill scheduling. #19878

Add page-aligned prefill scheduling. #19878

py4 commented Jun 19, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jun 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

gemini-code-assist bot Jun 19, 2025

Uh oh!

gemini-code-assist bot Jun 19, 2025

Uh oh!

robertgshaw2-redhat commented Jun 19, 2025

Uh oh!

Uh oh!

Uh oh!

Add page-aligned prefill scheduling. #19878

Are you sure you want to change the base?

Add page-aligned prefill scheduling. #19878

Conversation

py4 commented Jun 19, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

github-actions bot commented Jun 19, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

robertgshaw2-redhat commented Jun 19, 2025

Uh oh!

Uh oh!

py4 commented Jun 19, 2025 •

edited by github-actions bot

Loading