Skip to content

Add page-aligned prefill scheduling. #19878

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

py4
Copy link

@py4 py4 commented Jun 19, 2025

Purpose

Page-aligned prefill scheduling

We are a team from Google trying to extend vLLM with our jax/pallas kernels. We have benchmarked that for chunked-prefill, it gives around 5% e2e speed up per step if we write computed tokens during prefill, by page rather than by tokens. Intuitively, when updating the cache, there'll be more index entries when using one-index-per-token compared to one-per-page. As an example, if we cache layout is NUM_PAGES x PAGE_SIZE , if you want to write NUM_PAGES + 1 tokens, in practice you need both page index and index within the page. but if you had to write multiple of PAGE_SIZE, you just needed page indices.

This has become a motivation for us to add a flag to vLLM that makes sure the scheduler schedules prefill tokens in a page-aligned fashion. We have an early PR because we wanted to get some early feedback on this change and make sure we can land this properly in vLLM.

Test Plan

Will add.

Test Result

Will add.

(Optional) Documentation Update

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @py4, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces an optional page-aligned prefill scheduling mechanism to vLLM. The primary goal is to enhance end-to-end performance, particularly for chunked prefill operations, by optimizing how KV cache updates are handled. By aligning token writes to page boundaries, the system aims to reduce the overhead associated with cache indexing.

Highlights

  • Feature Introduction: Introduced a new page_aligned_scheduling boolean flag within SchedulerConfig to enable or disable a new prefill scheduling strategy.
  • Configuration and CLI Exposure: The page_aligned_scheduling option is now accessible via EngineArgs and can be configured through a new command-line argument, --page-aligned-scheduling.
  • Core Scheduler Logic Update: Modified the scheduler's logic to ensure that prefill segments for both running and pending requests are scheduled in page-aligned chunks when the new flag is enabled. This involves adjusting the number of tokens to be processed based on the KV cache's block size (page size).
  • Token Budget Management: Updated the token budget consumption mechanism to accurately reflect the page-aligned allocation. When page-aligned scheduling is active, the budget is reduced by the full page-aligned amount, even if the actual number of tokens in the segment is less than a full page.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@mergify mergify bot added the v1 label Jun 19, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces page-aligned scheduling for prefill operations, aiming to improve performance by aligning token scheduling with memory page sizes. The changes involve adding a new configuration flag and modifying the scheduler logic in vllm/v1/core/sched/scheduler.py.

Key feedback points:

  • A debug print statement should be removed.
  • There's an opportunity to refactor duplicated page-alignment logic to enhance maintainability.

Crucially, the PR description mentions that test plans and results will be added. Comprehensive testing will be essential to validate this significant change to the scheduling mechanism, covering various scenarios and edge cases related to page alignment and budget calculations. Performance benchmarks demonstrating the claimed 5% speed-up would also be valuable.

Comment on lines +220 to +230
# Make sure prefills are page_aligned
if num_new_tokens != 1 and self.scheduler_config.page_aligned_scheduling:
assert request.num_computed_tokens + num_new_tokens <= request.num_tokens_with_spec
# If it's not the last segment of the prefill, schedule less tokens but page_aligned. It's guaranteed that it doesn't surpass the budget.
if request.num_computed_tokens + num_new_tokens < request.num_tokens_with_spec:
num_new_tokens = (num_new_tokens // PAGE_SIZE) * PAGE_SIZE

# we want to reduce page aligned from the budget, hence the `math.ceil` here.
if math.ceil(num_new_tokens / PAGE_SIZE) * PAGE_SIZE > token_budget:
req_index += 1
break

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for page-aligned scheduling, specifically adjusting num_new_tokens and checking against token_budget (lines 220-231 for RUNNING requests), is substantially similar to the logic for WAITING requests (lines 435-448).

To improve maintainability and reduce redundancy, consider refactoring this common logic into a private helper method. This method could take parameters like the request, current num_new_tokens, token_budget, PAGE_SIZE, and a way to determine if it's the last segment of the prefill. It could return the adjusted num_new_tokens and a boolean indicating if scheduling is possible within the budget.

For example:

def _align_tokens_for_prefill(self, request_total_tokens: int, current_computed_tokens: int, num_new_tokens: int, token_budget: int, page_size: int) -> tuple[int, bool]:
    # ... (combined logic here)
    # Returns (adjusted_num_new_tokens, can_schedule_within_budget)

This would make the main schedule method cleaner and easier to follow.

Comment on lines +303 to +306
# For prefills, reduce page-aligned from the budget.
if num_new_tokens != 1 and self.scheduler_config.page_aligned_scheduling:
token_budget -= (math.ceil(num_new_tokens / PAGE_SIZE) * PAGE_SIZE)
else:
token_budget -= num_new_tokens
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for decrementing token_budget based on page-aligned scheduling (lines 303-307) is duplicated for WAITING requests (lines 513-517).

If the page-alignment logic (mentioned in a separate comment for lines 220-231 and 435-448) is refactored into a helper method, that helper could also return the amount of budget to consume. This would centralize the budget calculation and further reduce duplication.

@py4 py4 force-pushed the 0618_prototype branch from 7315026 to 5dc1e27 Compare June 19, 2025 19:08
@py4 py4 changed the title Add page-aligned prefill. Add page-aligned prefill scheduling. Jun 19, 2025
@robertgshaw2-redhat
Copy link
Collaborator

This generally looks like a reasonable change to me. I think we should consider profiling whether this is something we can just turn on by default.

One note: one thing that is a big trick about the scheduler is that we are continuing to add if ... else logic which makes the code hard to read.

This PR effectively introduces another scheduling constraint related to the "request state" rather than to the "batch state". We currently have 2 other "request state" constraints (long prefill threshold, max_model_len). Perhaps we could have a single utility function that enforces the three "request state" constraints which can be used in the RUNNING and WAITING lopp.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants