-
-
Notifications
You must be signed in to change notification settings - Fork 8.2k
Add page-aligned prefill scheduling. #19878
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @py4, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces an optional page-aligned prefill scheduling mechanism to vLLM. The primary goal is to enhance end-to-end performance, particularly for chunked prefill operations, by optimizing how KV cache updates are handled. By aligning token writes to page boundaries, the system aims to reduce the overhead associated with cache indexing.
Highlights
- Feature Introduction: Introduced a new
page_aligned_scheduling
boolean flag withinSchedulerConfig
to enable or disable a new prefill scheduling strategy. - Configuration and CLI Exposure: The
page_aligned_scheduling
option is now accessible viaEngineArgs
and can be configured through a new command-line argument,--page-aligned-scheduling
. - Core Scheduler Logic Update: Modified the scheduler's logic to ensure that prefill segments for both running and pending requests are scheduled in page-aligned chunks when the new flag is enabled. This involves adjusting the number of tokens to be processed based on the KV cache's block size (page size).
- Token Budget Management: Updated the token budget consumption mechanism to accurately reflect the page-aligned allocation. When page-aligned scheduling is active, the budget is reduced by the full page-aligned amount, even if the actual number of tokens in the segment is less than a full page.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces page-aligned scheduling for prefill operations, aiming to improve performance by aligning token scheduling with memory page sizes. The changes involve adding a new configuration flag and modifying the scheduler logic in vllm/v1/core/sched/scheduler.py
.
Key feedback points:
- A debug print statement should be removed.
- There's an opportunity to refactor duplicated page-alignment logic to enhance maintainability.
Crucially, the PR description mentions that test plans and results will be added. Comprehensive testing will be essential to validate this significant change to the scheduling mechanism, covering various scenarios and edge cases related to page alignment and budget calculations. Performance benchmarks demonstrating the claimed 5% speed-up would also be valuable.
# Make sure prefills are page_aligned | ||
if num_new_tokens != 1 and self.scheduler_config.page_aligned_scheduling: | ||
assert request.num_computed_tokens + num_new_tokens <= request.num_tokens_with_spec | ||
# If it's not the last segment of the prefill, schedule less tokens but page_aligned. It's guaranteed that it doesn't surpass the budget. | ||
if request.num_computed_tokens + num_new_tokens < request.num_tokens_with_spec: | ||
num_new_tokens = (num_new_tokens // PAGE_SIZE) * PAGE_SIZE | ||
|
||
# we want to reduce page aligned from the budget, hence the `math.ceil` here. | ||
if math.ceil(num_new_tokens / PAGE_SIZE) * PAGE_SIZE > token_budget: | ||
req_index += 1 | ||
break | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for page-aligned scheduling, specifically adjusting num_new_tokens
and checking against token_budget
(lines 220-231 for RUNNING requests), is substantially similar to the logic for WAITING requests (lines 435-448).
To improve maintainability and reduce redundancy, consider refactoring this common logic into a private helper method. This method could take parameters like the request, current num_new_tokens
, token_budget
, PAGE_SIZE
, and a way to determine if it's the last segment of the prefill. It could return the adjusted num_new_tokens
and a boolean indicating if scheduling is possible within the budget.
For example:
def _align_tokens_for_prefill(self, request_total_tokens: int, current_computed_tokens: int, num_new_tokens: int, token_budget: int, page_size: int) -> tuple[int, bool]:
# ... (combined logic here)
# Returns (adjusted_num_new_tokens, can_schedule_within_budget)
This would make the main schedule
method cleaner and easier to follow.
# For prefills, reduce page-aligned from the budget. | ||
if num_new_tokens != 1 and self.scheduler_config.page_aligned_scheduling: | ||
token_budget -= (math.ceil(num_new_tokens / PAGE_SIZE) * PAGE_SIZE) | ||
else: | ||
token_budget -= num_new_tokens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for decrementing token_budget
based on page-aligned scheduling (lines 303-307) is duplicated for WAITING requests (lines 513-517).
If the page-alignment logic (mentioned in a separate comment for lines 220-231 and 435-448) is refactored into a helper method, that helper could also return the amount of budget to consume. This would centralize the budget calculation and further reduce duplication.
This generally looks like a reasonable change to me. I think we should consider profiling whether this is something we can just turn on by default. One note: one thing that is a big trick about the scheduler is that we are continuing to add if ... else logic which makes the code hard to read. This PR effectively introduces another scheduling constraint related to the "request state" rather than to the "batch state". We currently have 2 other "request state" constraints (long prefill threshold, max_model_len). Perhaps we could have a single utility function that enforces the three "request state" constraints which can be used in the RUNNING and WAITING lopp. |
Purpose
Page-aligned prefill scheduling
We are a team from Google trying to extend vLLM with our jax/pallas kernels. We have benchmarked that for chunked-prefill, it gives around 5% e2e speed up per step if we write computed tokens during prefill, by page rather than by tokens. Intuitively, when updating the cache, there'll be more index entries when using one-index-per-token compared to one-per-page. As an example, if we cache layout is NUM_PAGES x PAGE_SIZE , if you want to write NUM_PAGES + 1 tokens, in practice you need both page index and index within the page. but if you had to write multiple of PAGE_SIZE, you just needed page indices.
This has become a motivation for us to add a flag to vLLM that makes sure the scheduler schedules prefill tokens in a page-aligned fashion. We have an early PR because we wanted to get some early feedback on this change and make sure we can land this properly in vLLM.
Test Plan
Will add.
Test Result
Will add.
(Optional) Documentation Update