-
-
Notifications
You must be signed in to change notification settings - Fork 10k
[V1] Decouple GPU and TPU InputBatch
#19778
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Andrew Feldman <afeldman@redhat.com>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @afeldman-nm, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces a foundational architectural change by decoupling the InputBatch
data structure implementations for TPU and GPU. This separation is a prerequisite for integrating a new logitsprocs
programming model, which necessitates InputBatch
modifications that are incompatible with TPU. The changes involve creating a dedicated TPU-specific InputBatch
class and updating relevant modules to conditionally use the appropriate batch implementation based on the hardware platform.
Highlights
- Dedicated TPU InputBatch: A new file,
vllm/v1/worker/tpu_input_batch.py
, has been introduced. This file contains a distinctInputBatch
class implementation specifically tailored for operations on Tensor Processing Units (TPUs). - GPU-Specific InputBatch Clarification: The existing
vllm/v1/worker/gpu_input_batch.py
has been updated with a clarifying comment, explicitly stating that it defines theInputBatch
for GPU, thereby formalizing the separation of concerns between GPU and TPU batch implementations. - Dynamic InputBatch Selection: The
vllm/v1/worker/lora_model_runner_mixin.py
module now includes conditional import logic. It dynamically imports either theTPUInputBatch
orGPUInputBatch
based on the detectedcurrent_platform.device_name
, ensuring that the correct batch structure is utilized at runtime depending on the hardware platform. - TPU Component Alignment: Files within the TPU-specific paths, namely
vllm/v1/sample/tpu/metadata.py
andvllm/v1/worker/tpu_model_runner.py
, have been updated to exclusively import and utilize the newly definedtpu_input_batch.InputBatch
. This change aligns TPU-specific components with their dedicated batch implementation, reinforcing the decoupling.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a dedicated InputBatch
implementation for TPU, decoupling it from the GPU version. It involves creating a new tpu_input_batch.py
file and updating imports and logic in related files (metadata.py
, lora_model_runner_mixin.py
, tpu_model_runner.py
). The changes are a necessary step towards a TPU-specific programming model. The review identified several high-severity issues related to potential correctness problems in sampling metadata handling, generator state management, and logprobs calculation, as well as a duplicate code block. A medium-severity issue regarding an undocumented assumption in the condense
method was also found.
Signed-off-by: Andrew Feldman <afeldman@redhat.com>
InputBatch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @afeldman-nm LGTM.
This PR adds a new
tpu_input_batch.py
, which is identical togpu_input_batch.py
but with all but TPU-oriented logic stripped out.
presumably you meant something like "but with non-TPU-supported logic stripped out"?
Signed-off-by: Andrew Feldman <afeldman@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @afeldman-nm.
As discussed we can look at how to reconcile this again once the LogitsProcessor rework is done (and hopefully support LPs with TPU somehow).
Signed-off-by: Andrew Feldman <afeldman@redhat.com>
Certainly. To be clear, vLLM's TPU logic currently supports certain hard-coded logits processors; the gap which needs to be filled is specifically to (1) support the new logitsproc programming model introduced in #16728 , and (2) support logitsproc extensibility. Basically, to have feature parity with GPU. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for separating TPU and GPU on InputBatch
!
Thanks for the reviews and merge @njhill @yaochengji @aarnphm |
Signed-off-by: Andrew Feldman <afeldman@redhat.com>
Signed-off-by: Andrew Feldman <afeldman@redhat.com> Signed-off-by: minpeter <kali2005611@gmail.com>
Signed-off-by: Andrew Feldman <afeldman@redhat.com> Signed-off-by: Yang Wang <elainewy@meta.com>
Signed-off-by: Andrew Feldman <afeldman@redhat.com>
Signed-off-by: Andrew Feldman <afeldman@redhat.com>
Signed-off-by: Andrew Feldman <afeldman@redhat.com>
Signed-off-by: Andrew Feldman <afeldman@redhat.com> Signed-off-by: avigny <47987522+avigny@users.noreply.github.com>
Signed-off-by: Andrew Feldman <afeldman@redhat.com>
Purpose
Decouple TPU and GPU
InputBatch
implementations. This PR adds a newtpu_input_batch.py
, which is identical togpu_input_batch.py
but with all but TPU-oriented logic stripped out. Rationale: (1) TPU code lags GPU code in terms of feature support, so it makes sense to allow GPU to have a separate input batch from TPU to support more advanced features; (2) This PR is a prerequisite for the new logitsprocs programming model introduced in #16728 , which makesInputBatch
modifications that are incompatible with TPU.Test Plan
Covered by existing GPU and TPU unit tests.
Test Result
N/A
(Optional) Documentation Update
N/A