-
-
Notifications
You must be signed in to change notification settings - Fork 11.6k
[Core] Performance: Use list[np.ndarray] instead of list[list[int]] for output tokens for GC optimization #26368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
CC @yeqcharlotte @houseroad @WoosukKwon @njhill for RFC on the proposal and high level code changes, but the PR might not be completely ready yet
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a clever optimization to reduce garbage collection overhead by changing the representation of single output tokens from a list to an integer. The use of Union[int, list[int]] is a good way to handle both single and multiple token outputs efficiently. The changes are consistently applied across the codebase. My only feedback is to add unit tests for the new utility functions to ensure the correctness and robustness of this performance-critical change.
|
Resolve #26369 |
|
Gentle nudge @houseroad for the review :P |
|
This pull request has merge conflicts that must be resolved before it can be |
8512558 to
02914f9
Compare
|
thanks for the change. hhh it's hard to keep classes around to avoid GC. wondering what's the actual gain we see from bigger models? we can play with deepseek v3, qwen-235b etc. and try it for both throughput and latency sensitive use cases. wondering if you folks have strong opinion for this @heheda12345 @njhill |
|
I am a bit hesitant on this PR. I feel it's over optimization. To achieve low latency, we probably should invest in:
I am worried that optimizations like the one in this PR can pile up and make the code much harder for people to understand without actually improving the perf very significantly. |
Thanks @zhuohan123 for your suggestions and I totally aligned with the tradeoff between code complexity and performance. But would love to better clarify my change in this PR.
Next step: @zhuohan123 per the new data and new clarification I provided, do you feel that's something we should worth to further investigate? If yes, I could try to explore the np.array suggestion more. Appreciate that. |
@yeqcharlotte To be very honest, this change might only be helpful for small models and large batch size scenarios. So the effectiveness could be small to large models like deepseek v3 and qwen 235b :/ |
|
V1 Test others has been failing since this PR: https://buildkite.com/vllm/ci/builds/39126/steps/canvas?sid=019a84f2-ee8a-46bd-9845-3bc144e4cf4b |
Sorry for the inconvenience. I'll take a look later in the day for a quick fix forward. If needed, please feel free to revert for early unblock. |
|
The fix seems to be straightforward enough, opened #28771 |
…or output tokens for GC optimization (vllm-project#26368) Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
…or output tokens for GC optimization (vllm-project#26368) Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com> Signed-off-by: George D. Torres <gdavtor@gmail.com>
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Jialin Ouyang <Jialin.Ouyang@gmail.com> Signed-off-by: George D. Torres <gdavtor@gmail.com>
|
@Jialin @DarkLight1337 @njhill A dumb question: Can we further optimize the data structure by using two numpy arrays, |
@WoosukKwon Thanks for the suggestion! But I think there're some trade off of your proposal, and I might not be the most critical GC bottleneck as of now. Purely from GC prospective, as numpy array don't even go through GC, the originally approach would only introduce one object (the list that contains sampled_tokens) per decode batch. However, I feel there're a few downside:
Recently I think I found a pretty good way (with tracemalloc) to analyze the code and objects that introduce the largest GC overhead. I'll update gc_utils.py accordingly to make it accessible to others. With that we could evaluate the GC bottleneck easily and optimize them in order, then we could optimize GC costs more objectively (given that all the optimization might unfortunately had tradeoff, we should choose to optimize critical ones only and have a way to justify the criticality of the issue). WDTY? :) |
…ge (#575) sampled_token_ids was changed from list[list[int]] to list[list[int]]: vllm-project/vllm#26368 Signed-off-by: Paweł Olejniczak <polejniczakx@habana.ai>
…or output tokens for GC optimization (vllm-project#26368) Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com> Signed-off-by: Bram Wasti <bwasti@meta.com>
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Jialin Ouyang <Jialin.Ouyang@gmail.com> Signed-off-by: Bram Wasti <bwasti@meta.com>
…ge (vllm-project#575) sampled_token_ids was changed from list[list[int]] to list[list[int]]: vllm-project/vllm#26368 Signed-off-by: Paweł Olejniczak <polejniczakx@habana.ai>
This reverts commit 98b4d38. Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
This reverts commit 98b4d38. Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
…#29121) Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com> Signed-off-by: LuminolT <lumischen01@gmail.com>
…#29121) Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
…#29121) Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
…#29121) Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com> Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu>
…or output tokens for GC optimization (vllm-project#26368) Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
…#29121) Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
…or output tokens for GC optimization (vllm-project#26368) Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
…#29121) Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>
Purpose
The default setup of GC is (700, 10, 10) which means
In this scenario, large batch size scenarios (small models) each batch could be as large as 1024, which means GC0 will be triggered per decode cycle, GC1 will triggered per 10 decode cycle and GC2 per 100 decode cycle, which is very inefficient!
In this PR, we change output tokens from list[list[int]] to list[np.ndarray] to cut down objects counts from <batch_size> to 1, which would significantly reduce GC overhead especially for large batch size use cases.
Test Plan & Test Result
Request Throughput Change
With the change GC improved significantly.

Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.