[Core] Performance: Use list[np.ndarray] instead of list[list[int]] for output tokens for GC optimization #26368

Jialin · 2025-10-07T17:21:50Z

Purpose

The default setup of GC is (700, 10, 10) which means

if allocated_obj-deallocated_obj>=700 in generation 0, GC0 will be triggered
GC1 is triggered after 10 GC0
GC2 is triggered after 10 GC1
In this scenario, large batch size scenarios (small models) each batch could be as large as 1024, which means GC0 will be triggered per decode cycle, GC1 will triggered per 10 decode cycle and GC2 per 100 decode cycle, which is very inefficient!

In this PR, we change output tokens from list[list[int]] to list[np.ndarray] to cut down objects counts from <batch_size> to 1, which would significantly reduce GC overhead especially for large batch size use cases.

Test Plan & Test Result

Test 1: facebook/125m TP1 Input 48 Output 2000
Test 2: facebook/125m TP1 Input 48 Output 500
Test 3: Llama3 8B TP2 Input 48 Output 500

Request Throughput Change

Test 1: facebook/125m Long Output: 30+% (from 10.04 request/s -> 13.11 request/s)
Test 2: facebook/125m Short Output: 5+% (from 48.95 request/s -> 51.66 request/s)
Test 3: Llama3 8B: 6+% (from 28.93 request/s -> 30.70 request/s)

With the change GC improved significantly.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Jialin · 2025-10-07T17:26:23Z

CC @yeqcharlotte @houseroad @WoosukKwon @njhill for RFC on the proposal and high level code changes, but the PR might not be completely ready yet

Callers might not be updated completely
Unit tests are still WIP

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

vllm/v1/worker/gpu_model_runner.py

gemini-code-assist

Code Review

This pull request introduces a clever optimization to reduce garbage collection overhead by changing the representation of single output tokens from a list to an integer. The use of Union[int, list[int]] is a good way to handle both single and multiple token outputs efficiently. The changes are consistently applied across the codebase. My only feedback is to add unit tests for the new utility functions to ensure the correctness and robustness of this performance-critical change.

vllm/v1/outputs.py

Jialin · 2025-10-07T20:56:16Z

Resolve #26369

Jialin · 2025-10-13T21:40:54Z

Gentle nudge @houseroad for the review :P

mergify · 2025-10-14T04:32:25Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Jialin.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

vllm/v1/outputs.py

yeqcharlotte · 2025-10-14T23:58:43Z

thanks for the change. hhh it's hard to keep classes around to avoid GC. wondering what's the actual gain we see from bigger models? we can play with deepseek v3, qwen-235b etc. and try it for both throughput and latency sensitive use cases.

wondering if you folks have strong opinion for this @heheda12345 @njhill

zhuohan123 · 2025-10-15T04:10:58Z

I am a bit hesitant on this PR. I feel it's over optimization. To achieve low latency, we probably should invest in:

Replace data structures with numpy arrays
Async scheduling
If Python overhead is actually that serious, we probably should seriously think about rewriting the scheduler in C++.

I am worried that optimizations like the one in this PR can pile up and make the code much harder for people to understand without actually improving the perf very significantly.

Jialin · 2025-10-16T09:40:28Z

I am a bit hesitant on this PR. I feel it's over optimization. To achieve low latency, we probably should invest in:

Replace data structures with numpy arrays

Async scheduling

If Python overhead is actually that serious, we probably should seriously think about rewriting the scheduler in C++.

I am worried that optimizations like the one in this PR can pile up and make the code much harder for people to understand without actually improving the perf very significantly.

Thanks @zhuohan123 for your suggestions and I totally aligned with the tradeoff between code complexity and performance. But would love to better clarify my change in this PR.

Motivation: reduce GC collect frequency by using list[int] for single element scenarios, but not due to to performance of list[int] itself. PR summary also had more details about the rational.
Improvements: we've included a bit more validations in the PR summary, overall, we believe the improvement is significant for small models and large batch size scenarios (facebook/opt125m 30+% throughput improvement and Llama 8B 6+% throughput improvements)
Alternatives:
- Replace with numpy arrays: That's a great call and I didn't think of earlier. I think it's cleaner also won't involve in GC and more friendly for both non-spec decoding and spec decoding scenarios. But that also required some code changes, so would love to get more of your opinions in this change before investing more time on this direction.
- Async scheduling won't help in this scenario, as long as the data structure stay as is, GC would kick in regardless and pause the whole process.
- Just rewrite scheduler in C++ also might not help, as long as we still use python interface and use list[int] to pass single output tokens around, GC would still happen.

Next step: @zhuohan123 per the new data and new clarification I provided, do you feel that's something we should worth to further investigate? If yes, I could try to explore the np.array suggestion more. Appreciate that.

Jialin · 2025-10-16T09:42:04Z

we can play with deepseek v3, qwen-235b etc. and try it for both throughput and latency sensitive use cases.

@yeqcharlotte To be very honest, this change might only be helpful for small models and large batch size scenarios. So the effectiveness could be small to large models like deepseek v3 and qwen 235b :/

DarkLight1337 · 2025-11-15T03:35:25Z

V1 Test others has been failing since this PR: https://buildkite.com/vllm/ci/builds/39126/steps/canvas?sid=019a84f2-ee8a-46bd-9845-3bc144e4cf4b

Jialin · 2025-11-15T03:41:06Z

V1 Test others has been failing since this PR: https://buildkite.com/vllm/ci/builds/39126/steps/canvas?sid=019a84f2-ee8a-46bd-9845-3bc144e4cf4b

Sorry for the inconvenience. I'll take a look later in the day for a quick fix forward. If needed, please feel free to revert for early unblock.

DarkLight1337 · 2025-11-15T03:42:09Z

The fix seems to be straightforward enough, opened #28771

…[int]] for output tokens for GC optimization (#26368)" This reverts commit 186352b.

…or output tokens for GC optimization (vllm-project#26368) Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>

Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>

…or output tokens for GC optimization (vllm-project#26368) Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com> Signed-off-by: George D. Torres <gdavtor@gmail.com>

Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Jialin Ouyang <Jialin.Ouyang@gmail.com> Signed-off-by: George D. Torres <gdavtor@gmail.com>

WoosukKwon · 2025-11-16T19:21:51Z

@Jialin @DarkLight1337 @njhill A dumb question: Can we further optimize the data structure by using two numpy arrays, sampled_tokens: int32[num_reqs, max_num_generated_tokens] and num_sampled_tokens: int32[num_reqs]?

Jialin · 2025-11-16T19:41:04Z

@Jialin @DarkLight1337 @njhill A dumb question: Can we further optimize the data structure by using two numpy arrays, sampled_tokens: int32[num_reqs, max_num_generated_tokens] and num_sampled_tokens: int32[num_reqs]?

@WoosukKwon Thanks for the suggestion! But I think there're some trade off of your proposal, and I might not be the most critical GC bottleneck as of now.

Purely from GC prospective, as numpy array don't even go through GC, the originally approach would only introduce one object (the list that contains sampled_tokens) per decode batch.
Your proposal could even remove the object involved as well (that's the sweet part :) )

However, I feel there're a few downside:

memory overhead due to we need to pad all lanes to max_decode_tokens
we might need to introduce more handy utils to manipulate the sample tokens (so the maintenance cost is also lifting)

Recently I think I found a pretty good way (with tracemalloc) to analyze the code and objects that introduce the largest GC overhead. I'll update gc_utils.py accordingly to make it accessible to others. With that we could evaluate the GC bottleneck easily and optimize them in order, then we could optimize GC costs more objectively (given that all the optimization might unfortunately had tradeoff, we should choose to optimize critical ones only and have a way to justify the criticality of the issue).

WDTY? :)

…ge (#575) sampled_token_ids was changed from list[list[int]] to list[list[int]]: vllm-project/vllm#26368 Signed-off-by: Paweł Olejniczak <polejniczakx@habana.ai>

…or output tokens for GC optimization (vllm-project#26368) Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com> Signed-off-by: Bram Wasti <bwasti@meta.com>

Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Jialin Ouyang <Jialin.Ouyang@gmail.com> Signed-off-by: Bram Wasti <bwasti@meta.com>

…ge (vllm-project#575) sampled_token_ids was changed from list[list[int]] to list[list[int]]: vllm-project/vllm#26368 Signed-off-by: Paweł Olejniczak <polejniczakx@habana.ai>

This reverts commit 98b4d38. Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>

Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>

…#29121) Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com> Signed-off-by: LuminolT <lumischen01@gmail.com>

…#29121) Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>

…#29121) Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com> Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu>

…or output tokens for GC optimization (vllm-project#26368) Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>

Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>

…#29121) Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>

…or output tokens for GC optimization (vllm-project#26368) Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>

Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>

…#29121) Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>

Jialin requested review from 22quinn, ApostaC, WoosukKwon, alexm-redhat, benchislett, comaniac, heheda12345, houseroad, luccafong, njhill, robertgshaw2-redhat and ywang96 as code owners October 7, 2025 17:21

mergify bot added speculative-decoding v1 labels Oct 7, 2025

Jialin mentioned this pull request Oct 7, 2025

[Performance]: Use int over list[int] as output_tokens to reduce GC overhead #26369

Open

1 task

chatgpt-codex-connector bot reviewed Oct 7, 2025

View reviewed changes

vllm/v1/worker/gpu_model_runner.py Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Oct 7, 2025

View reviewed changes

vllm/v1/outputs.py Outdated Show resolved Hide resolved

Jialin changed the title ~~Single output token~~ [Core] Performance: Use int instead of list[int] for single output token for GC optimization Oct 7, 2025

mergify bot added the needs-rebase label Oct 14, 2025

Jialin force-pushed the single_output_token branch from 8512558 to 02914f9 Compare October 14, 2025 07:30

mergify bot removed the needs-rebase label Oct 14, 2025

yeqcharlotte reviewed Oct 14, 2025

View reviewed changes

vllm/v1/outputs.py Outdated Show resolved Hide resolved

DarkLight1337 mentioned this pull request Nov 15, 2025

[Redo] #26368 #28771

Merged

5 tasks

njhill added a commit that referenced this pull request Nov 15, 2025

Revert "[Core] Performance: Use list[np.ndarray] instead of list[list…

ee4c3df

…[int]] for output tokens for GC optimization (#26368)" This reverts commit 186352b.

njhill mentioned this pull request Nov 15, 2025

Revert "[Core] Performance: Use list[np.ndarray] instead of list[list… #28773

Merged

vllm-bot pushed a commit that referenced this pull request Nov 15, 2025

[Redo] #26368 (#28771)

98b4d38

Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>

pawel-olejniczak mentioned this pull request Nov 17, 2025

[FIX_FOR_VLLM_LATEST] Fix crash after the sampled_token_ids type change vllm-project/vllm-gaudi#575

Merged

Jialin added a commit to Jialin/vllm that referenced this pull request Nov 20, 2025

Revert "[Redo] vllm-project#26368 (vllm-project#28771)"

97db3b6

This reverts commit 98b4d38. Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>

Jialin added a commit to Jialin/vllm that referenced this pull request Nov 21, 2025

Revert "[Redo] vllm-project#26368 (vllm-project#28771)"

bcb8a0f

This reverts commit 98b4d38. Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>

vllm-bot pushed a commit that referenced this pull request Nov 21, 2025

Revert "[Redo] #26368 (#28771)" (#29121)

30b9c67

Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>

LuminolT pushed a commit to LuminolT/vllm that referenced this pull request Nov 21, 2025

Revert "[Redo] vllm-project#26368 (vllm-project#28771)" (vllm-project…

70a2a7b

…#29121) Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com> Signed-off-by: LuminolT <lumischen01@gmail.com>

ywang96 pushed a commit to ywang96/vllm that referenced this pull request Nov 23, 2025

Revert "[Redo] vllm-project#26368 (vllm-project#28771)" (vllm-project…

152d63c

…#29121) Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>

lpapavassiliou pushed a commit to lpapavassiliou/vllm that referenced this pull request Nov 24, 2025

Revert "[Redo] vllm-project#26368 (vllm-project#28771)" (vllm-project…

e59b1c6

…#29121) Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>

RunkaiTao pushed a commit to RunkaiTao/vllm that referenced this pull request Nov 24, 2025

Revert "[Redo] vllm-project#26368 (vllm-project#28771)" (vllm-project…

c214cd5

…#29121) Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com> Signed-off-by: Runkai Tao <rt572@physics.rutgers.edu>

bringlein pushed a commit to bringlein/vllm that referenced this pull request Nov 26, 2025

[Core] Performance: Use list[np.ndarray] instead of list[list[int]] f…

03658a7

…or output tokens for GC optimization (vllm-project#26368) Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>

bringlein pushed a commit to bringlein/vllm that referenced this pull request Nov 26, 2025

Revert "[Redo] vllm-project#26368 (vllm-project#28771)" (vllm-project…

e18261f

…#29121) Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

[Core] Performance: Use list[np.ndarray] instead of list[list[int]] f…

b49d9d9

…or output tokens for GC optimization (vllm-project#26368) Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>

devpatelio pushed a commit to SumanthRH/vllm that referenced this pull request Nov 29, 2025

Revert "[Redo] vllm-project#26368 (vllm-project#28771)" (vllm-project…

3e72ac9

…#29121) Signed-off-by: Jialin Ouyang <Jialin.Ouyang@gmail.com>

Uh oh!

[Core] Performance: Use list[np.ndarray] instead of list[list[int]] for output tokens for GC optimization #26368

[Core] Performance: Use list[np.ndarray] instead of list[list[int]] for output tokens for GC optimization #26368

Uh oh!

Conversation

Jialin commented Oct 7, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan & Test Result

Uh oh!

Jialin commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Jialin commented Oct 7, 2025

Uh oh!

Jialin commented Oct 13, 2025

Uh oh!

mergify bot commented Oct 14, 2025

Uh oh!

Uh oh!

yeqcharlotte commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhuohan123 commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Jialin commented Oct 16, 2025

Uh oh!

Jialin commented Oct 16, 2025

Uh oh!

DarkLight1337 commented Nov 15, 2025

Uh oh!

Jialin commented Nov 15, 2025

Uh oh!

DarkLight1337 commented Nov 15, 2025

Uh oh!

WoosukKwon commented Nov 16, 2025

Uh oh!

Jialin commented Nov 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Jialin commented Oct 7, 2025 •

edited by github-actions bot

Loading

Jialin commented Oct 7, 2025 •

edited

Loading

yeqcharlotte commented Oct 14, 2025 •

edited

Loading

zhuohan123 commented Oct 15, 2025 •

edited

Loading

Jialin commented Nov 16, 2025 •

edited

Loading