Delay GPU->CPU sync in sampling #1337

Yard1 · 2023-10-12T23:55:54Z

This PR preallocates tensors used to prune hidden states and to index the samples by sampling type allowing us to delay the GPU->CPU sync slightly longer (up until torch.multinomial in the sampler).

Should fix the slight performance regression introduced in #1309

Technically, we can delay the sync even further by reordering some more operations, but that's left for a future PR.

Yard1 · 2023-10-12T23:56:00Z

cc @zhuohan123

Yard1 · 2023-10-16T19:58:27Z

updating the PR!

hanzhi713 · 2023-10-16T21:24:05Z

This would be a nice improvement to have. Since you're adding more h2ds, a small optimization would be to batch the h2ds together and make it async.

Yard1 · 2023-10-16T21:52:43Z

Yes, we could certainly optimize that as well - I would like to do that in a followup.

Yard1 · 2023-10-16T22:33:34Z

@zhuohan123 Updated, PTAL!

zhuohan123

LGTM! Thank you for your contribution! Left some small comments based on recent PRs. Let's refactor the bloated input_metadata in a future PR.

vllm/worker/worker.py

Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>

Yard1 · 2023-10-17T02:15:06Z

Need to fix max_prompt_len in worker

hanzhi713 · 2023-10-25T21:35:48Z

Any progress on this PR?

Yard1 · 2023-10-25T23:23:55Z

@hanzhi713 I am on holidays currently but will wrap it up next week!

Yard1 · 2023-10-27T23:27:01Z

Should be good now

zhuohan123

LGTM! Thank you for your contribution!

gesanqiu · 2023-11-02T02:25:00Z

@Yard1 @zhuohan123 Have you been test the sampling performance on AWQ model? I found that the AWQ model's first sampling process is much slower than FP16 model, but faster in the following tokens. I can't figure the exact bottleneck out :(

zhaoyang-star · 2023-11-11T07:09:46Z

@Yard1 @zhuohan123 Sorry for that I am not familar with selected_token_indices.

Assume we have 3 prompts. The length are 9, 8, 6. On the prompt (prefill) stage, selected_token_indices=[5, 14, 23]. But I think that selected_token_indices should be [8, 16, 22]. Below is a diagram, which is relatively more intuitive. I have been struggling to understand it here, and I would appreciate your interpretation.

Yard1 added 2 commits October 12, 2023 16:36

non-blocking prune_hidden_states

fe1efdc

Move categorized_seq_ids

e99071f

WoosukKwon mentioned this pull request Oct 13, 2023

[v0.2.1] Release Tracker #1346

Closed

3 tasks

WoosukKwon requested a review from zhuohan123 October 13, 2023 17:20

zhuohan123 mentioned this pull request Oct 13, 2023

Implement prompt logprobs & Batched topk for computing logprobs #1328

Merged

3 tasks

Merge branch 'upstream_main' into prune_hidden_states_optimization

2987792

zhuohan123 approved these changes Oct 17, 2023

View reviewed changes

vllm/worker/worker.py Outdated Show resolved Hide resolved

vllm/worker/worker.py Outdated Show resolved Hide resolved

Yard1 and others added 3 commits October 16, 2023 19:01

Apply suggestions from code review

e9defeb

Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>

Merge branch 'main' into prune_hidden_states_optimization

d3510a8

WIP

497ad66

Yard1 marked this pull request as draft October 17, 2023 02:14

Yard1 added 3 commits October 27, 2023 16:00

Merge branch 'upstream_main' into prune_hidden_states_optimization

31a99f9

Fix

41d1fbd

Lint

5a2ea35

Yard1 marked this pull request as ready for review October 27, 2023 23:27

zhuohan123 approved these changes Oct 30, 2023

View reviewed changes

zhuohan123 merged commit 15f5632 into vllm-project:main Oct 30, 2023

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Delay GPU->CPU sync in sampling (vllm-project#1337)

2c45254

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Delay GPU->CPU sync in sampling #1337

Delay GPU->CPU sync in sampling #1337

Yard1 commented Oct 12, 2023 •

edited

Loading

Uh oh!

Yard1 commented Oct 12, 2023

Uh oh!

Yard1 commented Oct 16, 2023

Uh oh!

hanzhi713 commented Oct 16, 2023

Uh oh!

Yard1 commented Oct 16, 2023

Uh oh!

Yard1 commented Oct 16, 2023

Uh oh!

zhuohan123 left a comment

Uh oh!

Uh oh!

Uh oh!

Yard1 commented Oct 17, 2023

Uh oh!

hanzhi713 commented Oct 25, 2023

Uh oh!

Yard1 commented Oct 25, 2023

Uh oh!

Yard1 commented Oct 27, 2023

Uh oh!

zhuohan123 left a comment

Uh oh!

gesanqiu commented Nov 2, 2023 •

edited

Loading

Uh oh!

zhaoyang-star commented Nov 11, 2023 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Delay GPU->CPU sync in sampling #1337

Delay GPU->CPU sync in sampling #1337

Conversation

Yard1 commented Oct 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Yard1 commented Oct 12, 2023

Uh oh!

Yard1 commented Oct 16, 2023

Uh oh!

hanzhi713 commented Oct 16, 2023

Uh oh!

Yard1 commented Oct 16, 2023

Uh oh!

Yard1 commented Oct 16, 2023

Uh oh!

zhuohan123 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Yard1 commented Oct 17, 2023

Uh oh!

hanzhi713 commented Oct 25, 2023

Uh oh!

Yard1 commented Oct 25, 2023

Uh oh!

Yard1 commented Oct 27, 2023

Uh oh!

zhuohan123 left a comment

Choose a reason for hiding this comment

Uh oh!

gesanqiu commented Nov 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhaoyang-star commented Nov 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Yard1 commented Oct 12, 2023 •

edited

Loading

gesanqiu commented Nov 2, 2023 •

edited

Loading

zhaoyang-star commented Nov 11, 2023 •

edited

Loading