Skip to content

Conversation

Yard1
Copy link
Collaborator

@Yard1 Yard1 commented Oct 12, 2023

This PR preallocates tensors used to prune hidden states and to index the samples by sampling type allowing us to delay the GPU->CPU sync slightly longer (up until torch.multinomial in the sampler).

Should fix the slight performance regression introduced in #1309

Technically, we can delay the sync even further by reordering some more operations, but that's left for a future PR.

@Yard1
Copy link
Collaborator Author

Yard1 commented Oct 12, 2023

cc @zhuohan123

@Yard1
Copy link
Collaborator Author

Yard1 commented Oct 16, 2023

updating the PR!

@hanzhi713
Copy link
Contributor

This would be a nice improvement to have. Since you're adding more h2ds, a small optimization would be to batch the h2ds together and make it async.

image

@Yard1
Copy link
Collaborator Author

Yard1 commented Oct 16, 2023

Yes, we could certainly optimize that as well - I would like to do that in a followup.

@Yard1
Copy link
Collaborator Author

Yard1 commented Oct 16, 2023

@zhuohan123 Updated, PTAL!

Copy link
Member

@zhuohan123 zhuohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you for your contribution! Left some small comments based on recent PRs. Let's refactor the bloated input_metadata in a future PR.

@Yard1 Yard1 marked this pull request as draft October 17, 2023 02:14
@Yard1
Copy link
Collaborator Author

Yard1 commented Oct 17, 2023

Need to fix max_prompt_len in worker

@hanzhi713
Copy link
Contributor

Any progress on this PR?

@Yard1
Copy link
Collaborator Author

Yard1 commented Oct 25, 2023

@hanzhi713 I am on holidays currently but will wrap it up next week!

@Yard1
Copy link
Collaborator Author

Yard1 commented Oct 27, 2023

Should be good now

@Yard1 Yard1 marked this pull request as ready for review October 27, 2023 23:27
Copy link
Member

@zhuohan123 zhuohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you for your contribution!

@zhuohan123 zhuohan123 merged commit 15f5632 into vllm-project:main Oct 30, 2023
@gesanqiu
Copy link
Contributor

gesanqiu commented Nov 2, 2023

@Yard1 @zhuohan123 Have you been test the sampling performance on AWQ model? I found that the AWQ model's first sampling process is much slower than FP16 model, but faster in the following tokens. I can't figure the exact bottleneck out :(

@zhaoyang-star
Copy link
Contributor

zhaoyang-star commented Nov 11, 2023

@Yard1 @zhuohan123 Sorry for that I am not familar with selected_token_indices.

Assume we have 3 prompts. The length are 9, 8, 6. On the prompt (prefill) stage, selected_token_indices=[5, 14, 23]. But I think that selected_token_indices should be [8, 16, 22]. Below is a diagram, which is relatively more intuitive. I have been struggling to understand it here, and I would appreciate your interpretation.
selected_token_ids

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants