-
Notifications
You must be signed in to change notification settings - Fork 32
perf: Improve vLLM backend performance by using a separate thread for responses #46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@Tabrizian Can you add description for the code changes in the PR? Also including the performance improvement you observed and in what cases? |
|
@tanmayv25 I updated the PR description. |
|
@kthui So you've received close to Iman's perf results after sync? |
yes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
|
I test Qwen2-7B-chat whith a A100*80G,The pr did not work, The gap compared to vllm deployments is nearly 40% for concurrency 64 vllm triton+llm+stream triton+llm+no stream |
llm+triton gpu utilization is significantly lower than vllm |
What does the PR do?
Triton's output token throughput for generate endpoint increases by 18% for concurrency 50. There is still a small gap between vLLM-only and vLLM + Triton solution.
The model is llama-2-7b.
Changes:
Next steps:
Checklist
<commit_type>: <Title>Commit Type:
Check the conventional commit type
box here and add the label to the github PR.
Related PRs:
N/A
Where should the reviewer start?
N/A
Test plan:
This is a performance improvement, existing test cases should be sufficient at covering any possible issues.
Caveats:
N/A
Background
N/A
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
N/A