Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Usage]: How to batch requests to chat models with OpenAI server? #4746

Closed
sidjha1 opened this issue May 10, 2024 · 2 comments
Closed

[Usage]: How to batch requests to chat models with OpenAI server? #4746

sidjha1 opened this issue May 10, 2024 · 2 comments
Labels
usage How to use vllm

Comments

@sidjha1
Copy link

sidjha1 commented May 10, 2024

Your current environment

...

How would you like to use vllm

I am serving a chat model (e.g. Llama 3 70B Instruct) using the OpenAI compatible server. However the v1/chat/completions endpoint does not take a batch. In contrast, the v1/completions endpoint takes a batch. Is the proper way to support batching to use multiple threads to send requests to v1/chat/completions or should I apply the chat template myself and then send it to the v1/completions endpoint? Thanks!

@sidjha1 sidjha1 added the usage How to use vllm label May 10, 2024
@simon-mo
Copy link
Collaborator

Either approaches works, the performance should be similar as there's one single engine performing batching under the hood.

@sidjha1
Copy link
Author

sidjha1 commented May 11, 2024

Great, thanks @simon-mo!

@sidjha1 sidjha1 closed this as completed May 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
usage How to use vllm
Projects
None yet
Development

No branches or pull requests

2 participants