[Usage]: How to batch requests to chat models with OpenAI server? #4746

sidjha1 · 2024-05-10T18:33:06Z

Your current environment

...

How would you like to use vllm

I am serving a chat model (e.g. Llama 3 70B Instruct) using the OpenAI compatible server. However the v1/chat/completions endpoint does not take a batch. In contrast, the v1/completions endpoint takes a batch. Is the proper way to support batching to use multiple threads to send requests to v1/chat/completions or should I apply the chat template myself and then send it to the v1/completions endpoint? Thanks!

simon-mo · 2024-05-10T23:46:58Z

Either approaches works, the performance should be similar as there's one single engine performing batching under the hood.

sidjha1 · 2024-05-11T00:06:53Z

Great, thanks @simon-mo!

sidjha1 added the usage How to use vllm label May 10, 2024

sidjha1 closed this as completed May 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Usage]: How to batch requests to chat models with OpenAI server? #4746

[Usage]: How to batch requests to chat models with OpenAI server? #4746

sidjha1 commented May 10, 2024

simon-mo commented May 10, 2024

sidjha1 commented May 11, 2024

[Usage]: How to batch requests to chat models with OpenAI server? #4746

[Usage]: How to batch requests to chat models with OpenAI server? #4746

Comments

sidjha1 commented May 10, 2024

Your current environment

How would you like to use vllm

simon-mo commented May 10, 2024

sidjha1 commented May 11, 2024