You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am serving a chat model (e.g. Llama 3 70B Instruct) using the OpenAI compatible server. However the v1/chat/completions endpoint does not take a batch. In contrast, the v1/completions endpoint takes a batch. Is the proper way to support batching to use multiple threads to send requests to v1/chat/completions or should I apply the chat template myself and then send it to the v1/completions endpoint? Thanks!
The text was updated successfully, but these errors were encountered:
Your current environment
...
How would you like to use vllm
I am serving a chat model (e.g. Llama 3 70B Instruct) using the OpenAI compatible server. However the v1/chat/completions endpoint does not take a batch. In contrast, the v1/completions endpoint takes a batch. Is the proper way to support batching to use multiple threads to send requests to v1/chat/completions or should I apply the chat template myself and then send it to the v1/completions endpoint? Thanks!
The text was updated successfully, but these errors were encountered: