-
-
Notifications
You must be signed in to change notification settings - Fork 12.6k
Description
Your current environment
A100, nvidia 12.1.... just run simple inference async with a standard llm model
How would you like to use vllm
I want to run inference of a [specific model](put link here). I don't know how to integrate it with vllm.
Hi i have been using async engine for inference and its nice to handle all of this in a queue, which emulates the requests.
My question is can i handle a batch of requests? Considering its a wrapper around LLM engine i do not see why not
so that each request is engine.async( [prompt1, prompt2, prompt3] )->[gen1,gen2,gen3] instead of engine.async( [prompt1] )->[gen1]. I want to make sure i am able to maintain the queue and not cause issues in the requests received.
Finally one more suggestion....perhaps i can copy and paste the engine.async code with the wrapper and modify the queuing code from there....would that be easier instead?
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.