Serving multiple concurrent queries with VLLM model instances using fast api workers #11065
-
First Check
Commit to Help
Example Codefrom fastapi import FastAPI
from langchain.llms import VLLM
app = FastAPI()
app.llm = VLLM(model="zephyr-7B-beta AWQ",quantization="awq",trust_remote_code=True,tensor_parallel_size=1,max_new_tokens=1024,batch_size = 8,top_k=50,top_p=0.9,temperature=0.1,repetition_penalty=1.1,stream=True,vllm_kwargs={"max_model_len":8192})
@app.get("/")
def read_root():
return {"LLM": app.llm}DescriptionI would like to process concurrent user requests which makes use of workers to share the memory of VLLM model and gives response at the same time so that 'n' users hitting the API endpoint should get response at same time. Can someone suggest how to achieve concurrency in fastAPI that serves VLLM's request Operating SystemLinux Operating System DetailsNo response FastAPI Version0.100.1 Pydantic Version1.10.13 Python Version3.11.4 Additional ContextNo response |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
|
any luck ? |
Beta Was this translation helpful? Give feedback.
-
|
The way to load the VLLM to be used by fastAPI at startup (even considering the multiple potential workers) is to use the lifespan events to the this cold start. https://fastapi.tiangolo.com/advanced/events/#lifespan Otherwise u will be facing issues when forking these objects. |
Beta Was this translation helpful? Give feedback.
The way to load the VLLM to be used by fastAPI at startup (even considering the multiple potential workers) is to use the lifespan events to the this cold start.
https://fastapi.tiangolo.com/advanced/events/#lifespan
Otherwise u will be facing issues when forking these objects.