Serving multiple concurrent queries with VLLM model instances using fast api workers #11065

raghuldeva · 2024-01-31T18:35:33Z

raghuldeva
Jan 31, 2024

First Check

I added a very descriptive title here.
I used the GitHub search to find a similar question and didn't find it.
I searched the FastAPI documentation, with the integrated search.
I already searched in Google "How to X in FastAPI" and didn't find any information.
I already read and followed all the tutorial in the docs and didn't find an answer.
I already checked if it is not related to FastAPI but to Pydantic.
I already checked if it is not related to FastAPI but to Swagger UI.
I already checked if it is not related to FastAPI but to ReDoc.

Commit to Help

I commit to help with one of those options 👆

Example Code

from fastapi import FastAPI
from langchain.llms import VLLM

app = FastAPI()

app.llm = VLLM(model="zephyr-7B-beta AWQ",quantization="awq",trust_remote_code=True,tensor_parallel_size=1,max_new_tokens=1024,batch_size = 8,top_k=50,top_p=0.9,temperature=0.1,repetition_penalty=1.1,stream=True,vllm_kwargs={"max_model_len":8192})

@app.get("/")
def read_root():
    return {"LLM": app.llm}

Description

I would like to process concurrent user requests which makes use of workers to share the memory of VLLM model and gives response at the same time so that 'n' users hitting the API endpoint should get response at same time. Can someone suggest how to achieve concurrency in fastAPI that serves VLLM's request

Operating System

Linux

Operating System Details

No response

FastAPI Version

0.100.1

Pydantic Version

1.10.13

Python Version

3.11.4

Additional Context

No response

Answered by JoanFM

Mar 12, 2024

The way to load the VLLM to be used by fastAPI at startup (even considering the multiple potential workers) is to use the lifespan events to the this cold start.

https://fastapi.tiangolo.com/advanced/events/#lifespan

Otherwise u will be facing issues when forking these objects.

View full answer

humza-sami · 2024-03-12T08:21:12Z

humza-sami
Mar 12, 2024

any luck ?

0 replies

JoanFM · 2024-03-12T17:55:15Z

JoanFM
Mar 12, 2024

The way to load the VLLM to be used by fastAPI at startup (even considering the multiple potential workers) is to use the lifespan events to the this cold start.

https://fastapi.tiangolo.com/advanced/events/#lifespan

Otherwise u will be facing issues when forking these objects.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Serving multiple concurrent queries with VLLM model instances using fast api workers #11065

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Serving multiple concurrent queries with VLLM model instances using fast api workers #11065

Uh oh!

raghuldeva Jan 31, 2024

First Check

Commit to Help

Example Code

Description

Operating System

Operating System Details

FastAPI Version

Pydantic Version

Python Version

Additional Context

Replies: 2 comments

Uh oh!

humza-sami Mar 12, 2024

Uh oh!

JoanFM Mar 12, 2024

raghuldeva
Jan 31, 2024

humza-sami
Mar 12, 2024

JoanFM
Mar 12, 2024