Serving ML models with multiple workers linearly adds the RAM's load. #2425

sevakharutyunyan · 2020-11-26T08:37:40Z

Recently, we deployed a ML model with FastAPI, and encountered an issue.

The code looks like this.

from ocr_pipeline.model.ocr_wrapper import OcrWrapper
ocr_wrapper = OcrWrapper(**config.model_load_params) # loads 1.5 GB PyTorch model

...

@api.post('/')
async def predict(file: UploadFile = File(...)):
       preds = ocr_wrapper.predict(file.file, **config.model_predict_params)
       return json.dumps({"data": preds})

The above written command consumes min. 3GB of RAM.

gunicorn --workers 2 --worker-class=uvicorn.workers.UvicornWorker app.main:api

Is there any way to scale the number of workers without consuming too much RAM?

ENVIRONMENT:
Ubuntu 18.04
Python 3.6.9

fastapi==0.61.2
uvicorn==0.12.2
gunicorn==20.0.4
uvloop==0.14.0

@tiangolo

ycd · 2020-11-26T08:54:28Z

I believe this issue is duplicate of #596. Have you tried to workarounds over there, like trying with Python 3.8?

sevakharutyunyan · 2020-11-26T09:02:19Z

I believe this issue is duplicate of #596. Have you tried to workarounds over there, like trying with Python 3.8?

Threre is no problem with consuming RAM forever. When it reaches the point of (num_workers) * (model_size), it stops there.

ycd · 2020-11-26T09:10:33Z

Well, i think it's pretty normal since you are loading 1.5 GB directly into the memory in two separate threads.

Approximately how long does it takes to answer one request?

sevakharutyunyan · 2020-11-26T09:18:42Z

Well, i think it's pretty normal since you are loading 1.5 GB directly into the memory in two separate threads.

Approximately how long does it takes to answer one request?

Yeah, I know that it's the normal behavior to load in 2 separate threads. The question is that, is there any way to make sure that all threads are using the same model, thus not copying them. It's even more important when the inference is done with GPU.

hackwithharsha · 2020-11-26T11:00:15Z

Not exactly sure, if it is about using shared memory among different threads. we recently faced similar issue while running celery workers.. The fix is discussed in the following stack-overflow post about using shared memory.. Not sure, will this technique help in your scenario or not..

https://stackoverflow.com/questions/9565542/share-memory-areas-between-celery-workers-on-one-machine

raphaelauv · 2020-11-27T11:30:57Z

This is not a specific fastAPI question (more a gunicorn one) , it's about sharing memory between process

The solution would be loading the model in ram before the fork of the workers (of gunicorn)

so you need to use --preload

gunicorn --workers 2 --preload --worker-class=uvicorn.workers.UvicornWorker app.main:api

your main.py file inside folder app

def create_app():
    MY_MODEL.load("model_path")
    app = FastAPI()
    app.include_router(my_router)
    return app
api = create_app()

If you have more question about gunicorn or python or fork or copy-on-write or python reference counting or memory leak -> stackoverflow

YOU can very probably CLOSE this issue , thank you :)

cosimo · 2020-12-03T13:41:20Z

This is not a specific fastAPI question (more a gunicorn one) , it's about sharing memory between process

That's right.

The solution would be loading the model in ram before the fork of the workers (of gunicorn)
so you need to use --preload
gunicorn --workers 2 --preload --worker-class=uvicorn.workers.UvicornWorker app.main:api
your main.py file inside folder app
def create_app():
    MY_MODEL.load("model_path")
    app = FastAPI()
    app.include_router(my_router)
    return app
api = create_app()

Since I have the same problem, I'm going to try this. I suspect this will not be a viable solution.

Trying to share PyTorch models in that way causes them to stop working. Whenever the model is used for inference, not always, but almost always, the worker hangs resulting in a timeout and consequent new worker spawn.

Here's a thread discussing the same issue: benoitc/gunicorn#2157

If you have more question about gunicorn or python or fork or copy-on-write or python reference counting or memory leak -> stackoverflow

I don't see how is this useful.

raphaelauv · 2020-12-03T13:58:37Z

Since I have the same problem, I'm going to try this. I suspect this will not be a viable solution.

The preload option work , and do what is supposed to do , if it do not work with your case that's your lib or code problem.

Since I have the same problem, I'm going to try this. I suspect this will not be a viable solution.

This is working in production with a model taking more than 40GO of ram shared by 8 workers

I don't see how is this useful.

That explicit the fact that --preload is not magic and will not work easily depending of the memory to share , like your PyTorch problem.

cosimo · 2020-12-03T14:18:26Z

This is working in production with a model taking more than 40GO of ram shared by 8 workers

Great to hear, then please share as much detail as you can about that, because evidently it's not working for everyone, despite --preload working correctly.

Is it a Pytorch model? Is it a pipeline? In my case, I use a SentenceTransformer model and only use it to get embeddings (model.encode()), not to do a full inference. Having more details about this could help both me and OP to find a solution.

That might be more useful in the thread I mentioned (or elsewhere) rather than here, since it's not a FastAPI problem.

cosimo · 2020-12-03T14:49:18Z

Just found out that if I change my app methods from:

@app.post("/clusters", response_model=ClusteringResponse)
async def cluster(request: ClusteringRequest, model=Depends(get_model)):
    """Cluster a list of text sentences"""
    ...

to:

@app.post("/clusters", response_model=ClusteringResponse)
def cluster(request: ClusteringRequest, model=Depends(get_model)):
    """Cluster a list of text sentences"""
    ...

removing the async qualifier, the model does indeed work as expected.

@sevakharutyunyan are you able to verify if this works for you?

sevakharutyunyan · 2020-12-07T12:42:49Z

Just found out that if I change my app methods from:
@app.post("/clusters", response_model=ClusteringResponse)
async def cluster(request: ClusteringRequest, model=Depends(get_model)):
    """Cluster a list of text sentences"""
    ...
to:
@app.post("/clusters", response_model=ClusteringResponse)
def cluster(request: ClusteringRequest, model=Depends(get_model)):
    """Cluster a list of text sentences"""
    ...
removing the async qualifier, the model does indeed work as expected.

@sevakharutyunyan are you able to verify if this works for you?

removing async doesn't help. --preload option for gunicorn indeed works for small network, but not for every case.

tiangolo · 2022-11-09T20:22:49Z

Thanks for the help here everyone! 👏 🙇

And thanks for closing the issue 👍

If anyone is having other related problems, please create new issues following the template. ☕

Sorry for the long delay! 🙈 I wanted to personally address each issue/PR and they piled up through time, but now I'm checking each one in order.

sevakharutyunyan added the question Question or problem label Nov 26, 2020

sevakharutyunyan changed the title ~~Number of workers linearly add RAM load.~~ Serving ML models with multiple workers linearly add RAM load. Nov 26, 2020

sevakharutyunyan changed the title ~~Serving ML models with multiple workers linearly add RAM load.~~ Serving ML models with multiple workers linearly adds the RAM's load. Nov 26, 2020

sevakharutyunyan closed this as completed Nov 26, 2020

sevakharutyunyan reopened this Nov 26, 2020

This was referenced Dec 17, 2020

Gunicorn preload flag not working with PyTorch library benoitc/gunicorn#2478

Closed

Gunicorn preload flag not working with Stanza stanfordnlp/stanza#570

Closed

Gunicorn preload flag not working with PyTorch library pytorch/pytorch#49555

Open

sevakharutyunyan closed this as completed Dec 21, 2020

raphaelauv mentioned this issue Apr 15, 2021

How to share objects from main process to Uvicorn workers #3090

Closed

ycd mentioned this issue Jun 27, 2021

How to load models and metadata at startup ? #3360

Closed

This was referenced Nov 21, 2021

Deploy with ML/DL model could not make all cpus work on Centos7 but win10 #4155

Closed

Most idiomatic way to run a task only once at startup? #3720

Closed

raphaelauv mentioned this issue Dec 31, 2021

Share updated python class object through multiple workers (gunicorn-fastAPI) #4340

Closed

9 tasks

mateusjs mentioned this issue Feb 8, 2022

How to call only one time a handler event ("app.on_event") using more then one worker #4526

Closed

9 tasks

tiangolo reopened this Feb 28, 2023

tiangolo added the question-migrate label Feb 28, 2023

Repository owner locked and limited conversation to collaborators Feb 28, 2023

tiangolo converted this issue into discussion #7069 Feb 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Serving ML models with multiple workers linearly adds the RAM's load. #2425

Serving ML models with multiple workers linearly adds the RAM's load. #2425

sevakharutyunyan commented Nov 26, 2020 •

edited

ycd commented Nov 26, 2020

sevakharutyunyan commented Nov 26, 2020

ycd commented Nov 26, 2020

sevakharutyunyan commented Nov 26, 2020

hackwithharsha commented Nov 26, 2020

raphaelauv commented Nov 27, 2020 •

edited

cosimo commented Dec 3, 2020

raphaelauv commented Dec 3, 2020 •

edited

cosimo commented Dec 3, 2020

cosimo commented Dec 3, 2020 •

edited

sevakharutyunyan commented Dec 7, 2020

tiangolo commented Nov 9, 2022

This issue was moved to a discussion.

This issue was moved to a discussion.

Serving ML models with multiple workers linearly adds the RAM's load. #2425

Serving ML models with multiple workers linearly adds the RAM's load. #2425

Comments

sevakharutyunyan commented Nov 26, 2020 • edited

ycd commented Nov 26, 2020

sevakharutyunyan commented Nov 26, 2020

ycd commented Nov 26, 2020

sevakharutyunyan commented Nov 26, 2020

hackwithharsha commented Nov 26, 2020

raphaelauv commented Nov 27, 2020 • edited

cosimo commented Dec 3, 2020

raphaelauv commented Dec 3, 2020 • edited

cosimo commented Dec 3, 2020

cosimo commented Dec 3, 2020 • edited

sevakharutyunyan commented Dec 7, 2020

tiangolo commented Nov 9, 2022

This issue was moved to a discussion.

sevakharutyunyan commented Nov 26, 2020 •

edited

raphaelauv commented Nov 27, 2020 •

edited

raphaelauv commented Dec 3, 2020 •

edited

cosimo commented Dec 3, 2020 •

edited