This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serving ML models with multiple workers linearly adds the RAM's load. #2425
Comments
I believe this issue is duplicate of #596. Have you tried to workarounds over there, like trying with Python 3.8? |
|
Well, i think it's pretty normal since you are loading 1.5 GB directly into the memory in two separate threads. Approximately how long does it takes to answer one request? |
|
Not exactly sure, if it is about using shared memory among different threads. we recently faced similar issue while running celery workers.. The fix is discussed in the following stack-overflow post about using shared memory.. Not sure, will this technique help in your scenario or not.. https://stackoverflow.com/questions/9565542/share-memory-areas-between-celery-workers-on-one-machine |
This is not a specific fastAPI question (more a gunicorn one) , it's about sharing memory between process The solution would be loading the model in ram before the fork of the workers (of gunicorn) so you need to use --preload gunicorn --workers 2 --preload --worker-class=uvicorn.workers.UvicornWorker app.main:api your main.py file inside folder app def create_app():
MY_MODEL.load("model_path")
app = FastAPI()
app.include_router(my_router)
return app
api = create_app() If you have more question about gunicorn or python or fork or copy-on-write or python reference counting or memory leak -> stackoverflow YOU can very probably CLOSE this issue , thank you :) |
That's right.
Since I have the same problem, I'm going to try this. I suspect this will not be a viable solution. Trying to share PyTorch models in that way causes them to stop working. Whenever the model is used for inference, not always, but almost always, the worker hangs resulting in a timeout and consequent new worker spawn. Here's a thread discussing the same issue: benoitc/gunicorn#2157
I don't see how is this useful. |
The preload option work , and do what is supposed to do , if it do not work with your case that's your lib or code problem.
This is working in production with a model taking more than 40GO of ram shared by 8 workers
That explicit the fact that --preload is not magic and will not work easily depending of the memory to share , like your PyTorch problem. |
Great to hear, then please share as much detail as you can about that, because evidently it's not working for everyone, despite Is it a Pytorch model? Is it a pipeline? In my case, I use a That might be more useful in the thread I mentioned (or elsewhere) rather than here, since it's not a FastAPI problem. |
Just found out that if I change my app methods from:
to:
removing the @sevakharutyunyan are you able to verify if this works for you? |
removing |
Thanks for the help here everyone! 👏 🙇 And thanks for closing the issue 👍 If anyone is having other related problems, please create new issues following the template. ☕
|
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Recently, we deployed a ML model with FastAPI, and encountered an issue.
The code looks like this.
The above written command consumes min. 3GB of RAM.
Is there any way to scale the number of workers without consuming too much RAM?
ENVIRONMENT:
Ubuntu 18.04
Python 3.6.9
fastapi==0.61.2
uvicorn==0.12.2
gunicorn==20.0.4
uvloop==0.14.0
@tiangolo
The text was updated successfully, but these errors were encountered: