Serving ML models with multiple workers linearly adds the RAM's load.

Recently, we deployed a ML model with FastAPI, and encountered an issue.

The code  looks like this.

```Python
from ocr_pipeline.model.ocr_wrapper import OcrWrapper
ocr_wrapper = OcrWrapper(**config.model_load_params) # loads 1.5 GB PyTorch model

...

@api.post('/')
async def predict(file: UploadFile = File(...)):
       preds = ocr_wrapper.predict(file.file, **config.model_predict_params)
       return json.dumps({"data": preds})
```
The above written command consumes min. 3GB of RAM.
```bash
gunicorn --workers 2 --worker-class=uvicorn.workers.UvicornWorker app.main:api
```
Is there any way to scale the number of workers without consuming too much RAM?

ENVIRONMENT:
Ubuntu 18.04
Python 3.6.9

fastapi==0.61.2
uvicorn==0.12.2
gunicorn==20.0.4
uvloop==0.14.0

@tiangolo 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Serving ML models with multiple workers linearly adds the RAM's load. #2425

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Serving ML models with multiple workers linearly adds the RAM's load. #2425

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions