Skip to content

Serving ML models with multiple workers linearly adds the RAM's load. #2425

@sevakharutyunyan

Description

@sevakharutyunyan

Recently, we deployed a ML model with FastAPI, and encountered an issue.

The code looks like this.

from ocr_pipeline.model.ocr_wrapper import OcrWrapper
ocr_wrapper = OcrWrapper(**config.model_load_params) # loads 1.5 GB PyTorch model

...

@api.post('/')
async def predict(file: UploadFile = File(...)):
       preds = ocr_wrapper.predict(file.file, **config.model_predict_params)
       return json.dumps({"data": preds})

The above written command consumes min. 3GB of RAM.

gunicorn --workers 2 --worker-class=uvicorn.workers.UvicornWorker app.main:api

Is there any way to scale the number of workers without consuming too much RAM?

ENVIRONMENT:
Ubuntu 18.04
Python 3.6.9

fastapi==0.61.2
uvicorn==0.12.2
gunicorn==20.0.4
uvloop==0.14.0

@tiangolo

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions