Gunicorn preload flag not working with Stanza #570

hsilveiro · 2020-12-17T18:50:03Z

Hello,
We have been developing a FastAPI application where we use some external libraries to perform some NLP tasks, such as tokenization. On top of this, we are launching the service with Gunicorn so that we can parallelize the requests.
However, we are having difficulties using Stanza with Gunicorn’s preload flag active.
It is a requirement to use this flag because since Stanza models can be large, we want the models to be loaded only once, in the Gunicorn master process. This way, Gunicorn workers can access the models that were previously loaded in the master process.

The difficulties that we are facing resumes on the fact that Gunicorn workers hang when trying to make an inference over a given model (that was loaded initially by the master process).

We’ve done some research and debugging but we weren’t able to find a solution. However, we noticed that the worker hangs when the code reaches the prediction step on PyTorch.
Although we are talking about Stanza, this problem also occurred with Sentence Transformers library. And both of them are using the PyTorch library.

Following, I’ll present more details:

Environment:

FastApi version: 0.54.2
Gunicorn version: 20.0.4
Uvicorn version: 0.12.3
Python version: 3.7
Stanza version: 1.1.1
OS: macOS Catalina 10.15.6

Steps executed:

Gunicorn command:

gunicorn --workers 1 --worker-class uvicorn.workers.UvicornWorker --max-requests=0 --max-requests-jitter=0 --timeout=120 --keep-alive=2 \
     --log-level=info --access-logfile - --preload -b 0.0.0.0:8010 my_app:app

The code that will run before launching the workers

def initialize_application() -> None:
     ...
     model = stanza.Pipeline(
                lang=cls._TOKENIZER_MODEL_LANGUAGES[language],
                package=cls._MODEL_TYPE[language],
                processors=cls._TOKENIZER_MODEL,
                tokenize_no_ssplit=True,
            )

This way, the model can be loaded only once, in the master process.
Once the required workers are launched, they should have access to the previous model, without having to load it by themselves (saving computational resources).

The problem happens when we receive a request that will make use of the model that was initially loaded. The worker that will be responsible for handling the request, won’t be able to use the model for inference. As so, the worker will be hanged until the timeout occurs.

After analyzing the code and debugging it, we reached the following step until the code stopped working:

Our code has a call to the process() method, class Pipeline, on the core.py file of Stanza.
That line calls the specific process() method, in this case from the tokenize_process.py, class TokenizeProcessor
Which calls the PyTorch code, output_predictions() method, from the utils.py
After some steps, it reaches the model.py file still in PyTorch, class Tokenizer(nn.Module), forward(self, x, feats) method, in the following line: nontok = F.logsigmoid(-tok0). It seems that this line is calling some C++ code where we didn’t investigate any further.

Of course, if we remove the --preload flag, everything will run smoothly. Removing it is something that we want to avoid because of the added computational resources that will be necessary (the models will be duplicated in every worker).

We looked through several other issues that could be related to this one, such as:
benoitc/gunicorn#2157
tiangolo/fastapi#2425
tiangolo/fastapi#596
benoitc/gunicorn#2124
and others...

After trying multiple solutions, it wasn’t possible to solve the issue. Do you have any suggestions to handle this? Or other tests that I can perform to give you more information?

Thanks in advance.

P.S.: I also opened issues on the Gunicorn and PyTorch github pages:

The text was updated successfully, but these errors were encountered:

AngledLuffa · 2020-12-18T08:20:38Z

I'm afraid I will have to ask around our lab... I have zero experience doing such a thing. The one time I served any deep learning models as part of a website, it was just Flask and a custom built model.

I suppose that if I can't find any answers, I might find some time over the holidays to work on an old personal project of mine which could be improved with a pytorch model. (Then again, what can't be improved with pytorch models? aside from gunicorn services)

AngledLuffa · 2020-12-18T19:31:12Z

A couple of my labmates reported successfully serving CPU-only torch models with gunicorn. They also pointed out that any parallelism benefit will be limited if you have multiple workers using the same GPU model anyway. How do you feel about trying CPU stanza?

Most of the models have their own cpu flag, such as

stanza/stanza/models/tagger.py

Line 80 in 708c935

parser.add_argument('--cuda', type=bool, default=torch.cuda.is_available())

and the whole Pipeline has the use_gpu flag, which you would set to False in this case:

stanza/stanza/pipeline/core.py

Line 57 in 708c935

    
           def __init__(self, lang='en', dir=DEFAULT_MODEL_DIR, package='default', processors={}, logging_level='INFO', verbose=None, use_gpu=True, **kwargs):

hsilveiro · 2020-12-22T18:28:47Z

A couple of my labmates reported successfully serving CPU-only torch models with gunicorn. They also pointed out that any parallelism benefit will be limited if you have multiple workers using the same GPU model anyway. How do you feel about trying CPU stanza?

Most of the models have their own cpu flag, such as

stanza/stanza/models/tagger.py

Line 80 in 708c935

parser.add_argument('--cuda', type=bool, default=torch.cuda.is_available())

and the whole Pipeline has the use_gpu flag, which you would set to False in this case:

stanza/stanza/pipeline/core.py

Line 57 in 708c935

def __init__(self, lang='en', dir=DEFAULT_MODEL_DIR, package='default', processors={}, logging_level='INFO', verbose=None, use_gpu=True, **kwargs):

Hi again,
We are only using CPU. But yes, thanks for the warning. I added the use_gpu=False when creating a new Pipeline.
However, the problem remains.
Do you have any other suggestions?
Thanks

AngledLuffa · 2020-12-22T19:57:30Z

In that case, it is outside our experience, I'm afraid. Although I can tell you that there have been a couple thread safety bugs uncovered since the last release. I'm less confident of that being the solution, since you said it happens with another software distribution as well, but it might be worth trying. They will go into the next release, but you can also get a sneak preview of them by using our dev repo. For example, python3 -m pip install --no-cache-dir git+ https://github.com/stanfordnlp/stanza.git@438c28dbdfcef96b07cfe35f399c718057d954e5

stale · 2021-02-20T21:07:58Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale · 2021-02-27T21:12:21Z

This issue has been automatically closed due to inactivity.

hsilveiro added the question label Dec 17, 2020

hsilveiro mentioned this issue Dec 18, 2020

Gunicorn preload flag not working with PyTorch library benoitc/gunicorn#2478

Closed

stale bot added the stale label Feb 20, 2021

stale bot closed this as completed Feb 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gunicorn preload flag not working with Stanza #570

Gunicorn preload flag not working with Stanza #570

hsilveiro commented Dec 17, 2020 •

edited

AngledLuffa commented Dec 18, 2020

AngledLuffa commented Dec 18, 2020

hsilveiro commented Dec 22, 2020

AngledLuffa commented Dec 22, 2020 via email

stale bot commented Feb 20, 2021

stale bot commented Feb 27, 2021

Gunicorn preload flag not working with Stanza #570

Gunicorn preload flag not working with Stanza #570

Comments

hsilveiro commented Dec 17, 2020 • edited

AngledLuffa commented Dec 18, 2020

AngledLuffa commented Dec 18, 2020

hsilveiro commented Dec 22, 2020

AngledLuffa commented Dec 22, 2020 via email

stale bot commented Feb 20, 2021

stale bot commented Feb 27, 2021

hsilveiro commented Dec 17, 2020 •

edited