Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TorchServe fails to start multiple workers threads on multiple GPUs with large model. #71

Closed
fbbradheintz opened this issue Feb 26, 2020 · 14 comments
Labels
bug Something isn't working close_after_merge launch blocker staged_release merged to stage_release

Comments

@fbbradheintz
Copy link
Contributor

On a c5.12xlarge instance, I was able to run 16 instances of the FairSeq English-to-German translation model, all simultaneously running translations. This model's weights take up about 2.5GB on disk (though its resident footprint in memory seems smaller).

Attempting a similar feat on a p3.8xlarge turned out to be impossible. I could get a single instance running, but if I attempted to get even 4 workers running, they crash repeatedly with OOMEs:

2020-02-26 02:46:34,454 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Torch worker started.
2020-02-26 02:46:34,454 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Python runtime: 3.6.6
2020-02-26 02:46:34,454 [DEBUG] W-9001-fairseq_model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9001-fairseq_model_1.0 State change WORKER_STOPPED -> WORKER_STARTED
2020-02-26 02:46:34,454 [INFO ] W-9001-fairseq_model_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /tmp/.ts.sock.9001
2020-02-26 02:46:34,455 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Connection accepted: /tmp/.ts.sock.9001.
2020-02-26 02:46:38,734 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Backend worker process die.
2020-02-26 02:46:38,734 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - Traceback (most recent call last):
2020-02-26 02:46:38,734 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ts/model_service_worker.py", line 163, in <module>
2020-02-26 02:46:38,734 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     worker.run_server()
2020-02-26 02:46:38,734 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ts/model_service_worker.py", line 141, in run_server
2020-02-26 02:46:38,734 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     self.handle_connection(cl_socket)
2020-02-26 02:46:38,734 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ts/model_service_worker.py", line 105, in handle_connection
2020-02-26 02:46:38,734 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     service, result, code = self.load_model(msg)
2020-02-26 02:46:38,734 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ts/model_service_worker.py", line 83, in load_model
2020-02-26 02:46:38,734 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     service = model_loader.load(model_name, model_dir, handler, gpu, batch_size)
2020-02-26 02:46:38,734 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/ts/model_loader.py", line 107, in load
2020-02-26 02:46:38,734 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     entry_point(None, service.context)
2020-02-26 02:46:38,734 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/tmp/models/fa2ad7fda70376da33595966d0cf3c38702ea6d1/fairseq_handler.py", line 120, in handle
2020-02-26 02:46:38,735 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     raise e
2020-02-26 02:46:38,735 [INFO ] epollEventLoopGroup-4-15 org.pytorch.serve.wlm.WorkerThread - 9001 Worker disconnected. WORKER_STARTED
2020-02-26 02:46:38,735 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/tmp/models/fa2ad7fda70376da33595966d0cf3c38702ea6d1/fairseq_handler.py", line 109, in handle
2020-02-26 02:46:38,735 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     _service.initialize(context)
2020-02-26 02:46:38,735 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/tmp/models/fa2ad7fda70376da33595966d0cf3c38702ea6d1/fairseq_handler.py", line 73, in initialize
2020-02-26 02:46:38,735 [DEBUG] W-9001-fairseq_model_1.0 org.pytorch.serve.wlm.WorkerThread - Backend worker monitoring thread interrupted or backend worker process died.
java.lang.InterruptedException
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
        at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
        at org.pytorch.serve.wlm.WorkerThread.run(WorkerThread.java:128)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
2020-02-26 02:46:38,735 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     state_dict = torch.load(model_pt_path, map_location=self.device)
2020-02-26 02:46:38,735 [WARN ] W-9001-fairseq_model_1.0 org.pytorch.serve.wlm.BatchAggregator - Load model failed: fairseq_model, error: Worker died.
2020-02-26 02:46:38,735 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 529, in load
2020-02-26 02:46:38,735 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
2020-02-26 02:46:38,735 [DEBUG] W-9001-fairseq_model_1.0 org.pytorch.serve.wlm.WorkerThread - W-9001-fairseq_model_1.0 State change WORKER_STARTED -> WORKER_STOPPED
2020-02-26 02:46:38,735 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 702, in _legacy_load
2020-02-26 02:46:38,735 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     result = unpickler.load()
2020-02-26 02:46:38,735 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 665, in persistent_load
2020-02-26 02:46:38,735 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     deserialized_objects[root_key] = restore_location(obj, location)
2020-02-26 02:46:38,735 [INFO ] W-9001-fairseq_model_1.0 org.pytorch.serve.wlm.WorkerThread - Retry worker: 9001 in 21 seconds.
2020-02-26 02:46:38,735 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 740, in restore_location
2020-02-26 02:46:38,735 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     return default_restore_location(storage, str(map_location))
2020-02-26 02:46:38,735 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 156, in default_restore_location
2020-02-26 02:46:38,735 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     result = fn(storage, location)
2020-02-26 02:46:38,735 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/serialization.py", line 136, in _cuda_deserialize
2020-02-26 02:46:38,735 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     return storage_type(obj.size())
2020-02-26 02:46:38,735 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -   File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/cuda/__init__.py", line 480, in _lazy_new
2020-02-26 02:46:38,735 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle -     return super(_CudaBase, cls).__new__(cls, *args, **kwargs)
2020-02-26 02:46:38,735 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - RuntimeError: CUDA out of memory. Tried to allocate 802.00 MiB (GPU 0; 15.75 GiB total capacity; 1.69 GiB already allocated; 605.12 MiB free; 1.69 GiB reserved in total by PyTorch)

On digging through the logs, it appears that it's attempting to start all workers on the same GPU. The following is the output of grep GPU ts_log.log:

2020-02-26 02:45:13,375 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 15.75 GiB total capacity; 3.03 GiB already allocated; 19.12 MiB free; 3.03 GiB reserved in total by PyTorch)
2020-02-26 02:45:13,375 [INFO ] W-9000-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 15.75 GiB total capacity; 3.16 GiB already allocated; 19.12 MiB free; 3.16 GiB reserved in total by PyTorch)
2020-02-26 02:45:13,383 [INFO ] W-9003-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 15.75 GiB total capacity; 2.60 GiB already allocated; 19.12 MiB free; 2.60 GiB reserved in total by PyTorch)
2020-02-26 02:45:26,382 [INFO ] W-9003-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - RuntimeError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 15.75 GiB total capacity; 3.05 GiB already allocated; 19.12 MiB free; 3.05 GiB reserved in total by PyTorch)
2020-02-26 02:45:26,519 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - RuntimeError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 15.75 GiB total capacity; 2.47 GiB already allocated; 19.12 MiB free; 2.47 GiB reserved in total by PyTorch)
2020-02-26 02:45:38,899 [INFO ] W-9003-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 15.75 GiB total capacity; 2.89 GiB already allocated; 7.12 MiB free; 2.89 GiB reserved in total by PyTorch)
2020-02-26 02:45:38,904 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 15.75 GiB total capacity; 2.65 GiB already allocated; 7.12 MiB free; 2.65 GiB reserved in total by PyTorch)
2020-02-26 02:45:52,201 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - RuntimeError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 15.75 GiB total capacity; 3.05 GiB already allocated; 19.12 MiB free; 3.05 GiB reserved in total by PyTorch)
2020-02-26 02:45:59,593 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - RuntimeError: CUDA out of memory. Tried to allocate 802.00 MiB (GPU 0; 15.75 GiB total capacity; 1.69 GiB already allocated; 605.12 MiB free; 1.69 GiB reserved in total by PyTorch)
2020-02-26 02:46:08,962 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - RuntimeError: CUDA out of memory. Tried to allocate 802.00 MiB (GPU 0; 15.75 GiB total capacity; 1.69 GiB already allocated; 605.12 MiB free; 1.69 GiB reserved in total by PyTorch)
2020-02-26 02:46:21,358 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - RuntimeError: CUDA out of memory. Tried to allocate 802.00 MiB (GPU 0; 15.75 GiB total capacity; 1.69 GiB already allocated; 605.12 MiB free; 1.69 GiB reserved in total by PyTorch)
2020-02-26 02:46:38,735 [INFO ] W-9001-fairseq_model_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - RuntimeError: CUDA out of memory. Tried to allocate 802.00 MiB (GPU 0; 15.75 GiB total capacity; 1.69 GiB already allocated; 605.12 MiB free; 1.69 GiB reserved in total by PyTorch)

Note that multiple workers (W-9000, W-9001, W-9003) are shown, but only one GPU turns up (GPU 0). The p3.8xlarge has 4 GPUs.

I attempted to use arguments of the Management API, such as number_gpu=4, to fix this, but nothing worked. Same result every time.

@mycpuorg
Copy link
Collaborator

We only do a round robin distribution of workers on GPUs:
https://github.com/pytorch/serve/blob/master/frontend/server/src/main/java/org/pytorch/serve/wlm/WorkLoadManager.java#L152

Looks like it needs testing with large models.

@fbbradheintz
Copy link
Contributor Author

fbbradheintz commented Feb 26, 2020

It would be helpful for testing and diagnosis if we also got logging when a worker successfully attached to a GPU - right now I can only see the intended device target when there's a failure. There's no way to verify correct behavior from the logs.

@fbbradheintz fbbradheintz changed the title TorchServe fails to start multiple workers threads on GPU with large model. TorchServe fails to start multiple workers threads on multiple GPUs with large model. Feb 27, 2020
@fbbradheintz fbbradheintz added launch blocker bug Something isn't working labels Feb 27, 2020
@mycpuorg
Copy link
Collaborator

@fbbradheintz Can you share the model archive you tested this with?

@fbbradheintz
Copy link
Contributor Author

The files are very large. I'll be a while uploading.

@fbbradheintz
Copy link
Contributor Author

@mycpuorg - Your email inbox should contain a Dropbox link to a folder containing:

  • CPU and GPU copies of the Fairseq en->de translation model
  • Code for this model
  • A model handler
  • Sample input
  • A README with the command line I used for testing.

Please let me know if there's anything else you need in relation to this.

@harshbafna
Copy link
Contributor

We are currently analyzing this issue with the shared mar file.

@harshbafna
Copy link
Contributor

harshbafna commented Mar 3, 2020

@fbbradheintz :

We need to specify the gpu id while creating the torch.device object so that every worker uses a different GPU. Following small change in your custom model handler resolves the issue :

  • Set device as None in the handler's __init__ method :
self.device = None
  • Add following line in your handler's initialize method. This will allow every worker to use different GPU device:
self.device = torch.device("cuda:" + str(properties.get("gpu_id")) if torch.cuda.is_available() else "cpu")

After the above changes I was able to register the fairseq model with TorchServe.

ubuntu@ip-172-31-13-12:~$ curl http://127.0.0.1:8081/models/fairseq_model
[
  {
    "modelName": "fairseq_model",
    "modelVersion": "1.0",
    "modelUrl": "fairseq_gpu.mar",
    "runtime": "python3",
    "minWorkers": 4,
    "maxWorkers": 4,
    "batchSize": 1,
    "maxBatchDelay": 100,
    "loadedAtStartup": true,
    "workers": [
      {
        "id": "9000",
        "startTime": "2020-03-03T17:03:04.999Z",
        "status": "READY",
        "gpu": true,
        "memoryUsage": 0
      },
      {
        "id": "9001",
        "startTime": "2020-03-03T17:03:05.001Z",
        "status": "READY",
        "gpu": true,
        "memoryUsage": 0
      },
      {
        "id": "9002",
        "startTime": "2020-03-03T17:03:05.001Z",
        "status": "READY",
        "gpu": true,
        "memoryUsage": 0
      },
      {
        "id": "9003",
        "startTime": "2020-03-03T17:03:05.002Z",
        "status": "READY",
        "gpu": true,
        "memoryUsage": 0
      }
    ]
  }
]

Could you please try the above changes and confirm?

Note: We have updated the MNIST example's custom handler to use similar change as well in "stage_release" branch.

@harshbafna harshbafna added the triaged_wait Waiting for the Reporter's resp label Mar 3, 2020
@mycpuorg mycpuorg added the staged_release merged to stage_release label Mar 3, 2020
@chauhang
Copy link
Contributor

chauhang commented Mar 4, 2020

@harshbafna Is this the java properties or the context.system_properties? It will help to update one of the examples to use gpus using the proposed sol

@harshbafna
Copy link
Contributor

@chauhang : It is contex.system_properties. As indicated in the previous comment. We have already updated the custom handler example (MNIST digit classifier) in "stage_release" branch

https://github.com/pytorch/serve/blob/stage_release/examples/image_classifier/mnist/mnist_handler.py

All the default handlers have been updated to use the same as well.

@fbbradheintz
Copy link
Contributor Author

I can confirm that this works with the changes prescribed by @harshbafna - thank you for the added guidance.

The thing I don't see is documentation. If we don't tell people about this quirk in custom handlers, they're going to the same thing I did, and file a bug like I did.

Please document this, preferably in examples/image_classifier/mnist/README.md, and then we can close this after merge.

@harshbafna
Copy link
Contributor

@fbbradheintz ,
Updated documentation in stage_release branch. Instead of adding the details in mnist/README we have added the details in custom_service documentation and referenced the mnist_handler there.

Could you please review the following commit : 7a029ef

@mycpuorg mycpuorg added this to the Experimental_Launch milestone Mar 6, 2020
@fbbradheintz
Copy link
Contributor Author

Looks good! Please close after merging.

@fbbradheintz fbbradheintz added close_after_merge and removed triaged_wait Waiting for the Reporter's resp labels Mar 9, 2020
@lokeshgupta1975 lokeshgupta1975 added this to High priority in IssuesBacklog Mar 12, 2020
mycpuorg added a commit that referenced this issue Mar 12, 2020
Stage release - Merge 

Periodic Merge From Stage Release Branch. Listing the highlights of this merge below.
When things settle down a bit I expect this will slow down and we can give a merge a new rc version.

Highlights:
Fixed:

    Benchmarks have dependency on Mxnet #72
    TorchServe fails to start multiple workers threads on multiple GPUs with large model #71
    Java concurrency crash when attempting batch processing #66
    Add handler for audio models with an example #60
    Undocumented options for config.properties #55
    Can't log custom metrics #53
    Add example for Custom Service #49

WIP:

    benchmark dependencies install script failing on fresh ubuntu 18.04 #36
    Unable to install on GPU machine #30
    Incorrect docs for --model-store option #23
IssuesBacklog automation moved this from High priority to Closed Mar 12, 2020
@okgrammer
Copy link

Hi @fbbradheintz, I am interested in deploying my Fairseq models with TorchServe. Would it be possible to share some examples or pointers that you used above? I noticed that in the MNIST example, custom model file is given but it is only a single file. What should be the correct way to do it when we have multiple files for the model definition like in Fairseq?

@harshbafna
Copy link
Contributor

@okgrammer, you can supply all the other dependency files using --extra-files param as a comma separated list while creating the model-archive.

You can refer the image_segmenter example, where it supplies multiple .py files using --extra-files parameter while creating the model-archive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working close_after_merge launch blocker staged_release merged to stage_release
Projects
No open projects
IssuesBacklog
  
Closed
Development

No branches or pull requests

5 participants