Dataloader killed by signal at the end of training #43455

xkszltl · 2020-08-22T14:26:14Z

🐛 Bug

Recently we have a new issue after updating pytorch:

Our job is based on Detectron2 and DDP across multiple nodes, with 4 dataloader workers per process.
It consistently crashed at the end of training.
- One node's dataloader killed by signal during shutdown
- Other nodes lost connection to that node and hang with NCCL error.
This issue only exist in:
- Multi-node job.
- When dataset is very small.
- When use 8 GPUs per node (8*4 workers in total). 4 GPU jobs works just fine.

Here's the stack trace:

[08/22 14:35:16 d2.data.common]: Serializing 98 elements to byte tensors and concatenating them all ...
[08/22 14:35:16 d2.data.common]: Serialized dataset takes 0.04 MiB
[08/22 14:35:16 d2.evaluation.evaluator]: Start inference on 7 images
ERROR [08/22 14:35:21 d2.engine.train_loop]: Exception during training:
Traceback (most recent call last):
  File "/mnt/scratch/PyTorchDetector/local_rank_0/detectron2/detectron2/engine/train_loop.py", line 133, in train
    self.after_step()
  File "/mnt/scratch/PyTorchDetector/local_rank_0/detectron2/detectron2/engine/train_loop.py", line 154, in after_step
    h.after_step()
  File "/mnt/scratch/PyTorchDetector/local_rank_0/detectron2/detectron2/engine/hooks.py", line 350, in after_step
    self._do_eval()
  File "/mnt/scratch/PyTorchDetector/local_rank_0/detectron2/detectron2/engine/hooks.py", line 324, in _do_eval
    results = self._func()
  File "/mnt/scratch/PyTorchDetector/local_rank_0/detectron2/detectron2/engine/defaults.py", line 352, in test_and_save_results
    self._last_eval_results = self.test(self.cfg, self.model)
  File "/mnt/scratch/PyTorchDetector/local_rank_0/detectron2/detectron2/engine/defaults.py", line 518, in test
    results_i = inference_on_dataset(model, data_loader, evaluator)
  File "/mnt/scratch/PyTorchDetector/local_rank_0/detectron2/detectron2/evaluation/evaluator.py", line 135, in inference_on_dataset
    for idx, inputs in enumerate(data_loader):
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 400, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 1006, in _next_data
    self._shutdown_workers()
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 1119, in _shutdown_workers
    w.join(timeout=_utils.MP_STATUS_CHECK_INTERVAL)
  File "/usr/lib/python3.6/multiprocessing/process.py", line 124, in join
    res = self._popen.wait(timeout)
  File "/usr/lib/python3.6/multiprocessing/popen_fork.py", line 47, in wait
    if not wait([self.sentinel], timeout):
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 911, in wait
    ready = selector.select(timeout)
  File "/usr/lib/python3.6/selectors.py", line 376, in select
    fd_event_list = self._poll.poll(timeout)
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 25563) is killed by signal: Terminated.

I think this PR is probably the reason: #39869
It may be too aggressive.
Since the issue is limited to very small datasets (< 100 images), maybe it is still busy warming up and don't really want to quit that fast?
And unlike other places using MP_STATUS_CHECK_INTERVAL, dataloader has custom callbacks (e.g. batcher).
It doesn't make sense to have a simple 5 sec timeout.

Environment

PyTorch Version (e.g., 1.0): master
OS (e.g., Linux): Ubuntu 18.04
How you installed PyTorch (conda, pip, source): source
Build command you used (if compiling from source): cmake+ninja+gcc-8
Python version: 3.6.9
CUDA/cuDNN version: 11.0/8
GPU models and configuration: V100x8 per node, >=2 nodes

cc @ssnl @VitalyFedyunin

The text was updated successfully, but these errors were encountered:

xkszltl · 2020-08-23T02:50:12Z

So far I only found 2 ways to get it work:

Only use 1 worker.
Set MP_STATUS_CHECK_INTERVAL to a very large value (e.g. 300) before creating dataloader.

ssnl · 2020-08-24T17:29:21Z

well actually now that I think about it. the patch you linked should not matter per

pytorch/torch/csrc/DataLoader.cpp

Lines 66 to 87 in 7024ce8

    
           // When an error happened in DataLoader methods and Python starts to exit, the 
        
           // error trace will keep the loader alive, and Python may kill the children 
        
           // processes first before deleting the loader object. Then the cleaning up 
        
           // methods in DataLoader.__del__ are not yet called, and SIGCHILD will print an 
        
           // error saying a worker is killed by SIGTERM. So we suppress SIGTERM from main 
        
           // loader process here to avoid this by _exit(EXIT_SUCCESS). Note that if we 
        
           // exit with nonzero code, the loader SIGCHLD handler may report RuntimeError 
        
           // again, and then it defeats the whole purpose. 
        
           static void handler_SIGTERM(int sig, siginfo_t *info, void *ctx) 
        
           { 
        
             if (info->si_pid == getppid()) { 
        
               _exit(EXIT_SUCCESS); 
        
             } 
        
             struct sigaction sa{}; 
        
             sa.sa_handler = SIG_DFL; 
        
             sa.sa_flags = 0; 
        
             if (sigemptyset(&sa.sa_mask) != 0 || sigaction(SIGTERM, &sa, nullptr) != 0) { 
        
               _exit(EXIT_FAILURE); 
        
             } else { 
        
               raise(SIGTERM); 
        
             } 
        
           }

xkszltl · 2020-08-25T17:52:49Z

@ssnl
I don't think I understand this part. Could you clarify?
When I add torch.utils.data._utils.MP_STATUS_CHECK_INTERVAL = 300 (default is 5) before creating dataloader, the issue disappeared, indicating the worker is probably killed too early.

ssnl · 2020-08-25T18:11:05Z

@xkszltl It means that if the worker is killed by the parent, it won't report an error. Thus the patch you linked should be irrelevant. Additionally, the killing added in that patch should generally not happen if all processes properly communicate with each other and really be viewed as a last resort.

xkszltl · 2020-08-25T18:39:13Z

I don't know why the handler does not intercept sigterm in this case.
Regarding the relation to PR, I agree the design of killing is as last resort, but clearly it is too early and interfered with a normal running dataloader.

To be clear, the behavior can be stably repro on our side and the only difference between working/non-working code is the change of MP_STATUS_CHECK_INTERVAL. So it is not something in our code crashed or killed by accident.

ssnl · 2020-08-25T18:41:28Z

A lot of places depend on that constant and are you sure that it is caused by that part? I also do not think that waiting for 5 minutes per worker is reasonable. 5 seconds should be fine unless something went wrong with interprocess comm.

xkszltl · 2020-08-26T05:53:16Z

That's exactly what I'm concerned about. My understanding is, for dataloader worker, you're not just waiting for "overhead", but actual data processing code. The 5 sec is reasonable for IPC overhead + some small working code, so it's good for other places. However for dataloader, there're `sampler`, `batch_sampler`, `collate_fn` provided in constructor. These functions are user-provided with batching logic and some data processing. which can easily run over 5 sec. In our case we have different size inputs and we use bins to dynamically batch similar size input. This means there'll be a "warm-up" period when we gradually fill up bins, which may also make these functions even slower. I'm not sure if this PR is the root cause, but given: 1. we were able to run with pytorch master ~2 months ago 2. 5-sec => 5-min helps It's likely to be something related to this timeout constant and we found this PR to be the most suspicious one.

…

On Wed, Aug 26, 2020 at 2:41 AM Tongzhou Wang ***@***.***> wrote: A lot of places depend on that constant and are you sure that it is caused by that part? I also do not think that waiting for 5 minutes per worker is reasonable. 5 seconds should be fine unless something went wrong with interprocess comm. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#43455 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABHWIUNUQ4E2ZXGBERQWAXDSCQAWRANCNFSM4QID32BA> .

-- From LTL

ssnl · 2020-08-26T05:57:20Z

That is good point. But still, a SIGTERM from the parent process shouldn't trigger the error. So it's unclear to me that what causes the problem. Maybe you can try the linked patch and see if that solves the issue?

xkszltl · 2020-08-26T07:02:39Z

2 questions about that fix:

What difference does it make when moving to finally? To me it sounds like you give the first few workers more grace period (when waiting for workers later in the list).
I'm not familiar threading in python but, for if w.is_alive(): w.terminate(), without a mutex, does it make sense to check? is_alive() may change between if and terminate() right?

…torch#43462) Summary: Fixes pytorch#43455 Pull Request resolved: pytorch#43462 Reviewed By: bdhirsh Differential Revision: D25277759 Pulled By: VitalyFedyunin fbshipit-source-id: 0bb0d87374c0403853d71aac2c242374bfc7acf2

ssnl added a commit that referenced this issue Aug 23, 2020

Fix unintended error when worker force kill happens #43455

a7780a6

ssnl mentioned this issue Aug 23, 2020

Fix unintended error when worker force kill happens #43455 #43462

Closed

ssnl self-assigned this Aug 23, 2020

ngimel added module: dataloader Related to torch.utils.data.DataLoader and Sampler triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Aug 24, 2020

facebook-github-bot closed this as completed in 1112773 Dec 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataloader killed by signal at the end of training #43455

Dataloader killed by signal at the end of training #43455

xkszltl commented Aug 22, 2020 •

edited by pytorch-probot bot

xkszltl commented Aug 23, 2020 •

edited

ssnl commented Aug 24, 2020

xkszltl commented Aug 25, 2020

ssnl commented Aug 25, 2020

xkszltl commented Aug 25, 2020

ssnl commented Aug 25, 2020

xkszltl commented Aug 26, 2020 via email

ssnl commented Aug 26, 2020

xkszltl commented Aug 26, 2020 •

edited

Dataloader killed by signal at the end of training #43455

Dataloader killed by signal at the end of training #43455

Comments

xkszltl commented Aug 22, 2020 • edited by pytorch-probot bot

🐛 Bug

Environment

xkszltl commented Aug 23, 2020 • edited

ssnl commented Aug 24, 2020

xkszltl commented Aug 25, 2020

ssnl commented Aug 25, 2020

xkszltl commented Aug 25, 2020

ssnl commented Aug 25, 2020

xkszltl commented Aug 26, 2020 via email

ssnl commented Aug 26, 2020

xkszltl commented Aug 26, 2020 • edited

xkszltl commented Aug 22, 2020 •

edited by pytorch-probot bot

xkszltl commented Aug 23, 2020 •

edited

xkszltl commented Aug 26, 2020 •

edited