Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataloader killed by signal at the end of training #43455

Closed
xkszltl opened this issue Aug 22, 2020 · 9 comments
Closed

Dataloader killed by signal at the end of training #43455

xkszltl opened this issue Aug 22, 2020 · 9 comments
Assignees
Labels
module: dataloader Related to torch.utils.data.DataLoader and Sampler triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@xkszltl
Copy link
Contributor

xkszltl commented Aug 22, 2020

🐛 Bug

Recently we have a new issue after updating pytorch:

  • Our job is based on Detectron2 and DDP across multiple nodes, with 4 dataloader workers per process.
  • It consistently crashed at the end of training.
    • One node's dataloader killed by signal during shutdown
    • Other nodes lost connection to that node and hang with NCCL error.
  • This issue only exist in:
    • Multi-node job.
    • When dataset is very small.
    • When use 8 GPUs per node (8*4 workers in total). 4 GPU jobs works just fine.

Here's the stack trace:

[08/22 14:35:16 d2.data.common]: Serializing 98 elements to byte tensors and concatenating them all ...
[08/22 14:35:16 d2.data.common]: Serialized dataset takes 0.04 MiB
[08/22 14:35:16 d2.evaluation.evaluator]: Start inference on 7 images
ERROR [08/22 14:35:21 d2.engine.train_loop]: Exception during training:
Traceback (most recent call last):
  File "/mnt/scratch/PyTorchDetector/local_rank_0/detectron2/detectron2/engine/train_loop.py", line 133, in train
    self.after_step()
  File "/mnt/scratch/PyTorchDetector/local_rank_0/detectron2/detectron2/engine/train_loop.py", line 154, in after_step
    h.after_step()
  File "/mnt/scratch/PyTorchDetector/local_rank_0/detectron2/detectron2/engine/hooks.py", line 350, in after_step
    self._do_eval()
  File "/mnt/scratch/PyTorchDetector/local_rank_0/detectron2/detectron2/engine/hooks.py", line 324, in _do_eval
    results = self._func()
  File "/mnt/scratch/PyTorchDetector/local_rank_0/detectron2/detectron2/engine/defaults.py", line 352, in test_and_save_results
    self._last_eval_results = self.test(self.cfg, self.model)
  File "/mnt/scratch/PyTorchDetector/local_rank_0/detectron2/detectron2/engine/defaults.py", line 518, in test
    results_i = inference_on_dataset(model, data_loader, evaluator)
  File "/mnt/scratch/PyTorchDetector/local_rank_0/detectron2/detectron2/evaluation/evaluator.py", line 135, in inference_on_dataset
    for idx, inputs in enumerate(data_loader):
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 400, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 1006, in _next_data
    self._shutdown_workers()
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/dataloader.py", line 1119, in _shutdown_workers
    w.join(timeout=_utils.MP_STATUS_CHECK_INTERVAL)
  File "/usr/lib/python3.6/multiprocessing/process.py", line 124, in join
    res = self._popen.wait(timeout)
  File "/usr/lib/python3.6/multiprocessing/popen_fork.py", line 47, in wait
    if not wait([self.sentinel], timeout):
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 911, in wait
    ready = selector.select(timeout)
  File "/usr/lib/python3.6/selectors.py", line 376, in select
    fd_event_list = self._poll.poll(timeout)
  File "/usr/local/lib/python3.6/dist-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 25563) is killed by signal: Terminated. 

I think this PR is probably the reason: #39869
It may be too aggressive.
Since the issue is limited to very small datasets (< 100 images), maybe it is still busy warming up and don't really want to quit that fast?
And unlike other places using MP_STATUS_CHECK_INTERVAL, dataloader has custom callbacks (e.g. batcher).
It doesn't make sense to have a simple 5 sec timeout.

Environment

  • PyTorch Version (e.g., 1.0): master
  • OS (e.g., Linux): Ubuntu 18.04
  • How you installed PyTorch (conda, pip, source): source
  • Build command you used (if compiling from source): cmake+ninja+gcc-8
  • Python version: 3.6.9
  • CUDA/cuDNN version: 11.0/8
  • GPU models and configuration: V100x8 per node, >=2 nodes

cc @ssnl @VitalyFedyunin

@xkszltl
Copy link
Contributor Author

xkszltl commented Aug 23, 2020

So far I only found 2 ways to get it work:

  1. Only use 1 worker.
  2. Set MP_STATUS_CHECK_INTERVAL to a very large value (e.g. 300) before creating dataloader.

@ssnl
Copy link
Collaborator

ssnl commented Aug 24, 2020

well actually now that I think about it. the patch you linked should not matter per

// When an error happened in DataLoader methods and Python starts to exit, the
// error trace will keep the loader alive, and Python may kill the children
// processes first before deleting the loader object. Then the cleaning up
// methods in DataLoader.__del__ are not yet called, and SIGCHILD will print an
// error saying a worker is killed by SIGTERM. So we suppress SIGTERM from main
// loader process here to avoid this by _exit(EXIT_SUCCESS). Note that if we
// exit with nonzero code, the loader SIGCHLD handler may report RuntimeError
// again, and then it defeats the whole purpose.
static void handler_SIGTERM(int sig, siginfo_t *info, void *ctx)
{
if (info->si_pid == getppid()) {
_exit(EXIT_SUCCESS);
}
struct sigaction sa{};
sa.sa_handler = SIG_DFL;
sa.sa_flags = 0;
if (sigemptyset(&sa.sa_mask) != 0 || sigaction(SIGTERM, &sa, nullptr) != 0) {
_exit(EXIT_FAILURE);
} else {
raise(SIGTERM);
}
}

@ngimel ngimel added module: dataloader Related to torch.utils.data.DataLoader and Sampler triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Aug 24, 2020
@xkszltl
Copy link
Contributor Author

xkszltl commented Aug 25, 2020

@ssnl
I don't think I understand this part. Could you clarify?
When I add torch.utils.data._utils.MP_STATUS_CHECK_INTERVAL = 300 (default is 5) before creating dataloader, the issue disappeared, indicating the worker is probably killed too early.

@ssnl
Copy link
Collaborator

ssnl commented Aug 25, 2020

@xkszltl It means that if the worker is killed by the parent, it won't report an error. Thus the patch you linked should be irrelevant. Additionally, the killing added in that patch should generally not happen if all processes properly communicate with each other and really be viewed as a last resort.

@xkszltl
Copy link
Contributor Author

xkszltl commented Aug 25, 2020

I don't know why the handler does not intercept sigterm in this case.
Regarding the relation to PR, I agree the design of killing is as last resort, but clearly it is too early and interfered with a normal running dataloader.

To be clear, the behavior can be stably repro on our side and the only difference between working/non-working code is the change of MP_STATUS_CHECK_INTERVAL. So it is not something in our code crashed or killed by accident.

@ssnl
Copy link
Collaborator

ssnl commented Aug 25, 2020

A lot of places depend on that constant and are you sure that it is caused by that part? I also do not think that waiting for 5 minutes per worker is reasonable. 5 seconds should be fine unless something went wrong with interprocess comm.

@xkszltl
Copy link
Contributor Author

xkszltl commented Aug 26, 2020 via email

@ssnl
Copy link
Collaborator

ssnl commented Aug 26, 2020

That is good point. But still, a SIGTERM from the parent process shouldn't trigger the error. So it's unclear to me that what causes the problem. Maybe you can try the linked patch and see if that solves the issue?

@xkszltl
Copy link
Contributor Author

xkszltl commented Aug 26, 2020

2 questions about that fix:

  1. What difference does it make when moving to finally? To me it sounds like you give the first few workers more grace period (when waiting for workers later in the list).
  2. I'm not familiar threading in python but, for if w.is_alive(): w.terminate(), without a mutex, does it make sense to check? is_alive() may change between if and terminate() right?

shaibagon pushed a commit to shaibagon/pytorch that referenced this issue Dec 3, 2020
…torch#43462)

Summary:
Fixes pytorch#43455

Pull Request resolved: pytorch#43462

Reviewed By: bdhirsh

Differential Revision: D25277759

Pulled By: VitalyFedyunin

fbshipit-source-id: 0bb0d87374c0403853d71aac2c242374bfc7acf2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: dataloader Related to torch.utils.data.DataLoader and Sampler triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants