Training is stuck at some point, I'm not sure if it is a PyTorch problem #140

bowenc0221 · 2018-08-31T19:00:26Z

Expected results

There should be no problem for training.

Actual results

Training is stuck at [Step 553061 / 720000]. GPU utilization is 0% but memory is not released.
I waited for 2 days but it didn't resume so I killed the job.
It seems the problem is caused by dataloader deadlock, I got the following message when I killed the job:

Exception ignored in: <bound method _DataLoaderIter.__del__ of <torch.utils.data.dataloader._DataLoaderIter object at 0x7f33d805aa20>>
Traceback (most recent call last):
  File "/home/bcheng/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 349, in __del__
Process Process-19:
    self._shutdown_workers()
  File "/home/bcheng/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 340, in _shutdown_workers
    self.worker_result_queue.put(None)
  File "/home/bcheng/anaconda3/lib/python3.6/multiprocessing/queues.py", line 346, in put
    with self._wlock:
  File "/home/bcheng/anaconda3/lib/python3.6/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt:
Traceback (most recent call last):
  File "tools/train_net_step.py", line 415, in main
    input_data = next(dataiterator)
  File "/home/bcheng/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 276, in __next__
    raise StopIteration
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/bcheng/anaconda3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/bcheng/anaconda3/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/bcheng/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 61, in _worker_loop
    data_queue.put((idx, samples))
  File "/home/bcheng/anaconda3/lib/python3.6/multiprocessing/queues.py", line 346, in put
    with self._wlock:
  File "/home/bcheng/anaconda3/lib/python3.6/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt

INFO train_net_step.py: 442: Save ckpt on exception ...
INFO train_net_step.py: 135: save model: Outputs/e2e_faster_rcnn_R-50-FPN_1x/Aug20-13-57-33_ifp-gup-03_step/ckpt/model_step553070.pth
INFO train_net_step.py: 444: Save ckpt done.
Traceback (most recent call last):
  File "tools/train_net_step.py", line 424, in main
    net_outputs = maskRCNN(**input_data)
  File "/home/bcheng/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/bcheng/Codes/github/Detectron.pytorch/lib/nn/parallel/data_parallel.py", line 108, in forward
    outputs = [self.module(*inputs[0], **kwargs[0])]
  File "/home/bcheng/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/bcheng/Codes/github/Detectron.pytorch/lib/modeling/model_builder.py", line 144, in forward
    return self._forward(data, im_info, roidb, **rpn_kwargs)
  File "/home/bcheng/Codes/github/Detectron.pytorch/lib/modeling/model_builder.py", line 175, in _forward
    box_feat = self.Box_Head(blob_conv, rpn_ret)
  File "/home/bcheng/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/bcheng/Codes/github/Detectron.pytorch/lib/modeling/fast_rcnn_heads.py", line 110, in forward
    sampling_ratio=cfg.FAST_RCNN.ROI_XFORM_SAMPLING_RATIO
  File "/home/bcheng/Codes/github/Detectron.pytorch/lib/modeling/model_builder.py", line 291, in roi_feature_transform
    resolution, resolution, sc, sampling_ratio)(bl_in, rois)
  File "/home/bcheng/Codes/github/Detectron.pytorch/lib/modeling/roi_xfrom/roi_align/functions/roi_align.py", line 16, in forward
    def forward(self, features, rois):
  File "/home/bcheng/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 178, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 10494) is killed by signal: Killed.

Detailed steps to reproduce

E.g.:

CUDA_VISIBLE_DEVICES=2 python3 tools/train_net_step.py --dataset coco2017 --cfg configs/baselines/e2e_faster_rcnn_R-50-FPN_1x.yaml --bs 2 --nw 4

System information

Operating system: Ubuntu 16.04.4
CUDA version: 9.1
cuDNN version: 7
GPU models (for all devices if they are not all the same): 1080ti
python version: 3.6.5
pytorch version: 0.4.0
Anything else that seems relevant: I did not modify any code.

The text was updated successfully, but these errors were encountered:

ShethR · 2018-09-07T21:01:54Z

@bowenc0221 , I am facing similar issues. This usually happens when the GPU memory usage approaches its max limit i.e when the GPU is almost out of memory.
Another problem I am facing is that the GPU memory usage gradually grows during the course of training until the training crashes/hangs

bowenc0221 · 2018-09-07T21:06:13Z

@ShethR Thanks for your reply.
For the second problem, I guess it is because the spatial resolution in a minibatch size keeps changing during training. Especially when padding is needed.

taksau · 2018-09-12T11:59:42Z

I also met this problem when using TITANV and V100 with 2 images per GPU.

ShethR · 2018-09-16T16:00:55Z

As @bowenc0221 mentioned, this problem is caused by a deadlock inside the dataloader when num_workers > 1.
Using num_workers=1 temporarily solves this issue for me.

bowenc0221 · 2018-09-20T21:21:52Z

@ShethR If setting num_workers=1 solves the problem, there are probably some problems in the getitem function. (I'm not sure if the deadlock is caused by conflict operations of different threads.)

SherlockHolmes221 · 2020-05-18T07:38:20Z

As @bowenc0221 mentioned, this problem is caused by a deadlock inside the dataloader when num_workers > 1.
Using num_workers=1 temporarily solves this issue for me.

Actually I can't solve the problem by setting num_workers==1. Any other method? Thanks

RolandZhu · 2020-09-17T08:22:16Z

Same here, any updates? Thanks.

rajatkoner08 · 2021-02-01T23:51:06Z

same here..in multi gpu training one gpu utilization set to 0 and rest are using their 100%...no error message ...any update ?

inkzk · 2021-08-26T02:46:53Z

same here..in multi gpu training one gpu utilization set to 0 and rest are using their 100%...no error message ...any update ?

same here... any solution now ??????

bowenc0221 mentioned this issue Sep 23, 2018

Docker for PyTorch.Detectron #149

Closed

bowenc0221 mentioned this issue Oct 19, 2018

ERROR: Unexpected segmentation fault encountered in worker. open-mmlab/mmdetection#49

Closed

bowenc0221 closed this as completed Dec 14, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training is stuck at some point, I'm not sure if it is a PyTorch problem #140

Training is stuck at some point, I'm not sure if it is a PyTorch problem #140

bowenc0221 commented Aug 31, 2018

ShethR commented Sep 7, 2018

bowenc0221 commented Sep 7, 2018

taksau commented Sep 12, 2018

ShethR commented Sep 16, 2018

bowenc0221 commented Sep 20, 2018

SherlockHolmes221 commented May 18, 2020

RolandZhu commented Sep 17, 2020

rajatkoner08 commented Feb 1, 2021

inkzk commented Aug 26, 2021

Training is stuck at some point, I'm not sure if it is a PyTorch problem #140

Training is stuck at some point, I'm not sure if it is a PyTorch problem #140

Comments

bowenc0221 commented Aug 31, 2018

Expected results

Actual results

Detailed steps to reproduce

System information

ShethR commented Sep 7, 2018

bowenc0221 commented Sep 7, 2018

taksau commented Sep 12, 2018

ShethR commented Sep 16, 2018

bowenc0221 commented Sep 20, 2018

SherlockHolmes221 commented May 18, 2020

RolandZhu commented Sep 17, 2020

rajatkoner08 commented Feb 1, 2021

inkzk commented Aug 26, 2021