Skip to content
This repository has been archived by the owner on Jan 26, 2022. It is now read-only.

Training is stuck at some point, I'm not sure if it is a PyTorch problem #140

Closed
bowenc0221 opened this issue Aug 31, 2018 · 9 comments
Closed

Comments

@bowenc0221
Copy link

Expected results

There should be no problem for training.

Actual results

Training is stuck at [Step 553061 / 720000]. GPU utilization is 0% but memory is not released.
I waited for 2 days but it didn't resume so I killed the job.
It seems the problem is caused by dataloader deadlock, I got the following message when I killed the job:

Exception ignored in: <bound method _DataLoaderIter.__del__ of <torch.utils.data.dataloader._DataLoaderIter object at 0x7f33d805aa20>>
Traceback (most recent call last):
  File "/home/bcheng/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 349, in __del__
Process Process-19:
    self._shutdown_workers()
  File "/home/bcheng/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 340, in _shutdown_workers
    self.worker_result_queue.put(None)
  File "/home/bcheng/anaconda3/lib/python3.6/multiprocessing/queues.py", line 346, in put
    with self._wlock:
  File "/home/bcheng/anaconda3/lib/python3.6/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt:
Traceback (most recent call last):
  File "tools/train_net_step.py", line 415, in main
    input_data = next(dataiterator)
  File "/home/bcheng/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 276, in __next__
    raise StopIteration
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/bcheng/anaconda3/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/bcheng/anaconda3/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/bcheng/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 61, in _worker_loop
    data_queue.put((idx, samples))
  File "/home/bcheng/anaconda3/lib/python3.6/multiprocessing/queues.py", line 346, in put
    with self._wlock:
  File "/home/bcheng/anaconda3/lib/python3.6/multiprocessing/synchronize.py", line 96, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt

INFO train_net_step.py: 442: Save ckpt on exception ...
INFO train_net_step.py: 135: save model: Outputs/e2e_faster_rcnn_R-50-FPN_1x/Aug20-13-57-33_ifp-gup-03_step/ckpt/model_step553070.pth
INFO train_net_step.py: 444: Save ckpt done.
Traceback (most recent call last):
  File "tools/train_net_step.py", line 424, in main
    net_outputs = maskRCNN(**input_data)
  File "/home/bcheng/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/bcheng/Codes/github/Detectron.pytorch/lib/nn/parallel/data_parallel.py", line 108, in forward
    outputs = [self.module(*inputs[0], **kwargs[0])]
  File "/home/bcheng/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/bcheng/Codes/github/Detectron.pytorch/lib/modeling/model_builder.py", line 144, in forward
    return self._forward(data, im_info, roidb, **rpn_kwargs)
  File "/home/bcheng/Codes/github/Detectron.pytorch/lib/modeling/model_builder.py", line 175, in _forward
    box_feat = self.Box_Head(blob_conv, rpn_ret)
  File "/home/bcheng/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/bcheng/Codes/github/Detectron.pytorch/lib/modeling/fast_rcnn_heads.py", line 110, in forward
    sampling_ratio=cfg.FAST_RCNN.ROI_XFORM_SAMPLING_RATIO
  File "/home/bcheng/Codes/github/Detectron.pytorch/lib/modeling/model_builder.py", line 291, in roi_feature_transform
    resolution, resolution, sc, sampling_ratio)(bl_in, rois)
  File "/home/bcheng/Codes/github/Detectron.pytorch/lib/modeling/roi_xfrom/roi_align/functions/roi_align.py", line 16, in forward
    def forward(self, features, rois):
  File "/home/bcheng/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 178, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 10494) is killed by signal: Killed.

Detailed steps to reproduce

E.g.:

CUDA_VISIBLE_DEVICES=2 python3 tools/train_net_step.py --dataset coco2017 --cfg configs/baselines/e2e_faster_rcnn_R-50-FPN_1x.yaml --bs 2 --nw 4

System information

  • Operating system: Ubuntu 16.04.4
  • CUDA version: 9.1
  • cuDNN version: 7
  • GPU models (for all devices if they are not all the same): 1080ti
  • python version: 3.6.5
  • pytorch version: 0.4.0
  • Anything else that seems relevant: I did not modify any code.
@ShethR
Copy link

ShethR commented Sep 7, 2018

@bowenc0221 , I am facing similar issues. This usually happens when the GPU memory usage approaches its max limit i.e when the GPU is almost out of memory.
Another problem I am facing is that the GPU memory usage gradually grows during the course of training until the training crashes/hangs

@bowenc0221
Copy link
Author

@ShethR Thanks for your reply.
For the second problem, I guess it is because the spatial resolution in a minibatch size keeps changing during training. Especially when padding is needed.

@taksau
Copy link

taksau commented Sep 12, 2018

I also met this problem when using TITANV and V100 with 2 images per GPU.

@ShethR
Copy link

ShethR commented Sep 16, 2018

As @bowenc0221 mentioned, this problem is caused by a deadlock inside the dataloader when num_workers > 1.
Using num_workers=1 temporarily solves this issue for me.

@bowenc0221
Copy link
Author

@ShethR If setting num_workers=1 solves the problem, there are probably some problems in the getitem function. (I'm not sure if the deadlock is caused by conflict operations of different threads.)

@SherlockHolmes221
Copy link

As @bowenc0221 mentioned, this problem is caused by a deadlock inside the dataloader when num_workers > 1.
Using num_workers=1 temporarily solves this issue for me.

Actually I can't solve the problem by setting num_workers==1. Any other method? Thanks

@RolandZhu
Copy link

Same here, any updates? Thanks.

@rajatkoner08
Copy link

same here..in multi gpu training one gpu utilization set to 0 and rest are using their 100%...no error message ...any update ?

@inkzk
Copy link

inkzk commented Aug 26, 2021

same here..in multi gpu training one gpu utilization set to 0 and rest are using their 100%...no error message ...any update ?

same here... any solution now ??????

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants