RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one #331

feitiandemiaomi · 2019-06-13T09:31:52Z

Hi, my friend , when I train my data ,a strange error happen, showing as follows:

**Traceback (most recent call last):
File "train.py", line 342, in
accumulate=opt.accumulate,
File "train.py", line 224, in train
pred = model(imgs)
File "/home/data/anaconda3/envs/caffe2_py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, kwargs)
File "/home/data/anaconda3/envs/caffe2_py36/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 392, in forward
self.reducer.prepare_for_backward([])
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing its output (the return value of forward). You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel. If you already have this argument set, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable). (prepare_for_backward at /pytorch/torch/csrc/distributed/c10d/reducer.cpp:408)
frame #0: std::function<std::string ()>::operator()() const + 0x11 (0x7f6a5435c441 in /home/data/anaconda3/envs/caffe2_py36/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x2a (0x7f6a5435bd7a in /home/data/anaconda3/envs/caffe2_py36/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #2: c10d::Reducer::prepare_for_backward(std::vector<torch::autograd::Variable, std::allocatortorch::autograd::Variable > const&) + 0x5ec (0x7f6a54e8283c in /home/data/anaconda3/envs/caffe2_py36/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #3: + 0x6c52bd (0x7f6a54e782bd in /home/data/anaconda3/envs/caffe2_py36/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #4: + 0x130cfc (0x7f6a548e3cfc in /home/data/anaconda3/envs/caffe2_py36/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

frame #33: __libc_start_main + 0xf0 (0x7f6a594a8830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #34: python() [0x4009e9]

This is a puzzling problem, because I use my second dataset ,it is ok , I have try to make a dataset only include one img, but it's still wrong. Is it related to the number of datasets or batch-size？
I don't know why it happend , and how to slove it, could you give me some advice? Thanks a lot.

The text was updated successfully, but these errors were encountered:

feitiandemiaomi · 2019-06-13T09:46:46Z

I found that Odd number dataset would reproduce the above error， even number is ok.

glenn-jocher · 2019-06-13T17:24:14Z

@feitiandemiaomi the error doesn't look familiar. But you do need Python 3.7 now, I see you are on 3.6.

remindchobits · 2019-06-18T03:27:29Z

I also confronted this problem, just as @feitiandemiaomi said, tuning the dataset to a even total number is a temporary solution. There may be some bugs ?

feitiandemiaomi · 2019-06-18T03:40:18Z

@remindchobits It may be related to 'Accumulate gradient for x batches before optimizing'.

gwestner94 · 2019-06-20T06:18:38Z

Hi @feitiandemiaomi I met the same issue when training a custom model.
For me, your solution of an equal training set didn't solve the issue, only after downgrading the pytorch version to 1.0.0 I got a running training.
With version 1.1.0 the training collapses with the mentioned issue after the first mAP calculation.
Is there any update on apparent fixes that can help this situation?

glenn-jocher · 2019-06-23T13:08:34Z

@feitiandemiaomi @gwestner94 the accumulate variable is the number of batches to accumulate before an optimizer step. So for example batch_size 16 accumulate 4 will produce an effective batch_size of 64 (one optimizer step every 64 images).

feitiandemiaomi · 2019-06-26T08:37:45Z

@gwestner94 I can not find out the solution ,it's strange. If you find out it , pelease share us ,thanks a lot

feitiandemiaomi · 2019-06-26T08:40:20Z

@glenn-jocher I see, could you reproduce the above error by using odd number dataset?

glenn-jocher · 2019-06-26T09:31:41Z

@feitiandemiaomi @remindchobits this repository is verified working as intended on COCO2014, by default an odd-number dataset with 117263 images. Please note that most technical problems are due to:

Your changes to the default repository. If your issue is not reproducible in a fresh git clone version of this repository we can not debug it. Before going further run this code and ensure your issue persists:

sudo rm -rf yolov3  # remove exising repo
git clone https://github.com/ultralytics/yolov3 && cd yolov3 # git clone latest
python3 detect.py  # verify detection
python3 train.py  # verify training (a few batches only)
# CODE TO REPRODUCE YOUR ISSUE HERE

Your custom data. If your issue is not reproducible with COCO data we can not debug it. Visit our Custom Training Tutorial for exact details on how to format your custom data. Examine train_batch0.jpg and test_batch0.jpg for a sanity check of training and testing data.
Your environment. If your issue is not reproducible in a GCP Quickstart Guide VM we can not debug it. Ensure you meet the requirements specified in the README: Unix, MacOS, or Windows with Python >= 3.7, Pytorch >= 1.1, etc. You can also use our Google Colab Notebook to test your code in working environment.

If none of these apply to you, we suggest you close this issue and raise a new one using the Bug Report template, providing screenshots and minimum viable code to reproduce your issue. Thank you!

H-YunHui · 2019-08-05T09:18:33Z

@feitiandemiaomi
@remindchobits
@gwestner94
I have the same problem. If the number of training sets is odd, this is the problem. If it's even, it's ok.
And how do you solved this?

feitiandemiaomi added the bug Something isn't working label Jun 13, 2019

glenn-jocher closed this as completed Jun 26, 2019

uefall mentioned this issue Sep 17, 2019

Training was interrupted after the first epoch #499

Closed

tankche1 mentioned this issue Feb 19, 2020

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. open-mmlab/mmdetection#2117

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one #331

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one #331

feitiandemiaomi commented Jun 13, 2019

feitiandemiaomi commented Jun 13, 2019 •

edited

glenn-jocher commented Jun 13, 2019

remindchobits commented Jun 18, 2019

feitiandemiaomi commented Jun 18, 2019

gwestner94 commented Jun 20, 2019

glenn-jocher commented Jun 23, 2019

feitiandemiaomi commented Jun 26, 2019

feitiandemiaomi commented Jun 26, 2019 •

edited

glenn-jocher commented Jun 26, 2019 •

edited

H-YunHui commented Aug 5, 2019

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one #331

RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one #331

Comments

feitiandemiaomi commented Jun 13, 2019

feitiandemiaomi commented Jun 13, 2019 • edited

glenn-jocher commented Jun 13, 2019

remindchobits commented Jun 18, 2019

feitiandemiaomi commented Jun 18, 2019

gwestner94 commented Jun 20, 2019

glenn-jocher commented Jun 23, 2019

feitiandemiaomi commented Jun 26, 2019

feitiandemiaomi commented Jun 26, 2019 • edited

glenn-jocher commented Jun 26, 2019 • edited

H-YunHui commented Aug 5, 2019

feitiandemiaomi commented Jun 13, 2019 •

edited

feitiandemiaomi commented Jun 26, 2019 •

edited

glenn-jocher commented Jun 26, 2019 •

edited