pytorch was blocked at loss.backward #16654

woodenwatcher · 2019-02-01T10:57:10Z

The pytorch version is 0.4.0

The call stack in python as follow:

File "/home/admin/code/PROPAGATE/tools/0_train_base.py", line 290, in <module>
    process(sys.argv[1], sys.argv[2], int(sys.argv[3]), sys.argv[4])
  File "/home/admin/code/PROPAGATE/tools/0_train_base.py", line 212, in process
    solver.solve()
  File "/home/admin/code/PROPAGATE/tools/0_train_base.py", line 162, in solve
    loss, accB, accN, lossC = self.iterate(XB, YB, XN, YN, True)
  File "/home/admin/code/PROPAGATE/tools/0_train_base.py", line 120, in iterate
    loss.backward()
  File "/home/admin/code/PROPAGATE/rpm/anaconda/lib/python2.7/site-packages/torch/tensor.py", line 93, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/admin/code/PROPAGATE/rpm/anaconda/lib/python2.7/site-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
  File "/home/admin/code/PROPAGATE/rpm/anaconda/lib/python2.7/site-packages/torch/nn/parallel/distributed.py", line 341, in reduction_fn_nccl
    group=self.nccl_reduction_group_id)
  File "/home/admin/code/PROPAGATE/rpm/anaconda/lib/python2.7/site-packages/torch/distributed/__init__.py", line 306, in all_reduce_multigpu
    return torch._C._dist_all_reduce_multigpu(tensor_list, op, group)
  File "<string>", line 1, in <module>

The call stack in c++ as follow:

Thread 1 (Thread 0x7ff1f47c4740 (LWP 1386)):
#0  0x00007ff1f45c3a8b in _dl_update_slotinfo () from /lib64/ld-linux-x86-64.so.2
#1  0x00007ff1f45b1f7f in update_get_addr () from /lib64/ld-linux-x86-64.so.2
#2  0x00007ff1f45c8b38 in __tls_get_addr () from /lib64/ld-linux-x86-64.so.2
#3  0x00007ff1e6882538 in ncclGroupEnd () at misc/group.cu:165
#4  0x00007ff1e681f27f in thd::DataChannelNccl::_getNcclResourcePair(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int) () from /home/admin/code/PROPAGATE/rpm/anaconda/lib/python2.7/site-packages/torch/_C.so
#5  0x00007ff1e68209dd in thd::DataChannelNccl::allReduce(std::vector<at::Tensor, std::allocator<at::Tensor> >&, THDReduceOp, int) () from /home/admin/code/PROPAGATE/rpm/anaconda/lib/python2.7/site-packages/torch/_C.so
#6  0x00007ff1e67f7a57 in THDAllReduceMultiGPU () from /home/admin/code/PROPAGATE/rpm/anaconda/lib/python2.7/site-packages/torch/_C.so
#7  0x00007ff1e667f36a in THDPModule_allReduceMultiGPU (_unused=<optimized out>, args=<optimized out>) at torch/csrc/distributed/Module.cpp:389
#8  0x00007ff1f42b6615 in PyEval_EvalFrameEx () from /home/admin/code/PROPAGATE/rpm/anaconda/bin/../lib/libpython2.7.so.1.0
#9  0x00007ff1f42b84e9 in PyEval_EvalCodeEx () from /home/admin/code/PROPAGATE/rpm/anaconda/bin/../lib/libpython2.7.so.1.0
#10 0x00007ff1f42b5482 in PyEval_EvalFrameEx () from /home/admin/code/PROPAGATE/rpm/anaconda/bin/../lib/libpython2.7.so.1.0
#11 0x00007ff1f42b84e9 in PyEval_EvalCodeEx () from /home/admin/code/PROPAGATE/rpm/anaconda/bin/../lib/libpython2.7.so.1.0
#12 0x00007ff1f4240fda in function_call () from /home/admin/code/PROPAGATE/rpm/anaconda/bin/../lib/libpython2.7.so.1.0
#13 0x00007ff1f421c773 in PyObject_Call () from /home/admin/code/PROPAGATE/rpm/anaconda/bin/../lib/libpython2.7.so.1.0
#14 0x00007ff1f421d053 in PyObject_CallFunctionObjArgs () from /home/admin/code/PROPAGATE/rpm/anaconda/bin/../lib/libpython2.7.so.1.0
#15 0x00007ff1e622f40d in operator() (__closure=0x7ff0c8c00660) at torch/csrc/autograd/python_engine.cpp:197
#16 std::_Function_handler<void(), THPEngine_queue_callback(PyObject*, PyObject*)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/functional:2039
#17 0x00007ff1e6202cdb in operator() (this=<optimized out>) at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/functional:2439
#18 torch::autograd::Engine::execute (this=this@entry=0x7ff1e9df7880 <engine>, input_roots=..., inputs=..., keep_graph=<optimized out>, create_graph=<optimized out>, outputs=...) at torch/csrc/autograd/engine.cpp:552
#19 0x00007ff1e623013c in torch::autograd::python::PythonEngine::execute (this=this@entry=0x7ff1e9df7880 <engine>, roots=..., inputs=..., keep_graph=<optimized out>, create_graph=<optimized out>, outputs=...) at torch/csrc/autograd/python_engine.cpp:61
#20 0x00007ff1e6230dc5 in THPEngine_run_backward (self=<optimized out>, args=<optimized out>, kwargs=<optimized out>) at torch/csrc/autograd/python_engine.cpp:169

any body can solve this problem?

The text was updated successfully, but these errors were encountered:

ustctf-zz · 2019-02-22T10:26:30Z

Almost the same issue, waiting for response...

soumith · 2019-02-23T00:17:30Z

you both are probably running into a distributed deadlock bug.
Please try with the latest version of pytorch (v1.0.1)

tuyaao · 2019-03-23T01:10:00Z

Hi,
I use pytorch 1.0.1.post2 version, it still block in loss.backward step.
Is any one have fixed this problem?

mrshenli · 2019-06-03T18:53:15Z

Hi @tuyaao @ustctf could you please share a minimum repro of this problem?

mrshenli · 2019-07-24T19:42:44Z

Hey @tuyaao @ustctf, we upgraded NCCL submodule and fixed some deadlocks recently, can you help to check if the same error still occurs with nightly?

xjcl · 2019-09-15T00:09:16Z

I seem to run into this problem with PyTorch 1.2.0 (8 August 2019).

I'm sharing a PyTorch neural network model between a main thread which trains the model and a number of worker threads which eval the model to generate training samples (à la AlphaGo).

My main thread gets stuck (deadlocked) on loss.backward() or model.parameters() on just the 3rd iteration. However my worker threads run to almost-completion (just waiting on the main thread to finish) so I'm very confused how a deadlock should occur here.

More details: This happens every time seed-independently. Ctrl+C only stops the 3 worker threads but not the main thread (strange!), so I cannot get a stack trace. This also seems to happen for >=3 workers only.

Also see my question on StackOverflow: https://stackoverflow.com/questions/57940151/pytorch-sharing-a-model-between-threads-do-i-need-to-lock-it-myself

xjcl · 2019-09-17T13:54:47Z

I got the issue to disappear by removing a torch.multiprocessing.Lock I used from my code and relying on a less elegant solution. I still don't know how it occurred or why the Lock should have anything to do with it. The Lock shouldn't have anything to do with the nn.Module behavior, plus the Lock was in possession all 3 iterations yet the first 2 went fine and it only interacted(?) with loss.backward() or model.parameters() on the 3rd.

zou3519 added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Feb 4, 2019

mrshenli added the needs reproduction Someone else needs to try reproducing the issue given the instructions. No action needed from user label Jun 3, 2019

jerryzh168 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pytorch was blocked at loss.backward #16654

pytorch was blocked at loss.backward #16654

woodenwatcher commented Feb 1, 2019

ustctf-zz commented Feb 22, 2019

soumith commented Feb 23, 2019

tuyaao commented Mar 23, 2019

mrshenli commented Jun 3, 2019

mrshenli commented Jul 24, 2019

xjcl commented Sep 15, 2019

xjcl commented Sep 17, 2019

pytorch was blocked at loss.backward #16654

pytorch was blocked at loss.backward #16654

Comments

woodenwatcher commented Feb 1, 2019

ustctf-zz commented Feb 22, 2019

soumith commented Feb 23, 2019

tuyaao commented Mar 23, 2019

mrshenli commented Jun 3, 2019

mrshenli commented Jul 24, 2019

xjcl commented Sep 15, 2019

xjcl commented Sep 17, 2019