Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pytorch was blocked at loss.backward #16654

Open
woodenwatcher opened this issue Feb 1, 2019 · 7 comments
Open

pytorch was blocked at loss.backward #16654

woodenwatcher opened this issue Feb 1, 2019 · 7 comments
Labels
needs reproduction Someone else needs to try reproducing the issue given the instructions. No action needed from user oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@woodenwatcher
Copy link

The pytorch version is 0.4.0

The call stack in python as follow:

File "/home/admin/code/PROPAGATE/tools/0_train_base.py", line 290, in <module>
    process(sys.argv[1], sys.argv[2], int(sys.argv[3]), sys.argv[4])
  File "/home/admin/code/PROPAGATE/tools/0_train_base.py", line 212, in process
    solver.solve()
  File "/home/admin/code/PROPAGATE/tools/0_train_base.py", line 162, in solve
    loss, accB, accN, lossC = self.iterate(XB, YB, XN, YN, True)
  File "/home/admin/code/PROPAGATE/tools/0_train_base.py", line 120, in iterate
    loss.backward()
  File "/home/admin/code/PROPAGATE/rpm/anaconda/lib/python2.7/site-packages/torch/tensor.py", line 93, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/admin/code/PROPAGATE/rpm/anaconda/lib/python2.7/site-packages/torch/autograd/__init__.py", line 90, in backward
    allow_unreachable=True)  # allow_unreachable flag
  File "/home/admin/code/PROPAGATE/rpm/anaconda/lib/python2.7/site-packages/torch/nn/parallel/distributed.py", line 341, in reduction_fn_nccl
    group=self.nccl_reduction_group_id)
  File "/home/admin/code/PROPAGATE/rpm/anaconda/lib/python2.7/site-packages/torch/distributed/__init__.py", line 306, in all_reduce_multigpu
    return torch._C._dist_all_reduce_multigpu(tensor_list, op, group)
  File "<string>", line 1, in <module>

The call stack in c++ as follow:

Thread 1 (Thread 0x7ff1f47c4740 (LWP 1386)):
#0  0x00007ff1f45c3a8b in _dl_update_slotinfo () from /lib64/ld-linux-x86-64.so.2
#1  0x00007ff1f45b1f7f in update_get_addr () from /lib64/ld-linux-x86-64.so.2
#2  0x00007ff1f45c8b38 in __tls_get_addr () from /lib64/ld-linux-x86-64.so.2
#3  0x00007ff1e6882538 in ncclGroupEnd () at misc/group.cu:165
#4  0x00007ff1e681f27f in thd::DataChannelNccl::_getNcclResourcePair(std::vector<at::Tensor, std::allocator<at::Tensor> >&, int) () from /home/admin/code/PROPAGATE/rpm/anaconda/lib/python2.7/site-packages/torch/_C.so
#5  0x00007ff1e68209dd in thd::DataChannelNccl::allReduce(std::vector<at::Tensor, std::allocator<at::Tensor> >&, THDReduceOp, int) () from /home/admin/code/PROPAGATE/rpm/anaconda/lib/python2.7/site-packages/torch/_C.so
#6  0x00007ff1e67f7a57 in THDAllReduceMultiGPU () from /home/admin/code/PROPAGATE/rpm/anaconda/lib/python2.7/site-packages/torch/_C.so
#7  0x00007ff1e667f36a in THDPModule_allReduceMultiGPU (_unused=<optimized out>, args=<optimized out>) at torch/csrc/distributed/Module.cpp:389
#8  0x00007ff1f42b6615 in PyEval_EvalFrameEx () from /home/admin/code/PROPAGATE/rpm/anaconda/bin/../lib/libpython2.7.so.1.0
#9  0x00007ff1f42b84e9 in PyEval_EvalCodeEx () from /home/admin/code/PROPAGATE/rpm/anaconda/bin/../lib/libpython2.7.so.1.0
#10 0x00007ff1f42b5482 in PyEval_EvalFrameEx () from /home/admin/code/PROPAGATE/rpm/anaconda/bin/../lib/libpython2.7.so.1.0
#11 0x00007ff1f42b84e9 in PyEval_EvalCodeEx () from /home/admin/code/PROPAGATE/rpm/anaconda/bin/../lib/libpython2.7.so.1.0
#12 0x00007ff1f4240fda in function_call () from /home/admin/code/PROPAGATE/rpm/anaconda/bin/../lib/libpython2.7.so.1.0
#13 0x00007ff1f421c773 in PyObject_Call () from /home/admin/code/PROPAGATE/rpm/anaconda/bin/../lib/libpython2.7.so.1.0
#14 0x00007ff1f421d053 in PyObject_CallFunctionObjArgs () from /home/admin/code/PROPAGATE/rpm/anaconda/bin/../lib/libpython2.7.so.1.0
#15 0x00007ff1e622f40d in operator() (__closure=0x7ff0c8c00660) at torch/csrc/autograd/python_engine.cpp:197
#16 std::_Function_handler<void(), THPEngine_queue_callback(PyObject*, PyObject*)::<lambda()> >::_M_invoke(const std::_Any_data &) (__functor=...) at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/functional:2039
#17 0x00007ff1e6202cdb in operator() (this=<optimized out>) at /opt/rh/devtoolset-3/root/usr/include/c++/4.9.2/functional:2439
#18 torch::autograd::Engine::execute (this=this@entry=0x7ff1e9df7880 <engine>, input_roots=..., inputs=..., keep_graph=<optimized out>, create_graph=<optimized out>, outputs=...) at torch/csrc/autograd/engine.cpp:552
#19 0x00007ff1e623013c in torch::autograd::python::PythonEngine::execute (this=this@entry=0x7ff1e9df7880 <engine>, roots=..., inputs=..., keep_graph=<optimized out>, create_graph=<optimized out>, outputs=...) at torch/csrc/autograd/python_engine.cpp:61
#20 0x00007ff1e6230dc5 in THPEngine_run_backward (self=<optimized out>, args=<optimized out>, kwargs=<optimized out>) at torch/csrc/autograd/python_engine.cpp:169

any body can solve this problem?

@zou3519 zou3519 added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Feb 4, 2019
@ustctf-zz
Copy link

Almost the same issue, waiting for response...

@soumith
Copy link
Member

soumith commented Feb 23, 2019

you both are probably running into a distributed deadlock bug.
Please try with the latest version of pytorch (v1.0.1)

@tuyaao
Copy link

tuyaao commented Mar 23, 2019

Hi,
I use pytorch 1.0.1.post2 version, it still block in loss.backward step.
Is any one have fixed this problem?

@mrshenli
Copy link
Contributor

mrshenli commented Jun 3, 2019

Hi @tuyaao @ustctf could you please share a minimum repro of this problem?

@mrshenli mrshenli added the needs reproduction Someone else needs to try reproducing the issue given the instructions. No action needed from user label Jun 3, 2019
@jerryzh168 jerryzh168 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Jun 4, 2019
@mrshenli
Copy link
Contributor

Hey @tuyaao @ustctf, we upgraded NCCL submodule and fixed some deadlocks recently, can you help to check if the same error still occurs with nightly?

@xjcl
Copy link

xjcl commented Sep 15, 2019

I seem to run into this problem with PyTorch 1.2.0 (8 August 2019).

I'm sharing a PyTorch neural network model between a main thread which trains the model and a number of worker threads which eval the model to generate training samples (à la AlphaGo).

My main thread gets stuck (deadlocked) on loss.backward() or model.parameters() on just the 3rd iteration. However my worker threads run to almost-completion (just waiting on the main thread to finish) so I'm very confused how a deadlock should occur here.

More details: This happens every time seed-independently. Ctrl+C only stops the 3 worker threads but not the main thread (strange!), so I cannot get a stack trace. This also seems to happen for >=3 workers only.

Also see my question on StackOverflow: https://stackoverflow.com/questions/57940151/pytorch-sharing-a-model-between-threads-do-i-need-to-lock-it-myself

@xjcl
Copy link

xjcl commented Sep 17, 2019

I got the issue to disappear by removing a torch.multiprocessing.Lock I used from my code and relying on a less elegant solution. I still don't know how it occurred or why the Lock should have anything to do with it. The Lock shouldn't have anything to do with the nn.Module behavior, plus the Lock was in possession all 3 iterations yet the first 2 went fine and it only interacted(?) with loss.backward() or model.parameters() on the 3rd.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs reproduction Someone else needs to try reproducing the issue given the instructions. No action needed from user oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

8 participants