-
Notifications
You must be signed in to change notification settings - Fork 581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
error when running multi-gpu #49
Comments
I'm also running into this issue @Tramac. Any help would be greatly appreciated! |
did you guys get any error like this? |
|
1 |
pytorch1.1.0 Ubuntu16.04 me too |
|
|
I modified the code to make it possible to multi-GPU parallel, but using: |
Traceback (most recent call last): |
I have run this program with one GPU sucessfully.
But failed in running multi-GPU, the log as follow:
I hope you could give me a favor, Thank you so much!
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your modul$
has parameters that were not used in producing its output (the return value of
forward
). You can enable unused parameter detection bypassing the keyword argument
find_unused_parameters=True
totorch.nn.parallel.DistributedDataParallel
. If you already have this argu$ent set, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's
forwar$
function. Please include the structure of the return value offorward
of your module when reporting this issue (e.g. list, dict, ite$able). (prepare_for_backward at /opt/conda/conda-bld/pytorch_1556653114079/work/torch/csrc/distributed/c10d/reducer.cpp:408)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7f91725b4dc5 in /home/maobinjie/anaconda3/lib/python3.7/s$
te-packages/torch/lib/libc10.so)
frame #1: c10d::Reducer::prepare_for_backward(std::vector<torch::autograd::Variable, std::allocatortorch::autograd::Variable > const&)
frame About results #2: + 0x6cb6c8 (0x7f91a1b5c6c8 in /home/maobinjie/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_pyth$
n.so)
frame The question #3: + 0x12d07a (0x7f91a15be07a in /home/maobinjie/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_pyth$
n.so)
The text was updated successfully, but these errors were encountered: