-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. #16
Comments
find_unused_parameters=True
to torch.nn.parallel.DistributedDataParallel
; (2) making sure all forward
function outputs participate in calculating loss. If you already have done the above two steps, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward
function. Please include the loss function and the structure of the return value of forward
of your module when reporting this issue (e.g. list, dict, iterable).
Adding the argument
|
In your commmand '--nproc_per_node=1' this setting indicates you only have 1 GPU, but you set '--gpus "0,1"' which requires 2 GPUs. |
Did you fix your problem, can I close this issue now? |
Hi, I think I ran into the same problem here.
I'm running in a docker environment with 4 gpus, but it does not even work for the single gpu setting. Please help. |
Running
gives the following
RuntimeError
errorThe text was updated successfully, but these errors were encountered: