You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have successfully run the example, and also convert my own code from single node DataParallel to distributedDataParallel mode, the train runs without any error, but the loss is not descending and the accuracy looks wrong (finally to be 0)
the surprising point is, if I run the code with single GPU with batch size 128 and LR=0.001, it can work, everything is fine and converging, but if I take mpirun for 2 node X 8 opus =16gpus, loss is not descending.
what should I do?
I have changed the LR to be 10 or 1/10 times of the original LR, no help.
The text was updated successfully, but these errors were encountered:
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
hi, guys
sorry for bothering!
I have successfully run the example, and also convert my own code from single node DataParallel to distributedDataParallel mode, the train runs without any error, but the loss is not descending and the accuracy looks wrong (finally to be 0)
the surprising point is, if I run the code with single GPU with batch size 128 and LR=0.001, it can work, everything is fine and converging, but if I take mpirun for 2 node X 8 opus =16gpus, loss is not descending.
what should I do?
I have changed the LR to be 10 or 1/10 times of the original LR, no help.
The text was updated successfully, but these errors were encountered: