New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data Parallel #2
Comments
As ActNorm uses individual batch to calculate statistics and initialize the parameter using it, in DataParallel scenario it scrambles model training (like batch norm). If you can forward once in 1 GPU (without backward), then ActNorm will be initialized properly and you can use DataParallel to train your model. I found this enables multi gpu training. (without this model is not trainable.) |
Aha! I thought it had something to do with |
You can check this f8805e7. This is my workaround. If you can forward 1 batch in 1 GPU this will work. I think you can use this even with torch.no_grad, so maybe this is not a problem. |
Thank you for the change, but I didn't quite work for me. My understanding is that the problem has something to do with two GPUs having different weights after initialization. I don't think calling forward on individual GPUs would synchronize their weights. |
I have a similar problem running the code. Running on a single GPU works fine but logdet would have different values on a multi GPU case. |
Thank you for your code. It looks like you have tried to use
nn.DataParallel
but didn't quite include it in there. Can you tell me your experience with it?For some reason, the loss kept increasing when I used
nn.DataParallel
with 2 GPUs regardless of batch size. To make it run with your code, I changed yourcalc_loss
a little bit by expandinglogdet
to have same size aslog_p
. I also triedlogdet.mean()
, but it didn't work either. Here, I'm not really sure whylogdet
values are different for the 2 GPUs, as it seems to depend on shared weights only.The text was updated successfully, but these errors were encountered: