Data Parallel #2

tangbinh · 2018-12-16T05:38:57Z

Thank you for your code. It looks like you have tried to use nn.DataParallel but didn't quite include it in there. Can you tell me your experience with it?

For some reason, the loss kept increasing when I used nn.DataParallel with 2 GPUs regardless of batch size. To make it run with your code, I changed your calc_loss a little bit by expanding logdet to have same size as log_p. I also tried logdet.mean(), but it didn't work either. Here, I'm not really sure why logdet values are different for the 2 GPUs, as it seems to depend on shared weights only.

The text was updated successfully, but these errors were encountered:

rosinality · 2018-12-16T05:51:36Z

As ActNorm uses individual batch to calculate statistics and initialize the parameter using it, in DataParallel scenario it scrambles model training (like batch norm). If you can forward once in 1 GPU (without backward), then ActNorm will be initialized properly and you can use DataParallel to train your model. I found this enables multi gpu training. (without this model is not trainable.)

tangbinh · 2018-12-16T05:58:46Z

Aha! I thought it had something to do with ActNorm, but your explanation made it very clear. Do you know the best way to forward a batch in one GPU while avoiding doing so in others?

rosinality · 2018-12-16T06:17:41Z

You can check this f8805e7. This is my workaround. If you can forward 1 batch in 1 GPU this will work. I think you can use this even with torch.no_grad, so maybe this is not a problem.

tangbinh · 2018-12-16T15:43:43Z

Thank you for the change, but I didn't quite work for me. My understanding is that the problem has something to do with two GPUs having different weights after initialization. I don't think calling forward on individual GPUs would synchronize their weights.

eugenelet · 2019-05-10T03:08:10Z

I have a similar problem running the code. Running on a single GPU works fine but logdet would have different values on a multi GPU case.

tangbinh closed this as completed Dec 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Parallel #2

Data Parallel #2

tangbinh commented Dec 16, 2018

rosinality commented Dec 16, 2018

tangbinh commented Dec 16, 2018

rosinality commented Dec 16, 2018 •

edited

tangbinh commented Dec 16, 2018

eugenelet commented May 10, 2019

Data Parallel #2

Data Parallel #2

Comments

tangbinh commented Dec 16, 2018

rosinality commented Dec 16, 2018

tangbinh commented Dec 16, 2018

rosinality commented Dec 16, 2018 • edited

tangbinh commented Dec 16, 2018

eugenelet commented May 10, 2019

rosinality commented Dec 16, 2018 •

edited