NaN output with distributed training #8

choltz95 · 2021-12-06T02:41:53Z

Hi, thanks for this project. Just opening for future investigation. I am finding that training with more than one GPU using the basic pytorch DPP demo on CIFAR-10 results in NaN outputs after a few epochs. Training using a single gpu works great within the DPP framework.

The implementation from [meliketoy](https://github.com/meliketoy/wide-resnet.pytorch works fine, but uses more gpu memory.

choltz95 · 2021-12-06T02:47:18Z

looks like that other repo also suffers from the same issue, just takes longer.

choltz95 · 2021-12-06T17:30:56Z

Just in case someone ever has a similar issue with DPP: need to ensure proper usage of DPP.no_sync. In my case, it was necessary for adversarial training.

choltz95 closed this as completed Dec 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NaN output with distributed training #8

NaN output with distributed training #8

choltz95 commented Dec 6, 2021

choltz95 commented Dec 6, 2021

choltz95 commented Dec 6, 2021

NaN output with distributed training #8

NaN output with distributed training #8

Comments

choltz95 commented Dec 6, 2021

choltz95 commented Dec 6, 2021

choltz95 commented Dec 6, 2021