You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For instance when I use the code from @csarofeen 's fp16 example, everything works fine on 1 gpu for both --fp16 and regular 32 bit training. On 2 gpu's, 32 bit training still works fine, but 16 bit training broken.
Training become unstable or results in slower learning curves. Also, validation loss is often NaN.
Tested with several setups including 1 and 2 titan V's with cuda 9.1 and 390.xx and 9.0 on 384.xx
I tried adding torch.cuda.synchronize()
around the special lines for fp16 as well as casting the output half back to a float before sending it into the criterion. No luck with either idea.