-
Notifications
You must be signed in to change notification settings - Fork 74.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nan causes writing summary failed #3212
Comments
|
NaNs usually indicate something wrong with your training. Perhaps your learning rate is too high, perhaps you have invalid data. Maybe you have an invalid operation like a divide by zero. Tensorflow refusing to write any NaNs is giving you a warning that something has gone wrong with your training. If you still suspect there is an underlying bug, you need to provide us a reproducible test case (as small as possible), plus information about what environment (please see the issue submission template). |
|
Thank you for your reply! @aselle Here is a snapshot of my current training. I turned off the writing summary so that the training can continue. If it is turned on, the program will crash in about 4 or 5 epochs. There are many NaNs in val_acc. I agree that the learning rate is too big (I use adadelta(0.1)), because the curve is not smooth, which shows the energy is too high. Although there are many NaNs in val_acc, the process is still going on, the acc is going up and the loss is going down. I am not sure if it is normal or not. Another question is because I use adadelta, I can't get the real learning rate. I read some papers which said that they just use vanilla SGD with a learning rate scheduler. So which one is better, SGD with lr scheduler or other optimizers(Adam, Adadelta, Rmsprop...)? |
|
In my opinion, it would be better if the summary writers didn't crash the program if they get passed a NaN. Or alternatively, it should be the optimization ops that complain about NaN values. |
|
I agree that writing a NaN should not crash the program. I am guessing the crash is due to the histogram summary op? We should fix this. However, I am out for the next two weeks, and this sounds like a TensorFlow kernel issue rather than a TensorBoard issue. I'm going to unassign myself. |
|
Yes, the histogram summary op can be expensive and might cause crashes. Closing this for now - please feel free to refile if there's a more specific bug to look at. |
|
Specifically, @benwu232 if you confirm that the issue is that |


I find when I try to write some weights or biases into summaries, they cannot be Nan, or it causes the program crashed. When I turn off writing summaries, everything runs normally, unless sometimes the loss could be Nan.
Is it normal that some variables (e.g. weights, biases) be Nans at sometime and the whole training process runs ok? If it is, please modify the writing summary functions to avoid crash at this situation.
Thanks!
Ben
The text was updated successfully, but these errors were encountered: