Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nan causes writing summary failed #3212

Closed
benwu232 opened this issue Jul 7, 2016 · 7 comments
Closed

Nan causes writing summary failed #3212

benwu232 opened this issue Jul 7, 2016 · 7 comments
Labels
stat:awaiting response Status - Awaiting response from author

Comments

@benwu232
Copy link

benwu232 commented Jul 7, 2016

I find when I try to write some weights or biases into summaries, they cannot be Nan, or it causes the program crashed. When I turn off writing summaries, everything runs normally, unless sometimes the loss could be Nan.

Is it normal that some variables (e.g. weights, biases) be Nans at sometime and the whole training process runs ok? If it is, please modify the writing summary functions to avoid crash at this situation.

Thanks!
Ben

@aselle aselle added the stat:awaiting response Status - Awaiting response from author label Jul 7, 2016
@aselle
Copy link
Contributor

aselle commented Jul 7, 2016

NaNs usually indicate something wrong with your training. Perhaps your learning rate is too high, perhaps you have invalid data. Maybe you have an invalid operation like a divide by zero. Tensorflow refusing to write any NaNs is giving you a warning that something has gone wrong with your training.

If you still suspect there is an underlying bug, you need to provide us a reproducible test case (as small as possible), plus information about what environment (please see the issue submission template).

@benwu232
Copy link
Author

benwu232 commented Jul 7, 2016

Thank you for your reply! @aselle

Here is a snapshot of my current training.
image
image

I turned off the writing summary so that the training can continue. If it is turned on, the program will crash in about 4 or 5 epochs.

There are many NaNs in val_acc. I agree that the learning rate is too big (I use adadelta(0.1)), because the curve is not smooth, which shows the energy is too high. Although there are many NaNs in val_acc, the process is still going on, the acc is going up and the loss is going down. I am not sure if it is normal or not.

Another question is because I use adadelta, I can't get the real learning rate. I read some papers which said that they just use vanilla SGD with a learning rate scheduler. So which one is better, SGD with lr scheduler or other optimizers(Adam, Adadelta, Rmsprop...)?

@ibab
Copy link
Contributor

ibab commented Jul 12, 2016

In my opinion, it would be better if the summary writers didn't crash the program if they get passed a NaN.
Instead, it would be nice if they were stored and then flagged somehow in tensorboard.
Users can then use tensorboard to find the source of the NaNs more easily.

Or alternatively, it should be the optimization ops that complain about NaN values.
That seems more natural than having the monitoring ops complain.

@aselle aselle removed the stat:awaiting response Status - Awaiting response from author label Jul 19, 2016
@concretevitamin
Copy link
Contributor

Dan, assigning it to you to take a look at the TensorBoard feature request.

@ibab @benwu232 You could use tf.check_numerics() to catch these errors during training. Would this op work?

@concretevitamin concretevitamin added the stat:awaiting response Status - Awaiting response from author label Jul 19, 2016
@teamdandelion
Copy link
Contributor

teamdandelion commented Jul 20, 2016

I agree that writing a NaN should not crash the program. I am guessing the crash is due to the histogram summary op? We should fix this.

However, I am out for the next two weeks, and this sounds like a TensorFlow kernel issue rather than a TensorBoard issue. I'm going to unassign myself.

@teamdandelion teamdandelion removed their assignment Jul 20, 2016
@concretevitamin
Copy link
Contributor

Yes, the histogram summary op can be expensive and might cause crashes.

Closing this for now - please feel free to refile if there's a more specific bug to look at.

@teamdandelion
Copy link
Contributor

Specifically, @benwu232 if you confirm that the issue is that tf.histogram_summary crashes if passed a NaN, please open an issue on that specific problem and we will triage it there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stat:awaiting response Status - Awaiting response from author
Projects
None yet
Development

No branches or pull requests

5 participants