-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Invalid Loss error when training on the GPU #51
Comments
Great update. I don't have access to GPU right now so I haven't run the unit-tests for CUDA for a while. One initial theory is that I think cuda-batchnorm is (must) be different than Keras-cpu-batchnorm, since the latter accepts a mask and as far as I know, cuda-batchnorm unfortunately doesn't. I'm not sure if keras calls the cuda-batchnorm primitives though. Are you using batchnorm? Otherwise, same goes with mask in general. I would be surprised if Another is that the machine-epsilon is different on CPU/GPU. I recommend setting As a general recommendation, I recommend clipping log-likelihood using
|
@ragulpr Thanks for your reply. I am not using batchnorm and you are correct about the CuDNN not accepting the masking. I will try the epsilon and log-likelihood clipping and will let you know how it goes. |
If you find anything inside WTTE not working properly with GPU it would be very good to know thanks alot for raising issue. For general NaN-avoidance there are many other git-issues with recommendations. Some top-of the list remedies for further reference;
|
Another idea I forgot; I've had problems getting GPU to respect the random seed I set for it, but that might be a pytorch problem. If you repeat experiment using different seeds on CPU maybe you get the same NaN-failures? |
Hi, @dongishan called my attention to this post recently, and it just came to my mind today while working with the GPU. I have also observed numerical instabilities in the loss function when using it (we use the same cluster). I have observed this for WTTE-RNN, but also for an extension of it that I wrote for a gaussian-based loss function. I had not commented anything until now because my main hypothesis was that those instabilities were due to my data being contaminated/badly pre-processed (I use real industrial data). But today I started comparing the GPU and the CPU and initial results show that the loss is much more stable for the CPU. My architecture is quite simple, with a large batch size, two stacked 50 neuron LSTM's with regularisation, and a Timedistributed 100 neuron dense layer. I use tanh everywhere as an activation function. With regards to numerical instability in the wtte-rnn case, I have usually been successful avoiding it by normalizing the times to event and by using the continuous log-lik. For some reason, in my data-sets the discrete mode was more prone to numerical instability. I prefer that to clipping. Edit: Update: I have run now 4 experiments (10000 epochs each) and I observed some loss instabilities in the CPU case, but still in a much minor extent than in the GPU case. |
There may be a lot of reasons for numerical instability as pointed out, so would be very helpful if we can find a GPU/CPU reproducible example. Can it have anything to do with content of your keras.json-file? Maybe GPU is float32 and CPU is float64 or similar? ps.
|
What could be the reason for the Invalid Loss error to be present in GPU training and not in the CPU training?
I've successfully trained the WTTE-RNN algorithm on a CPU using a GRU RNN on the C-MAPSS dataset. However, when doing the same on a Nvidia GPU with CuDNNGRU I am getting the Invalid Loss error at around 20/100 epochs.
I am using Keras with Tensorflow backend. And the WTTE-RNN version is 1.1.1.
The text was updated successfully, but these errors were encountered: