-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ButchNormalization fails when nvidia GPU is used and the training size and the batch size is "well set" #46205
Comments
I have tried in colab with TF GPU version 2.4 and i am not seeing any issue. Please, find the gist here.Colab is using Cuda 10.1.Thanks! |
Strange, I just rerun the same box and got this (I mean the same block in the notebook you linked in): 2/2 [==============================] - 0s 5ms/step - loss: 19.9350 |
@enyecz I ran @ravikyram gist and saw a number (instead of nan as you mentioned). Please check the gist here Did you ran the code with (i) colab, (ii) local with CUDA 11.0, and (iii) both colab & local? Thanks! |
It seems to me that collab don't like pip. If you run pip first to install tensorflow-gpu, you don't get a GPU later, and running by CPU, there is no bug. In that case, you need to wait for the timeout, and your GPU-s come back with TF 2.4.0 (that's why I saw nan at the very same notebook). nan-bug.mp4 |
Was able to reproduce the issue with TF v2.5-gpu,please find the gist here ..Thanks! |
It is still replicating in TF 2.7. Attaching Gist for reference. Thanks! |
It says: 2/2 [==============================] - 13s 26ms/step - loss: 18.6861 So yes. This is still a bug. |
BUMP: The bug is still there with TF 2.8.2. I faced with it in a new project (and lost half a hour before I recognized the problem and had a facepalm). How can this not be a problem for others? |
System information
Describe the current behavior
When the size of training set and the batch size is set so that the last batch contains only one element (i.e. len(training_set)%batch_size == 1 ), one of the weights of the batch normalization (the variance) is set to nan. The problem disappears when using CPU.
Describe the expected behavior
Have valid weights computed regardless of the size of the training set.
Standalone code to reproduce the issue
from tensorflow import keras
import numpy as np
inp = keras.layers.Input((1,1,1))
mid = keras.layers.Conv2D(1, (1,1))(inp)
out = keras.layers.BatchNormalization()(mid)
mod = keras.models.Model(inputs=inp, outputs=out)
mod.compile(optimizer='adam', loss='mse')
data = np.reshape(np.arange(9), (9,1,1,1))
mod.fit(data,data,batch_size=8)
print(mod.layers[2].get_weights()[3]) #This is the last layer's last weight; this is NAN with GPU
The text was updated successfully, but these errors were encountered: