Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ButchNormalization fails when nvidia GPU is used and the training size and the batch size is "well set" #46205

Open
enyecz opened this issue Jan 6, 2021 · 9 comments
Assignees
Labels
comp:keras Keras related issues stat:awaiting tensorflower Status - Awaiting response from tensorflower TF 2.7 Issues related to TF 2.7.0 type:bug Bug

Comments

@enyecz
Copy link

enyecz commented Jan 6, 2021

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 20.04
  • TensorFlow installed from (source or binary): binary installed by pip3
  • TensorFlow version (use command below): 2.4.0
  • Python version: 3.8.5
  • CUDA/cuDNN version: 11.0/8.0
  • GPU model and memory: Nvidia 1080Ti 11GB

Describe the current behavior
When the size of training set and the batch size is set so that the last batch contains only one element (i.e. len(training_set)%batch_size == 1 ), one of the weights of the batch normalization (the variance) is set to nan. The problem disappears when using CPU.

Describe the expected behavior
Have valid weights computed regardless of the size of the training set.

Standalone code to reproduce the issue
from tensorflow import keras
import numpy as np

inp = keras.layers.Input((1,1,1))
mid = keras.layers.Conv2D(1, (1,1))(inp)
out = keras.layers.BatchNormalization()(mid)

mod = keras.models.Model(inputs=inp, outputs=out)
mod.compile(optimizer='adam', loss='mse')

data = np.reshape(np.arange(9), (9,1,1,1))

mod.fit(data,data,batch_size=8)
print(mod.layers[2].get_weights()[3]) #This is the last layer's last weight; this is NAN with GPU

@enyecz enyecz added the type:bug Bug label Jan 6, 2021
@ravikyram ravikyram added comp:keras Keras related issues TF 2.4 for issues related to TF 2.4 labels Jan 6, 2021
@ravikyram
Copy link
Contributor

I have tried in colab with TF GPU version 2.4 and i am not seeing any issue. Please, find the gist here.Colab is using Cuda 10.1.Thanks!

@ravikyram ravikyram assigned jvishnuvardhan and unassigned ravikyram Jan 6, 2021
@enyecz
Copy link
Author

enyecz commented Jan 6, 2021

Strange, I just rerun the same box and got this (I mean the same block in the notebook you linked in):

2/2 [==============================] - 0s 5ms/step - loss: 19.9350
[nan]

@jvishnuvardhan
Copy link
Contributor

@enyecz I ran @ravikyram gist and saw a number (instead of nan as you mentioned). Please check the gist here

Did you ran the code with (i) colab, (ii) local with CUDA 11.0, and (iii) both colab & local? Thanks!

@jvishnuvardhan jvishnuvardhan added the stat:awaiting response Status - Awaiting response from author label Jan 6, 2021
@enyecz
Copy link
Author

enyecz commented Jan 6, 2021

It seems to me that collab don't like pip. If you run pip first to install tensorflow-gpu, you don't get a GPU later, and running by CPU, there is no bug. In that case, you need to wait for the timeout, and your GPU-s come back with TF 2.4.0 (that's why I saw nan at the very same notebook).
Please take a look at the video I did, there I show you the whole case. There I use tf.test.is_gpu_available() to show that the GPU is lost.

nan-bug.mp4

@jvishnuvardhan
Copy link
Contributor

@enyecz Agree with you. I just tried with TF2.3-gpu, which results in nan. Please check the gist here

There is some issue in TF2.4-gpu when pip installed in a colab. We will look into the issue. Thanks!

@jvishnuvardhan jvishnuvardhan added stat:awaiting tensorflower Status - Awaiting response from tensorflower and removed stat:awaiting response Status - Awaiting response from author labels Jan 7, 2021
@sushreebarsa
Copy link
Contributor

Was able to reproduce the issue with TF v2.5-gpu,please find the gist here ..Thanks!

@mohantym
Copy link
Contributor

It is still replicating in TF 2.7. Attaching Gist for reference. Thanks!

@mohantym mohantym added the TF 2.7 Issues related to TF 2.7.0 label Nov 25, 2021
@enyecz
Copy link
Author

enyecz commented Nov 25, 2021

It says:

2/2 [==============================] - 13s 26ms/step - loss: 18.6861
[nan]

So yes. This is still a bug.

@enyecz
Copy link
Author

enyecz commented Jun 26, 2022

BUMP: The bug is still there with TF 2.8.2. I faced with it in a new project (and lost half a hour before I recognized the problem and had a facepalm). How can this not be a problem for others?

@tilakrayal tilakrayal removed the TF 2.4 for issues related to TF 2.4 label Dec 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:keras Keras related issues stat:awaiting tensorflower Status - Awaiting response from tensorflower TF 2.7 Issues related to TF 2.7.0 type:bug Bug
Projects
None yet
Development

No branches or pull requests

6 participants