ButchNormalization fails when nvidia GPU is used and the training size and the batch size is "well set" #46205

enyecz · 2021-01-06T08:52:43Z

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 20.04
TensorFlow installed from (source or binary): binary installed by pip3
TensorFlow version (use command below): 2.4.0
Python version: 3.8.5
CUDA/cuDNN version: 11.0/8.0
GPU model and memory: Nvidia 1080Ti 11GB

Describe the current behavior
When the size of training set and the batch size is set so that the last batch contains only one element (i.e. len(training_set)%batch_size == 1 ), one of the weights of the batch normalization (the variance) is set to nan. The problem disappears when using CPU.

Describe the expected behavior
Have valid weights computed regardless of the size of the training set.

Standalone code to reproduce the issue
from tensorflow import keras
import numpy as np

inp = keras.layers.Input((1,1,1))
mid = keras.layers.Conv2D(1, (1,1))(inp)
out = keras.layers.BatchNormalization()(mid)

mod = keras.models.Model(inputs=inp, outputs=out)
mod.compile(optimizer='adam', loss='mse')

data = np.reshape(np.arange(9), (9,1,1,1))

mod.fit(data,data,batch_size=8)
print(mod.layers[2].get_weights()[3]) #This is the last layer's last weight; this is NAN with GPU

ravikyram · 2021-01-06T09:18:01Z

I have tried in colab with TF GPU version 2.4 and i am not seeing any issue. Please, find the gist here.Colab is using Cuda 10.1.Thanks!

enyecz · 2021-01-06T11:19:38Z

Strange, I just rerun the same box and got this (I mean the same block in the notebook you linked in):

2/2 [==============================] - 0s 5ms/step - loss: 19.9350
[nan]

jvishnuvardhan · 2021-01-06T17:58:40Z

@enyecz I ran @ravikyram gist and saw a number (instead of nan as you mentioned). Please check the gist here

Did you ran the code with (i) colab, (ii) local with CUDA 11.0, and (iii) both colab & local? Thanks!

enyecz · 2021-01-06T23:31:29Z

It seems to me that collab don't like pip. If you run pip first to install tensorflow-gpu, you don't get a GPU later, and running by CPU, there is no bug. In that case, you need to wait for the timeout, and your GPU-s come back with TF 2.4.0 (that's why I saw nan at the very same notebook).
Please take a look at the video I did, there I show you the whole case. There I use tf.test.is_gpu_available() to show that the GPU is lost.

nan-bug.mp4

jvishnuvardhan · 2021-01-07T01:41:16Z

@enyecz Agree with you. I just tried with TF2.3-gpu, which results in nan. Please check the gist here

There is some issue in TF2.4-gpu when pip installed in a colab. We will look into the issue. Thanks!

sushreebarsa · 2021-06-23T16:58:44Z

Was able to reproduce the issue with TF v2.5-gpu,please find the gist here ..Thanks!

mohantym · 2021-11-25T10:16:04Z

It is still replicating in TF 2.7. Attaching Gist for reference. Thanks!

enyecz · 2021-11-25T12:00:55Z

It says:

2/2 [==============================] - 13s 26ms/step - loss: 18.6861
[nan]

So yes. This is still a bug.

enyecz · 2022-06-26T17:21:15Z

BUMP: The bug is still there with TF 2.8.2. I faced with it in a new project (and lost half a hour before I recognized the problem and had a facepalm). How can this not be a problem for others?

enyecz added the type:bug Bug label Jan 6, 2021

google-ml-butler bot assigned ravikyram Jan 6, 2021

ravikyram added comp:keras Keras related issues TF 2.4 for issues related to TF 2.4 labels Jan 6, 2021

ravikyram assigned jvishnuvardhan and unassigned ravikyram Jan 6, 2021

jvishnuvardhan added the stat:awaiting response Status - Awaiting response from author label Jan 6, 2021

jvishnuvardhan added stat:awaiting tensorflower Status - Awaiting response from tensorflower and removed stat:awaiting response Status - Awaiting response from author labels Jan 7, 2021

mohantym added the TF 2.7 Issues related to TF 2.7.0 label Nov 25, 2021

tilakrayal removed the TF 2.4 for issues related to TF 2.4 label Dec 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ButchNormalization fails when nvidia GPU is used and the training size and the batch size is "well set" #46205

ButchNormalization fails when nvidia GPU is used and the training size and the batch size is "well set" #46205

enyecz commented Jan 6, 2021

ravikyram commented Jan 6, 2021

enyecz commented Jan 6, 2021 •

edited

jvishnuvardhan commented Jan 6, 2021

enyecz commented Jan 6, 2021 •

edited

jvishnuvardhan commented Jan 7, 2021

sushreebarsa commented Jun 23, 2021

mohantym commented Nov 25, 2021

enyecz commented Nov 25, 2021

enyecz commented Jun 26, 2022

ButchNormalization fails when nvidia GPU is used and the training size and the batch size is "well set" #46205

ButchNormalization fails when nvidia GPU is used and the training size and the batch size is "well set" #46205

Comments

enyecz commented Jan 6, 2021

ravikyram commented Jan 6, 2021

enyecz commented Jan 6, 2021 • edited

jvishnuvardhan commented Jan 6, 2021

enyecz commented Jan 6, 2021 • edited

jvishnuvardhan commented Jan 7, 2021

sushreebarsa commented Jun 23, 2021

mohantym commented Nov 25, 2021

enyecz commented Nov 25, 2021

enyecz commented Jun 26, 2022

enyecz commented Jan 6, 2021 •

edited

enyecz commented Jan 6, 2021 •

edited