Tensorflow operations : Invalid data type according to Tensorflow Profiler #45946

RocaVincent · 2020-12-23T19:57:13Z

Hi,

I use Tensorflow 2.5 installed from source with Cuda 11.1 and Cudnn 8 on Ubuntu 20.04. My GPU is a Nvidia Quadro RTX 6000.

I have got Out of Memory problems with the GPU and the training of a CycleGAN model. In order to track memory leaks in my code, I have run it with profiler trace and I can now see with Tensorboard strange results in the memory breakdown table (memory profile tab). Indeed, some Tensorflow operations have "INVALID" data type, no region type and no shape. This suggests there are bugs with some Tensorflow operations.

Above I put a minimal code sample which reproduces this kind of error.

import tensorflow as tf
import keras

IMAGE_SHAPE = [256,256,3]

def Discriminator():
    return keras.Sequential([
        keras.layers.Flatten(input_shape=IMAGE_SHAPE),
        keras.layers.Dense(1, activation="sigmoid")
    ])

def Generator():
    return keras.Sequential([
        keras.layers.Conv2D(filters=IMAGE_SHAPE[-1], kernel_size=3, strides=1, padding="same", use_bias=False,
                           input_shape=IMAGE_SHAPE)
    ])

generator_BtoA = Generator()
discriminator_A = Discriminator()

loss_obj = keras.losses.MeanSquaredError()

discriminator_A_optimizer = keras.optimizers.Adam(0.0002)

BATCH_SIZE = 32

@tf.function
def train_step():
    # training discriminator
    imagesA = tf.random.uniform([BATCH_SIZE]+IMAGE_SHAPE)
    imagesB = tf.random.uniform([BATCH_SIZE]+IMAGE_SHAPE)
    fakesA = generator_BtoA(imagesB, training=False)
    with tf.GradientTape(persistent=True) as tape:
        disc_fakesA = discriminator_A(fakesA, training=True)
        discA_loss = loss_obj(tf.zeros_like(disc_fakesA), disc_fakesA)
    gradients_discA = tape.gradient(discA_loss, discriminator_A.trainable_variables)
    discriminator_A_optimizer.apply_gradients(zip(gradients_discA, discriminator_A.trainable_variables))


from tensorflow.profiler.experimental import Trace as Trace_profiler, start as start_profiler, stop as stop_profiler

start_profiler("my_logdir/")
with Trace_profiler("train", step_num=1, _r=-1):
    train_step()
stop_profiler()

With this code, I get the following results in the memory profile tab :

Op Name	Allocation Size (GiBs)	Requested Size (GiBs)	Occurrences	Region type	Data type	Shape
sequential/conv2d/Conv2D	0.227	0.227	1		INVALID
sequential/conv2d/Conv2D	0.039	0.039	1		INVALID
sequential/conv2d/Conv2D	0.023	0.023	1	output	float	[32,3,256,256]
sequential/conv2d/Conv2D-0-TransposeNHWCToNCHW-LayoutOptimizer	0.023	0.023	1	output	float	[32,3,256,256]

How do you interprete these results ?

The text was updated successfully, but these errors were encountered:

amahendrakar · 2020-12-24T12:30:50Z

@RocaVincent,

I have got Out of Memory problems with the GPU and the training of a CycleGAN model.

To resolve the out of memory error, please try any one of the methods to limit GPU memory as shown in this guide.

Also, please go through these guides for tensorboard profiling tool and memory profile summary for more information. Thanks!

RocaVincent · 2020-12-24T14:19:24Z

@amahendrakar

To resolve the out of memory error, please try any one of the methods to limit GPU memory as shown in this guide.

I have already tried to allow memory growthing but there is still the problems.

Also, please go through these guides for tensorboard profiling tool and memory profile summary for more information.

That's what I've done and that's what I've explained in the first post with in addition a minimal code sample to reproduce the results with the profiler, you may have missed it.

I'm curious to know if people get the same results with this code (run the code and access the dir my_logdir with the profiler).

amahendrakar · 2020-12-27T15:01:51Z

I did not face any out of memory errors on running the code, however I did get similar results with the memory profiler on TF v2.3, TF v2.4 and TF-nightly. Please find the gist of it here.

Thanks!

rmothukuru · 2021-01-04T12:55:03Z

@RocaVincent,
Can you please respond to the above comment? Thanks!

RocaVincent · 2021-01-05T09:55:53Z

@rmothukuru
My issue is about the strange results that the profiler gives with this convolutions with invalid data type, no shape and no region type. Of course, I haven't OOM errors with this simple code either, but it points out some problems which may cause OOM with bigger models.

ckluk-github · 2021-01-07T23:18:05Z

Hi,
I think I replied you regarding this issue in tensorflow/profiler#255
We are still investigating (delayed due to the holidays).
thanks

tinducvo · 2021-01-13T17:25:08Z

I have this issue with Conv1D also. My model should be using under 200MB, but I have INVALID shape when profiling and the heap usage spikes to 4GB.

sushreebarsa · 2021-06-03T07:51:15Z

Was able to replicate the issue in TF v2.5,please find the gist here..Thanks !

bastienjalbert · 2021-09-19T14:10:59Z

I have this issue with Conv1D also. My model should be using under 200MB, but I have INVALID shape when profiling and the heap usage spikes to 4GB.

Got almost exactly the same. With Conv2D. It causes that 4go of memory are fried, even if the network need no more than 200Mo to run...

Does anybody have found a solution or an idea of that ?

xsqian · 2021-10-24T19:27:26Z

I am running TensorFlow 2.4.3 and see the same issue. Has anybody found a solution or an idea for this issue?

yichunkuo-pony · 2022-01-25T17:34:36Z

Has similar issue, any update? @ckluk-github

isabellahuang · 2022-04-29T17:19:39Z

Running into the same issue as well.

RocaVincent added the type:bug Bug label Dec 23, 2020

google-ml-butler bot assigned Saduf2019 Dec 23, 2020

amahendrakar assigned amahendrakar and unassigned Saduf2019 Dec 24, 2020

amahendrakar added comp:gpu GPU related issues stat:awaiting response Status - Awaiting response from author TF 2.4 for issues related to TF 2.4 type:support Support issues and removed type:bug Bug labels Dec 24, 2020

tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Dec 26, 2020

amahendrakar added the comp:tensorboard Tensorboard related issues label Dec 27, 2020

amahendrakar assigned rmothukuru and unassigned amahendrakar Dec 27, 2020

rmothukuru added the stat:awaiting response Status - Awaiting response from author label Jan 4, 2021

rmothukuru assigned bmd3k and unassigned rmothukuru Jan 5, 2021

rmothukuru added stat:awaiting tensorflower Status - Awaiting response from tensorflower type:bug Bug and removed stat:awaiting response Status - Awaiting response from author type:support Support issues labels Jan 5, 2021

bmd3k assigned ckluk-github and unassigned bmd3k Jan 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensorflow operations : Invalid data type according to Tensorflow Profiler #45946

Tensorflow operations : Invalid data type according to Tensorflow Profiler #45946

RocaVincent commented Dec 23, 2020

amahendrakar commented Dec 24, 2020

RocaVincent commented Dec 24, 2020

amahendrakar commented Dec 27, 2020

rmothukuru commented Jan 4, 2021

RocaVincent commented Jan 5, 2021

ckluk-github commented Jan 7, 2021

tinducvo commented Jan 13, 2021

sushreebarsa commented Jun 3, 2021

bastienjalbert commented Sep 19, 2021 •

edited

xsqian commented Oct 24, 2021

yichunkuo-pony commented Jan 25, 2022

isabellahuang commented Apr 29, 2022

Tensorflow operations : Invalid data type according to Tensorflow Profiler #45946

Tensorflow operations : Invalid data type according to Tensorflow Profiler #45946

Comments

RocaVincent commented Dec 23, 2020

amahendrakar commented Dec 24, 2020

RocaVincent commented Dec 24, 2020

amahendrakar commented Dec 27, 2020

rmothukuru commented Jan 4, 2021

RocaVincent commented Jan 5, 2021

ckluk-github commented Jan 7, 2021

tinducvo commented Jan 13, 2021

sushreebarsa commented Jun 3, 2021

bastienjalbert commented Sep 19, 2021 • edited

xsqian commented Oct 24, 2021

yichunkuo-pony commented Jan 25, 2022

isabellahuang commented Apr 29, 2022

bastienjalbert commented Sep 19, 2021 •

edited