Tensorflow 2.1 Error “when finalizing GeneratorDataset iterator” - a memory leak? #37515

Tuxius · 2020-03-11T20:31:18Z

Reopening of issue #35100, as more and more people report to still have the same problem:

Problem description

I am using TensorFlow 2.1.0 for image classification under Centos Linux. As my image training data set is growing, I have to start using a Generator as I do not have enough RAM to hold all pictures. I have coded the Generator based on this tutorial.

It seems to work fine, until my program all the sudden gets killed without an error message:

Epoch 6/30
2020-03-08 13:28:11.361785: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
43/43 [==============================] - 54s 1s/step - loss: 5.6839 - accuracy: 0.4669
Epoch 7/30
2020-03-08 13:29:05.511813: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
 7/43 [===>..........................] - ETA: 1:04 - loss: 4.3953 - accuracy: 0.5268Killed

Looking at the growing memory consumption with linux's top, I suspect a memory leak?

What I have tried

The suggestion to switch to TF nightly build version. For me it did not help, also downgrading to TF2.0.1 did not help
There is a discussion suggesting that it is important, that 'steps_per_epoch' and 'batch size' correspond (whatever this exactly means) - I played with it without finding any improvement.
Trying to narrow down by looking at the size development of all variables in my Generator

Relevant code snippets

class DataGenerator(tf.keras.utils.Sequence):
    'Generates data for Keras'
    def __init__(self, list_IDs, labels, dir, n_classes):
        'Initialization'
        config = configparser.ConfigParser()
        config.sections()
        config.read('config.ini')

        self.dim = (int(config['Basics']['PicHeight']),int(config['Basics']['PicWidth']))
        self.batch_size = int(config['HyperParameter']['batchsize'])
        self.labels = labels
        self.list_IDs = list_IDs
        self.dir = dir
        self.n_channels = 3
        self.n_classes = n_classes
        self.on_epoch_end()        


    def __len__(self):
        'Denotes the number of batches per epoch'
        return math.floor(len(self.list_IDs) / self.batch_size)

    def __getitem__(self, index):
        'Generate one batch of data'
        # Generate indexes of the batch
        indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]

        # Find list of IDs
        list_IDs_temp = [self.list_IDs[k] for k in indexes]

        # Generate data
        X, y = self.__data_generation(list_IDs_temp)

        return X, y, [None]

being called by

        training_generator = datagenerator.DataGenerator(train_files, labels, dir, len(self.class_names))
        self.model.fit(x=training_generator,
                    use_multiprocessing=False,
                    workers=6, 
                    epochs=self._Epochs, 
                    steps_per_epoch = len(training_generator),
                    callbacks=[LoggingCallback(self.logger.debug)])

I have tried running the exact same code under Windows 10, which gives me the following error:

Epoch 9/30
2020-03-08 20:49:37.555692: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
41/41 [==============================] - 75s 2s/step - loss: 2.0167 - accuracy: 0.3133
Epoch 10/30
2020-03-08 20:50:52.986306: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
 1/41 [..............................] - ETA: 2:36 - loss: 1.6237 - accuracy: 0.39062020-03-08 20:50:57.689373: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at matmul_op.cc:480 : Resource exhausted: OOM when allocating tensor with shape[1279200,322] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
2020-03-08 20:50:57.766163: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Resource exhausted: OOM when allocating tensor with shape[1279200,322] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
         [[{{node MatMul_6}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

 2/41 [>.............................] - ETA: 2:02 - loss: 1.6237 - accuracy: 0.3906Traceback (most recent call last):
  File "run.py", line 83, in <module>
    main()
  File "run.py", line 70, in main
    accuracy, num_of_classes = train_Posture(unique_name)
  File "run.py", line 31, in train_Posture
    acc = neuro.train(picdb, train_ids, test_ids, "Posture")
  File "A:\200307 3rd Try\neuro.py", line 161, in train
    callbacks=[LoggingCallback(self.logger.debug)])
  File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\keras\engine\training.py", line 819, in fit
    use_multiprocessing=use_multiprocessing)
  File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\keras\engine\training_v2.py", line 342, in fit
    total_epochs=epochs)
  File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\keras\engine\training_v2.py", line 128, in run_one_epoch
    batch_outs = execution_function(iterator)
  File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\keras\engine\training_v2_utils.py", line 98, in execution_function
    distributed_function(input_fn))
  File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\eager\def_function.py", line 568, in __call__
    result = self._call(*args, **kwds)
  File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\eager\def_function.py", line 599, in _call
    return self._stateless_fn(*args, **kwds)  # pylint: disable=not-callable
  File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\eager\function.py", line 2363, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\eager\function.py", line 1611, in _filtered_call
    self.captured_inputs)
  File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\eager\function.py", line 1692, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\eager\function.py", line 545, in call
    ctx=ctx)
  File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\eager\execute.py", line 67, in quick_execute
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.ResourceExhaustedError:  OOM when allocating tensor with shape[1279200,322] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
         [[node MatMul_6 (defined at A:\200307 3rd Try\neuro.py:161) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
 [Op:__inference_distributed_function_764]

Function call stack:
distributed_function

2020-03-08 20:51:00.785175: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled

The text was updated successfully, but these errors were encountered:

jsimsa · 2020-03-14T18:06:57Z

@aaudiber could you please take a look? thank you

Tuxius · 2020-03-17T20:57:34Z

Trying to find a work around I have reduces the epochs to 1, and instead tried a loop, which gives me a slightly different error, but still a memory leak:

Start training
Starting Epoch 1 of 25
2020-03-17 21:30:09.914586: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 4113907200 exceeds 10% of free system memory.
2020-03-17 21:30:10.268434: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 4113907200 exceeds 10% of free system memory.
43/43 [==============================] - 111s 3s/step - loss: 871.6517 - accuracy: 0.0999
Starting Epoch 2 of 25
43/43 [==============================] - 116s 3s/step - loss: 136.0917 - accuracy: 0.1930
Starting Epoch 3 of 25
43/43 [==============================] - 113s 3s/step - loss: 67.1135 - accuracy: 0.2776
Starting Epoch 4 of 25
43/43 [==============================] - 116s 3s/step - loss: 50.1236 - accuracy: 0.3205
Starting Epoch 5 of 25
43/43 [==============================] - 120s 3s/step - loss: 24.6999 - accuracy: 0.4353
Starting Epoch 6 of 25
43/43 [==============================] - 120s 3s/step - loss: 21.4684 - accuracy: 0.4484
Starting Epoch 7 of 25
2020-03-17 21:43:49.960918: W tensorflow/core/framework/op_kernel.cc:1737] OP_REQUIRES failed at matmul_op.cc:481 : Resource exhausted: OOM when allocating tensor with shape[1279200,804] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
Killed

Here is the small change I did to the code (original see above):

        for i in range(0,self._Epochs):
            print("Epoch {} of {}".format(i+1,self._Epochs))
            self.model.fit(x=training_generator,
                        use_multiprocessing=False,
                        workers=6, 
                        epochs=1, 
                        steps_per_epoch = len(training_generator),
                        callbacks=[LoggingCallback(self.logger.debug)])

aaudiber · 2020-03-17T22:44:44Z

@Tuxius Is it possible to reproduce the issue using fake data? If you can provide a minimal, self-contained repro, that will help a lot in finding the root cause.

yakhyo · 2020-03-18T22:00:57Z

I have also the same issue: Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled

wendell-hom · 2020-03-19T15:03:01Z

I am seeing this issue as well. Memory increases at the beginning of each epoch and fills up quickly.

lcalle · 2020-03-19T17:31:54Z

I am having similar issues (Error..finalizing GeneratorDataset iterator..) with the same script I have successfully run before. I recently upgraded to Tensorflow v2.1. I just downgraded to an earlier Tensorflow version and the script works without error; downgraded via "conda install tensorflow=1.15.0", miniconda3 python v3.7, numpy 1.18.1.

Note: I use Keras in R to access tensorflow backend functions. It is possible that there is some conflict between Keras and Tensorflow??

I am putting together a sample script/data to try and reproduce this error for the group here. Hopefully we can identify a solution.

I can confirm that I got my script working using the procedure below (from terminal):

conda install tensorflow=2.0.0
conda install -c conda-forge keras=2.3.0
source ~/path/miniconda3/bin/activate root
Rscript ~/path/DL_script.R

My script runs successfully without error. I tested separate versions of tensorflow and it appears that tensorflow 2.1 is not compatible with keras version or there is some other conflict that was resolved when running the procedures above. I also verified that installing tensorflow via conda was sufficient -- I did not have to specify "conda install tensorflow-gpu" to get tensorflow to use native GPU on my system. From terminal, "nvidi-smi" shows that GPU is being used when running my code, and also, from within R, "tf$test$is_gpu_available()" shows that GPU returns TRUE.

Hopefully this helps people.

zhangyaochn · 2020-03-21T03:22:17Z

Got same problem here. And this only happens when I specify the number of workers. But removing this argument will slow down the process.

Tuxius · 2020-03-21T18:52:55Z

Yes, I can confirm that setting the number of workers to 1 or just leaving out the argument completely solves the problem! It doesn't crash anymore and the memory consumption is stable. Only with workers set to >1 it crashes.

zhangyaochn · 2020-03-21T23:09:09Z

I update tf to tf-nightly. Error was gone.

Tuxius · 2020-03-22T15:06:59Z

Also updated to tf-nightly 2.2.0.dev20200319, but still get a crash with workers > 1. Also tried several other of the last nightlies, still get crashed. Only with workers = 1 it runs for me :-(

Tuxius · 2020-03-23T21:37:56Z

@aaudiber: I created a minimal, self-contained repro for you:

run.py:

import tensorflow as tf
import datagenerator

PicX = 300
PicY = 300
Color = (255,255,255)

def main():
    print("Starting a minimal, self-contained error reproduction")
    model = tf.keras.Sequential()
    model.add(tf.keras.layers.Flatten(input_shape=(PicX, PicY, 3)))
    model.add(tf.keras.layers.Dense(600, activation='relu'))    
    model.add(tf.keras.layers.Dense(150, activation='relu'))        
    model.add(tf.keras.layers.Dense(3, activation='softmax'))
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    training_generator = datagenerator.DataGenerator(100, PicX, PicY, Color)
    print("Starting training")
    model.fit(x=training_generator, workers=1, epochs=50, steps_per_epoch = len(training_generator))
    print("Fit without error with one worker!")
    model.fit(x=training_generator, workers=6, epochs=50, steps_per_epoch = len(training_generator))
    print("Fit without error with six worker!") #For me it crashed before

if __name__ == '__main__':
    main()

datagenerator.py:

import tensorflow as tf
import numpy as np
from PIL import Image, ImageDraw

class DataGenerator(tf.keras.utils.Sequence):
    def __init__(self, BatchSize, PicX, PicY, Color):
        self._BatchSize = BatchSize
        self._dim = (PicX, PicY)
        self._Color = Color
        
    def __len__(self):
        return 100
        
    def create_random_form(self):
        img = Image.new('RGB', self._dim, (50,50,50))
        draw = ImageDraw.Draw(img)
        label = np.random.randint(3)
        x0 = np.random.randint(int((self._dim[0]-5)/2))+1
        x1 = np.random.randint(int((self._dim[0]-5)/2))+int(self._dim[0]/2)
        y0 = np.random.randint(int((self._dim[1]-5)/2))
        y1 = np.random.randint(int((self._dim[1]-5)/2))+int(self._dim[1]/2)
        if label == 0:
            draw.rectangle((x0,y0,x1,y1), fill=self._Color)
        elif label == 1:
            draw.ellipse((x0,y0,x1,y1), fill=self._Color)                
        else:
            draw.polygon([(x0,y0),(x0,y1),(x1,y1)], fill=self._Color)     
        return img, label
        
    def __getitem__(self, index):
        X = np.empty((self._BatchSize, *self._dim, 3))
        y = np.empty((self._BatchSize), dtype=int)
        for i in range(0,self._BatchSize):
            img, label = self.create_random_form()
            X[i,] = tf.keras.preprocessing.image.img_to_array(img) / 255.0
            y[i] = label
        return X, y

for me this works with workers = 1 but crashes with workers = 6 ...

jsimsa · 2020-03-23T22:56:25Z

Thank you very much for providing the reproduction and narrowing it down to the use of the workers argument. The GeneratorDataset warning is a red herring. The root cause is a memory leak in Keras, which I created a fix for and verified that it resolves your issue. The fix should be submitted later this week.

wendell-hom · 2020-03-24T02:13:30Z

I used "conda install tensorflow-gpu" to install my tensorflow environment.
How do I consume this fix into my conda env? =)

ybagdasa · 2020-03-24T18:14:56Z

I'm using the tensorflow image tensorflow/tensorflow:2.1.0-gpu-py3 from docker hub: https://hub.docker.com/r/tensorflow/tensorflow/tags/?page=1

I'm also interested in consuming the fix. I'm using tf.keras.

Fixes: tensorflow#37515 PiperOrigin-RevId: 302568217 Change-Id: I28d0eaf3602fea0461901680df24899f135ce649

ybagdasa · 2020-03-26T03:36:20Z

@geetachavan1 Thanks!

Zen3515 · 2020-04-21T18:43:53Z

@wendell-hom

I used "conda install tensorflow-gpu" to install my tensorflow environment.
How do I consume this fix into my conda env? =)

This issue happen to me today, and I also happen to use conda so I think I might share it to you as well,

conda create -n tf22 python=3.7 cudnn cupti cudatoolkit=10.1.243
pip install tensorflow==2.2.0rc3
conda activate tf22

For tf2, tensorflow already support GPU if it can open all the libary

sharifulgeo · 2021-02-06T01:35:56Z

I am with Python 3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 14:00:49) [MSC v.1915 64 bit (AMD64)] on win32 and Tensorflow 2.1.0 I am getting this 2021-02-06 07:19:20.301919: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled error while executing this project https://github.com/JackonYang/captcha-tensorflow. :(

pentanol2 · 2022-01-21T07:46:58Z

We're in 2022 and still there is no clear answer for the problem. I get the same message on all versions of tensorflow and with all kinds of gpu configurations. Nothing is working. This happens by end of each training process.

zacharyfrederick · 2022-09-07T00:27:58Z

setting use_multiprocessing=True fixed this for me

Tuxius added the type:bug Bug label Mar 11, 2020

Tuxius mentioned this issue Mar 11, 2020

Error occurred when finalizing GeneratorDataset iterator #35100

Closed

Saduf2019 self-assigned this Mar 12, 2020

Saduf2019 added TF 2.1 for tracking issues in 2.1 release comp:data tf.data related issues type:performance Performance Issue and removed type:bug Bug labels Mar 12, 2020

Saduf2019 assigned ymodak and unassigned Saduf2019 Mar 12, 2020

ymodak assigned jsimsa and unassigned ymodak Mar 13, 2020

jsimsa assigned aaudiber and unassigned jsimsa Mar 14, 2020

tensorflow-copybara closed this as completed in e918c6e Mar 24, 2020

geetachavan1 pushed a commit to geetachavan1/tensorflow that referenced this issue Mar 24, 2020

Fixing a memory leak in Keras.

e01206d

Fixes: tensorflow#37515 PiperOrigin-RevId: 302568217 Change-Id: I28d0eaf3602fea0461901680df24899f135ce649

geetachavan1 mentioned this issue Mar 24, 2020

[TF2.2:Cherrypick]Fixing a memory leak in Keras. #37877

Merged

goldiegadde added this to Done in TensorFlow 2.2.0 Mar 26, 2020

universvm mentioned this issue Mar 30, 2020

Upgrade to tensorflow 2.0 wells-wood-research/aposteriori#1

Closed

kjun9 mentioned this issue Mar 30, 2020

Tests fail with tensorflow 2.2.0 release candidates stellargraph/stellargraph#1078

Closed

muellerdo mentioned this issue May 9, 2020

Tensorflow 2.1.0: Error occurred when finalizing GeneratorDataset iterator frankkramer-lab/MIScnn#11

Closed

amahendrakar mentioned this issue Jul 14, 2020

Premature end of JPEG file #41208

Closed

jmm5491 mentioned this issue Feb 7, 2021

how to find what tensorflow/keras release contains a specific bug fix #46987

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensorflow 2.1 Error “when finalizing GeneratorDataset iterator” - a memory leak? #37515

Tensorflow 2.1 Error “when finalizing GeneratorDataset iterator” - a memory leak? #37515

Tuxius commented Mar 11, 2020

jsimsa commented Mar 14, 2020

Tuxius commented Mar 17, 2020

aaudiber commented Mar 17, 2020

yakhyo commented Mar 18, 2020

wendell-hom commented Mar 19, 2020

lcalle commented Mar 19, 2020 •

edited

zhangyaochn commented Mar 21, 2020

Tuxius commented Mar 21, 2020

zhangyaochn commented Mar 21, 2020

Tuxius commented Mar 22, 2020

Tuxius commented Mar 23, 2020 •

edited

jsimsa commented Mar 23, 2020

wendell-hom commented Mar 24, 2020

ybagdasa commented Mar 24, 2020

ybagdasa commented Mar 26, 2020

Zen3515 commented Apr 21, 2020 •

edited

sharifulgeo commented Feb 6, 2021 •

edited

pentanol2 commented Jan 21, 2022

zacharyfrederick commented Sep 7, 2022

Tensorflow 2.1 Error “when finalizing GeneratorDataset iterator” - a memory leak? #37515

Tensorflow 2.1 Error “when finalizing GeneratorDataset iterator” - a memory leak? #37515

Comments

Tuxius commented Mar 11, 2020

jsimsa commented Mar 14, 2020

Tuxius commented Mar 17, 2020

aaudiber commented Mar 17, 2020

yakhyo commented Mar 18, 2020

wendell-hom commented Mar 19, 2020

lcalle commented Mar 19, 2020 • edited

zhangyaochn commented Mar 21, 2020

Tuxius commented Mar 21, 2020

zhangyaochn commented Mar 21, 2020

Tuxius commented Mar 22, 2020

Tuxius commented Mar 23, 2020 • edited

jsimsa commented Mar 23, 2020

wendell-hom commented Mar 24, 2020

ybagdasa commented Mar 24, 2020

ybagdasa commented Mar 26, 2020

Zen3515 commented Apr 21, 2020 • edited

sharifulgeo commented Feb 6, 2021 • edited

pentanol2 commented Jan 21, 2022

zacharyfrederick commented Sep 7, 2022

lcalle commented Mar 19, 2020 •

edited

Tuxius commented Mar 23, 2020 •

edited

Zen3515 commented Apr 21, 2020 •

edited

sharifulgeo commented Feb 6, 2021 •

edited