Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensorflow 2.1 Error “when finalizing GeneratorDataset iterator” - a memory leak? #37515

Closed
Tuxius opened this issue Mar 11, 2020 · 19 comments
Assignees
Labels
comp:data tf.data related issues TF 2.1 for tracking issues in 2.1 release type:performance Performance Issue

Comments

@Tuxius
Copy link

Tuxius commented Mar 11, 2020

Reopening of issue #35100, as more and more people report to still have the same problem:

Problem description

I am using TensorFlow 2.1.0 for image classification under Centos Linux. As my image training data set is growing, I have to start using a Generator as I do not have enough RAM to hold all pictures. I have coded the Generator based on this tutorial.

It seems to work fine, until my program all the sudden gets killed without an error message:

Epoch 6/30
2020-03-08 13:28:11.361785: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
43/43 [==============================] - 54s 1s/step - loss: 5.6839 - accuracy: 0.4669
Epoch 7/30
2020-03-08 13:29:05.511813: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
 7/43 [===>..........................] - ETA: 1:04 - loss: 4.3953 - accuracy: 0.5268Killed

Looking at the growing memory consumption with linux's top, I suspect a memory leak?

What I have tried

  • The suggestion to switch to TF nightly build version. For me it did not help, also downgrading to TF2.0.1 did not help

  • There is a discussion suggesting that it is important, that 'steps_per_epoch' and 'batch size' correspond (whatever this exactly means) - I played with it without finding any improvement.

  • Trying to narrow down by looking at the size development of all variables in my Generator

Relevant code snippets

class DataGenerator(tf.keras.utils.Sequence):
    'Generates data for Keras'
    def __init__(self, list_IDs, labels, dir, n_classes):
        'Initialization'
        config = configparser.ConfigParser()
        config.sections()
        config.read('config.ini')

        self.dim = (int(config['Basics']['PicHeight']),int(config['Basics']['PicWidth']))
        self.batch_size = int(config['HyperParameter']['batchsize'])
        self.labels = labels
        self.list_IDs = list_IDs
        self.dir = dir
        self.n_channels = 3
        self.n_classes = n_classes
        self.on_epoch_end()        


    def __len__(self):
        'Denotes the number of batches per epoch'
        return math.floor(len(self.list_IDs) / self.batch_size)

    def __getitem__(self, index):
        'Generate one batch of data'
        # Generate indexes of the batch
        indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]

        # Find list of IDs
        list_IDs_temp = [self.list_IDs[k] for k in indexes]

        # Generate data
        X, y = self.__data_generation(list_IDs_temp)

        return X, y, [None]

being called by

        training_generator = datagenerator.DataGenerator(train_files, labels, dir, len(self.class_names))
        self.model.fit(x=training_generator,
                    use_multiprocessing=False,
                    workers=6, 
                    epochs=self._Epochs, 
                    steps_per_epoch = len(training_generator),
                    callbacks=[LoggingCallback(self.logger.debug)])

I have tried running the exact same code under Windows 10, which gives me the following error:

Epoch 9/30
2020-03-08 20:49:37.555692: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
41/41 [==============================] - 75s 2s/step - loss: 2.0167 - accuracy: 0.3133
Epoch 10/30
2020-03-08 20:50:52.986306: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
 1/41 [..............................] - ETA: 2:36 - loss: 1.6237 - accuracy: 0.39062020-03-08 20:50:57.689373: W tensorflow/core/framework/op_kernel.cc:1655] OP_REQUIRES failed at matmul_op.cc:480 : Resource exhausted: OOM when allocating tensor with shape[1279200,322] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
2020-03-08 20:50:57.766163: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Resource exhausted: OOM when allocating tensor with shape[1279200,322] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
         [[{{node MatMul_6}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

 2/41 [>.............................] - ETA: 2:02 - loss: 1.6237 - accuracy: 0.3906Traceback (most recent call last):
  File "run.py", line 83, in <module>
    main()
  File "run.py", line 70, in main
    accuracy, num_of_classes = train_Posture(unique_name)
  File "run.py", line 31, in train_Posture
    acc = neuro.train(picdb, train_ids, test_ids, "Posture")
  File "A:\200307 3rd Try\neuro.py", line 161, in train
    callbacks=[LoggingCallback(self.logger.debug)])
  File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\keras\engine\training.py", line 819, in fit
    use_multiprocessing=use_multiprocessing)
  File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\keras\engine\training_v2.py", line 342, in fit
    total_epochs=epochs)
  File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\keras\engine\training_v2.py", line 128, in run_one_epoch
    batch_outs = execution_function(iterator)
  File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\keras\engine\training_v2_utils.py", line 98, in execution_function
    distributed_function(input_fn))
  File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\eager\def_function.py", line 568, in __call__
    result = self._call(*args, **kwds)
  File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\eager\def_function.py", line 599, in _call
    return self._stateless_fn(*args, **kwds)  # pylint: disable=not-callable
  File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\eager\function.py", line 2363, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\eager\function.py", line 1611, in _filtered_call
    self.captured_inputs)
  File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\eager\function.py", line 1692, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\eager\function.py", line 545, in call
    ctx=ctx)
  File "C:\Users\Frank\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\eager\execute.py", line 67, in quick_execute
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.ResourceExhaustedError:  OOM when allocating tensor with shape[1279200,322] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
         [[node MatMul_6 (defined at A:\200307 3rd Try\neuro.py:161) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
 [Op:__inference_distributed_function_764]

Function call stack:
distributed_function

2020-03-08 20:51:00.785175: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
@Tuxius Tuxius added the type:bug Bug label Mar 11, 2020
@Saduf2019 Saduf2019 self-assigned this Mar 12, 2020
@Saduf2019 Saduf2019 added TF 2.1 for tracking issues in 2.1 release comp:data tf.data related issues type:performance Performance Issue and removed type:bug Bug labels Mar 12, 2020
@Saduf2019 Saduf2019 assigned ymodak and unassigned Saduf2019 Mar 12, 2020
@ymodak ymodak assigned jsimsa and unassigned ymodak Mar 13, 2020
@jsimsa jsimsa assigned aaudiber and unassigned jsimsa Mar 14, 2020
@jsimsa
Copy link
Contributor

jsimsa commented Mar 14, 2020

@aaudiber could you please take a look? thank you

@Tuxius
Copy link
Author

Tuxius commented Mar 17, 2020

Trying to find a work around I have reduces the epochs to 1, and instead tried a loop, which gives me a slightly different error, but still a memory leak:

Start training
Starting Epoch 1 of 25
2020-03-17 21:30:09.914586: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 4113907200 exceeds 10% of free system memory.
2020-03-17 21:30:10.268434: W tensorflow/core/framework/cpu_allocator_impl.cc:81] Allocation of 4113907200 exceeds 10% of free system memory.
43/43 [==============================] - 111s 3s/step - loss: 871.6517 - accuracy: 0.0999
Starting Epoch 2 of 25
43/43 [==============================] - 116s 3s/step - loss: 136.0917 - accuracy: 0.1930
Starting Epoch 3 of 25
43/43 [==============================] - 113s 3s/step - loss: 67.1135 - accuracy: 0.2776
Starting Epoch 4 of 25
43/43 [==============================] - 116s 3s/step - loss: 50.1236 - accuracy: 0.3205
Starting Epoch 5 of 25
43/43 [==============================] - 120s 3s/step - loss: 24.6999 - accuracy: 0.4353
Starting Epoch 6 of 25
43/43 [==============================] - 120s 3s/step - loss: 21.4684 - accuracy: 0.4484
Starting Epoch 7 of 25
2020-03-17 21:43:49.960918: W tensorflow/core/framework/op_kernel.cc:1737] OP_REQUIRES failed at matmul_op.cc:481 : Resource exhausted: OOM when allocating tensor with shape[1279200,804] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
Killed

Here is the small change I did to the code (original see above):

        for i in range(0,self._Epochs):
            print("Epoch {} of {}".format(i+1,self._Epochs))
            self.model.fit(x=training_generator,
                        use_multiprocessing=False,
                        workers=6, 
                        epochs=1, 
                        steps_per_epoch = len(training_generator),
                        callbacks=[LoggingCallback(self.logger.debug)])

@aaudiber
Copy link
Contributor

@Tuxius Is it possible to reproduce the issue using fake data? If you can provide a minimal, self-contained repro, that will help a lot in finding the root cause.

@yakhyo
Copy link

yakhyo commented Mar 18, 2020

I have also the same issue: Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled

@wendell-hom
Copy link

I am seeing this issue as well. Memory increases at the beginning of each epoch and fills up quickly.

@lcalle
Copy link

lcalle commented Mar 19, 2020

I am having similar issues (Error..finalizing GeneratorDataset iterator..) with the same script I have successfully run before. I recently upgraded to Tensorflow v2.1. I just downgraded to an earlier Tensorflow version and the script works without error; downgraded via "conda install tensorflow=1.15.0", miniconda3 python v3.7, numpy 1.18.1.

Note: I use Keras in R to access tensorflow backend functions. It is possible that there is some conflict between Keras and Tensorflow??

I am putting together a sample script/data to try and reproduce this error for the group here. Hopefully we can identify a solution.



I can confirm that I got my script working using the procedure below (from terminal):

  • conda install tensorflow=2.0.0
  • conda install -c conda-forge keras=2.3.0
  • source ~/path/miniconda3/bin/activate root
  • Rscript ~/path/DL_script.R

My script runs successfully without error. I tested separate versions of tensorflow and it appears that tensorflow 2.1 is not compatible with keras version or there is some other conflict that was resolved when running the procedures above. I also verified that installing tensorflow via conda was sufficient -- I did not have to specify "conda install tensorflow-gpu" to get tensorflow to use native GPU on my system. From terminal, "nvidi-smi" shows that GPU is being used when running my code, and also, from within R, "tf$test$is_gpu_available()" shows that GPU returns TRUE.

Hopefully this helps people.

@zhangyaochn
Copy link

Got same problem here. And this only happens when I specify the number of workers. But removing this argument will slow down the process.

@Tuxius
Copy link
Author

Tuxius commented Mar 21, 2020

Yes, I can confirm that setting the number of workers to 1 or just leaving out the argument completely solves the problem! It doesn't crash anymore and the memory consumption is stable. Only with workers set to >1 it crashes.

@zhangyaochn
Copy link

I update tf to tf-nightly. Error was gone.

@Tuxius
Copy link
Author

Tuxius commented Mar 22, 2020

Also updated to tf-nightly 2.2.0.dev20200319, but still get a crash with workers > 1. Also tried several other of the last nightlies, still get crashed. Only with workers = 1 it runs for me :-(

@Tuxius
Copy link
Author

Tuxius commented Mar 23, 2020

@aaudiber: I created a minimal, self-contained repro for you:

run.py:

import tensorflow as tf
import datagenerator

PicX = 300
PicY = 300
Color = (255,255,255)

def main():
    print("Starting a minimal, self-contained error reproduction")
    model = tf.keras.Sequential()
    model.add(tf.keras.layers.Flatten(input_shape=(PicX, PicY, 3)))
    model.add(tf.keras.layers.Dense(600, activation='relu'))    
    model.add(tf.keras.layers.Dense(150, activation='relu'))        
    model.add(tf.keras.layers.Dense(3, activation='softmax'))
    model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
    training_generator = datagenerator.DataGenerator(100, PicX, PicY, Color)
    print("Starting training")
    model.fit(x=training_generator, workers=1, epochs=50, steps_per_epoch = len(training_generator))
    print("Fit without error with one worker!")
    model.fit(x=training_generator, workers=6, epochs=50, steps_per_epoch = len(training_generator))
    print("Fit without error with six worker!") #For me it crashed before

if __name__ == '__main__':
    main()   

datagenerator.py:

import tensorflow as tf
import numpy as np
from PIL import Image, ImageDraw

class DataGenerator(tf.keras.utils.Sequence):
    def __init__(self, BatchSize, PicX, PicY, Color):
        self._BatchSize = BatchSize
        self._dim = (PicX, PicY)
        self._Color = Color
        
    def __len__(self):
        return 100
        
    def create_random_form(self):
        img = Image.new('RGB', self._dim, (50,50,50))
        draw = ImageDraw.Draw(img)
        label = np.random.randint(3)
        x0 = np.random.randint(int((self._dim[0]-5)/2))+1
        x1 = np.random.randint(int((self._dim[0]-5)/2))+int(self._dim[0]/2)
        y0 = np.random.randint(int((self._dim[1]-5)/2))
        y1 = np.random.randint(int((self._dim[1]-5)/2))+int(self._dim[1]/2)
        if label == 0:
            draw.rectangle((x0,y0,x1,y1), fill=self._Color)
        elif label == 1:
            draw.ellipse((x0,y0,x1,y1), fill=self._Color)                
        else:
            draw.polygon([(x0,y0),(x0,y1),(x1,y1)], fill=self._Color)     
        return img, label
        
    def __getitem__(self, index):
        X = np.empty((self._BatchSize, *self._dim, 3))
        y = np.empty((self._BatchSize), dtype=int)
        for i in range(0,self._BatchSize):
            img, label = self.create_random_form()
            X[i,] = tf.keras.preprocessing.image.img_to_array(img) / 255.0
            y[i] = label
        return X, y

for me this works with workers = 1 but crashes with workers = 6 ...

@jsimsa
Copy link
Contributor

jsimsa commented Mar 23, 2020

Thank you very much for providing the reproduction and narrowing it down to the use of the workers argument. The GeneratorDataset warning is a red herring. The root cause is a memory leak in Keras, which I created a fix for and verified that it resolves your issue. The fix should be submitted later this week.

@wendell-hom
Copy link

I used "conda install tensorflow-gpu" to install my tensorflow environment.
How do I consume this fix into my conda env? =)

@ybagdasa
Copy link

I'm using the tensorflow image tensorflow/tensorflow:2.1.0-gpu-py3 from docker hub: https://hub.docker.com/r/tensorflow/tensorflow/tags/?page=1

I'm also interested in consuming the fix. I'm using tf.keras.

geetachavan1 pushed a commit to geetachavan1/tensorflow that referenced this issue Mar 24, 2020
Fixes: tensorflow#37515
PiperOrigin-RevId: 302568217
Change-Id: I28d0eaf3602fea0461901680df24899f135ce649
@ybagdasa
Copy link

@geetachavan1 Thanks!

@Zen3515
Copy link

Zen3515 commented Apr 21, 2020

@wendell-hom

I used "conda install tensorflow-gpu" to install my tensorflow environment.
How do I consume this fix into my conda env? =)

This issue happen to me today, and I also happen to use conda so I think I might share it to you as well,

conda create -n tf22 python=3.7 cudnn cupti cudatoolkit=10.1.243
pip install tensorflow==2.2.0rc3
conda activate tf22

For tf2, tensorflow already support GPU if it can open all the libary

@sharifulgeo
Copy link

sharifulgeo commented Feb 6, 2021

I am with Python 3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 14:00:49) [MSC v.1915 64 bit (AMD64)] on win32 and Tensorflow 2.1.0 I am getting this 2021-02-06 07:19:20.301919: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled error while executing this project https://github.com/JackonYang/captcha-tensorflow. :(

@pentanol2
Copy link

We're in 2022 and still there is no clear answer for the problem. I get the same message on all versions of tensorflow and with all kinds of gpu configurations. Nothing is working. This happens by end of each training process.

@zacharyfrederick
Copy link

setting use_multiprocessing=True fixed this for me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:data tf.data related issues TF 2.1 for tracking issues in 2.1 release type:performance Performance Issue
Projects
Development

No branches or pull requests