Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error occurred when finalizing GeneratorDataset iterator #35100

Closed
olk opened this issue Dec 13, 2019 · 51 comments
Closed

Error occurred when finalizing GeneratorDataset iterator #35100

olk opened this issue Dec 13, 2019 · 51 comments
Assignees
Labels
comp:data tf.data related issues TF 2.1 for tracking issues in 2.1 release type:bug Bug

Comments

@olk
Copy link

olk commented Dec 13, 2019

System information

  • OS Platform and Distribution: Arch Linux, 5.4.2-arch1-1-ARCH
  • TensorFlow installed from: binary
  • TensorFlow version: 2.1.0rc0-1
  • Keras version: 2.2.4-tf
  • Python version: 3.8
  • GPU model and memory: 2x GTX 1080 Ti 11GB"`

Describe the current behavior
executing Tensorflow's MNIST handwriting example produces error:
the error dissapears if the code doesn't use OneDeviceStrategy or MirroredStrategy

W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled

Code to reproduce the issue

import tensorflow as tf
 import tensorflow_datasets as tfds
 import time
 
 from tensorflow.keras.optimizers import Adam
 
 def build_model():
     filters = 48
     units = 24
     kernel_size = 7
     learning_rate = 1e-4
     model = tf.keras.Sequential([
       tf.keras.layers.Conv2D(filters=filters, kernel_size=(kernel_size, kernel_size), activation='relu', input_shape=(28, 28, 1)),
       tf.keras.layers.MaxPooling2D(),
       tf.keras.layers.Flatten(),
       tf.keras.layers.Dense(units, activation='relu'),
       tf.keras.layers.Dense(10, activation='softmax')
     ])
     model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam(learning_rate), metrics=['accuracy'])
     return model
 
 datasets, info = tfds.load(name='mnist', with_info=True, as_supervised=True)
 mnist_train, mnist_test = datasets['train'], datasets['test']
 
 num_train_examples = info.splits['train'].num_examples
 num_test_examples = info.splits['test'].num_examples
 
 strategy = tf.distribute.OneDeviceStrategy(device='/gpu:0')
 
 BUFFER_SIZE = 10000
 BATCH_SIZE = 32
 
 def scale(image, label):
   image = tf.cast(image, tf.float32)
   image /= 255
   return image, label
 
 train_dataset = mnist_train.map(scale).shuffle(BUFFER_SIZE).repeat().batch(BATCH_SIZE).prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
 eval_dataset = mnist_test.map(scale).repeat().batch(BATCH_SIZE).prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
 
 with strategy.scope():
   model = build_model()
 
 epochs=5
 start = time.perf_counter()
 model.fit(
         train_dataset,
         validation_data=eval_dataset,
         steps_per_epoch=num_train_examples/epochs,
         validation_steps=num_test_examples/epochs,
         epochs=epochs)
 elapsed = time.perf_counter() - start
 print('elapsed: {:0.3f}'.format(elapsed))
@gadagashwini-zz gadagashwini-zz self-assigned this Dec 16, 2019
@gadagashwini-zz gadagashwini-zz added the TF 2.1 for tracking issues in 2.1 release label Dec 16, 2019
@gadagashwini-zz
Copy link
Contributor

gadagashwini-zz commented Dec 16, 2019

@olk, I tried reproducing the reported issue but it worked as expected. Please take a look at the gist. Thanks!

@gadagashwini-zz gadagashwini-zz added stat:awaiting response Status - Awaiting response from author comp:data tf.data related issues labels Dec 16, 2019
@olk
Copy link
Author

olk commented Dec 16, 2019

upgraded to TensorFlow version: 2.1.0-rc1 - still get errors
please note that I execute the example at real hardware (not colab)

@tensorflowbutler tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Dec 17, 2019
@olk
Copy link
Author

olk commented Dec 17, 2019

I guess this issue is related to using Tensorflow with Python-3.8.

@olk
Copy link
Author

olk commented Dec 17, 2019

I've downgraded my system:

  • Python 3.7.4
  • Tensorflow-2.1.0-rc1

Still facing the error:

Train for 30000.0 steps, validate for 5000.0 steps
Epoch 1/2
2019-12-17 19:21:54.361240: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2019-12-17 19:21:55.824790: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2019-12-17 19:21:56.980785: W tensorflow/stream_executor/gpu/redzone_allocator.cc:312] Not found: ./bin/ptxas not found
Relying on driver to perform ptx compilation. This message will be only logged once.
30000/30000 [==============================] - 115s 4ms/step - loss: 0.0856 - accuracy: 0.9761 - val_loss: 0.0376 - val_accuracy: 0.9879
Epoch 2/2
29990/30000 [============================>.] - ETA: 0s - loss: 0.0152 - accuracy: 0.99582019-12-17 19:25:28.372294: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
30000/30000 [==============================] - 111s 4ms/step - loss: 0.0152 - accuracy: 0.9958 - val_loss: 0.0375 - val_accuracy: 0.9889
2019-12-17 19:25:40.010887: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
2019-12-17 19:25:40.031138: W tensorflow/core/kernels/data/generator_dataset_op.cc:103] Error occurred when finalizing GeneratorDataset iterator: Cancelled: Operation was cancelled
elapsed: 226.391

seams to be related to tensorflow-2.1.0-rc1

@ymodak ymodak assigned jsimsa and unassigned ymodak Dec 17, 2019
@GeoMarky
Copy link

GeoMarky commented Dec 17, 2019

I have the same issue. Originally I was using:

tensorflow/tensorflow:nightly-gpu-py3

which has:
2.1.0-dev20191106

Then I tried upgrading tensorflow in the container with:

https://files.pythonhosted.org/packages/a9/fa/8ac34cf1369deb4f523a80eeb86ec0be3dd44139bfb42c45dd3829d6aff5/tf_nightly_gpu-2.1.0.dev20191217-cp36-cp36m-manylinux2010_x86_64.whl

I still have the same issue.

@jsimsa jsimsa assigned qlzh727 and guptapriya and unassigned jsimsa Dec 18, 2019
@jsimsa
Copy link
Contributor

jsimsa commented Dec 18, 2019

@guptapriya @qlzh727 this seems to be an issue related to tf.distribute + tf.keras. In particular, as far as I can tell, the user code does not use tf.data.Dataset.from_generator but the error indicates that GeneratorDataset is used. Could you please triage? Thanks.

@guptapriya
Copy link
Contributor

The error log suggests that the training completed fine, but something at the end caused this error. Neither the training or validation dataset are using generators, so it does seem weird that there is a generator related error. Also it seems like it's just a warning - since the user's print statement at the end "elapsed.." did get printed as well.

@jsimsa is tf.data.Dataset.from_generator the only time generator_dataset_op is used? Or could there be something else that could trigger it?

@rchao could it be something related to any of the fault tolerance callbacks?

@andrew-bydlon
Copy link

andrew-bydlon commented Dec 18, 2019

I can verify this error with python 3.8 and python-tensorflow-opt-cuda 2.1.0rc1-2 on arch linux. This error is weirdly not present if you import only the generator from tensorflow, and everything else from Keras.

@jsimsa
Copy link
Contributor

jsimsa commented Dec 19, 2019

@guptapriya I realized that generator dataset is used in multi-device iterator. This seems related to newly added support for cancellation in tf.data.

The good news is that, as you pointed out, the warning is superfluous. The bad news is that, as far as I can tell, this warning will be present for all tf.distribute jobs in TF 2.1 (given how tf.data cancellation is implemented). I will look into having a fix for this cherrypicked into TF 2.1.

@guptapriya
Copy link
Contributor

Ah, great, thanks @jsimsa.

@tensorflow-bot
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

@spate141
Copy link

@jsimsa Any update on this? I'm getting this exact message and it looks like my model.fit is not doing it's thing on validation dataset during training.

@ybagdasa
Copy link

ybagdasa commented Mar 28, 2020 via email

@Serhiy-Shekhovtsov
Copy link

Had the same problem. Memory leak and crash after some number of epochs. Looks like the ModelCheckpoint callback is a culprit. Removing it solved the issue.

@mikechen66
Copy link

I guess this issue is related to using Tensorflow with Python-3.8.

It is not related to Python 3.8. I have the same problem with Python 3.7.4

@leszekmp
Copy link

leszekmp commented Aug 1, 2020

I found a reason for the problem on my computer - YMMV. I was using the ModelCheckpoint callback to save the best model, and if there was a model with that name already in the folder, I got the error. Removing or renaming the model with that name fixed the issue. Windows 10 system, Python 3.7.4.

@a-arbabian
Copy link

Adding this code snippet fixes this issue for me when using RTX GPUs:

devices = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(devices[0], True)

This is something I have to do in my training scripts as well. Might help someone 👍

@ccl-private
Copy link

@Gare-Ng
Copy link

Gare-Ng commented Oct 3, 2020

Ideas from stackoverflow. I just directly copy the code from deeplearning.ai in colab. A part of it goes like this:
`train_generator = train_datagen.flow_from_directory(
'horse-or-human/', # This is the source directory for training images
target_size=(300, 300), # All images will be resized to 300x300
batch_size=128,
# Since we use binary_crossentropy loss, we need binary labels
class_mode='binary')

history = model.fit(
train_generator,
steps_per_epoch=8,
epochs=15,
verbose=1)`
and there are 1027 images. 128*8=1024, less than 1027. I set steps_per_epoch to 9, the error disappear.
So, for me the problem arises when there is wrong correspondence on the batch size and steps(iterations).
At least this is one of the cases for the error.
Here is the original answer https://stackoverflow.com/questions/60000573/error-occurred-when-finalizing-generatordataset-iterator-cancelled-operation-w

ghadj added a commit to ghadj/Social-Distancing-YOLO that referenced this issue Nov 24, 2020
+ problem arises when there is wrong correspondence on the batch
  size and steps(iterations)
+ (step)x(batch size) must be >= number of training images
+ use ceiling to ensure that after the division the above contition still applies

Reference:
+ tensorflow/tensorflow#35100
@rcx986635
Copy link

I have the same problem (using model.fit() ,and one numpy generator but keras.sequence)
I'm using linux redhat, python 3.6 and tensorflow 2.4.1.
then I got the error
File "/root/python_env/anaconda3/lib/python3.6/copy.py", line 215, in _deepcopy_list
append(deepcopy(a, memo))
File "/root/python_env/anaconda3/lib/python3.6/copy.py", line 180, in deepcopy
y = _reconstruct(x, memo, *rv)
File "/root/python_env/anaconda3/lib/python3.6/copy.py", line 274, in _reconstruct
y = func(*args)
File "/root/python_env/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/tensor_shape.py", line 190, in init
if value < 0:
RecursionError: maximum recursion depth exceeded in comparison
2021-03-26 14:50:45.148650: W tensorflow/core/kernels/data/generator_dataset_op.cc:107] Error occurred when finalizing GeneratorDataset iterator: Failed precondition: Python interpreter state is not initialized. The process may be terminated.
[[{{node PyFunc}}]]

@rcx986635
Copy link

devices = tf.config.experimental.list_physical_devices('GPU') tf.config.experimental.set_memory_growth(devices[0], True)
not work for me

@rcx986635
Copy link

It might be helpful ,
Epoch 00001: loss improved from inf to 93.23533, saving model to model/best_model.h5
Traceback (most recent call last):
File "train_v2.py", line 1070, in train
callbacks=callbacks_list)
File "/root/python_env/anaconda3/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 1145, in fit
callbacks.on_epoch_end(epoch, epoch_logs)
File "/root/python_env/anaconda3/lib/python3.6/site-packages/tensorflow/python/keras/callbacks.py", line 428, in on_epoch_end
callback.on_epoch_end(epoch, logs)
File "/root/python_env/anaconda3/lib/python3.6/site-packages/tensorflow/python/keras/callbacks.py", line 1344, in on_epoch_end
self._save_model(epoch=epoch, logs=logs)
File "/root/python_env/anaconda3/lib/python3.6/site-packages/tensorflow/python/keras/callbacks.py", line 1396, in _save_model
self.model.save(filepath, overwrite=True, options=self._options)
File "/root/python_env/anaconda3/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 2002, in save
signatures, options, save_traces)
File "/root/python_env/anaconda3/lib/python3.6/site-packages/tensorflow/python/keras/saving/save.py", line 154, in save_model
model, filepath, overwrite, include_optimizer)
File "/root/python_env/anaconda3/lib/python3.6/site-packages/tensorflow/python/keras/saving/hdf5_format.py", line 115, in save_model_to_hdf5
model_metadata = saving_utils.model_metadata(model, include_optimizer)
File "/root/python_env/anaconda3/lib/python3.6/site-packages/tensorflow/python/keras/saving/saving_utils.py", line 155, in model_metadata
model_config['config'] = model.get_config()
File "/root/python_env/anaconda3/lib/python3.6/site-packages/tensorflow/python/keras/engine/functional.py", line 650, in get_config
return copy.deepcopy(get_network_config(self))
File "/root/python_env/anaconda3/lib/python3.6/copy.py", line 150, in deepcopy
y = copier(x, memo)
File "/root/python_env/anaconda3/lib/python3.6/copy.py", line 240, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)

@prajwaljpj
Copy link

I have the same problem (using model.fit() ,and one numpy generator but keras.sequence)
I'm using linux redhat, python 3.6 and tensorflow 2.4.1.
then I got the error
File "/root/python_env/anaconda3/lib/python3.6/copy.py", line 215, in _deepcopy_list
append(deepcopy(a, memo))
File "/root/python_env/anaconda3/lib/python3.6/copy.py", line 180, in deepcopy
y = _reconstruct(x, memo, *rv)
File "/root/python_env/anaconda3/lib/python3.6/copy.py", line 274, in _reconstruct
y = func(*args)
File "/root/python_env/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/tensor_shape.py", line 190, in init
if value < 0:
RecursionError: maximum recursion depth exceeded in comparison
2021-03-26 14:50:45.148650: W tensorflow/core/kernels/data/generator_dataset_op.cc:107] Error occurred when finalizing GeneratorDataset iterator: Failed precondition: Python interpreter state is not initialized. The process may be terminated.
[[{{node PyFunc}}]]

Running into the same error...

Here is some information about my configuration:
Distribution:

Distributor ID: Ubuntu
Description:    Ubuntu 18.04.5 LTS
Release:        18.04
Codename:       bionic

GPU Driver:

Fri Apr 16 12:13:45 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.39       Driver Version: 460.39       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:00:07.0 Off |                    0 |
| N/A   34C    P0    27W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

CUDA version:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_14_21:12:58_PST_2021
Cuda compilation tools, release 11.2, V11.2.152
Build cuda_11.2.r11.2/compiler.29618528_0

Python version:

Python 3.8.0

Tensorflow version:

# Tensorflow-2.4.1 compiled from source with gcc and no TensorRT
tensorflow @ file:///home/ubuntu/projects/tensorflow/mywhl/tensorflow-2.4.1-cp38-cp38-linux_x86_64.whl
tensorflow-addons==0.12.1
tensorflow-estimator==2.4.0

Any thoughts?

@Melkeydev
Copy link

The following solved the issue for me:

from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession


def fix_gpu():
    config = ConfigProto()
    config.gpu_options.allow_growth = True
    session = InteractiveSession(config=config)


fix_gpu()

Call this function at the start of your script

@magicianfromriga
Copy link

Hi,
I seem to be facing the same issue with Leela Chess Zero training.

message.txt
I have tried various fixes to this problem (changing the parser, changing CUDA version, changing Python version, changing TF version etc). But it continues to persist. Please help me rectify this issue.

My current PC configuration:
8 GB RAM
Intel i7-4790K
NVIDIA RTX 2070 SUPER
1TB SSD

My current requirement setup:
CUDA 11.3
CUDNN 8.2.1
Python 3.9.5
TF-Nightly GPU (2.7.0 dev)

As you can see from the attachment, the training progresses fine. But when it reaches a checkpoint an Assertion Error is thrown. I have tried this fix:

from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession

def fix_gpu():
config = ConfigProto()
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)

fix_gpu()

But it doesn't seem to work. Please help.

@Nafees-060
Copy link

Getting same error;

[[{{node PyFunc}}]]
    [[IteratorGetNext]] [Op:__inference_train_function_6832]
2022-03-13 01:17:08.463663: W tensorflow/core/kernels/data/generator_dataset_op.cc:107] Error occurred when finalizing GeneratorDataset iterator: FAILED_PRECONDITION: Python interpreter state is not initialized. The process may be terminated.
    [[{{node PyFunc}}]]

Tried every possible way to fix it. However, could not successful. Honestly, don't know why this error occurring? I would request please reopen this issue?

@mrk1992
Copy link

mrk1992 commented Apr 3, 2022

I solve this problem (in tensorflow 2.5)

I suppose a file named 'train.py' to run (this is an example)

When run .py files in terminal

$ CUDA_VISIBLE_DEVICES=0 python train.py  # Use GPU 0.
$ CUDA_VISIBLE_DEVICES=1 python train.py  # Use GPU 1.
$ CUDA_VISIBLE_DEVICES=2,3 python train.py  # Use GPUs 2 and 3.
  1. Add 3 lines in train.py file
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"   
os.environ["CUDA_VISIBLE_DEVICES"]="0"

@ghost
Copy link

ghost commented Oct 28, 2022

The following solved the issue for me:

from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession


def fix_gpu():
    config = ConfigProto()
    config.gpu_options.allow_growth = True
    session = InteractiveSession(config=config)


fix_gpu()

Call this function at the start of your script

Worked for me.
Thanks for sharing

@pangyuteng
Copy link

pangyuteng commented Apr 9, 2023

for what its worth (or not)...

I saw this same "error" (or is this actually a warning - per the W at front of log) in one of my training log files W tensorflow/core/kernels/data/generator_dataset_op.cc:108] Error occurred when finalizing GeneratorDataset iterator: FAILED_PRECONDITION: Python interpreter state is not initialized. The process may be terminated. [[{{node PyFunc}}]]

But the exit code of the train script is 0, and turns out training stopped due to tf.keras.callbacks.EarlyStopping kicked in, and i did not set verbose=1 in the earlystopping callback.

--

Also I was not able to replicate this error with the code provided in the original post
#35100 (comment)

versions used:

container: tensorflow/tensorflow:2.10.0-gpu-jupyter
tensorflow_datasets==4.9.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:data tf.data related issues TF 2.1 for tracking issues in 2.1 release type:bug Bug
Projects
Development

No branches or pull requests