-
Notifications
You must be signed in to change notification settings - Fork 74.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error occurred when finalizing GeneratorDataset iterator #35100
Comments
upgraded to TensorFlow version: 2.1.0-rc1 - still get errors |
I guess this issue is related to using Tensorflow with Python-3.8. |
I've downgraded my system:
Still facing the error:
seams to be related to tensorflow-2.1.0-rc1 |
I have the same issue. Originally I was using: tensorflow/tensorflow:nightly-gpu-py3 which has: Then I tried upgrading tensorflow in the container with: I still have the same issue. |
@guptapriya @qlzh727 this seems to be an issue related to tf.distribute + tf.keras. In particular, as far as I can tell, the user code does not use |
The error log suggests that the training completed fine, but something at the end caused this error. Neither the training or validation dataset are using generators, so it does seem weird that there is a generator related error. Also it seems like it's just a warning - since the user's print statement at the end "elapsed.." did get printed as well. @jsimsa is tf.data.Dataset.from_generator the only time generator_dataset_op is used? Or could there be something else that could trigger it? @rchao could it be something related to any of the fault tolerance callbacks? |
I can verify this error with python 3.8 and python-tensorflow-opt-cuda 2.1.0rc1-2 on arch linux. This error is weirdly not present if you import only the generator from tensorflow, and everything else from Keras. |
@guptapriya I realized that generator dataset is used in multi-device iterator. This seems related to newly added support for cancellation in tf.data. The good news is that, as you pointed out, the warning is superfluous. The bad news is that, as far as I can tell, this warning will be present for all tf.distribute jobs in TF 2.1 (given how tf.data cancellation is implemented). I will look into having a fix for this cherrypicked into TF 2.1. |
Ah, great, thanks @jsimsa. |
@jsimsa Any update on this? I'm getting this exact message and it looks like my model.fit is not doing it's thing on validation dataset during training. |
For me the results aren't reproducible from run to run either, even with
tf.random.set_seed(), but I suspect it has to do with multiple workers for
my image augmentation generator.
…On Sat, Mar 28, 2020, 12:04 PM flydragon2018 ***@***.***> wrote:
but the training result seems not stable. train/val loss/accuracy up and
downs too much.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#35100 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABI2DNKEM2P6L6YZLNMF3FLRJZC4XANCNFSM4J2WWO2A>
.
|
Had the same problem. Memory leak and crash after some number of epochs. Looks like the |
It is not related to Python 3.8. I have the same problem with Python 3.7.4 |
I found a reason for the problem on my computer - YMMV. I was using the ModelCheckpoint callback to save the best model, and if there was a model with that name already in the folder, I got the error. Removing or renaming the model with that name fixed the issue. Windows 10 system, Python 3.7.4. |
Adding this code snippet fixes this issue for me when using RTX GPUs:
This is something I have to do in my training scripts as well. Might help someone 👍 |
Ideas from stackoverflow. I just directly copy the code from deeplearning.ai in colab. A part of it goes like this: history = model.fit( |
+ problem arises when there is wrong correspondence on the batch size and steps(iterations) + (step)x(batch size) must be >= number of training images + use ceiling to ensure that after the division the above contition still applies Reference: + tensorflow/tensorflow#35100
I have the same problem (using model.fit() ,and one numpy generator but keras.sequence) |
devices = tf.config.experimental.list_physical_devices('GPU') tf.config.experimental.set_memory_growth(devices[0], True) |
It might be helpful , |
Running into the same error... Here is some information about my configuration: Distributor ID: Ubuntu
Description: Ubuntu 18.04.5 LTS
Release: 18.04
Codename: bionic GPU Driver: Fri Apr 16 12:13:45 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.39 Driver Version: 460.39 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... Off | 00000000:00:07.0 Off | 0 |
| N/A 34C P0 27W / 250W | 0MiB / 16280MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+ CUDA version: nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_14_21:12:58_PST_2021
Cuda compilation tools, release 11.2, V11.2.152
Build cuda_11.2.r11.2/compiler.29618528_0
Python version: Python 3.8.0 Tensorflow version: # Tensorflow-2.4.1 compiled from source with gcc and no TensorRT
tensorflow @ file:///home/ubuntu/projects/tensorflow/mywhl/tensorflow-2.4.1-cp38-cp38-linux_x86_64.whl
tensorflow-addons==0.12.1
tensorflow-estimator==2.4.0 Any thoughts? |
The following solved the issue for me:
Call this function at the start of your script |
Hi, message.txt My current PC configuration: My current requirement setup: As you can see from the attachment, the training progresses fine. But when it reaches a checkpoint an Assertion Error is thrown. I have tried this fix: from tensorflow.compat.v1 import ConfigProto def fix_gpu(): fix_gpu() But it doesn't seem to work. Please help. |
Getting same error;
Tried every possible way to fix it. However, could not successful. Honestly, don't know why this error occurring? I would request please reopen this issue? |
I solve this problem (in tensorflow 2.5) I suppose a file named 'train.py' to run (this is an example) When run .py files in terminal $ CUDA_VISIBLE_DEVICES=0 python train.py # Use GPU 0.
$ CUDA_VISIBLE_DEVICES=1 python train.py # Use GPU 1.
$ CUDA_VISIBLE_DEVICES=2,3 python train.py # Use GPUs 2 and 3.
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"]="0" |
Worked for me. |
for what its worth (or not)... I saw this same "error" (or is this actually a warning - per the But the exit code of the train script is -- Also I was not able to replicate this error with the code provided in the original post versions used:
|
System information
Describe the current behavior
executing Tensorflow's MNIST handwriting example produces error:
the error dissapears if the code doesn't use OneDeviceStrategy or MirroredStrategy
Code to reproduce the issue
The text was updated successfully, but these errors were encountered: