[Mirrored Strategy] You dataset iterator ran out of data; interrupting training. #30636

edwardyehuang · 2019-07-12T08:39:23Z

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow):YES
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):Ubuntu 18.04
TensorFlow installed from (source or binary):Pip tensorflow-gpu 2.0 beta1
TensorFlow version (use command below):2.0 beta1
Python version:3.7
GPU model and memory:Titan RTX x 2 (2 x 24GB) / P100 x 2 (2 x 16GB)

Error in keras.Model.fit.
When using the mirroredstrategy with tensorflow dataset in both training and validation.
Single GPU card works fine, whether using the mirroredstertegy or not. (When using the mirroredstertegy, set the devices = /gpu:0). This problem only occurs when using multiple gpu cards.

The error displayed:
[training_arrays.py 325] Your dataset iterator ran out of data; interrupting training. Make sure that your iterator can geretate at least "validation_steps * epochs" batches.

Currently the only worked solutation for me is manully set the "validation_steps" in keras.Model.fit.

Tensorflow dataset repeat or/and take, will not work. By setting the validation batch size to 2 (1 for each GPU) also does not work

Simliar issues in here #25254, but closed

edwardyehuang · 2019-07-12T08:42:17Z

As mentioned in #25254 . The offical example code does not worked either.

isaprykin · 2019-07-24T17:06:03Z

Keras should be able to figure out validation_steps. (cc @omalleyt12 )

Could you please share the code so that we have more complete details of this case while we are looking at it? Thanks.

edwardyehuang · 2019-07-24T20:52:53Z

Keras should be able to figure out validation_steps. (cc @omalleyt12 )

Could you please share the code so that we have more complete details of this case while we are looking at it? Thanks.

When I do not pass the validations_steps:

Without Mirroed Strategy : Work
With Mirroed Strategy, Single Card : Work
With Mirroed Strategy, Multiple Cards with device = /gpu:0 : work
With Mirroed Strategy, Multiple Cards with device = /gpu:1 : work
With Mirroed Strategy, Multiple Cards : Error

edwardyehuang · 2019-07-24T20:58:28Z

I think this maybe the keras does not consider when the numbers of val data is odd.

Hugh0120 · 2019-07-26T06:56:32Z

same here here

weichen456 · 2019-08-20T08:12:38Z

I have get the same problem

samvdj · 2019-08-30T14:20:31Z

Same issue

ymodak · 2019-09-11T23:12:24Z

@edwardyehuang Can you please test it against latest tf 2.0 nightly build. Thanks!

pip install tf-nightly-2.0-preview

jvishnuvardhan · 2019-10-08T17:05:55Z

@edwardyehuang Is this still an issue? Can you check with TF2.0 and/or tf-nightly and let us know whether the issue persists with latest TF version. Thanks!

jvishnuvardhan · 2019-10-17T00:09:39Z

@edwardyehuang Is this still an issue? If not, please close the issue. Thank!

tensorflow-bot · 2019-10-17T00:22:15Z

Are you satisfied with the resolution of your issue?
Yes
No

keunwoochoi · 2019-10-30T18:40:08Z

Is this really fixed? With tf 2.0.0 + keras.fit + tf.dataset for validation data + multi-gpu, i'm having the same issue.

edwardyehuang · 2019-10-30T23:09:28Z

FIxed confirmed in 2.0 stable version

Dabuk · 2019-11-05T22:22:42Z

what's the fix ? want to use it in 1.14

edwardyehuang · 2019-11-28T00:10:08Z

I found it cannot estimate the data from tfrecord.
Usually, the keras fit will show the message: Train for xxx steps, validate for xxxx step
If I did not set the "validation_step" in model.fit and use the tfrecord, it will only show : Train for xxx steps

edwardyehuang · 2019-11-28T00:11:13Z

However, it still can complete the validation, but will print some ugly warning message. The program will not stop

YekaiLiu · 2020-02-26T01:54:50Z

try:
%tensorflow_version 1.x # enable TF 1.x in Colab
except Exception:
pass

print(tf.version)

#just try to not to use TF 2.x

tensorflowbutler · 2021-06-10T18:13:44Z

Hi There,

We are checking to see if you still need help on this issue, as you are using an older version of tensorflow(1.x) which is officially considered as end of life. We recommend that you upgrade to 2.4 or later version and let us know if the issue still persists in newer versions.

This issue will be closed automatically 7 days from now. If you still need help with this issue, Please open a new issue for any help you need against 2.x, and we will get you the right help.

google-ml-butler · 2021-06-11T01:36:34Z

Are you satisfied with the resolution of your issue?
Yes
No

edwardyehuang changed the title ~~You dataset iterator ran out of data; interrupting training.~~ [Mirrored Strategy] You dataset iterator ran out of data; interrupting training. Jul 12, 2019

oanush self-assigned this Jul 15, 2019

oanush added 2.0.0-beta0 comp:dist-strat Distribution Strategy related issues type:bug Bug labels Jul 15, 2019

oanush assigned ymodak and unassigned oanush Jul 23, 2019

jvishnuvardhan added stat:awaiting response Status - Awaiting response from author TF 2.0 Issues relating to TensorFlow 2.0 and removed TF 2.0.0-beta0 labels Oct 8, 2019

edwardyehuang closed this as completed Oct 17, 2019

edwardyehuang reopened this Nov 26, 2019

tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Nov 26, 2019

edwardyehuang closed this as completed Jun 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Mirrored Strategy] You dataset iterator ran out of data; interrupting training. #30636

[Mirrored Strategy] You dataset iterator ran out of data; interrupting training. #30636

edwardyehuang commented Jul 12, 2019 •

edited

Loading

edwardyehuang commented Jul 12, 2019 •

edited

Loading

isaprykin commented Jul 24, 2019

edwardyehuang commented Jul 24, 2019

edwardyehuang commented Jul 24, 2019

Hugh0120 commented Jul 26, 2019

weichen456 commented Aug 20, 2019

samvdj commented Aug 30, 2019

ymodak commented Sep 11, 2019

jvishnuvardhan commented Oct 8, 2019

jvishnuvardhan commented Oct 17, 2019

tensorflow-bot bot commented Oct 17, 2019

keunwoochoi commented Oct 30, 2019

edwardyehuang commented Oct 30, 2019

Dabuk commented Nov 5, 2019

edwardyehuang commented Nov 28, 2019 •

edited

Loading

edwardyehuang commented Nov 28, 2019 •

edited

Loading

YekaiLiu commented Feb 26, 2020

tensorflowbutler commented Jun 10, 2021

google-ml-butler bot commented Jun 11, 2021

[Mirrored Strategy] You dataset iterator ran out of data; interrupting training. #30636

[Mirrored Strategy] You dataset iterator ran out of data; interrupting training. #30636

Comments

edwardyehuang commented Jul 12, 2019 • edited Loading

edwardyehuang commented Jul 12, 2019 • edited Loading

isaprykin commented Jul 24, 2019

edwardyehuang commented Jul 24, 2019

edwardyehuang commented Jul 24, 2019

Hugh0120 commented Jul 26, 2019

weichen456 commented Aug 20, 2019

samvdj commented Aug 30, 2019

ymodak commented Sep 11, 2019

jvishnuvardhan commented Oct 8, 2019

jvishnuvardhan commented Oct 17, 2019

tensorflow-bot bot commented Oct 17, 2019

keunwoochoi commented Oct 30, 2019

edwardyehuang commented Oct 30, 2019

Dabuk commented Nov 5, 2019

edwardyehuang commented Nov 28, 2019 • edited Loading

edwardyehuang commented Nov 28, 2019 • edited Loading

YekaiLiu commented Feb 26, 2020

tensorflowbutler commented Jun 10, 2021

google-ml-butler bot commented Jun 11, 2021

edwardyehuang commented Jul 12, 2019 •

edited

Loading

edwardyehuang commented Jul 12, 2019 •

edited

Loading

edwardyehuang commented Nov 28, 2019 •

edited

Loading

edwardyehuang commented Nov 28, 2019 •

edited

Loading