Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Mirrored Strategy] You dataset iterator ran out of data; interrupting training. #30636

Closed
edwardyehuang opened this issue Jul 12, 2019 · 19 comments
Assignees
Labels
comp:dist-strat Distribution Strategy related issues TF 2.0 Issues relating to TensorFlow 2.0 type:bug Bug

Comments

@edwardyehuang
Copy link
Contributor

edwardyehuang commented Jul 12, 2019

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):YES
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):Ubuntu 18.04
  • TensorFlow installed from (source or binary):Pip tensorflow-gpu 2.0 beta1
  • TensorFlow version (use command below):2.0 beta1
  • Python version:3.7
  • GPU model and memory:Titan RTX x 2 (2 x 24GB) / P100 x 2 (2 x 16GB)

Error in keras.Model.fit.
When using the mirroredstrategy with tensorflow dataset in both training and validation.
Single GPU card works fine, whether using the mirroredstertegy or not. (When using the mirroredstertegy, set the devices = /gpu:0). This problem only occurs when using multiple gpu cards.

The error displayed:
[training_arrays.py 325] Your dataset iterator ran out of data; interrupting training. Make sure that your iterator can geretate at least "validation_steps * epochs" batches.

Currently the only worked solutation for me is manully set the "validation_steps" in keras.Model.fit.

Tensorflow dataset repeat or/and take, will not work. By setting the validation batch size to 2 (1 for each GPU) also does not work

Simliar issues in here #25254, but closed

@edwardyehuang
Copy link
Contributor Author

edwardyehuang commented Jul 12, 2019

As mentioned in #25254 . The offical example code does not worked either.

@edwardyehuang edwardyehuang changed the title You dataset iterator ran out of data; interrupting training. [Mirrored Strategy] You dataset iterator ran out of data; interrupting training. Jul 12, 2019
@oanush oanush self-assigned this Jul 15, 2019
@oanush oanush added 2.0.0-beta0 comp:dist-strat Distribution Strategy related issues type:bug Bug labels Jul 15, 2019
@oanush oanush assigned ymodak and unassigned oanush Jul 23, 2019
@isaprykin
Copy link
Contributor

Keras should be able to figure out validation_steps. (cc @omalleyt12 )

Could you please share the code so that we have more complete details of this case while we are looking at it? Thanks.

@edwardyehuang
Copy link
Contributor Author

Keras should be able to figure out validation_steps. (cc @omalleyt12 )

Could you please share the code so that we have more complete details of this case while we are looking at it? Thanks.

When I do not pass the validations_steps:

Without Mirroed Strategy : Work
With Mirroed Strategy, Single Card : Work
With Mirroed Strategy, Multiple Cards with device = /gpu:0 : work
With Mirroed Strategy, Multiple Cards with device = /gpu:1 : work
With Mirroed Strategy, Multiple Cards : Error

@edwardyehuang
Copy link
Contributor Author

I think this maybe the keras does not consider when the numbers of val data is odd.

@Hugh0120
Copy link

same here here

@weichen456
Copy link

I have get the same problem

@samvdj
Copy link

samvdj commented Aug 30, 2019

Same issue

@ymodak
Copy link
Contributor

ymodak commented Sep 11, 2019

@edwardyehuang Can you please test it against latest tf 2.0 nightly build. Thanks!

pip install tf-nightly-2.0-preview

@jvishnuvardhan
Copy link
Contributor

@edwardyehuang Is this still an issue? Can you check with TF2.0 and/or tf-nightly and let us know whether the issue persists with latest TF version. Thanks!

@jvishnuvardhan jvishnuvardhan added stat:awaiting response Status - Awaiting response from author TF 2.0 Issues relating to TensorFlow 2.0 and removed TF 2.0.0-beta0 labels Oct 8, 2019
@jvishnuvardhan
Copy link
Contributor

@edwardyehuang Is this still an issue? If not, please close the issue. Thank!

@tensorflow-bot
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

@keunwoochoi
Copy link

Is this really fixed? With tf 2.0.0 + keras.fit + tf.dataset for validation data + multi-gpu, i'm having the same issue.

@edwardyehuang
Copy link
Contributor Author

FIxed confirmed in 2.0 stable version

@Dabuk
Copy link

Dabuk commented Nov 5, 2019

what's the fix ? want to use it in 1.14

@edwardyehuang edwardyehuang reopened this Nov 26, 2019
@tensorflowbutler tensorflowbutler removed the stat:awaiting response Status - Awaiting response from author label Nov 26, 2019
@edwardyehuang
Copy link
Contributor Author

edwardyehuang commented Nov 28, 2019

I found it cannot estimate the data from tfrecord.
Usually, the keras fit will show the message: Train for xxx steps, validate for xxxx step
If I did not set the "validation_step" in model.fit and use the tfrecord, it will only show : Train for xxx steps

@edwardyehuang
Copy link
Contributor Author

edwardyehuang commented Nov 28, 2019

However, it still can complete the validation, but will print some ugly warning message. The program will not stop

@YekaiLiu
Copy link

try:
%tensorflow_version 1.x # enable TF 1.x in Colab
except Exception:
pass

print(tf.version)

#just try to not to use TF 2.x

@tensorflowbutler
Copy link
Member

Hi There,

We are checking to see if you still need help on this issue, as you are using an older version of tensorflow(1.x) which is officially considered as end of life. We recommend that you upgrade to 2.4 or later version and let us know if the issue still persists in newer versions.

This issue will be closed automatically 7 days from now. If you still need help with this issue, Please open a new issue for any help you need against 2.x, and we will get you the right help.

@google-ml-butler
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:dist-strat Distribution Strategy related issues TF 2.0 Issues relating to TensorFlow 2.0 type:bug Bug
Projects
None yet
Development

No branches or pull requests