Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tf.estimator.train's incompatibility with distributed training on Cloud ML Engine is not well-documented #23017

Closed
logicchains opened this issue Oct 16, 2018 · 8 comments
Assignees
Labels
stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author stat:contribution welcome Status - Contributions welcome type:docs-bug Document issues

Comments

@logicchains
Copy link

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Fedora 28
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): 1.11.0
  • Python version: 3.5.0
  • Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version:
  • GPU model and memory:
  • Exact command to reproduce:

The page https://cloud.google.com/ml-engine/docs/tensorflow/distributed-training-details#tensorflow-config notes that "The tf.estimator.train method doesn't work with distributed training on Cloud ML Engine. Please use train_and_evaluate instead.". This is not documented on the Estimator page (https://www.tensorflow.org/api_docs/python/tf/estimator/Estimator) or anywhere else I've seen. I believe it would be helpful to document it more prominently, as I can't be the only one who didn't read the "Using TF_CONFIG for Distributed Training Details" page and wasted time debugging why a model wouldn't work when distributed.

If possible, it would be even more helpful to make tf.estimator.train raise an exception when run in a distributed ML Engine context, or log a warning. It's unreasonable to expect the user to figure this out themselves as from an API perspective there's no reason to expect train_and_evaluate would work where train fails (one might reasonably assume train_and_evaluate calls train).

@MarkDaoust
Copy link
Member

Thanks for reporting this @logicchains.

You seem to have a clear opinion on how this can be improved, is there any change you can send a PR with a fix? The api_docs are generated from the python doc-strings.

@karmel this is outside my area of expertise. Can you assign someone with a little more subject matter knowledge? That fix would have to go in the estimator repo, right?

@MarkDaoust MarkDaoust removed their assignment Oct 23, 2018
@tensorflowbutler
Copy link
Member

Nagging Assignee @karmel: It has been 29 days with no activity and this issue has an assignee. Please update the label and/or status accordingly.

@karmel karmel added type:docs-bug Document issues stat:contribution welcome Status - Contributions welcome labels Dec 3, 2018
@karmel karmel removed their assignment Dec 3, 2018
@karmel
Copy link

karmel commented Dec 3, 2018

Help with clarifying docs is always appreciated-- @logicchains , let us know if you would like to contribute a fix here.

@logicchains
Copy link
Author

I'm happy to submit a change to the documentation if that's suitable. That would be https://github.com/tensorflow/estimator?

@karmel
Copy link

karmel commented Dec 4, 2018

Yes, thank you.

@chunduriv chunduriv self-assigned this Mar 17, 2022
@chunduriv
Copy link
Contributor

@logicchains, Sorry for late response.

There is limited support for training with Estimator using all strategies except TPUStrategy.Basic training and evaluation should work, but a number of advanced features such as v1.train.Scaffold do not. There may also be a number of bugs in this integration and there are no plans to actively improve this support (the focus is on Keras and custom training loop support). If at all possible, you should prefer to use tf.distribute with those APIs instead.

Using tf.distribute.Strategy with Estimator has limited support. For more details please refer here. Thanks!

@chunduriv chunduriv added the stat:awaiting response Status - Awaiting response from author label Mar 17, 2022
@google-ml-butler
Copy link

This issue has been automatically marked as stale because it has no recent activity. It will be closed if no further activity occurs. Thank you.

@google-ml-butler google-ml-butler bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Mar 24, 2022
@google-ml-butler
Copy link

Closing as stale. Please reopen if you'd like to work on this further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale This label marks the issue/pr stale - to be closed automatically if no activity stat:awaiting response Status - Awaiting response from author stat:contribution welcome Status - Contributions welcome type:docs-bug Document issues
Projects
None yet
Development

No branches or pull requests

6 participants