New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tf.estimator.train's incompatibility with distributed training on Cloud ML Engine is not well-documented #23017
Comments
Thanks for reporting this @logicchains. You seem to have a clear opinion on how this can be improved, is there any change you can send a PR with a fix? The api_docs are generated from the python doc-strings. @karmel this is outside my area of expertise. Can you assign someone with a little more subject matter knowledge? That fix would have to go in the estimator repo, right? |
Nagging Assignee @karmel: It has been 29 days with no activity and this issue has an assignee. Please update the label and/or status accordingly. |
Help with clarifying docs is always appreciated-- @logicchains , let us know if you would like to contribute a fix here. |
I'm happy to submit a change to the documentation if that's suitable. That would be https://github.com/tensorflow/estimator? |
Yes, thank you. |
@logicchains, Sorry for late response. There is limited support for training with Estimator using all strategies except Using tf.distribute.Strategy with Estimator has limited support. For more details please refer here. Thanks! |
This issue has been automatically marked as stale because it has no recent activity. It will be closed if no further activity occurs. Thank you. |
Closing as stale. Please reopen if you'd like to work on this further. |
System information
The page https://cloud.google.com/ml-engine/docs/tensorflow/distributed-training-details#tensorflow-config notes that "The tf.estimator.train method doesn't work with distributed training on Cloud ML Engine. Please use train_and_evaluate instead.". This is not documented on the Estimator page (https://www.tensorflow.org/api_docs/python/tf/estimator/Estimator) or anywhere else I've seen. I believe it would be helpful to document it more prominently, as I can't be the only one who didn't read the "Using TF_CONFIG for Distributed Training Details" page and wasted time debugging why a model wouldn't work when distributed.
If possible, it would be even more helpful to make tf.estimator.train raise an exception when run in a distributed ML Engine context, or log a warning. It's unreasonable to expect the user to figure this out themselves as from an API perspective there's no reason to expect
train_and_evaluate
would work wheretrain
fails (one might reasonably assumetrain_and_evaluate
callstrain
).The text was updated successfully, but these errors were encountered: