-
Notifications
You must be signed in to change notification settings - Fork 74.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tf.estimator.train_and_evaluate not run evaluate when distribute strategy is CollectiveAllReduceStrategy #27857
Comments
I am also seeing this same problem with tf.contrib.distribute.MirroredStrategy(). Nearly the same exact setup as above, but reading from tfrecords instead of a generator. |
Ping @yuefengz |
Btw could you try tf-nightly and reproduce the error? |
@byronyi, thanks for the reply. I tried installing tf-nightly into a clean virtual env (as well as the conda env that I normally use). Unfortunately, I'm unable to train an estimator with a distribution strategy (no distribution seems to work normally). A few details below (want to avoid adding too much off an unrelated issue). Sorry this doesn't reproduce the error exactly or give you code that runs out of the box but I've included enough to hopefully give you an idea of what I'm doing
Trying that out, I get this:
3)Trying MultiWorkerAllReduce, I get:
System information Have I written custom code (as opposed to using a stock example script provided in TensorFlow): |
Found the solution (to at least my problem and hopefully the one above). If you don't supply a separate machine/task an evaluator you need to specify one of your machines/tasks as "master" NOT "chief" or "worker". In _TrainingExecutor (which is called by the distribute coordinator) there are separate methods for run_chief and run_master, run_chief does not call estimator.evaluate while run_master does!! |
@wudixiaotie Is this still an issue? |
Please note "master" is not officially supported by distribution strategy or Estimator. If you want to run evaluation, you need to have an "evaluator" task with |
@devinkmoore Are you seeing any loss scaling related issue or is that resolved? |
@yuefengz Any TF 2.x code examples to configure the "evaluator" task? Thanks. |
Do you me in the tfconfig specify "evaluator" in addition to "chief", "worker", "ps"? |
Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template
System information
CentOS Linux release 7.3.1611
binary
1.13.1
2.7.5
You can collect some of this information using our environment capture script
You can also obtain the TensorFlow version with
python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
('v1.13.1-0-g6612da8951', '1.13.1')
Describe the current behavior
When I try to run estimator in distribute with CollectiveAllReduceStrategy strategy, the train_and_evaluate api do not run evaluation after model save checkpoint.
Describe the expected behavior
train_and_evaluate should run evaluation after model save checkpoint
Code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate the problem.
Other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.
The text was updated successfully, but these errors were encountered: