-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
train_and_evaluate does not preserve the Dataset iterator state across train/eval #19062
Comments
Relates to #15448. |
Just today I came across this problem, and I can't think of a workaround. Also, I think this is also related to #13895. |
@JuanjoAlegria the workaround is to run the TF cluster on a single machine with just two tasks: chief doing the training, and evaluator doing solely the evaluation. Each task should be started in a separate process, e.g. via the Update: I suspect that multiprocessing is not required here, |
Is there any news about this issue? I agree with @superbobry that, if this is not a bug, at least is an issue that should be highlighted in the docs. Thanks! |
This caused also problems to the cache. Check #18266 |
Nagging Assignee @ispirmustafa: It has been 14 days with no activity and this issue has an assignee. Please update the label and/or status accordingly. |
Hi @superbobry |
@ispirmustafa So also #18266 is it solved? |
It solves the train input fn part, but not evaluate.
…On Mon, Jun 25, 2018 at 5:01 PM bhack ***@***.***> wrote:
@ispirmustafa <https://github.com/ispirmustafa> So also #18266
<#18266> is it solved?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#19062 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ASZl7a7YTWpKYL_FpGLVV3Oa9Tfo5Erpks5uAXnYgaJpZM4TxZax>
.
|
@ispirmustafa is there a plan also for evaluate? |
Thanks @ispirmustafa, that is a welcome change. Looks like this documentation should be updated then: tensorflow/tensorflow/python/estimator/training.py Lines 280 to 284 in 3edb609
|
Thanks for looking into this, @ispirmustafa! Am I right that the training graph is no longer persisted in local mode? If not, how did you solve iterator persistence issue? Also, for future reference, here's the commit fixing the train part: 3edb609. |
Hi @guillaumekln , Yes that should be updated. we'll do that |
Hi @bhack , |
Hi @superbobry, |
@ispirmustafa yes, thanks for clarifying this. I think we can close the issue once the documentation clearly reflects the evaluation graph re-creation. One final question: is there a reason for not caching the evaluation graph in the hook? i.e. do we really need to recreate it? |
Keeping evaluation graph needs to hijack existing estimator API.
…On Wed, Jun 27, 2018 at 3:02 AM Sergei Lebedev ***@***.***> wrote:
@ispirmustafa <https://github.com/ispirmustafa> yes, thanks for
clarifying this. I think we can close the issue once the documentation
clearly reflects the evaluation graph re-creation.
One final question: is there a reason for not caching the evaluation graph
in the hook? i.e. do we really need to recreate it?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#19062 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ASZl7ejW6wjWGIaeFKlSN21V-UUL6gGdks5uA1grgaJpZM4TxZax>
.
|
I'm sorry, I think I miss context to understand what needs to be hijacked. |
The way we keep training graph is that we call estimator.train only once.
We need to call evaluation repeatedly (for each checkpoint). So if we want
to keep the same graph we need to access the graph created within
estimator.evaluate. Estimator does recreate graph for every call of
evaluate/train by design.
…On Thu, Jun 28, 2018 at 1:23 PM Sergei Lebedev ***@***.***> wrote:
I'm sorry, I think I miss context to understand what needs to be hijacked.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#19062 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ASZl7RVHU86pawtRr7KuYmCjIlJB3F1gks5uBTtEgaJpZM4TxZax>
.
|
Nagging Assignee @ispirmustafa: It has been 15 days with no activity and this issue has an assignee. Please update the label and/or status accordingly. |
Nagging Assignee @ispirmustafa: It has been 30 days with no activity and this issue has an assignee. Please update the label and/or status accordingly. |
Am I understanding this right? Suppose my input_fn returns a From the explanation above, it sounds like the latest version of train_and_eval doesn't interupt the input_fn used in training (or somehow preserves it) during eval, and thus the input_fn keeps reading the long file line by line independently. Is this the case? |
@formigone your statement is correct. train input_fn will not be impacted by eval. this is the case for tf.version>=1.9 |
Now that tensorflow/tensorflow#19062 is fixed for training, there is no need for always forcing distributed mode. The users can just call ``tf.estimator.train_and_evaluate`` or ``Esimator.*`` to run locally.
System information
Describe the problem
When the input function is based on the one-shot dataset iterator, the training phase always starts from the beginning of the iterator. That is, the iterator state gets reset in between the train/eval phases. Therefore, if the dataset is big enough, then the training would only see a subset of the data, which can be processed in
eval_spec.throttle_secs
.I think the issue is caused by the fact that the graph is persisted before transitioning to the next phase, and restored upon reentering training. However, I find the behaviour a bit counterintuitive, so if it is not a bug, it should be mentioned in the
train_and_evaluate
docs.Source code / logs
Here is a small example demonstrating the issue:
The code produces the following log output (I've omitted irrelevant lines):
The text was updated successfully, but these errors were encountered: