train_and_evaluate does not preserve the Dataset iterator state across train/eval #19062

superbobry · 2018-05-03T16:26:11Z

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): N/A
TensorFlow installed from (source or binary): binary
TensorFlow version (use command below): v1.8.0-0-g93bc2e2072 1.8.0
Python version: 3.6
Bazel version (if compiling from source): N/A
GCC/Compiler version (if compiling from source): N/A
CUDA/cuDNN version: N/A
GPU model and memory: N/A
Exact command to reproduce: see below

Describe the problem

When the input function is based on the one-shot dataset iterator, the training phase always starts from the beginning of the iterator. That is, the iterator state gets reset in between the train/eval phases. Therefore, if the dataset is big enough, then the training would only see a subset of the data, which can be processed in eval_spec.throttle_secs.

I think the issue is caused by the fact that the graph is persisted before transitioning to the next phase, and restored upon reentering training. However, I find the behaviour a bit counterintuitive, so if it is not a bug, it should be mentioned in the train_and_evaluate docs.

Source code / logs

Here is a small example demonstrating the issue:

import tensorflow as tf


def input_fn(data):
    dataset = tf.data.Dataset.from_tensor_slices(data)
    dataset = dataset.batch(batch_size=1)
    x = dataset.make_one_shot_iterator().get_next()
    return {"x": tf.Print(x, [x])}, x


if __name__ == "__main__":
    model = tf.estimator.LinearRegressor(feature_columns=[
        tf.feature_column.numeric_column("x")
    ])

    train_spec = tf.estimator.TrainSpec(
        input_fn=lambda: input_fn(list(range(2**20))),
        max_steps=2)
    eval_spec = tf.estimator.EvalSpec(
        input_fn=lambda: input_fn([42]),
        steps=1,
        start_delay_secs=1,
        throttle_secs=1)

    tf.logging.set_verbosity("INFO")
    tf.train.create_global_step()
    tf.estimator.train_and_evaluate(model, train_spec, eval_spec)

The code produces the following log output (I've omitted irrelevant lines):

INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after 1 secs (eval_spec.throttle_secs) or training is finished.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
[0]
INFO:tensorflow:Saving checkpoints for 1 into /var/folders/wr/s7brqkzj74v4pmdwkwn19321l34xg2/T/tmpgtq1ap9_/model.ckpt.
INFO:tensorflow:loss = 0.0, step = 1
INFO:tensorflow:Loss for final step: 0.0.
...
INFO:tensorflow:Starting evaluation at 2018-05-03-16:21:51
...
INFO:tensorflow:Finished evaluation at 2018-05-03-16:21:51
...
INFO:tensorflow:Restoring parameters from /var/folders/wr/s7brqkzj74v4pmdwkwn19321l34xg2/T/tmpgtq1ap9_/model.ckpt-1
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
[0]
INFO:tensorflow:Saving checkpoints for 2 into /var/folders/wr/s7brqkzj74v4pmdwkwn19321l34xg2/T/tmpgtq1ap9_/model.ckpt.
INFO:tensorflow:loss = 0.0, step = 2
INFO:tensorflow:Loss for final step: 0.0.
...

The text was updated successfully, but these errors were encountered:

superbobry · 2018-05-07T12:24:02Z

Relates to #15448.

JuanjoAlegria · 2018-05-14T16:28:37Z

Just today I came across this problem, and I can't think of a workaround. Also, I think this is also related to #13895.

superbobry · 2018-05-14T21:17:50Z

@JuanjoAlegria the workaround is to run the TF cluster on a single machine with just two tasks: chief doing the training, and evaluator doing solely the evaluation. Each task should be started in a separate process, e.g. via the ProcessPoolExecutor.

Update: I suspect that multiprocessing is not required here, ThreadPoolExecutor will probably do just fine as well.

JuanjoAlegria · 2018-05-25T16:45:44Z

Is there any news about this issue? I agree with @superbobry that, if this is not a bug, at least is an issue that should be highlighted in the docs.

Thanks!

bhack · 2018-06-07T23:53:29Z

This caused also problems to the cache. Check #18266

tensorflowbutler · 2018-06-22T18:57:29Z

Nagging Assignee @ispirmustafa: It has been 14 days with no activity and this issue has an assignee. Please update the label and/or status accordingly.

ispirmustafa · 2018-06-25T23:45:16Z

Hi @superbobry
We have updated the behavior of train_and_evaluate. Now train is called only once. So you should not have cache problem in train input-fn.
Having said that eval input-fn will be called for every evaluate.

bhack · 2018-06-25T23:57:26Z

@ispirmustafa So also #18266 is it solved?

ispirmustafa · 2018-06-26T00:24:29Z

It solves the train input fn part, but not evaluate.

…

On Mon, Jun 25, 2018 at 5:01 PM bhack ***@***.***> wrote: @ispirmustafa <https://github.com/ispirmustafa> So also #18266 <#18266> is it solved? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#19062 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ASZl7a7YTWpKYL_FpGLVV3Oa9Tfo5Erpks5uAXnYgaJpZM4TxZax> .

bhack · 2018-06-26T00:45:40Z

@ispirmustafa is there a plan also for evaluate?

guillaumekln · 2018-06-26T07:20:55Z

Thanks @ispirmustafa, that is a welcome change. Looks like this documentation should be updated then:

tensorflow/tensorflow/python/estimator/training.py

Lines 280 to 284 in 3edb609

    
             Overfitting: In order to avoid overfitting, it is recommended to set up the 
        
             training `input_fn` to shuffle the training data properly. It is also 
        
             recommended to train the model a little longer, say multiple epochs, before 
        
             performing evaluation, as the input pipeline starts from scratch for each 
        
             training. It is particularly important for local training and evaluation.

superbobry · 2018-06-26T15:06:19Z

Thanks for looking into this, @ispirmustafa! Am I right that the training graph is no longer persisted in local mode? If not, how did you solve iterator persistence issue?

Also, for future reference, here's the commit fixing the train part: 3edb609.

ispirmustafa · 2018-06-26T22:54:49Z

Hi @guillaumekln , Yes that should be updated. we'll do that

ispirmustafa · 2018-06-26T22:58:10Z

Hi @bhack ,
For evaluate we don't have any plan to change graph re-creation.

ispirmustafa · 2018-06-26T22:59:26Z

Hi @superbobry,
I guess you mean whether we keep same graph with multiple train graph by 'persisting'.
No we're not doing that. We call train only once. We do evaluation via a hook (listener)

superbobry · 2018-06-27T09:57:41Z

@ispirmustafa yes, thanks for clarifying this. I think we can close the issue once the documentation clearly reflects the evaluation graph re-creation.

One final question: is there a reason for not caching the evaluation graph in the hook? i.e. do we really need to recreate it?

ispirmustafa · 2018-06-27T20:10:21Z

Keeping evaluation graph needs to hijack existing estimator API.

…

On Wed, Jun 27, 2018 at 3:02 AM Sergei Lebedev ***@***.***> wrote: @ispirmustafa <https://github.com/ispirmustafa> yes, thanks for clarifying this. I think we can close the issue once the documentation clearly reflects the evaluation graph re-creation. One final question: is there a reason for not caching the evaluation graph in the hook? i.e. do we really need to recreate it? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#19062 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ASZl7ejW6wjWGIaeFKlSN21V-UUL6gGdks5uA1grgaJpZM4TxZax> .

superbobry · 2018-06-28T20:18:39Z

I'm sorry, I think I miss context to understand what needs to be hijacked.

ispirmustafa · 2018-07-02T09:40:33Z

The way we keep training graph is that we call estimator.train only once. We need to call evaluation repeatedly (for each checkpoint). So if we want to keep the same graph we need to access the graph created within estimator.evaluate. Estimator does recreate graph for every call of evaluate/train by design.

…

On Thu, Jun 28, 2018 at 1:23 PM Sergei Lebedev ***@***.***> wrote: I'm sorry, I think I miss context to understand what needs to be hijacked. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#19062 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ASZl7RVHU86pawtRr7KuYmCjIlJB3F1gks5uBTtEgaJpZM4TxZax> .

tensorflowbutler · 2018-07-17T19:26:27Z

Nagging Assignee @ispirmustafa: It has been 15 days with no activity and this issue has an assignee. Please update the label and/or status accordingly.

tensorflowbutler · 2018-08-01T18:52:12Z

Nagging Assignee @ispirmustafa: It has been 30 days with no activity and this issue has an assignee. Please update the label and/or status accordingly.

formigone · 2018-08-14T03:01:12Z

Am I understanding this right? Suppose my input_fn returns a tf.data.TextLineDataset(filename).map (lamda line: tf.decode_csv(line)) that references a csv with, say, 1,000,000 lines, and it shuffles with dataset.shuffle(buffer_size=256). If the evaluation runs after a few hundred steps of training, when training resumes, will the input_fn used in training start reading thr cvs from where it left off before doing the evaluation, or will it start over from the first line?

From the explanation above, it sounds like the latest version of train_and_eval doesn't interupt the input_fn used in training (or somehow preserves it) during eval, and thus the input_fn keeps reading the long file line by line independently. Is this the case?

ispirmustafa · 2018-08-14T21:16:45Z

@formigone your statement is correct. train input_fn will not be impacted by eval. this is the case for tf.version>=1.9

Now that tensorflow/tensorflow#19062 is fixed for training, there is no need for always forcing distributed mode. The users can just call ``tf.estimator.train_and_evaluate`` or ``Esimator.*`` to run locally.

tensorflowbutler assigned tatianashp May 4, 2018

tatianashp assigned ispirmustafa and unassigned tatianashp May 4, 2018

tatianashp added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label May 4, 2018

bhack mentioned this issue Jun 26, 2018

Cache lockfile already exists #18266

Closed

tensorflowbutler removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jun 27, 2018

tensorflow-copybara closed this as completed in 729caa4 Aug 10, 2018

fprost mentioned this issue Sep 5, 2018

Train and evaluate conversationai/conversationai-models#150

Merged

jjtan mentioned this issue Oct 2, 2018

train_and_eval correct behavior conversationai/conversationai-models#165

Closed

shendiaomo mentioned this issue Jan 13, 2020

input_fn called multiple times in Estimator.train tensorflow/adanet#143

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

train_and_evaluate does not preserve the Dataset iterator state across train/eval #19062

train_and_evaluate does not preserve the Dataset iterator state across train/eval #19062

superbobry commented May 3, 2018 •

edited

superbobry commented May 7, 2018

JuanjoAlegria commented May 14, 2018

superbobry commented May 14, 2018 •

edited

JuanjoAlegria commented May 25, 2018

bhack commented Jun 7, 2018

tensorflowbutler commented Jun 22, 2018

ispirmustafa commented Jun 25, 2018

bhack commented Jun 25, 2018

ispirmustafa commented Jun 26, 2018 via email

bhack commented Jun 26, 2018 •

edited

guillaumekln commented Jun 26, 2018

superbobry commented Jun 26, 2018

ispirmustafa commented Jun 26, 2018

ispirmustafa commented Jun 26, 2018

ispirmustafa commented Jun 26, 2018

superbobry commented Jun 27, 2018

ispirmustafa commented Jun 27, 2018 via email

superbobry commented Jun 28, 2018

ispirmustafa commented Jul 2, 2018 via email

tensorflowbutler commented Jul 17, 2018

tensorflowbutler commented Aug 1, 2018

formigone commented Aug 14, 2018 •

edited

ispirmustafa commented Aug 14, 2018

train_and_evaluate does not preserve the Dataset iterator state across train/eval #19062

train_and_evaluate does not preserve the Dataset iterator state across train/eval #19062

Comments

superbobry commented May 3, 2018 • edited

System information

Describe the problem

Source code / logs

superbobry commented May 7, 2018

JuanjoAlegria commented May 14, 2018

superbobry commented May 14, 2018 • edited

JuanjoAlegria commented May 25, 2018

bhack commented Jun 7, 2018

tensorflowbutler commented Jun 22, 2018

ispirmustafa commented Jun 25, 2018

bhack commented Jun 25, 2018

ispirmustafa commented Jun 26, 2018 via email

bhack commented Jun 26, 2018 • edited

guillaumekln commented Jun 26, 2018

superbobry commented Jun 26, 2018

ispirmustafa commented Jun 26, 2018

ispirmustafa commented Jun 26, 2018

ispirmustafa commented Jun 26, 2018

superbobry commented Jun 27, 2018

ispirmustafa commented Jun 27, 2018 via email

superbobry commented Jun 28, 2018

ispirmustafa commented Jul 2, 2018 via email

tensorflowbutler commented Jul 17, 2018

tensorflowbutler commented Aug 1, 2018

formigone commented Aug 14, 2018 • edited

ispirmustafa commented Aug 14, 2018

superbobry commented May 3, 2018 •

edited

superbobry commented May 14, 2018 •

edited

bhack commented Jun 26, 2018 •

edited

formigone commented Aug 14, 2018 •

edited