Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train_and_evaluate does not preserve the Dataset iterator state across train/eval #19062

Closed
superbobry opened this issue May 3, 2018 · 23 comments
Assignees

Comments

@superbobry
Copy link
Member

superbobry commented May 3, 2018

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): N/A
  • TensorFlow installed from (source or binary): binary
  • TensorFlow version (use command below): v1.8.0-0-g93bc2e2072 1.8.0
  • Python version: 3.6
  • Bazel version (if compiling from source): N/A
  • GCC/Compiler version (if compiling from source): N/A
  • CUDA/cuDNN version: N/A
  • GPU model and memory: N/A
  • Exact command to reproduce: see below

Describe the problem

When the input function is based on the one-shot dataset iterator, the training phase always starts from the beginning of the iterator. That is, the iterator state gets reset in between the train/eval phases. Therefore, if the dataset is big enough, then the training would only see a subset of the data, which can be processed in eval_spec.throttle_secs.

I think the issue is caused by the fact that the graph is persisted before transitioning to the next phase, and restored upon reentering training. However, I find the behaviour a bit counterintuitive, so if it is not a bug, it should be mentioned in the train_and_evaluate docs.

Source code / logs

Here is a small example demonstrating the issue:

import tensorflow as tf


def input_fn(data):
    dataset = tf.data.Dataset.from_tensor_slices(data)
    dataset = dataset.batch(batch_size=1)
    x = dataset.make_one_shot_iterator().get_next()
    return {"x": tf.Print(x, [x])}, x


if __name__ == "__main__":
    model = tf.estimator.LinearRegressor(feature_columns=[
        tf.feature_column.numeric_column("x")
    ])

    train_spec = tf.estimator.TrainSpec(
        input_fn=lambda: input_fn(list(range(2**20))),
        max_steps=2)
    eval_spec = tf.estimator.EvalSpec(
        input_fn=lambda: input_fn([42]),
        steps=1,
        start_delay_secs=1,
        throttle_secs=1)

    tf.logging.set_verbosity("INFO")
    tf.train.create_global_step()
    tf.estimator.train_and_evaluate(model, train_spec, eval_spec)

The code produces the following log output (I've omitted irrelevant lines):

INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after 1 secs (eval_spec.throttle_secs) or training is finished.
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
[0]
INFO:tensorflow:Saving checkpoints for 1 into /var/folders/wr/s7brqkzj74v4pmdwkwn19321l34xg2/T/tmpgtq1ap9_/model.ckpt.
INFO:tensorflow:loss = 0.0, step = 1
INFO:tensorflow:Loss for final step: 0.0.
...
INFO:tensorflow:Starting evaluation at 2018-05-03-16:21:51
...
INFO:tensorflow:Finished evaluation at 2018-05-03-16:21:51
...
INFO:tensorflow:Restoring parameters from /var/folders/wr/s7brqkzj74v4pmdwkwn19321l34xg2/T/tmpgtq1ap9_/model.ckpt-1
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
[0]
INFO:tensorflow:Saving checkpoints for 2 into /var/folders/wr/s7brqkzj74v4pmdwkwn19321l34xg2/T/tmpgtq1ap9_/model.ckpt.
INFO:tensorflow:loss = 0.0, step = 2
INFO:tensorflow:Loss for final step: 0.0.
...
@tatianashp tatianashp assigned ispirmustafa and unassigned tatianashp May 4, 2018
@tatianashp tatianashp added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label May 4, 2018
@superbobry
Copy link
Member Author

Relates to #15448.

@JuanjoAlegria
Copy link

Just today I came across this problem, and I can't think of a workaround. Also, I think this is also related to #13895.

@superbobry
Copy link
Member Author

superbobry commented May 14, 2018

@JuanjoAlegria the workaround is to run the TF cluster on a single machine with just two tasks: chief doing the training, and evaluator doing solely the evaluation. Each task should be started in a separate process, e.g. via the ProcessPoolExecutor.

Update: I suspect that multiprocessing is not required here, ThreadPoolExecutor will probably do just fine as well.

@JuanjoAlegria
Copy link

Is there any news about this issue? I agree with @superbobry that, if this is not a bug, at least is an issue that should be highlighted in the docs.

Thanks!

@bhack
Copy link
Contributor

bhack commented Jun 7, 2018

This caused also problems to the cache. Check #18266

@tensorflowbutler
Copy link
Member

Nagging Assignee @ispirmustafa: It has been 14 days with no activity and this issue has an assignee. Please update the label and/or status accordingly.

@ispirmustafa
Copy link
Contributor

Hi @superbobry
We have updated the behavior of train_and_evaluate. Now train is called only once. So you should not have cache problem in train input-fn.
Having said that eval input-fn will be called for every evaluate.

@bhack
Copy link
Contributor

bhack commented Jun 25, 2018

@ispirmustafa So also #18266 is it solved?

@ispirmustafa
Copy link
Contributor

ispirmustafa commented Jun 26, 2018 via email

@bhack
Copy link
Contributor

bhack commented Jun 26, 2018

@ispirmustafa is there a plan also for evaluate?

@guillaumekln
Copy link
Contributor

Thanks @ispirmustafa, that is a welcome change. Looks like this documentation should be updated then:

Overfitting: In order to avoid overfitting, it is recommended to set up the
training `input_fn` to shuffle the training data properly. It is also
recommended to train the model a little longer, say multiple epochs, before
performing evaluation, as the input pipeline starts from scratch for each
training. It is particularly important for local training and evaluation.

@superbobry
Copy link
Member Author

Thanks for looking into this, @ispirmustafa! Am I right that the training graph is no longer persisted in local mode? If not, how did you solve iterator persistence issue?

Also, for future reference, here's the commit fixing the train part: 3edb609.

@ispirmustafa
Copy link
Contributor

Hi @guillaumekln , Yes that should be updated. we'll do that

@ispirmustafa
Copy link
Contributor

Hi @bhack ,
For evaluate we don't have any plan to change graph re-creation.

@ispirmustafa
Copy link
Contributor

Hi @superbobry,
I guess you mean whether we keep same graph with multiple train graph by 'persisting'.
No we're not doing that. We call train only once. We do evaluation via a hook (listener)

@tensorflowbutler tensorflowbutler removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Jun 27, 2018
@superbobry
Copy link
Member Author

@ispirmustafa yes, thanks for clarifying this. I think we can close the issue once the documentation clearly reflects the evaluation graph re-creation.

One final question: is there a reason for not caching the evaluation graph in the hook? i.e. do we really need to recreate it?

@ispirmustafa
Copy link
Contributor

ispirmustafa commented Jun 27, 2018 via email

@superbobry
Copy link
Member Author

I'm sorry, I think I miss context to understand what needs to be hijacked.

@ispirmustafa
Copy link
Contributor

ispirmustafa commented Jul 2, 2018 via email

@tensorflowbutler
Copy link
Member

Nagging Assignee @ispirmustafa: It has been 15 days with no activity and this issue has an assignee. Please update the label and/or status accordingly.

@tensorflowbutler
Copy link
Member

Nagging Assignee @ispirmustafa: It has been 30 days with no activity and this issue has an assignee. Please update the label and/or status accordingly.

@formigone
Copy link
Contributor

formigone commented Aug 14, 2018

Am I understanding this right? Suppose my input_fn returns a tf.data.TextLineDataset(filename).map (lamda line: tf.decode_csv(line)) that references a csv with, say, 1,000,000 lines, and it shuffles with dataset.shuffle(buffer_size=256). If the evaluation runs after a few hundred steps of training, when training resumes, will the input_fn used in training start reading thr cvs from where it left off before doing the evaluation, or will it start over from the first line?

From the explanation above, it sounds like the latest version of train_and_eval doesn't interupt the input_fn used in training (or somehow preserves it) during eval, and thus the input_fn keeps reading the long file line by line independently. Is this the case?

@ispirmustafa
Copy link
Contributor

@formigone your statement is correct. train input_fn will not be impacted by eval. this is the case for tf.version>=1.9

superbobry pushed a commit to criteo/tf-yarn that referenced this issue Sep 6, 2018
Now that tensorflow/tensorflow#19062 is fixed for training, there is no
need for always forcing distributed mode. The users can just call
``tf.estimator.train_and_evaluate`` or ``Esimator.*`` to run locally.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants