Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training takes > 1 day on Boston Housing example using 8 GPU machine #61

Closed
zjost opened this issue Jan 23, 2019 · 6 comments
Closed

Training takes > 1 day on Boston Housing example using 8 GPU machine #61

zjost opened this issue Jan 23, 2019 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@zjost
Copy link
Contributor

zjost commented Jan 23, 2019

Using tf 1.9.0 and running example notebook. I think the problem is in the eval spec definition which has this code section:

eval_spec = tf.estimator.EvalSpec(
      input_fn=input_fn("test", training=False, batch_size=BATCH_SIZE),
      steps=None,
      start_delay_secs=1,
      throttle_secs=1,
  )

This seems to cause evaluation every 1 second, and lead to a ginormous tf.events file (>20 GB).

@cweill
Copy link
Contributor

cweill commented Jan 23, 2019

Evaluation should only occur whenever a checkpoint gets written. However, if one of those causes a checkpoint to be written every second, then you will have the reported issue. Do you have any suggestions for resolving this? Perhaps adding a comment to the notebook? A PR is welcome.

@zjost
Copy link
Contributor Author

zjost commented Jan 24, 2019

When I changed throttle_secs to something like 30, that fixed the issue and allowed training to happen fairly quickly. However, it seems from your notebook that the same problem didn't exist. I wonder if something has changed with the TF versions that alters the behavior.

I'll also note that my training data curves had much less data too, not just the eval curves. It's unclear to me how to separately control the tensorboard write frequency of training information vs eval information.

I'm happy to make a PR to change throttle_secs, but I'm not sure that's the right approach since the behavior seems different between TF versions. What do you think?

@cweill
Copy link
Contributor

cweill commented Jan 28, 2019

I agree, there are many knobs you can tune. RunConfig.save_summary_steps controls the frequency of saving summaries (training information). The above fields in conjunction with save_checkpoints_steps control evaluation frequency, because an evaluation only occurs when a new checkpoint is created. And last but not least, RunConfig.log_step_count_steps controls the frequency of writing the global_steps/sec metric for TensorBoard.

If you think you have a fix, feel free to send a PR, and I'll have a look.

@zjost
Copy link
Contributor Author

zjost commented Jan 29, 2019

Cool, let me dig into it a bit and see if there's a sensible approach that works across versions (within reason).

@zjost
Copy link
Contributor Author

zjost commented Apr 12, 2019

I think root cause is explained in this Issue. Relevant quote related to changes to train_and_evaluate between TF versions 1.9 and 1.10:

My understanding is it now runs evaluation via a tf.train.CheckpointSaverListener. As such, the evaluation frequency is determined by the saving frequency. You can see the new functionality here: https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/python/estimator/training.py#L667

When I run the example, the logs say:

INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after 1 secs (eval_spec.throttle_secs) or training is finished.

Whereas the example notebook says:

INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps 5000 or save_checkpoints_secs None.

It seems TF versions >= 1.10 use a different mechanism for deciding when to evaluate that's based on checkpoint writing rather than time. Here are relevant code blocks for 1.10 and 1.9

The same linked issue gives a work-around related to making the training input_fn finite, which will trigger the eval. It doesn't seem worth the effort to implement this. I recommend just changing the throttle_secs to something more sensible than 1 sec, such as 30 secs. This shouldn't impact TF versions >= 1.10 unless training of 5000 steps occurs faster than whatever the new value is, since evaluation would only occur if both a new checkpoint were available and the last evaluation occurred more than throttle_secs ago.

Would you support this change?

@cweill
Copy link
Contributor

cweill commented Apr 12, 2019

@zjost: If you have a fix for v1.9/1.10, feel free to send a PR. The team will have a look. Is this still an issue with adanet v0.6.1?

@cweill cweill self-assigned this Apr 12, 2019
@cweill cweill added the bug Something isn't working label Apr 12, 2019
cweill pushed a commit that referenced this issue Apr 22, 2019
…_secs since this causes rapid re-evaluation in TF 1.9

Resolving #61. Just changing `throttle_secs` from 1 to 30 in `tf.estimator.EvalSpec`.

PiperOrigin-RevId: 244016318
@zjost zjost closed this as completed Aug 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants