New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training takes > 1 day on Boston Housing example using 8 GPU machine #61
Comments
Evaluation should only occur whenever a checkpoint gets written. However, if one of those causes a checkpoint to be written every second, then you will have the reported issue. Do you have any suggestions for resolving this? Perhaps adding a comment to the notebook? A PR is welcome. |
When I changed I'll also note that my training data curves had much less data too, not just the eval curves. It's unclear to me how to separately control the tensorboard write frequency of training information vs eval information. I'm happy to make a PR to change |
I agree, there are many knobs you can tune. If you think you have a fix, feel free to send a PR, and I'll have a look. |
Cool, let me dig into it a bit and see if there's a sensible approach that works across versions (within reason). |
I think root cause is explained in this Issue. Relevant quote related to changes to
When I run the example, the logs say:
Whereas the example notebook says:
It seems TF versions >= 1.10 use a different mechanism for deciding when to evaluate that's based on checkpoint writing rather than time. Here are relevant code blocks for 1.10 and 1.9 The same linked issue gives a work-around related to making the training Would you support this change? |
@zjost: If you have a fix for v1.9/1.10, feel free to send a PR. The team will have a look. Is this still an issue with adanet v0.6.1? |
…_secs since this causes rapid re-evaluation in TF 1.9 Resolving #61. Just changing `throttle_secs` from 1 to 30 in `tf.estimator.EvalSpec`. PiperOrigin-RevId: 244016318
Using tf 1.9.0 and running example notebook. I think the problem is in the eval spec definition which has this code section:
This seems to cause evaluation every 1 second, and lead to a ginormous tf.events file (>20 GB).
The text was updated successfully, but these errors were encountered: