-
Notifications
You must be signed in to change notification settings - Fork 45.4k
Description
System information
- What is the top-level directory of the model you are using: research/object_detection
- Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
- OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Docker on Linux Ubuntu 16.04
- TensorFlow installed from (source or binary): In docker, tensorflow/tensorflow:1.10.0-devel-gpu
- TensorFlow version (use command below): 1.10.0
- Bazel version (if compiling from source): N/A
- CUDA/cuDNN version: V9.0.176 / 7
- GPU model and memory: GeForce GTX 1060
- Exact command to reproduce: N/A
Describe the problem
The eval_interval_secs in the eval.proto doesn't work in estimator-based training and I found out that it will works if you pass eval_interval_secs to EvalSpec.throttle_secs. Therefore I send a PR #5144.
However, tensorflow/tensorflow@3edb609#diff-bc4a1638bbcd88997adf5e723b8609c7 has been merged in TensorFlow 1.10 and it change the way to customize the frequency of saving checkpoint.
For now, if you want to change the frequency of saving checkpoint, RunConfig.save_checkpoints_secs and RunConfig.save_checkpoints_steps is much prefered according to estimator/training.py#L672, and EvalSpec.throttle_secs does not define the frequency of saving checkpoint anymore. It only define the minimum time interval of evaluation. For more detail, I've done some summary in #5139 (comment)
Currently, we don't have a proper way to customize the behavior of saving checkpoint and evaluation in estimator-based training. I think the ability to configure the behavior of saving checkpoint and evaluation is pretty important because I've suffered from the memory leak #5139 for a while. By changing the frequency of evaluation, I can prevent my training process from being killed while working.
Source code / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. Try to provide a reproducible test case that is the bare minimum necessary to generate the problem.