Skip to content

[Feature Request] A clear way to control freqency of saving checkpoint and evaluation #5303

@bleqdyce

Description

@bleqdyce

System information

  • What is the top-level directory of the model you are using: research/object_detection
  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Docker on Linux Ubuntu 16.04
  • TensorFlow installed from (source or binary): In docker, tensorflow/tensorflow:1.10.0-devel-gpu
  • TensorFlow version (use command below): 1.10.0
  • Bazel version (if compiling from source): N/A
  • CUDA/cuDNN version: V9.0.176 / 7
  • GPU model and memory: GeForce GTX 1060
  • Exact command to reproduce: N/A

Describe the problem

The eval_interval_secs in the eval.proto doesn't work in estimator-based training and I found out that it will works if you pass eval_interval_secs to EvalSpec.throttle_secs. Therefore I send a PR #5144.

However, tensorflow/tensorflow@3edb609#diff-bc4a1638bbcd88997adf5e723b8609c7 has been merged in TensorFlow 1.10 and it change the way to customize the frequency of saving checkpoint.

For now, if you want to change the frequency of saving checkpoint, RunConfig.save_checkpoints_secs and RunConfig.save_checkpoints_steps is much prefered according to estimator/training.py#L672, and EvalSpec.throttle_secs does not define the frequency of saving checkpoint anymore. It only define the minimum time interval of evaluation. For more detail, I've done some summary in #5139 (comment)

Currently, we don't have a proper way to customize the behavior of saving checkpoint and evaluation in estimator-based training. I think the ability to configure the behavior of saving checkpoint and evaluation is pretty important because I've suffered from the memory leak #5139 for a while. By changing the frequency of evaluation, I can prevent my training process from being killed while working.

Source code / logs

Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. Try to provide a reproducible test case that is the bare minimum necessary to generate the problem.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions