Training takes > 1 day on Boston Housing example using 8 GPU machine #61

zjost · 2019-01-23T21:06:32Z

Using tf 1.9.0 and running example notebook. I think the problem is in the eval spec definition which has this code section:

eval_spec = tf.estimator.EvalSpec(
      input_fn=input_fn("test", training=False, batch_size=BATCH_SIZE),
      steps=None,
      start_delay_secs=1,
      throttle_secs=1,
  )

This seems to cause evaluation every 1 second, and lead to a ginormous tf.events file (>20 GB).

The text was updated successfully, but these errors were encountered:

cweill · 2019-01-23T23:16:29Z

Evaluation should only occur whenever a checkpoint gets written. However, if one of those causes a checkpoint to be written every second, then you will have the reported issue. Do you have any suggestions for resolving this? Perhaps adding a comment to the notebook? A PR is welcome.

zjost · 2019-01-24T15:31:18Z

When I changed throttle_secs to something like 30, that fixed the issue and allowed training to happen fairly quickly. However, it seems from your notebook that the same problem didn't exist. I wonder if something has changed with the TF versions that alters the behavior.

I'll also note that my training data curves had much less data too, not just the eval curves. It's unclear to me how to separately control the tensorboard write frequency of training information vs eval information.

I'm happy to make a PR to change throttle_secs, but I'm not sure that's the right approach since the behavior seems different between TF versions. What do you think?

cweill · 2019-01-28T07:56:25Z

I agree, there are many knobs you can tune. RunConfig.save_summary_steps controls the frequency of saving summaries (training information). The above fields in conjunction with save_checkpoints_steps control evaluation frequency, because an evaluation only occurs when a new checkpoint is created. And last but not least, RunConfig.log_step_count_steps controls the frequency of writing the global_steps/sec metric for TensorBoard.

If you think you have a fix, feel free to send a PR, and I'll have a look.

zjost · 2019-01-29T15:09:13Z

Cool, let me dig into it a bit and see if there's a sensible approach that works across versions (within reason).

zjost · 2019-04-12T17:37:28Z

I think root cause is explained in this Issue. Relevant quote related to changes to train_and_evaluate between TF versions 1.9 and 1.10:

My understanding is it now runs evaluation via a tf.train.CheckpointSaverListener. As such, the evaluation frequency is determined by the saving frequency. You can see the new functionality here: https://github.com/tensorflow/tensorflow/blob/r1.10/tensorflow/python/estimator/training.py#L667

When I run the example, the logs say:

INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after 1 secs (eval_spec.throttle_secs) or training is finished.

Whereas the example notebook says:

INFO:tensorflow:Start train and evaluate loop. The evaluate will happen after every checkpoint. Checkpoint frequency is determined based on RunConfig arguments: save_checkpoints_steps 5000 or save_checkpoints_secs None.

It seems TF versions >= 1.10 use a different mechanism for deciding when to evaluate that's based on checkpoint writing rather than time. Here are relevant code blocks for 1.10 and 1.9

The same linked issue gives a work-around related to making the training input_fn finite, which will trigger the eval. It doesn't seem worth the effort to implement this. I recommend just changing the throttle_secs to something more sensible than 1 sec, such as 30 secs. This shouldn't impact TF versions >= 1.10 unless training of 5000 steps occurs faster than whatever the new value is, since evaluation would only occur if both a new checkpoint were available and the last evaluation occurred more than throttle_secs ago.

Would you support this change?

cweill · 2019-04-12T20:50:00Z

@zjost: If you have a fix for v1.9/1.10, feel free to send a PR. The team will have a look. Is this still an issue with adanet v0.6.1?

…_secs since this causes rapid re-evaluation in TF 1.9 Resolving #61. Just changing `throttle_secs` from 1 to 30 in `tf.estimator.EvalSpec`. PiperOrigin-RevId: 244016318

cweill self-assigned this Apr 12, 2019

cweill added the bug Something isn't working label Apr 12, 2019

zjost mentioned this issue Apr 12, 2019

Fixing typo in example and changing default eval throttle_secs since … #95

Closed

zjost closed this as completed Aug 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training takes > 1 day on Boston Housing example using 8 GPU machine #61

Training takes > 1 day on Boston Housing example using 8 GPU machine #61

zjost commented Jan 23, 2019

cweill commented Jan 23, 2019

zjost commented Jan 24, 2019

cweill commented Jan 28, 2019

zjost commented Jan 29, 2019

zjost commented Apr 12, 2019 •

edited

cweill commented Apr 12, 2019

Training takes > 1 day on Boston Housing example using 8 GPU machine #61

Training takes > 1 day on Boston Housing example using 8 GPU machine #61

Comments

zjost commented Jan 23, 2019

cweill commented Jan 23, 2019

zjost commented Jan 24, 2019

cweill commented Jan 28, 2019

zjost commented Jan 29, 2019

zjost commented Apr 12, 2019 • edited

cweill commented Apr 12, 2019

zjost commented Apr 12, 2019 •

edited