Resuming training from saved checkpoint produces different result than uninterrupted training #27049

lostmsu · 2019-03-22T23:33:19Z

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes (Google Collab)
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Google Collab
TensorFlow installed from (source or binary): Google Collab
TensorFlow version (use command below): 1.13.1
Python version: Python 3

Describe the current behavior
Loading a model with tf.keras.models.load_model, produced with tf.keras.callbacks.ModelCheckpoint , and resuming training produces different results from running the training without save model + restore model in the middle.

Describe the expected behavior
Saving and restoring the model should allow to resume training as if there was no interruption in the first place.

Code to reproduce the issue
Google Collab

Other info / logs
No interruption:

Epoch 49/100

0s - loss: 3.5190 - val_loss: 3.3597
Epoch 50/100

0s - loss: 3.4090 - val_loss: 3.2668
Epoch 51/100

0s - loss: 3.2637 - val_loss: 3.1623
Epoch 52/100

0s - loss: 3.0962 - val_loss: 2.9975

With interruption:

Epoch 49/50

0s - loss: 3.5190 - val_loss: 3.3597
Epoch 50/50

Epoch 00050: saving model to weights.50.ckpt

0s - loss: 3.4090 - val_loss: 3.2668
... load_model('weights.50.ckpt') ...
Epoch 51/100

0s - loss: 3.2637 - val_loss: 3.3816
Epoch 52/100

0s - loss: 3.3175 - val_loss: 3.1457

The model does not have any random elements, so it looks like optimizer state is lost.

The text was updated successfully, but these errors were encountered:

allenlavoie · 2019-03-25T22:33:37Z

Model.fit is taking random slices of the data and batching them together. If you control for that, e.g. make all of the examples identical, can you still reproduce?

lostmsu · 2019-03-25T23:02:41Z

@allenlavoie the model will converge instantly if there's just 1 training sample.
I added shuffle=False to .fit(...) calls. The issue persisted. (notebook updated).

Also, this should not have changed anything at all, since Has no effect when steps_per_epoch is not None., which was 1 in my original sample anyway. Indeed, the losses after this change are the same.

allenlavoie · 2019-03-25T23:14:25Z

Hrm. It looks like it's passing include_optimizer to Model.save. @tanzhenyu, any ideas?

tanzhenyu · 2019-03-26T01:20:46Z

@lostmsu The previous keras optimizer is problematic. We have provided a new set of optimizers under exactly the same name and fully backward compatible for users. However as unfortunate as this is, the new ones did not make to tf 1.13.1. If you do !pip install tf-nightly-gpu-2.0-preview, then you should be able to see identical behavior.

That said, you need to change one line from tf.set_random_seed(seed) to tf.random.set_seed(seed)

lostmsu · 2019-03-26T18:58:02Z

@tanzhenyu confirming, seems to be fixed in 2.0-preview

tensorflow-bot · 2019-03-26T18:58:04Z

Are you satisfied with the resolution of your issue?
Yes
No

jvishnuvardhan self-assigned this Mar 25, 2019

jvishnuvardhan added comp:apis Highlevel API related issues type:bug Bug labels Mar 25, 2019

jvishnuvardhan assigned allenlavoie and unassigned jvishnuvardhan Mar 25, 2019

jvishnuvardhan added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Mar 25, 2019

tensorflowbutler removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Mar 26, 2019

lostmsu closed this as completed Mar 26, 2019

peterbence3 mentioned this issue Apr 8, 2019

Model Start At Different Loss Level After Resuming The Training #27630

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resuming training from saved checkpoint produces different result than uninterrupted training #27049

Resuming training from saved checkpoint produces different result than uninterrupted training #27049

lostmsu commented Mar 22, 2019

allenlavoie commented Mar 25, 2019

lostmsu commented Mar 25, 2019 •

edited

allenlavoie commented Mar 25, 2019

tanzhenyu commented Mar 26, 2019

lostmsu commented Mar 26, 2019

tensorflow-bot bot commented Mar 26, 2019

Resuming training from saved checkpoint produces different result than uninterrupted training #27049

Resuming training from saved checkpoint produces different result than uninterrupted training #27049

Comments

lostmsu commented Mar 22, 2019

allenlavoie commented Mar 25, 2019

lostmsu commented Mar 25, 2019 • edited

allenlavoie commented Mar 25, 2019

tanzhenyu commented Mar 26, 2019

lostmsu commented Mar 26, 2019

tensorflow-bot bot commented Mar 26, 2019

lostmsu commented Mar 25, 2019 •

edited