Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resuming training from saved checkpoint produces different result than uninterrupted training #27049

Closed
lostmsu opened this issue Mar 22, 2019 · 6 comments
Assignees
Labels
comp:apis Highlevel API related issues type:bug Bug

Comments

@lostmsu
Copy link

lostmsu commented Mar 22, 2019

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes (Google Collab)
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Google Collab
  • TensorFlow installed from (source or binary): Google Collab
  • TensorFlow version (use command below): 1.13.1
  • Python version: Python 3

Describe the current behavior
Loading a model with tf.keras.models.load_model, produced with tf.keras.callbacks.ModelCheckpoint , and resuming training produces different results from running the training without save model + restore model in the middle.

Describe the expected behavior
Saving and restoring the model should allow to resume training as if there was no interruption in the first place.

Code to reproduce the issue
Google Collab

Other info / logs
No interruption:

Epoch 49/100

  • 0s - loss: 3.5190 - val_loss: 3.3597
    Epoch 50/100
  • 0s - loss: 3.4090 - val_loss: 3.2668
    Epoch 51/100
  • 0s - loss: 3.2637 - val_loss: 3.1623
    Epoch 52/100
  • 0s - loss: 3.0962 - val_loss: 2.9975

With interruption:

Epoch 49/50

  • 0s - loss: 3.5190 - val_loss: 3.3597
    Epoch 50/50

Epoch 00050: saving model to weights.50.ckpt

  • 0s - loss: 3.4090 - val_loss: 3.2668
    ... load_model('weights.50.ckpt') ...
    Epoch 51/100
  • 0s - loss: 3.2637 - val_loss: 3.3816
    Epoch 52/100
  • 0s - loss: 3.3175 - val_loss: 3.1457

The model does not have any random elements, so it looks like optimizer state is lost.

@jvishnuvardhan jvishnuvardhan self-assigned this Mar 25, 2019
@jvishnuvardhan jvishnuvardhan added comp:apis Highlevel API related issues type:bug Bug labels Mar 25, 2019
@jvishnuvardhan jvishnuvardhan added the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Mar 25, 2019
@allenlavoie
Copy link
Member

Model.fit is taking random slices of the data and batching them together. If you control for that, e.g. make all of the examples identical, can you still reproduce?

@lostmsu
Copy link
Author

lostmsu commented Mar 25, 2019

@allenlavoie the model will converge instantly if there's just 1 training sample.
I added shuffle=False to .fit(...) calls. The issue persisted. (notebook updated).

Also, this should not have changed anything at all, since Has no effect when steps_per_epoch is not None., which was 1 in my original sample anyway. Indeed, the losses after this change are the same.

@allenlavoie
Copy link
Member

Hrm. It looks like it's passing include_optimizer to Model.save. @tanzhenyu, any ideas?

@tanzhenyu
Copy link
Contributor

@lostmsu The previous keras optimizer is problematic. We have provided a new set of optimizers under exactly the same name and fully backward compatible for users. However as unfortunate as this is, the new ones did not make to tf 1.13.1. If you do !pip install tf-nightly-gpu-2.0-preview, then you should be able to see identical behavior.

That said, you need to change one line from tf.set_random_seed(seed) to tf.random.set_seed(seed)

@tensorflowbutler tensorflowbutler removed the stat:awaiting tensorflower Status - Awaiting response from tensorflower label Mar 26, 2019
@lostmsu
Copy link
Author

lostmsu commented Mar 26, 2019

@tanzhenyu confirming, seems to be fixed in 2.0-preview

@lostmsu lostmsu closed this as completed Mar 26, 2019
@tensorflow-bot
Copy link

Are you satisfied with the resolution of your issue?
Yes
No

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:apis Highlevel API related issues type:bug Bug
Projects
None yet
Development

No branches or pull requests

5 participants