New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resuming training from saved checkpoint produces different result than uninterrupted training #27049
Comments
Model.fit is taking random slices of the data and batching them together. If you control for that, e.g. make all of the examples identical, can you still reproduce? |
@allenlavoie the model will converge instantly if there's just 1 training sample. Also, this should not have changed anything at all, since |
Hrm. It looks like it's passing include_optimizer to Model.save. @tanzhenyu, any ideas? |
@lostmsu The previous keras optimizer is problematic. We have provided a new set of optimizers under exactly the same name and fully backward compatible for users. However as unfortunate as this is, the new ones did not make to tf 1.13.1. If you do !pip install tf-nightly-gpu-2.0-preview, then you should be able to see identical behavior. That said, you need to change one line from tf.set_random_seed(seed) to tf.random.set_seed(seed) |
@tanzhenyu confirming, seems to be fixed in 2.0-preview |
System information
Describe the current behavior
Loading a model with
tf.keras.models.load_model
, produced withtf.keras.callbacks.ModelCheckpoint
, and resuming training produces different results from running the training without save model + restore model in the middle.Describe the expected behavior
Saving and restoring the model should allow to resume training as if there was no interruption in the first place.
Code to reproduce the issue
Google Collab
Other info / logs
No interruption:
With interruption:
The model does not have any random elements, so it looks like optimizer state is lost.
The text was updated successfully, but these errors were encountered: