-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restoring from checkpoints are broken in TF 1.13.1 #27937
Comments
Prints
The output of the model is the correctly restored output even though the variables weren't there when the Is that clear? Happy to hear ideas for better documentation. It's a somewhat tricky API I know, but restore-on-create is a bit of a trilemma: we either need to require input shapes to Layer construction so we can create variables immediately, or we can do deferred restoration, or we can require symbolic construction of the computation first (the TF 1.x approach) which gives us enough information to create the variables. We decided not to take the first path since it's annoying to have to specify, and we turned on eager by default so the third path isn't available (although you can optionally specify an input_shape to the first Layer in Sequential and it'll build everything right away). |
Thanks for the quick response. I understand what you are saying, and I think this should be given a line or two of explanation in the TensorFlow Eager tutorial. Since in the tutorial a variable is used, which doesn’t require shape, this problem doesn’t occur until using it in practice. It could definitely use more visibility. Also, given that the weights aren’t restored until model input is provided, I need to compare weights after restoration and input but before gradient step, correct? As at this point they should have taken on the restored values, so I can assert if they are equal. |
Thanks, yes it sounds like the eager guide should have a blurb about deferred restoration and a reference to the checkpointing guide. The main reason it doesn't is presumably the order they were written. On checking that values are restored, yes that makes sense to me. Something like this:
You could also directly check the optimizer's slot variables rather than running two steps and checking both. |
After running into an issue where I did not know that my model was not being loaded, I think that As an example, in my case, my model was not loading the weights but it wasn't telling me, and the only check I was doing was that the weights weren't empty. Well I found out that the weights were just random inits, and since the loading error is not fatal, I was debugging for hours why my model was not performing properly. This is especially weird because I was reloading the model in the same file with no changes being made to it. It also passed
Looks like something to do with the optimizer, but it should tell me when this happens, because otherwise (until this experience) I would have just kept assuming it was working. Also, the weirdest part was that some other variables, such as epoch (int), were being restored properly while this wasn't. So there is also inconsistent behaviour happening across what is being restored. |
When should Can you share a reproduction for the unchanged file issue you ran into? That sounds like a bug. |
Or how does this sound? We can print a warning on program exit by default if a checkpoint was partially loaded. Status objects will have an "allow_partial" which silences the warning. |
I think that’s a good middle of the road solution, as it provides relevant information to the user while also not causing a fatal error, which might be undesirable for some. As for the potential bug I’ll try and create an MVP of it later this week, as it’s in a fairly complicated system. |
…ores Will eventually (on __del__, so maybe at program shutdown) complain about values in the checkpoint which weren't used with restore-on-create. Adds an expect_partial() method to status objects to silence these warnings for the case where a partial restore was intended. Following up on #27937 PiperOrigin-RevId: 245992963
Thanks for the feedback. We have a warning for partial checkpoint restores now (in the latest nightly), and the eager guide now mentions deferred restoration and points to the checkpoint guide. |
System information
Describe the current behavior
I am unable to restore the weights of any of my tf.keras models ONLY when restoring from a new initialization of the model. If I change the weights then restore without reinitializing the model, it will properly restore. Furthermore, a SILENT error is being thrown when this happens, requiring me to print the status of the restore to see it.
Describe the expected behavior
The weights should restore and not run into an error. And if an error would occur, it should be logged without me having to print it myself.
Code to reproduce the issue
Other info / logs
The text was updated successfully, but these errors were encountered: