-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recover callback history when training is interrupted #50516
Comments
@dennymarcels Thanks for creating the issue. keras moved to a new repository https://github.com/keras-team/keras/issues and that repo is dedicated for keras development. Earlier keras team published about the move in TF forum. The link is given below. Regarding this feature, you could write custom callback to checkpoint and recover anytime. For example, you can use Please let me know what you think. Also, provide any use-case for further discussion. Thanks! |
Hey @jvishnuvardhan thank you for replying! I don't think writing a custom callback would work because my model is customized itself, and I could only save its weights. I can surely load the weights if the training was interrupted, but all callbacks will be reset, meaning none knows which was the best loss so far, nor in which epoch it happened. Also, if you feel that is more appropriate and you have the power to do so, would you mind moving this request to the Keras repository? |
@dennymarcels When you write custom callback, you can inherit ModelCheckpoint callback and write the weights based on the performance of a metric. Please check this example. When you want to load weights, you can choose weights of last iteration/batch/epoch or best weights as mentioned in this TF tutorial. If this is still an issue, please open in As entire Keras team is focussed on that repo, you would get faster response and resolution. Thanks! |
I did post an issue there, and while writing it I figured I was not that clear here. Thank you nonetheless. Closing. |
I could not find a solution to my demand, so I believe it should be a feature to be implemented.
System information
Describe the feature and the current behavior/state.
I was wondering if it would be possible to checkpoint and recover the callback history, so that my callbacks can continue whatever they were tracking when training is interrupted for any reason.
Will this change the current api? How?
I don't know.
Who will benefit with this feature?
Whoever performs long training sessions.
Any Other info.
NA
The text was updated successfully, but these errors were encountered: