Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recover callback history when training is interrupted #50516

Closed
dennymarcels opened this issue Jun 29, 2021 · 4 comments
Closed

Recover callback history when training is interrupted #50516

dennymarcels opened this issue Jun 29, 2021 · 4 comments
Assignees
Labels
comp:keras Keras related issues stat:awaiting response Status - Awaiting response from author type:feature Feature requests

Comments

@dennymarcels
Copy link

I could not find a solution to my demand, so I believe it should be a feature to be implemented.

System information

  • TensorFlow version (you are using): 2
  • Are you willing to contribute it (Yes/No): Yes

Describe the feature and the current behavior/state.
I was wondering if it would be possible to checkpoint and recover the callback history, so that my callbacks can continue whatever they were tracking when training is interrupted for any reason.

Will this change the current api? How?
I don't know.

Who will benefit with this feature?
Whoever performs long training sessions.

Any Other info.
NA

@dennymarcels dennymarcels added the type:feature Feature requests label Jun 29, 2021
@saikumarchalla saikumarchalla added the comp:keras Keras related issues label Jul 1, 2021
@jvishnuvardhan
Copy link
Contributor

@dennymarcels Thanks for creating the issue. keras moved to a new repository https://github.com/keras-team/keras/issues and that repo is dedicated for keras development. Earlier keras team published about the move in TF forum. The link is given below.

https://discuss.tensorflow.org/t/keras-project-moved-to-new-repository-in-https-github-com-keras-team-keras/1999

Regarding this feature, you could write custom callback to checkpoint and recover anytime. For example, you can use on_train_begin to checkpoint at the start of training and you can use other methods available to checkpoint at different times in the model training/testing as described here https://keras.io/guides/writing_your_own_callbacks/.

Please let me know what you think. Also, provide any use-case for further discussion. Thanks!

@jvishnuvardhan jvishnuvardhan added the stat:awaiting response Status - Awaiting response from author label Jul 1, 2021
@dennymarcels
Copy link
Author

dennymarcels commented Jul 2, 2021

Hey @jvishnuvardhan thank you for replying!

I don't think writing a custom callback would work because my model is customized itself, and I could only save its weights. I can surely load the weights if the training was interrupted, but all callbacks will be reset, meaning none knows which was the best loss so far, nor in which epoch it happened.

Also, if you feel that is more appropriate and you have the power to do so, would you mind moving this request to the Keras repository?

@jvishnuvardhan
Copy link
Contributor

@dennymarcels When you write custom callback, you can inherit ModelCheckpoint callback and write the weights based on the performance of a metric. Please check this example. When you want to load weights, you can choose weights of last iteration/batch/epoch or best weights as mentioned in this TF tutorial.

If this is still an issue, please open in keras-team/keras. I cannot be able to move the issue to that repo because it is not part of tensorflow/tensorflow repository and I don't have permission. It is easy for you to open there and reference this issue. Thanks!

As entire Keras team is focussed on that repo, you would get faster response and resolution. Thanks!

@dennymarcels
Copy link
Author

I did post an issue there, and while writing it I figured I was not that clear here. Thank you nonetheless. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:keras Keras related issues stat:awaiting response Status - Awaiting response from author type:feature Feature requests
Projects
None yet
Development

No branches or pull requests

3 participants