Resumable trainer? #349

StephenArk30 · 2021-04-21T11:23:46Z

I wonder if it is a common demand to save training progress and resume it. In my story I bothered by an unstable server that often shut down and made my train wasted. I made some changes to the trainer and, if it is a common demand, I will create a pr. But a discussion is needed to decide what param should be added to the trainer, like begin_step, last_gradient_step, etc. Should we read them from log? Or read it from a slot function that users should implement (like load_fn).

Also, what data should be saved in the progress. Buffer, policy, and what else?

The text was updated successfully, but these errors were encountered:

Trinkle23897 · 2021-04-21T12:50:38Z

I wonder if it is a common demand to save training progress and resume it.

Definitely worth supporting it!

an unstable server that often shut down and made my train wasted.

Because of a kill event?

Also, what data should be saved in the progress. Buffer, policy, and what else?

optim status (if we call torch.save(policy) the optim is also saved into pth file? not sure)
policy parameters
logger
buffer

But a discussion is needed to decide what param should be added to the trainer, like begin_step, last_gradient_step, etc. Should we read them from log? Or read it from a slot function that users should implement (like load_fn).

I don't think we need a large amount of extra params -- load all before trainer init would be fine, like the current approach for policy load/save. And we can get the env_step/gradient_step from logger.

My current thought is that we can modify the trainer logic in test_episode, save (logger/buffer/policy) first and go on testing. And sure, add something like save_all_every_epoch: bool = False would be fine.

StephenArk30 · 2021-04-23T14:04:49Z

pr here @Trinkle23897 , please take a look.

Trinkle23897 added the enhancement Feature that is not a new algorithm or an algorithm enhancement label Apr 21, 2021

Trinkle23897 linked a pull request Apr 23, 2021 that will close this issue

Make trainer resumable #350

Merged

Trinkle23897 assigned StephenArk30 Apr 24, 2021

Trinkle23897 closed this as completed in #350 May 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resumable trainer? #349

Resumable trainer? #349

StephenArk30 commented Apr 21, 2021

Trinkle23897 commented Apr 21, 2021 •

edited

StephenArk30 commented Apr 23, 2021

Resumable trainer? #349

Resumable trainer? #349

Comments

StephenArk30 commented Apr 21, 2021

Trinkle23897 commented Apr 21, 2021 • edited

StephenArk30 commented Apr 23, 2021

Trinkle23897 commented Apr 21, 2021 •

edited