Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to resume train.py from checkpoint files #8

Closed
yuanzhaozy opened this issue Sep 19, 2023 · 2 comments
Closed

How to resume train.py from checkpoint files #8

yuanzhaozy opened this issue Sep 19, 2023 · 2 comments

Comments

@yuanzhaozy
Copy link

I ran "python train.py" but it seemed that the training job didn't finish normally. How can I resume the training process with a checkpoint file? Many thanks!

@zaixizhang
Copy link
Owner

Hi, could you show the training error? You can reload the model from the latest checkpoint.

@yuanzhaozy
Copy link
Author

Hi Zaixi, thanks for your reply. I got a lot of checkpoint files in the ".log/train_model_2023_xx_xx__xx_xx_xx/checkpoints " directory. If I want to reload the model from the latest checkpoint, what should I do is modify the checkpoint parameter in ./config/train_model.yml file? Are there any other parameters that need to be changed? What is the output file if the training job finished normally?

model:
hidden_channels: 256
random_alpha: False
checkpoint: None
refinement: True

For training error, the last few lines of log.txt file is as follow:
[2023-09-20 17:14:15,012::train::INFO] [Train] Iter 403505 | Loss 0.499874 | Loss(Pred) 0.073919 | Loss(comb) 0.530171 | Loss(Focal) 0.120495 | Loss(Dm) 0.152404 | Loss(Tor) -0.404686 | Loss(SR) 0.027571 | Orig_grad_norm 7.849844
[2023-09-20 17:14:15,216::train::INFO] [Train] Iter 403506 | Loss 1.401024 | Loss(Pred) 0.652173 | Loss(comb) 0.669408 | Loss(Focal) 0.069054 | Loss(Dm) 0.134527 | Loss(Tor) -0.171648 | Loss(SR) 0.047510 | Orig_grad_norm 24.358858
[2023-09-20 17:14:15,410::train::INFO] [Train] Iter 403507 | Loss 3.349940 | Loss(Pred) 2.343825 | Loss(comb) 0.722585 | Loss(Focal) 0.113704 | Loss(Dm) 0.483718 | Loss(Tor) -0.346563 | Loss(SR) 0.032671 | Orig_grad_norm 53.380486
[2023-09-20 17:14:15,599::train::INFO] [Train] Iter 403508 | Loss 1.348553 | Loss(Pred) 0.644946 | Loss(comb) 0.499146 | Loss(Focal) 0.110188 | Loss(Dm) 0.195771 | Loss(Tor) -0.131030 | Loss(SR) 0.029531 | Orig_grad_norm 15.907043

It looks like an accidental interruption.

By the way, can you provide an estimated training time on what kind of hardware? I would really appreciate it if you could provide the pre-trained model. Many thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants