How to resume train.py from checkpoint files #8

yuanzhaozy · 2023-09-19T07:10:05Z

I ran "python train.py" but it seemed that the training job didn't finish normally. How can I resume the training process with a checkpoint file? Many thanks!

zaixizhang · 2023-10-18T17:25:05Z

Hi, could you show the training error? You can reload the model from the latest checkpoint.

yuanzhaozy · 2023-10-19T01:19:44Z

Hi Zaixi, thanks for your reply. I got a lot of checkpoint files in the ".log/train_model_2023_xx_xx__xx_xx_xx/checkpoints " directory. If I want to reload the model from the latest checkpoint, what should I do is modify the checkpoint parameter in ./config/train_model.yml file? Are there any other parameters that need to be changed? What is the output file if the training job finished normally?

model:
hidden_channels: 256
random_alpha: False
checkpoint: None
refinement: True

It looks like an accidental interruption.

By the way, can you provide an estimated training time on what kind of hardware? I would really appreciate it if you could provide the pre-trained model. Many thanks.

zaixizhang closed this as completed Oct 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to resume train.py from checkpoint files #8

How to resume train.py from checkpoint files #8

yuanzhaozy commented Sep 19, 2023

zaixizhang commented Oct 18, 2023

yuanzhaozy commented Oct 19, 2023

How to resume train.py from checkpoint files #8

How to resume train.py from checkpoint files #8

Comments

yuanzhaozy commented Sep 19, 2023

zaixizhang commented Oct 18, 2023

yuanzhaozy commented Oct 19, 2023