Resume Training? #13

WilliamLwj · 2020-05-11T14:24:09Z

Hi, I am wondering whether it is possible to resume training using the saved checkpoint? Based on the code I think I just need to re-define the scheduler by myself. Is there anything that you think I missed？

Thank you so much for your code btw.

rdspring1 · 2020-05-11T15:14:37Z

First, you need to save the optimizer. Also, you need to calculate the step variable.
step is the current iteration during the training process.
step = train_corpus.batch_num * epoch + current_batch

    torch.save({
            'epoch': epoch,
            'step': step,
            'model_state_dict': model.state_dict(),
            'optimizer_state_dict': optimizer.state_dict(),
            }, os.path.join(args.save, "optim_model.pt"))

When resuming:
Note: You need to move the model to the GPU after loading from disk.

    if args.resume:
        print("Loading model from checkpoint")
        sys.stdout.flush()
        checkpoint = torch.load(os.path.join(args.save, "model.pt"), map_location="cpu")
        model.load_state_dict(checkpoint['model_state_dict'])
        optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
        start_epoch = checkpoint['epoch']+1
        step = checkpoint['step']

Also, you need to modify the learning rate scheduler:
Change last_iter from -1 to current_step-1

scheduler = LinearLR(optimizer, base_lr=args.lr*args.scale, max_iters=train_corpus.batch_num*args.epochs, last_iter=checkpoint['step']-1, min_lr=1e-8)

Plus, some minor variable tweaks, but this is the general setup.

WilliamLwj · 2020-05-11T16:29:10Z

Thank you so much!!

WilliamLwj closed this as completed May 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resume Training? #13

Resume Training? #13

WilliamLwj commented May 11, 2020

rdspring1 commented May 11, 2020 •

edited

Loading

WilliamLwj commented May 11, 2020

Resume Training? #13

Resume Training? #13

Comments

WilliamLwj commented May 11, 2020

rdspring1 commented May 11, 2020 • edited Loading

WilliamLwj commented May 11, 2020

rdspring1 commented May 11, 2020 •

edited

Loading