Skip to content
This repository has been archived by the owner on Jan 26, 2022. It is now read-only.

About resume #37

Closed
philokey opened this issue May 3, 2018 · 17 comments
Closed

About resume #37

philokey opened this issue May 3, 2018 · 17 comments

Comments

@philokey
Copy link

philokey commented May 3, 2018

Hello, I try to resume the training by using this command:

 python tools/train_net_step.py --dataset coco2017 --cfg configs/e2e_faster_rcnn_R-101-FPN_1x.yaml --use_tfboard --load_ckpt  Outputs/e2e_faster_rcnn_R-101-FPN_1x/May02-12-15-12_faster_step/ckpt/model_step69999.pth --resume

However, it throw out a runtime error

Traceback (most recent call last):
  File "tools/train_net_step.py", line 367, in main
    optimizer.step()
  File "/home/philokey/.virtualenvs/py3/lib/python3.5/site-packages/torch/optim/sgd.py", line 94, in step
    buf.mul_(momentum).add_(1 - dampening, d_p)
RuntimeError: invalid argument 3: sizes do not match at /pytorch/torch/lib/THC/generated/../generic/THCTensorMathPointwise.cu:271

How can I solve this problem?

@roytseng-tw
Copy link
Owner

Have you changed anything ? I can't reproduce this error myself.

@philokey
Copy link
Author

philokey commented May 4, 2018

No. I just use res101 instead of res50, and I use 4 gpus.

@roytseng-tw
Copy link
Owner

roytseng-tw commented May 4, 2018

I test with res101 and everything works fine. BTW, what's your pytorch version ?

@philokey
Copy link
Author

philokey commented May 4, 2018

that's strange. My pytorch version is 0.3.1

@roytseng-tw
Copy link
Owner

Maybe you could send me a link to download your checkpoint. And I'll test it.

@philokey
Copy link
Author

philokey commented May 4, 2018

Thank you very much.
I upload my checkpoint here.

@roytseng-tw
Copy link
Owner

Hi, the way that pytorch store the optimizer states is not really robust. If the order of params in param_groups changed, then the state relating to params will not be restored in the correct correspondence, and thus results in the size mismatch error above. Not sure if this problem has been improved in pytorch 0.4 or not.

You must have updated (git pull) the code between the training and resume, and the order of params in param_groups is changed.

@roytseng-tw
Copy link
Owner

I'll take a remedy for that.

@roytseng-tw
Copy link
Owner

Interestingly, I found that the order of param ids in the checkpoint ckpt['optimizer'][['param_groups'][i]['params'] is not changed. When I try to match the shape of params in ckpt['model'] and momentum_buffers in ckpt['optimizer']['state'][param_id]['momentum_buffer'], using the fact that model state_dict (parameters) is insertion ordered, I found those shapes do not match. So your checkpoint seems corrupted somehow.

Are you facing the same issue, if you restart a new training and then try to resume from any of the new checkpoints ? (BTW, press ctrl+c to terminate the program will also save a checkpoint)

@philokey
Copy link
Author

philokey commented May 5, 2018

Yes, param and momentum_buffer shape mismatch in checkpoint. I meet a AssertionError as following:

Traceback (most recent call last):
  File "tools/train_net_step.py", line 436, in <module>
    main()
  File "tools/train_net_step.py", line 305, in main
    misc_utils.ensure_optimizer_ckpt_params_order(param_names, checkpoint)
  File "/home/philokey/Project/Detectron.pytorch/lib/utils/misc.py", line 73, in ensure_optimizer_ckpt_params_order
    ' param_name: {}, param_id: {}'.format(key, saved_p_id))
AssertionError: param and momentum_buffer shape mismatch in checkpoint. param_name: Conv_Body.conv_top.weight, param_id: 140364424130056

It seems there are some bugs in torch.save, maybe I should update the version of my torch.

@roytseng-tw
Copy link
Owner

What's you pytorch version? It's recommend to use 0.3.1 for now.

@philokey
Copy link
Author

philokey commented May 5, 2018

I use pytorch 0.3.1, ubuntu 16.04, titan xp, python3.5, cudnn 7, cuda 9.0

@roytseng-tw
Copy link
Owner

Are you still facing the same issue now, or it's solved after retraining ?

@philokey
Copy link
Author

philokey commented May 6, 2018

Unfortunately, I still facing the same issue after retraining:param and momentum_buffer shape mismatch in checkpoint

@roytseng-tw
Copy link
Owner

roytseng-tw commented May 6, 2018

I may not able to help with your any further.

Here are some suggestions:

  1. Record the actually mapping of param name and param id to see if the saved optimizer checkpoint is really wrong.
  2. Try doing the same thing, loading a optimizer checkpoint, in other pytorch projects.

Closing the issue since it's not a bug related to this project. Feel free to open it again if you find it is !

@luohuan2uestc
Copy link

@philokey i have facing the same problems?
have you solve this problem?
thx!

@choasup
Copy link

choasup commented Dec 17, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants