About resume #37

philokey · 2018-05-03T15:27:32Z

Hello, I try to resume the training by using this command:

 python tools/train_net_step.py --dataset coco2017 --cfg configs/e2e_faster_rcnn_R-101-FPN_1x.yaml --use_tfboard --load_ckpt  Outputs/e2e_faster_rcnn_R-101-FPN_1x/May02-12-15-12_faster_step/ckpt/model_step69999.pth --resume

However, it throw out a runtime error

Traceback (most recent call last):
  File "tools/train_net_step.py", line 367, in main
    optimizer.step()
  File "/home/philokey/.virtualenvs/py3/lib/python3.5/site-packages/torch/optim/sgd.py", line 94, in step
    buf.mul_(momentum).add_(1 - dampening, d_p)
RuntimeError: invalid argument 3: sizes do not match at /pytorch/torch/lib/THC/generated/../generic/THCTensorMathPointwise.cu:271

How can I solve this problem?

The text was updated successfully, but these errors were encountered:

roytseng-tw · 2018-05-04T04:29:32Z

Have you changed anything ? I can't reproduce this error myself.

philokey · 2018-05-04T05:20:42Z

No. I just use res101 instead of res50, and I use 4 gpus.

roytseng-tw · 2018-05-04T06:16:52Z

I test with res101 and everything works fine. BTW, what's your pytorch version ?

philokey · 2018-05-04T07:22:52Z

that's strange. My pytorch version is 0.3.1

roytseng-tw · 2018-05-04T07:50:28Z

Maybe you could send me a link to download your checkpoint. And I'll test it.

philokey · 2018-05-04T15:00:41Z

Thank you very much.
I upload my checkpoint here.

roytseng-tw · 2018-05-04T16:35:34Z

Hi, the way that pytorch store the optimizer states is not really robust. If the order of params in param_groups changed, then the state relating to params will not be restored in the correct correspondence, and thus results in the size mismatch error above. Not sure if this problem has been improved in pytorch 0.4 or not.

You must have updated (git pull) the code between the training and resume, and the order of params in param_groups is changed.

roytseng-tw · 2018-05-04T16:55:03Z

I'll take a remedy for that.

roytseng-tw · 2018-05-05T02:33:45Z

Interestingly, I found that the order of param ids in the checkpoint ckpt['optimizer'][['param_groups'][i]['params'] is not changed. When I try to match the shape of params in ckpt['model'] and momentum_buffers in ckpt['optimizer']['state'][param_id]['momentum_buffer'], using the fact that model state_dict (parameters) is insertion ordered, I found those shapes do not match. So your checkpoint seems corrupted somehow.

Are you facing the same issue, if you restart a new training and then try to resume from any of the new checkpoints ? (BTW, press ctrl+c to terminate the program will also save a checkpoint)

philokey · 2018-05-05T13:21:12Z

Yes, param and momentum_buffer shape mismatch in checkpoint. I meet a AssertionError as following:

Traceback (most recent call last):
  File "tools/train_net_step.py", line 436, in <module>
    main()
  File "tools/train_net_step.py", line 305, in main
    misc_utils.ensure_optimizer_ckpt_params_order(param_names, checkpoint)
  File "/home/philokey/Project/Detectron.pytorch/lib/utils/misc.py", line 73, in ensure_optimizer_ckpt_params_order
    ' param_name: {}, param_id: {}'.format(key, saved_p_id))
AssertionError: param and momentum_buffer shape mismatch in checkpoint. param_name: Conv_Body.conv_top.weight, param_id: 140364424130056

It seems there are some bugs in torch.save, maybe I should update the version of my torch.

roytseng-tw · 2018-05-05T14:46:43Z

What's you pytorch version? It's recommend to use 0.3.1 for now.

philokey · 2018-05-05T15:29:49Z

I use pytorch 0.3.1, ubuntu 16.04, titan xp, python3.5, cudnn 7, cuda 9.0

roytseng-tw · 2018-05-06T00:32:30Z

Are you still facing the same issue now, or it's solved after retraining ?

philokey · 2018-05-06T01:38:31Z

Unfortunately, I still facing the same issue after retraining：param and momentum_buffer shape mismatch in checkpoint

roytseng-tw · 2018-05-06T03:05:26Z

I may not able to help with your any further.

Here are some suggestions:

Record the actually mapping of param name and param id to see if the saved optimizer checkpoint is really wrong.
Try doing the same thing, loading a optimizer checkpoint, in other pytorch projects.

Closing the issue since it's not a bug related to this project. Feel free to open it again if you find it is !

luohuan2uestc · 2018-05-29T02:26:55Z

@philokey i have facing the same problems?
have you solve this problem?
thx!

choasup · 2018-12-17T13:07:14Z

Infer to https://github.com/choasup/pytorch-fasterRCNN/issues/1#issuecomment-447836926

roytseng-tw closed this as completed May 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About resume #37

About resume #37

philokey commented May 3, 2018

roytseng-tw commented May 4, 2018

philokey commented May 4, 2018

roytseng-tw commented May 4, 2018 •

edited

philokey commented May 4, 2018

roytseng-tw commented May 4, 2018

philokey commented May 4, 2018 •

edited

roytseng-tw commented May 4, 2018

roytseng-tw commented May 4, 2018

roytseng-tw commented May 5, 2018

philokey commented May 5, 2018

roytseng-tw commented May 5, 2018

philokey commented May 5, 2018 •

edited

roytseng-tw commented May 6, 2018

philokey commented May 6, 2018

roytseng-tw commented May 6, 2018 •

edited

luohuan2uestc commented May 29, 2018

choasup commented Dec 17, 2018 •

edited

About resume #37

About resume #37

Comments

philokey commented May 3, 2018

roytseng-tw commented May 4, 2018

philokey commented May 4, 2018

roytseng-tw commented May 4, 2018 • edited

philokey commented May 4, 2018

roytseng-tw commented May 4, 2018

philokey commented May 4, 2018 • edited

roytseng-tw commented May 4, 2018

roytseng-tw commented May 4, 2018

roytseng-tw commented May 5, 2018

philokey commented May 5, 2018

roytseng-tw commented May 5, 2018

philokey commented May 5, 2018 • edited

roytseng-tw commented May 6, 2018

philokey commented May 6, 2018

roytseng-tw commented May 6, 2018 • edited

luohuan2uestc commented May 29, 2018

choasup commented Dec 17, 2018 • edited

roytseng-tw commented May 4, 2018 •

edited

philokey commented May 4, 2018 •

edited

philokey commented May 5, 2018 •

edited

roytseng-tw commented May 6, 2018 •

edited

choasup commented Dec 17, 2018 •

edited