About resume #37
Comments
Have you changed anything ? I can't reproduce this error myself. |
No. I just use res101 instead of res50, and I use 4 gpus. |
I test with res101 and everything works fine. BTW, what's your pytorch version ? |
that's strange. My pytorch version is 0.3.1 |
Maybe you could send me a link to download your checkpoint. And I'll test it. |
Thank you very much. |
Hi, the way that pytorch store the optimizer states is not really robust. If the order of params in param_groups changed, then the state relating to params will not be restored in the correct correspondence, and thus results in the size mismatch error above. Not sure if this problem has been improved in pytorch 0.4 or not. You must have updated (git pull) the code between the training and resume, and the order of params in param_groups is changed. |
I'll take a remedy for that. |
Interestingly, I found that the order of param ids in the checkpoint Are you facing the same issue, if you restart a new training and then try to resume from any of the new checkpoints ? (BTW, press ctrl+c to terminate the program will also save a checkpoint) |
Yes, param and momentum_buffer shape mismatch in checkpoint. I meet a AssertionError as following:
It seems there are some bugs in torch.save, maybe I should update the version of my torch. |
What's you pytorch version? It's recommend to use 0.3.1 for now. |
I use pytorch 0.3.1, ubuntu 16.04, titan xp, python3.5, cudnn 7, cuda 9.0 |
Are you still facing the same issue now, or it's solved after retraining ? |
Unfortunately, I still facing the same issue after retraining:param and momentum_buffer shape mismatch in checkpoint |
I may not able to help with your any further. Here are some suggestions:
Closing the issue since it's not a bug related to this project. Feel free to open it again if you find it is ! |
@philokey i have facing the same problems? |
Hello, I try to resume the training by using this command:
However, it throw out a runtime error
How can I solve this problem?
The text was updated successfully, but these errors were encountered: