Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

solve the size mismatch issue of the generated *.ckpt files based on … #21

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Amber-Heung
Copy link

…Luc Giraud(Inria) and Paul Mycek(Cerfacs)'s suggestion

…Luc Giraud(Inria) and Paul Mycek(Cerfacs)'s suggestion
@astanziola
Copy link
Member

Hi @Amber-Heung , thanks a lot for the PR!

Just for my understanding, does this fixes the problem with loading those old weights, or is it generally a problem with the training of the model from scratch?

I'm struggling to see how this can fix a size issue.

@Amber-Heung
Copy link
Author

Hi @astanziola , there is no problem when training the model download from the main brach, this size mismach only appear when the training is finished and if you want to load these weights generated during training.

@astanziola
Copy link
Member

Interesting! Thanks a lot, I'll add a quick test for this (this repository really need some testing) and merge.

@Amber-Heung
Copy link
Author

Hi @astanziola , I should explain the size mismatck issue we observed when loading the trained weight more clear.
In my case, I use 2 NVIDIA P100 GPUs for training the model downloaded from /main brabch, after that, 4 *.ckpt files are stored in the /checkpoint folder with two different size, a smaller one with around 1513683 and a larger one with around 3799505. When I load these generated *.ckpt file to the trained model, there is no problem if I load the smaller size one, but when I load the larger one, I got red line writed as:
size mismatch for source: copying a param with shape torch.Size([32, 2, 96, 96]) from checkpoint, the shape in current model is torch.Size([1, 2, 96, 96])
And after the trying shown in the Pull request, I run the training process by train.py again, and this time all the *.ckpt files stored in the /checkpoint folder have the similar size around 1513491, which is close to the smaller size one before this slightly modification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants