Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prefix size for Vit/ResNet #45

Closed
scfrank opened this issue Jun 27, 2022 · 1 comment
Closed

prefix size for Vit/ResNet #45

scfrank opened this issue Jun 27, 2022 · 1 comment

Comments

@scfrank
Copy link

scfrank commented Jun 27, 2022

Hi - I'm having trouble with prefix_size mismatch on my own finetuned model with CLIP features using ViT B-32. I'm learning just the transforming mapping and no gpt-2 finetuning, using the commands given in the readme.

My understanding is that CLIP with ViT B-32 uses prefix_size = 512 while ResNet encoders use prefix_size = 640, e.g. Radford et al Visual Transformers Appendix F, and also here in train.py:

prefix_dim = 640 if args.is_rn else 512

However, in the prediction script, which I've essentially copied from the transformer notebook (https://github.com/rmokady/CLIP_prefix_caption/blob/main/notebooks/transformer_inference.ipynb)
the prefix size is set to 640.

model = ClipCaptionPrefix(prefix_length, clip_length=40, prefix_size=640,
                                  num_layers=8, mapping_type='transformer')

This worked for me for your pretrained coco model, but not my finetuned model, where I get a dimensionality mismatch between 512/640. Can you help me out here? Should I be using prefix_size = 640 or 512 for training/inference?

Thank you!
Stella

Ps: FYI there are a few typos in the commands in the Readme.md, where num_layres should be num_layers.

@scfrank
Copy link
Author

scfrank commented Jun 28, 2022

I solved the problem: I'd missed the CLIP model loading at inference.

In case someone else runs into this problem:
The transformer_inference notebook loads a ResNet 50:

clip_model, preprocess = clip.load("RN50x4", device=device, jit=False)

Changing this to ViT, and changing the prefix size from 640 to 512 (ClipCaptionPrefix(prefix_size=512)) makes everything work correctly.

clip_model, preprocess = clip.load("ViT-B/32", device=device, jit=False)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant