Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BLIP-2: Cannot replicate "modality gap" results with pretrained vs. new Q-Former #544

Closed
gregor-ge opened this issue Oct 5, 2023 · 0 comments

Comments

@gregor-ge
Copy link

Hello,

I find the modality gap results of Figure 5 in the BLIP-2 paper very interesting and was looking into ways to improve the stage 1 pretraining to get better results with the LLM. However, I fail to replicate the results even with your stage 1 checkpoint in my smaller-scale setup, the "new" Q-Former works just as good if not better than the pre-trained one:

grafik

  • I am using the HuggingFace implementation and I converted the "blip2" checkpoint with an adapted version of Niels Rogges script.
  • I am training with 2m captions sampled from the 14m BLIP WebCapFilt dataset with batch size 128.

Is it possible that the observed gap between pre-trained and newly-intialized Q-Former only emerges with significantly more training samples?

Thank you for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant