BLIP-2: Cannot replicate "modality gap" results with pretrained vs. new Q-Former #544

gregor-ge · 2023-10-05T08:02:10Z

Hello,

I find the modality gap results of Figure 5 in the BLIP-2 paper very interesting and was looking into ways to improve the stage 1 pretraining to get better results with the LLM. However, I fail to replicate the results even with your stage 1 checkpoint in my smaller-scale setup, the "new" Q-Former works just as good if not better than the pre-trained one:

I am using the HuggingFace implementation and I converted the "blip2" checkpoint with an adapted version of Niels Rogges script.
I am training with 2m captions sampled from the 14m BLIP WebCapFilt dataset with batch size 128.

Is it possible that the observed gap between pre-trained and newly-intialized Q-Former only emerges with significantly more training samples?

Thank you for your help!

gregor-ge closed this as completed Oct 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BLIP-2: Cannot replicate "modality gap" results with pretrained vs. new Q-Former #544

BLIP-2: Cannot replicate "modality gap" results with pretrained vs. new Q-Former #544

gregor-ge commented Oct 5, 2023

BLIP-2: Cannot replicate "modality gap" results with pretrained vs. new Q-Former #544

BLIP-2: Cannot replicate "modality gap" results with pretrained vs. new Q-Former #544

Comments

gregor-ge commented Oct 5, 2023