You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I find the modality gap results of Figure 5 in the BLIP-2 paper very interesting and was looking into ways to improve the stage 1 pretraining to get better results with the LLM. However, I fail to replicate the results even with your stage 1 checkpoint in my smaller-scale setup, the "new" Q-Former works just as good if not better than the pre-trained one:
I am using the HuggingFace implementation and I converted the "blip2" checkpoint with an adapted version of Niels Rogges script.
I am training with 2m captions sampled from the 14m BLIP WebCapFilt dataset with batch size 128.
Is it possible that the observed gap between pre-trained and newly-intialized Q-Former only emerges with significantly more training samples?
Thank you for your help!
The text was updated successfully, but these errors were encountered:
Hello,
I find the modality gap results of Figure 5 in the BLIP-2 paper very interesting and was looking into ways to improve the stage 1 pretraining to get better results with the LLM. However, I fail to replicate the results even with your stage 1 checkpoint in my smaller-scale setup, the "new" Q-Former works just as good if not better than the pre-trained one:
Is it possible that the observed gap between pre-trained and newly-intialized Q-Former only emerges with significantly more training samples?
Thank you for your help!
The text was updated successfully, but these errors were encountered: