You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Why don't the ViT-L/14 models in (blip2 pretrain_vitL) and (blip2_t5 pretrain_flant5xl_vitL) have the same number of layers as when instantiation a BLIP2 model with vit_model = 'clip_L'?
#609
Meanwhile, when instantiating a base BLIP2 model, we can specify the vision encoder to be a "clip_L" model, which is done by the init_vision_encoder. As you can see, it creates the ViT-L model by calling the function create_clip_vit_L . You can see in the function's code that there are $23$ layers, which I could confirm by downloading the model's pth file and loading it.
I ran the following code to check if the weights in the vision encoder of blip2 pretrain_vitL are the same as their counterparts in the clip vit_L:
from lavis.models import load_model
model = load_model("blip2", "pretrain_vitL")
blip2_pretrain_vitl_state_dict = model.state_dict()
clip_vitL_state_dict = torch.load(clip_vit_l_path, map_location="cpu")
# Verify that the parameters' names are the same
ve_blip2_pretrain_vitl_keys = [k.replace('visual_encoder.', '') for k in blip2_pretrain_vitl_state_dict.keys() if 'visual_encoder' in k]
all_keys_in = sum([1 for k in ve_blip2_pretrain_vitl_keys if k in clip_vitL_state_dict.keys()])
print(len(ve_blip2_pretrain_vitl_keys) == all_keys_in)
# Verify that the weights are the same
for k in ve_blip2_pretrain_vitl_keys:
if not bool((blip2_pretrain_vitl_state_dict[f"visual_encoder.{k}"] == clip_vitL_state_dict[k]).all()):
print(k)
I was wondering why is there such a difference.
The text was updated successfully, but these errors were encountered:
Hi!
I have checked the vision part of both blip2 pretrain_vitL and blip2_t5 pretrain_flant5xl_vitL and noticed that there are$21$ residual attention blocks.
Meanwhile, when instantiating a base BLIP2 model, we can specify the vision encoder to be a "clip_L" model, which is done by the$23$ layers, which I could confirm by downloading the model's pth file and loading it.
init_vision_encoder
. As you can see, it creates the ViT-L model by calling the functioncreate_clip_vit_L
. You can see in the function's code that there areI ran the following code to check if the weights in the vision encoder of blip2 pretrain_vitL are the same as their counterparts in the clip vit_L:
I was wondering why is there such a difference.
The text was updated successfully, but these errors were encountered: