Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OWL-ViT pre-trained models cannot accept some of the longest descriptions #13

Open
HarukiNishimura-TRI opened this issue Feb 13, 2024 · 3 comments

Comments

@HarukiNishimura-TRI
Copy link

Dear authors,

Thank you for your work and the release of the d-cube dataset.

I was trying to run a pre-trained OWL-ViT model (e.g. "google/owlvit-base-patch32") on the dataset, and found the following sentences to yield a RuntimeError.

 ID: 140, TEXT: "a person who wears a hat and holds a tennis racket on the tennis court",
 ID: 146, TEXT: "the player who is ready to bat with both feet leaving the ground in the room",
 ID: 253, TEXT: "a person who plays music with musical instrument surrounded by spectators on the street",
 ID: 342, TEXT: "a fisher who stands on the shore and whose lower body is not submerged by water",
 ID: 348, TEXT: "a person who stands on the stage for speech but don't open their mouths",
 ID: 355, TEXT: "a person with a pen in one hand but not looking at the paper",
 ID: 356, TEXT: "a billiard ball with no numbers or patterns on its surface on the table",
 ID: 364, TEXT: "a person standing at the table of table tennis who is not waving table tennis rackets",
 ID: 404, TEXT: "a water polo player who is in the water but does not hold the ball",
 ID: 405, TEXT: "a barbell held by a weightlifter that has not been lifted above the head",
 ID: 412, TEXT: "a person who wears a helmet and sling equipment but is not on the sling",
 ID: 419, TEXT: "person who kneels on one knee and proposes but has nothing in his hand"

A typical error message is shown at the bottom. It seems that the pre-trained model uses max_position_embeddings = 16 in OwlViTTextConfig which is not long enough to accept the descriptions above as inputs. All the models available on Huggingface seem to use max_position_embeddings = 16. Did you encounter the same issue when running your experiments for the paper? If so, how did you handle it in the evaluation process?

Thanks in advance.

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[129], line 1
----> 1 results = get_prediction(processor, model, image, [text_list[0]])

Cell In[11], line 13, in get_prediction(processor, model, image, captions, cpu_only)
      9 with torch.no_grad():
     10     inputs = processor(text=[captions], images=image, return_tensors="pt").to(
     11         device
     12     )
---> 13     outputs = model(**inputs)
     14 target_size = torch.Tensor([image.size[::-1]]).to(device)
     15 results = processor.post_process_object_detection(
     16     outputs=outputs, target_sizes=target_size, threshold=0.05
     17 )

File /venv/lib/python3.9/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
   1516     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517 else:
-> 1518     return self._call_impl(*args, **kwargs)

File /venv/lib/python3.9/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
   1522 # If we don't have any hooks, we want to skip the rest of the logic in
   1523 # this function, and just call forward.
   1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1525         or _global_backward_pre_hooks or _global_backward_hooks
   1526         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527     return forward_call(*args, **kwargs)
   1529 try:
   1530     result = None

File /venv/lib/python3.9/site-packages/transformers/models/owlvit/modeling_owlvit.py:1640, in OwlViTForObjectDetection.forward(self, input_ids, pixel_values, attention_mask, output_attentions, output_hidden_states, return_dict)
   1637 return_dict = return_dict if return_dict is not None else self.config.return_dict
   1639 # Embed images and text queries
-> 1640 query_embeds, feature_map, outputs = self.image_text_embedder(
   1641     input_ids=input_ids,
   1642     pixel_values=pixel_values,
   1643     attention_mask=attention_mask,
   1644     output_attentions=output_attentions,
   1645     output_hidden_states=output_hidden_states,
   1646 )
   1648 # Text and vision model outputs
   1649 text_outputs = outputs.text_model_output

File /venv/lib/python3.9/site-packages/transformers/models/owlvit/modeling_owlvit.py:1385, in OwlViTForObjectDetection.image_text_embedder(self, input_ids, pixel_values, attention_mask, output_attentions, output_hidden_states)
   1376 def image_text_embedder(
   1377     self,
   1378     input_ids: torch.Tensor,
   (...)
   1383 ) -> Tuple[torch.FloatTensor]:
   1384     # Encode text and image
-> 1385     outputs = self.owlvit(
   1386         pixel_values=pixel_values,
   1387         input_ids=input_ids,
   1388         attention_mask=attention_mask,
   1389         output_attentions=output_attentions,
   1390         output_hidden_states=output_hidden_states,
   1391         return_dict=True,
   1392     )
   1394     # Get image embeddings
   1395     last_hidden_state = outputs.vision_model_output[0]

File /venv/lib/python3.9/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
   1516     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517 else:
-> 1518     return self._call_impl(*args, **kwargs)

File /venv/lib/python3.9/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
   1522 # If we don't have any hooks, we want to skip the rest of the logic in
   1523 # this function, and just call forward.
   1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1525         or _global_backward_pre_hooks or _global_backward_hooks
   1526         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527     return forward_call(*args, **kwargs)
   1529 try:
   1530     result = None

File /venv/lib/python3.9/site-packages/transformers/models/owlvit/modeling_owlvit.py:1163, in OwlViTModel.forward(self, input_ids, pixel_values, attention_mask, return_loss, output_attentions, output_hidden_states, return_base_image_embeds, return_dict)
   1155 vision_outputs = self.vision_model(
   1156     pixel_values=pixel_values,
   1157     output_attentions=output_attentions,
   1158     output_hidden_states=output_hidden_states,
   1159     return_dict=return_dict,
   1160 )
   1162 # Get embeddings for all text queries in all batch samples
-> 1163 text_outputs = self.text_model(
   1164     input_ids=input_ids,
   1165     attention_mask=attention_mask,
   1166     output_attentions=output_attentions,
   1167     output_hidden_states=output_hidden_states,
   1168     return_dict=return_dict,
   1169 )
   1171 text_embeds = text_outputs[1]
   1172 text_embeds = self.text_projection(text_embeds)

File /venv/lib/python3.9/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
   1516     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517 else:
-> 1518     return self._call_impl(*args, **kwargs)

File /venv/lib/python3.9/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
   1522 # If we don't have any hooks, we want to skip the rest of the logic in
   1523 # this function, and just call forward.
   1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1525         or _global_backward_pre_hooks or _global_backward_hooks
   1526         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527     return forward_call(*args, **kwargs)
   1529 try:
   1530     result = None

File /venv/lib/python3.9/site-packages/transformers/models/owlvit/modeling_owlvit.py:798, in OwlViTTextTransformer.forward(self, input_ids, attention_mask, position_ids, output_attentions, output_hidden_states, return_dict)
    796 input_shape = input_ids.size()
    797 input_ids = input_ids.view(-1, input_shape[-1])
--> 798 hidden_states = self.embeddings(input_ids=input_ids, position_ids=position_ids)
    800 # num_samples, seq_len = input_shape  where num_samples = batch_size * num_max_text_queries
    801 # OWLVIT's text model uses causal mask, prepare it here.
    802 # https://github.com/openai/CLIP/blob/cfcffb90e69f37bf2ff1e988237a0fbe41f33c04/clip/model.py#L324
    803 causal_attention_mask = _create_4d_causal_attention_mask(
    804     input_shape, hidden_states.dtype, device=hidden_states.device
    805 )

File /venv/lib/python3.9/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
   1516     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517 else:
-> 1518     return self._call_impl(*args, **kwargs)

File /venv/lib/python3.9/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
   1522 # If we don't have any hooks, we want to skip the rest of the logic in
   1523 # this function, and just call forward.
   1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1525         or _global_backward_pre_hooks or _global_backward_hooks
   1526         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527     return forward_call(*args, **kwargs)
   1529 try:
   1530     result = None

File /venv/lib/python3.9/site-packages/transformers/models/owlvit/modeling_owlvit.py:332, in OwlViTTextEmbeddings.forward(self, input_ids, position_ids, inputs_embeds)
    329     inputs_embeds = self.token_embedding(input_ids)
    331 position_embeddings = self.position_embedding(position_ids)
--> 332 embeddings = inputs_embeds + position_embeddings
    334 return embeddings

RuntimeError: The size of tensor a (18) must match the size of tensor b (16) at non-singleton dimension 1
@Charles-Xie
Copy link
Member

Hi Hakuri,

Thanks for your interest.
About your question, for OWL-ViT, we did skipped these several sentences with the try-except sentence during evaluation. Other methods we evaluated does not have this constraint on input length, so no need for this processing is required for them. I think simply truncating the input string length to 16 might be a better solution for this and we will have a try on this.
If you have further questions, please feel free to send me emails.

Best regards,
Chi

@HarukiNishimura-TRI
Copy link
Author

Hi Chi,

Thank you for clarification. So you omitted those sentences for the inter-scenario case as well?

Regards,
Haruki

@Charles-Xie
Copy link
Member

Hi Chi,

Thank you for clarification. So you omitted those sentences for the inter-scenario case as well?

Regards, Haruki

@HarukiNishimura-TRI Yes, I think so, for owl-vit. I think it would be better for inference on owl-vit to truncate the descriptions to 16 letters and use them for inference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants