Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the difference between clip_text_features and clip_word_tokens? #27

Open
kinredon opened this issue Aug 16, 2023 · 0 comments

Comments

@kinredon
Copy link

Hi, thanks for your excellent work.

I am reading the code of this project but the difference between clip_text_features and clip_word_tokens in this line:

clip_text_features, clip_word_tokens = \
    text_encoder.encode_pseudo_text(pseudo_text, end_token_ids,
                                    text_pe=True, normalize=True,
                                    return_word_tokens=True)

clip_text_features are the features from the CLIP text encoder for the whole text. clip_word_tokens are the features for the particular class name (use the end_token_ids as the index). The clip_text_features can represent the text feature for the bag of the region, but clip_word_tokens represent the text features for one proposal. Do I understand this correctly?

More importantly, the implementation for the clip_word_tokens makes me confused. In lines

def forward(self, x, return_tokens=False, cls_indices=None, attn_masks=None):
    att, tokens = self.attention(self.ln_1(x), return_tokens, attn_masks=attn_masks)
    if return_tokens:
        assert cls_indices is not None
        if not isinstance(cls_indices, int):
            assert len(cls_indices) == x.shape[1]   # x: LNC
        cls_tokens = x[cls_indices, torch.arange(x.shape[1])]
        tokens = cls_tokens[None] + tokens
        tokens = tokens + self.mlp(self.ln_2(tokens))

        x = x + att
        x = x + self.mlp(self.ln_2(x))

        return x, tokens
    else:
        assert tokens is None
        x = x + att
        # x = x + self.attention(self.ln_1(x))
        x = x + self.mlp(self.ln_2(x))

        return x, None

Could the author provide some explanations or papers for this implementation? This really helps me a lot, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant