Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion closed #411

Closed
MaHuanAAA opened this issue Jun 7, 2024 · 1 comment
Closed

Discussion closed #411

MaHuanAAA opened this issue Jun 7, 2024 · 1 comment

Comments

@MaHuanAAA
Copy link

For most VLMs, I find that the image feature representations are fixed. Although cross-attention is mentioned in the paper of QWen-VL, it appears to be just an adapter in practice. Why don't these models use the text question as a ``q'' to perform cross-attention for image feature extraction? There must be some unacceptable drawbacks to this approach.

Can anyone explain the reason behind this?

@KDD2018
Copy link

KDD2018 commented Jun 7, 2024

+1

@MaHuanAAA MaHuanAAA changed the title 【Discussion】Why do current large-scale VL models rely on self-attention within images for feature extraction rather than using cross-attention based on the content of the question? What is the reason behind this? Discussion closed Jun 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants