You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For most VLMs, I find that the image feature representations are fixed. Although cross-attention is mentioned in the paper of QWen-VL, it appears to be just an adapter in practice. Why don't these models use the text question as a ``q'' to perform cross-attention for image feature extraction? There must be some unacceptable drawbacks to this approach.
Can anyone explain the reason behind this?
The text was updated successfully, but these errors were encountered:
MaHuanAAA
changed the title
【Discussion】Why do current large-scale VL models rely on self-attention within images for feature extraction rather than using cross-attention based on the content of the question? What is the reason behind this?
Discussion closed
Jun 7, 2024
For most VLMs, I find that the image feature representations are fixed. Although cross-attention is mentioned in the paper of QWen-VL, it appears to be just an adapter in practice. Why don't these models use the text question as a ``q'' to perform cross-attention for image feature extraction? There must be some unacceptable drawbacks to this approach.
Can anyone explain the reason behind this?
The text was updated successfully, but these errors were encountered: