Discussion closed #411

MaHuanAAA · 2024-06-07T06:22:06Z

For most VLMs, I find that the image feature representations are fixed. Although cross-attention is mentioned in the paper of QWen-VL, it appears to be just an adapter in practice. Why don't these models use the text question as a ``q'' to perform cross-attention for image feature extraction? There must be some unacceptable drawbacks to this approach.

Can anyone explain the reason behind this?

KDD2018 · 2024-06-07T08:50:17Z

+1

MaHuanAAA changed the title 【Discussion】Why do current large-scale VL models rely on self-attention within images for feature extraction rather than using cross-attention based on the content of the question? What is the reason behind this? Discussion closed Jun 7, 2024

MaHuanAAA closed this as completed Jun 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion closed #411

Discussion closed #411

MaHuanAAA commented Jun 7, 2024

KDD2018 commented Jun 7, 2024

Discussion closed #411

Discussion closed #411

Comments

MaHuanAAA commented Jun 7, 2024

KDD2018 commented Jun 7, 2024