About the details of visual abastractor #10

LiuRicky · 2023-05-02T03:56:52Z

First of all, thanks for your great work.
From the paper, I see learnable queries in visual abastractor. I think it may be similar to Perceiver in Flamingo or Q-Former in BLIP-2. But I don't find the implementation in your code about learnable queries (mPLUG_OwlVisualAbstractorEncoder and mPLUG_OwlVisualAbstractorModel in modeling_mplug_owl.py).
I am curious about the details of visual abastractor. In other words, is it seems to Q-Former or Perceiver? The details do not contain in your paper and I cannot find in the code.
Thanks again.

LiuRicky · 2023-05-02T03:59:15Z

Could you give a detailed description (in text) of the implementation of your visual abstractor?

LukeForeverYoung · 2023-05-03T06:21:56Z

We put the query_embed in mPLUG_OwlModel and pass it to Visual Abstractor during forward. The implement of Visual Abstractor is similar to the Perceiver in Flamingo, except that we use FFNs the same as LLAMA.
Referred to mPLUG and mPLUG-2, we apply abstractor to reduce the length of token length and help model to learn visual knowledge in language space.

MAGAer13 · 2023-05-03T06:28:41Z

First of all, thanks for your great work. From the paper, I see learnable queries in visual abastractor. I think it may be similar to Perceiver in Flamingo or Q-Former in BLIP-2. But I don't find the implementation in your code about learnable queries (mPLUG_OwlVisualAbstractorEncoder and mPLUG_OwlVisualAbstractorModel in modeling_mplug_owl.py). I am curious about the details of visual abastractor. In other words, is it seems to Q-Former or Perceiver? The details do not contain in your paper and I cannot find in the code. Thanks again.

Hi, just for additional claim. The aim of visual abstractor is to reduce the number of patches for images which would result in a large number of token (256 for ViT/L-14 with 224x224 resolution) for the LLM. The maximum token for LLMs such as LLaMA, Bloom are 2048 where 256 is relatively large number for it. However, it did not happen to flamingo since it utilizes cross-attention. So the purpose is different. Besides, since we want to learn some useful features such as region or object features from the image, as practiced by mPLUG-2, which also leverages similar idea and verified by the visualization of attention map for the learnable queries.

MAGAer13 assigned LukeForeverYoung May 2, 2023

MAGAer13 added the question Further information is requested label May 2, 2023

LukeForeverYoung closed this as completed May 3, 2023

MAGAer13 mentioned this issue May 17, 2023

what is the network architecture of the visual abstractor? #40

Closed

jp1924 mentioned this issue Apr 17, 2024

Question about Abstractor's FFN and Attention #219

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About the details of visual abastractor #10

About the details of visual abastractor #10

LiuRicky commented May 2, 2023

LiuRicky commented May 2, 2023

LukeForeverYoung commented May 3, 2023 •

edited

MAGAer13 commented May 3, 2023

About the details of visual abastractor #10

About the details of visual abastractor #10

Comments

LiuRicky commented May 2, 2023

LiuRicky commented May 2, 2023

LukeForeverYoung commented May 3, 2023 • edited

MAGAer13 commented May 3, 2023

LukeForeverYoung commented May 3, 2023 •

edited