Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About the details of visual abastractor #10

Closed
LiuRicky opened this issue May 2, 2023 · 3 comments
Closed

About the details of visual abastractor #10

LiuRicky opened this issue May 2, 2023 · 3 comments
Assignees
Labels
question Further information is requested

Comments

@LiuRicky
Copy link

LiuRicky commented May 2, 2023

First of all, thanks for your great work.
From the paper, I see learnable queries in visual abastractor. I think it may be similar to Perceiver in Flamingo or Q-Former in BLIP-2. But I don't find the implementation in your code about learnable queries (mPLUG_OwlVisualAbstractorEncoder and mPLUG_OwlVisualAbstractorModel in modeling_mplug_owl.py).
I am curious about the details of visual abastractor. In other words, is it seems to Q-Former or Perceiver? The details do not contain in your paper and I cannot find in the code.
Thanks again.

@LiuRicky
Copy link
Author

LiuRicky commented May 2, 2023

Could you give a detailed description (in text) of the implementation of your visual abstractor?

@MAGAer13 MAGAer13 added the question Further information is requested label May 2, 2023
@LukeForeverYoung
Copy link
Collaborator

LukeForeverYoung commented May 3, 2023

We put the query_embed in mPLUG_OwlModel and pass it to Visual Abstractor during forward. The implement of Visual Abstractor is similar to the Perceiver in Flamingo, except that we use FFNs the same as LLAMA.
Referred to mPLUG and mPLUG-2, we apply abstractor to reduce the length of token length and help model to learn visual knowledge in language space.

@MAGAer13
Copy link
Collaborator

MAGAer13 commented May 3, 2023

First of all, thanks for your great work. From the paper, I see learnable queries in visual abastractor. I think it may be similar to Perceiver in Flamingo or Q-Former in BLIP-2. But I don't find the implementation in your code about learnable queries (mPLUG_OwlVisualAbstractorEncoder and mPLUG_OwlVisualAbstractorModel in modeling_mplug_owl.py). I am curious about the details of visual abastractor. In other words, is it seems to Q-Former or Perceiver? The details do not contain in your paper and I cannot find in the code. Thanks again.

Hi, just for additional claim. The aim of visual abstractor is to reduce the number of patches for images which would result in a large number of token (256 for ViT/L-14 with 224x224 resolution) for the LLM. The maximum token for LLMs such as LLaMA, Bloom are 2048 where 256 is relatively large number for it. However, it did not happen to flamingo since it utilizes cross-attention. So the purpose is different. Besides, since we want to learn some useful features such as region or object features from the image, as practiced by mPLUG-2, which also leverages similar idea and verified by the visualization of attention map for the learnable queries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants