Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The diffrence with BLIP2? #7

Closed
hello451 opened this issue Apr 17, 2023 · 6 comments
Closed

The diffrence with BLIP2? #7

hello451 opened this issue Apr 17, 2023 · 6 comments

Comments

@hello451
Copy link

It seems that the mini-GPT4 is the BLIP2 that just changing the LLM to the open-source GPT?

@TsuTikgiau
Copy link
Collaborator

The main difference between MiniGPT-4 and BLIP-2 is the training strategy. We notice that BLIP-2's training strategy is not enough to align the vision module with powerful LLMs like Vicuna well and will impact the text generation ability of Vicuna seriously. Therefore, we propose a novel way to collect a small yet high-quality image-description pair dataset created by the model itself and polished by ChatGPT. After the traditional image-text training stage like BLIP-2 did, we further fineturn MiniGPT-4 on this dataset with conversation prompts together so MiniGPT-4 can generate coherent text to answer user's questions and improve its usability. This fineturn stage is very efficient and can be finished in 7 mins with 1 A100. However, its effectiveness is significant.

Another important finding is that we don't fine-tune the Q-Former like BLIP-2, but directly use the Q-Former aligned with FlanT5 before and only train a single projecting layer. We show that such a simple linear layer is enough to let Vicuna see the image. This makes our training very efficient.

@vateye
Copy link

vateye commented Apr 17, 2023

Can you provide more samples for Stage-1 training to verify that Stage-2 is needed?

@TsuTikgiau
Copy link
Collaborator

We plan to update our paper in 2 days to provide some qualitative and quantitative comparisons for the difference between stage-1 and stage-2. Stay tuned!

@Pilot-LH
Copy link

Excellent job! I have some inquiries regarding the model:

  1. Based on my understanding, there are two stages in BLIP-2, and you are utilizing the pre-trained Q-Former from the second stage of BLIP-2 directly aligned with FlanT5 in this model. Please correct me if I am mistaken.
  2. It is intriguing to note that only a linear layer needs to be adjusted in BLIP-2, rather than Q-Former. When you mentioned "after the traditional image-text training stage like BLIP-2 did," were you referring to the first stage, second stage, or both stages of BLIP-2? As far as I know, the first stage of BLIP-2 is not traditional, and it is the key to the success of BLIP-2 (as shown in Figure 5 of the BLIP-2 paper).
  3. Fine-tuning on a small but high-quality dataset appears to be quite effective. Is it possible for BLIP-2 to benefit from this approach? I ask because vicuna is not entirely open source.

@TsuTikgiau
Copy link
Collaborator

@Pilot-LH Thanks for your interest!
A1. Yes you are correct, we are directly using the Q-Former aligned with FlanT5 XXL in our model
A2. Here I mean the second stage of BLIP-2, as our first stage pertaining is quite similar to BLIP-2 second stage training. The difference is that we only train one linear layer.
A3. This is a good question. We don't try this, but I think the reason that it works in our case is that Vicuna itself alone is already a close-to-chatgpt level model with a powerful conversation ability. The second stage fine-tuning activates this ability again when visual input is given. Therefore, the training is light. In contrast, Flan-T5's conversation ability is weak. So I guess Flan-T5 should first learn how to chat well with humans. And our small dataset doesn't have this capacity to teach Flan-T5 how to talk. I guess there should be soon some full open-sourced LLMs that work like Vicuna, as how Vicuna is built it is clear. And think our training method can be applied directly once such a LLM is ready.

@Pilot-LH
Copy link

Thank you for your response. I now have a clear understanding of the model.
I agree with your point that this approach can be applied to other large language models (LLMs).
In my opinion, one of the major challenges for the open source community is to reproduce LLaMa. Once this is accomplished, there will likely be much more advanced models available than the current Vicuna model.

junchen14 pushed a commit that referenced this issue Oct 30, 2023
update finetune and eval readme
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants