The diffrence with BLIP2? #7

hello451 · 2023-04-17T11:57:46Z

It seems that the mini-GPT4 is the BLIP2 that just changing the LLM to the open-source GPT?

TsuTikgiau · 2023-04-17T12:12:26Z

The main difference between MiniGPT-4 and BLIP-2 is the training strategy. We notice that BLIP-2's training strategy is not enough to align the vision module with powerful LLMs like Vicuna well and will impact the text generation ability of Vicuna seriously. Therefore, we propose a novel way to collect a small yet high-quality image-description pair dataset created by the model itself and polished by ChatGPT. After the traditional image-text training stage like BLIP-2 did, we further fineturn MiniGPT-4 on this dataset with conversation prompts together so MiniGPT-4 can generate coherent text to answer user's questions and improve its usability. This fineturn stage is very efficient and can be finished in 7 mins with 1 A100. However, its effectiveness is significant.

Another important finding is that we don't fine-tune the Q-Former like BLIP-2, but directly use the Q-Former aligned with FlanT5 before and only train a single projecting layer. We show that such a simple linear layer is enough to let Vicuna see the image. This makes our training very efficient.

vateye · 2023-04-17T13:43:52Z

Can you provide more samples for Stage-1 training to verify that Stage-2 is needed?

TsuTikgiau · 2023-04-17T14:04:33Z

We plan to update our paper in 2 days to provide some qualitative and quantitative comparisons for the difference between stage-1 and stage-2. Stay tuned!

Pilot-LH · 2023-04-17T17:23:35Z

Excellent job! I have some inquiries regarding the model:

Based on my understanding, there are two stages in BLIP-2, and you are utilizing the pre-trained Q-Former from the second stage of BLIP-2 directly aligned with FlanT5 in this model. Please correct me if I am mistaken.
It is intriguing to note that only a linear layer needs to be adjusted in BLIP-2, rather than Q-Former. When you mentioned "after the traditional image-text training stage like BLIP-2 did," were you referring to the first stage, second stage, or both stages of BLIP-2? As far as I know, the first stage of BLIP-2 is not traditional, and it is the key to the success of BLIP-2 (as shown in Figure 5 of the BLIP-2 paper).
Fine-tuning on a small but high-quality dataset appears to be quite effective. Is it possible for BLIP-2 to benefit from this approach? I ask because vicuna is not entirely open source.

TsuTikgiau · 2023-04-17T17:44:22Z

@Pilot-LH Thanks for your interest!
A1. Yes you are correct, we are directly using the Q-Former aligned with FlanT5 XXL in our model
A2. Here I mean the second stage of BLIP-2, as our first stage pertaining is quite similar to BLIP-2 second stage training. The difference is that we only train one linear layer.
A3. This is a good question. We don't try this, but I think the reason that it works in our case is that Vicuna itself alone is already a close-to-chatgpt level model with a powerful conversation ability. The second stage fine-tuning activates this ability again when visual input is given. Therefore, the training is light. In contrast, Flan-T5's conversation ability is weak. So I guess Flan-T5 should first learn how to chat well with humans. And our small dataset doesn't have this capacity to teach Flan-T5 how to talk. I guess there should be soon some full open-sourced LLMs that work like Vicuna, as how Vicuna is built it is clear. And think our training method can be applied directly once such a LLM is ready.

Pilot-LH · 2023-04-17T17:52:52Z

Thank you for your response. I now have a clear understanding of the model.
I agree with your point that this approach can be applied to other large language models (LLMs).
In my opinion, one of the major challenges for the open source community is to reproduce LLaMa. Once this is accomplished, there will likely be much more advanced models available than the current Vicuna model.

update finetune and eval readme

TsuTikgiau closed this as completed Apr 19, 2023

junchen14 pushed a commit that referenced this issue Oct 30, 2023

Merge pull request #7 from junchen14/main

5e8c105

update finetune and eval readme

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The diffrence with BLIP2? #7

The diffrence with BLIP2? #7

hello451 commented Apr 17, 2023

TsuTikgiau commented Apr 17, 2023

vateye commented Apr 17, 2023

TsuTikgiau commented Apr 17, 2023

Pilot-LH commented Apr 17, 2023

TsuTikgiau commented Apr 17, 2023

Pilot-LH commented Apr 17, 2023

The diffrence with BLIP2? #7

The diffrence with BLIP2? #7

Comments

hello451 commented Apr 17, 2023

TsuTikgiau commented Apr 17, 2023

vateye commented Apr 17, 2023

TsuTikgiau commented Apr 17, 2023

Pilot-LH commented Apr 17, 2023

TsuTikgiau commented Apr 17, 2023

Pilot-LH commented Apr 17, 2023