Details about VLM baselines #4

lazyLuizi · 2024-05-07T13:15:31Z

Thanks for sharing your great work!
I have a few questions about your work, especially regarding the baselines.

Did you fine-tune the VLMs reported in Table 1? I got confused because Section 3.2 primarily mentioned EMMA for the training details.
Could you share the prompt you used? I tried to generate action sequences with instructBLIP, but it did not work well for me.

Also, I am looking forward to your code release!

stevenyangyj · 2024-06-03T13:24:24Z

Sorry for the late reply, I'm too busy to respond. Let me answer your questions one by one:

yes, I did; you can find the finetuning configuration in the appendix of the paper. and I followed the same finetuning procedure as instructblip while removing the text input of qformer.
I have also shared the prompts I used in the appendix of the paper. The format is exactly same as that I used in experiments. I do not expect an original instructblip model does work before it is fine-tuned and aligned with environment dynamics via our proposed dagger-dpo algorithm (algo. 1 in the paper).
I have released the code for dagger-dpo, please refer to this

Best

Provide feedback