New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about merlot model #4
Comments
thanks for all the kind words!
[pos0], [image_unk_1], [image_unk_0], [pos3] (Written in the right order). hope that helps :) |
Dear Rowan,
Thanks a lot!! |
|
Thank you!! |
Dear Rowan,
Hi, I have noticed this paper recently, I really think this paper is of great value, I understand nearly all the details of your paper except the model. I know the details are in the code, but I am not familiar with TensorFlow, if you can explain these to me, I will understand the code much easier, so I wonder if you can answer my questions when you have time?
1.What does chunk mean in the code? Does it represent the max number of segments a video has been segmented?
2.In 3.2, you said that MERLOT takes multiple unordered video frames as input, but in Joint Vision-Language Encoder
part, you say that position embeddings are added to the vision components, do you mean that, when fed into the model, the image and the corresponding sentence have the same position embedding?
3.In 3.3, Temporal Reordering part, I understand the core idea, but I am not sure about your methods, is it correct that you randomly choose i frames, and then change the position embedding of these frames to the same embedding [image_unk_0]?
Best regards,
Zihao
The text was updated successfully, but these errors were encountered: