MVM for CLIP feature #12

vateye · 2022-11-28T13:02:42Z

Hi, I would like to know how to compute the loss between VideoSwin and the CLIP features in latest paper. Since the Swin family models take the patch size as 4x4, however for ViT the patch size is 16. I would like to know how to compute the loss between these two? (i.e., l1 loss)?

Thanks.

tsujuifu · 2022-11-29T10:37:19Z

Hi vateye,

Thanks for your interest in our empirical study.

We use VidSwin and CLIP (ViT-B/32), both with a patch size of 32.
For example, a 224² video frame will be 7² visual features.

vateye · 2022-12-07T08:31:44Z

Thanks! Meanwhile, I am curious how do you apply Random Masking and Blockwise Masking together with the ratio of 30%. Is there any code for doing this? Since these two are disjoint masking strategies.

tsujuifu · 2022-12-07T09:26:08Z

This code only has BM and AM.
During pre-training, we randomly chose among them. You can also use a different probability.

tsujuifu closed this as completed Dec 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MVM for CLIP feature #12

MVM for CLIP feature #12

vateye commented Nov 28, 2022

tsujuifu commented Nov 29, 2022

vateye commented Dec 7, 2022

tsujuifu commented Dec 7, 2022 •

edited

MVM for CLIP feature #12

MVM for CLIP feature #12

Comments

vateye commented Nov 28, 2022

tsujuifu commented Nov 29, 2022

vateye commented Dec 7, 2022

tsujuifu commented Dec 7, 2022 • edited

tsujuifu commented Dec 7, 2022 •

edited