Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MVM for CLIP feature #12

Closed
vateye opened this issue Nov 28, 2022 · 3 comments
Closed

MVM for CLIP feature #12

vateye opened this issue Nov 28, 2022 · 3 comments

Comments

@vateye
Copy link

vateye commented Nov 28, 2022

Hi, I would like to know how to compute the loss between VideoSwin and the CLIP features in latest paper. Since the Swin family models take the patch size as 4x4, however for ViT the patch size is 16. I would like to know how to compute the loss between these two? (i.e., l1 loss)?

Thanks.

@tsujuifu
Copy link
Owner

Hi vateye,

Thanks for your interest in our empirical study.

We use VidSwin and CLIP (ViT-B/32), both with a patch size of 32.
For example, a 2242 video frame will be 72 visual features.

@vateye
Copy link
Author

vateye commented Dec 7, 2022

Thanks! Meanwhile, I am curious how do you apply Random Masking and Blockwise Masking together with the ratio of 30%. Is there any code for doing this? Since these two are disjoint masking strategies.

@tsujuifu
Copy link
Owner

tsujuifu commented Dec 7, 2022

This code only has BM and AM.
During pre-training, we randomly chose among them. You can also use a different probability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants