-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot replicate text-to-video R@1 results on MSR-VTT dataset #3
Comments
Part of Docker File 1, This is for CUDA 11.1: |
Part of Docker File 2, this is for cuda 10.2: Both images listed above are tested and get around 47.0 on MSR-VTT R@1 metric (ViT-B/32). |
Thanks a lot for your response and sharing the setting. Really appreciate your help. In addtion to those, could you please also share the config for Vit-B/16 experiments for MSR-VTT? Specifically, (1) what is the video compression fps? (2) training batch size, max frames, max words? (3) did you use multiple computation nodes with >8 GPUs to finish Vit-B/16 training? It looks like Vit-B/16 consumes a lot more GPU memory that could be very difficult to fit in a single node when paired with 8 GPUs. Thanks again and looking forward to your reply. |
Hi! I also want to ask this question, did you have a answer? |
Thanks for your attention.
|
Dear authors,
First of all, thanks a lot for the great work and also for sharing the code. I know there have been a long discussion on going about replicating the experiment results on MSR-VTT data, but I do want to create a new thread to make things clear, especially on the configurations to replicate the results mentioned in the paper.
So far we have tried with many different hyper-parameter settings (such as batch size changing, video compression with different fps, different values of top-k frame token selection, random seeds, etc.), and the best text-to-video R@1 result we can get is 46.5 and 47.9 for Vit-B/32 and Vit-B/16 respectively, which is far from the paper reported 47.0 and 49.4.
I understand there are many factors which can affect the model training, hence the final evaluation results. Still, could you please share your configs, which includes the shell scripts for hyper-parameters for the best R@1, the video compression commands, and further the model checkpoint for the best evaluation epoch(s) if possible? We would like to use the exact same setting to understand where the problem could be.
Thanks again for your help, and looking forward to your reply.
The text was updated successfully, but these errors were encountered: