Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot replicate text-to-video R@1 results on MSR-VTT dataset #3

Closed
jzk66596 opened this issue Oct 7, 2022 · 5 comments
Closed

Cannot replicate text-to-video R@1 results on MSR-VTT dataset #3

jzk66596 opened this issue Oct 7, 2022 · 5 comments

Comments

@jzk66596
Copy link

jzk66596 commented Oct 7, 2022

Dear authors,

First of all, thanks a lot for the great work and also for sharing the code. I know there have been a long discussion on going about replicating the experiment results on MSR-VTT data, but I do want to create a new thread to make things clear, especially on the configurations to replicate the results mentioned in the paper.

So far we have tried with many different hyper-parameter settings (such as batch size changing, video compression with different fps, different values of top-k frame token selection, random seeds, etc.), and the best text-to-video R@1 result we can get is 46.5 and 47.9 for Vit-B/32 and Vit-B/16 respectively, which is far from the paper reported 47.0 and 49.4.

I understand there are many factors which can affect the model training, hence the final evaluation results. Still, could you please share your configs, which includes the shell scripts for hyper-parameters for the best R@1, the video compression commands, and further the model checkpoint for the best evaluation epoch(s) if possible? We would like to use the exact same setting to understand where the problem could be.

Thanks again for your help, and looking forward to your reply.

@LiuRicky
Copy link
Owner

LiuRicky commented Oct 17, 2022

Part of Docker File 1, This is for CUDA 11.1:
RUN source ~/.bashrc && conda activate env-3.8.8
&& pip install torch==1.10.0+cu111 torchvision==0.11.0+cu111 torchaudio==0.10.0 -f https://download.pytorch.org/whl/torch_stable.html
&& pip install timm==0.4.12 transformers==4.15.0 fairscale==0.4.4 pycocoevalcap decord
&& conda install -y ruamel_yaml
&& pip install numpy opencv-python Pillow pyyaml requests scikit-image scipy tqdm regex easydict scikit-learn
&& pip install mmcv terminaltables tensorboardX python-magic faiss-gpu imageio-ffmpeg
&& pip install yacs Cython tensorboard gdown termcolor tabulate xlrd==1.2.0
&& pip install ffmpeg-python librosa pydub pytorch_lightning torchlibrosa
&& pip install gpustat einops ftfy boto3 pandas
&& pip install git+https://github.com/openai/CLIP.git

@LiuRicky
Copy link
Owner

LiuRicky commented Oct 17, 2022

Part of Docker File 2, this is for cuda 10.2:
RUN source ~/.bashrc && conda activate env-3.6.8
&& pip install torch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 tensorflow==2.3.0 transformers==4.15.0 mxnet==1.9.0
&& pip install numpy opencv-python Pillow pyyaml requests scikit-image scipy tqdm regex easydict scikit-learn
&& pip install mmcv terminaltables tensorboardX python-magic faiss-gpu imageio-ffmpeg
&& pip install yacs Cython tensorboard gdown termcolor tabulate xlrd==1.2.0
&& pip install ffmpeg-python librosa pydub pytorch_lightning torchlibrosa
&& pip install line_profiler imagehash cos-python-sdk-v5 thop einops timm pycm
&& pip install moviepy openpyxl lmdb
&& pip install qqseg==1.14.1 jieba
&& pip install ftfy regex tqdm scipy opencv-python boto3 requests pandas

Both images listed above are tested and get around 47.0 on MSR-VTT R@1 metric (ViT-B/32).

@jzk66596
Copy link
Author

Part of Docker File 2, this is for cuda 10.2: RUN source ~/.bashrc && conda activate env-3.6.8 && pip install torch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 tensorflow==2.3.0 transformers==4.15.0 mxnet==1.9.0 && pip install numpy opencv-python Pillow pyyaml requests scikit-image scipy tqdm regex easydict scikit-learn && pip install mmcv terminaltables tensorboardX python-magic faiss-gpu imageio-ffmpeg && pip install yacs Cython tensorboard gdown termcolor tabulate xlrd==1.2.0 && pip install ffmpeg-python librosa pydub pytorch_lightning torchlibrosa && pip install line_profiler imagehash cos-python-sdk-v5 thop einops timm pycm && pip install moviepy openpyxl lmdb && pip install qqseg==1.14.1 jieba && pip install ftfy regex tqdm scipy opencv-python boto3 requests pandas

Both images listed above are tested and get around 47.0 on MSR-VTT R@1 metric (ViT-B/32).

Thanks a lot for your response and sharing the setting. Really appreciate your help. In addtion to those, could you please also share the config for Vit-B/16 experiments for MSR-VTT?

Specifically, (1) what is the video compression fps? (2) training batch size, max frames, max words? (3) did you use multiple computation nodes with >8 GPUs to finish Vit-B/16 training? It looks like Vit-B/16 consumes a lot more GPU memory that could be very difficult to fit in a single node when paired with 8 GPUs.

Thanks again and looking forward to your reply.

@tiesanguaixia
Copy link

Part of Docker File 2, this is for cuda 10.2: RUN source ~/.bashrc && conda activate env-3.6.8 && pip install torch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 tensorflow==2.3.0 transformers==4.15.0 mxnet==1.9.0 && pip install numpy opencv-python Pillow pyyaml requests scikit-image scipy tqdm regex easydict scikit-learn && pip install mmcv terminaltables tensorboardX python-magic faiss-gpu imageio-ffmpeg && pip install yacs Cython tensorboard gdown termcolor tabulate xlrd==1.2.0 && pip install ffmpeg-python librosa pydub pytorch_lightning torchlibrosa && pip install line_profiler imagehash cos-python-sdk-v5 thop einops timm pycm && pip install moviepy openpyxl lmdb && pip install qqseg==1.14.1 jieba && pip install ftfy regex tqdm scipy opencv-python boto3 requests pandas
Both images listed above are tested and get around 47.0 on MSR-VTT R@1 metric (ViT-B/32).

Thanks a lot for your response and sharing the setting. Really appreciate your help. In addtion to those, could you please also share the config for Vit-B/16 experiments for MSR-VTT?

Specifically, (1) what is the video compression fps? (2) training batch size, max frames, max words? (3) did you use multiple computation nodes with >8 GPUs to finish Vit-B/16 training? It looks like Vit-B/16 consumes a lot more GPU memory that could be very difficult to fit in a single node when paired with 8 GPUs.

Thanks again and looking forward to your reply.

Hi! I also want to ask this question, did you have a answer?

@LiuRicky
Copy link
Owner

LiuRicky commented Mar 29, 2023

Part of Docker File 2, this is for cuda 10.2: RUN source ~/.bashrc && conda activate env-3.6.8 && pip install torch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 tensorflow==2.3.0 transformers==4.15.0 mxnet==1.9.0 && pip install numpy opencv-python Pillow pyyaml requests scikit-image scipy tqdm regex easydict scikit-learn && pip install mmcv terminaltables tensorboardX python-magic faiss-gpu imageio-ffmpeg && pip install yacs Cython tensorboard gdown termcolor tabulate xlrd==1.2.0 && pip install ffmpeg-python librosa pydub pytorch_lightning torchlibrosa && pip install line_profiler imagehash cos-python-sdk-v5 thop einops timm pycm && pip install moviepy openpyxl lmdb && pip install qqseg==1.14.1 jieba && pip install ftfy regex tqdm scipy opencv-python boto3 requests pandas
Both images listed above are tested and get around 47.0 on MSR-VTT R@1 metric (ViT-B/32).

Thanks a lot for your response and sharing the setting. Really appreciate your help. In addtion to those, could you please also share the config for Vit-B/16 experiments for MSR-VTT?
Specifically, (1) what is the video compression fps? (2) training batch size, max frames, max words? (3) did you use multiple computation nodes with >8 GPUs to finish Vit-B/16 training? It looks like Vit-B/16 consumes a lot more GPU memory that could be very difficult to fit in a single node when paired with 8 GPUs.
Thanks again and looking forward to your reply.

Hi! I also want to ask this question, did you have a answer?

Thanks for your attention.

  1. Video compress fps = 3.
  2. I think all settings you have asked could be find in paper's implement detail part. And you could check this part for more details. Such as training batch size is 128.
  3. Sure you can use multiple nodes when train vit-b/16. But a node with 8 large memory GPUs could also finish.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants