This is the official source code of our paper: Learn to Understand Negation in Video Retrieval.
We used Anaconda to setup a deep learning workspace that supports PyTorch. Run the following script to install all the required packages.
conda create -n py37 python==3.7 -y
conda activate py37
git clone git@github.com:ruc-aimc-lab/nT2VR.git
cd nT2VR
pip install -r requirements.txt
-
For MSRVTT, the official data can be found in link. The raw videos can be found in sharing from Frozen️ in Time.
We follow the official MSRVTT3k split and MSRVTT1k split (described in the paper JSFUSION)
-
For vatex, the official data can be found in this link
We follow the split of HGR
-
We extract frames from the video at a frame rate of 0.5s before training, using scrip from video-cnn-feat. Each data folder should also contain a file indicates frame id and the image path.(See the example of id.imagepath.txt. The prefix of frame id should be consistent with video id.)
Download data for training & evaluation in nT2V. We use the prefix "msrvtt10k" and "msrvtt1kA" to distinguish MSR-VTT3k split and MSR-VTT1k split.
- The training data augumented by negator is named as "**.caption.neagtion.txt". The negated and composed test query sets are named as "**.negated.txt" and "**.composed.txt".
We provide script for evaluting zero-shot CLIP, CLIP* and CLIP-bnl on nT2V.
- CLIP: original model, used in a zero-shot setting
- CLIP*: Fine-tuned CLIP on text-to-video retrieval data using retrieval loss.
- CLIP-bnl: Fine-tuned CLIP using proposed negation leraning. Here are the checkpoints and performances of CLIP, CLIP* and CLIP-bnl:
Model Checkpoint | Original | Negated | Composed | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
CLIP | 20.8 | 40.3 | 49.7 | 0.305 | 1.5 | 2.5 | 2.9 | 0.020 | 6.9 | 24.2 | 35.6 | 0.160 |
CLIP* | 27.7 | 53.0 | 64.2 | 0.398 | 0.5 | 1.1 | 1.1 | 0.008 | 11.4 | 33.3 | 46.2 | 0.225 |
CLIP (boolean) | -- | -- | -- | -- | 18.8 | 37.5 | 46.2 | 5.9 | 16.7 | 23.9 | 0.118 | 0.116 |
CLIP* (boolean) | -- | -- | -- | -- | 25.3 | 47.1 | 56.1 | 13.5 | 33.7 | 45.5 | 0.236 | 0.243 |
CLIP-bnl | 28.4 | 53.7 | 64.6 | 0.404 | 5.0 | 6.9 | 6.9 | 0.057 | 15.3 | 40.0 | 53.3 | 0.274 |
Model Checkpoint | Original | Negated | Composed | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
CLIP | 31.6 | 54.2 | 64.2 | 0.422 | 1.4 | 1.4 | 1.5 | 0.017 | 12.9 | 35.0 | 46.2 | 0.237 |
CLIP* | 41.1 | 69.8 | 79.9 | 0.543 | 0.0 | 1.7 | 1.0 | 0.006 | 17.3 | 46.8 | 61.2 | 0.310 |
CLIP (boolean) | -- | -- | -- | -- | 26.4 | 46.2 | 56.8 | 0.354 | 6.3 | 18.4 | 25.9 | 0.129 |
CLIP* (boolean) | -- | -- | -- | -- | 35.9 | 59.5 | 65.2 | 0.463 | 17.6 | 42.0 | 52.0 | 0.291 |
CLIP-bnl | 42.1 | 68.4 | 79.6 | 0.546 | 12.2 | 11.7 | 14.4 | 0.121 | 24.8 | 57.6 | 68.8 | 0.391 |
Model Checkpoint | Original | Negated | Composed | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
CLIP | 41.4 | 72.9 | 82.7 | 0.555 | 1.9 | 2.1 | 2.2 | 0.018 | 10.5 | 28.3 | 41.3 | 0.201 |
CLIP* | 56.8 | 88.4 | 94.4 | 0.703 | 0.2 | 0.4 | 0.7 | 0.004 | 14.2 | 39.2 | 53.3 | 0.266 |
CLIP (boolean) | -- | -- | -- | -- | 32.5 | 57.2 | 64.5 | 0.431 | 5.0 | 18.0 | 25.6 | 0.116 |
CLIP* (boolean) | -- | -- | -- | -- | 25.3 | 47.1 | 56.1 | 0.353 | 14.1 | 34.4 | 45.1 | 0.243 |
CLIP-bnl | 57.6 | 88.3 | 94.0 | 0.708 | 14.0 | 11.7 | 8.6 | 0.125 | 16.6 | 39.9 | 53.9 | 0.284 |
- To evaluate zero-shot CLIP, run the script clip.sh
# use 'rootpath' to specify the path to the data folder
cd shell/test
bash clip.sh
- To evaluate CLIP*, run the script clipft.sh
# use 'rootpath' to specify the path to the data folder
# use 'model_path' to specify the path of model
cd shell/test
bash clipft.sh
- To evaluate zero-shot CLIP+boolean, run the script clip_bool.sh
cd shell/test
bash clip_bool.sh
- To evaluate CLIP*+boolean, run the script clipft_bool.sh
cd shell/test
bash clipft_bool.sh
- To evaluate CLIP-bnl, run the script clip_bnl.sh
cd shell/test
bash clip_bnl.sh
- train CLIP-bnl on MSR-VTT3k split, run
# use 'rootpath' to specify the path to the data folder
cd shell/train
bash msrvtt7k_clip_bnl.sh
- train CLIP-bnl on MSR-VTT1k split, run
cd shell/train
bash msrvtt9k_clip_bnl.sh
- train CLIP-bnl on VATEX, run
cd shell/train
bash vatex_clip_bnl.sh
- Additionally, training script of CLIP* is clipft.sh
- install additional packages:
cd negationdata
pip install -r requirements.txt
- download checkpoint of negation scope detection model,which is built on NegBERT
- run the script prepare_data.sh
# use 'rootpath' to specify the path to the data folder
#use 'cache_dir'to specify the path to path of models used in negation scope detection model
cd negationdata
bash prepare_data.sh
@inproceedings{mm22-nt2vr,
title = {Learn to Understand Negation in Video Retrieval},
author = {Ziyue Wang and Aozhu Chen and Fan Hu and Xirong Li},
year = {2022},
booktitle = {ACMMM},
}
If you enounter issues when running the code, please feel free to reach us.
- Ziyue Wang (ziyuewang@ruc.edu.cn)
- Aozhu Chen (caz@ruc.edu.cn)
- Fan Hu (hufan_hf@ruc.edu.cn)