Pytorch code of the paper "SelfAlign: Towards Fast and Accurate Image-Text Retrieval with Self-Supervised Fine-Grained Alignment". It is built on top of VSRN and CAMERA.
Image-text retrieval requires the system to bridge the heterogenous gap between vision and language for accurate retrieval while keeping the network lightweight-enough for efficient retrieval. Existing trade-off solutions mainly study from the view of incorporating cross-modal interactions with the independent-embedding framework or leveraging stronger pre-trained encoders, which still demand time-consuming similarity measurement or heavyweight model structure in the retrieval stage. In this work, we propose a image-text alignment module SelfAlign on top of independent-embedding framework, which improves the retrieval accuracy while maintaining the retrieval efficiency without extra supervision.
SelfAlign contains two collaborative sub-modules that force image-text alignment at both concept level and context level by self-supervised contrastive learning, which doesn’t require cross-modal embedding interactions in training and maintains independent image and text encoders in retrieval.
With comparable time cost, SelfAlign consistently boosts the accuracy of state-of-the-art independent-embedding models respectively by 9.1%, 4.2% and 6.6% on Flickr30K, MSCOCO1K and MSCOCO5K. The retrieval accuracy also outperforms most of existing interactive-embedding models with orders of magnitude decrease in retrieval time.
We recommended the following dependencies.
-
Python 2.7
-
PyTorch (0.4.1)
-
Transformers 2.1.1
-
NumPy (>1.12.1)
-
Punkt Sentence Tokenizer:
import nltk
nltk.download()
> d punkt
Download the dataset files and pre-trained models. We use splits produced by Andrej Karpathy.
We follow bottom-up attention model and SCAN to obtain image features for fair comparison. More details about data pre-processing (optional) can be found here. All the data needed for reproducing the experiments in the paper, including image features and vocabularies, can be downloaded from SCAN by using:
wget https://scanproject.blob.core.windows.net/scan-data/data.zip
You can also get the data from google drive: https://drive.google.com/drive/u/1/folders/1os1Kr7HeTbh8FajBNegW8rjJf6GIhFqC. We refer to the path of extracted files for data.zip
as $DATA_PATH
.
We use the BERT code from BERT-pytorch. Please following here to convert the Google BERT model to a PyTorch save file $BERT_PATH
.
Go to the directory ./camera_SelfAlign
, Run train_SelfAlign.py
:
For Flickr30K:
python train.py --data_path $DATA_PATH --bert_path $BERT_PATH --logger_name runs/flickr --data_name f30k_precomp --num_epochs 30 --lr_update 10
For MSCOCO:
python train.py --data_path $DATA_PATH --bert_path $BERT_PATH --logger_name runs/coco --data_name coco_precomp --num_epochs 40 --lr_update 20
Modify the model_path and data_path in the evaluation_models.py
file. Then Run it :
python evaluate_models.py
Go to the directory ./vsrn_SelfAlign
, Run train_SelfAlign.py
:
For Flickr30K:
python train.py --data_path $DATA_PATH --logger_name runs/flickr --data_name f30k_precomp --lr_update 10
For MSCOCO:
python train.py --data_path $DATA_PATH --logger_name runs/coco --data_name coco_precomp --lr_update 15
Modify the model_path and data_path in the evaluation_models.py
file. Then Run it :
python evaluate_models.py