How to extract features to do image retrieval #5

eugeneware · 2021-01-25T08:47:55Z

Thank you for this amazing piece of work.

I'm interested in using VILLA or UNITER to do image retrieval.

I'd like to pre-extract features from VILLA for a folder of images and then retrieve them at inference time by using a text query.

I note that in your paper you publish image retrieval and text retrieval metrics.

I've run the code as noted in the UNITER repo:

# text annotation preprocessing
bash scripts/create_txtdb.sh $PATH_TO_STORAGE/txt_db $PATH_TO_STORAGE/ann

# image feature extraction (Tested on Titan-Xp; may not run on latest GPUs)
bash scripts/extract_imgfeat.sh $PATH_TO_IMG_FOLDER $PATH_TO_IMG_NPY

# image preprocessing
bash scripts/create_imgdb.sh $PATH_TO_IMG_NPY $PATH_TO_STORAGE/img_db

Most of the scripts and examples I can see in the repo require both images and text to be presented to the model.

Do you have any examples or advice on how to get text-only representations/features that could be used to then retrieve images by their pre-encoded features?

Thanks for any help or guidance you can provide.

zhegan27 · 2021-01-27T22:12:48Z

@eugeneware , thanks for your inquiry. For UNITER & VILLA, both images & text are needed to be fed into the model, so text-only features cannot be obtained. This is done for better performance, as multimodal fusion is conducted at early stage. However, the inference can be very slow, since each text needs to be fused with every candidate image in order to get a similarity score.

From my understanding, what you want to do is to get text and image features separately, and then do a dot-product for image retrieval. So, my suggestion is that you can first try using BERT for text feature extraction, then train an image retrieval on top of it. Actually, my colleagues at Microsoft recently submitted a paper to NAACL 2021, and they have done pre-training in this new way so that image retrieval can be super fast. The paper is still in review though.

Hope it helps. Thanks.

Best,
Zhe

eugeneware · 2021-01-28T05:20:02Z

Thanks so much for your reply @zhegan27 - so, to clarify, the Image Retrieval metrics in the paper were created by taking each text query, and then running it against every single image in the image corpus to get a similarity.ranking score - and then ordering the results by best match? If that's the case then that wouldn't work in a low latency inference environment.

But, when I look at the UniterModel base class I can see code there which allows you to pass in only text tokens, or only image features, or both? Is it unlikely that the text-only representation and the image-only representation would be not be similar in the shared embedding space?

Are you saying that putting in just image features and pre-computing the embedding output, and then trying to retrieve those image embeddings based on cosine distance/dot product of an embedding from just the text tokens is unlikely to work?

Thanks again for your help.

zhegan27 · 2021-02-03T20:12:44Z

@eugeneware , sorry that I am busy with paper deadlines this week, will get back to you this weekend or early next week. Thanks for your understanding.

eugeneware · 2021-02-04T00:13:21Z

@zhegan27 I completely understand. Good luck with your paper deadline. I really appreciate you being so generous with your time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to extract features to do image retrieval #5

How to extract features to do image retrieval #5

eugeneware commented Jan 25, 2021

zhegan27 commented Jan 27, 2021

eugeneware commented Jan 28, 2021

zhegan27 commented Feb 3, 2021

eugeneware commented Feb 4, 2021

How to extract features to do image retrieval #5

How to extract features to do image retrieval #5

Comments

eugeneware commented Jan 25, 2021

zhegan27 commented Jan 27, 2021

eugeneware commented Jan 28, 2021

zhegan27 commented Feb 3, 2021

eugeneware commented Feb 4, 2021