Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to extract features to do image retrieval #5

Open
eugeneware opened this issue Jan 25, 2021 · 4 comments
Open

How to extract features to do image retrieval #5

eugeneware opened this issue Jan 25, 2021 · 4 comments

Comments

@eugeneware
Copy link

Thank you for this amazing piece of work.

I'm interested in using VILLA or UNITER to do image retrieval.

I'd like to pre-extract features from VILLA for a folder of images and then retrieve them at inference time by using a text query.

I note that in your paper you publish image retrieval and text retrieval metrics.

I've run the code as noted in the UNITER repo:

# text annotation preprocessing
bash scripts/create_txtdb.sh $PATH_TO_STORAGE/txt_db $PATH_TO_STORAGE/ann

# image feature extraction (Tested on Titan-Xp; may not run on latest GPUs)
bash scripts/extract_imgfeat.sh $PATH_TO_IMG_FOLDER $PATH_TO_IMG_NPY

# image preprocessing
bash scripts/create_imgdb.sh $PATH_TO_IMG_NPY $PATH_TO_STORAGE/img_db

Most of the scripts and examples I can see in the repo require both images and text to be presented to the model.

Do you have any examples or advice on how to get text-only representations/features that could be used to then retrieve images by their pre-encoded features?

Thanks for any help or guidance you can provide.

@zhegan27
Copy link
Owner

@eugeneware , thanks for your inquiry. For UNITER & VILLA, both images & text are needed to be fed into the model, so text-only features cannot be obtained. This is done for better performance, as multimodal fusion is conducted at early stage. However, the inference can be very slow, since each text needs to be fused with every candidate image in order to get a similarity score.

From my understanding, what you want to do is to get text and image features separately, and then do a dot-product for image retrieval. So, my suggestion is that you can first try using BERT for text feature extraction, then train an image retrieval on top of it. Actually, my colleagues at Microsoft recently submitted a paper to NAACL 2021, and they have done pre-training in this new way so that image retrieval can be super fast. The paper is still in review though.

Hope it helps. Thanks.

Best,
Zhe

@eugeneware
Copy link
Author

Thanks so much for your reply @zhegan27 - so, to clarify, the Image Retrieval metrics in the paper were created by taking each text query, and then running it against every single image in the image corpus to get a similarity.ranking score - and then ordering the results by best match? If that's the case then that wouldn't work in a low latency inference environment.

But, when I look at the UniterModel base class I can see code there which allows you to pass in only text tokens, or only image features, or both? Is it unlikely that the text-only representation and the image-only representation would be not be similar in the shared embedding space?

Are you saying that putting in just image features and pre-computing the embedding output, and then trying to retrieve those image embeddings based on cosine distance/dot product of an embedding from just the text tokens is unlikely to work?

Thanks again for your help.

@zhegan27
Copy link
Owner

zhegan27 commented Feb 3, 2021

@eugeneware , sorry that I am busy with paper deadlines this week, will get back to you this weekend or early next week. Thanks for your understanding.

@eugeneware
Copy link
Author

@zhegan27 I completely understand. Good luck with your paper deadline. I really appreciate you being so generous with your time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants