Overview

This project contains an implementation of VSE++ losses by [Faghri, Fartash et al., 2017] that is a technique for learning visual-semantic embeddings for cross-modal retrieval, and an implementation of t-SNE by [van der Maaten et al., 2008] (school project, Signal Learning and Multimedia class, 2019).

It is applied on MSCOCO image captioning dataset by [Lin, Tsung-Yi et al., 2014], in particular with the val2014 data which contains a set of 40k images annotated with five captions each. We also used Resnet50 features by [He, Kaiming et al., 2016] and glove embeddings by [Pennington, Jeffrey et al., 2014].

A good introduction of Representation Learning would be [Bengio, Y. et al., 2013].

Features

Image retreival
Caption retreival
VSE++ losses (Loss-Sum-Hinge and Loss-Max-Hinge)
t-SNE for captions in a 2D scatter
t-SNE for captions in 3D scatter
t-SNE for images in a 2D grid
t-SNE for images in a 2D scatter
t-SNE for both captions and images in a 2D scatter

Installation

It requires python3, python3-pip, the packages listed in requirements.txt and a recent version of git that supports git-lfs.

To install the required packages:

pip3 install -r requirements.txt

Usage

A notebook is available, and each feature is illustrated in an example in test directory.

References

[Faghri, Fartash et al., 2017] Faghri, Fartash et al. “VSE++: Improving Visual-Semantic Embeddings with Hard Negatives.” BMVC (2017).
[van der Maaten et al., 2008] van der Maaten, L.J.P.; Hinton, G.E. (Nov 2008). "Visualizing Data Using t-SNE" (PDF). Journal of Machine Learning Research. 9: 2579–2605.
[Lin, Tsung-Yi et al., 2014] Lin, Tsung-Yi et al. “Microsoft COCO: Common Objects in Context.” Lecture Notes in Computer Science (2014): 740–755. Crossref. Web.
[Bengio, Y. et al., 2013] Bengio, Y., A. Courville, and P. Vincent. “Representation Learning: A Review and New Perspectives.” IEEE Transactions on Pattern Analysis and Machine Intelligence 35.8 (2013): 1798–1828. Crossref. Web.
[He, Kaiming et al., 2016] He, Kaiming et al. “Deep Residual Learning for Image Recognition.” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016): n. pag. Crossref. Web.
[Pennington, Jeffrey et al., 2014] Pennington, Jeffrey & Socher, Richard & Manning, Christoper. (2014). Glove: Global Vectors for Word Representation. EMNLP. 14. 1532-1543. 10.3115/v1/D14-1162.

Authors

Charly Lamothe
Guillaume Ollier
Balthazar Casalé

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
datasets/05-caption		datasets/05-caption
models		models
notebook		notebook
src		src
test		test
tsne_images		tsne_images
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
TODO.md		TODO.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Features

Installation

Usage

References

Authors

About

Releases

Packages

Languages

License

swasun/Joint-Text-and-Image-Representation

Folders and files

Latest commit

History

Repository files navigation

Overview

Features

Installation

Usage

References

Authors

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages