This repo provides the PyTorch source code of our paper: Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data (ICLR 2024). Check out project page here!
Building cross-modal applications is challenging due to limited paired multi-modal data. Recent works have shown that leveraging a pre-trained multi-modal contrastive representation space enables cross-modal tasks to be learned from uni-modal data. This is based on the assumption that contrastive optimization makes embeddings from different modalities interchangeable. However, this assumption is under-explored due to the poorly understood geometry of the multi-modal contrastive space, where a modality gap exists. In our study, we provide a theoretical explanation of this space's geometry and introduce a three-step method,
Figure: Overview of the motivation behind our approach,
-
Reproduce embedding geometry analysis results here.
-
Reproduce image captioning results here.
-
Reproduce image generation results here.
-
Reproduce ImageBind results imagebind branch.
If you use this repo in your research, please cite it as follows:
@inproceedings{C3,
title={Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data},
author={Zhang, Yuhui and Sui, Elaine and Yeung-Levy, Serena},
booktitle={International Conference on Learning Representations (ICLR)},
year={2024}
}