Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data

This repo provides the PyTorch source code of our paper: Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data (ICLR 2024). Check out project page here!

🔮 Abstract

Building cross-modal applications is challenging due to limited paired multi-modal data. Recent works have shown that leveraging a pre-trained multi-modal contrastive representation space enables cross-modal tasks to be learned from uni-modal data. This is based on the assumption that contrastive optimization makes embeddings from different modalities interchangeable. However, this assumption is under-explored due to the poorly understood geometry of the multi-modal contrastive space, where a modality gap exists. In our study, we provide a theoretical explanation of this space's geometry and introduce a three-step method, $C^3$ (Connect, Collapse, Corrupt), to bridge the modality gap, enhancing the interchangeability of embeddings. Our $C^3$ method significantly improves cross-modal learning from uni-modal data, achieving state-of-the-art results on zero-shot image / audio / video captioning and text-to-image generation.

💡 Approach

Figure: Overview of the motivation behind our approach, $C^3$. Our work provides a theoretical explanation of the unique geometry that arises from multi-modal contrastive learning, where a modality gap and alignment noise exist in the learned representation space. Building upon this observation, we present a straightforward technique, $C^3$, which enhances the interchangeability of embeddings between modalities, enabling the creation of cross-modal applications using only uni-modal data.

🚀 Getting Started

Reproduce embedding geometry analysis results here.
Reproduce image captioning results here.
Reproduce image generation results here.
Reproduce ImageBind results imagebind branch.

🎯 Citation

If you use this repo in your research, please cite it as follows:

@inproceedings{C3,
  title={Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data},
  author={Zhang, Yuhui and Sui, Elaine and Yeung-Levy, Serena},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
figures		figures
geometry_analysis		geometry_analysis
image_captioning		image_captioning
image_generation		image_generation
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data

🔮 Abstract

💡 Approach

🚀 Getting Started

🎯 Citation

About

Releases

Packages

Contributors 2

Languages

yuhui-zh15/C3

Folders and files

Latest commit

History

Repository files navigation

Connect, Collapse, Corrupt: Learning Cross-Modal Tasks with Uni-Modal Data

🔮 Abstract

💡 Approach

🚀 Getting Started

🎯 Citation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages