DreamLIP: Language-Image Pre-training with Long Captions
Kecheng Zheng, Yifei Zhang, Wei Wu, Fan Lu, Shuailei Ma, Xin Jin, Wei Chen, Yujun Shen
Project Page | Paper | Data
- [2024/03/27] Long captions (LLAVA1.5, InstructBLIP and shareGPT4V) of CC3M are released here~
- 🔥 Exploring how language-image pre-training could benefit from long captions.
- 🔥 Strong improvement on semantic segmentation, image-text retrieval, semantic segmentation, and image understanding in MLLM.
- 🔥 DreamLIP trained with 30M image-text pairs achieves on par or even better performance than CLIP trained with 400M pairs.
- We have released long captions of CC3M.
- Release long captions of CC12M, YFCC15M, Laion20M, and COYO4M.
- Upload the pretrained weight of VIT-B/16 and VIT-B/32 pretrained in CC3M, CC12M, YFCC15M, and merged-30M.
- Release evaluation code
- Release training code
Dataset | Raw | InstructBLIP | LLAVA1.5 | ShareGPT4V | ALL |
---|---|---|---|---|---|
CC3M | TODO | TODO | TODO | TODO | Link |
CC12M | TODO | TODO | TODO | TODO | TODO |
YFCC15M | TODO | TODO | TODO | TODO | TODO |
TODO
@article{DreamLIP,
title={DreamLIP: Language-Image Pre-training with Long Captions},
author={Zheng, Kecheng and Zhang, Yifei and Wu, Wei and Lu, Fan and Ma, Shuailei and Jin, Xin and Chen, Wei and Shen, Yujun},
journal={arXiv:2403.17007},
year={2024}
}
We thank InstructBLIP, ShareGPT4V and LLAVA for the pretrained models and codes.