Skip to content

Latest commit

 

History

History
434 lines (283 loc) · 34.3 KB

paperList.md

File metadata and controls

434 lines (283 loc) · 34.3 KB
  • The summary of mainstream multi-modal pre-trained big models.

Year 2024

  • Pegasus-v1 Technical Report, arXiv:2404.14687, Raehyuk Jung, Hyojun Go, Jaehyuk Yi, Jiho Jang, Daniel Kim, Jay Suh, Aiden Lee, Cooper Han, Jae Lee, Jeff Kim, Jin-Young Kim, Junwan Kim, Kyle Park, Lucas Lee, Mars Ha, Minjoon Seo, Abraham Jo, Ed Park, Hassan Kianinejad, SJ Kim, Tony Moon, Wade Jeong, Andrei Popescu, Esther Kim, EK Yoon, Genie Heo, Henry Choi, Jenna Kang, Kevin Han, Noah Seo, Sunny Nguyen, Ryan Won, Yeonhoo Park, Anthony Giuliani, Dave Chung, Hans Yoon, James Le, Jenny Ahn, June Lee, Maninder Saini, Meredith Sanders, Soyoung Lee, Sue Kim, Travis Couture [Paper]

  • MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training, Brandon McKinzie, Zhe Gan, Jean-Philippe Fauconnier, Sam Dodge, Bowen Zhang, Philipp Dufter, Dhruti Shah, Xianzhi Du, Futang Peng, Floris Weers, Anton Belyi, Haotian Zhang, Karanjeet Singh, Doug Kang, Hongyu Hè, Max Schwarzer, Tom Gunter, Xiang Kong, Aonan Zhang, Jianyu Wang, Chong Wang, Nan Du, Tao Lei, Sam Wiseman, Mark Lee, Zirui Wang, Ruoming Pang, Peter Grasch, Alexander Toshev, Yinfei Yang [Paper]

  • EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters, Quan Sun, Jinsheng Wang, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Xinlong Wang [Paper] [Code]

  • InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Models, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Xilin Wei, Songyang Zhang, Haodong Duan, Maosong Cao, Wenwei Zhang, Yining Li, Hang Yan, Yang Gao, Xinyue Zhang, Wei Li, Jingwen Li, Kai Chen, Conghui He, Xingcheng Zhang, Yu Qiao, Dahua Lin, Jiaqi Wang [Paper] [Code]

Year 2023

  • PALI-3 VISION LANGUAGE MODELS: SMALLER, FASTER, STRONGER, [Paper]

  • Fuyu-8B: A Multimodal Architecture for AI Agents, [https://www.adept.ai/blog/fuyu-8b]

  • OtterHD: A High-Resolution Multi-modality Model, Bo Li, Peiyuan Zhang, Jingkang Yang, Yuanhan Zhang, Fanyi Pu, Ziwei Liu [Paper] [Code]

  • Towards Medical Artificial General Intelligence via Knowledge-Enhanced Multimodal Pretraining, Bingqian Lin et al. [Paper]

  • CLIP^2: Contrastive Language-Image-Point Pretraining from Real-World Point Cloud Data, Yihan Zeng, Chenhan Jiang, Jiageng Mao, Jianhua Han, Chaoqiang Ye, Qingqiu Huang, Dit-Yan Yeung, Zhen Yang, Xiaodan Liang, Hang Xu [Paper]

  • PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical Documents, Weixiong Lin, Ziheng Zhao, Xiaoman Zhang, Chaoyi Wu, Ya Zhang, Yanfeng Wang, and Weidi Xie, [Paper]

  • HICLIP: CONTRASTIVE LANGUAGE-IMAGE PRETRAINING WITH HIERARCHY-AWARE ATTENTION, Shijie Geng, Jianbo Yuan, Yu Tian, Yuxiao Chen, Yongfeng Zhang, ICLR 2023 [Paper] [Code]

  • FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks, Xiao Han, Xiatian Zhu, Licheng Yu, Li Zhang, Yi-Zhe Song, Tao Xiang [Paper] [Code]

  • Prismer: A Vision-Language Model with An Ensemble of Experts, Shikun Liu, Linxi Fan, Edward Johns, Zhiding Yu, Chaowei Xiao, Anima Anandkumar [Paper] [Code]

  • STRUCTEXTV2: MASKED VISUAL-TEXTUAL PREDICTION FOR DOCUMENT IMAGE PRE-TRAINING, Yuechen Yu, Yulin Li, Chengquan Zhang, Xiaoqiang Zhang, Zengyuan Guo,Xiameng Qin, Kun Yao, Junyu Han, Errui Ding, Jingdong Wang [Paper] [Code]

  • Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection to Image-Text Pre-Training, Dezhao Luo, Jiabo Huang, Shaogang Gong, Hailin Jin, and Yang Liu [Paper]

  • RAMM: Retrieval-augmented Biomedical Visual Question Answering with Multi-modal Pre-training, Zheng Yuan, Qiao Jin12, Chuanqi Tan, Zhengyun Zhao, Hongyi Yuan, Fei Huang, Songfang Huang [Paper]

  • "Language Is Not All You Need: Aligning Perception with Language Models." arXiv preprint arXiv:2302.14045 (2023). Huang, Shaohan, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv et al. [Paper] [Code]

  • Improving Medical Speech-to-Text Accuracy with Vision-Language Pre-training Model, Jaeyoung Huha, Sangjoon Parka, Jeong Eun Leeb, Jong Chul Ye [Paper]

  • Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning, CVPR 2023, Yang, Antoine and Nagrani, Arsha and Seo, Paul Hongsuck and Miech, Antoine and Pont-Tuset, Jordi and Laptev, Ivan and Sivic, Josef and Schmid, Cordelia, [Paper] [Project]

  • Knowledge-enhanced Visual-Language Pre-training on Chest Radiology Images, Xiaoman Zhang, Chaoyi Wu, Ya Zhang, Yanfeng Wang, Weidi Xie, [arXiv]

Year 2022

  • FLAVA: A Foundational Language And Vision Alignment Model, Amanpreet Singh* Ronghang Hu* Vedanuj Goswami* Guillaume Couairon Wojciech Galuba Marcus Rohrbach Douwe Kiela, CVPR_2022 [Paper] [Project] [Code]

  • Position-guided Text Prompt for Vision-Language Pre-training, Alex Jinpeng Wang, Pan Zhou, Mike Zheng Shou, Shuicheng Yan [Paper] [Code]

  • MedKLIP: Medical Knowledge Enhanced Language-Image Pre-Training, Chaoyi Wu1,2, Xiaoman Zhang1,2, Ya Zhang1,2, Yanfeng Wang1,2, Weidi Xie [Paper] [Code]

  • Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models, Wenhao Wu1,2 Xiaohan Wang3 Haipeng Luo4 Jingdong Wang1 Yi Yang3 Wanli Ouyang [Paper]

  • HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training, Qinghao Ye Guohai Xu Ming Yan∗ Haiyang Xu Qi Qian Ji Zhang Fei Huang, [Paper] [Model]

  • Million-scale Object Detection with Large Vision Model, Feng Lin, Wenze Hu, Yaowei Wang, Yonghong Tian, Guangming Lu, Fanglin, Chen, Yong Xu and Xiaoyu Wang, [Paper] [Code]

  • VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning, Qiushi Zhu, Long Zhou, Ziqiang Zhang, Shujie Liu, Binxing Jiao, Jie Zhang, Lirong Dai, Daxin Jiang, Jinyu Li, Furu Wei [Paper] [Code]

  • SIMLA: Single-Stream Multi-Level Alignment for Vision-Language Pretraining, ECCV 2022 (NEC Labs), Zaid Khan, Vijay Kumar, Xiang Yu, Samuel Schulter, Manmohan Chandraker, and Yun Fu [Paper] [Code] [Project]

  • VINDLU : A Recipe for Effective Video-and-Language Pretraining, Feng Cheng, Xizi Wang, Jie Lei, David Crandall, Mohit Bansal, Gedas Bertasius [Paper] [Code]

  • CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy with ViT-B and ViT-L on ImageNet, Xiaoyi Dong*, Jianmin Bao, Ting Zhang, Dongdong Chen, Shuyang Gu, Weiming Zhang, Lu Yuan, Dong Chen, Fang Wen, Nenghai Yu [Paper] [Code]

  • REVEAL: Retrieval-Augmented Visual-Language Pre-Training with Multi-Source Multimodal Knowledge Memory, Ziniu Hu1*, Ahmet Iscen, Chen Sun, Zirui Wang, Kai-Wei Chang, Yizhou Sun, Cordelia Schmid, David A. Ross, Alireza Fathi [Paper]

  • Scaling Multimodal Pre-Training via Cross-Modality Gradient Harmonization, Junru Wu et al. [Paper]

  • Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese, An Yang et al. [Paper] [Code]

  • Generative Negative Text Replay for Continual Vision-Language Pretraining, [Paper]

  • GRIT-VLP: GRouped mIni-baTch sampling for Efficient Vision-Language Pre-training, ECCV 2022, [Paper] [[Code](GRIT-VLP: GRouped mIni-baTch sampling for Efficient Vision-Language Pre-training)]

  • INSTRUCTION-FOLLOWING AGENTS WITH JOINTLY PRE-TRAINED VISION-LANGUAGE MODELS, Hao Liu et al. [Paper] [Code]

  • FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified Retrieval and Captioning, Suvir Mirchandani, et al. [Paper]

  • ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model with Knowledge-Enhanced Mixture-of-Denoising-Experts, Zhida Feng et al. [Paper]

  • Learning by Hallucinating: Vision-Language Pre-training with Weak Supervision, [Paper]

  • MedCLIP: Contrastive Learning from Unpaired Medical Images and Text, Zifeng Wang, Zhenbang Wu, Dinesh Agarwal, Jimeng Sun [Paper] [Code]

  • Contrastive Language-Image Pre-Training with Knowledge Graphs, [Paper]

  • Vision-Language Pre-training: Basics, Recent Advances, and Future Trends, Zhe Gan, Linjie Li, Chunyuan Li, Lijuan Wang, Zicheng Liu, Jianfeng Gao [Paper] [Recent Advances in Vision-and-Language Pre-training In conjunction with CVPR 2022]

  • Non-Contrastive Learning Meets Language-Image Pre-Training, Jinghao Zhou Li Dong Zhe Gan Lijuan Wang Furu Wei [Paper]

  • Plausible May Not Be Faithful: Probing Object Hallucination in Vision-Language Pre-training, Wenliang Dai, Zihan Liu, Ziwei Ji, Dan Su, Pascale Fung [Paper]

  • Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning, Yuchong Sun, Hongwei Xue, Ruihua Song, Bei Liu, Huan Yang, Jianlong Fu, [Paper] [Code]

  • MAP: Modality-Agnostic Uncertainty-Aware Vision-Language Pre-training Model, Yatai Ji Junjie Wang Yuan Gong Lin Zhang Yanru Zhu Hongfa Wang Jiaxing Zhang Tetsuya Sakai Yujiu Yang, [Paper] [Code]

  • CLIP2Point: Transfer CLIP to Point Cloud Classification with Image-Depth Pre-training, Tianyu Huang, Bowen Dong, Yunhan Yang, Xiaoshui Huang, Rynson W.H. Lau, Wanli Ouyang, Wangmeng Zuo [Paper] [Code]

  • F-VLM: OPEN-VOCABULARY OBJECT DETECTION UPON FROZEN VISION AND LANGUAGE MODELS, Weicheng Kuo, Yin Cui, Xiuye Gu, AJ Piergiovanni, Anelia Angelova [Paper] [Code]

  • MEDICAL IMAGE UNDERSTANDING WITH PRETRAINED VISION LANGUAGE MODELS: A COMPREHENSIVE STUDY, Ziyuan Qin, Huahui Yi, Qicheng Lao, Kang Li [Paper]

  • ERNIE-VIL 2.0: MULTI-VIEW CONTRASTIVE LEARNING FOR IMAGE-TEXT PRE-TRAINING, Bin Shan Weichong Yin Yu Sun Hao Tian Hua Wu Haifeng Wang, [Paper] [Code]

  • OmniVL: One Foundation Model for Image-Language and Video-Language Tasks, Junke Wang, Dongdong Chen, Zuxuan Wu, Chong Luo, Luowei Zhou, Yucheng Zhao, Yujia Xie, Ce Liu, Yu-Gang Jiang, Lu Yuan [Paper]

  • Multi-Modal Masked Autoencoders for Medical Vision-and-Language Pre-Training, Zhihong Chen, Yuhao Du, Jinpeng Hu, Yang Liu, Guanbin Li, Xiang Wan,and Tsung-Hui Chang, MICCAI-2022. [Paper] [Code]

  • EXPLORING VISUAL INTERPRETABILITY FOR CONTRASTIVE LANGUAGE-IMAGE PRE-TRAINING, Yi Li, Hualiang Wang, Yiqun Duan, Hang Xu, Xiaomeng Li, [Paper] [Code]

  • PaLI: A Jointly-Scaled Multilingual Language-Image Model, Xi Chen∗ Xiao Wang Soravit Changpinyo AJ Piergiovanni Piotr Padlewski, Daniel Salz Sebastian Goodman Adam Grycner Basil Mustafa Lucas Beyer, Alexander Kolesnikov Joan Puigcerver Nan Ding Keran Rong Hassan Akbari, Gaurav Mishra Linting Xue Ashish Thapliyal James Bradbury Weicheng Kuo, Mojtaba Seyedhosseini Chao Jia Burcu Karagol Ayan Carlos Riquelme, Andreas Steiner Anelia Angelova Xiaohua Zhai Neil Houlsby Radu Soricut [Paper]

  • CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment, arxiv 2209.06430, Hongwei Xue, Yuchong Sun, Bei Liu, Jianlong Fu, Ruihua Song, Houqiang Li, Jiebo Luo [Paper] [Code]

  • RLIP: Relational Language-Image Pre-training for Human-Object Interaction Detection, Hangjie Yuan et al. [Paper]

  • An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling, Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, Zicheng Liu, [Paper]

  • Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment, Mustafa Shukor, Guillaume Couairon, Matthieu Cord, [Paper] [Code]

  • COYO-700M: Image-Text Pair Dataset [Paper] [Code]

  • Image as a Foreign Language: BEIT Pretraining for All Vision and Vision-Language Tasks, Wang, Wenhui and Bao, Hangbo and Dong, Li and Bjorck, Johan and Peng, Zhiliang and Liu, Qiang and Aggarwal, Kriti and Mohammed, Owais Khan and Singhal, Saksham and Som, Subhojit and others, arXiv:2208.10442, 2022. [Paper] [Code]

  • Pix4Point: Image Pretrained Transformers for 3D Point Cloud Understanding, Guocheng Qian, Xingdi Zhang, Abdullah Hamdi, Bernard Ghanem [Paper] [Code]

  • VLMAE: Vision-Language Masked Autoencoder, Sunan He, Taian Guo, Tao Dai, Ruizhi Qiao, Chen Wu, Xiujun Shu, Bo Ren, arXiv:2208.09374 [Paper]

  • Li, Juncheng, et al. "Fine-Grained Semantically Aligned Vision-Language Pre-Training." arXiv preprint arXiv:2208.02515 (2022). [Paper]

  • GRIT-VLP: Grouped Mini-batch Sampling for Efficient Vision and Language Pre-training, Jaeseok Byun, Taebaek Hwang, Jianlong Fu, and Taesup Moon, arXiv:2208.04060 [Paper] [Code]

  • Wang, Tengfei, et al. "Pretraining is All You Need for Image-to-Image Translation." arXiv preprint arXiv:2205.12952 (2022). [Paper] [Code]

  • Wang, Jinpeng, et al. "Object-aware Video-language Pre-training for Retrieval." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. [Paper] [Code]

  • See Finer, See More: Implicit Modality Alignment for Text-based Person Retrieval, Xiujun Shu, Wei Wen, Haoqian Wu, Keyu Chen, Yiran Song, Ruizhi Qiao, Bo Ren, Xiao Wang, The 2nd Workshop on Real-World Surveillance: Applications and Challenges, ECCVW-2022
    [Paper] [Code]

  • Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training, 2022 European Conference on Computer Vision (ECCV 2022), Haoxuan You*, Luowei Zhou*, Bin Xiao*, Noel Codella*, Yu Cheng, Ruochen Xu, Shih-Fu Chang, Lu Yuan. [Paper] [Code]

  • Zhao, Tiancheng, et al. "VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations." arXiv preprint arXiv:2207.00221 (2022). [Paper] [Code]

  • DemoVLP: Revitalize Region Feature for Democratizing Video-Language Pre-training, Guanyu Cai, Yixiao Ge, Alex Jinpeng Wang, Rui Yan, Xudong Lin, Ying Shan, Lianghua He, Xiaohu Qie, Jianping Wu, Mike Zheng Shou [Paper] [Code]

  • Yan, Rui, et al. "Video-Text Pre-training with Learned Regions." arXiv preprint arXiv:2112.01194 (2021). [Paper]] [Code]

  • Wang, Alex Jinpeng, et al. "All in one: Exploring unified video-language pre-training." arXiv preprint arXiv:2203.07303 (2022). [Paper] [Code]

  • Egocentric Video-Language Pretraining, Kevin Qinghong Lin and Alex Jinpeng Wang and Mattia Soldan and Michael Wray and Rui Yan and Eric Zhongcong Xu and Difei Gao and Rongcheng Tu and Wenzhe Zhao and Weijie Kong and Chengfei Cai and Hongfa Wang and Dima Damen and Bernard Ghanem and Wei Liu and Mike Zheng Shou, arXiv-2022 [Paper] [Code]

  • LocVTP: Video-Text Pre-training for Temporal Localization (ECCV 2022), Meng Cao, Tianyu Yang, Junwu Weng, Can Zhang, Jue Wang and Yuexian Zou.
    [Paper] [Code]

  • Gui L, Wang B, Huang Q, et al. Kat: A knowledge augmented transformer for vision-and-language[J]. arXiv preprint arXiv:2112.08614, 2021. [Paper] [Code]

NO. Model Publish Modality Architecture Objective Highlights Code
64 pyramidCLIP arXiv-2022 image-text CNN+Trans CS Hierarchical image-text contrastive learning -
65 VLC arXiv-2022 image-text ViT MIM, MLM ITM Built on top of MAE that does not require trained on ImageNet [Code]
66 VLCDoC arXiv-2022 image-text Trans CS Contrastive Pre-Training for document classification -
67 MVP arXiv-2022 image-text ViT MIM Multimodality-guided visual pre-training leads to impressive gains -
68 COTS arXiv-2022 image-text Trans CS, KLD, MVLM Token- and task-level interaction are proposed to enhance cross-modal interaction -
69 Flamingo arXiv-2022 image-text NFNet CS An architecture for accepting arbitrarily interleaved visual data and text as input [Code]
70 BLIP arXiv-2022 image-text BERT CS, MML, MLM Propose the multimodal mixture of encoder-decoder, and captioning-filtering scheme [Code]
71 TCL CVPR-2022 image-text BERT CMA, IMC, LMI ITM, MLM The first work considers local structure information for multi-modality representation learning [Code]
72 SCALE CVPR-2022 image, text, table video, audio BERT MRP, MLM, MEM MFP, MFP, MAM A unified model to handle five modalities [Code]
73 Clinical-BERT AAAI-2022 image-text BERT CD, MMM MLM, IMM The first work to learn domain knowledge during pre-training for the medical domain -
74 ProbES ACL-2022 image-text LSTM, ViLBERT Ranking loss Prompt-based learning for VLN based on CLIP [Code]
75 VLP-MABSA ACL-2022 image-text BERT MLM, AOE, MRM AOG, MSP Task-specific VL-PTMs for multimodal aspect-based sentiment analysis [Code]
76 R2D2 arXiv-2022 image-text ViT, BERT GCPR, FGR, MLM A two-way distillation strategy is proposed, i.e., target- and feature-guided distillation -
77 DeFILIP arXiv-2022 image-text ViT, ResNet CS A benchmark for CLIP and its variants [Code]
78 CoCa arXiv-2022 image-text Trans CS, ITG Jointly pre-train image text model with contrastive loss and captioning loss -
79 HiVLP arXiv-2022 image-text Trans LRM, HRL, VLM Accelerate image-text retrieval via hierarchical retrieval -
80 CLIP-Event CVPR-2022 image-text Trans CS Consider event structural knowledge and prompts in the pre-training phase. [Code]
81 AudioCLIP ICASSP-2022 image-text-audio Trans CS Build a triplet modality based PTMs like CLIP [Code]
82 VL-BEiT arXiv-2022 image-text Trans MLM, MIM, MVLM Pretrain on both monomodal and multimodal data using a shared Transformer [Code]
83 MV-GPT arXiv-2022 image-text BERT MLM, LG Pre-train both a multi-modal video encoder and a sentence decoder jointly. -
84 MMKD arXiv-2022 image-text BERT ITM Iteratively execute knowledge discovery and model pre-training for continuous learning -
85 GLIPv2 arXiv-2022 image-text Swin, BERT PGL, CS, MLM Serves both the localization and understanding tasks. [Code]
86 LIMoE arXiv-2022 image-text Trans CS multi-modal pre-training with a sparse mixture of experts model -
87 VLMixer arXiv-2022 image-text Trans MLM, CMCL, MTM Implicit cross-modal alignment learning in unpaired VLP. [Code]
88 ProtoCLIP arXiv-2022 image-text Trans CS Combine the CLIP loss and prototypical supervisions for VLP. [Code]
89 i-Code arXiv-2022 image-text-audio Trans MLM, MVM MSM, CS It can handle different combinations of modalities (such as single-, dual-, and triple-modality) into a single representation space. -

Year 2021

NO. Model Publish Modality Architecture Objective Highlights Code
25 XGPT NLPCC-2021 image-text Trans IC, MLM, IDA, MOR Novel IDA pre-training; Share parameters between encoder and decoder -
26 ERNIE-ViL AAAI-2021 image-text Trans MOC, AttP, RelP, MLM, MOR, MML Use the knowledge obtained from scene graph [Code]
27 KVL-BERT KBS-2021 image-text BERT MOC, MLM Integrate commonsense knowledge for visual commonsense reasoning -
28 VinVL CVPR-2021 image-text Trans MTL, 3-way CS Verifying that visual feature matters in VLP, i.e., strong object detector brings better results [Code]
29 VL-T5 ICML-2021 image-text Trans MLM, VQA, MML, VG, GC Unified framework for VL via generating texts [Code]
30 ViLT ICML-2021 image-text Trans MLM, MML Use linear embedding only for Fast VL transformer [Code]
31 ALIGN ICML-2021 image-text EfficientNet, BERT CS Milestone for image-text pre-training using noisy data -
32 Kaleido-BERT CVPR-2021 image-text Trans MLM, MML, AKPM Use saliency detector to generate multi-grained patches [Code]
33 MDETR ICCV-2021 image-text CNN+Trans STP, MML An end-to-end text-modulated detection system [Code]
34 SOHO CVPR-2021 image-text CNN+Trans MLM, MOR, MML Use a dynamic-updated visual dictionary for vision-language alignment [Code]
35 E2E-VLP ACL-2021 image-text Trans OBD, ITG The first end-to-end pre-trained model for V+L understanding and generation -
36 PIM NeurIPS-2021 image-text Trans MLM, MML, MOR Propose a inter-modality flow metric to measure and reveal vision and language fusion -
37 CLIP-ViL arXiv-2021 image-text Trans MLM, VQA, MML Take the CLIP visual encoder as its visual backbone [Code]
38 ALBEF NeurIPS-2021 image-text Trans CS, GR Design a momentum model to address noisy data [Code]
39 SimVLM arXiv-2021 image-text Trans PrefixLM Simple VL model using single PrefixLM pre-training objective only -
40 MURAL arXiv-2021 image-text Trans CS Adopt multi-task contrastive learning objective (image-text, text-text) -
41 VLMo arXiv-2021 image-text Trans MLM, MML, CS Jointly learns visual-, text-encoder and a fusion encoder [Code]
42 METER CVPR-2022 image-text Trans MLM, MOR, MOC, MML An empirical study on VLP [Code]
43 CLIP ICML-2021 image-text Resnet, Trans CS Milestone for image-text pre-training using noisy data [Code]
44 Frozen ICCV-2021 video/image-text Trans MML Flexibly trained on both images and videos with captions jointly [Code]
45 RegionLearner arXiv-2021 video-text Trans MML Implicitly learning object region without position supervision [Code]
46 DALL-E ICML-2021 image-text Trans ELB Achieve high quality image generation without using any of the training labels [Code]
47 BriVL arXiv-2021 image-text Trans InfoNCE First large-scale Chinese multi-modal pre-training model [Code]
48 M6 arXiv-2021 image-text Trans LM The largest pretrained model in Chinese -
49 CogView NeurIPS-2021 image-text Trans NLL The first open-source large text-to-image transformer [Code]
50 VATT NeurIPS-2021 Video, Audio, Text Trans NCE, MIL-NCE Modality-specific or Modality-agnostic triplet modality pre-trained model [Code]
51 OPT arXiv-2021 image, Audio, Text Trans MLM, MVM, MoLM MAM, DTR, DIR The first pre-trained model that connects the three modalities of text, vision, and audio -
52 Florence arXiv-2021 image-text CoSwin UniCL Expand the representations from coarse-to-fine, static-to-dynamic, and RGB-to-MM -
53 ROSITA MM-2021 image-text Trans SKM, MLM, MRM Incorporates both cross- and intra-modal knowledge, and proposed SKM strategy -
54 GilBERT IR-2021 image-text BERT MLM, MOR Employ image-to-text captioning and text-to-image synthesizing in VLP -
55 U-VisualBERT NAACL-2021 image-text Trans, BERT GR, MML \emph{Unpaired image-text data for pre-training [Code]
56 M3P CVPR-2021 image-text BERT xMLM, MC-MLM, MC-MRM Multitask, Multilingual, Multimodal Pre-training [Code]
57 NUWA arXiv-2021 image-text Trans T2I, T2V, V2V A 3D transformer framework can handle image, text, and video, simultaneously [Code]
58 GLIP CVPR-2022 image-text BERT CS Unifying detection and grounding by reformulating object detection as phrase grounding [Code]
59 RegionCLIP CVPR-2022 image-text Trans Distillation loss, CS Learn region-level visual representations based on CLIP [Code]
60 DeCLIP ICLR-2022 image-text ViT InfoNCE, SS MVS, NNS Learn generic visual features in a data efficient way [Code]
61 SLIP arXiv-2021 image-text ViT CS, InfoNCE Combine the self-supervised learning and CLIP pre-training in a multi-task framework [Code]
62 FILIP arXiv-2021 image-text ViT CS Achieve finer-level alignment using the cross-modal late interaction scheme -
63 SemVLP arXiv-2021 image-text Trans MLM, MOP, ITM, QA Fuse the single- and two-stream architectures -

Year 2020

NO. Model Publish Modality Architecture Objective Highlights Code
08 Unicoder-VL AAAI-2020 image-text Trans GR, MML, MOC Single transformer encoder for VLP [Code]
09 VLP AAAI-2020 image-text Trans BiDT, Seq2seq Unified encoder-decoder network architecture [Code]
10 UNITER ECCV-2020 image-text Trans MRA, MML Propose an OT-based Word-Region Alignment objective [Code]
11 12-IN-1 CVPR-2020 image-text Trans CS, GR Training jointly on 12 different datasets in a multi-task learning manner [Code]
12 VisDial-BERT ECCV-2020 image-text Trans MLM, NSP, MIR Pre-training on image-text corpus and finetuning on visual dialog [Code]
13 ImageBERT arXiv-2020 image-text Trans MOC, MLM, MML, MOR Indicating that multi-stage pre-training works better -
14 PREVALENT CVPR-2020 image-text Trans MLM, AP Pre-training for vision and language navigation [Code]
15 InterBERT arXiv-2020 image-text Trans MSM, MOC, ITM-hn Finding that all-attention works better than co-attention for modal interaction [Code]
16 PixelBERT arXiv-2020 image-text CNN, Trans MLM, MML First to align vision and language in pixel and text-level -
17 OSCAR ECCV-2020 image-text Trans CS, MLM Use object tags as anchor points to align image regions with word embeddings [Code]
18 FashionBERT RDIR-2020 image-text BERT MLM, MOR, MML Use image patches for fashion domain instead of RoIs [Code]
19 VILLA NeurIPS-2020 image-text Trans MLM, MOR, MML Pre-training with adversarial learning [Code]
20 UniVL arXiv-2020 video-text Trans MLM, MFM, MML, ITG A unified model for multimodal understanding and generation [Code]
21 HERO EMNLP-2020 video-text Trans MLM, MFM, VSM, FOM Hierarchical Transformer-based model trained with newly proposed VSM and FOM [Code]
22 MMFT-BERT EMNLP-2020 image-text BERT Classification Adopt multiModal fusion Transformer for modality fusion [Code]
23 ActBERT CVPR-2020 image-text Trans CS, GR Extract actions explicitly as one of the inputs -
24 UNIMO arXiv-2020 image-text Trans CS Adapt to single-, multi-modal understanding and generation tasks effectively [Code]

Year 2019 and Before

NO. Model Publish Modality Architecture Objective Highlights Code
01 VisualBERT arXiv-2019 image-text Trans, BERT GR, MML A simple and strong baseline for VLP [Code]
02 ViLBERT NeurIPS-2019 image-text Trans CS, GR First adopt co-attention for MM pre-training [Code]
03 LXMERT EMNLP-2019 image-text Trans QA, MOR, MOC, MML, MLM Propose a cross-modality encoder for vision-language pre-training [Code]
04 B2T2 EMNLP-2019 image-text ResNet, BERT MML, GR Embed bounding box into text transformer in a early fusion manner [Code]
05 VL-BERT ICLR-2019 image-text BERT GR, MOC MM PTMs and faster rcnn are jointly trained [Code]
06 VideoBERT ICCV-2019 video-text BERT MLM A simple model for video-text feature learning [Code]
07 CBT arXiv-2019 video-text Trans NCE Self-supervised contrastive bidirectional Transformer -