Awesome-Multimodal-Papers

A curated list of awesome Multimodal studies.

Contribution

If you have published a high-quality paper or come across one that you think is valuable, feel free to contribute! To submit a paper, please open an issue and include the following information in the specified format:

{
    "title": paper title,
    "url": paper URL,
    "venue": the venue where the paper was published, such as ICML 2025, CVPR 2025 or arXiv,
    "category": one or more relevant categories from our directory, or feel free to propose a new, more suitable category,
    "code": [Optional] code URL,
    "project_page": [Optional] project page URL,
    "dataset": [Optional] HuggingFace Dataset URL,
    "collections": [Optional] HuggingFace Collections URL
}

Awesome-Multimodal-Papers

Visual Understanding

Title	Venue	Date	Code	Supplement
Diversity-Guided MLP Reduction for Efficient Large Vision Transformers (DGMR)	arXiv	2025-06-10
Learning Compact Vision Tokens for Efficient Large Multimodal Models (LLaVA-STF)	arXiv	2025-06-08
✨ InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models	arXiv	2025-04-14
InternVL-X: Advancing and Accelerating InternVL Series with Efficient Visual Token Compression (Baidu)	arXiv	2025-03-27		-
M-LLM Based Video Frame Selection for Efficient Video Understanding (CMU)	arXiv	2025-02-27	-	-
✨ Qwen2.5 VL	-	2025-01-26
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling	arXiv	2025-01-21		-
LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding (Spatial-Temporal Compression)	arXiv	2025-01-14		-
LLaVA-Octopus: Unlocking Instruction-Driven Adaptive Projector Fusion for Video Understanding	arXiv	2025-01-09		-
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos	arXiv	2025-01-07
FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Visual Language Models (Video Token Compression)	arXiv	2024-12-30
✨ Apollo: An Exploration of Video Understanding in Large Multimodal Models (Exploration) (Meta)	arXiv	2024-12-13
CompCap: Improving Multimodal Large Language Models with Composite Captions (Meta)	arXiv	2024-12-09	-	-
✨ Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling (InternVL 2.5)	arXiv	2024-12-06
xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs	arXiv	2024-10-21	-
[Model, Dataset] Personalized Visual Instruction Tuning (PVIT, PVIT-3M)	arXiv	2024-10-09
✨ Video Instruction Tuning With Synthetic Data (LLaVA-Video, LLaVA-NeXT Series)	arXiv	2024-10-03
EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions	arXiv	2024-09-26	-
Instruction-guided Multi-Granularity Segmentation and Captioning with Large Multimodal Model (MGLMM, Alibaba)	arXiv	2024-09-20
POINTS: Improving Your Vision-language Model with Affordable Strategies (WeChat)	arXiv	2024-09-07		-
✨ xGen-MM (BLIP-3): A Family of Open Large Multimodal Models	arXiv	2024-08-16
✨ LLaVA-OneVision: Easy Visual Task Transfer (LLaVA-NeXT Series)	arXiv	2024-08-06
Tarsier: Recipes for Training and Evaluating Large Video Description Models (Tarsier, Dream1k, by ByteDance)	arXiv	2024-07-30
✨ InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output	arXiv	2024-07-03		-
TokenPacker: Efficient Visual Projector for Multimodal LLM	arXiv	2024-07-02		-
✨ Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs (Cambrian, Data Rationing)	arXiv	2024-06-24
✨ Long Context Transfer from Language to Vision (LongVA, by Ziwei Liu, Chunyuan Li)	arXiv	2024-06-24
Generative Visual Instruction Tuning	arXiv	2024-06-17		-
✨ VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding	arXiv	2024-06-13
✨ 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities (Apple)	arXiv	2024-06-13
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs	arXiv	2024-06-11		-
Wings: Learning Multimodal LLMs without Text-only Forgetting	arXiv	2024-06-05	-	-
Enhancing Multimodal Large Language Models with Multi-instance Visual Prompt Generator for Visual Representation Enrichment (MIVPG)	arXiv	2024-06-05	-	-
PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM	arXiv	2024-06-05		-
OLIVE: Object Level In-Context Visual Embeddings	ACL 2024	2024-06-02		-
X-VILA: Cross-Modality Alignment for Large Language Model (by NVIDIA)	arXiv	2024-05-29	-
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models	arXiv	2024-05-24		-
Prompt-Aware Adapter: Towards Learning Adaptive Visual Tokens for Multimodal Large Language Models	arXiv	2024-05-24	-	-
LOVA3: Learning to Visual Question Answering, Asking and Assessment	arXiv	2024-05-23		-
AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability	arXiv	2024-05-23
Chameleon: Mixed-Modal Early-Fusion Foundation Models (Meta)	arXiv	2024-05-16
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts	arXiv	2024-05-09
ImageInWords: Unlocking Hyper-Detailed Image Descriptions (Google)	arXiv	2024-05-05
✨ What matters when building vision-language models? (Idefics2)	arXiv	2024-05-03	-
MANTIS: Interleaved Multi-Image Instruction Tuning	arXiv	2024-05-02
Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs	CVPR 2024 Workshop	2024-04-23	-
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models	arXiv	2024-04-19
MoVA: Adapting Mixture of Vision Experts to Multimodal Context	arXiv	2024-04-19		-
Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models	arXiv	2024-04-18	-
LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation? (LaDiC)	NAACL 2024	2024-04-16		-
AesExpert: Towards Multi-modality Foundation Model for Image Aesthetics Perception (AesExpert, AesMMIT Dataset)	arXiv	2024-04-15		-
Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models (Ferret-v2)	arXiv	2024-04-11	-	-
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies (MiniCPM series)	arXiv	2024-04-09
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs (Ferret-UI)	arXiv	2024-04-08	-	-
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding	CVPR 2024	2024-04-08
Koala: Key frame-conditioned long video-LLM	CVPR 2024	2024-04-05
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens	arXiv	2024-04-04
LongVLM: Efficient Long Video Understanding via Large Language Models	arXiv	2024-04-04		-
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding	ECCV 2024	2024-03-22		-
VideoAgent: Long-form Video Understanding with Large Language Model as Agent (key frame)	arXiv	2024-03-15	-	-
✨ MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training (Apple)	arXiv	2024-03-14	-	-
UniCode: Learning a Unified Codebook for Multimodal Large Language Models	arXiv	2024-03-14	-	-
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context	arXiv	2024-03-08	-
Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models	arXiv	2023-03-05		-
RegionGPT: Towards Region Understanding Vision Language Model	CVPR 2024	2024-03-04	-
All in an Aggregated Image for In-Image Learning	arXiv	2024-02-28		-
Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners	CVPR 2024	2024-02-27
TMT: Tri-Modal Translation between Speech, Image, and Text by Processing Different Modalities as Different Languages	arXiv	2024-02-25	-	-
LLMBind: A Unified Modality-Task Integration Framework	arXiv	2024-02-22	-	-
✨ ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model (ALLaVA)	arXiv	2024-02-18
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model	arXiv	2024-02-06		-
MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices	arXiv	2023-12-28		-
Gemini: A Family of Highly Capable Multimodal Models	arXiv	2023-12-19	-
✨ Osprey: Pixel Understanding with Visual Instruction Tuning	CVPR 2024	2023-12-15		-
✨ VILA: On Pre-training for Visual Language Models (NVIDIA, MIT)	CVPR 2024	2023-12-12		-
Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models	arXiv	2023-12-11
Prompt Highlighter: Interactive Control for Multi-Modal LLMs	CVPR 2024	2023-12-07
PixelLM: Pixel Reasoning with Large Multimodal Model	CVPR 2024	2023-12-04
APoLLo : Unified Adapter and Prompt Learning for Vision Language Models	EMNLP 2023	2023-12-04
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models	arXiv	2023-11-28
PG-Video-LLaVA: Pixel Grounding Large Video-Language Models	arXiv	2023-11-22		-
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions	arXiv	2023-11-21
LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge	CVPR 2024	2023-11-20
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection	arXiv	2023-11-16		-
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration	arXiv	2023-11-07		-
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning	arXiv	2023-10-14
Ferret: Refer and Ground Anything Anywhere at Any Granularity (Ferret)	ICLR 2024	2023-10-11		-
✨ Improved Baselines with Visual Instruction Tuning (LLaVA-1.5)	arXiv	2023-10-05
Aligning Large Multimodal Models with Factually Augmented RLHF (LLaVA-RLHF, MMHal-Bench (hallucination))	arXiv	2023-09-25
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning	ICLR 2024	2023-09-14		-
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond	arXiv	2023-08-24
Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages (VisCPM-Chat/Paint)	ICLR 2024	2023-08-23		-
SVIT: Scaling up Visual Instruction Tuning	arXiv	2023-07-09
Kosmos-2: Grounding Multimodal Large Language Models to the World (Kosmos-2, GrIT Dataset)	arXiv	2023-06-26
M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning	arXiv	2023-06-07	-
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning	NeurIPS 2023	2023-05-11		-
MultiModal-GPT: A Vision and Language Model for Dialogue with Humans	arXiv	2023-05-08		-
VPGTrans: Transfer Visual Prompt Generator across LLMs	NeurIPS 2023	2023-05-02
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality	arXiv	2023-04-27		-
✨ MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models	ICLR 2024	2023-04-20
✨ Visual Instruction Tuning (LLaVA)	NeurIPS 2023	2023-04-17
Language Is Not All You Need: Aligning Perception with Language Models (Kosmos-1)	NeurIPS 2023	2023-02-27		-
Multimodal Chain-of-Thought Reasoning in Language Models	arXiv	2023-02-02		-
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models	ICML 2023	2023-01-30		-
Flamingo: a Visual Language Model for Few-Shot Learning	NeurIPS 2022	2022-04-29		-

Omni Understanding

Title	Venue	Date	Code	Supplement
Ming-Omni: A Unified Multimodal Model for Perception and Generation (Ant Group)	arXiv	2025-06-11
✨ Qwen2.5-Omni Technical Report	arXiv	2025-03-26
Baichuan-Omni-1.5 Technical Report	arXiv	2025-01-26		-
R1-Omni: Explainable Omni-Multimodal Emotion Recognition with Reinforcement Learning	arXiv	2025-03-07
OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference	arXiv	2025-02-25
Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment (THU, Tencent Hunyuan, NTU S-Lab)	arXiv	2025-02-06
[Benchmark] WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs (Xiaohongshu, SJTU)	arXiv	2025-02-06
Align Anything: Training All-Modality Models to Follow Instructions with Language Feedback	arXiv	2024-12-20
[Survey] From Specific-MLLM to Omni-MLLM: A Survey about the MLLMs alligned with Multi-Modality (HIT, Peng Cheng Lab)	arXiv	2024-12-16	-	-
OMCAT: Omni Context Aware Transformer (OCTAV, OMCAT) (NVIDIA)	arXiv	2024-10-15	-
Baichuan-Omni Technical Report	arXiv	2024-10-11		-
[Benchmark] OmniBench: Towards The Future of Universal Omni-Language Models	arXiv	2024-09-23
OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces	arXiv	2024-07-16
Explore the Limits of Omni-modal Pretraining at Scale (MiCo)	arXiv	2024-06-13
ViT-Lens: Towards Omni-modal Representations (TencentARC)	CVPR 2024	2023-08-20
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset	NeurIPS 2023	2023-05-29		-
ImageBind: One Embedding Space To Bind Them All	CVPR 2023	2023-05-09
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset	TPAMI 2024	2023-04-17

Unified Understanding and Generation

Title	Venue	Date	Code	Supplement
FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities	arXiv	2025-05-26	-
✨ MMaDA: Multimodal Large Diffusion Language Models (ByteDance Seed)	arXiv	2025-05-21
Unified Reward Model for Multimodal Understanding and Generation (UnifiedReward) (Fudan, Shanghai AI Lab)	arXiv	2025-03-07
✨ Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling (by deepseek)	arXiv	2025-01-29		-
LlamaFusion: Adapting Pretrained Language Models for Multimodal Generation (Meta)	arXiv	2024-12-19	-	-
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding	arXiv	2024-12-12	coming soon	-
ILLUME: Illuminating Your LLMs to See, Draw, and Self-Enhance	arXiv	2024-12-09	-
Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation (TencentARC)	arXiv	2024-12-05		-
Liquid: Language Models are Scalable Multi-modal Generators (Bytedance)	arXiv	2024-12-05		arXiv
TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation (ByteDance)	arXiv	2024-12-04
✨ Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation (by deepseek)	-	2024-10-17		-
✨ Emu3: Next-Token Prediction is All You Need	arXiv	2024-09-27
✨ Show-o: One Single Transformer to Unify Multimodal Understanding and Generation	arXiv	2024-08-22
An Image is Worth 32 Tokens for Reconstruction and Generation (TiTok, by ByteDance)	arXiv	2024-06-11
✨ Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models	arXiv	2024-05-27
VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing	-	2024-04-25
✨ SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation	arXiv	2024-04-22		-
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling	arXiv	2024-02-19
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization	arXiv	2024-02-05
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action	arXiv	2023-12-28
Generative Multimodal Models are In-Context Learners (Emu2)	CVPR 2024	2023-12-20
CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation	arXiv	2023-11-30
LLMGA: Multimodal Large Language Model based Generation Assistant	arXiv	2023-11-27
VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation	arXiv	2023-12-14		-
Kosmos-G: Generating Images in Context with Multimodal Large Language Models	ICLR 2024	2023-10-04
✨ MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens	arXiv	2023-10-03
DreamLLM: Synergistic Multimodal Comprehension and Creation	ICLR 2024	2023-09-20
NExT-GPT: Any-to-Any Multimodal LLM	arXiv	2023-09-11
Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization (LaVIT)	ICLR 2024	2023-09-09		-
Planting a SEED of Vision in Large Language Model	ICLR 2024	2023-07-16
Generative Pretraining in Multimodality (Emu1)	ICLR 2024	2023-07-11		-
Generating Images with Multimodal Language Models (GILL)	NeurIPS 2023	2023-05-26
Any-to-Any Generation via Composable Diffusion (CoDi-1)	NeurIPS 2023	2023-05-19
Grounding Language Models to Images for Multimodal Inputs and Outputs (FROMAGe)	ICML 2023	2023-01-31

Diffusion MLLM

Title	Venue	Date	Code	Supplement
Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model	arXiv	2025-05-29		-
FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities	arXiv	2025-05-26	-
Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding (NUS)	arXiv	2025-05-22		-
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning (Gaoling)	arXiv	2025-05-22
LaViDa: A Large Diffusion Language Model for Multimodal Understanding (UCLA, Panasonic AI, Salesforce, Adobe)	arXiv	2025-05-22
MMaDA: Multimodal Large Diffusion Language Models (ByteDance Seed)	arXiv	2025-05-21

Multimodal Embedding/Retrieval

Title	Venue	Date	Code	Supplement
Modality Curation: Building Universal Embeddings for Advanced Multimodal Information Retrieval (UNITE)	arXiv	2025-05-26
[Benchmark] MIEB: Massive Image Embedding Benchmark	arXiv	2025-04-14
[Data, Model, Benchmark] IDMR: Towards Instance-Driven Precise Visual Correspondence in Multimodal Retrieval	arXiv	2025-04-01		-
LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning	arXiv	2025-03-04
CoTMR: Chain-of-Thought Multi-Scale Reasoning for Training-Free Zero-Shot Composed Image Retrieval (CoT)	arXiv	2025-02-28	-	-
[Model, Dataset] Learning Fine-Grained Representations through Textual Token Disentanglement in Composed Video Retrieval (FDCA, FineCVR-1M)	ICLR 2025	2025-02-26
[Benchmark, Model] MomentSeeker: A Comprehensive Benchmark and A Strong Baseline For Moment Retrieval Within Long Videos (MomentSeeker, V-Embedder) (Gaoling)	arXiv	2025-02-18
[Data, Model, Benchmark] Any Information Is Just Worth One Single Screenshot: Unifying Search With Visualized Information Retrieval (Vis-IR task, VIRA, UniSE, MVRB)	arXiv	2025-02-17	-	-
[Data, Model] Learning Fine-Grained Representations through Textual Token Disentanglement in Composed Video Retrieval (FineCVR-1M, FDCA)	ICLR 2025	2025-01-23
[Benchmark] CaReBench: A Fine-Grained Benchmark for Video Captioning and Retrieval (CaReBench)	CVPR 2025	2024-12-31
MINIMA: Modality Invariant Image Matching	CVPR 2025	2024-12-27
✨ GME: Improving Universal Multimodal Retrieval by Multimodal LLMs (Tongyi Lab)	CVPR 2025	2024-12-22
✨ [Dataset, Model] MegaPairs: Massive Data Synthesis For Universal Multimodal Retrieval (MagaPairs, BGE-VL)	arXiv	2024-12-19
Reason-before-Retrieve: One-Stage Reflective Chain-of-Thoughts for Training-Free Zero-Shot Composed Image Retrieval (OSrCIR) (CoT)	CVPR 2025	2024-12-15		-
✨ LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant	arXiv	2024-12-02
MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs (NVIDIA)	ICLR 2025 Poster	2024-11-04	-
OMCAT: Omni Context Aware Transformer	arXiv	2024-10-15	-
✨ VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks (TIGER Lab)	arXiv	2024-10-07
✨ E5-V: Universal Embeddings with Multimodal Large Language Models	arXiv	2024-07-17		-
OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces	arXiv	2024-07-16
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models (NVIDIA)	ICLR 2025	2024-05-27	-
MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions (Google DeepMind)	ICML 2024 Oral	2024-03-28
DREAM: Improving Video-Text Retrieval Through Relevance-Based Augmentation Using Large Foundation Models	NAACL	2024-04-07	-	-
Composed Video Retrieval via Enriched Context and Discriminative Embeddings	CVPR 2024	2024-03-25		-
✨ UniIR: Training and Benchmarking Universal Multimodal Information Retrievers (TIGER Lab)	ECCV 2024	2023-11-28
CoVR-2: Automatic Data Construction for Composed Video Retrieval&CoVR: Learning Composed Video Retrieval from Web Video Captions	TPAMI 2024 & AAAI 2024	2023-08-23
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset	NeurIPS 2023	2023-05-29		-
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset	TPAMI2024	2023-04-17
✨ (QB-Norm) Cross Modal Retrieval with Querybank Normalisation	CVPR 2022	2021-12-23
✨ (DSL) Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss	arXiv	2021-09-09		-

Image Understanding Benchmark

Title	Venue	Date	Code	Supplement
Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning	arXiv	2024-06-18	-
LOVA3: Learning to Visual Question Answering, Asking and Assessment	arXiv	2024-05-23		-
MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI	arXiv	2024-04-24
BLINK: Multimodal Large Language Models Can See but Not Perceive	arXiv	2024-04-18
Ferret: Refer and Ground Anything Anywhere at Any Granularity (Ferret-Bench)	ICLR 2024	2023-10-11		-
Aligning Large Multimodal Models with Factually Augmented RLHF (LLaVA-RLHF, MMHal-Bench (hallucination))	arXiv	2023-09-25
Affective Visual Dialog: A Large-Scale Benchmark for Emotional Reasoning Based on Visually Grounded Conversations (AffectVisDial)	ECCV 2024	2023-08-30
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension	CVPR 2024	2023-07-30		-

Video Understanding Benchmark

Title	Venue	Date
TUNA: Comprehensive Fine-grained Temporal Understanding Evaluation on Dense Dynamic Videos	ACL 2025 Main	2025-05-26
TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models	arXiv	2024-10-30
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark (AuroraCap, VDC)	arXiv	2024-10-24
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models	arXiv	2024-10-14
E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding	NeurIPS 2024	2024-09-26
Tarsier: Recipes for Training and Evaluating Large Video Description Models (Tarsier, Dream1k) (ByteDance)	arXiv	2024-07-30
VideoVista: A Versatile Benchmark for Video Understanding and Reasoning (VideoVista)	arXiv	2024-06-17
VELOCITI: Can Video-Language Models Bind Semantic Concepts through Time?	arXiv	2024-06-16
MLVU: A Comprehensive Benchmark for Multi-Task Long Video Understanding	arXiv	2024-06-06
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis (Video-MME)	arXiv	2024-05-31
TempCompass: Do Video LLMs Really Understand Videos?	arXiv	2024-03-01
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark (MVBench, VideoChat2)	CVPR 2024 Highlight	2023-11-28
EgoSchema: A Diagnostic Benchmark for Very Long-form Video Language Understanding	NeurIPS 2023	2023-08-17
Perception Test: A Diagnostic Benchmark for Multimodal Video Models (Perception Test, by Google DeepMind)	NeurIPS 2023	2023-05-23

Audio

Title	Venue	Date	Code	Supplement
SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities	EMNLP 2023 (Findings)	2023-05-18

Multimodal Dialogue

Title	Venue	Date	Code	Supplement
DialogGen: Multi-modal Interactive Dialogue System for Multi-turn Text-to-Image Generation	arXiv	2024-03-13		-
STICKERCONV: Generating Multimodal Empathetic Responses from Scratch	ACL 2024 Main	2024-01-20
VDialogUE: A Unified Evaluation Benchmark for Visually-grounded Dialogue	arXiv	2023-09-14	-	-
PaCE: Unified Multi-modal Dialogue Pre-training with Progressive and Compositional Experts	ACL 2023	2023-05-24
TikTalk: A Multi-Modal Dialogue Dataset for Real-World Chitchat	ACM MM 2023	2023-01-14		Dataset
MMDialog: A Large-scale Multi-turn Dialogue Dataset Towards Multi-modal Open-domain Conversation	ACL 2023	2022-11-10		Dataset
Multimodal Dialogue Response Generation (Divter)	ACL 2022	2021-10-16	-	-
Maria: A Visual Experience Powered Conversational Agent	ACL 2021	2021-05-27		-
Multi-Modal Open-Domain Dialogue	EMNLP 2021	2020-10-02	-	-
Open Domain Dialogue Generation with Latent Images	AAAI 2021	2020-04-04	-	-
Learning to Respond with Stickers: A Framework of Unifying Multi-Modality in Multi-Turn Dialog	WWW 2020	2020-03-10		-

Multimodal Learning

Title	Venue	Date	Code	Supplement
Video as the New Language for Real-World Decision Making	arXiv	2024-02-27	-	-
Tokenize Anything via Prompting	arXiv	2023-12-14		-
LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment	ICLR 2024	2023-10-03		-
ImageBind: One Embedding Space To Bind Them All	CVPR 2023	2023-05-09
Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks	CVPR 2023	2022-11-17		-
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks (BEiT-3)	CVPR 2023	2022-08-22		-
BEiT v2: Masked Image Modeling with Vector-Quantized Visual Tokenizers	arXiv	2022-08-12		-
BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning	AAAI 2023	2022-06-17
OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework	ICML 2022	2022-02-07		-
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation	ICML 2022	2022-01-28		-
Uni-Perceiver: Pre-Training Unified Architecture for Generic Perception for Zero-Shot and Few-Shot Tasks	CVPR 2022	2021-12-02		-
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation (ALBEF)	NeurIPS 2021	2021-07-16
BEiT: BERT Pre-Training of Image Transformers	ICLR 2022	2021-06-15		-
Learning Transferable Visual Models From Natural Language Supervision	ICML 2021	2021-02-26
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision	ICML 2021	2021-02-05		-

Image Generation

Title	Venue	Date	Code	Supplement
GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing	arXiv	2025-03-13		-
✨ OmniGen: Unified Image Generation	arXiv	2024-09-17
✨ Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens (by Kaiming He, DeepMind, MIT)	arXiv	2024-10-17	-	-
Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers (Lumina-T2X, Flag-DiT) (Text2Any)	arXiv	2024-05-09
FreeU: Free Lunch in Diffusion U-Net (FreeU, by Ziwei Liu)	CVPR 2024 Oral	2023-09-20
Lazy Diffusion Transformer for Interactive Image Editing	arXiv	2024-04-18	-
Salient Object-Aware Background Generation using Text-Guided Diffusion Models	CVPR 2024 Workshop	2024-04-15		-
HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing	arXiv	2024-04-15
UNIAA: A Unified Multi-modal Image Aesthetic Assessment Baseline and Benchmark (UNIAA-LLaVA, UNIAA-Bench)	arXiv	2024-04-15	-	-
PMG: Personalized Multimodal Generation with Large Language Models	WWW 2024	2024-04-07	-	-
Identity Decoupling for Multi-Subject Personalization of Text-to-Image Models	arXiv	2024-04-05
Concept Weaver: Enabling Multi-Concept Fusion in Text-to-Image Models	CVPR 2024	2024-04-05	-	-
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction (VAR)	arXiv	2024-04-03
PixArt-Σ: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation (HuaWei, Enze Xie)	arXiv	2024-03-07
Multi-LoRA Composition for Image Generation	arXiv	2024-02-26
PIXART-δ: Fast and Controllable Image Generation with Latent Consistency Models (HuaWei, Enze Xie)	arXiv	2024-01-10
Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model	AAAI 2024	2023-12-19		-
SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models (Tencent Xintao Wang)	arXiv	2023-12-11
InstructAny2Pix: Flexible Visual Editing via Multimodal Instruction Following	arXiv	2023-12-11
Emu Edit: Precise Image Editing via Recognition and Generation Tasks	arXiv	2023-11-16	-
BeautifulPrompt: Towards Automatic Prompt Engineering for Text-to-Image Synthesis	EMNLP 2023	2023-11-12
AnyText: Multilingual Visual Text Generation And Editing	ICLR 2024	2023-11-06		-
EasyGen: Easing Multimodal Generation with a Bidirectional Conditional Diffusion Model and LLMs	arXiv	2023-10-13		-
Mini-DALLE3: Interactive Text to Image by Prompting Large Language Models	arXiv	2023-10-11
PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis (HuaWei, Enze Xie)	ICLR 2024 Spotlight	2023-09-30
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models	arXiv	2023-08-13
Kosmos-G: Generating Images in Context with Multimodal Large Language Models	arXiv	2023-10-04
Improving Image Generation with Better Captions (DALL-E 3)	OpenAI	2023	-	-
Scaling up GANs for Text-to-Image Synthesis (GigaGAN)	CVPR 2023	2023-05-09
Adding Conditional Control to Text-to-Image Diffusion Models (ControlNet)	ICCV 2023	2023-02-10		-
Scalable Diffusion Models with Transformers (DiT)	ICCV 2023	2022-12-19
InstructPix2Pix: Learning to Follow Image Editing Instructions	CVPR 2023	2022-11-17
All are Worth Words: A ViT Backbone for Diffusion Models (U-ViT, first Diffsuion Transformer) (RUC, Chongxuan Li)	CVPR 2023	2022-09-25		-
DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation	CVPR 2023	2022-08-25
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding (Imagen)	NeurIPS 2022	2022-05-23
Hierarchical Text-Conditional Image Generation with CLIP Latents (DALL-E 2)	OpenAI	2022-04-13		-
High-Resolution Image Synthesis with Latent Diffusion Models (LDM, Stable Diffusion)	CVPR 2022	2021-12-20		-
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models	ICML 2022	2021-12-20		-
NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion	ECCV 2022	2021-11-24		-
SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations	ICLR 2022	2021-08-02
CogView: Mastering Text-to-Image Generation via Transformers	NeurIPS 2021	2021-05-26		-
Zero-Shot Text-to-Image Generation (DALL-E 1)	ICML 2021	2021-02-24
Taming Transformers for High-Resolution Image Synthesis (VQ-GAN)	CVPR 2021	2020-12-17

Video Generation

Title	Venue	Date	Code
[Dataset] Señorita-2M: A High-Quality Instruction-based Dataset for General Video Editing by Video Specialists	arXiv	2025-02-10
LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment	arXiv	2024-12-06
[Dataset] VidGen-1M: A Large-Scale Dataset for Text-to-video Generation	arXiv	2024-08-05
[Dataset] MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions (Mira)	arXiv	2024-07-08	Video Generation
VIMI: Grounding Video Generation through Multi-modal Instruction	arXiv	2024-07-08
[Dataset] OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation	ICLR 2025	2024-07-02
Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers (Lumina-T2X, Flag-DiT) (Text2Any)	arXiv	2024-05-09
StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text (Long Video Generation)	arXiv	2024-03-21
AnyV2V: A Plug-and-Play Framework For Any Video-to-Video Editing Tasks	arXiv	2024-03-21
FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation (FRESCO) (NTU, Ziwei Liu)	CVPR 2024	2024-03-19
Latte: Latent Diffusion Transformer for Video Generation (Latte) (NTU, Ziwei Liu)	arXiv	2024-01-05
FreeInit: Bridging Initialization Gap in Video Diffusion Models (FreeInit) (NTU, Ziwei Liu)	arXiv	2023-12-12
VideoBooth: Diffusion-based Video Generation with Image Prompts (VideoBooth) (NTU, Ziwei Liu)	arXiv	2023-12-01
VBench: Comprehensive Benchmark Suite for Video Generative Models [Benchmark] (VBench) (NTU, Ziwei Liu)	CVPR 2024	2023-11-29
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets (SVD)	arXiv	2023-11-25
SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction (NTU, Ziwei Liu)	ICLR 2024	2023-10-31
FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling (FreeNoise) (NTU, Ziwei Liu)	ICLR 2024	2023-10-23
LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models (LaVie) (NTU, Ziwei Liu)	arXiv	2023-09-26

Multimodal Dataset

Title	Venue	Date	Code	Supplement
[Benchmark & Dataset] VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models	arXiv	2025-04-21	Visual Reasoning Benchmark & Dataset
[Dataset] Señorita-2M: A High-Quality Instruction-based Dataset for General Video Editing by Video Specialists	arXiv	2025-02-10
Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data	arXiv	2024-10-24	-
LVD-2M: A Long-take Video Dataset with Temporally Dense Captions	NeurIPS 2024	2024-10-14
VidGen-1M: A Large-Scale Dataset for Text-to-video Generation	arXiv	2024-08-05
MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions (Mira)	arXiv	2024-07-08	Video Generation
OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation	ICLR 2025	2024-07-02
GUIDE: A Guideline-Guided Dataset for Instructional Video Comprehension	IJCAI 2024	2024-06-26	-
MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens	arXiv	2024-06-17
CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation	arXiv	2024-06-15		-
What If We Recaption Billions of Web Images with LLaMA-3? (Recap-DataComp-1B)	arXiv	2024-06-12		[
TextSquare: Scaling up Text-Centric Visual Instruction Tuning	arXiv	2024-04-19	Visual Instruction Tuning	-
HQ-Edit: A High-Quality Dataset for Instruction-based Image Editing	arXiv	2024-04-15	Instruction Image Editing
AesExpert: Towards Multi-modality Foundation Model for Image Aesthetics Perception (AesExpert, AesMMIT Dataset)	arXiv	2024-04-15	Aesthetic Multi-Modality Instruction Tuning
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers	CVPR 2024	2024-02-29	video-caption
ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model	arXiv	2024-02-18	GPT4V-synthesized Data
STICKERCONV: Generating Multimodal Empathetic Responses from Scratch	ACL 2024 Main	2024-01-20	Multimodal Empathetic Dialogue
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation	ICLR 2024	2023-07-13
SVIT: Scaling up Visual Instruction Tuning	arXiv	2023-07-09	Instruction Tuning
Kosmos-2: Grounding Multimodal Large Language Models to the World (Kosmos-2, GrIT Dataset)	arXiv	2023-06-26	Grounded image-text pairs
M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning	arXiv	2023-06-07	Instruction Tuning
Visual Instruction Tuning (LLaVA)	NeurIPS 2023	2023-04-17	Instruction Tuning
Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved with Text	NeurIPS D&B 2023	2023-04-14	Interleaved Image-Text
AToMiC: An Image/Text Retrieval Test Collection to Support Multimedia Content Creation (Wiki)	TREC 2023 Workshop	2023-04-04
TikTalk: A Multi-Modal Dialogue Dataset for Real-World Chitchat	ACM MM 2023	2023-01-14	Multimodal Dialogue
MMDialog: A Large-scale Multi-turn Dialogue Dataset Towards Multi-modal Open-domain Conversation	ACL 2023	2022-11-10	Multimodal Dialogue
LAION-5B: An open large-scale dataset for training next generation image-text models	NeurIPS 2022	2022-10-16	Image-Text Pairs
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs	NeurIPS Workshop 2021	2021-11-03	Image-Text Pairs
MMConv: An Environment for Multimodal Conversational Search across Multiple Domains	ACM SIGIR 2021	2021-07	Multimodal Dialogue
PhotoChat: A Human-Human Dialogue Dataset With Photo Sharing Behavior For Joint Image-Text Modeling	ACL 2021	2021-07-06	Open-domain Multimodal Dialogue
Image-Chat: Engaging Grounded Conversations	ACL 2020	2018-11-02	Multimodal Dialogue

Multimodal Survey

Title	Venue	Date	Supplement	Latest Update
Discrete Diffusion in Large Language and Multimodal Models: A Survey	arXiv	2025-06-16		-
A Survey on Bridging VLMs and Synthetic Data	OpenReview	2025-05-16		-
From Specific-MLLM to Omni-MLLM: A Survey about the MLLMs alligned with Multi-Modality (HIT, Peng Cheng Lab)	arXiv	2024-12-16	-
From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding (TikTok)	arXiv	2024-09-27	-
Video Diffusion Models: A Survey	arXiv	2024-05-06	-
Theoretical research on generative diffusion models: an overview	arXiv	2024-04-13	-	-
A Review of Multi-Modal Large Language and Vision Models	arXiv	2024-03-28	-	-
The (R)Evolution of Multimodal Large Language Models: A Survey	arXiv	2024-02-19	-	-
MM-LLMs: Recent Advances in MultiModal Large Language Models	arXiv	2024-01-24	-	2024-02-20
Multimodal Large Language Models: A Survey	IEEE BigData 2023	2023-11-22	-	-
Multimodal Foundation Models: From Specialists to General-Purpose Assistants	CVPR 2023	2023-09-18	-	-
Understanding Deep Learning	-	2023	-	-
Large Multimodal Models: Notes on CVPR 2023 Tutorial	CVPR 2023	2023-06-26	-	-
A Survey on Multimodal Large Language Models	arXiv	2023-06-23	-	2024-04-01
Multimodal Deep Learning	arXiv	2023-01-12	-	-
Diffusion Models: A Comprehensive Survey of Methods and Applications	ACM Computing Surveys	2022-09-02	-	2024-02-06
Multimodal Learning with Transformers: A Survey	IEEE TPAMI 2023	2022-01-13	-	2023-05-10
Multimodal Machine Learning: A Survey and Taxonomy	IEEE PAMI 2019	2017-05-26	-	2017-08-01

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Awesome-Multimodal-Papers

Visual Understanding

Omni Understanding

Unified Understanding and Generation

Diffusion MLLM

Multimodal Embedding/Retrieval

Image Understanding Benchmark

Video Understanding Benchmark

Audio

Multimodal Dialogue

Multimodal Learning

Image Generation

Video Generation

Multimodal Dataset

Multimodal Survey

About

Uh oh!

Releases

Packages

Contributors 3

License

friedrichor/Awesome-Multimodal-Papers

Folders and files

Latest commit

History

Repository files navigation

Awesome-Multimodal-Papers

Visual Understanding

Omni Understanding

Unified Understanding and Generation

Diffusion MLLM

Multimodal Embedding/Retrieval

Image Understanding Benchmark

Video Understanding Benchmark

Audio

Multimodal Dialogue

Multimodal Learning

Image Generation

Video Generation

Multimodal Dataset

Multimodal Survey

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Packages