A curated list of awesome Multimodal studies.
Contribution
If you have published a high-quality paper or come across one that you think is valuable, feel free to contribute! To submit a paper, please open an issue and include the following information in the specified format:
{
"title": paper title,
"url": paper URL,
"venue": the venue where the paper was published, such as ICML 2025, CVPR 2025 or arXiv,
"category": one or more relevant categories from our directory, or feel free to propose a new, more suitable category,
"code": [Optional] code URL,
"project_page": [Optional] project page URL,
"dataset": [Optional] HuggingFace Dataset URL,
"collections": [Optional] HuggingFace Collections URL
}
Title | Venue | Date | Code | Supplement |
---|---|---|---|---|
Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model | arXiv | 2025-05-29 | - | |
FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities | arXiv | 2025-05-26 | - | |
Dimple: Discrete Diffusion Multimodal Large Language Model with Parallel Decoding (NUS) | arXiv | 2025-05-22 | - | |
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning (Gaoling) | arXiv | 2025-05-22 | ||
LaViDa: A Large Diffusion Language Model for Multimodal Understanding (UCLA, Panasonic AI, Salesforce, Adobe) | arXiv | 2025-05-22 | ||
MMaDA: Multimodal Large Diffusion Language Models (ByteDance Seed) | arXiv | 2025-05-21 |
Title | Venue | Date | Code | Supplement |
---|---|---|---|---|
Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning | arXiv | 2024-06-18 | - | |
LOVA3: Learning to Visual Question Answering, Asking and Assessment | arXiv | 2024-05-23 | - | |
MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI | arXiv | 2024-04-24 | ||
BLINK: Multimodal Large Language Models Can See but Not Perceive | arXiv | 2024-04-18 | ||
Ferret: Refer and Ground Anything Anywhere at Any Granularity (Ferret-Bench) | ICLR 2024 | 2023-10-11 | - | |
Aligning Large Multimodal Models with Factually Augmented RLHF (LLaVA-RLHF, MMHal-Bench (hallucination)) | arXiv | 2023-09-25 | ||
Affective Visual Dialog: A Large-Scale Benchmark for Emotional Reasoning Based on Visually Grounded Conversations (AffectVisDial) | ECCV 2024 | 2023-08-30 | ||
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension | CVPR 2024 | 2023-07-30 | - |
Title | Venue | Date | Code | Supplement |
---|---|---|---|---|
SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities | EMNLP 2023 (Findings) | 2023-05-18 |