Skip to content

llm-lab-org/Multimodal-RAG-Survey

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 

Repository files navigation

Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation

arXiv Website

This repository is designed to collect and categorize papers related to Multimodal Retrieval-Augmented Generation (RAG) according to our survey paper: Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation. Given the rapid growth in this field, we will continuously update both the paper and this repository to serve as a resource for researchers working on future projects.

📢 News

  • June 2, 2025: A new enhanced version of our paper is out now on arXiv! This update also includes new related papers and covers new topics such as agentic interaction and audio-centric retrieval.
  • May 15, 2025: This paper has been accepted for publication in the ACL 2025 Findings.
  • April 18, 2025: Our website for this topic is up now.
  • February 17, 2025: We release the first survey for Multimodal Retrieval-Augmented Generation. Feel free to cite, contribute, or open a pull request to add recent related papers!

📑 List of Contents


🔎 General Pipeline

MM-RAG (1)

🌿 Taxonomy of Recent Advances and Enhancements

Multimodal_Retrieval_Augmented_Generation__A_Survey___acl_final_organized

⚙ Taxonomy of Application Domains

applications

📝 Abstract

Large Language Models (LLMs) suffer from hallucinations and outdated knowledge due to their reliance on static training data. Retrieval-Augmented Generation (RAG) mitigates these issues by integrating external dynamic information for improved factual grounding. With advances in multimodal learning, Multimodal RAG extends this approach by incorporating multiple modalities such as text, images, audio, and video to enhance the generated outputs. However, cross-modal alignment and reasoning introduce unique challenges beyond those in unimodal RAG.

This survey offers a structured and comprehensive analysis of Multimodal RAG systems, covering datasets, benchmarks, metrics, evaluation, methodologies, and innovations in retrieval, fusion, augmentation, and generation. We precisely review training strategies, robustness enhancements, loss functions, and agent-based approaches, while also exploring the diverse Multimodal RAG scenarios. In addition, we outline open challenges and future directions to guide research in this evolving field. This survey lays the foundation for developing more capable and reliable AI systems that effectively leverage multimodal dynamic external knowledge bases.

📊 Overview of Popular Datasets

🖼 Image-Text

Name Statistics and Description Modalities Link
LAION-400M 200M image–text pairs; used for pre-training multimodal models. Image, Text LAION-400M
Conceptual-Captions (CC) 15M image–caption pairs; multilingual English–German image descriptions. Image, Text Conceptual Captions
CIRR 36,554 triplets from 21,552 images; focuses on natural image relationships. Image, Text CIRR
MS-COCO 330K images with captions; used for caption-to-image and image-to-caption generation. Image, Text MS-COCO
Flickr30K 31K images annotated with five English captions per image. Image, Text Flickr30K
Multi30K 30K German captions from native speakers and human-translated captions. Image, Text Multi30K
NoCaps For zero-shot image captioning evaluation; 15K images. Image, Text NoCaps
Laion-5B 5B image–text pairs used as external memory for retrieval. Image, Text LAION-5B
COCO-CN 20,341 images for cross-lingual tagging and captioning with Chinese sentences. Image, Text COCO-CN
CIRCO 1,020 queries with an average of 4.53 ground truths per query; for composed image retrieval. Image, Text CIRCO

🎞 Video-Text

Name Statistics and Description Modalities Link
BDD-X 77 hours of driving videos with expert textual explanations; for explainable driving behavior. Video, Text BDD-X
YouCook2 2,000 cooking videos with aligned descriptions; focused on video–text tasks. Video, Text YouCook2
ActivityNet 20,000 videos with multiple captions; used for video understanding and captioning. Video, Text ActivityNet
SoccerNet Videos and metadata for 550 soccer games; includes transcribed commentary and key event annotations. Video, Text SoccerNet
MSR-VTT 10,000 videos with 20 captions each; a large video description dataset. Video, Text MSR-VTT
MSVD 1,970 videos with approximately 40 captions per video. Video, Text MSVD
LSMDC 118,081 video–text pairs from 202 movies; a movie description dataset. Video, Text LSMDC
DiDemo 10,000 videos with four concatenated captions per video; with temporal localization of events. Video, Text DiDemo
Breakfast 1,712 videos of breakfast preparation; one of the largest fully annotated video datasets. Video, Text Breakfast
COIN 11,827 instructional YouTube videos across 180 tasks; for comprehensive instructional video analysis. Video, Text COIN
MSRVTT-QA Video question answering benchmark. Video, Text MSRVTT-QA
MSVD-QA 1,970 video clips with approximately 50.5K QA pairs; video QA dataset. Video, Text MSVD-QA
ActivityNet-QA 58,000 human–annotated QA pairs on 5,800 videos; benchmark for video QA models. Video, Text ActivityNet-QA
EpicKitchens-100 700 videos (100 hours of cooking activities) for online action prediction; egocentric vision dataset. Video, Text EPIC-KITCHENS-100
Ego4D 4.3M video–text pairs for egocentric videos; massive-scale egocentric video dataset. Video, Text Ego4D
HowTo100M 136M video clips with captions from 1.2M YouTube videos; for learning text–video embeddings. Video, Text HowTo100M
CharadesEgo 68,536 activity instances from ego–exo videos; used for evaluation. Video, Text Charades-Ego
ActivityNet Captions 20K videos with 3.7 temporally localized sentences per video; dense-captioning events in videos. Video, Text ActivityNet Captions
VATEX 34,991 videos, each with multiple captions; a multilingual video-and-language dataset. Video, Text VATEX
Charades 9,848 video clips with textual descriptions; a multimodal research dataset. Video, Text Charades
WebVid 10M video–text pairs (refined to WebVid-Refined-1M). Video, Text WebVid
Youku-mPLUG Chinese dataset with 10M video–text pairs (refined to Youku-Refined-1M). Video, Text Youku-mPLUG

🔊 Audio-Text

Name Statistics and Description Modalities Link
LibriSpeech 1,000 hours of read English speech with corresponding text; ASR corpus based on audiobooks. Audio, Text LibriSpeech
SpeechBrown 55K paired speech-text samples; 15 categories covering diverse topics from religion to fiction. Audio, Text SpeechBrown
AudioCap 46K audio clips paired with human-written text captions. Audio, Text AudioCaps
AudioSet 2M human-labeled sound clips from YouTube across diverse audio event classes (e.g., music or environmental). Audio AudioSet

🩺 Medical

Name Statistics and Description Modalities Link
MIMIC-CXR 125,417 labeled chest X-rays with reports; widely used for medical imaging research. Image, Text MIMIC-CXR
CheXpert 224,316 chest radiographs of 65,240 patients; focused on medical analysis. Image, Text CheXpert
MIMIC-III Health-related data from over 40K patients; includes clinical notes and structured data. Text MIMIC-III
IU-Xray 7,470 pairs of chest X-rays and corresponding diagnostic reports. Image, Text IU-Xray
PubLayNet 100,000 training samples and 2,160 test samples built from PubLayNet for document layout analysis. Image, Text PubLayNet

👗 Fashion

Name Statistics and Description Modalities Link
Fashion-IQ 77,684 images across three categories; evaluated with Recall@10 and Recall@50 metrics. Image, Text Fashion-IQ
FashionGen 260.5K image–text pairs of fashion images and item descriptions. Image, Text FashionGen
VITON-HD 83K images for virtual try-on; high-resolution clothing items dataset. Image, Text VITON-HD
Fashionpedia 48,000 fashion images annotated with segmentation masks and fine-grained attributes. Image, Text Fashionpedia
DeepFashion Approximately 800K diverse fashion images for pseudo triplet generation. Image, Text DeepFashion

💡 QA

Name Statistics and Description Modalities Link
VQA 400K QA pairs with images for visual question-answering tasks. Image, Text VQA
PAQ 65M text-based QA pairs; a large-scale dataset for open-domain QA tasks. Text PAQ
ELI5 270K complex questions augmented with web pages and images; designed for long-form QA tasks. Text ELI5
OK-VQA 14K questions requiring external knowledge for visual question answering tasks. Image, Text OK-VQA
WebQA 46K queries requiring reasoning across text and images; multimodal QA dataset. Text, Image WebQA
Infoseek Fine-grained visual knowledge retrieval using a Wikipedia-based knowledge base (~6M passages). Image, Text Infoseek
ClueWeb22 10 billion web pages organized into subsets; a large-scale web corpus for retrieval tasks. Text ClueWeb22
MOCHEG 15,601 claims annotated with truthfulness labels and accompanied by textual and image evidence. Text, Image MOCHEG
VQA v2 1.1M questions (augmented with VG-QA questions) for fine-tuning VQA models. Image, Text VQA v2
A-OKVQA Benchmark for visual question answering using world knowledge; around 25K questions. Image, Text A-OKVQA
XL-HeadTags 415K news headline-article pairs spanning 20 languages across six diverse language families. Text XL-HeadTags
SEED-Bench 19K multiple-choice questions with accurate human annotations across 12 evaluation dimensions. Text SEED-Bench

🌎 Other

Name Statistics and Description Modalities Link
ImageNet 14M labeled images across thousands of categories; used as a benchmark in computer vision research. Image ImageNet
Oxford Flowers102 Dataset of flowers with 102 categories for fine-grained image classification tasks. Image Oxford Flowers102
Stanford Cars Images of different car models (five examples per model); used for fine-grained categorization tasks. Image Stanford Cars
GeoDE 61,940 images from 40 classes across six world regions; emphasizes geographic diversity in object recognition. Image GeoDE

📄 Papers

📚 RAG-related Surveys

👓 Retrieval Strategies Advances

🔍 Efficient-Search and Similarity Retrieval

❓ Maximum Inner Product Search-MIPS
💫 Multi-Modal Encoders

🎨 Modality-Centric Retrieval

📋 Text-Centric
📸 Vision-Centric
🎥 Video-Centric
📰 Document Retrieval and Layout Understanding

🥇🥈 Re-ranking Strategies

🎯 Optimized Example Selection
🧮 Relevance Score Evaluation
⏳ Filtering Mechanisms

🛠 Fusion Mechanisms

🎰 Score Fusion and Alignment

⚔ Attention-Based Mechanisms

🧩 Unified Frameworkes

🚀 Augmentation Techniques

💰 Context-Enrichment

🎡 Adaptive and Iterative Retrieval

🤖 Generation Techniques

🧠 In-Context Learning

👨‍⚖️ Reasoning

🤺 Instruction Tuning

📂 Source Attribution and Evidence Transparency

🔧 Training Strategies and Loss Function

🛡️ Robustness and Noise Management

🛠 Taks Addressed by Multimodal RAGs

🩺 Healthcare and Medicine

💻 Software Engineering

🕶️ Fashion and E-Commerce

🤹 Entertainment and Social Computing

🚗 Emerging Applications

📏 Evaluation Metrics

📊 Retrieval Performance

It represents the minimum value between precision (+P) and sensitivity (Se), providing a balanced measure of model performance.

📝 Fluency and Readability

✅ Relevance and Accuracy

🖼️ Image-related Metrics

🎵 Audio-related Metrics

🔗 Text Similarity and Overlap Metrics

📊 Statistical Metrics

⚙️ Efficiency and Computational Performance

🏥 Domain-Specific Metrics


This README is a work in progress and will be completed soon. Stay tuned for more updates!


🔗 Citations

If you find our paper or repository useful, please cite the paper:

@misc{abootorabi2025askmodalitycomprehensivesurvey,
      title={Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation}, 
      author={Mohammad Mahdi Abootorabi and Amirhosein Zobeiri and Mahdi Dehghani and Mohammadali Mohammadkhani and Bardia Mohammadi and Omid Ghahroodi and Mahdieh Soleymani Baghshah and Ehsaneddin Asgari},
      year={2025},
      eprint={2502.08826},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.08826}, 
}

📧 Contact

If you have questions, please send an email to mahdi.abootorabi2@gmail.com.

⭐ Star History

Star History Chart

About

A Survey on Multimodal Retrieval-Augmented Generation

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 6