MaskRAG is a multimodal RAG-like framework for referring expression segmentation (RES). It addresses the "segmentation hallucination" issue in existing [SEG]-based methods through two core modules: a Mask Retrieval Module that encodes region features with customized language templates for finer-grained scene perception, and a Mask Augmentation Module with multi-granularity semantic fusion and adaptive routing mechanisms. MaskRAG achieves state-of-the-art performance across RefCOCO/+/g benchmarks.
| Model | RefCOCO Val | RefCOCO+ Val | RefCOCOg Val | Avg. |
|---|---|---|---|---|
| MaskRAG (4B) | 80.7 | 75.8 | 77.6 | 77.8 |
| MaskRAG (8B) | 82.1 | 77.0 | 78.3 | 79.0 |
Coming soon.
Coming soon.
If you find this work useful, please cite:
@article{he2025maskrag,
title={MaskRAG: Mask Retrieval Augmented Generation for MLLM-based Referring Expression Segmentation},
author={He, Zhongjiang and Zhao, An and Tang, Canhui and Sun, Hao and Sun, Hongbo and Yuan, Ye and Liang, Kongming and Ma, Zhanyu},
year={2025}
}