We introduce the MFC-Bench, a comprehensive Multimodal FactChecking testbed designed to evaluate LVLM in terms of identifying factual inconsistencies and counterfactual scenarios. MFC-Bench encompasses a wide range of visual and textual queries, organized into three binary classification tasks: Manipulation Classification, Out-of-Context Classification, and Veracity Classification.
- The Manipulation Classification task targets various alterations like face swapping, face attribute editing, background changing, image generation, entity replacement, and style transfer. (Dataset:manipulation-mfc-bench)
- The Out-of-Context Classification task focuses on identifying the false connection between the image and text that may be both true.(Dataset:ooc-mfc-bench)
- The Veracity Classification task is the multimodal counterpart to classifying the veracity of textual claims given the visual evidence, by leveraging the inherent knowledge embedded in LVLMs.(Dataset:veracity-mfc-bench)
To provide an exhaustive perspective on the current state of LVLMs within the context of multimodal fact-checking, we conducted evaluations on 18 representative accessible LVLMs. For the open-source and accessible LVLMs, we adopt the representative models like Emu2, InternVL, CogVLM, LLaVA-NeXT, InstructBLIP, Pixtral, MiniCPM-V-2.6, LLaVA-OneVision, Molmo, Qwen-VL, Qwen2-VL, Yi-VL and xGen-MM.
As five of the most powerful closed-source LVLMs, GPT-4o, GPT-4V, Claude3.5-Sonnet, Claude3-Haiku and Gemini-1.5-Pro. are also included in our testing scope.
To explore the effect of different prompt strategies like Chain-of-Thought (CoT) or In-Context Learning (ICL) prompting, we utilized the four following prompt methods for the MFC-Bench: Zero-shot, Zero-shot with CoT , Few-shot, and Few-shot with CoT.
The model's justification was evaluated by GPT-4 and Human subjects across four dimensions: Misleadingness (M), Informativeness (I), Soundness (S), and Readability (R). A 5-point Likert scale was used, where 1 indicates the lowest quality and 5 the highest for Informativeness, Soundness, and Readability, but the scale is reversed for Misleadingness.
If you find our paper and code useful in your research, please consider giving a star ⭐ and citation 📝 :)
@misc{wang2024mfcbenchbenchmarkingmultimodalfactchecking,
title={MFC-Bench: Benchmarking Multimodal Fact-Checking with Large Vision-Language Models},
author={Shengkang Wang and Hongzhan Lin and Ziyang Luo and Zhen Ye and Guang Chen and Jing Ma},
year={2024},
eprint={2406.11288},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2406.11288},
}