Skip to content

wskbest/MFC-Bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 

Repository files navigation

logo MFC-Bench

arXiv License: Apache 2.0

architecture

📑 Introduction

We introduce the MFC-Bench, a comprehensive Multimodal FactChecking testbed designed to evaluate LVLM in terms of identifying factual inconsistencies and counterfactual scenarios. MFC-Bench encompasses a wide range of visual and textual queries, organized into three binary classification tasks: Manipulation Classification, Out-of-Context Classification, and Veracity Classification.

  1. The Manipulation Classification task targets various alterations like face swapping, face attribute editing, background changing, image generation, entity replacement, and style transfer. (Dataset:manipulation-mfc-bench)
  2. The Out-of-Context Classification task focuses on identifying the false connection between the image and text that may be both true.(Dataset:ooc-mfc-bench)
  3. The Veracity Classification task is the multimodal counterpart to classifying the veracity of textual claims given the visual evidence, by leveraging the inherent knowledge embedded in LVLMs.(Dataset:veracity-mfc-bench)

🎯 Datasets Construction

Datasets

🔍 Methodology

To provide an exhaustive perspective on the current state of LVLMs within the context of multimodal fact-checking, we conducted evaluations on 18 representative accessible LVLMs. For the open-source and accessible LVLMs, we adopt the representative models like Emu2, InternVL, CogVLM, LLaVA-NeXT, InstructBLIP, Pixtral, MiniCPM-V-2.6, LLaVA-OneVision, Molmo, Qwen-VL, Qwen2-VL, Yi-VL and xGen-MM.

As five of the most powerful closed-source LVLMs, GPT-4o, GPT-4V, Claude3.5-Sonnet, Claude3-Haiku and Gemini-1.5-Pro. are also included in our testing scope.

To explore the effect of different prompt strategies like Chain-of-Thought (CoT) or In-Context Learning (ICL) prompting, we utilized the four following prompt methods for the MFC-Bench: Zero-shot, Zero-shot with CoT , Few-shot, and Few-shot with CoT.

prompt-types

🔦 Experiment Results

1. Zero-Shot

Main Result

2. Model Interpretability

The model's justification was evaluated by GPT-4 and Human subjects across four dimensions: Misleadingness (M), Informativeness (I), Soundness (S), and Readability (R). A 5-point Likert scale was used, where 1 indicates the lowest quality and 5 the highest for Informativeness, Soundness, and Readability, but the scale is reversed for Misleadingness.

Model Interpretability

3. Zero-Shot with CoT

Zero-shot CoT

4. Few-Shot

Few-shot

Citation

If you find our paper and code useful in your research, please consider giving a star ⭐ and citation 📝 :)

@misc{wang2024mfcbenchbenchmarkingmultimodalfactchecking,
      title={MFC-Bench: Benchmarking Multimodal Fact-Checking with Large Vision-Language Models}, 
      author={Shengkang Wang and Hongzhan Lin and Ziyang Luo and Zhen Ye and Guang Chen and Jing Ma},
      year={2024},
      eprint={2406.11288},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2406.11288}, 
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published