Evaluating MLLM Robustness on Multimodal Math Reasoning with Adversarial Perturbations

📰 News

[August 2024] RoMMath paper has been accepted by NAACL 2025 Main!

👋 Overview

ROMMATH is the first benchmark designed to evaluate the robustness of Multimodal Large Language Models (MLLMs) in math reasoning, especially under adversarial perturbations across both text and vision modalities.

📌 “Are MLLMs robust when solving math problems under adversarial attacks in text and visual context?”

🌟 Highlights

🔢 4,200 expertly annotated examples from high school-level geometry, functions, and statistics
⚠️ 7 adversarial perturbation types: 4 text-level + 3 vision-level
🧠 Fine-grained error types and diagnostic analysis
📊 Robustness evaluation of 18 top-performing MLLMs
🧪 In-context learning and prompting strategies explored to boost performance

🧩 Benchmark Structure

Each ROMMATH sample includes:

📌 A math word problem with both text and visual components
🧾 Answer label (multiple-choice or free-form)
🧪 7 distinct adversarial variants per problem
✅ Human-verified data quality

🎯 Perturbation Types

Text-level:

Lexical – Replace with uncommon synonyms
Structure – Change information order/grammar
Semantic Complexification – Add complexity
Interference Introduction – Distracting text info

Vision-level:

Low-level perturbation – Noise, brightness, color
Vision-dominant interpretation – Key info in images
Visual interference – Irrelevant symbols/noise

🧬 Dataset Overview

Category	TestMini	Test
Original	200	400
Adversarial	1,200	2,400
Total	1,400	2,800

🚀 Quickstart

🧰 Step 0: Install Environment

conda create --name rommath python=3.10
conda activate rommath
pip install -r requirements.txt

🤖 Step 1: Run Inference

bash scripts/vllm_small.sh

📈 Step 2: Evaluate Accuracy

python acc_evaluation.py

✍️ Citation

If you use our work, please cite us:

@inproceedings{zhao-etal-2025-multimodal,
    title = "Are Multimodal {LLM}s Robust Against Adversarial Perturbations? {R}o{MM}ath: A Systematic Evaluation on Multimodal Math Reasoning",
    author = "Zhao, Yilun  and
      Gan, Guo  and
      Wang, Chengye  and
      Zhao, Chen  and
      Cohan, Arman",
    editor = "Chiruzzo, Luis  and
      Ritter, Alan  and
      Wang, Lu",
    booktitle = "Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
    month = apr,
    year = "2025",
    address = "Albuquerque, New Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.naacl-long.582/",
    doi = "10.18653/v1/2025.naacl-long.582",
    pages = "11653--11665",
    ISBN = "979-8-89176-189-6",
    abstract = "We introduce RoMMath, the first benchmark designed to evaluate the capabilities and robustness of multimodal large language models (MLLMs) in handling multimodal math reasoning, particularly when faced with adversarial perturbations. RoMMath consists of 4,800 expert-annotated examples, including an original set and seven adversarial sets, each targeting a specific type of perturbation at the text or vision levels. We evaluate a broad spectrum of 17 MLLMs on RoMMath and uncover a critical challenge regarding model robustness against adversarial perturbations. Through detailed error analysis by human experts, we gain a deeper understanding of the current limitations of MLLMs. Additionally, we explore various approaches to enhance the performance and robustness of MLLMs, providing insights that can guide future research efforts."
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
model_inference		model_inference
scripts		scripts
utils		utils
.gitignore		.gitignore
README.md		README.md
acc_evaluation.py		acc_evaluation.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Evaluating MLLM Robustness on Multimodal Math Reasoning with Adversarial Perturbations

📰 News

👋 Overview

🌟 Highlights

🧩 Benchmark Structure

🎯 Perturbation Types

🧬 Dataset Overview

🚀 Quickstart

🧰 Step 0: Install Environment

🤖 Step 1: Run Inference

📈 Step 2: Evaluate Accuracy

✍️ Citation

About

Uh oh!

Releases

Packages

Languages

yale-nlp/RoMMath

Folders and files

Latest commit

History

Repository files navigation

Evaluating MLLM Robustness on Multimodal Math Reasoning with Adversarial Perturbations

📰 News

👋 Overview

🌟 Highlights

🧩 Benchmark Structure

🎯 Perturbation Types

🧬 Dataset Overview

🚀 Quickstart

🧰 Step 0: Install Environment

🤖 Step 1: Run Inference

📈 Step 2: Evaluate Accuracy

✍️ Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages