Skip to content

xuyige/CrossMath

Repository files navigation

CrossMath: Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap


Authors: Yige Xu, Yongjie Wang, Zizhuo Wu, Kaisong Song Jun Lin, Zhiqi Shen.

Overview

Is there a reasoning gap between textual modality and vision modality in VLMs? —— We say Yes!

We introduce CrossMath, a novel multimodal reasoning benchmark designed for controlled cross-modal comparisons.

CrossMath

Quick Start

Setup and Dependencies

Create environments:

bash build_enviroments.sh --name [Your/env/name]

Prepare Data

The testing data is in data/, which is also available in Huggingface's space with name xuyige/CrossMath (here).

Evaluation

Original Style

For original style, we can use the following command for batch evaluation:

python batch_inference_qwen35.py \
    --test_file "data/Original/testset_hr.jsonl" \
    --model_name Qwen/Qwen3.5-9B \
    --adapter_dir None \
    --modality image \
    --max_new_tokens 16384 \
    --num_return_sequence 4 \
    --log_suffix "hr"

where --adapter_dir could be applied to load LoRA adapters, --modality can be selected from [image, hybrid, text], --num_return_sequence with generate multiple sequences in parallel, --log_suffix "hr" saves all predicted answer in file hr_run_1.log, hr_run_2.log, ...

Change Image Styles

To change the image style, we can change the following arguments:

Image Style --test_file --log_suffix
Original Style data/Original/testset_hr.jsonl "hr"
Without Border data/Noborder/testset_noborder.jsonl "noborder"
With Significant Background data/Beige/testset_hr_beige.jsonl "beige"
Change Font and Color data/Altstyle/testset_altstyle.jsonl "altstyle"

Compute Metrics

To compute the evaluation metrics,

python calc_metric.py \
    --test_file "data/Original/testset_hr.jsonl" \
    --num_return_sequences 4 \
    --log_suffix "hr"

then the script will compute metric according to the logs with the suffix "hr". For example, will compute metric for log files named hr_run_1.log, hr_run_2.log, ...

Citation

If you find this work helpful, please cite:

@article{xu2026crossmathbench,
	title={Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap},
	author={Xu, Yige and Wang, Yongjie and Wu, Zizhuo and Song, Kaisong and Lin, Jun and Shen, Zhiqi},
	journal={arXiv preprint arXiv:2604.16256},
	year={2026}
}

About

Source code for paper: Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages