CrossMath: Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap

Authors: Yige Xu, Yongjie Wang, Zizhuo Wu, Kaisong Song Jun Lin, Zhiqi Shen.

Overview

Is there a reasoning gap between textual modality and vision modality in VLMs? —— We say Yes!

We introduce CrossMath, a novel multimodal reasoning benchmark designed for controlled cross-modal comparisons.

Quick Start

Setup and Dependencies

Create environments:

bash build_enviroments.sh --name [Your/env/name]

Prepare Data

The testing data is in data/, which is also available in Huggingface's space with name xuyige/CrossMath (here).

Evaluation

Original Style

For original style, we can use the following command for batch evaluation:

python batch_inference_qwen35.py \
    --test_file "data/Original/testset_hr.jsonl" \
    --model_name Qwen/Qwen3.5-9B \
    --adapter_dir None \
    --modality image \
    --max_new_tokens 16384 \
    --num_return_sequence 4 \
    --log_suffix "hr"

where --adapter_dir could be applied to load LoRA adapters, --modality can be selected from [image, hybrid, text], --num_return_sequence with generate multiple sequences in parallel, --log_suffix "hr" saves all predicted answer in file hr_run_1.log, hr_run_2.log, ...

Change Image Styles

To change the image style, we can change the following arguments:

Image Style	--test_file	--log_suffix
Original Style	data/Original/testset_hr.jsonl	"hr"
Without Border	data/Noborder/testset_noborder.jsonl	"noborder"
With Significant Background	data/Beige/testset_hr_beige.jsonl	"beige"
Change Font and Color	data/Altstyle/testset_altstyle.jsonl	"altstyle"

Compute Metrics

To compute the evaluation metrics,

python calc_metric.py \
    --test_file "data/Original/testset_hr.jsonl" \
    --num_return_sequences 4 \
    --log_suffix "hr"

then the script will compute metric according to the logs with the suffix "hr". For example, will compute metric for log files named hr_run_1.log, hr_run_2.log, ...

Citation

If you find this work helpful, please cite:

@article{xu2026crossmathbench,
	title={Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap},
	author={Xu, Yige and Wang, Yongjie and Wu, Zizhuo and Song, Kaisong and Lin, Jun and Shen, Zhiqi},
	journal={arXiv preprint arXiv:2604.16256},
	year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
images		images
LICENSE		LICENSE
README.md		README.md
batch_inference_qwen35.py		batch_inference_qwen35.py
build_environments.sh		build_environments.sh
calc_metric.py		calc_metric.py
instruction_template.py		instruction_template.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CrossMath: Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap

Overview

Quick Start

Setup and Dependencies

Prepare Data

Evaluation

Original Style

Change Image Styles

Compute Metrics

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

CrossMath: Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap

Overview

Quick Start

Setup and Dependencies

Prepare Data

Evaluation

Original Style

Change Image Styles

Compute Metrics

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages