Authors:
Yige Xu,
Yongjie Wang, Zizhuo Wu, Kaisong Song
Jun Lin,
Zhiqi Shen.
Is there a reasoning gap between textual modality and vision modality in VLMs? —— We say Yes!
We introduce CrossMath, a novel multimodal reasoning benchmark designed for controlled cross-modal comparisons.
Create environments:
bash build_enviroments.sh --name [Your/env/name]The testing data is in data/, which is also available in Huggingface's space with name xuyige/CrossMath (here).
For original style, we can use the following command for batch evaluation:
python batch_inference_qwen35.py \
--test_file "data/Original/testset_hr.jsonl" \
--model_name Qwen/Qwen3.5-9B \
--adapter_dir None \
--modality image \
--max_new_tokens 16384 \
--num_return_sequence 4 \
--log_suffix "hr"where --adapter_dir could be applied to load LoRA adapters, --modality can be selected from [image, hybrid, text], --num_return_sequence with generate multiple sequences in parallel, --log_suffix "hr" saves all predicted answer in file hr_run_1.log, hr_run_2.log, ...
To change the image style, we can change the following arguments:
| Image Style | --test_file | --log_suffix |
| Original Style | data/Original/testset_hr.jsonl | "hr" |
| Without Border | data/Noborder/testset_noborder.jsonl | "noborder" |
| With Significant Background | data/Beige/testset_hr_beige.jsonl | "beige" |
| Change Font and Color | data/Altstyle/testset_altstyle.jsonl | "altstyle" |
To compute the evaluation metrics,
python calc_metric.py \
--test_file "data/Original/testset_hr.jsonl" \
--num_return_sequences 4 \
--log_suffix "hr"then the script will compute metric according to the logs with the suffix "hr". For example, will compute metric for log files named hr_run_1.log, hr_run_2.log, ...
If you find this work helpful, please cite:
@article{xu2026crossmathbench,
title={Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap},
author={Xu, Yige and Wang, Yongjie and Wu, Zizhuo and Song, Kaisong and Lin, Jun and Shen, Zhiqi},
journal={arXiv preprint arXiv:2604.16256},
year={2026}
}
