MM-Text2SQL-Bench is a benchmark for image-grounded Text-to-SQL and multimodal database reasoning. It studies a setting in which answering a natural-language question requires joint reasoning over an image, a database schema, a database instance, and the question itself.
Unlike conventional Text-to-SQL, where the query can often be inferred from text and schema alone, MM-Text2SQL-Bench focuses on cases where part of the SQL semantics is determined by visual evidence. The benchmark is designed to evaluate whether models can identify image-conditioned constraints, connect them to database fields, and generate executable SQL queries that return correct answers.
Each example in MM-Text2SQL-Bench contains:
- an image
- a natural-language question
- a database schema
- a database instance
- a gold SQL query
- the execution answer
- optional structured visual evidence
A simplified JSON-style representation is shown below: { "id": "sample_0001", "image": "images/sample_0001.jpg", "question": "Which store has the highest sales among the products shown in the image?", "schema": "schemas/sample_0001_schema.json", "database": "databases/sample_0001.sqlite", "sql": "SELECT store_name FROM sales ...", "answer": ["Downtown Store"], "evidence": { "visual_entities": ["product_A", "product_B"], "candidate_columns": ["product_name", "category"], "grounded_conditions": ["product_name IN (...)"] } }
The benchmark is built to study multimodal reasoning beyond standard visual question answering and beyond standard Text-to-SQL. In our setting, the image is not decorative side information. Instead, it provides part of the grounding signal required to construct the correct SQL query.
MM-Text2SQL-Bench is motivated by the observation that many real-world analytical questions require both visual understanding and structured data access. Existing Text-to-SQL benchmarks assume that the full query intent is recoverable from text and schema, while conventional multimodal QA benchmarks often do not require executable database querying.
This benchmark fills that gap by introducing a task where the model must jointly:
- identify visually grounded conditions,
- map those conditions to relevant schema elements,
- generate executable SQL, and
- retrieve the correct answer through database execution.
Given an image I, a natural-language question q, a database schema S, and a database instance D, the goal is to generate a SQL query y such that:
yis syntactically valid,yis executable onD,ycorrectly captures the semantics ofq, and- at least part of
yis grounded in the visual content ofI.
We evaluate models under three protocols:
The model receives the question and the schema only. This protocol measures how far a model can go without explicit visual grounding.
The model receives the question, schema, and structured visual evidence extracted from the image. This protocol evaluates whether intermediate visual grounding improves SQL generation.
The model directly receives the image, question, and schema. This protocol evaluates integrated multimodal reasoning without manually curated intermediate evidence.
Due to file size and storage limitations, the full dataset is not directly hosted in this repository.
If you need the complete dataset, please download it from the following Google Drive link:
Google Drive: https://drive.google.com/drive/folders/1SUGDUgZI2FJ_nVGUP5DpmXBKo6sj5vcj?usp=sharing
This repository only provides the dataset description, schema files, sample examples, and related documentation.
For full experiments, training, or evaluation, please use the complete dataset from Google Drive.
MM-Text2SQL-Bench/
├── README.md
├── LICENSE
├── CITATION.cff
├── docs/
│ ├── index.md
│ ├── dataset.md
│ ├── examples.md
│ ├── baselines.md
│ ├── download.md
│ ├── ethics.md
│ └── assets/
├── data/
│ ├── sample/
│ └── README.md
├── dataset_card/
│ └── DATASET_CARD.md
├── baselines/
├── scripts/
├── supplementary/
└── paper/
---