ImageMining is a benchmark designed to evaluate multimodal agents on their ability to integrate high-density visual understanding with autonomous deep search. Unlike traditional VQA tasks, ImageMining requires models to actively mine visual inputs through agentic behaviors — multi-step tool calls such as localized cropping, magnification of minute details, and cross-referencing visual inputs to refine search queries.
The benchmark tests the paradigm of "think with image, deep search with image" — anchoring reasoning within visual contexts rather than relying on textual shortcuts or parametric knowledge. Task performance correlates strongly with the precision of on-image tool usage, evaluating models across a "Deep-Wide-Search" spectrum that spans search breadth across sources and depth in visual reasoning.
| Metric | Count |
|---|---|
| Total test cases | 217 |
| Domains | 7 |
| Sub-categories | 23 |
| Reasoning types | 5 |
Each entry in data.jsonl contains the following fields:
| Field | Description |
|---|---|
id |
Unique identifier |
category_l1 |
Primary domain category |
category_l2 |
Fine-grained sub-category |
difficulty_tags |
List of reasoning types required |
image |
Corresponding image filename in images/ |
question |
Question text (English) |
answer |
Ground-truth answer (English) |
question_zh |
Original question (Chinese) |
answer_zh |
Original answer (Chinese) |
reasoning |
Step-by-step reasoning chain (English) |
reasoning_zh |
Original reasoning chain (Chinese) |
need_image_before_search |
Whether visual input is needed before starting search (yes/no) |
need_image_during_search |
Whether visual input is needed during the search process (yes/no) |
{
"id": 1,
"category_l1": "Social & Humanities",
"category_l2": "Politics",
"difficulty_tags": ["Event Reasoning", "Image Retrieval Reasoning"],
"image": "1.png",
"question": "While holding an important position, this singer bought a rock record during a visit to China. What English text was written on the cover of that album?",
"answer": "DOU WEI BLACK DREAM",
"question_zh": "这位歌手在他担任要职时...",
"answer_zh": "DOU WEI BLACK DREAM",
"reasoning": "1. An image search reveals that the singer is Blinken...",
"reasoning_zh": "1. 搜图得到该歌手是布林肯...",
"need_image_before_search": "yes",
"need_image_during_search": "yes"
}| Category | Count |
|---|---|
| Rich Text | 41 |
| Science | 40 |
| Place | 36 |
| Social & Humanities | 31 |
| Product | 31 |
| Entertainment & Sports | 25 |
| Nature | 13 |
Top sub-categories include: Place (36), Others (36), Complex Posters and Drawings (13), Electronics & Digital (10), Biology (9), Astronomy (8), Chemistry (8), Physics (7), Arts (7), Document (7), among others.
| Type | Count |
|---|---|
| Image Retrieval Reasoning | 102 |
| Text Reasoning | 99 |
| Object Recognition | 99 |
| Event Reasoning | 66 |
| Spatiotemporal Reasoning | 66 |
Note: Each test case may involve multiple reasoning types.
Descriptions:
- Object Recognition: Fine-grained identification of flora, fauna, artifacts, and real-world objects.
- Spatiotemporal Reasoning: Geographic and temporal deduction grounded in visual cues.
- Event Reasoning: Comprehension of news events, political milestones, and product launches.
- Text Reasoning: Reasoning over embedded rich text such as academic papers, financial reports, and documents.
- Image Retrieval Reasoning: Cross-referencing visual inputs to retrieve specific artworks, imagery, or information through search.
ImageMining/
├── data.jsonl # Main dataset (217 entries)
├── images/ # Associated images (217 files, need to be downloaded separately)
│ ├── 1.png
│ ├── 2.jpg
│ └── ...
└── README.md
Download Images: The associated images can be downloaded from Tsinghua Cloud. After downloading, please extract the files into the images/ directory.
- Visual-First Reasoning: Questions are designed so that visual input is essential — models cannot solve them through text alone.
- Multi-Step Search: Success requires iterative search with visual feedback, including localized cropping and magnification of details.
- Diverse Domains: Covers 7 domains and 23 sub-categories spanning science, politics, culture, nature, and more.
- Annotated Reasoning Chains: Each entry includes a human-verified step-by-step reasoning process for interpretability.
This dataset is released for research purposes. Please refer to the license file for details.