🧠 MedSPOT

A Workflow-Aware Sequential Grounding Benchmark for Clinical GUI

Rozain Shakeel¹, Abdul Rahman Mohammad Ali², Muneeb Mushtaq¹, Tausifa Jan Saleem³, Tajamul Ashraf^1,4*

^* Corresponding author

¹ Gaash Research Lab, National Institute of Technology Srinagar, India
² e& Group, UAE
³ Mohammad Bin Zayed University of Artificial Intelligence (MBZUAI), UAE
⁴ King Abdullah University of Science and Technology (KAUST), Saudi Arabia

✨ Overview

MedSPOT introduces a novel benchmark for workflow-aware sequential grounding in clinical GUI environments, enabling evaluation of models in realistic, multi-step medical interaction scenarios.

🚀 News

📅 Mar 18, 2026 — MedSPOT benchmark officially released! Check out the paper
📅 Mar 13, 2026 — Evaluation scripts are now available on GitHub

--

Overview

MedSPOT is a benchmark for evaluating Multimodal Large Language Models (MLLMs) on GUI grounding tasks in medical imaging software. It evaluates models on their ability to localize and interact with UI elements across 10 medical imaging applications including 3DSlicer, DICOMscope, Weasis, MITK, and others.

Evaluation Protocol

Tasks are evaluated sequentially, if a model fails a step, the task terminates early. This reflects real-world GUI interaction where errors compound.

Metrics

Metric	Description
TCA (Task Completion Accuracy)	Fraction of tasks where ALL steps are completed correctly in sequence
SHR (Step Hit Rate)	Per-step accuracy across all evaluated steps
S1A (Step 1 Accuracy)	Accuracy on the first step of each task

Installation

Common Dependencies

pip install torch>=2.0 transformers>=4.40 pillow tqdm

Each model should be evaluated in its recommended environment.
Follow the official setup instructions for the specific model you are evaluating.

Dataset Structure

MedSPOT-Bench/
  Annotations/
    3DSlicer_Annotation.json
    DICOMscope_Annotation.json
    Weasis_Annotation.json
    ...
  Images/
    3DSlicer/
    DICOMscope/
    Weasis/
    ...

📚 Check out our dataset here!

Annotation Format

Each annotation JSON follows this format:

{
  "tasks": [
    {
      "task_overview": "Import new DICOM.",
      "steps": [
        {
          "step_id": 1,
          "image_path": "Images/MicroDicom/MD_(1).png",
          "instruction": "Click on the File menu in the top toolbar.",
          "actions": [
            {
              "type": "click",
              "target": "file_menu",
              "bbox": [
                0.22,
                0.39,
                2.79,
                2.51
              ]
            }
          ]
        },
        {
          "step_id": 2,
          "image_path": "Images/MicroDicom/MD_(2).png",
          "instruction": "Click on 'Open' to browse files.",
          "actions": [
            {
              "type": "click",
              "target": "open_option",
              "bbox": [
                1.12,
                11.03,
                18.1,
                2.9
              ]
            }
          ]
        },
        {
          "step_id": 3,
          "image_path": "Images/MicroDicom/MD_(3).png",
          "instruction": "Select the desired DICOM file from the file explorer.",
          "actions": [
            {
              "type": "click",
              "target": "select_file",
              "bbox": [
                12.4,
                18.23,
                4.02,
                2.33
              ]
            }
          ]
        },
      ]
    }
  ]
}

Evaluation

🔬 Each script evaluates one model on the full benchmark. All scripts share the same interface:

Script	Model
`evaluate_gui_actor.py`	GUI-Actor
`evaluate_gpt5.py`	GPT-5
`evaluate_gpt4omini.py`	GPT-4o-mini
`evaluate_tars.py`	UI-TARS
`evaluate_cogagent.py`	CogAgent-9B
`evaluate_qwen2vl.py`	Qwen2-VL
`evaluate_qwen2_5vl.py`	Qwen2.5-VL
`evaluate_qwen3vl0.py`	Qwen3-VL
`evaluate_gemma3_27B.py`	Gemma3-27B
`evaluate_llama.py`	Llama-3.2-11B
`evaluate_osatlas.py`	OS-Atlas
`evaluate_seeclick.py`	SeeClick
`evaluate_uground.py`	UGround
`evaluate_Aguvis.py`	Aguvis-7B

HuggingFace Models

🤗 Models are loaded directly from Hugging Face at runtime. Make sure you are logged in before running any evaluation

huggingface-cli login

For gated models, request access on the model's HuggingFace page beforehand

OpenAI Models (GPT-5, GPT-4o-mini)

Export your OpenAI API key before running evaluations:

export OPENAI_API_KEY=your_api_key

Results

📊 Results are saved in the following structure:

results/
  ModelName/
    SoftwareName/
      task_results.json
      task_metrics.json
      failure_statistics.json
    overall_dataset_metrics.json

References

📝 If you find MedSPOT useful in your research, please consider citing our paper:

@misc{medspot,
      title={MedSPOT: A Workflow-Aware Sequential Grounding Benchmark for Clinical GUI}, 
      author={Rozain Shakeel and Abdul Rahman Mohammad Ali and Muneeb Mushtaq and Tausifa Jan Saleem and Tajamul Ashraf},
      year={2026},
      eprint={2603.19993},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.19993}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
Evaluation_scripts		Evaluation_scripts
Images		Images
MedPSOT_bench/Annotations		MedPSOT_bench/Annotations
.DS_Store		.DS_Store
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧠 MedSPOT

✨ Overview

🚀 News

Table of Contents

Overview

Evaluation Protocol

Metrics

Installation

Common Dependencies

Dataset Structure

Annotation Format

Evaluation

HuggingFace Models

OpenAI Models (GPT-5, GPT-4o-mini)

Results

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🧠 MedSPOT

✨ Overview

🚀 News

Table of Contents

Overview

Evaluation Protocol

Metrics

Installation

Common Dependencies

Dataset Structure

Annotation Format

Evaluation

HuggingFace Models

OpenAI Models (GPT-5, GPT-4o-mini)

Results

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages