This repository provides the official evaluation suite of MiMo-Embodied, designed to support rigorous and reproducible evaluation for embodied AI and autonomous driving tasks.
Built on top of the excellent lmms-eval framework, this repository extends the evaluation pipeline with MiMo-specific model integration, benchmark support, and evaluation workflows for embodied and driving scenarios.
MiMo-Embodied is a powerful cross-embodied vision-language model that demonstrates state-of-the-art performance in both autonomous driving and embodied AI tasks, representing the first open-source VLM that integrates these two critical areas.
This repository is for evaluation only. It does not contain model training code.
We use a custom mivllm model class built on top of the original VLLM implementation in lmms-eval, tailored for MiMo models. Compared with the default implementation, it:
- improves data loading efficiency
- enables finer control over image and video preprocessing
- supports MiMo-specific inference settings such as:
max_model_lengpu_memory_utilizationmax_num_seqs
This evaluation suite supports embodied AI benchmarks covering key capabilities such as:
- affordance prediction
- task planning
- spatial understanding
This evaluation suite also supports autonomous driving benchmarks covering key capabilities such as:
- environmental perception
- status prediction
- driving planning
- driving knowledge-based QA
The framework supports:
- single-GPU evaluation
- multi-GPU evaluation
- multi-node distributed evaluation
- batch evaluation across multiple tasks
This repository focuses on the evaluation of embodied AI and autonomous driving tasks.
| Category | Benchmarks |
|---|---|
| Affordance & Planning | Where2Place (where2place_point), RoboAfford-Eval (roboafford), Part-Afford (part_affordance), RoboRefIt (roborefit), VABench-Point (vabench_point_box) |
| Planning | EgoPlan2 (egoplan), RoboVQA (robovqa), Cosmos (cosmos_reason1_boxed) |
| Spatial Understanding | CV-Bench (cvbench_boxed), ERQA (erqa_boxed), EmbSpatial (embspatialbench), SAT (sat), RoboSpatial (robospatial), RefSpatial (refspatialbench), CRPE (crpe_relation), MetaVQA (metavqa_eval), VSI-Bench (vsibench_boxed) |
| Benchmarks |
|---|
CODA-LM (codalm) |
Drama (drama) |
DriveAction (drive_action_boxed_detail) |
LingoQA (lingoqa_boxed) |
nuScenes-QA (nuscenesqa) |
OmniDrive (omnidrive) |
NuInstruct (nuinstruct) |
DriveLM (drivelm) |
MAPLM (maplm) |
BDD-X (bddx) |
MME-RealWorld (mme_realworld) |
IDKB (idkb) |
A more detailed task list can be maintained in
mimovl_docs/tasks.md.
# Step 1: Create conda environment
conda create -n lmms-eval python=3.10 -y
conda activate lmms-eval
# Step 2: Install PyTorch (adjust CUDA version as needed)
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
# Step 3: Install vLLM
pip install vllm==0.7.3
# Step 4: Install the evaluation framework
git clone https://github.com/XiaomiMiMo/MiMo-Embodied.git
cd MiMo-Embodied
pip install -e . && pip uninstall -y opencv-python-headless
pip install -r requirements.txt
# Step 5 (optional but recommended)
pip install xformers==0.0.28.post3For many benchmarks, images are already packaged in the corresponding Hugging Face dataset, so no additional local path configuration is required.
For some benchmarks with large image/video assets, the released config YAML uses a placeholder local path such as:
img_root: "/path/to/your/image_or_video_data"Before running evaluation for these benchmarks, please manually update img_root in the corresponding task YAML file to point to your local image/video directory.
For example:
dataset_path: Zray26/bdd_x_testing_caption
task: "bddx"
test_split: test
dataset_kwargs:
token: True
output_type: generate_until
img_root: "/path/to/your/image_or_video_data"
doc_to_visual: !function utils.doc_to_visual
doc_to_text: !function utils.doc_to_text
doc_to_target: !function utils.doc_to_target
process_results: !function utils.process_test_results_for_submissionA typical task folder is organized as:
lmms_eval/tasks/<task_name>/
├── <task_name>.yaml
└── utils.py
For example:
lmms_eval/tasks/bddx/
├── bddx.yaml
└── utils.py
Please check the YAML file of each benchmark case by case and fill in img_root when local image/video assets are required.
The main evaluation launcher is:
bash mimovl_docs/eval_mimo_vl_args.sh <model_path> <task_name> <output_dir> [disable_thinking]bash mimovl_docs/eval_mimo_vl_args.sh \
XiaomiMiMo/MiMo-Embodied-7B \
cvbench_boxed \
./eval_resultsFor tasks evaluated in no-think mode, run:
bash mimovl_docs/eval_mimo_vl_args.sh \
XiaomiMiMo/MiMo-Embodied-7B \
<task_name> \
./eval_results \
trueThis corresponds to:
disable_thinking_user=trueThe launcher supports distributed evaluation through environment variables:
export NNODES=1
export NODE_RANK=0
export MASTER_ADDR=127.0.0.1
export MASTER_PORT=29500
export NPROC_PER_NODE=8Then run:
bash mimovl_docs/eval_mimo_vl_args.sh \
<model_path> \
<task_name> \
<output_dir>To run multiple tasks sequentially, edit the task list in:
tools/submit/batch_run.pyThen launch:
python tools/submit/batch_run.py \
--input <model_path> \
--eval_results_dir <output_dir>To disable thinking mode in batch evaluation:
python tools/submit/batch_run.py \
--input <model_path> \
--eval_results_dir <output_dir> \
--disable_thinking_userThis evaluation suite supports both thinking and no-think evaluation settings, depending on the benchmark protocol.
For embodied AI benchmarks, the following task is evaluated under no-think mode:
RoboVQA(robovqa)
For autonomous driving benchmarks, the following tasks are evaluated under no-think mode:
CODA-LM(codalm)IDKB(idkb)OmniDrive(omnidrive)NuInstruct(nuinstruct)DriveLM(drivelm)MAPLM(maplm)nuScenes-QA(nuscenesqa)BDD-X(bddx)
For these tasks, the model is evaluated with:
disable_thinking_user=true--model mivllmmax_model_lengpu_memory_utilizationmax_num_seqs
PATCH_SIZE = 28
IMAGE_MAX_TOKENS = 4096
IMAGE_MAX_PIXELS = 3211264
VIDEO_MAX_TOKENS = 4096
VIDEO_MAX_PIXELS = 3211264
VIDEO_TOTAL_MAX_TOKENS = 16384
VIDEO_TOTAL_MAX_PIXELS = 12845056
VIDEO_FPS = 2
VIDEO_MAX_FRAMES = 256
max_new_tokens = 32768
- 1 × NVIDIA A100 (80GB), or
- 1 × NVIDIA H20
MiMo-Embodied demonstrates superior performance across 17 benchmarks in three key embodied AI capabilities: Task Planning, Affordance Prediction, and Spatial Understanding, significantly surpassing existing open-source embodied VLM models and rivaling closed-source models.
Additionally, MiMo-Embodied excels in 12 autonomous driving benchmarks across three key capabilities: Environmental Perception, Status Prediction, and Driving Planning—significantly outperforming both existing open-source and closed-source VLM models, as well as proprietary VLM models.
Moreover, evaluation on 8 general visual understanding benchmarks confirms that MiMo-Embodied retains and even strengthens its general capabilities, showing that domain-specialized training enhances rather than diminishes overall model proficiency.
Results marked with
*are obtained using our evaluation framework.
The following table explains how the reported numbers in the evaluation tables are computed from the corresponding result.json files.
Unless otherwise specified:
- reported scores are shown in percentage format
- percentage scores are computed as
metric × 100 - if a benchmark contains multiple subtasks, the reported score is the arithmetic mean of the corresponding subtask metrics
| Benchmark Name (Table) | Task Name (Eval Script) | Metric in result.json |
How Table Score Is Computed | Mode | Notes |
|---|---|---|---|---|---|
Where2Place |
where2place_point |
accuracy |
accuracy × 100 |
think | |
RoboAfford-Eval |
roboafford |
accuracy |
accuracy × 100 |
think | |
Part-Afford |
part_affordance |
accuracy |
accuracy × 100 |
think | |
RoboRefIt |
roborefit |
accuracy |
accuracy × 100 |
think | |
VABench-Point |
vabench_point_box |
accuracy |
accuracy × 100 |
think | |
EgoPlan2 |
egoplan |
accuracy |
accuracy × 100 |
think | |
RoboVQA |
robovqa |
robovqa_score |
robovqa_score × 100 |
no-think | |
Cosmos |
cosmos_reason1_boxed |
exact_match from 5 subtasks |
mean(exact_match of 5 subtasks) × 100 |
think | |
CV-Bench |
cvbench_boxed |
accuracy |
accuracy × 100 |
think | |
ERQA |
erqa_boxed |
exact_match |
exact_match × 100 |
think | |
EmbSpatial |
embspatialbench |
accuracy |
accuracy × 100 |
think | |
SAT |
sat |
accuracy |
accuracy × 100 |
think | |
RoboSpatial |
robospatial |
accuracy from 3 subtasks |
mean(accuracy of 3 subtasks) × 100 |
think | |
RefSpatial |
refspatialbench |
refspatial-bench-location, refspatial-bench-placement |
mean(refspatial-bench-location, refspatial-bench-placement) × 100 |
think | |
CRPE |
crpe_relation |
accuracy |
accuracy × 100 |
think | |
MetaVQA |
metavqa_eval |
accuracy |
accuracy × 100 |
think | |
VSI-Bench |
vsibench_boxed |
vsibench_score |
vsibench_score × 100 |
think | |
CODA-LM |
codalm |
jsonl results for 3 subtasks |
Export jsonl files for the three subtasks, then follow the official CODA-LM evaluation pipeline to compute the final score |
no-think | Official evaluation instructions: https://github.com/DLUT-LYZ/CODA-LM/tree/main/evaluation |
Drama |
drama |
drama_ACC@0.5 |
drama_ACC@0.5 × 100 |
think | |
DriveAction |
drive_action_boxed_detail |
drive_action_Overall_acc |
drive_action_Overall_acc × 100 |
think | |
LingoQA |
lingoqa_boxed |
lingo_judge_acc |
lingo_judge_acc × 100 |
think | |
nuScenes-QA |
nuscenesqa |
exist, count, object, status, comparison |
mean(exist, count, object, status, comparison) × 100 |
no-think | These category scores are read from accuracy_extract in result.json. |
OmniDrive |
omnidrive |
Bleu_1, ROUGE_L, CIDEr |
mean(Bleu_1, ROUGE_L, CIDEr) × 100 |
no-think | |
NuInstruct |
nuinstruct |
bleu |
bleu × 100 |
no-think | |
DriveLM |
drivelm |
jsonl results |
Prepare prediction results, then follow the official CODA-LM evaluation pipeline to compute the final score | no-think | Official evaluation instructions: https://github.com/DLUT-LYZ/CODA-LM/tree/main/evaluation |
MAPLM |
maplm |
maplm_FRM, maplm_QNS |
mean(maplm_FRM, maplm_QNS) |
no-think | maplm_FRM and maplm_QNS are already reported on the 0–100 scale. |
BDD-X |
bddx |
Bleu_4, ROUGE_L, CIDEr |
mean(Bleu_4, ROUGE_L, CIDEr) × 100 |
no-think | |
MME-RealWorld |
mme_realworld |
mme_realworld_score from 2 subtasks |
mean(mme_realworld_score of 2 subtasks) × 100 |
think | |
IDKB |
idkb |
Metrics from 6 subtasks | mean(IDKB_multi_no_image_val, IDKB_multi_with_image_val, IDKB_qa_no_image_val, IDKB_qa_with_image_val, IDKB_single_no_image_val, IDKB_single_with_image_val) |
no-think | Subtask scores are computed as follows: IDKB_multi_no_image_val = acc × 100; IDKB_multi_with_image_val = acc × 100; IDKB_qa_no_image_val = mean(rouge_1, rouge_l, semscore); IDKB_qa_with_image_val = mean(rouge_1, rouge_l, semscore); IDKB_single_no_image_val = acc × 100; IDKB_single_with_image_val = acc × 100. |
.
├── lmms_eval/ # Core evaluation framework
│ ├── models/ # Model adapters, including mivllm
│ ├── tasks/ # Task definitions and configs
│ ├── api/ # API interfaces
│ └── ...
├── mimovl_docs/
│ ├── eval_mimo_vl_args.sh # Main evaluation launcher
│ └── tasks.md # Task documentation
├── tools/submit/ # Batch evaluation runners
├── patches/ # Environment patches
├── assets/ # README assets
├── requirements.txt
├── setup.py
├── pyproject.toml
└── README.md
@misc{hao2025mimoembodiedxembodiedfoundationmodel,
title={MiMo-Embodied: X-Embodied Foundation Model Technical Report},
author={Xiaomi Embodied Intelligence Team},
year={2025},
eprint={2511.16518},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2511.16518},
}
@misc{mimoembodiedeval2025,
title={The Evaluation Suite of Xiaomi MiMo-Embodied},
author={Xiaomi Embodied Intelligence Team},
year={2025},
url={https://github.com/XiaomiMiMo/MiMo-Embodied}
}





