The Inspect AI for robotics
An open-source evaluation framework for physical AI and VLA (vision-language-action) models.
Define a robotics benchmark once, then run any policy against any compatible embodiment — a real robot or a simulator — with reproducible logs and first-class Rerun visualization.
LLM evaluations have a single swappable input: the model. Robotics evaluations have two — and Inspect Robots makes both first-class and orthogonal:
🧠 Policy — the VLA |
The "brain". Maps an observation + instruction to an action chunk (a horizon of actions executed open-loop, as π0 / ACT / diffusion policies do). |
🦾 Embodiment — the robot or sim |
The "body + world". Produces observations, executes actions, owns the action/observation spaces and control rate. Real-robot-first; sims are a stricter special case. |
A Task — a dataset of Scenes (initial conditions, instructions, success
targets) plus scorers — is defined independently of both. Before any rollout,
Inspect Robots checks the (policy, embodiment) pair is compatible (action/observation
spaces, semantics, control rate, scene realizability) and fails fast if not.
pip install inspect-robots # core (numpy only)
pip install "inspect-robots[rerun]" # + Rerun visualizationNo hardware or simulator needed — the dependency-free CubePick mock world
exercises the whole stack:
from inspect_robots import eval
from inspect_robots.mock import CubePickEmbodiment, ScriptedPolicy
from inspect_robots.scene import Scene
from inspect_robots.scorer import success_at_end
from inspect_robots.task import Task
task = Task(
name="cubepick-reach",
scenes=[Scene(id=f"layout-{i}", instruction="reach the cube", init_seed=i) for i in range(5)],
scorer=success_at_end(),
max_steps=80,
)
# The two swappable inputs: a policy (VLA) and an embodiment (robot/sim).
(log,) = eval(task, ScriptedPolicy(), CubePickEmbodiment())
print(log.status, log.results.metrics) # success {'success_at_end': 1.0}…or from the command line (components resolve from a registry):
inspect-robots list # registered components
inspect-robots run --task cubepick-reach --policy scripted --embodiment cubepick
inspect-robots inspect logs/cubepick-reach_*.json # results table- 🌍 Real-world first. Interfaces assume real-robot reality — human-in-the-loop reset, no privileged success oracle, wall-clock control rate. Simulators just offer more (seeding, privileged success, rendering) via opt-in capabilities.
- 🔁 Reproducible. Every run yields an immutable, schema-versioned
EvalLogwith the resolved config, git revision, and package versions — re-readable across releases, and re-scorable offline. - 🪶 Light core. Depends only on NumPy. Rerun and simulator/VLA backends are optional extras and separately installable plugins.
- 🛑 Safe unattended. An explicit error taxonomy separates "record and continue" from "halt and require a human", so a faulted robot never auto-advances overnight.
- 🎞️ Rerun visualization. Stream camera images, 3D poses, joint/action
time-series, and success markers to a
.rrdrecording. - 🧩 Pluggable. Ship
inspect-robots-maniskillorinspect-robots-openvlaas separate packages — entry points make them appear ininspect-robots listautomatically. - ⚙️ VLA-native. Action chunking, open-loop execution, and ACT/ALOHA temporal ensembling are built in, with action semantics (control mode, rotation representation, gripper, frame) that make compatibility and ensembling correct.
If you know Inspect AI, you already know Inspect Robots.
| Inspect AI | Inspect Robots |
|---|---|
Model |
Policy (VLA) + Embodiment (two inputs) |
Task = dataset + solver + scorer |
Task = scenes + controller + scorer |
Sample |
Scene |
Solver chain |
Controller middleware (chunking, ensembling, smoothing) |
eval() → EvalLog |
eval() → EvalLog |
@task / @solver / @scorer + registry |
@task / @policy / @embodiment / @scorer + entry points |
This repository is the framework (the "Inspect AI for robotics"). Concrete benchmarks (the "Inspect Evals for robotics") and backend adapters live in separate plugin packages.
Full guides and an auto-generated API reference live at
inspectrobots.org.
LLM-friendly versions: llms.txt
and llms-full.txt.
uv venv && uv pip install -e ".[dev]"
uv run pre-commit install # ruff + mypy on commit, 100% coverage on push
uv run pytest --cov # 100% coverage required
uv run ruff check . && uv run mypyPre-commit hooks and a blocking CI coverage gate keep main green. See
CONTRIBUTING.md and the design docs in plans/.