Skip to content

strands-robots v0.4.0

Latest

Choose a tag to compare

@cagataycali cagataycali released this 17 Jun 07:51
· 271 commits to main since this release
5e3fe43

strands-robots v0.4.0

150+ commits since v0.3.8 (Feb 20). This is the release where strands-robots
stops being "policy inference glue for a single arm on your desk" and becomes a
platform you can simulate on, evaluate on, and deploy to a fleet with - without
swapping libraries at each stage. The through-line of this cycle was closing the gap
between "it runs a model" and "it runs a robot you'd trust in a room with people."

Python >=3.12 required (LeRobot >=0.5.0 floor).
pip install strands-robots · everything: pip install "strands-robots[all]"


The story of this release

Three forces drove almost every PR:

  1. The sim-to-real loop was broken in the middle. You could load a policy and you
    could talk to hardware, but there was no first-class simulator in between - so you
    couldn't develop a policy without a physical robot, and you couldn't reproduce a
    failure without the exact bench setup. We built a MuJoCo backend and a Robot()
    factory so the same code path runs in sim and on metal.

  2. "One robot" assumptions don't survive contact with a fleet. The moment more than
    one robot, or more than one operator, is involved, you need identity, presence,
    replay protection, an audit trail, and a human able to hit stop. We built the mesh
    and its AWS IoT transport around that reality, and spent a large fraction of the
    cycle hardening it rather than adding surface.

  3. "It produced an action array" is not "it succeeded." We added a real evaluation
    protocol (LIBERO) so policy quality is a number we can defend, not a vibe from
    watching a render.

Everything below ladders up to one of those three.


Simulation: so you can develop without a robot on the bench

Why: Iterating on a policy against physical hardware is slow, unsafe, and
unreproducible. A bug that only shows up at 50 Hz on a real servo is nearly impossible
to bisect. We needed a simulator that is byte-equivalent enough to trust and that
exposes the same agent-tool surface as the rest of the library.

  • MuJoCo backend as an AgentTool with 50+ actions (#85) on a foundation of
    models, the Policy/engine ABCs, a factory, a model registry, and asset
    auto-download (#84, #105/#106). The agent drives the world the same way it drives a
    real robot - world building, robot inject/eject, cameras, stepping, rollout, render.
  • Control-rate substepping (#353) and stepping physics for the full control
    period in eval
    (#429). Why it matters: position-servo policies were silently
    failing to track because we stepped the sim once per action instead of for the whole
    control interval - the policy looked broken when the integration was. This is the
    kind of bug that costs a week on hardware; we paid it down once, in sim.
  • Render fidelity fixes - blown-out white ground (#428), ground-plane z-fighting on
    attach (#360), conditional ground-strip + tendon scale + RNG parity (#400).
    Why it matters: renders are the policy's input in vision models. A blown-out
    frame isn't cosmetic - it's feeding the policy garbage and poisoning eval.
  • Naming/identity correctness: register sim robots under the user's name not the
    canonical one (#435), structured errors from resolvers (#417), optional robot_name
    across the state family (#412). Why: the sim has to address robots the way the user
    thinks about them, or multi-robot scenes become a guessing game.
  • SimEngine.describe() discovery surface (#407) - so an agent can ask "what can I do
    here?" instead of failing on an unknown action key. We also started surfacing valid
    actuator names on errors and propagating failures into run_policy status (#436),
    because a silent zero-action on failure is the single most dangerous default in robotics.

Robot() factory: one entry point

Why: Before this, choosing sim vs. real meant choosing a different code path, and
the path that touches a physical motor was just as easy to invoke by accident as the
safe one. That's backwards.

  • Robot() factory + top-level lazy imports (#86; hygiene follow-ups #145).
    Sim is the default; real hardware is an explicit opt-in (mode="real" /
    STRANDS_ROBOT_MODE=real). Lazy imports keep import strands_robots cheap so the
    factory doesn't drag the entire ML stack into a process that just wants to talk to a
    serial port.
  • Ergonomics for send_action, add_robot, render, get_robot (#431) - the
    paper cuts that make the difference between an API you demo and an API you live in.

Mesh + AWS IoT: because a fleet is a security problem, not a networking problem

Why: Connecting robots is easy. Connecting robots safely - where a stale command
can't be replayed, a compromised peer can't impersonate another, an operator can always
intervene, and every actuation is on an audit trail - is the actual job. We treated the
mesh as a trust boundary from day one, which is why the hardening PRs outnumber the
feature PRs here.

  • Core mesh - session, presence, RPC, streams, wiring, + AWS IoT transport (#101).
    Zenoh for the LAN, MQTT5/mTLS for the cloud, Device Shadow mirror, S3 camera offload,
    account-wide Fleet Provisioning.
  • The #195 hardening split landed as a deliberate sequence so each layer could be
    reviewed in isolation: PKI helpers + conftest (#220), payload validation / action
    allowlist (#223), Zenoh + ACL config with mTLS/downsampling/low-pass (#224),
    tamper-evident HMAC audit log with per-peer sequence + rotation (#221), cross-transport
    dedup + monotonic TTL + strict mode (#222), replay caches + override-resume + safety
    topic handlers (#225), and robot_mesh HITL via tool_context.interrupt + per-action
    rate limit (#227). Why split: security review fatigue is real; a 9-part series each
    reviewable in an afternoon catches more than one 4000-line PR nobody reads to the end.
  • Human-in-the-loop done right (#227, #411): a declined approval must not consume a
    rate-limit slot (or an operator's "no" could lock out a legitimate e-stop), and the
    operator's literal reply is never echoed back into the LLM context (that turns a human
    into a prompt-injection channel). Read-only actions are audited too (#411) - operators
    need "the agent read N frames at time T", not just actuation logs.
  • Replay/lockout safety pins: estop engages even when the per-issuer replay cache is
    full (#263/#339) - the cache bounds memory, never safety; resume-cache fairness mirror
    (#342); check-then-set estop replay lock (#273/#361); poison records on every degraded
    audit path so a stream gap is attributable, not silent (#410). Why this obsession:
    in this domain the failure mode isn't "wrong answer", it's "arm moves when it shouldn't"
    or "stop didn't take." Fail-loud, fail-safe, always.
  • AWS IoT provisioning hardening: CA pin + thing-name regex + scoped policy (#228),
    deny-by-default Fleet Provisioning hook (#333), operator-shadow + response publishes
    scoped to the device's own ThingName (#334, #336), atomic break-glass marker with
    explicit symlink reject (#388/#402). Why: fleet provisioning is the blast radius -
    a permissive default here means one bad cert owns the whole account.
  • Teleop integrity: validate input frames before apply (#332), route every teleop
    publish through the single Mesh.publish() chokepoint (#452). One door, guarded.
  • Sim is a first-class mesh peer: tell() dispatch maps to run_policy/start_policy
    (#304), sim joint state bridges to child peers (#422), sim cameras publish JPEG frames
    (#425). Why: if sim and real don't look identical on the mesh, your fleet tooling
    can't be tested in sim - which defeats the point of having a sim.

Policies: more brains, one socket

Why: The field is moving fast and no single policy wins everywhere - VLAs for
open-ended manipulation, classical planners for collision-aware motion. The Policy ABC
exists so adding a new brain doesn't fork the stack. This cycle we proved the abstraction
holds by hanging very different things off it.

  • NVIDIA Cosmos 3 omnimodal VLA (#317) with both a service backend (msgpack/websockets,
    GPU-isolated) and an in-process diffusers backend (#458) for when you'd rather not
    run a sidecar. We re-anchor IK on the achieved EE pose each step (#462) - why: over a
    long rollout, integrating the model's relative pose deltas drifts; anchoring on where the
    arm actually is bounds the tracking error instead of letting it compound.
  • MoveIt2Policy under [moveit2] (#305) and cuRobo migrated to the main API
    (#442) - collision-aware planners living under the same ABC as the VLAs, so an agent can
    choose "plan a safe path" vs. "imitate" without changing how it calls a policy.
  • GR00T N1.7 EA (#93) plus the unglamorous-but-essential wire-format fixes: service
    (B, T, ...) shape + float32 state (#149), container lifecycle (#152), command-builder
    flags (#150, #155). Why these are in a release at all: an off-by-one in the observation
    tensor shape doesn't error - it silently degrades inference. Pinning the wire format is
    what makes "GR00T support" a claim instead of a hope.
  • LeRobot local direct-HF inference with RTC (#56), device-resolution + postprocessor
    warnings (#430), and the LeRobot 0.5.2 recording pipeline overhaul - synchronized
    multi-robot, action-horizon batching, a camera-recorder race fix, full embodiment
    coverage (#366). Why: recording is how you get training data; a race in the recorder
    is silent data loss you only discover when your dataset trains a worse policy.

Evaluation: turning "looks good" into a number

Why: Without a benchmark, "the policy is better now" is unfalsifiable. We adopted a
benchmark-agnostic eval protocol (#129) and a LIBERO adapter + BDDL parser (#130), then
spent real effort making the eval honest:

  • Load the actual LIBERO scene MJCFs (#165) - evaluating against the wrong world is
    worse than not evaluating. Snapshot/restore canonical qpos for procedural scenes (#168),
    agree with robosuite's check_success (#173), reach success_rate > 0 on our engine
    (#175), pack the gripper as the 2-element array the model was trained on (#162), bridge
    EE FK into state for libero_panda (#161), per-episode reseed for reproducibility (#180).
    We retired libero_offscreen_render once our engine was byte-equivalent to upstream
    (#186) - why keep two renderers when one is provably the same?

Device Connect: the mesh transport for when you have real infrastructure

Why: The built-in Zenoh mesh is great for getting started, but organizations with
device fleets already have discovery/RPC/safety infrastructure. Device Connect (#370)
plugs into that as the primary transport and falls back to Zenoh when it isn't installed -
so you can start simple and graduate without a rewrite. CI installs the matching
device-connect packages from source while they're pre-release, so the integration is
tested against the version it actually targets.

Docs, CLI, and the boring stuff that makes it usable

  • Full MkDocs Material site + Pages CI (#160), Device Connect (#449) and security
    (#465) pages, README rewritten for the v0.x Robot/mesh/sim story (#371). Why: a
    platform nobody can onboard to isn't a platform.
  • strands-robots doctor (#419) - why: 90% of "it doesn't work" is an environment
    problem (missing CUDA, wrong torch, no OpenGL). A diagnostic that exits non-zero under
    NO_COLOR/TERM=dumb (#443) so CI can gate on it turns those tickets into self-service.
  • 5 hero examples + hub-to-hardware walkthrough (#432, #381, #459) - the examples are
    the spec for the ergonomics work; if the hero path is ugly, the API is wrong.

Build, CI & the security baseline

Why: This library tells an LLM to move physical motors. The supply chain and the
input-validation story are not optional.

  • Python >=3.12, uv as the hatch installer (#83), ruff replacing
    black+isort+flake8 (#73), 11 granular optional-dependency groups (#14) - so you install
    the brain you need, not the entire NVIDIA stack to talk to a serial port.
  • NVIDIA Thor/Jetson GPU torch (#374): a targeted torch 2.11 override fixing the
    sm_110 cuBLAS bug, with UV_TORCH_BACKEND=auto. Why the complexity: the naive PyPI
    torch wheel is CPU-only on Thor, so inference silently runs on CPU - "works but 50x too
    slow" is the worst kind of bug.
  • Security baseline (#185, #189): CodeQL security-and-quality (catches the LLM-input →
    subprocess/XML/path taint class), Dependency Review hard-failing on high/critical CVEs,
    an LLM-input-safety annotation check, SHA-pinned actions + Dependabot. Plus path
    validation on every filesystem-writing tool (#91). The pypa/gh-action-pypi-publish pin
    is non-negotiable - a moving release branch there is exactly the tj-actions supply-chain
    pattern.
  • A large test-coverage push (hardware_robot, assets, lerobot_*, pose/serial tools,
    cli, doctor, registry, mesh sensors, predicates) and ASCII-only tool output enforced
    everywhere (#434). Why ASCII: agents read these strings programmatically; emojis are
    tokenizer noise and a stray combining mark breaks downstream parsing.

Upgrade notes

  • Python 3.12+ required.
  • Extras are granular - groot-service, cosmos3-service, cosmos3-diffusers,
    cosmos3-sim, moveit2, curobo, lerobot, sim-mujoco, mesh, mesh-iot,
    device-connect, or all.
  • cuRobo is not on PyPI - install from source (NVlabs/curobo); the [curobo] extra is
    intentionally a no-op until a real release exists (the PyPI nvidia-curobo is a squatter).
  • Hardware is opt-in - Robot() defaults to sim; pass mode="real" or set
    STRANDS_ROBOT_MODE=real deliberately.
  • Thor/Jetson - see the README Installation section for the UV_TORCH_BACKEND /
    torchcodec CUDA-index caveat.