🌍 WorldSim — Multimodal World-to-Robot-Interaction Engine

Generate a visually rich 3D environment from text, image, or video — then drop a robot inside it. MuJoCo handles physics. Gaussian splats handle appearance. Claude handles semantic understanding.

What This Is

WorldSim is a pipeline that takes any of the following as input:

A text description ("a cluttered kitchen counter")
A photograph of a real space
A video walkthrough of an environment

And produces:

A photorealistic reconstructed world (World Labs Gaussian splat)
A set of interactive 3D objects with physics properties (Meshy AI)
A MuJoCo scene with collision geometry, mass, friction, joints, and affordances
A synchronized dual renderer so the world looks realistic while physics stay stable
A robot that can push, grasp, knock over, and manipulate everything in the scene

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────┐
│                          INPUT LAYER                                    │
│              Text Prompt  │  Image/Photo  │  Video                      │
└──────────────┬────────────┴───────────────┴────────────┬────────────────┘
               │                                         │
               ▼                                         ▼
┌──────────────────────────┐             ┌───────────────────────────────┐
│   SCENE GENERATION       │             │     OBJECT GENERATION         │
│   World Labs API         │             │     Meshy AI API              │
│                          │             │                               │
│  → Gaussian splat (SPZ)  │             │  → GLB mesh (visual)          │
│  → Collider mesh (GLB)   │             │  → OBJ mesh (physics)         │
│  → Panorama image        │             │  → Affordance metadata        │
└──────────┬───────────────┘             └──────────────┬────────────────┘
           │                                            │
           ▼                                            ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                    SEMANTIC LABELING LAYER                               │
│                    Claude API (claude-sonnet-4-6)                       │
│                                                                         │
│  For each object extracted from scene:                                  │
│  • Label type (table, mug, door, drawer, obstacle…)                     │
│  • Estimate mass, friction, restitution from material                   │
│  • Assign joint type (free, hinge, slider, fixed, welded)               │
│  • Detect grasp affordance (can robot pick this up?)                    │
│  • Determine static vs. dynamic                                         │
│  • Output structured ObjectManifest JSON                                │
└──────────────────────────┬──────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                    MESH PROCESSING LAYER                                │
│                    trimesh + CoACD + Open3D                             │
│                                                                         │
│  Visual mesh (GLB)    →  kept for renderer, group=1 in MJCF            │
│  Collision mesh (GLB) →  CoACD convex decomposition → 8–16 hulls       │
│                          simplify, center, normalize scale              │
└──────────────────────────┬──────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                    MJCF COMPOSITION LAYER                               │
│                    mjcf_composer (Python)                               │
│                                                                         │
│  • Write <asset> block — all meshes registered                          │
│  • Write worldbody — static environment geom                            │
│  • Write per-object <body> with freejoint or hinge/slider               │
│  • Attach visual geom (group=1) + collision geom (group=0) per body     │
│  • Include robot MJCF via <include> or direct embed                     │
│  • Write actuators, sensors, cameras                                    │
│  • Output: scene.xml + assets/ directory                                │
└──────────────────────────┬──────────────────────────────────────────────┘
                           │
            ┌──────────────┴──────────────┐
            ▼                             ▼
┌───────────────────────┐     ┌──────────────────────────────────────────┐
│   MUJOCO PHYSICS       │     │   VISUAL RENDERER                        │
│   (headless)           │     │   (Three.js / @luma-ai/three-gaussian-   │
│                        │ ←── │    splats or Nerfstudio viewer)          │
│   • Rigid body sim     │sync │                                          │
│   • Contact / friction │ ──► │   • Renders Gaussian splat (SPZ)        │
│   • Robot control      │     │   • Overlays object poses from MuJoCo   │
│   • Grasping / push    │     │   • Camera stream for vision agents     │
└───────────────────────┘     └──────────────────────────────────────────┘

Repository Structure

worldsim/
│
├── pipeline/
│   ├── ingest.py              # Entry point — takes text/image/video
│   ├── world_builder.py       # World Labs API: scene → splat + collider mesh
│   ├── object_generator.py    # Meshy AI API: text prompts → GLB objects
│   ├── semantic_labeler.py    # Claude API: assigns affordances + physics props
│   ├── mesh_processor.py      # CoACD decomposition, mesh cleanup, normalization
│   ├── mjcf_composer.py       # Writes the final scene.xml
│   └── scene_manifest.py      # ObjectManifest dataclass + JSON schema
│
├── renderer/
│   ├── splat_renderer/        # Three.js app rendering the SPZ splat
│   │   ├── index.html
│   │   ├── main.js            # Loads splat, syncs object poses from MuJoCo
│   │   └── pose_bridge.js     # WebSocket bridge ← MuJoCo pose stream
│   └── mujoco_viewer.py       # Optional: MuJoCo passive viewer for debug
│
├── robots/
│   ├── franka_panda.xml       # Franka Emika Panda (manipulation)
│   ├── ur5.xml                # Universal Robots UR5
│   └── spot.xml               # Boston Dynamics Spot (locomotion)
│
├── scene_output/              # Generated per run
│   ├── scene.xml              # Final MuJoCo MJCF
│   ├── manifest.json          # ObjectManifest for all objects
│   ├── assets/
│   │   ├── world_mesh.obj     # World Labs collision mesh
│   │   ├── world_splat.spz    # World Labs Gaussian splat (renderer only)
│   │   ├── pano.jpg           # Panorama (skybox)
│   │   └── objects/
│   │       ├── mug_visual.glb
│   │       ├── mug_collision.obj
│   │       ├── chair_visual.glb
│   │       └── chair_collision.obj
│   └── preview.png
│
├── tools/
│   ├── worldlabs_generator/   # App: text/image/video → World Labs world
│   ├── meshy_importer/        # App: text prompts → MuJoCo objects
│   └── scene_inspector/       # App: browse + edit ObjectManifest
│
├── tests/
│   ├── test_mesh_processor.py
│   ├── test_mjcf_composer.py
│   └── test_semantic_labeler.py
│
├── .env.example
├── requirements.txt
└── README.md

Core Data Structure — `ObjectManifest`

This is the central data contract that flows through the entire pipeline. Every object in the scene has one of these.

@dataclass
class ObjectManifest:
    # Identity
    id: str                        # UUID
    name: str                      # "wooden_chair_01"
    semantic_label: str            # "chair" | "mug" | "door" | "table" | ...
    source: str                    # "meshy" | "world_labs" | "manual"

    # Assets
    visual_mesh_path: str          # .glb — used by renderer
    collision_mesh_path: str       # .obj — convex hulls for MuJoCo
    thumbnail_url: str | None

    # Placement
    position: list[float]          # [x, y, z] in world frame
    orientation_quat: list[float]  # [w, x, y, z]
    scale: float                   # uniform scale applied to mesh

    # Physics properties
    mass_kg: float                 # estimated from material + size
    friction: float                # sliding friction coefficient
    restitution: float             # bounciness 0–1
    is_static: bool                # True = fixed geom, False = free body

    # Joint type
    joint_type: str                # "free" | "hinge" | "slider" | "fixed" | "weld"
    joint_axis: list[float] | None # [0,0,1] for hinge around Z
    joint_range: list[float] | None# [min, max] in radians or meters

    # Affordances (for robot planning)
    is_graspable: bool             # robot can pick it up
    grasp_points: list[dict]       # [{pos, normal, aperture_mm}]
    is_pushable: bool
    is_openable: bool              # drawer, door, lid
    support_surface: bool          # can objects be placed on top?
    contains_objects: bool         # is it a container?

    # Semantic extras
    material_class: str            # "wood" | "metal" | "ceramic" | "fabric"
    size_class: str                # "small" | "medium" | "large"
    notes: str                     # free-form from Claude labeling pass

Pipeline — Step by Step

Step 1 · Scene Generation (World Labs)

Input: text prompt, image, or video
Output: world_splat.spz, world_mesh.glb, pano.jpg

from pipeline.world_builder import WorldBuilder

builder = WorldBuilder(api_key=WORLDLABS_KEY)
world = builder.generate(
    input_type="image",          # "text" | "image" | "video"
    source="photo.jpg",          # path or URL
    display_name="Kitchen Scene",
    text_hint="A cluttered home kitchen counter"
)
# world.collider_mesh_url → download as world_mesh.glb
# world.splat_url         → download as world_splat.spz
# world.pano_url          → download as pano.jpg

The collider mesh becomes the static environment geom in MuJoCo.
The splat goes directly to the Three.js renderer — MuJoCo never sees it.

Step 2 · Object Generation (Meshy AI)

Input: list of text prompts
Output: {name}_visual.glb + {name}_collision.obj per object

from pipeline.object_generator import ObjectGenerator

gen = ObjectGenerator(api_key=MESHY_KEY)
objects = gen.generate_batch([
    "ceramic coffee mug with handle",
    "wooden kitchen chair",
    "stainless steel kettle",
    "small cardboard box",
])
# Each returns visual GLB + raw mesh for decomposition

Step 3 · Semantic Labeling (Claude)

Input: object name, thumbnail, scene context
Output: filled ObjectManifest for each object

from pipeline.semantic_labeler import SemanticLabeler

labeler = SemanticLabeler(api_key=ANTHROPIC_KEY)
manifests = labeler.label_all(objects, scene_context="home kitchen")

Claude receives: object name, thumbnail image, scene description.
Claude returns structured JSON matching ObjectManifest schema:

mass_kg: 0.35        ← "ceramic mug, medium size, ~350g"
friction: 0.6        ← "ceramic on wood surface"
joint_type: "free"   ← movable object
is_graspable: true   ← cylindrical body, handle present
grasp_points: [...]  ← handle grasp + body grasp positions
support_surface: false
is_openable: false
material_class: "ceramic"

Step 4 · Mesh Processing (CoACD + trimesh)

Input: {name}_visual.glb from Meshy
Output: {name}_collision.obj — convex hull decomposition

from pipeline.mesh_processor import MeshProcessor

proc = MeshProcessor()
for obj in manifests:
    proc.decompose(
        glb_path=obj.visual_mesh_path,
        out_path=obj.collision_mesh_path,
        max_hulls=16,           # 8–16 is the sweet spot
        threshold=0.05,         # CoACD concavity threshold
    )

Why CoACD and not V-HACD?
CoACD produces fewer, better-fitting hulls at the same fidelity. V-HACD is available as a fallback via --decomposer vhacd.

Dual-mesh pattern in MuJoCo:

<!-- Visual-only geom (rendered, no collision) -->
<geom name="mug_vis" type="mesh" mesh="mug_visual"
      contype="0" conaffinity="0" group="1" rgba="1 1 1 1"/>

<!-- Collision-only geom (physics, invisible in nice render) -->
<geom name="mug_col" type="mesh" mesh="mug_collision"
      contype="1" conaffinity="3" group="0" mass="0.35"
      friction="0.6 0.02 0.001"/>

Step 5 · MJCF Composition

Input: ObjectManifest[] + world mesh paths
Output: scene.xml

from pipeline.mjcf_composer import MJCFComposer

composer = MJCFComposer()
composer.set_environment(world_mesh="assets/world_mesh.obj", pano="assets/pano.jpg")
composer.add_robot("robots/franka_panda.xml", pos=[0, 0, 0.9])
for m in manifests:
    composer.add_object(m)
composer.write("scene_output/scene.xml")

Key MJCF decisions made per ObjectManifest:

`is_static`	`joint_type`	MuJoCo result
`true`	any	`<geom>` directly in `<worldbody>` — no body, no joint
`false`	`"free"`	`<body><freejoint/>` — full 6-DOF rigid body
`false`	`"hinge"`	`<body><joint type="hinge" axis="..."/>` — door, drawer
`false`	`"slider"`	`<body><joint type="slide"/>` — sliding drawer

Step 6 · Synchronized Dual Render

MuJoCo runs headlessly. A lightweight WebSocket server streams body poses at 60Hz to the Three.js renderer, which overlays the Gaussian splat.

MuJoCo (Python)                Three.js (Browser)
     │                               │
     │  pose_stream {                │
     │    "mug": [x,y,z,qw,qx,qy,qz]│
     │    "chair": [...]             │
     │  }  ─────────── WebSocket ──► │  update object transforms
     │                               │  render splat + objects

For robot training / headless RL: use MuJoCo's built-in offscreen renderer — no Three.js needed.

Tech Stack

Layer	Technology	Purpose
Scene generation	World Labs API	Text/image/video → 3D world + collider mesh
Object generation	Meshy AI API	Text → 3D GLB objects
Semantic labeling	Claude API (`claude-sonnet-4-6`)	Affordances, physics props, joint types
Physics simulation	MuJoCo 3.x	Rigid body, contacts, grasping, robot control
Convex decomposition	CoACD	Mesh → convex hulls for stable MuJoCo contact
Mesh processing	trimesh	Load/convert/normalize GLB/OBJ meshes
Visual rendering	Three.js + @luma-ai/three-gaussian-splats	Render Gaussian splat
Pose bridge	Python `websockets` + JS	Stream body transforms from MuJoCo → renderer
Robot models	MuJoCo Menagerie	Franka, UR5, Spot, Leap Hand

Installation

git clone https://github.com/yourname/worldsim
cd worldsim

# Python deps
pip install -r requirements.txt
# Key: mujoco trimesh coacd requests anthropic websockets numpy pillow open3d

# Node deps (for splat renderer)
cd renderer/splat_renderer
npm install

# Copy and fill in API keys
cp .env.example .env

.env:

WORLDLABS_API_KEY=wlt_xxxxxxxxxxxxxxxxxxxxx
MESHY_API_KEY=msy_xxxxxxxxxxxxxxxxxxxxx
ANTHROPIC_API_KEY=sk-ant-xxxxxxxxxxxxxxxxxxxxx

Quick Start

# Generate a full scene from a photo
python -m pipeline.ingest \
  --input-type image \
  --source my_room.jpg \
  --objects "wooden chair, coffee mug, small table lamp, cardboard box" \
  --robot franka_panda \
  --output scene_output/

# Launch MuJoCo viewer
python -m mujoco.viewer --mjcf=scene_output/scene.xml

# Launch visual renderer (splat + synced objects)
cd renderer/splat_renderer && npm start

Design Decisions & Tradeoffs

Why not use the Gaussian splat for physics?

Gaussian splats have no hard surface boundaries — they are probability distributions. MuJoCo contact resolution requires a closed triangle mesh. The World Labs collider GLB is purpose-built for this.

Why CoACD over a single convex hull?

A single convex hull wraps the entire object in one hull (imagine a shrink-wrap around a chair — it fills the space between the legs). CoACD produces multiple hulls that approximate the actual shape, giving correct contact behavior when objects are stacked, nested, or have concavities.

Why dual-mesh (visual + collision)?

Raw Meshy GLBs are high-poly (50k–300k triangles) and concave. MuJoCo slows down significantly with high-poly collision geoms and produces incorrect contacts on concave surfaces. Separating visual from collision lets you have photorealistic appearance at near-zero physics cost.

Why Claude for semantic labeling?

Physics properties (mass, friction, joint type, affordances) are fundamentally semantic. A "ceramic mug" should have different mass and friction than a "metal kettle" even if their meshes are similar. Claude can reason about material class, object function, and robot interaction affordances from a name + thumbnail — no training required.

Why not just use Isaac Sim or Genesis?

Both are valid alternatives. WorldSim's advantage is the World Labs → photorealistic novel environments path: you can generate a scene from a single photo of any real space, not just procedural or pre-built environments. Isaac Sim has better out-of-box robot libraries; Genesis has faster parallel simulation. WorldSim is optimized for world reconstruction fidelity and open API composability.

Roadmap

Articulated object support — detect and model doors, drawers with correct hinge/slider joints
Object detection from scene — use a vision model to auto-extract objects from the World Labs thumbnail, rather than requiring manual prompt list
RL training loop — parallel MuJoCo environments, reward shaping for manipulation tasks
Real-to-sim transfer — depth camera / LiDAR scan → mesh → WorldSim scene
Deformable objects — soft-body support for cloth, foam, food items
Tactile sensors — per-geom contact force readout for gripper feedback
Scene editing UI — drag-and-drop object placement, physics property tweaking
Genesis backend — swap MuJoCo for Genesis for massively parallel GPU sim

License

MIT — see LICENSE.

Citation

If you use WorldSim in research:

@software{worldsim2026,
  title   = {WorldSim: Multimodal World-to-Robot-Interaction Engine},
  year    = {2026},
  url     = {https://github.com/yourname/worldsim}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🌍 WorldSim — Multimodal World-to-Robot-Interaction Engine

What This Is

Architecture Overview

Repository Structure

Core Data Structure — `ObjectManifest`

Pipeline — Step by Step

Step 1 · Scene Generation (World Labs)

Step 2 · Object Generation (Meshy AI)

Step 3 · Semantic Labeling (Claude)

Step 4 · Mesh Processing (CoACD + trimesh)

Step 5 · MJCF Composition

Step 6 · Synchronized Dual Render

Tech Stack

Installation

Quick Start

Design Decisions & Tradeoffs

Why not use the Gaussian splat for physics?

Why CoACD over a single convex hull?

Why dual-mesh (visual + collision)?

Why Claude for semantic labeling?

Why not just use Isaac Sim or Genesis?

Roadmap

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
pipeline		pipeline
renderer		renderer
robots		robots
tests		tests
tools		tools
.env.example		.env.example
.gitignore		.gitignore
MUJOCO_LOG.TXT		MUJOCO_LOG.TXT
README.md		README.md
my_room.jpg		my_room.jpg
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🌍 WorldSim — Multimodal World-to-Robot-Interaction Engine

What This Is

Architecture Overview

Repository Structure

Core Data Structure — ObjectManifest

Pipeline — Step by Step

Step 1 · Scene Generation (World Labs)

Step 2 · Object Generation (Meshy AI)

Step 3 · Semantic Labeling (Claude)

Step 4 · Mesh Processing (CoACD + trimesh)

Step 5 · MJCF Composition

Step 6 · Synchronized Dual Render

Tech Stack

Installation

Quick Start

Design Decisions & Tradeoffs

Why not use the Gaussian splat for physics?

Why CoACD over a single convex hull?

Why dual-mesh (visual + collision)?

Why Claude for semantic labeling?

Why not just use Isaac Sim or Genesis?

Roadmap

License

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Core Data Structure — `ObjectManifest`

Packages