Generate a visually rich 3D environment from text, image, or video — then drop a robot inside it. MuJoCo handles physics. Gaussian splats handle appearance. Claude handles semantic understanding.
WorldSim is a pipeline that takes any of the following as input:
- A text description ("a cluttered kitchen counter")
- A photograph of a real space
- A video walkthrough of an environment
And produces:
- A photorealistic reconstructed world (World Labs Gaussian splat)
- A set of interactive 3D objects with physics properties (Meshy AI)
- A MuJoCo scene with collision geometry, mass, friction, joints, and affordances
- A synchronized dual renderer so the world looks realistic while physics stay stable
- A robot that can push, grasp, knock over, and manipulate everything in the scene
┌─────────────────────────────────────────────────────────────────────────┐
│ INPUT LAYER │
│ Text Prompt │ Image/Photo │ Video │
└──────────────┬────────────┴───────────────┴────────────┬────────────────┘
│ │
▼ ▼
┌──────────────────────────┐ ┌───────────────────────────────┐
│ SCENE GENERATION │ │ OBJECT GENERATION │
│ World Labs API │ │ Meshy AI API │
│ │ │ │
│ → Gaussian splat (SPZ) │ │ → GLB mesh (visual) │
│ → Collider mesh (GLB) │ │ → OBJ mesh (physics) │
│ → Panorama image │ │ → Affordance metadata │
└──────────┬───────────────┘ └──────────────┬────────────────┘
│ │
▼ ▼
┌─────────────────────────────────────────────────────────────────────────┐
│ SEMANTIC LABELING LAYER │
│ Claude API (claude-sonnet-4-6) │
│ │
│ For each object extracted from scene: │
│ • Label type (table, mug, door, drawer, obstacle…) │
│ • Estimate mass, friction, restitution from material │
│ • Assign joint type (free, hinge, slider, fixed, welded) │
│ • Detect grasp affordance (can robot pick this up?) │
│ • Determine static vs. dynamic │
│ • Output structured ObjectManifest JSON │
└──────────────────────────┬──────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ MESH PROCESSING LAYER │
│ trimesh + CoACD + Open3D │
│ │
│ Visual mesh (GLB) → kept for renderer, group=1 in MJCF │
│ Collision mesh (GLB) → CoACD convex decomposition → 8–16 hulls │
│ simplify, center, normalize scale │
└──────────────────────────┬──────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ MJCF COMPOSITION LAYER │
│ mjcf_composer (Python) │
│ │
│ • Write <asset> block — all meshes registered │
│ • Write worldbody — static environment geom │
│ • Write per-object <body> with freejoint or hinge/slider │
│ • Attach visual geom (group=1) + collision geom (group=0) per body │
│ • Include robot MJCF via <include> or direct embed │
│ • Write actuators, sensors, cameras │
│ • Output: scene.xml + assets/ directory │
└──────────────────────────┬──────────────────────────────────────────────┘
│
┌──────────────┴──────────────┐
▼ ▼
┌───────────────────────┐ ┌──────────────────────────────────────────┐
│ MUJOCO PHYSICS │ │ VISUAL RENDERER │
│ (headless) │ │ (Three.js / @luma-ai/three-gaussian- │
│ │ ←── │ splats or Nerfstudio viewer) │
│ • Rigid body sim │sync │ │
│ • Contact / friction │ ──► │ • Renders Gaussian splat (SPZ) │
│ • Robot control │ │ • Overlays object poses from MuJoCo │
│ • Grasping / push │ │ • Camera stream for vision agents │
└───────────────────────┘ └──────────────────────────────────────────┘
worldsim/
│
├── pipeline/
│ ├── ingest.py # Entry point — takes text/image/video
│ ├── world_builder.py # World Labs API: scene → splat + collider mesh
│ ├── object_generator.py # Meshy AI API: text prompts → GLB objects
│ ├── semantic_labeler.py # Claude API: assigns affordances + physics props
│ ├── mesh_processor.py # CoACD decomposition, mesh cleanup, normalization
│ ├── mjcf_composer.py # Writes the final scene.xml
│ └── scene_manifest.py # ObjectManifest dataclass + JSON schema
│
├── renderer/
│ ├── splat_renderer/ # Three.js app rendering the SPZ splat
│ │ ├── index.html
│ │ ├── main.js # Loads splat, syncs object poses from MuJoCo
│ │ └── pose_bridge.js # WebSocket bridge ← MuJoCo pose stream
│ └── mujoco_viewer.py # Optional: MuJoCo passive viewer for debug
│
├── robots/
│ ├── franka_panda.xml # Franka Emika Panda (manipulation)
│ ├── ur5.xml # Universal Robots UR5
│ └── spot.xml # Boston Dynamics Spot (locomotion)
│
├── scene_output/ # Generated per run
│ ├── scene.xml # Final MuJoCo MJCF
│ ├── manifest.json # ObjectManifest for all objects
│ ├── assets/
│ │ ├── world_mesh.obj # World Labs collision mesh
│ │ ├── world_splat.spz # World Labs Gaussian splat (renderer only)
│ │ ├── pano.jpg # Panorama (skybox)
│ │ └── objects/
│ │ ├── mug_visual.glb
│ │ ├── mug_collision.obj
│ │ ├── chair_visual.glb
│ │ └── chair_collision.obj
│ └── preview.png
│
├── tools/
│ ├── worldlabs_generator/ # App: text/image/video → World Labs world
│ ├── meshy_importer/ # App: text prompts → MuJoCo objects
│ └── scene_inspector/ # App: browse + edit ObjectManifest
│
├── tests/
│ ├── test_mesh_processor.py
│ ├── test_mjcf_composer.py
│ └── test_semantic_labeler.py
│
├── .env.example
├── requirements.txt
└── README.md
This is the central data contract that flows through the entire pipeline. Every object in the scene has one of these.
@dataclass
class ObjectManifest:
# Identity
id: str # UUID
name: str # "wooden_chair_01"
semantic_label: str # "chair" | "mug" | "door" | "table" | ...
source: str # "meshy" | "world_labs" | "manual"
# Assets
visual_mesh_path: str # .glb — used by renderer
collision_mesh_path: str # .obj — convex hulls for MuJoCo
thumbnail_url: str | None
# Placement
position: list[float] # [x, y, z] in world frame
orientation_quat: list[float] # [w, x, y, z]
scale: float # uniform scale applied to mesh
# Physics properties
mass_kg: float # estimated from material + size
friction: float # sliding friction coefficient
restitution: float # bounciness 0–1
is_static: bool # True = fixed geom, False = free body
# Joint type
joint_type: str # "free" | "hinge" | "slider" | "fixed" | "weld"
joint_axis: list[float] | None # [0,0,1] for hinge around Z
joint_range: list[float] | None# [min, max] in radians or meters
# Affordances (for robot planning)
is_graspable: bool # robot can pick it up
grasp_points: list[dict] # [{pos, normal, aperture_mm}]
is_pushable: bool
is_openable: bool # drawer, door, lid
support_surface: bool # can objects be placed on top?
contains_objects: bool # is it a container?
# Semantic extras
material_class: str # "wood" | "metal" | "ceramic" | "fabric"
size_class: str # "small" | "medium" | "large"
notes: str # free-form from Claude labeling passInput: text prompt, image, or video
Output: world_splat.spz, world_mesh.glb, pano.jpg
from pipeline.world_builder import WorldBuilder
builder = WorldBuilder(api_key=WORLDLABS_KEY)
world = builder.generate(
input_type="image", # "text" | "image" | "video"
source="photo.jpg", # path or URL
display_name="Kitchen Scene",
text_hint="A cluttered home kitchen counter"
)
# world.collider_mesh_url → download as world_mesh.glb
# world.splat_url → download as world_splat.spz
# world.pano_url → download as pano.jpgThe collider mesh becomes the static environment geom in MuJoCo.
The splat goes directly to the Three.js renderer — MuJoCo never sees it.
Input: list of text prompts
Output: {name}_visual.glb + {name}_collision.obj per object
from pipeline.object_generator import ObjectGenerator
gen = ObjectGenerator(api_key=MESHY_KEY)
objects = gen.generate_batch([
"ceramic coffee mug with handle",
"wooden kitchen chair",
"stainless steel kettle",
"small cardboard box",
])
# Each returns visual GLB + raw mesh for decompositionInput: object name, thumbnail, scene context
Output: filled ObjectManifest for each object
from pipeline.semantic_labeler import SemanticLabeler
labeler = SemanticLabeler(api_key=ANTHROPIC_KEY)
manifests = labeler.label_all(objects, scene_context="home kitchen")Claude receives: object name, thumbnail image, scene description.
Claude returns structured JSON matching ObjectManifest schema:
mass_kg: 0.35 ← "ceramic mug, medium size, ~350g"
friction: 0.6 ← "ceramic on wood surface"
joint_type: "free" ← movable object
is_graspable: true ← cylindrical body, handle present
grasp_points: [...] ← handle grasp + body grasp positions
support_surface: false
is_openable: false
material_class: "ceramic"
Input: {name}_visual.glb from Meshy
Output: {name}_collision.obj — convex hull decomposition
from pipeline.mesh_processor import MeshProcessor
proc = MeshProcessor()
for obj in manifests:
proc.decompose(
glb_path=obj.visual_mesh_path,
out_path=obj.collision_mesh_path,
max_hulls=16, # 8–16 is the sweet spot
threshold=0.05, # CoACD concavity threshold
)Why CoACD and not V-HACD?
CoACD produces fewer, better-fitting hulls at the same fidelity. V-HACD is available as a fallback via --decomposer vhacd.
Dual-mesh pattern in MuJoCo:
<!-- Visual-only geom (rendered, no collision) -->
<geom name="mug_vis" type="mesh" mesh="mug_visual"
contype="0" conaffinity="0" group="1" rgba="1 1 1 1"/>
<!-- Collision-only geom (physics, invisible in nice render) -->
<geom name="mug_col" type="mesh" mesh="mug_collision"
contype="1" conaffinity="3" group="0" mass="0.35"
friction="0.6 0.02 0.001"/>Input: ObjectManifest[] + world mesh paths
Output: scene.xml
from pipeline.mjcf_composer import MJCFComposer
composer = MJCFComposer()
composer.set_environment(world_mesh="assets/world_mesh.obj", pano="assets/pano.jpg")
composer.add_robot("robots/franka_panda.xml", pos=[0, 0, 0.9])
for m in manifests:
composer.add_object(m)
composer.write("scene_output/scene.xml")Key MJCF decisions made per ObjectManifest:
is_static |
joint_type |
MuJoCo result |
|---|---|---|
true |
any | <geom> directly in <worldbody> — no body, no joint |
false |
"free" |
<body><freejoint/> — full 6-DOF rigid body |
false |
"hinge" |
<body><joint type="hinge" axis="..."/> — door, drawer |
false |
"slider" |
<body><joint type="slide"/> — sliding drawer |
MuJoCo runs headlessly. A lightweight WebSocket server streams body poses at 60Hz to the Three.js renderer, which overlays the Gaussian splat.
MuJoCo (Python) Three.js (Browser)
│ │
│ pose_stream { │
│ "mug": [x,y,z,qw,qx,qy,qz]│
│ "chair": [...] │
│ } ─────────── WebSocket ──► │ update object transforms
│ │ render splat + objects
For robot training / headless RL: use MuJoCo's built-in offscreen renderer — no Three.js needed.
| Layer | Technology | Purpose |
|---|---|---|
| Scene generation | World Labs API | Text/image/video → 3D world + collider mesh |
| Object generation | Meshy AI API | Text → 3D GLB objects |
| Semantic labeling | Claude API (claude-sonnet-4-6) |
Affordances, physics props, joint types |
| Physics simulation | MuJoCo 3.x | Rigid body, contacts, grasping, robot control |
| Convex decomposition | CoACD | Mesh → convex hulls for stable MuJoCo contact |
| Mesh processing | trimesh | Load/convert/normalize GLB/OBJ meshes |
| Visual rendering | Three.js + @luma-ai/three-gaussian-splats | Render Gaussian splat |
| Pose bridge | Python websockets + JS |
Stream body transforms from MuJoCo → renderer |
| Robot models | MuJoCo Menagerie | Franka, UR5, Spot, Leap Hand |
git clone https://github.com/yourname/worldsim
cd worldsim
# Python deps
pip install -r requirements.txt
# Key: mujoco trimesh coacd requests anthropic websockets numpy pillow open3d
# Node deps (for splat renderer)
cd renderer/splat_renderer
npm install
# Copy and fill in API keys
cp .env.example .env.env:
WORLDLABS_API_KEY=wlt_xxxxxxxxxxxxxxxxxxxxx
MESHY_API_KEY=msy_xxxxxxxxxxxxxxxxxxxxx
ANTHROPIC_API_KEY=sk-ant-xxxxxxxxxxxxxxxxxxxxx
# Generate a full scene from a photo
python -m pipeline.ingest \
--input-type image \
--source my_room.jpg \
--objects "wooden chair, coffee mug, small table lamp, cardboard box" \
--robot franka_panda \
--output scene_output/
# Launch MuJoCo viewer
python -m mujoco.viewer --mjcf=scene_output/scene.xml
# Launch visual renderer (splat + synced objects)
cd renderer/splat_renderer && npm startGaussian splats have no hard surface boundaries — they are probability distributions. MuJoCo contact resolution requires a closed triangle mesh. The World Labs collider GLB is purpose-built for this.
A single convex hull wraps the entire object in one hull (imagine a shrink-wrap around a chair — it fills the space between the legs). CoACD produces multiple hulls that approximate the actual shape, giving correct contact behavior when objects are stacked, nested, or have concavities.
Raw Meshy GLBs are high-poly (50k–300k triangles) and concave. MuJoCo slows down significantly with high-poly collision geoms and produces incorrect contacts on concave surfaces. Separating visual from collision lets you have photorealistic appearance at near-zero physics cost.
Physics properties (mass, friction, joint type, affordances) are fundamentally semantic. A "ceramic mug" should have different mass and friction than a "metal kettle" even if their meshes are similar. Claude can reason about material class, object function, and robot interaction affordances from a name + thumbnail — no training required.
Both are valid alternatives. WorldSim's advantage is the World Labs → photorealistic novel environments path: you can generate a scene from a single photo of any real space, not just procedural or pre-built environments. Isaac Sim has better out-of-box robot libraries; Genesis has faster parallel simulation. WorldSim is optimized for world reconstruction fidelity and open API composability.
- Articulated object support — detect and model doors, drawers with correct hinge/slider joints
- Object detection from scene — use a vision model to auto-extract objects from the World Labs thumbnail, rather than requiring manual prompt list
- RL training loop — parallel MuJoCo environments, reward shaping for manipulation tasks
- Real-to-sim transfer — depth camera / LiDAR scan → mesh → WorldSim scene
- Deformable objects — soft-body support for cloth, foam, food items
- Tactile sensors — per-geom contact force readout for gripper feedback
- Scene editing UI — drag-and-drop object placement, physics property tweaking
- Genesis backend — swap MuJoCo for Genesis for massively parallel GPU sim
MIT — see LICENSE.
If you use WorldSim in research:
@software{worldsim2026,
title = {WorldSim: Multimodal World-to-Robot-Interaction Engine},
year = {2026},
url = {https://github.com/yourname/worldsim}
}