Skip to content

wlu03/ycomb-deepmind-hack

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🌍 WorldSim — Multimodal World-to-Robot-Interaction Engine

Generate a visually rich 3D environment from text, image, or video — then drop a robot inside it. MuJoCo handles physics. Gaussian splats handle appearance. Claude handles semantic understanding.


What This Is

WorldSim is a pipeline that takes any of the following as input:

  • A text description ("a cluttered kitchen counter")
  • A photograph of a real space
  • A video walkthrough of an environment

And produces:

  • A photorealistic reconstructed world (World Labs Gaussian splat)
  • A set of interactive 3D objects with physics properties (Meshy AI)
  • A MuJoCo scene with collision geometry, mass, friction, joints, and affordances
  • A synchronized dual renderer so the world looks realistic while physics stay stable
  • A robot that can push, grasp, knock over, and manipulate everything in the scene

Architecture Overview

┌─────────────────────────────────────────────────────────────────────────┐
│                          INPUT LAYER                                    │
│              Text Prompt  │  Image/Photo  │  Video                      │
└──────────────┬────────────┴───────────────┴────────────┬────────────────┘
               │                                         │
               ▼                                         ▼
┌──────────────────────────┐             ┌───────────────────────────────┐
│   SCENE GENERATION       │             │     OBJECT GENERATION         │
│   World Labs API         │             │     Meshy AI API              │
│                          │             │                               │
│  → Gaussian splat (SPZ)  │             │  → GLB mesh (visual)          │
│  → Collider mesh (GLB)   │             │  → OBJ mesh (physics)         │
│  → Panorama image        │             │  → Affordance metadata        │
└──────────┬───────────────┘             └──────────────┬────────────────┘
           │                                            │
           ▼                                            ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                    SEMANTIC LABELING LAYER                               │
│                    Claude API (claude-sonnet-4-6)                       │
│                                                                         │
│  For each object extracted from scene:                                  │
│  • Label type (table, mug, door, drawer, obstacle…)                     │
│  • Estimate mass, friction, restitution from material                   │
│  • Assign joint type (free, hinge, slider, fixed, welded)               │
│  • Detect grasp affordance (can robot pick this up?)                    │
│  • Determine static vs. dynamic                                         │
│  • Output structured ObjectManifest JSON                                │
└──────────────────────────┬──────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                    MESH PROCESSING LAYER                                │
│                    trimesh + CoACD + Open3D                             │
│                                                                         │
│  Visual mesh (GLB)    →  kept for renderer, group=1 in MJCF            │
│  Collision mesh (GLB) →  CoACD convex decomposition → 8–16 hulls       │
│                          simplify, center, normalize scale              │
└──────────────────────────┬──────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                    MJCF COMPOSITION LAYER                               │
│                    mjcf_composer (Python)                               │
│                                                                         │
│  • Write <asset> block — all meshes registered                          │
│  • Write worldbody — static environment geom                            │
│  • Write per-object <body> with freejoint or hinge/slider               │
│  • Attach visual geom (group=1) + collision geom (group=0) per body     │
│  • Include robot MJCF via <include> or direct embed                     │
│  • Write actuators, sensors, cameras                                    │
│  • Output: scene.xml + assets/ directory                                │
└──────────────────────────┬──────────────────────────────────────────────┘
                           │
            ┌──────────────┴──────────────┐
            ▼                             ▼
┌───────────────────────┐     ┌──────────────────────────────────────────┐
│   MUJOCO PHYSICS       │     │   VISUAL RENDERER                        │
│   (headless)           │     │   (Three.js / @luma-ai/three-gaussian-   │
│                        │ ←── │    splats or Nerfstudio viewer)          │
│   • Rigid body sim     │sync │                                          │
│   • Contact / friction │ ──► │   • Renders Gaussian splat (SPZ)        │
│   • Robot control      │     │   • Overlays object poses from MuJoCo   │
│   • Grasping / push    │     │   • Camera stream for vision agents     │
└───────────────────────┘     └──────────────────────────────────────────┘

Repository Structure

worldsim/
│
├── pipeline/
│   ├── ingest.py              # Entry point — takes text/image/video
│   ├── world_builder.py       # World Labs API: scene → splat + collider mesh
│   ├── object_generator.py    # Meshy AI API: text prompts → GLB objects
│   ├── semantic_labeler.py    # Claude API: assigns affordances + physics props
│   ├── mesh_processor.py      # CoACD decomposition, mesh cleanup, normalization
│   ├── mjcf_composer.py       # Writes the final scene.xml
│   └── scene_manifest.py      # ObjectManifest dataclass + JSON schema
│
├── renderer/
│   ├── splat_renderer/        # Three.js app rendering the SPZ splat
│   │   ├── index.html
│   │   ├── main.js            # Loads splat, syncs object poses from MuJoCo
│   │   └── pose_bridge.js     # WebSocket bridge ← MuJoCo pose stream
│   └── mujoco_viewer.py       # Optional: MuJoCo passive viewer for debug
│
├── robots/
│   ├── franka_panda.xml       # Franka Emika Panda (manipulation)
│   ├── ur5.xml                # Universal Robots UR5
│   └── spot.xml               # Boston Dynamics Spot (locomotion)
│
├── scene_output/              # Generated per run
│   ├── scene.xml              # Final MuJoCo MJCF
│   ├── manifest.json          # ObjectManifest for all objects
│   ├── assets/
│   │   ├── world_mesh.obj     # World Labs collision mesh
│   │   ├── world_splat.spz    # World Labs Gaussian splat (renderer only)
│   │   ├── pano.jpg           # Panorama (skybox)
│   │   └── objects/
│   │       ├── mug_visual.glb
│   │       ├── mug_collision.obj
│   │       ├── chair_visual.glb
│   │       └── chair_collision.obj
│   └── preview.png
│
├── tools/
│   ├── worldlabs_generator/   # App: text/image/video → World Labs world
│   ├── meshy_importer/        # App: text prompts → MuJoCo objects
│   └── scene_inspector/       # App: browse + edit ObjectManifest
│
├── tests/
│   ├── test_mesh_processor.py
│   ├── test_mjcf_composer.py
│   └── test_semantic_labeler.py
│
├── .env.example
├── requirements.txt
└── README.md

Core Data Structure — ObjectManifest

This is the central data contract that flows through the entire pipeline. Every object in the scene has one of these.

@dataclass
class ObjectManifest:
    # Identity
    id: str                        # UUID
    name: str                      # "wooden_chair_01"
    semantic_label: str            # "chair" | "mug" | "door" | "table" | ...
    source: str                    # "meshy" | "world_labs" | "manual"

    # Assets
    visual_mesh_path: str          # .glb — used by renderer
    collision_mesh_path: str       # .obj — convex hulls for MuJoCo
    thumbnail_url: str | None

    # Placement
    position: list[float]          # [x, y, z] in world frame
    orientation_quat: list[float]  # [w, x, y, z]
    scale: float                   # uniform scale applied to mesh

    # Physics properties
    mass_kg: float                 # estimated from material + size
    friction: float                # sliding friction coefficient
    restitution: float             # bounciness 0–1
    is_static: bool                # True = fixed geom, False = free body

    # Joint type
    joint_type: str                # "free" | "hinge" | "slider" | "fixed" | "weld"
    joint_axis: list[float] | None # [0,0,1] for hinge around Z
    joint_range: list[float] | None# [min, max] in radians or meters

    # Affordances (for robot planning)
    is_graspable: bool             # robot can pick it up
    grasp_points: list[dict]       # [{pos, normal, aperture_mm}]
    is_pushable: bool
    is_openable: bool              # drawer, door, lid
    support_surface: bool          # can objects be placed on top?
    contains_objects: bool         # is it a container?

    # Semantic extras
    material_class: str            # "wood" | "metal" | "ceramic" | "fabric"
    size_class: str                # "small" | "medium" | "large"
    notes: str                     # free-form from Claude labeling pass

Pipeline — Step by Step

Step 1 · Scene Generation (World Labs)

Input: text prompt, image, or video
Output: world_splat.spz, world_mesh.glb, pano.jpg

from pipeline.world_builder import WorldBuilder

builder = WorldBuilder(api_key=WORLDLABS_KEY)
world = builder.generate(
    input_type="image",          # "text" | "image" | "video"
    source="photo.jpg",          # path or URL
    display_name="Kitchen Scene",
    text_hint="A cluttered home kitchen counter"
)
# world.collider_mesh_url → download as world_mesh.glb
# world.splat_url         → download as world_splat.spz
# world.pano_url          → download as pano.jpg

The collider mesh becomes the static environment geom in MuJoCo.
The splat goes directly to the Three.js renderer — MuJoCo never sees it.


Step 2 · Object Generation (Meshy AI)

Input: list of text prompts
Output: {name}_visual.glb + {name}_collision.obj per object

from pipeline.object_generator import ObjectGenerator

gen = ObjectGenerator(api_key=MESHY_KEY)
objects = gen.generate_batch([
    "ceramic coffee mug with handle",
    "wooden kitchen chair",
    "stainless steel kettle",
    "small cardboard box",
])
# Each returns visual GLB + raw mesh for decomposition

Step 3 · Semantic Labeling (Claude)

Input: object name, thumbnail, scene context
Output: filled ObjectManifest for each object

from pipeline.semantic_labeler import SemanticLabeler

labeler = SemanticLabeler(api_key=ANTHROPIC_KEY)
manifests = labeler.label_all(objects, scene_context="home kitchen")

Claude receives: object name, thumbnail image, scene description.
Claude returns structured JSON matching ObjectManifest schema:

mass_kg: 0.35        ← "ceramic mug, medium size, ~350g"
friction: 0.6        ← "ceramic on wood surface"
joint_type: "free"   ← movable object
is_graspable: true   ← cylindrical body, handle present
grasp_points: [...]  ← handle grasp + body grasp positions
support_surface: false
is_openable: false
material_class: "ceramic"

Step 4 · Mesh Processing (CoACD + trimesh)

Input: {name}_visual.glb from Meshy
Output: {name}_collision.obj — convex hull decomposition

from pipeline.mesh_processor import MeshProcessor

proc = MeshProcessor()
for obj in manifests:
    proc.decompose(
        glb_path=obj.visual_mesh_path,
        out_path=obj.collision_mesh_path,
        max_hulls=16,           # 8–16 is the sweet spot
        threshold=0.05,         # CoACD concavity threshold
    )

Why CoACD and not V-HACD?
CoACD produces fewer, better-fitting hulls at the same fidelity. V-HACD is available as a fallback via --decomposer vhacd.

Dual-mesh pattern in MuJoCo:

<!-- Visual-only geom (rendered, no collision) -->
<geom name="mug_vis" type="mesh" mesh="mug_visual"
      contype="0" conaffinity="0" group="1" rgba="1 1 1 1"/>

<!-- Collision-only geom (physics, invisible in nice render) -->
<geom name="mug_col" type="mesh" mesh="mug_collision"
      contype="1" conaffinity="3" group="0" mass="0.35"
      friction="0.6 0.02 0.001"/>

Step 5 · MJCF Composition

Input: ObjectManifest[] + world mesh paths
Output: scene.xml

from pipeline.mjcf_composer import MJCFComposer

composer = MJCFComposer()
composer.set_environment(world_mesh="assets/world_mesh.obj", pano="assets/pano.jpg")
composer.add_robot("robots/franka_panda.xml", pos=[0, 0, 0.9])
for m in manifests:
    composer.add_object(m)
composer.write("scene_output/scene.xml")

Key MJCF decisions made per ObjectManifest:

is_static joint_type MuJoCo result
true any <geom> directly in <worldbody> — no body, no joint
false "free" <body><freejoint/> — full 6-DOF rigid body
false "hinge" <body><joint type="hinge" axis="..."/> — door, drawer
false "slider" <body><joint type="slide"/> — sliding drawer

Step 6 · Synchronized Dual Render

MuJoCo runs headlessly. A lightweight WebSocket server streams body poses at 60Hz to the Three.js renderer, which overlays the Gaussian splat.

MuJoCo (Python)                Three.js (Browser)
     │                               │
     │  pose_stream {                │
     │    "mug": [x,y,z,qw,qx,qy,qz]│
     │    "chair": [...]             │
     │  }  ─────────── WebSocket ──► │  update object transforms
     │                               │  render splat + objects

For robot training / headless RL: use MuJoCo's built-in offscreen renderer — no Three.js needed.


Tech Stack

Layer Technology Purpose
Scene generation World Labs API Text/image/video → 3D world + collider mesh
Object generation Meshy AI API Text → 3D GLB objects
Semantic labeling Claude API (claude-sonnet-4-6) Affordances, physics props, joint types
Physics simulation MuJoCo 3.x Rigid body, contacts, grasping, robot control
Convex decomposition CoACD Mesh → convex hulls for stable MuJoCo contact
Mesh processing trimesh Load/convert/normalize GLB/OBJ meshes
Visual rendering Three.js + @luma-ai/three-gaussian-splats Render Gaussian splat
Pose bridge Python websockets + JS Stream body transforms from MuJoCo → renderer
Robot models MuJoCo Menagerie Franka, UR5, Spot, Leap Hand

Installation

git clone https://github.com/yourname/worldsim
cd worldsim

# Python deps
pip install -r requirements.txt
# Key: mujoco trimesh coacd requests anthropic websockets numpy pillow open3d

# Node deps (for splat renderer)
cd renderer/splat_renderer
npm install

# Copy and fill in API keys
cp .env.example .env

.env:

WORLDLABS_API_KEY=wlt_xxxxxxxxxxxxxxxxxxxxx
MESHY_API_KEY=msy_xxxxxxxxxxxxxxxxxxxxx
ANTHROPIC_API_KEY=sk-ant-xxxxxxxxxxxxxxxxxxxxx

Quick Start

# Generate a full scene from a photo
python -m pipeline.ingest \
  --input-type image \
  --source my_room.jpg \
  --objects "wooden chair, coffee mug, small table lamp, cardboard box" \
  --robot franka_panda \
  --output scene_output/

# Launch MuJoCo viewer
python -m mujoco.viewer --mjcf=scene_output/scene.xml

# Launch visual renderer (splat + synced objects)
cd renderer/splat_renderer && npm start

Design Decisions & Tradeoffs

Why not use the Gaussian splat for physics?

Gaussian splats have no hard surface boundaries — they are probability distributions. MuJoCo contact resolution requires a closed triangle mesh. The World Labs collider GLB is purpose-built for this.

Why CoACD over a single convex hull?

A single convex hull wraps the entire object in one hull (imagine a shrink-wrap around a chair — it fills the space between the legs). CoACD produces multiple hulls that approximate the actual shape, giving correct contact behavior when objects are stacked, nested, or have concavities.

Why dual-mesh (visual + collision)?

Raw Meshy GLBs are high-poly (50k–300k triangles) and concave. MuJoCo slows down significantly with high-poly collision geoms and produces incorrect contacts on concave surfaces. Separating visual from collision lets you have photorealistic appearance at near-zero physics cost.

Why Claude for semantic labeling?

Physics properties (mass, friction, joint type, affordances) are fundamentally semantic. A "ceramic mug" should have different mass and friction than a "metal kettle" even if their meshes are similar. Claude can reason about material class, object function, and robot interaction affordances from a name + thumbnail — no training required.

Why not just use Isaac Sim or Genesis?

Both are valid alternatives. WorldSim's advantage is the World Labs → photorealistic novel environments path: you can generate a scene from a single photo of any real space, not just procedural or pre-built environments. Isaac Sim has better out-of-box robot libraries; Genesis has faster parallel simulation. WorldSim is optimized for world reconstruction fidelity and open API composability.


Roadmap

  • Articulated object support — detect and model doors, drawers with correct hinge/slider joints
  • Object detection from scene — use a vision model to auto-extract objects from the World Labs thumbnail, rather than requiring manual prompt list
  • RL training loop — parallel MuJoCo environments, reward shaping for manipulation tasks
  • Real-to-sim transfer — depth camera / LiDAR scan → mesh → WorldSim scene
  • Deformable objects — soft-body support for cloth, foam, food items
  • Tactile sensors — per-geom contact force readout for gripper feedback
  • Scene editing UI — drag-and-drop object placement, physics property tweaking
  • Genesis backend — swap MuJoCo for Genesis for massively parallel GPU sim

License

MIT — see LICENSE.


Citation

If you use WorldSim in research:

@software{worldsim2026,
  title   = {WorldSim: Multimodal World-to-Robot-Interaction Engine},
  year    = {2026},
  url     = {https://github.com/yourname/worldsim}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors