Skip to content

rk-edge/SimScaleAI

Repository files navigation

SimScaleAI

End-to-end robotics AI training and simulation platform — from physics simulation to foundation model training, reinforcement learning, and deployment evaluation.

Built to demonstrate the full stack of robotic AI infrastructure: simulation environments, distributed training pipelines, foundation model architectures (VLM/VLA/BC), RL training, synthetic data generation, and experiment tooling.

Architecture

┌──────────────────────────────────────────────────────────┐
│                     SimScaleAI CLI                       │
│         simscale train | eval | datagen | rl | viz       │
├──────────────┬───────────────┬───────────────────────────┤
│  Simulation  │    Training   │       Models              │
│  (MuJoCo 3)  │  Infrastructure│  (BC, VLA, Diffusion)    │
│              │  (PyTorch DDP) │                          │
│  • Reach     │  • Distributed │  • Behavior Cloning      │
│  • PickPlace │  • AMP/FSDP   │  • Vision-Language-Action │
│  • Juggle    │  • Checkpoint  │  • Diffusion Policy Head │
│  • ClothFold │  • WandB log  │  • Model Registry        │
│  • Humanoid  │  • VLA train  │                          │
│    Walk      │               │                          │
│  • Domain    │               │                          │
│    Randomize │               │                          │
├──────────────┼───────────────┼───────────────────────────┤
│      RL Pipeline             │   Synthetic Data Gen      │
│  • PPO Agent                 │  • Domain randomization   │
│  • GAE Advantages            │  • Multi-modal capture    │
│  • Closed-loop eval          │  • Language instructions  │
│  • Reward function library   │  • HDF5 export            │
└──────────────────────────────┴───────────────────────────┘

Quick Start

Installation

# Clone
git clone https://github.com/rk-edge/SimScaleAI.git
cd SimScaleAI

# Install (with all optional dependencies)
pip install -e ".[all]"

# Or minimal install
pip install -e .

Try It Out

# List available environments and models
simscale list-envs
simscale list-models

# Generate synthetic training data
simscale datagen --env-name reach --n-episodes 100 --output data/reach.h5

# Train a Behavior Cloning model
simscale train --model bc --dataset data/reach.h5 --max-steps 1000

# Train a VLA model (with dummy data)
simscale train --model vla --max-steps 500

# Evaluate a checkpoint in simulation
simscale eval checkpoints/final.pt --env-name reach --n-episodes 20

# Train an RL agent (PPO)
simscale rl --env-name reach --total-steps 50000

Python API

from simscaleai.sim import make_env
from simscaleai.models import ModelRegistry
from simscaleai.rl.agents.ppo import PPOAgent

# Create simulation environment
env = make_env("reach", render_mode="human")
obs, info = env.reset()

# Step through the environment
action = env.action_space.sample()
obs, reward, terminated, truncated, info = env.step(action)

# Create a model
model = ModelRegistry.create("vla", image_size=128, action_dim=4)

# Train RL agent
agent = PPOAgent(obs_dim=20, action_dim=4)
agent.train(env)

Project Structure

SimScaleAI/
├── simscaleai/
│   ├── sim/                    # Simulation environments
│   │   ├── base_env.py         # Abstract MuJoCo environment
│   │   ├── factory.py          # Environment registry & factory
│   │   ├── domain_randomization.py  # DR config & pipeline
│   │   ├── assets/             # MJCF robot/scene files (auto-generated)
│   │   └── envs/
│   │       ├── reach_env.py    # Reach task (move EE to target)
│   │       ├── pick_place_env.py # Pick-and-place manipulation
│   │       ├── juggle_env.py   # 3-ball juggling
│   │       ├── cloth_fold_env.py # Deformable cloth folding
│   │       └── humanoid_walk_env.py # Bipedal locomotion + curriculum
│   ├── training/               # ML training infrastructure
│   │   ├── trainer.py          # Distributed training loop (DDP/AMP)
│   │   ├── train_vla.py        # Language-conditioned VLA pipeline
│   │   └── data/
│   │       └── dataset.py      # HDF5 trajectory datasets
│   ├── models/                 # Foundation model architectures
│   │   ├── registry.py         # Model registry (@register_model)
│   │   ├── bc.py               # Behavior Cloning (imitation learning)
│   │   ├── vla.py              # Vision-Language-Action model
│   │   └── policy_heads/
│   │       ├── mlp_head.py     # Standard MLP action head
│   │       └── diffusion_head.py # Diffusion Policy action head
│   ├── rl/                     # Reinforcement learning
│   │   ├── evaluator.py        # Closed-loop simulation evaluation
│   │   ├── agents/
│   │   │   └── ppo.py          # PPO with GAE
│   │   └── rewards/
│   │       └── rewards.py      # Composable reward functions
│   ├── datagen/                # Synthetic data generation
│   │   ├── generator.py        # Single-process dataset pipeline
│   │   └── parallel_generator.py # Multi-worker scalable generation
│   ├── eval/                   # Evaluation & benchmarking
│   │   └── transfer_benchmark.py # Sim-to-real transfer matrix
│   └── tools/
│       └── cli.py              # Typer CLI entry point
├── tests/                      # Unit & integration tests
├── .github/workflows/ci.yml    # CI pipeline
├── pyproject.toml              # Package config & dependencies
└── README.md

Visualization

SimScaleAI includes built-in visualization tools accessible via CLI and Python API. See docs/visualization.md for the full guide.

Simulation Environment Rendering

Grid view of a reach-task rollout with the Franka Panda arm:

simscale viz-env --env-name reach --n-steps 20 --save env_grid.png

Environment Grid

Camera Modalities

RGB, depth, and segmentation outputs from the wrist camera:

simscale viz-cameras --env-name reach --save cameras.png

Camera Modalities

Dataset Statistics

Episode length, reward distributions, and per-dimension action histograms from an HDF5 dataset:

simscale viz-dataset data/reach.h5 --save dataset_stats.png

Dataset Statistics

Trajectory Timeline

Single episode breakdown — observations, actions, and rewards over time:

simscale viz-trajectory data/reach.h5 --episode 0 --save trajectory.png

Trajectory

Training Metrics

Loss curves with smoothed overlays from BC/VLA training:

Training Metrics

RL Training Progress

PPO reward curves, episode lengths, and policy/value losses:

RL Training


Juggle Environment — Evaluation Results

A 3-ball juggling task using the Franka Panda arm with a flat paddle. Three policies were trained and evaluated over 20 episodes each:

Juggle Environment

Juggle Trajectory

Metric Scripted (Expert) BC (Imitation) PPO (RL)
Mean Reward 95.9 ± 61.3 95.7 ± 61.0 94.9 ± 61.2
Mean Episode Length 53.4 ± 62.7 53.4 ± 62.7 52.3 ± 63.1
Max Balls Airborne 3 3 3
Best Episode Reward 279.9 278.8 278.8
Worst Episode Reward 63.0 62.9 62.9

Takeaways:

  • BC nearly matches the expert — trained on only 200 demonstration episodes (loss 0.38 → 0.06).
  • PPO is competitive with 50K timesteps; more training would likely surpass imitation.
  • All policies achieve 3 balls airborne simultaneously in their best episodes.

Pick-and-Place — Full Pipeline Results

The core manipulation benchmark: Franka Panda picks a 3cm red cube and places it at a randomized green target, using damped pseudoinverse IK and a kinematic grasp lock.

Pick-and-Place Multi-View

Pick-and-Place Rollout

Training Summary

Model Data Steps Loss (start → end)
BC 200 eps, 37.8K steps 3,000 0.385 → 0.010
BC-DR 200 eps, 58.8K steps (domain‑randomized) 3,000 0.396 → 0.009
PPO Online (50K env steps) 50,000 reward −43.4 → −6.6
VLA 37.8K steps + language instructions 2,000 0.300 → 0.064

Evaluation (50 episodes each)

Policy Reward Success Avg Length
Scripted (expert) 145.1 ± 148.9 20.0% 261
BC (imitation) 55.6 ± 76.0 0.0% 300
BC‑DR (domain‑randomized, eval on DR env) −35.4 ± 99.3 0.0% 300
PPO (RL, 50K steps) −7.6 ± 40.1 0.0% 300
VLA (language‑conditioned) −46.7 ± 50.0 0.0% 300

Pick-and-Place Reward Curves

Pick-and-Place Camera Modalities

Takeaways:

  • Scripted policy achieves 20% success — pick‑and‑place is significantly harder than reaching, requiring precise multi‑phase coordination (approach → descend → grasp → lift → transport → place).
  • BC captures the motion pattern (positive reward) but hasn't generalized grasping from 200 demos — more data and longer training would improve this.
  • PPO learns to avoid penalties (reward near 0) but hasn't discovered the full grasp→lift sequence in 50K steps — contact‑rich manipulation typically needs millions of steps.
  • VLA demonstrates language‑conditioned action prediction — a 1.4M parameter model that fuses vision + language + state through a transformer to output actions.
  • Domain randomization systematically varies physics (friction, mass, damping, gains), geometry (object size), and visuals (lighting, camera, materials) for sim‑to‑real transfer.

Sim‑to‑Real Transfer Benchmark

Systematic evaluation of how policies trained under different conditions transfer to unseen environment variations. Tests robustness across 4 eval conditions (Clean → Heavy DR) with per‑parameter sensitivity ablation.

Transfer Matrix (Policy × Eval Condition)

Policy Clean Light DR Default DR Heavy DR
Scripted 121.1 ± 131.4 158.4 ± 229.0 162.7 ± 269.6 −60.6 ± 128.5
BC 50.8 ± 126.7 108.0 ± 56.2 −88.8 ± 79.7 −123.5 ± 57.7
BC‑DR 276.9 ± 137.8 282.8 ± 127.7 −8.1 ± 108.0 −104.9 ± 117.9
PPO 3.4 ± 34.7 4.6 ± 48.8 −43.3 ± 64.8 −121.5 ± 64.4

Per‑Parameter Sensitivity (Reward Drop)

Parameter Scripted Drop BC‑DR Drop
Gains (kp) +145.2 (most sensitive) +329.6 (most sensitive)
Obj Size −9.0 +76.1
Damping −28.4 +63.3
Friction −84.6 +47.3
Mass −18.7 +33.5
Gravity −7.2 +34.4
Lighting +1.8 (least sensitive) +33.0

Key Findings:

  • BC‑DR dominates on clean + light DR — training with DR actually improves clean performance (276.9 vs 50.8 for vanilla BC), showing DR acts as regularization.
  • Actuator gains are the most critical parameter — randomizing kp alone drops Scripted reward by +145 and BC-DR by +330. This suggests actuator calibration is the #1 priority for sim‑to‑real transfer.
  • Lighting has near-zero impact on state-based policies (expected — no images in the observation).
  • Heavy DR breaks all policies — extreme randomization (mass 0.3–3×, friction 0.4–2×, gains 0.5–2×) exceeds what any policy trained with default ranges can handle.
  • DR improves generalization — BC‑DR transfers better than vanilla BC across every condition.

Transfer Heatmap

Ablation Sensitivity — Scripted

Ablation Sensitivity — BC-DR


Deformable Object Manipulation: Cloth Folding

Genuinely frontier research territory: autonomous cloth folding with learned policies on physically-accurate deformable simulation.

Cloth Fold Views

Physics

Uses MuJoCo 3.x <flexcomp> for real-time FEM cloth simulation:

  • 8×8 vertex grid (64 vertices, 192 DOFs) with edge damping + self-collision
  • Kinematic grasp lock: cloth edge vertices attached to end-effector during manipulation
  • Body-frame ↔ world coordinate mapping for accurate vertex kinematics

Task

Pick up one edge of a 17.5cm × 17.5cm cloth and fold it onto the opposite edge:

Stage Description
1. Approach Move EE above the far edge of the cloth
2. Grasp Close gripper to lock 8 edge vertices to EE
3. Lift Lift edge slightly above the table
4. Fold Sweep grasped edge toward the target edge (−X direction)
5. Release Open gripper — cloth should remain folded

Cloth Fold Scripted Sequence

Cloth Deformation Sequence

Results

Metric Scripted Expert BC (learned)
Success rate 100% 100%
Steps to fold 77 372
Final fold distance 0.028m 0.010m
Mean reward 129.6 615.6

Key insight: BC successfully learns to fold cloth but takes ~5× longer than the scripted expert. The learned policy starts with cautious movements (action magnitude ≈ 0.11) then accelerates near the goal (≈ 0.40), demonstrating emergent precision—it learns to be careful with deformable objects.

Fold Distance Comparison

Vertex Comparison

Training: 100 expert demos (7,700 timesteps) → 5,000 BC steps on MPS → 222-dim state → 4-dim delta actions.


Humanoid Locomotion — PPO + Curriculum Learning

Bipedal humanoid walking using a custom 21-DOF MJCF model trained from scratch with PPO and automatic curriculum advancement.

Humanoid Model

Humanoid Model

Custom-authored MuJoCo MJCF with:

  • 21 actuated joints: 3-DOF hips, 1-DOF knees, 2-DOF ankles, 2-DOF shoulders, 1-DOF elbows (× 2 limbs)
  • Free-floating torso (6-DOF freejoint) — total 25 qpos, 24 qvel
  • Torque-controlled motors (gear ratio 100, ctrl range [-1, 1])
  • Sensors: foot contact (touch), torso IMU (gyro + accelerometer)

Observation Space (49-dim)

Component Dimensions Description
Torso height 1 Center-of-mass z position
Torso orientation 4 Quaternion (w,x,y,z)
Joint positions 18 All actuated joint angles
Torso linear velocity 3 CoM velocity (x,y,z)
Torso angular velocity 3 Gyroscope reading
Joint velocities 18 Actuated joint angular velocities
Foot contacts 2 Binary ground-contact flags

Reward Shaping

reward = forward_vel × 1.25 × stage_scale
       + alive_bonus (5.0/step)
       + height_bonus (2.0 × min(z/1.3, 1))
       − energy_cost (0.01 × |ctrl·vel|)
       − ctrl_cost (0.001 × |ctrl|²)
       + fall_penalty (−100 on termination)

Curriculum Learning

Automatic stage progression based on 20-episode rolling reward average:

Stage Objective Forward Reward Scale External Forces Advancement Threshold
0 — Stand Learn to stay upright 0.1× None avg reward ≥ 40
1 — Walk Walk forward 1.0× None avg reward ≥ 120
2 — Robust Walk under perturbation 1.0× Random pushes every 100 steps

Training Results (1M steps, CPU, 563s)

Metric Value
Training FPS 1,778 steps/s
Curriculum 0→1 Advanced at step 174K (avg reward 43.3)
Eval reward 73.6 ± 37.7
Eval episode length 34 steps (0.68s upright)
Random baseline 17 steps (0.34s) — 2× improvement
Max episode length 44 steps

Training curves:

Humanoid Training Curves

Trained policy rollout:

Humanoid Rollout

Key insights:

  • Curriculum works: Agent learns Stage 0 (standing) in 174K steps, then transitions to Stage 1 (walking) automatically.
  • Reward shaping is critical: Large alive bonus (5.0/step) + fall penalty (−100) prevents "die-fast" degenerate strategies common in locomotion RL.
  • CPU > MPS for small models: 1,778 FPS on CPU vs 223 FPS on MPS — the MPS kernel launch overhead dominates for lightweight networks (49→256→256→18).
  • Humanoid locomotion is hard: Even with 1M steps, the agent balances for ~0.7s. Production humanoid controllers (DeepMind, Agility) use 10B+ steps with massively parallel GPU simulation.
# Train humanoid locomotion
python -m scripts.train_humanoid_walk --total-steps 1000000 --device cpu

# Evaluate
python -m scripts.train_humanoid_walk --eval-only

Key Features

1. Physics Simulation (MuJoCo)

  • Gymnasium-compatible environments for Franka Panda arm and humanoid robot
  • Reach, pick-and-place, 3-ball juggling, cloth folding (deformable), and bipedal humanoid walking
  • MuJoCo 3.x <flexcomp> for real-time FEM cloth simulation (64 vertices, 192 DOFs)
  • Custom 21-DOF humanoid MJCF with torque control, foot contact sensors, and IMU
  • Damped pseudoinverse IK with kinematic grasp lock
  • Multi-camera rendering (RGB, depth, segmentation)
  • Configurable via YAML — swap tasks without code changes

2. Domain Randomization Pipeline

  • Configurable DomainRandomizationConfig with 15+ randomization targets
  • Visual: lighting direction/color, camera pose/FOV, material colors
  • Physics: friction, mass, damping, actuator gains
  • Geometry: object size, table position
  • Dynamics: gravity noise, timestep variation
  • Nominal-value caching for relative randomization
  • Integrated into base environment — enabled with domain_randomization=True

3. Distributed Training Infrastructure (PyTorch)

  • PyTorch DDP for multi-GPU distributed training
  • Mixed precision (AMP) with BFloat16/Float16
  • Warmup + cosine decay learning rate schedule
  • Checkpoint save/resume with full optimizer state
  • WandB and TensorBoard logging
  • Config-driven via Hydra-style system

3. Foundation Model Architectures

  • Behavior Cloning (BC): State and image-conditioned imitation learning
  • Vision-Language-Action (VLA): 1.4M parameter transformer — ViT encoder + char-level language encoder + fusion transformer → robot actions, conditioned on natural language instructions (inspired by RT-2/OpenVLA)
  • Diffusion Policy Head: Denoising diffusion for multi-modal action distributions
  • Model Registry: Add new architectures with @register_model("name")

5. Reinforcement Learning

  • PPO agent with Generalized Advantage Estimation (GAE)
  • Curriculum learning: automatic stage progression (stand → walk → robust)
  • Reward shaping library with alive bonus, energy cost, fall penalty
  • Closed-loop evaluation (model controls robot in real-time)
  • Composable reward function library
  • Vectorized environment support

6. Scalable Data Generation

  • Parallel workers: N processes × 1 MuJoCo env each, near-linear scaling
  • Sharded HDF5: Each worker writes its own shard, optional merge
  • Resume: Skip completed shards on re-run (fault-tolerant)
  • Domain randomization for diverse training data
  • Configurable compression (gzip levels 1-9)
  • Generation config saved as JSON for reproducibility

Architecture:

  Coordinator (main process)
      │
      ├── Worker 0  ──►  shard_00000.h5   (episodes 0–249)
      ├── Worker 1  ──►  shard_00001.h5   (episodes 250–499)
      ├── Worker 2  ──►  shard_00002.h5   (episodes 500–749)
      └── Worker 3  ──►  shard_00003.h5   (episodes 750–999)
                                │
                         merge (optional)
                                │
                          dataset.h5

Each worker runs its own MuJoCo physics instance — no shared memory, no GIL contention, no I/O locks. Episodes are divided evenly with remainders distributed round-robin. Shards are self-contained HDF5 files marked complete=True on finish, enabling fault-tolerant resume.

Scaling Benchmark (200 episodes, pick-and-place, Mac Mini M2):

Workers Time Throughput Speedup
1 13.7s 14.6 ep/s 1.0x
4 3.9s 51.8 ep/s 3.5x
8 2.8s 72.6 ep/s 4.9x
# Parallel data generation (auto-detects CPU count)
python -m simscaleai.datagen.parallel_generator \
    --env pick_place --episodes 10000 --workers 8 \
    --output-dir data/pick_place_10k --policy scripted

7. CLI Tooling

  • simscale train — launch any training experiment
  • simscale eval — closed-loop checkpoint evaluation
  • simscale datagen — generate datasets (single-process)
  • python -m simscaleai.datagen.parallel_generator — scalable parallel generation
  • simscale rl — RL agent training
  • simscale list-envs / list-models — discover components
  • simscale viz-env / viz-cameras / viz-dataset / viz-trajectory / viz-live — visualization

Configuration

All models have debug configs that run on CPU/MPS and full configs for GPU:

# Debug (runs on your Mac)
model = ModelRegistry.create("vla",
    image_size=64, embed_dim=64, num_heads=2, num_layers=2
)

# Full scale (for cloud GPU)
model = ModelRegistry.create("vla",
    image_size=224, embed_dim=1024, num_heads=16, num_layers=24
)

Running Tests

# All tests
pytest tests/ -v

# Skip slow tests
pytest tests/ -v -m "not slow"

# With coverage
pytest tests/ --cov=simscaleai --cov-report=term-missing

Tech Stack

Component Technology
Physics Simulation MuJoCo 3.x
ML Framework PyTorch 2.x
Environment Interface Gymnasium
Experiment Config Hydra / OmegaConf
Logging WandB / TensorBoard
Data Format HDF5 (h5py)
CLI Typer + Rich
Testing pytest
Linting Ruff
CI/CD GitHub Actions

License

Apache 2.0 — see LICENSE.

About

End-to-end robotics AI training and simulation platform — MuJoCo physics, foundation models (BC/VLA/Diffusion), distributed training, PPO RL, synthetic data generation

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages