End-to-end robotics AI training and simulation platform — from physics simulation to foundation model training, reinforcement learning, and deployment evaluation.
Built to demonstrate the full stack of robotic AI infrastructure: simulation environments, distributed training pipelines, foundation model architectures (VLM/VLA/BC), RL training, synthetic data generation, and experiment tooling.
┌──────────────────────────────────────────────────────────┐
│ SimScaleAI CLI │
│ simscale train | eval | datagen | rl | viz │
├──────────────┬───────────────┬───────────────────────────┤
│ Simulation │ Training │ Models │
│ (MuJoCo 3) │ Infrastructure│ (BC, VLA, Diffusion) │
│ │ (PyTorch DDP) │ │
│ • Reach │ • Distributed │ • Behavior Cloning │
│ • PickPlace │ • AMP/FSDP │ • Vision-Language-Action │
│ • Juggle │ • Checkpoint │ • Diffusion Policy Head │
│ • ClothFold │ • WandB log │ • Model Registry │
│ • Humanoid │ • VLA train │ │
│ Walk │ │ │
│ • Domain │ │ │
│ Randomize │ │ │
├──────────────┼───────────────┼───────────────────────────┤
│ RL Pipeline │ Synthetic Data Gen │
│ • PPO Agent │ • Domain randomization │
│ • GAE Advantages │ • Multi-modal capture │
│ • Closed-loop eval │ • Language instructions │
│ • Reward function library │ • HDF5 export │
└──────────────────────────────┴───────────────────────────┘
# Clone
git clone https://github.com/rk-edge/SimScaleAI.git
cd SimScaleAI
# Install (with all optional dependencies)
pip install -e ".[all]"
# Or minimal install
pip install -e .# List available environments and models
simscale list-envs
simscale list-models
# Generate synthetic training data
simscale datagen --env-name reach --n-episodes 100 --output data/reach.h5
# Train a Behavior Cloning model
simscale train --model bc --dataset data/reach.h5 --max-steps 1000
# Train a VLA model (with dummy data)
simscale train --model vla --max-steps 500
# Evaluate a checkpoint in simulation
simscale eval checkpoints/final.pt --env-name reach --n-episodes 20
# Train an RL agent (PPO)
simscale rl --env-name reach --total-steps 50000from simscaleai.sim import make_env
from simscaleai.models import ModelRegistry
from simscaleai.rl.agents.ppo import PPOAgent
# Create simulation environment
env = make_env("reach", render_mode="human")
obs, info = env.reset()
# Step through the environment
action = env.action_space.sample()
obs, reward, terminated, truncated, info = env.step(action)
# Create a model
model = ModelRegistry.create("vla", image_size=128, action_dim=4)
# Train RL agent
agent = PPOAgent(obs_dim=20, action_dim=4)
agent.train(env)SimScaleAI/
├── simscaleai/
│ ├── sim/ # Simulation environments
│ │ ├── base_env.py # Abstract MuJoCo environment
│ │ ├── factory.py # Environment registry & factory
│ │ ├── domain_randomization.py # DR config & pipeline
│ │ ├── assets/ # MJCF robot/scene files (auto-generated)
│ │ └── envs/
│ │ ├── reach_env.py # Reach task (move EE to target)
│ │ ├── pick_place_env.py # Pick-and-place manipulation
│ │ ├── juggle_env.py # 3-ball juggling
│ │ ├── cloth_fold_env.py # Deformable cloth folding
│ │ └── humanoid_walk_env.py # Bipedal locomotion + curriculum
│ ├── training/ # ML training infrastructure
│ │ ├── trainer.py # Distributed training loop (DDP/AMP)
│ │ ├── train_vla.py # Language-conditioned VLA pipeline
│ │ └── data/
│ │ └── dataset.py # HDF5 trajectory datasets
│ ├── models/ # Foundation model architectures
│ │ ├── registry.py # Model registry (@register_model)
│ │ ├── bc.py # Behavior Cloning (imitation learning)
│ │ ├── vla.py # Vision-Language-Action model
│ │ └── policy_heads/
│ │ ├── mlp_head.py # Standard MLP action head
│ │ └── diffusion_head.py # Diffusion Policy action head
│ ├── rl/ # Reinforcement learning
│ │ ├── evaluator.py # Closed-loop simulation evaluation
│ │ ├── agents/
│ │ │ └── ppo.py # PPO with GAE
│ │ └── rewards/
│ │ └── rewards.py # Composable reward functions
│ ├── datagen/ # Synthetic data generation
│ │ ├── generator.py # Single-process dataset pipeline
│ │ └── parallel_generator.py # Multi-worker scalable generation
│ ├── eval/ # Evaluation & benchmarking
│ │ └── transfer_benchmark.py # Sim-to-real transfer matrix
│ └── tools/
│ └── cli.py # Typer CLI entry point
├── tests/ # Unit & integration tests
├── .github/workflows/ci.yml # CI pipeline
├── pyproject.toml # Package config & dependencies
└── README.md
SimScaleAI includes built-in visualization tools accessible via CLI and Python API. See docs/visualization.md for the full guide.
Grid view of a reach-task rollout with the Franka Panda arm:
simscale viz-env --env-name reach --n-steps 20 --save env_grid.pngRGB, depth, and segmentation outputs from the wrist camera:
simscale viz-cameras --env-name reach --save cameras.pngEpisode length, reward distributions, and per-dimension action histograms from an HDF5 dataset:
simscale viz-dataset data/reach.h5 --save dataset_stats.pngSingle episode breakdown — observations, actions, and rewards over time:
simscale viz-trajectory data/reach.h5 --episode 0 --save trajectory.pngLoss curves with smoothed overlays from BC/VLA training:
PPO reward curves, episode lengths, and policy/value losses:
A 3-ball juggling task using the Franka Panda arm with a flat paddle. Three policies were trained and evaluated over 20 episodes each:
| Metric | Scripted (Expert) | BC (Imitation) | PPO (RL) |
|---|---|---|---|
| Mean Reward | 95.9 ± 61.3 | 95.7 ± 61.0 | 94.9 ± 61.2 |
| Mean Episode Length | 53.4 ± 62.7 | 53.4 ± 62.7 | 52.3 ± 63.1 |
| Max Balls Airborne | 3 | 3 | 3 |
| Best Episode Reward | 279.9 | 278.8 | 278.8 |
| Worst Episode Reward | 63.0 | 62.9 | 62.9 |
Takeaways:
- BC nearly matches the expert — trained on only 200 demonstration episodes (loss 0.38 → 0.06).
- PPO is competitive with 50K timesteps; more training would likely surpass imitation.
- All policies achieve 3 balls airborne simultaneously in their best episodes.
The core manipulation benchmark: Franka Panda picks a 3cm red cube and places it at a randomized green target, using damped pseudoinverse IK and a kinematic grasp lock.
| Model | Data | Steps | Loss (start → end) |
|---|---|---|---|
| BC | 200 eps, 37.8K steps | 3,000 | 0.385 → 0.010 |
| BC-DR | 200 eps, 58.8K steps (domain‑randomized) | 3,000 | 0.396 → 0.009 |
| PPO | Online (50K env steps) | 50,000 | reward −43.4 → −6.6 |
| VLA | 37.8K steps + language instructions | 2,000 | 0.300 → 0.064 |
| Policy | Reward | Success | Avg Length |
|---|---|---|---|
| Scripted (expert) | 145.1 ± 148.9 | 20.0% | 261 |
| BC (imitation) | 55.6 ± 76.0 | 0.0% | 300 |
| BC‑DR (domain‑randomized, eval on DR env) | −35.4 ± 99.3 | 0.0% | 300 |
| PPO (RL, 50K steps) | −7.6 ± 40.1 | 0.0% | 300 |
| VLA (language‑conditioned) | −46.7 ± 50.0 | 0.0% | 300 |
Takeaways:
- Scripted policy achieves 20% success — pick‑and‑place is significantly harder than reaching, requiring precise multi‑phase coordination (approach → descend → grasp → lift → transport → place).
- BC captures the motion pattern (positive reward) but hasn't generalized grasping from 200 demos — more data and longer training would improve this.
- PPO learns to avoid penalties (reward near 0) but hasn't discovered the full grasp→lift sequence in 50K steps — contact‑rich manipulation typically needs millions of steps.
- VLA demonstrates language‑conditioned action prediction — a 1.4M parameter model that fuses vision + language + state through a transformer to output actions.
- Domain randomization systematically varies physics (friction, mass, damping, gains), geometry (object size), and visuals (lighting, camera, materials) for sim‑to‑real transfer.
Systematic evaluation of how policies trained under different conditions transfer to unseen environment variations. Tests robustness across 4 eval conditions (Clean → Heavy DR) with per‑parameter sensitivity ablation.
| Policy | Clean | Light DR | Default DR | Heavy DR |
|---|---|---|---|---|
| Scripted | 121.1 ± 131.4 | 158.4 ± 229.0 | 162.7 ± 269.6 | −60.6 ± 128.5 |
| BC | 50.8 ± 126.7 | 108.0 ± 56.2 | −88.8 ± 79.7 | −123.5 ± 57.7 |
| BC‑DR | 276.9 ± 137.8 | 282.8 ± 127.7 | −8.1 ± 108.0 | −104.9 ± 117.9 |
| PPO | 3.4 ± 34.7 | 4.6 ± 48.8 | −43.3 ± 64.8 | −121.5 ± 64.4 |
| Parameter | Scripted Drop | BC‑DR Drop |
|---|---|---|
| Gains (kp) | +145.2 (most sensitive) | +329.6 (most sensitive) |
| Obj Size | −9.0 | +76.1 |
| Damping | −28.4 | +63.3 |
| Friction | −84.6 | +47.3 |
| Mass | −18.7 | +33.5 |
| Gravity | −7.2 | +34.4 |
| Lighting | +1.8 (least sensitive) | +33.0 |
Key Findings:
- BC‑DR dominates on clean + light DR — training with DR actually improves clean performance (276.9 vs 50.8 for vanilla BC), showing DR acts as regularization.
- Actuator gains are the most critical parameter — randomizing
kpalone drops Scripted reward by +145 and BC-DR by +330. This suggests actuator calibration is the #1 priority for sim‑to‑real transfer. - Lighting has near-zero impact on state-based policies (expected — no images in the observation).
- Heavy DR breaks all policies — extreme randomization (mass 0.3–3×, friction 0.4–2×, gains 0.5–2×) exceeds what any policy trained with default ranges can handle.
- DR improves generalization — BC‑DR transfers better than vanilla BC across every condition.
Genuinely frontier research territory: autonomous cloth folding with learned policies on physically-accurate deformable simulation.
Uses MuJoCo 3.x <flexcomp> for real-time FEM cloth simulation:
- 8×8 vertex grid (64 vertices, 192 DOFs) with edge damping + self-collision
- Kinematic grasp lock: cloth edge vertices attached to end-effector during manipulation
- Body-frame ↔ world coordinate mapping for accurate vertex kinematics
Pick up one edge of a 17.5cm × 17.5cm cloth and fold it onto the opposite edge:
| Stage | Description |
|---|---|
| 1. Approach | Move EE above the far edge of the cloth |
| 2. Grasp | Close gripper to lock 8 edge vertices to EE |
| 3. Lift | Lift edge slightly above the table |
| 4. Fold | Sweep grasped edge toward the target edge (−X direction) |
| 5. Release | Open gripper — cloth should remain folded |
| Metric | Scripted Expert | BC (learned) |
|---|---|---|
| Success rate | 100% | 100% |
| Steps to fold | 77 | 372 |
| Final fold distance | 0.028m | 0.010m |
| Mean reward | 129.6 | 615.6 |
Key insight: BC successfully learns to fold cloth but takes ~5× longer than the scripted expert. The learned policy starts with cautious movements (action magnitude ≈ 0.11) then accelerates near the goal (≈ 0.40), demonstrating emergent precision—it learns to be careful with deformable objects.
Training: 100 expert demos (7,700 timesteps) → 5,000 BC steps on MPS → 222-dim state → 4-dim delta actions.
Bipedal humanoid walking using a custom 21-DOF MJCF model trained from scratch with PPO and automatic curriculum advancement.
Custom-authored MuJoCo MJCF with:
- 21 actuated joints: 3-DOF hips, 1-DOF knees, 2-DOF ankles, 2-DOF shoulders, 1-DOF elbows (× 2 limbs)
- Free-floating torso (6-DOF freejoint) — total 25 qpos, 24 qvel
- Torque-controlled motors (gear ratio 100, ctrl range [-1, 1])
- Sensors: foot contact (touch), torso IMU (gyro + accelerometer)
| Component | Dimensions | Description |
|---|---|---|
| Torso height | 1 | Center-of-mass z position |
| Torso orientation | 4 | Quaternion (w,x,y,z) |
| Joint positions | 18 | All actuated joint angles |
| Torso linear velocity | 3 | CoM velocity (x,y,z) |
| Torso angular velocity | 3 | Gyroscope reading |
| Joint velocities | 18 | Actuated joint angular velocities |
| Foot contacts | 2 | Binary ground-contact flags |
reward = forward_vel × 1.25 × stage_scale
+ alive_bonus (5.0/step)
+ height_bonus (2.0 × min(z/1.3, 1))
− energy_cost (0.01 × |ctrl·vel|)
− ctrl_cost (0.001 × |ctrl|²)
+ fall_penalty (−100 on termination)
Automatic stage progression based on 20-episode rolling reward average:
| Stage | Objective | Forward Reward Scale | External Forces | Advancement Threshold |
|---|---|---|---|---|
| 0 — Stand | Learn to stay upright | 0.1× | None | avg reward ≥ 40 |
| 1 — Walk | Walk forward | 1.0× | None | avg reward ≥ 120 |
| 2 — Robust | Walk under perturbation | 1.0× | Random pushes every 100 steps | — |
| Metric | Value |
|---|---|
| Training FPS | 1,778 steps/s |
| Curriculum 0→1 | Advanced at step 174K (avg reward 43.3) |
| Eval reward | 73.6 ± 37.7 |
| Eval episode length | 34 steps (0.68s upright) |
| Random baseline | 17 steps (0.34s) — 2× improvement |
| Max episode length | 44 steps |
Training curves:
Trained policy rollout:
Key insights:
- Curriculum works: Agent learns Stage 0 (standing) in 174K steps, then transitions to Stage 1 (walking) automatically.
- Reward shaping is critical: Large alive bonus (5.0/step) + fall penalty (−100) prevents "die-fast" degenerate strategies common in locomotion RL.
- CPU > MPS for small models: 1,778 FPS on CPU vs 223 FPS on MPS — the MPS kernel launch overhead dominates for lightweight networks (49→256→256→18).
- Humanoid locomotion is hard: Even with 1M steps, the agent balances for ~0.7s. Production humanoid controllers (DeepMind, Agility) use 10B+ steps with massively parallel GPU simulation.
# Train humanoid locomotion
python -m scripts.train_humanoid_walk --total-steps 1000000 --device cpu
# Evaluate
python -m scripts.train_humanoid_walk --eval-only- Gymnasium-compatible environments for Franka Panda arm and humanoid robot
- Reach, pick-and-place, 3-ball juggling, cloth folding (deformable), and bipedal humanoid walking
- MuJoCo 3.x
<flexcomp>for real-time FEM cloth simulation (64 vertices, 192 DOFs) - Custom 21-DOF humanoid MJCF with torque control, foot contact sensors, and IMU
- Damped pseudoinverse IK with kinematic grasp lock
- Multi-camera rendering (RGB, depth, segmentation)
- Configurable via YAML — swap tasks without code changes
- Configurable
DomainRandomizationConfigwith 15+ randomization targets - Visual: lighting direction/color, camera pose/FOV, material colors
- Physics: friction, mass, damping, actuator gains
- Geometry: object size, table position
- Dynamics: gravity noise, timestep variation
- Nominal-value caching for relative randomization
- Integrated into base environment — enabled with
domain_randomization=True
- PyTorch DDP for multi-GPU distributed training
- Mixed precision (AMP) with BFloat16/Float16
- Warmup + cosine decay learning rate schedule
- Checkpoint save/resume with full optimizer state
- WandB and TensorBoard logging
- Config-driven via Hydra-style system
- Behavior Cloning (BC): State and image-conditioned imitation learning
- Vision-Language-Action (VLA): 1.4M parameter transformer — ViT encoder + char-level language encoder + fusion transformer → robot actions, conditioned on natural language instructions (inspired by RT-2/OpenVLA)
- Diffusion Policy Head: Denoising diffusion for multi-modal action distributions
- Model Registry: Add new architectures with
@register_model("name")
- PPO agent with Generalized Advantage Estimation (GAE)
- Curriculum learning: automatic stage progression (stand → walk → robust)
- Reward shaping library with alive bonus, energy cost, fall penalty
- Closed-loop evaluation (model controls robot in real-time)
- Composable reward function library
- Vectorized environment support
- Parallel workers: N processes × 1 MuJoCo env each, near-linear scaling
- Sharded HDF5: Each worker writes its own shard, optional merge
- Resume: Skip completed shards on re-run (fault-tolerant)
- Domain randomization for diverse training data
- Configurable compression (gzip levels 1-9)
- Generation config saved as JSON for reproducibility
Architecture:
Coordinator (main process)
│
├── Worker 0 ──► shard_00000.h5 (episodes 0–249)
├── Worker 1 ──► shard_00001.h5 (episodes 250–499)
├── Worker 2 ──► shard_00002.h5 (episodes 500–749)
└── Worker 3 ──► shard_00003.h5 (episodes 750–999)
│
merge (optional)
│
dataset.h5
Each worker runs its own MuJoCo physics instance — no shared memory,
no GIL contention, no I/O locks. Episodes are divided evenly with
remainders distributed round-robin. Shards are self-contained HDF5 files
marked complete=True on finish, enabling fault-tolerant resume.
Scaling Benchmark (200 episodes, pick-and-place, Mac Mini M2):
| Workers | Time | Throughput | Speedup |
|---|---|---|---|
| 1 | 13.7s | 14.6 ep/s | 1.0x |
| 4 | 3.9s | 51.8 ep/s | 3.5x |
| 8 | 2.8s | 72.6 ep/s | 4.9x |
# Parallel data generation (auto-detects CPU count)
python -m simscaleai.datagen.parallel_generator \
--env pick_place --episodes 10000 --workers 8 \
--output-dir data/pick_place_10k --policy scriptedsimscale train— launch any training experimentsimscale eval— closed-loop checkpoint evaluationsimscale datagen— generate datasets (single-process)python -m simscaleai.datagen.parallel_generator— scalable parallel generationsimscale rl— RL agent trainingsimscale list-envs/list-models— discover componentssimscale viz-env/viz-cameras/viz-dataset/viz-trajectory/viz-live— visualization
All models have debug configs that run on CPU/MPS and full configs for GPU:
# Debug (runs on your Mac)
model = ModelRegistry.create("vla",
image_size=64, embed_dim=64, num_heads=2, num_layers=2
)
# Full scale (for cloud GPU)
model = ModelRegistry.create("vla",
image_size=224, embed_dim=1024, num_heads=16, num_layers=24
)# All tests
pytest tests/ -v
# Skip slow tests
pytest tests/ -v -m "not slow"
# With coverage
pytest tests/ --cov=simscaleai --cov-report=term-missing| Component | Technology |
|---|---|
| Physics Simulation | MuJoCo 3.x |
| ML Framework | PyTorch 2.x |
| Environment Interface | Gymnasium |
| Experiment Config | Hydra / OmegaConf |
| Logging | WandB / TensorBoard |
| Data Format | HDF5 (h5py) |
| CLI | Typer + Rich |
| Testing | pytest |
| Linting | Ruff |
| CI/CD | GitHub Actions |
Apache 2.0 — see LICENSE.






















