SimScaleAI

End-to-end robotics AI training and simulation platform — from physics simulation to foundation model training, reinforcement learning, and deployment evaluation.

Built to demonstrate the full stack of robotic AI infrastructure: simulation environments, distributed training pipelines, foundation model architectures (VLM/VLA/BC), RL training, synthetic data generation, and experiment tooling.

Architecture

┌──────────────────────────────────────────────────────────┐
│                     SimScaleAI CLI                       │
│         simscale train | eval | datagen | rl | viz       │
├──────────────┬───────────────┬───────────────────────────┤
│  Simulation  │    Training   │       Models              │
│  (MuJoCo 3)  │  Infrastructure│  (BC, VLA, Diffusion)    │
│              │  (PyTorch DDP) │                          │
│  • Reach     │  • Distributed │  • Behavior Cloning      │
│  • PickPlace │  • AMP/FSDP   │  • Vision-Language-Action │
│  • Juggle    │  • Checkpoint  │  • Diffusion Policy Head │
│  • ClothFold │  • WandB log  │  • Model Registry        │
│  • Humanoid  │  • VLA train  │                          │
│    Walk      │               │                          │
│  • Domain    │               │                          │
│    Randomize │               │                          │
├──────────────┼───────────────┼───────────────────────────┤
│      RL Pipeline             │   Synthetic Data Gen      │
│  • PPO Agent                 │  • Domain randomization   │
│  • GAE Advantages            │  • Multi-modal capture    │
│  • Closed-loop eval          │  • Language instructions  │
│  • Reward function library   │  • HDF5 export            │
└──────────────────────────────┴───────────────────────────┘

Quick Start

Installation

# Clone
git clone https://github.com/rk-edge/SimScaleAI.git
cd SimScaleAI

# Install (with all optional dependencies)
pip install -e ".[all]"

# Or minimal install
pip install -e .

Try It Out

# List available environments and models
simscale list-envs
simscale list-models

# Generate synthetic training data
simscale datagen --env-name reach --n-episodes 100 --output data/reach.h5

# Train a Behavior Cloning model
simscale train --model bc --dataset data/reach.h5 --max-steps 1000

# Train a VLA model (with dummy data)
simscale train --model vla --max-steps 500

# Evaluate a checkpoint in simulation
simscale eval checkpoints/final.pt --env-name reach --n-episodes 20

# Train an RL agent (PPO)
simscale rl --env-name reach --total-steps 50000

Python API

from simscaleai.sim import make_env
from simscaleai.models import ModelRegistry
from simscaleai.rl.agents.ppo import PPOAgent

# Create simulation environment
env = make_env("reach", render_mode="human")
obs, info = env.reset()

# Step through the environment
action = env.action_space.sample()
obs, reward, terminated, truncated, info = env.step(action)

# Create a model
model = ModelRegistry.create("vla", image_size=128, action_dim=4)

# Train RL agent
agent = PPOAgent(obs_dim=20, action_dim=4)
agent.train(env)

Project Structure

SimScaleAI/
├── simscaleai/
│   ├── sim/                    # Simulation environments
│   │   ├── base_env.py         # Abstract MuJoCo environment
│   │   ├── factory.py          # Environment registry & factory
│   │   ├── domain_randomization.py  # DR config & pipeline
│   │   ├── assets/             # MJCF robot/scene files (auto-generated)
│   │   └── envs/
│   │       ├── reach_env.py    # Reach task (move EE to target)
│   │       ├── pick_place_env.py # Pick-and-place manipulation
│   │       ├── juggle_env.py   # 3-ball juggling
│   │       ├── cloth_fold_env.py # Deformable cloth folding
│   │       └── humanoid_walk_env.py # Bipedal locomotion + curriculum
│   ├── training/               # ML training infrastructure
│   │   ├── trainer.py          # Distributed training loop (DDP/AMP)
│   │   ├── train_vla.py        # Language-conditioned VLA pipeline
│   │   └── data/
│   │       └── dataset.py      # HDF5 trajectory datasets
│   ├── models/                 # Foundation model architectures
│   │   ├── registry.py         # Model registry (@register_model)
│   │   ├── bc.py               # Behavior Cloning (imitation learning)
│   │   ├── vla.py              # Vision-Language-Action model
│   │   └── policy_heads/
│   │       ├── mlp_head.py     # Standard MLP action head
│   │       └── diffusion_head.py # Diffusion Policy action head
│   ├── rl/                     # Reinforcement learning
│   │   ├── evaluator.py        # Closed-loop simulation evaluation
│   │   ├── agents/
│   │   │   └── ppo.py          # PPO with GAE
│   │   └── rewards/
│   │       └── rewards.py      # Composable reward functions
│   ├── datagen/                # Synthetic data generation
│   │   ├── generator.py        # Single-process dataset pipeline
│   │   └── parallel_generator.py # Multi-worker scalable generation
│   ├── eval/                   # Evaluation & benchmarking
│   │   └── transfer_benchmark.py # Sim-to-real transfer matrix
│   └── tools/
│       └── cli.py              # Typer CLI entry point
├── tests/                      # Unit & integration tests
├── .github/workflows/ci.yml    # CI pipeline
├── pyproject.toml              # Package config & dependencies
└── README.md

Visualization

SimScaleAI includes built-in visualization tools accessible via CLI and Python API. See docs/visualization.md for the full guide.

Simulation Environment Rendering

Grid view of a reach-task rollout with the Franka Panda arm:

simscale viz-env --env-name reach --n-steps 20 --save env_grid.png

Camera Modalities

RGB, depth, and segmentation outputs from the wrist camera:

simscale viz-cameras --env-name reach --save cameras.png

Dataset Statistics

Episode length, reward distributions, and per-dimension action histograms from an HDF5 dataset:

simscale viz-dataset data/reach.h5 --save dataset_stats.png

Trajectory Timeline

Single episode breakdown — observations, actions, and rewards over time:

simscale viz-trajectory data/reach.h5 --episode 0 --save trajectory.png

Training Metrics

Loss curves with smoothed overlays from BC/VLA training:

RL Training Progress

PPO reward curves, episode lengths, and policy/value losses:

Juggle Environment — Evaluation Results

A 3-ball juggling task using the Franka Panda arm with a flat paddle. Three policies were trained and evaluated over 20 episodes each:

Metric	Scripted (Expert)	BC (Imitation)	PPO (RL)
Mean Reward	95.9 ± 61.3	95.7 ± 61.0	94.9 ± 61.2
Mean Episode Length	53.4 ± 62.7	53.4 ± 62.7	52.3 ± 63.1
Max Balls Airborne	3	3	3
Best Episode Reward	279.9	278.8	278.8
Worst Episode Reward	63.0	62.9	62.9

Takeaways:

BC nearly matches the expert — trained on only 200 demonstration episodes (loss 0.38 → 0.06).
PPO is competitive with 50K timesteps; more training would likely surpass imitation.
All policies achieve 3 balls airborne simultaneously in their best episodes.

Pick-and-Place — Full Pipeline Results

The core manipulation benchmark: Franka Panda picks a 3cm red cube and places it at a randomized green target, using damped pseudoinverse IK and a kinematic grasp lock.

Training Summary

Model	Data	Steps	Loss (start → end)
BC	200 eps, 37.8K steps	3,000	0.385 → 0.010
BC-DR	200 eps, 58.8K steps (domain‑randomized)	3,000	0.396 → 0.009
PPO	Online (50K env steps)	50,000	reward −43.4 → −6.6
VLA	37.8K steps + language instructions	2,000	0.300 → 0.064

Evaluation (50 episodes each)

Policy	Reward	Success	Avg Length
Scripted (expert)	145.1 ± 148.9	20.0%	261
BC (imitation)	55.6 ± 76.0	0.0%	300
BC‑DR (domain‑randomized, eval on DR env)	−35.4 ± 99.3	0.0%	300
PPO (RL, 50K steps)	−7.6 ± 40.1	0.0%	300
VLA (language‑conditioned)	−46.7 ± 50.0	0.0%	300

Takeaways:

Scripted policy achieves 20% success — pick‑and‑place is significantly harder than reaching, requiring precise multi‑phase coordination (approach → descend → grasp → lift → transport → place).
BC captures the motion pattern (positive reward) but hasn't generalized grasping from 200 demos — more data and longer training would improve this.
PPO learns to avoid penalties (reward near 0) but hasn't discovered the full grasp→lift sequence in 50K steps — contact‑rich manipulation typically needs millions of steps.
VLA demonstrates language‑conditioned action prediction — a 1.4M parameter model that fuses vision + language + state through a transformer to output actions.
Domain randomization systematically varies physics (friction, mass, damping, gains), geometry (object size), and visuals (lighting, camera, materials) for sim‑to‑real transfer.

Sim‑to‑Real Transfer Benchmark

Systematic evaluation of how policies trained under different conditions transfer to unseen environment variations. Tests robustness across 4 eval conditions (Clean → Heavy DR) with per‑parameter sensitivity ablation.

Transfer Matrix (Policy × Eval Condition)

Policy	Clean	Light DR	Default DR	Heavy DR
Scripted	121.1 ± 131.4	158.4 ± 229.0	162.7 ± 269.6	−60.6 ± 128.5
BC	50.8 ± 126.7	108.0 ± 56.2	−88.8 ± 79.7	−123.5 ± 57.7
BC‑DR	276.9 ± 137.8	282.8 ± 127.7	−8.1 ± 108.0	−104.9 ± 117.9
PPO	3.4 ± 34.7	4.6 ± 48.8	−43.3 ± 64.8	−121.5 ± 64.4

Per‑Parameter Sensitivity (Reward Drop)

Parameter	Scripted Drop	BC‑DR Drop
Gains (kp)	+145.2 (most sensitive)	+329.6 (most sensitive)
Obj Size	−9.0	+76.1
Damping	−28.4	+63.3
Friction	−84.6	+47.3
Mass	−18.7	+33.5
Gravity	−7.2	+34.4
Lighting	+1.8 (least sensitive)	+33.0

Key Findings:

BC‑DR dominates on clean + light DR — training with DR actually improves clean performance (276.9 vs 50.8 for vanilla BC), showing DR acts as regularization.
Actuator gains are the most critical parameter — randomizing kp alone drops Scripted reward by +145 and BC-DR by +330. This suggests actuator calibration is the #1 priority for sim‑to‑real transfer.
Lighting has near-zero impact on state-based policies (expected — no images in the observation).
Heavy DR breaks all policies — extreme randomization (mass 0.3–3×, friction 0.4–2×, gains 0.5–2×) exceeds what any policy trained with default ranges can handle.
DR improves generalization — BC‑DR transfers better than vanilla BC across every condition.

Deformable Object Manipulation: Cloth Folding

Genuinely frontier research territory: autonomous cloth folding with learned policies on physically-accurate deformable simulation.

Physics

Uses MuJoCo 3.x <flexcomp> for real-time FEM cloth simulation:

8×8 vertex grid (64 vertices, 192 DOFs) with edge damping + self-collision
Kinematic grasp lock: cloth edge vertices attached to end-effector during manipulation
Body-frame ↔ world coordinate mapping for accurate vertex kinematics

Task

Pick up one edge of a 17.5cm × 17.5cm cloth and fold it onto the opposite edge:

Stage	Description
1. Approach	Move EE above the far edge of the cloth
2. Grasp	Close gripper to lock 8 edge vertices to EE
3. Lift	Lift edge slightly above the table
4. Fold	Sweep grasped edge toward the target edge (−X direction)
5. Release	Open gripper — cloth should remain folded

Results

Metric	Scripted Expert	BC (learned)
Success rate	100%	100%
Steps to fold	77	372
Final fold distance	0.028m	0.010m
Mean reward	129.6	615.6

Key insight: BC successfully learns to fold cloth but takes ~5× longer than the scripted expert. The learned policy starts with cautious movements (action magnitude ≈ 0.11) then accelerates near the goal (≈ 0.40), demonstrating emergent precision—it learns to be careful with deformable objects.

Training: 100 expert demos (7,700 timesteps) → 5,000 BC steps on MPS → 222-dim state → 4-dim delta actions.

Humanoid Locomotion — PPO + Curriculum Learning

Bipedal humanoid walking using a custom 21-DOF MJCF model trained from scratch with PPO and automatic curriculum advancement.

Humanoid Model

Custom-authored MuJoCo MJCF with:

21 actuated joints: 3-DOF hips, 1-DOF knees, 2-DOF ankles, 2-DOF shoulders, 1-DOF elbows (× 2 limbs)
Free-floating torso (6-DOF freejoint) — total 25 qpos, 24 qvel
Torque-controlled motors (gear ratio 100, ctrl range [-1, 1])
Sensors: foot contact (touch), torso IMU (gyro + accelerometer)

Observation Space (49-dim)

Component	Dimensions	Description
Torso height	1	Center-of-mass z position
Torso orientation	4	Quaternion (w,x,y,z)
Joint positions	18	All actuated joint angles
Torso linear velocity	3	CoM velocity (x,y,z)
Torso angular velocity	3	Gyroscope reading
Joint velocities	18	Actuated joint angular velocities
Foot contacts	2	Binary ground-contact flags

Reward Shaping

reward = forward_vel × 1.25 × stage_scale
       + alive_bonus (5.0/step)
       + height_bonus (2.0 × min(z/1.3, 1))
       − energy_cost (0.01 × |ctrl·vel|)
       − ctrl_cost (0.001 × |ctrl|²)
       + fall_penalty (−100 on termination)

Curriculum Learning

Automatic stage progression based on 20-episode rolling reward average:

Stage	Objective	Forward Reward Scale	External Forces	Advancement Threshold
0 — Stand	Learn to stay upright	0.1×	None	avg reward ≥ 40
1 — Walk	Walk forward	1.0×	None	avg reward ≥ 120
2 — Robust	Walk under perturbation	1.0×	Random pushes every 100 steps	—

Training Results (1M steps, CPU, 563s)

Metric	Value
Training FPS	1,778 steps/s
Curriculum 0→1	Advanced at step 174K (avg reward 43.3)
Eval reward	73.6 ± 37.7
Eval episode length	34 steps (0.68s upright)
Random baseline	17 steps (0.34s) — 2× improvement
Max episode length	44 steps

Training curves:

Trained policy rollout:

Key insights:

Curriculum works: Agent learns Stage 0 (standing) in 174K steps, then transitions to Stage 1 (walking) automatically.
Reward shaping is critical: Large alive bonus (5.0/step) + fall penalty (−100) prevents "die-fast" degenerate strategies common in locomotion RL.
CPU > MPS for small models: 1,778 FPS on CPU vs 223 FPS on MPS — the MPS kernel launch overhead dominates for lightweight networks (49→256→256→18).
Humanoid locomotion is hard: Even with 1M steps, the agent balances for ~0.7s. Production humanoid controllers (DeepMind, Agility) use 10B+ steps with massively parallel GPU simulation.

# Train humanoid locomotion
python -m scripts.train_humanoid_walk --total-steps 1000000 --device cpu

# Evaluate
python -m scripts.train_humanoid_walk --eval-only

Key Features

1. Physics Simulation (MuJoCo)

Gymnasium-compatible environments for Franka Panda arm and humanoid robot
Reach, pick-and-place, 3-ball juggling, cloth folding (deformable), and bipedal humanoid walking
MuJoCo 3.x <flexcomp> for real-time FEM cloth simulation (64 vertices, 192 DOFs)
Custom 21-DOF humanoid MJCF with torque control, foot contact sensors, and IMU
Damped pseudoinverse IK with kinematic grasp lock
Multi-camera rendering (RGB, depth, segmentation)
Configurable via YAML — swap tasks without code changes

2. Domain Randomization Pipeline

Configurable DomainRandomizationConfig with 15+ randomization targets
Visual: lighting direction/color, camera pose/FOV, material colors
Physics: friction, mass, damping, actuator gains
Geometry: object size, table position
Dynamics: gravity noise, timestep variation
Nominal-value caching for relative randomization
Integrated into base environment — enabled with domain_randomization=True

3. Distributed Training Infrastructure (PyTorch)

PyTorch DDP for multi-GPU distributed training
Mixed precision (AMP) with BFloat16/Float16
Warmup + cosine decay learning rate schedule
Checkpoint save/resume with full optimizer state
WandB and TensorBoard logging
Config-driven via Hydra-style system

3. Foundation Model Architectures

Behavior Cloning (BC): State and image-conditioned imitation learning
Vision-Language-Action (VLA): 1.4M parameter transformer — ViT encoder + char-level language encoder + fusion transformer → robot actions, conditioned on natural language instructions (inspired by RT-2/OpenVLA)
Diffusion Policy Head: Denoising diffusion for multi-modal action distributions
Model Registry: Add new architectures with @register_model("name")

5. Reinforcement Learning

PPO agent with Generalized Advantage Estimation (GAE)
Curriculum learning: automatic stage progression (stand → walk → robust)
Reward shaping library with alive bonus, energy cost, fall penalty
Closed-loop evaluation (model controls robot in real-time)
Composable reward function library
Vectorized environment support

6. Scalable Data Generation

Parallel workers: N processes × 1 MuJoCo env each, near-linear scaling
Sharded HDF5: Each worker writes its own shard, optional merge
Resume: Skip completed shards on re-run (fault-tolerant)
Domain randomization for diverse training data
Configurable compression (gzip levels 1-9)
Generation config saved as JSON for reproducibility

Architecture:

  Coordinator (main process)
      │
      ├── Worker 0  ──►  shard_00000.h5   (episodes 0–249)
      ├── Worker 1  ──►  shard_00001.h5   (episodes 250–499)
      ├── Worker 2  ──►  shard_00002.h5   (episodes 500–749)
      └── Worker 3  ──►  shard_00003.h5   (episodes 750–999)
                                │
                         merge (optional)
                                │
                          dataset.h5

Each worker runs its own MuJoCo physics instance — no shared memory, no GIL contention, no I/O locks. Episodes are divided evenly with remainders distributed round-robin. Shards are self-contained HDF5 files marked complete=True on finish, enabling fault-tolerant resume.

Scaling Benchmark (200 episodes, pick-and-place, Mac Mini M2):

Workers	Time	Throughput	Speedup
1	13.7s	14.6 ep/s	1.0x
4	3.9s	51.8 ep/s	3.5x
8	2.8s	72.6 ep/s	4.9x

# Parallel data generation (auto-detects CPU count)
python -m simscaleai.datagen.parallel_generator \
    --env pick_place --episodes 10000 --workers 8 \
    --output-dir data/pick_place_10k --policy scripted

7. CLI Tooling

simscale train — launch any training experiment
simscale eval — closed-loop checkpoint evaluation
simscale datagen — generate datasets (single-process)
python -m simscaleai.datagen.parallel_generator — scalable parallel generation
simscale rl — RL agent training
simscale list-envs / list-models — discover components
simscale viz-env / viz-cameras / viz-dataset / viz-trajectory / viz-live — visualization

Configuration

All models have debug configs that run on CPU/MPS and full configs for GPU:

# Debug (runs on your Mac)
model = ModelRegistry.create("vla",
    image_size=64, embed_dim=64, num_heads=2, num_layers=2
)

# Full scale (for cloud GPU)
model = ModelRegistry.create("vla",
    image_size=224, embed_dim=1024, num_heads=16, num_layers=24
)

Running Tests

# All tests
pytest tests/ -v

# Skip slow tests
pytest tests/ -v -m "not slow"

# With coverage
pytest tests/ --cov=simscaleai --cov-report=term-missing

Tech Stack

Component	Technology
Physics Simulation	MuJoCo 3.x
ML Framework	PyTorch 2.x
Environment Interface	Gymnasium
Experiment Config	Hydra / OmegaConf
Logging	WandB / TensorBoard
Data Format	HDF5 (h5py)
CLI	Typer + Rich
Testing	pytest
Linting	Ruff
CI/CD	GitHub Actions

License

Apache 2.0 — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.github/workflows		.github/workflows
docs		docs
outputs/humanoid_walk		outputs/humanoid_walk
scripts		scripts
simscaleai		simscaleai
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
eval_pick_place.py		eval_pick_place.py
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

SimScaleAI

Architecture

Quick Start

Installation

Try It Out

Python API

Project Structure

Visualization

Simulation Environment Rendering

Camera Modalities

Dataset Statistics

Trajectory Timeline

Training Metrics

RL Training Progress

Juggle Environment — Evaluation Results

Pick-and-Place — Full Pipeline Results

Training Summary

Evaluation (50 episodes each)

Sim‑to‑Real Transfer Benchmark

Transfer Matrix (Policy × Eval Condition)

Per‑Parameter Sensitivity (Reward Drop)

Deformable Object Manipulation: Cloth Folding

Physics

Task

Results

Humanoid Locomotion — PPO + Curriculum Learning

Humanoid Model

Observation Space (49-dim)

Reward Shaping

Curriculum Learning

Training Results (1M steps, CPU, 563s)

Key Features

1. Physics Simulation (MuJoCo)

2. Domain Randomization Pipeline

3. Distributed Training Infrastructure (PyTorch)

3. Foundation Model Architectures

5. Reinforcement Learning

6. Scalable Data Generation

7. CLI Tooling

Configuration

Running Tests

Tech Stack

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages