AI-native production simulation platform for distributed systems.
Sim runs in your CI/CD pipeline: it ingests Kubernetes and Terraform configs, generates chaos scenarios with heuristics and AI, simulates failure propagation through your service graph, and surfaces blast radius findings before code ships.
K8s / Terraform manifests
│
▼
@sim/parser ← parse manifests into a typed ServiceGraph
│
▼
@sim/generator ← heuristic + AI chaos scenario generation
│
▼
@sim/executor ← model-based failure propagation simulation
│
▼
@sim/reporter ← blast radius findings + markdown report
│
┌────┴────┐
▼ ▼
@sim/api @sim/cli ← HTTP server and command-line interface
│
▼
@sim/action ← GitHub Actions integration (PR comments + commit status)
| Package | Description |
|---|---|
@sim/core |
Shared TypeScript types: ServiceGraph, ChaosScenarioConfig, ExecutionResult, BlastRadiusReport |
@sim/parser |
Parses Kubernetes manifests and Terraform state files into a ServiceGraph |
@sim/generator |
Generates chaos scenarios from a ServiceGraph — deterministic heuristics plus optional Claude AI novelty pass |
@sim/executor |
Model-based failure propagation simulation: latency, error rate, pod kill, network partition, CPU/memory stress, DNS errors |
@sim/reporter |
Converts ExecutionResult[] into a BlastRadiusReport with root-cause traces, mitigations, risk score, and markdown body |
@sim/api |
REST HTTP server — POST /v1/simulate, POST /v1/report, POST /v1/pipeline |
@sim/cli |
CLI tool — sim parse <dir> and sim run <dir> |
@sim/action |
GitHub Actions composite action — posts blast radius reports as PR comments |
Test coverage: 117 tests across all packages, all passing.
pnpm install
pnpm build
# Parse your K8s manifests into a service graph
sim parse ./k8s
# Full pipeline: parse → generate → simulate → report (text output)
sim run ./k8s
# JSON output (pipe to jq, scripts, etc.)
sim run ./k8s --output json
# Limit scenario count
sim run ./k8s --max-total 10# Start the server
pnpm --filter @sim/api start # listens on PORT (default 3000)
# Health check
curl http://localhost:3000/health
# Run a single scenario
curl -X POST http://localhost:3000/v1/simulate \
-H 'Content-Type: application/json' \
-d '{ "scenario": { ... }, "graph": { ... } }'
# Full pipeline from a service graph
curl -X POST http://localhost:3000/v1/pipeline \
-H 'Content-Type: application/json' \
-d '{ "graph": { ... }, "options": { "heuristicsOnly": true } }'Add to .github/workflows/sim.yml:
- name: Run Sim Blast Radius Analysis
uses: your-org/sim@main
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
anthropic-api-key: ${{ secrets.ANTHROPIC_API_KEY }} # optional; enables AI scenarios
infra-path: 'k8s/'
fail-on: 'critical' # fail the check if any critical findings
max-scenarios: '20'Outputs: risk-score, critical-count, high-count, medium-count, low-count, total-scenarios, report-json-path
pnpm install
# Build all packages
pnpm build
# Run all tests (117 tests)
pnpm test
# Type check
pnpm typecheck
# Lint + format check
pnpm lint
pnpm format:check| Fault | Effect on target |
|---|---|
latency |
Adds N ms to response time |
error_rate |
X% of requests return errors |
pod_kill |
Reduces replica count (single-replica → full outage) |
network_partition |
Complete network isolation (30 s connection timeout) |
cpu_stress |
5× latency increase, small error rate |
memory_stress |
Periodic OOM kills → 15% error rate |
dns_error |
DNS resolution failure → 100% error rate |
Failures propagate outward through the service graph with a 0.7× latency factor and 0.8× error-rate factor per hop. Circuit breakers open after 30 s of sustained >80% error rate on a callee.
ci.yml— lint → typecheck → test → build on every push/PR tomaindeploy.yml— staging deploy onmain; production deploy on version tags (v*)sim.yml— blast radius analysis on PRs that touchk8s/,terraform/, orinfra/