Skip to content

x-mesh/xm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

470 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🇰🇷 한국어 | 🇺🇸 English

xm

xm

AI coding agents fail silently. They skip planning, miss context, and never check the result.
xm makes that harder to do.

Version License: MIT Node.js Plugins

A plugin toolkit for Claude Code. It adds the steps a senior engineer never skips: plan before coding, review before merging, verify before declaring done.

/xm:build plan "Build a REST API with JWT auth"
→ PRD → task decomposition → parallel agents → verified ✅


Table of Contents


Install

Prerequisites

xm uses Bun as its JavaScript runtime for testing, the dashboard server, and script execution.

Why Bun?

  • Fast startup — scripts and tests launch instantly with no JIT warmup
  • Built-in test runner — bun test works out of the box, no extra devDependencies
  • Native TypeScript/ESM — runs .ts and .mjs files directly without transpilation
  • Zero-config HTTP server — powers x-dashboard with no npm dependencies
# macOS / Linux
curl -fsSL https://bun.sh/install | bash

# Homebrew
brew install oven-sh/bun/bun

# Windows
powershell -c "irm bun.sh/install.ps1 | iex"

After installation, verify with bun --version (requires v1.0+).

Node.js >= 18 is still required for Claude Code itself. Bun is used for xm's own tooling.

Plugin Setup

/plugin marketplace add x-mesh/xm
/plugin install xm@xm -s user

First-Run Init (Global)

After installing, run once per machine to copy the trace-session hook into ~/.claude/hooks/ and register Skill matchers in ~/.claude/settings.json:

/xm init              # install trace-session hook into ~/.claude/
/xm init status       # verify install state
/xm init uninstall    # remove hook + settings entries
/xm init --no-hooks   # CLI-only install (no-op today — reserved)

Idempotent: safe to re-run. Existing hooks (e.g. mem-mesh) are preserved, and each write creates a timestamped backup of settings.json. Traces land in each project's .xm/traces/.

The same install flow is available from a terminal via xm init (see Terminal CLI).

Terminal CLI (optional)

Install the xm umbrella CLI to run commands directly from your shell — useful for the dashboard, sync, memory, traces, etc., without entering Claude Code:

# Local install from this repo
bash xm/scripts/install.sh

# Or remote
curl -fsSL https://raw.githubusercontent.com/x-mesh/xm/main/xm/scripts/install.sh | bash

The installer writes ~/.local/bin/xm (override with XM_BIN_DIR; ensure it is on your PATH) and, when the claude CLI is on PATH, also runs claude plugin install <p>@xm -s user for every plugin in marketplace.json (x-build, x-agent, x-op, x-solver, x-review, x-trace, x-memory, x-eval, x-probe, x-humble, x-dashboard, xm). Run /reload-plugins inside Claude Code afterward to activate them. If claude is not on PATH, the CLI wrapper alone is installed and the plugin list is printed for manual install.

Global hook install (xm init)

The bash xm init subcommand is equivalent to the /xm init slash command — it installs the Skill-tracing hook into user-scoped ~/.claude/ (once per machine):

xm init                 # install trace-session hook into ~/.claude/
xm init status          # verify install state
xm init uninstall       # remove hook + settings entries
xm init --no-hooks      # CLI-only install (no-op today — reserved)

Writes ~/.claude/hooks/xm-trace-session.mjs and merges PreToolUse/PostToolUse Skill matchers into ~/.claude/settings.json (existing hooks such as mem-mesh are preserved; a timestamped backup is created on every write). Use the bash route when you are outside Claude Code; otherwise /xm init is the preferred entry point.

xm dashboard                       # start — uses ~/.xm/projects.json registry (all registered projects)
xm dashboard --scan ~/work         # legacy multi-project mode: scan ~/work for .xm/ dirs (depth 4)
XM_DASHBOARD_SCAN=~/work xm dashboard   # same, persisted via env var
xm dashboard stop                  # stop it
xm dashboard open                  # open it in your browser

# Project registry (~/.xm/projects.json)
xm project import ~/work           # one-shot bulk-register all .xm/ projects under ~/work
xm project list                    # show registered projects
xm project add [<path>]            # register CWD or given path
xm project remove <id|path>        # unregister
xm project archive <id>            # hide from dashboard without deleting
xm project gc                      # drop entries whose path no longer exists
xm sync push           # push .xm/ state to your sync server
xm sync pull           # pull state from your sync server
xm memory <subcmd>     # save | recall | inject | list
xm build <subcmd>      # build status / list / ...
xm trace <subcmd>      # execution traces
xm solver <subcmd>     # structured problem solving
xm handoff [reason]    # save session state
xm handon              # restore session state
xm which               # show resolved lib paths
xm version
xm help

The CLI dispatches to plugin libs in ~/.claude/plugins/cache/xm/ (or $XM_LIB), so the Claude Code plugin must be installed first. The sync subcommand reuses the x-sync plugin lib, so you do not need to run x-sync/install.sh client separately.

Project Registry (xm project)

The dashboard reads from a machine-local registry at ~/.xm/projects.json. Once populated, xm dashboard shows every registered project without --scan.

  • First-time setup: run xm project import ~/work (or any root) to bulk-register every existing .xm/ project. Idempotent — re-running only updates last_seen.
  • Auto-registration: when you run any xm command from a project directory, the dispatcher self-registers it. New projects appear in the dashboard without explicit action.
  • Worktrees: a worktree of an already-registered repo is collapsed onto the main repo entry. Running xm from any worktree updates the same registry entry — no duplicates.
  • Resolution priority: --scan flag → ~/.xm/projects.json → legacy ~/.xm/config.json scan_roots → CWD only.

Multi-Tool Install (Cursor / Codex / Kiro / Antigravity / OpenCode)

xm is published as a Claude Code marketplace plugin, but its 16 SKILLs can also be rendered into rule/steering formats consumed by other AI coding tools. A single source compiler (xm/lib/install/install-cli.mjs) emits per-tool artifacts.

# Interactive picker (scope + targets)
xm install
# or, when invoking the compiler directly
node xm/lib/install/install-cli.mjs --interactive

# Preview what would be installed (no fs writes)
node xm/lib/install/install-cli.mjs --list

# Install for one or more tools, project-local (default)
node xm/lib/install/install-cli.mjs --target cursor,codex,kiro,antigravity,opencode

# User-global install (~/.cursor/, ~/.codex/, ~/.kiro/, ~/.gemini/, ~/.config/opencode/)
node xm/lib/install/install-cli.mjs --target cursor --global

# Re-hash installed files against the manifest (R-SEC-13/15)
node xm/lib/install/install-cli.mjs --verify --target cursor

# Remove all xm-managed files; user content in AGENTS.md is preserved
node xm/lib/install/install-cli.mjs --uninstall --target cursor,codex

Per-tool layout:

Tool Skills Slash invocation Hook
Cursor .cursor/rules/xm-*.mdc (frontmatter: description, alwaysApply) agent-requested .cursor/hooks.json (camelCase events)
Codex CLI .codex/prompts/xm-*.md (project) or ~/.codex/prompts/xm-*.md (--global) + AGENTS.md index (≤ 16 KiB block) /prompts:xm-<plug> .codex/hooks.json / ~/.codex/hooks.json (requires codex features enable hooks or [features] hooks=true)
Kiro .kiro/steering/xm-*.md (frontmatter: inclusion: auto|manual) n/a .kiro/hooks/xm-*.kiro.hook (informational only — Kiro cannot block)
Antigravity .agent/skills/xm-*.md (project) or ~/.gemini/antigravity/skills/xm-*.md (--global) + shared AGENTS.md index agent-requested not supported (no programmable hook API)
OpenCode .opencode/skills/xm-*/SKILL.md (project) or ~/.config/opencode/skills/xm-*/SKILL.md (--global) native skill discovery not emitted

Safety:

  • <!-- xm:BEGIN v2 --> ... <!-- xm:END --> markers isolate xm content inside files shared with the user (AGENTS.md). Pre-existing user content is preserved.
  • Existing files are rotated to .bak, .bak.1, .bak.2 (max 3 generations) on first overwrite. Symbolic links abort.
  • Lock files use O_EXCL atomic creation with a 60-second stale TTL.
  • Each install writes a manifest under the target's xm/manifest.json directory (for example .cursor/xm/manifest.json or ~/.config/opencode/xm/manifest.json) with SHA-256 + HMAC self-checksum. --verify recomputes hashes; --uninstall rolls back exactly the recorded files.
  • R-SEC-02 supply-chain guard: source SKILL.md hashes are verified against xm/skills.checksums.json before render. --allow-unverified bypasses with a flagged audit entry.
  • Installs are idempotent — re-running with the same arguments produces zero diff.

See docs/multi-tool-install.md for the complete guide — capability matrix, per-tool install steps, manual verification in each IDE, security model, troubleshooting. The full design (PRD v2.1) is at .xm/build/projects/multi-tool-install/phases/02-plan/PRD.md.

Auto-propagation on update

xm update automatically re-renders skills to every installed LLM target (Cursor, Codex CLI, Kiro, Antigravity, OpenCode) when their global manifests are present. Per-file SHA-256 diffing skips unchanged targets. Pass --no-propagate to update only the Claude plugin.

xm update                   # update plugin + propagate to all installed targets
xm update --no-propagate    # Claude-only update, skip fan-out
xm install --propagate      # re-render every installed manifest target on demand
xm install --list-installed # print installed manifest inventory as JSON

Quick Start

/xm:build plan "Build a REST API with JWT auth"

That single line:

  1. Creates a project + generates a PRD with requirements
  2. Auto-decomposes into tasks with done criteria
  3. Presents the plan for review (user approval)
  4. Agents execute tasks in parallel → quality verification

Want to skip Research/PRD and go straight to execution? Use --quick:

/xm:build plan "Build a REST API with JWT auth" --quick

Failed? Run /xm:build run again. Completed tasks are skipped, only remaining ones execute.

Step-by-step tutorial (5 minutes)
# 1. Initialize
/xm:build init my-project

# 2. Gather requirements (optional but recommended)
/xm:build discuss --mode interview
# → Agent asks clarifying questions, generates CONTEXT.md

# 3. Generate PRD + decompose into tasks
/xm:build plan "Build a user auth system with JWT"
# → Auto-generates PRD + task list

# 4. Validate the plan
/xm:build plan-check
# → Checks 11 dimensions (atomicity, coverage, scope-clarity, ...)

# 5. Execute
/xm:build run
# → Agents execute tasks in parallel (DAG order)

# 6. Verify
/xm:build quality                  # test/lint/build checks
/xm:build verify-traceability      # R# ↔ Task ↔ AC matrix

# 7. Done!
/xm:build status

Why xm?

Most AI coding tools work off a checklist: SQL injection, null check, N+1. Checklists find patterns. Senior engineers find problems.

The difference is the questions they ask before they act. Before filing a security finding: can an attacker actually reach this path? Before debugging: when did this last work? Before raising severity: am I inflating this because I'm not sure?

xm bakes those questions into every agent prompt. Agents end up reasoning about context instead of pattern-matching through a list.

Before & After examples

Code review (x-review):

Checklist agent xm agent
Finding [Medium] src/api.ts:42 — Possible SQL injection [Critical] src/api.ts:42 — req.query.id inserted directly into SQL template literal. Public API endpoint with no auth middleware.
Fix Validate input. db.query('SELECT * FROM users WHERE id = $1', [req.query.id])
Why (missing) Unauthenticated public endpoint, input flows directly to query sink

Planning (x-build):

Without principles With principles
Approach "Using microservices because it's modern" "Monolith with module boundaries — no constraint requires separate deployment"
Risk "Security risks" "JWT secret rotation may invalidate active sessions — mitigate with grace period"
Done criteria "Auth works properly" "JWT endpoint returns 401 for expired token, refresh rotation tested"

Debugging (x-solver):

Typical AI xm
First action Generate 5 hypotheses Describe current state + find last known-good baseline
Evidence "It seems like the issue is..." "git bisect shows regression in commit abc1234, confirmed by test output"
Stuck Retry same approach Switch layer (was checking app code → now check infra/config)
Thinking principles at a glance
When you... xm principle Tool
Review code Context determines severity — same pattern, different risk depending on exposure x-review
Review code No evidence, no finding — trace it in the diff or don't report it x-review
Review code When in doubt, downgrade — over-reporting erodes trust x-review
Plan a project Decide what NOT to build first — scope by exclusion x-build
Plan a project Name the risk, schedule it first — fail fast, not fail late x-build
Plan a project Can't verify it? Can't ship it — every task needs done criteria x-build
Solve a problem Diagnose state before hypothesizing — what's happening, not what's wrong x-solver
Solve a problem Anchor to known good — no baseline, no chase x-solver
Solve a problem Compound signals — never conclude from one log line x-solver
Reflect Why happened · Why found late · What to change in the process x-humble

How a senior engineer debugs — the thinking protocol embedded in x-solver:

DIAGNOSE ──→ HYPOTHESIZE ──→ TEST ──→ REFINE ──→ RESOLVE ──→ REFLECT
  1. "What's happening right now?" — Describe the observable state, not the problem.
  2. "When did it last work?" — Find the baseline. No baseline = find one first.
  3. "Why?" — with evidence — Corroborate from different sources. No evidence? Stop.
  4. "Stuck? Change the lens." — All hypotheses from the same layer? Look at a different one.
  5. "Show me it works." — Execution is the only proof.
  6. "Why did we miss this?" — Retrospect via x-humble.

Shared References

Common reference material lives in references/ (synced to marketplace as xm/references/). Skills pull these in on demand — progressive disclosure keeps each SKILL.md lean.

Reference Used by
ask-user-question-rule.md 7 plugins (Dark-Theme rule for AskUserQuestion)
trace-recording.md 9 plugins (trace hook protocol)
dimension-anchors.md x-op strategies, x-review lenses, x-eval rubrics
self-score-protocol.md all x-op strategies, x-agent solve/consensus
finding-severity.md x-review, CLAUDE.md code review principles

Plugins

12 plugins, each installable individually or bundled via xm.

Plugin Purpose Key command
x-build Project lifecycle & PRD pipeline /xm:build plan "goal"
x-op 17 multi-agent strategies /xm:op debate "A vs B"
x-review Judgment-based code review /xm:review diff
x-solver Structured problem solving /xm:solver init "bug"
x-probe Evidence-grade premise validation /xm:probe "idea"
x-eval Quality scoring & benchmarks /xm:eval score file
x-humble Structured retrospective /xm:humble reflect
x-agent Agent primitives & teams /xm:agent fan-out "task"
x-trace Execution tracing & cost /xm:trace timeline
x-memory Cross-session memory /xm:memory inject
x-sync Multi-machine .xm/ sync xm sync push
x-ship Release automation & squash /xm:ship auto
x-humanize Remove AI writing patterns /xm:humanize audit text
xm Bundle + config + pipeline /xm pipeline release

x-build

Takes a project from idea to verified delivery. Generates the PRD, runs deliberation modes, attaches a written acceptance contract to every task, and gates execution on quality.

/xm:build init my-api
/xm:build discuss --mode interview       # Multi-round requirements interview
/xm:build plan "Build a REST API with JWT auth"
/xm:build run                             # Agents execute in DAG order
Research ──→ PRD ──→ Plan ──→ Execute ──→ Verify ──→ Close
 [discuss]  [quality]  [critique]  [contract]  [quality]  [auto]
  interview   consensus   validate    adapt     verify-contracts
  validate
Features & commands
Feature Description
Multi-mode deliberation discuss with 5 modes: interview, assumptions, validate, critique, adapt
PRD generation Auto-generates 8-section PRD from research artifacts
PRD quality gate On-demand judge panel — rubric-based scoring with guidance
Planning principles Scope by exclusion, fail-fast risk ordering, plan as hypothesis, intent over implementation, verify or don't ship
Consensus review 4-agent review (architect, critic, planner, security) until agreement
Acceptance contracts done_criteria per task — auto-derived from PRD, verified at close
Strategy-tagged tasks Tasks with --strategy flag execute via x-op with quality verification
Team execution --team routes tasks to hierarchical teams (x-agent team system)
DAG execution Tasks run in dependency order, parallel where possible
Cost forecasting Per-task $ estimate with complexity-adjusted confidence
Quality dashboard Per-task scores + project average in status output
Traceability matrix R# ↔ Task ↔ AC ↔ Done Criteria with gap detection
Scope creep detection Warns when new tasks overlap with PRD "Out of Scope" items
Error recovery Auto-retry with exponential backoff, circuit breaker, git rollback
plan-check (11 dims) atomicity, deps, coverage (incl. done_criteria), granularity (upper bound >15), completeness, context, naming (44-verb dict), tech-leakage, scope-clarity (Out of Scope match), risk-ordering (DAG-based), overall
Domain-aware done_criteria Auto-generated based on task domain, size tier, and PRD NFR targets
Category Commands
Project init, list, status, next [--json], close, dashboard
Phase phase next/set, gate pass/fail, checkpoint, handoff --full, handon
Plan plan "goal", plan-check [--strict], prd-gate [--threshold N], consensus [--round N]
Tasks tasks add [--deps] [--size] [--strategy] [--team] [--done-criteria], tasks done-criteria, tasks list, tasks remove [--cascade], tasks update [--no-commit], tasks reopen <id> --reason "..." [--cascade], later add/list/promote/dismiss/verify-scope
Steps steps compute/status/next
Execute run, run --json, run-status
Verify quality, verify-coverage, verify-traceability, verify-contracts, verify-review-fix [--init]
Analysis forecast, metrics, decisions, summarize
Export export --format md/csv/jira/confluence, import
Release release detect, release squash, release bump, release commit, release test, release trace, release diff-report
Settings mode developer/normal, config set/get/show

x-op

17 multi-agent strategies. Each one self-scores its output and can delegate verification to x-eval.

/xm:op refine "Payment API design" --rounds 4 --verify
/xm:op tournament "Best approach" --agents 6 --bracket double
/xm:op debate "REST vs GraphQL"
/xm:op investigate "Redis vs Memcached" --depth deep
/xm:op compose "brainstorm | tournament | refine" --topic "v2 plan"
Category Strategies
Collaboration refine, brainstorm, socratic
Competition tournament, debate, council
Pipeline chain, distribute, scaffold, compose, decompose
Analysis review, red-team, persona, hypothesis, investigate
Meta monitor

Quality features:

  • Confidence Gate: Pre-execution 4-question checklist — blocks underspecified tasks before wasting agent tokens
  • Self-Score + 4Q Check: Every strategy auto-scores (1-10) then verifies evidence, requirements, assumptions, consistency
  • --verify: Delegates quality verification to x-eval using the strategy's default rubric
  • Result Persistence: Strategy results saved to .xm/op/ — viewable in x-dashboard
  • Compose presets: --preset analysis-deep, --preset security-audit, --preset consensus
  • Output Quality Contract: Evidence-based, falsifiable, dimension-tagged arguments with per-category Dimension Anchors
All 17 strategies
Strategy Pattern Best for
refine Diverge → converge → verify Iterating on a design
tournament Compete → seed → bracket → winner Picking the best solution
chain A → B → C with conditional branching Multi-step analysis
review Parallel multi-perspective (dynamic scaling) Code review
debate Pro vs Con + Judge → verdict Trade-off decisions
red-team Attack → defend → re-attack Security hardening
brainstorm Free ideation → cluster → vote Feature exploration
distribute Split → parallel → merge Large parallel tasks
council Weighted deliberation → consensus Multi-stakeholder decisions
socratic Question-driven deep inquiry Challenging assumptions
persona Multi-role perspective analysis Requirements from all angles
scaffold Design → dispatch → integrate Top-down implementation
compose Strategy piping (A | B | C) Complex workflows
decompose Recursive split → leaf parallel → assemble Large implementations
hypothesis Generate → falsify → adopt Bug diagnosis, root cause
investigate Multi-angle → cross-validate → gap analysis Unknown exploration
monitor Observe → analyze → auto-dispatch Change surveillance
Which strategy should I use?
Situation Strategy Why
Iterate on a design refine Diverge → converge → verify
Pick the best solution tournament Compete → anonymous vote
Code review review Multi-perspective parallel review
REST vs GraphQL tradeoff debate Pro/con + judge verdict
Find a bug's root cause hypothesis Generate → falsify → adopt
Large feature implementation decompose Recursive split → parallel → merge
Security hardening red-team Attack → defend → report
Feature brainstorming brainstorm Free ideation → cluster → vote
Unknown territory exploration investigate Multi-angle → gap analysis

Not sure? Run /xm:op list to see all strategies with descriptions.

Options
--rounds N              Round count (default 4)
--preset quick|thorough|deep|analysis-deep|security-audit|consensus
--agents N              Number of agents (default: agent_max_count)
--model sonnet|opus     Agent model
--target <file>         Review/red-team/monitor target
--depth shallow|deep|exhaustive   Investigation depth
--verify                Delegate quality validation to x-eval
--threshold N           Quality threshold (default 7)
--vote                  Enable voting (brainstorm)
--dry-run               Show execution plan only
--resume                Resume from checkpoint
--explain               Include decision trace
--pipe <strategy>       Chain strategies (compose)

x-review

Multi-perspective code review that reasons about each finding instead of pattern-matching it against a checklist.

/xm:review diff                     # Review last commit
/xm:review diff HEAD~3              # Review last 3 commits
/xm:review pr 142                   # Review GitHub PR
/xm:review file src/auth.ts         # Review specific file
/xm:review diff --specialists       # Enhance lenses with domain specialist agents
Feature Description
4 default lenses security, logic, perf, tests (expandable to 7: +architecture, docs, errors)
--specialists Injects matching specialist agent rules (security-agent, performance-agent, qa-agent, etc.) as lens preambles for deeper domain expertise
Judgment framework Each lens has principles, judgment criteria, severity calibration, ignore conditions
Why-line requirement Every finding must cite which severity criterion applies — no vague reports
Challenge stage Leader validates each finding's severity before final report
Consensus elevation 2+ agents report same issue → severity promoted + [consensus] tag
Recall Boost After severity filtering, second pass scans 6 categories (stubs, contradictions, cross-refs, silent behavior changes, missing error paths, off-by-one) as [Observation] tags
--thorough Dedicated recall agent with fresh context, 10 observations max, aggressive auto-promotion
Severity disambiguation Architecture lens: "this diff introduced it" → Medium vs "follows existing convention" → Low
Verdict LGTM (0 Critical, 0 High, Medium ≤ 3) / Request Changes (High 1-2 or Medium > 3) / Block (1+ Critical or High > 2)

Review principles: Context determines severity · No evidence = no finding · No fix direction = no finding · When in doubt, downgrade


x-solver

4 strategies for working through a problem. Auto-picks one based on what the problem actually looks like.

/xm:solver init "Memory leak in React component"
/xm:solver classify          # Auto-recommend strategy
/xm:solver solve             # Execute with agents
Strategy Pattern Best for
decompose Break → solve leaves → merge Complex multi-faceted problems
iterate Diagnose → hypothesis → test → refine Bugs, debugging, root cause
constrain Elicit → candidates → score → select Design decisions, tradeoffs
pipeline Auto-detect → route to best strategy When unsure
DIAGNOSE → HYPOTHESIZE → TEST → REFINE → RESOLVE → x-humble
[state+baseline] [falsifiable] [one var] [switch/revert] [exec verify] [why late?]

x-probe

Should you build this? Probe before you commit. Grades every premise on the evidence behind it, runs a pre-mortem, and only returns PROCEED when no fatal assumption is left.

/xm:probe "Build a payment system"    # Full probe session
/xm:probe verdict                      # Show last verdict
/xm:probe list                         # Past probes
FRAME ──→ PROBE ──→ STRESS ──→ VERDICT
[premises]  [socratic]  [pre-mortem]  [PROCEED/RETHINK/KILL]
                        [inversion]
                        [alternatives]
Features
Feature Description
6 thinking principles Default is NO, kill cheaply, evidence with provenance, pre-mortem, code is expensive, ask don't answer
Premise extraction Auto-identifies 3-7 assumptions with evidence grades (assumption/heuristic/data-backed/validated), ordered by fragility then evidence
Socratic probing Grade-calibrated questioning — heavy on assumptions, light on validated premises
3-agent stress test Pre-mortem (failure scenarios) + inversion (reasons NOT to) + alternatives (without code)
Domain detection Auto-classifies idea domain (technology/business/market) for specialized questions
Reclassification triggers Grade auto-upgrades/downgrades based on user evidence during probing
Verdict PROCEED / RETHINK / KILL with evidence summary — fatal+assumption blocks PROCEED
x-build integration PROCEED verdict auto-injects premises, evidence gaps, kill criteria into CONTEXT.md
Verdict schema v2 Structured JSON with domain, evidence grades, gaps — consumed by x-solver/x-humble/xm:memory
x-build link PROCEED auto-injects validated premises into CONTEXT.md
x-humble link KILL triggers retrospective on why the idea reached probe stage

x-eval

Score outputs against rubrics, benchmark strategies head-to-head, and measure how quality moves between commits.

/xm:eval score output.md --rubric code-quality     # Judge panel scoring
/xm:eval score output.md --rubric code-quality \
  --assert "handles empty input" \
  --assert "no global state"               # + binary outcome assertions (HARD FAIL gate)
/xm:eval compare old.md new.md --judges 5          # A/B comparison
/xm:eval bench "Find bugs" --strategies "refine,debate,tournament" --trials 5
                                                  # pass@k/pass^k reliability metrics
/xm:eval diff --from abc1234 --quality              # Change measurement
/xm:eval diff --baseline v1.5.0                     # Regression check vs pinned tag
/xm:eval consistency x-review                       # Test specific plugin consistency
/xm:eval report --sample-transcript 2              # Dump judge rationales to audit scores
/xm:eval calibrate --rubric code-quality            # Human-vs-judge bias check
Commands & rubrics
Command What it does
score N judges score content against rubric (1-10, weighted avg, consensus σ); --assert adds binary HARD FAIL gates; judges may return N/A for inapplicable criteria (weight renormalized)
compare A/B comparison with position bias mitigation
bench strategies × models × trials with pass@k/pass^k reliability metrics, σ-aware recommendation, broken-task warning, and Score/$ optimization
diff Git-based change analysis + optional before/after quality comparison; --baseline <tag> flags regressions (delta ≤ -0.5 → ⛔) for CI gates
consistency Measure plugin output consistency across repeated runs
rubric Create/list custom evaluation rubrics
report Aggregated evaluation history
calibrate Human-vs-judge calibration loop: surfaces per-criterion bias (inflate/deflate); systematic bias ≥ 1.0 triggers explicit guidance; gates automated judge use when

Built-in rubrics: code-quality, review-quality, plan-quality, general — each declares a pass_threshold (7.0–8.0) used by bench to compute pass@k / pass^k. Custom rubrics may override via the pass_threshold field.

Audit trail: score and bench preserve per-judge rationales in .xm/eval/results/; read them via report --sample-transcript N to verify scores aren't just aggregate vibes.

Domain presets: api-design, frontend-design, data-pipeline, security-audit, architecture-review

Bias-aware judging: High-confidence x-humble lessons (confirmed 3+) surfaced as optional judge context


x-humble

Learn from failures together. The retrospective process is the point — not the list of rules left at the end.

/xm:humble reflect              # Full session retrospective
/xm:humble review "why scaffold?"  # Deep-dive on specific decision
/xm:humble lessons              # View accumulated lessons
/xm:humble apply L3             # Apply lesson to CLAUDE.md
CHECK-IN ──→ RECALL ──→ IDENTIFY ──→ ANALYZE ──→ ALTERNATIVE ──→ COMMIT
[accountability]  [summary]  [failures]   [root cause]  [steelman]    [KEEP/STOP/START]
Features
Feature Description
Phase 0 Check-In Verify previous COMMIT items before new retrospective
Root cause analysis Why it happened · Why it was discovered late · What process should change
Bias analysis 7 cognitive biases detected (anchoring, confirmation, sunk cost, ...)
Cross-session patterns Recurring bias tags surfaced automatically
Steelman Protocol User proposes alternative first, agent strengthens it
Comfortable Challenger Agent challenges self-rationalization directly
KEEP/STOP/START Lessons stored, optionally applied to CLAUDE.md
x-solver link After problem solving, auto-suggests retrospective for non-trivial problems
Action Quality Contract Every action must be verifiable, scoped, and traced to root cause. Action Type Taxonomy: PROCESS, PROMPT, CONTEXT, TOOL, CALIBRATION

x-dashboard

Web dashboard for .xm/ project state. Browse builds, probes, solvers, reviews, evals, humble lessons, traces, memory, and costs in one read-only view. No build chain to set up.

x-dashboard

bun x-dashboard/lib/x-dashboard-server.mjs              # Start (standalone)
bun x-dashboard/lib/x-dashboard-server.mjs --stop       # Stop
/xm:dashboard                                       # Start from Claude Code
Browser ──→ Bun HTTP :19841 ──→ .xm/ (read-only)
  │
  ├── Home (summary + cost widget)
  ├── Builds (projects list + detail + tasks + context docs + PRD)
  ├── Probes (history + detail + diff between two verdicts)
  ├── Solvers (list + detail with phase data)
  ├── Traces (timeline + token/cost per span)
  ├── Memory (decisions with search/filter)
  └── Config
Features
Feature Description
Multi-root workspaces --scan ~/work or scan_roots in ~/.xm/config.json — view all projects across directories
Probe verdict diff Side-by-side comparison of two probe runs with premises change highlighting
Cost/token dashboard Aggregate cost by model (haiku/sonnet/opus) and date from x-trace data
Brutalism UI Hard shadows, monospace accents, dark/light toggle
Search Cross-data search across projects, tasks, probes, solvers, context docs
Export Download project/probe/solver detail as markdown
Auto-refresh 3-second polling with ETag/304 — no scroll/focus reset
Accessibility Skip-link, ARIA labels, keyboard navigation, focus indicators
Zero dependencies Vanilla HTML/JS/CSS, Bun HTTP server, no npm packages
Session handoff card Full handoff display — commits, decisions, quality scores, test status, blockers, stashes (collapsible)
Multi-root session state Fetches handoff from all workspaces in parallel, shows most recent

x-agent

Agent primitives and autonomous behaviors on top of Claude Code's native Agent tool. Use primitives when you want to control the steps; switch to autonomous behaviors when you'd rather let agents find the path themselves (stigmergy via a shared board).

# Primitives
/xm:agent fan-out "Find bugs in this code" --agents 5
/xm:agent delegate security "Review src/auth.ts"
/xm:agent broadcast "Review this PR" --roles "security,perf,logic"

# Autonomous behaviors
/xm:agent research "Redis pub/sub limits" --budget 5
/xm:agent solve "CI-only test failure in auth" --agents 3
/xm:agent consensus "JWT vs Session for auth" --agents 4
/xm:agent swarm "Increase test coverage to 80%" --agents 5

# Team
/xm:agent team create eng --template engineering
/xm:agent team assign eng "Build payment system"
Layer Commands What it does
Primitives fan-out, delegate, broadcast Direct agent control — parallel, specialized, or role-based
Autonomous research, solve, consensus, swarm Goal-driven — agents explore, adapt, and converge on their own
Team team create/assign/status Hierarchical: Team Leader (opus) → Members
Presets 15 role presets Cross-cutting roles injected into all layers

Key distinction: x-op = conductor with a score (leader controls every phase). x-agent = jazz band (agents listen to each other and adapt).

Autonomous options: --budget N (max rounds), --depth shallow|deep|exhaustive, --focus <hint>, --web (allow web search).

Model auto-routing: architect → opus, executor → sonnet, scanner → haiku. Override with --model.


x-trace

See what your agents actually did. Walk the timeline, check the cost, replay any past run.

/xm:trace timeline              # Agent execution timeline
/xm:trace cost                  # Token/cost breakdown per agent
/xm:trace replay <id>           # Replay a past execution
/xm:trace diff <id1> <id2>      # Compare two execution runs

x-memory

Persist decisions and patterns across sessions. Auto-inject relevant context on start.

/xm:memory save --type decision "Redis for caching — ACID not required, read-heavy"
/xm:memory save --type failure "Auth middleware order matters — apply before rate limiter"
/xm:memory list                 # List all memories (--type, --tag filters)
/xm:memory show mem-001         # Show full memory content
/xm:memory recall "auth"        # Search past decisions and patterns
/xm:memory forget mem-003       # Delete a memory
/xm:memory inject               # Auto-inject relevant memories into current context
/xm:memory export --format json # Export memories to JSON or Markdown
/xm:memory import backup.json   # Import memories with dedup
/xm:memory stats                # Show memory statistics by type
Type Purpose Auto-injected
decision Architecture/tech choices with rationale On related file changes
failure Past mistakes with lessons On similar patterns
pattern Reusable solutions On matching context

x-sync

Synchronize .xm/ project data across multiple machines via a central API server.

Server Deployment

Option A: Docker (recommended for remote)

# One-line deploy
XM_SYNC_API_KEY=secret docker compose -f x-sync/docker-compose.yml up -d

# Or pull from GHCR
docker run -d -p 19842:19842 -e XM_SYNC_API_KEY=secret \
  -v x-sync-data:/root/.xm/sync jinwoo/xm:sync:latest

Option B: Standalone install

# Install to ~/.local/bin/x-sync-server
curl -fsSL https://raw.githubusercontent.com/x-mesh/xm/main/x-sync/install.sh | bash -s server

# Run
XM_SYNC_API_KEY=secret x-sync-server --port 19842

Client Setup

# Install CLI
curl -fsSL https://raw.githubusercontent.com/x-mesh/xm/main/x-sync/install.sh | bash -s client

# Configure
x-sync setup

# Use
x-sync push     # push .xm/ to server
x-sync pull     # pull other machines' data
x-sync status   # show config and sync state

Or use directly in Claude Code: /xm:sync push, /xm:sync pull, /xm:sync setup

Feature Detail
Push SHA-256 hash dedup, batch POST
Pull Since-timestamp incremental, skip own machine
Auth API key (X-Api-Key header)
Storage SQLite WAL on server
Offline SessionEnd hook queues to .sync-queue/, drains on next push
Machine ID Auto-generated from hostname, stored in ~/.xm/sync.json

x-ship

Release automation: squash WIP commits, bump the version, push. Works on xm marketplace plugins and on standalone projects (Node.js, Rust, Python, Go).

/xm:ship                # Interactive: test → review → release
/xm:ship auto           # Squash + bump + push, no gates
/xm:ship status         # Show commits since last release
/xm:ship patch          # Explicit patch bump
Feature Description
Release CLI 7 subcommands: detect, diff-report, squash, bump, test, commit, trace
WIP squash Auto-classifies WIP commits (tm(), fixup!, wip:) and squashes them
Quality gates Optional test + review gates before release
Standalone support Auto-detects package.json, Cargo.toml, pyproject.toml, go.mod
Release metrics Records version, bump type, test/review results to .xm/traces/
Diff-based analysis Per-commit diff report for intelligent squash grouping

x-humanize

Detect AI-writing patterns and rewrite generated text into natural human prose. Catalog draws from Wikipedia's "Signs of AI writing" (English) and observed Korean AI-slop conventions.

/xm:humanize audit <text>          # Report AI patterns only, no rewrite
/xm:humanize light <text>          # Minimal edits, preserve original structure
/xm:humanize <text>                # Default: medium intensity rewrite
/xm:humanize strong <text>         # Rebuild prose aggressively, preserve facts
/xm:humanize voice <file> <text>   # Match voice of sample file
/xm:humanize --lang ko <text>      # Force Korean output
Feature Description
Pattern catalog Korean (KO-1 ~ KO-40) + English (EN-1 ~ EN-22), each tagged with severity (High/Medium/Low). Korean covers translation-ese, mechanical parallelism, hedging tics, formal-tone overuse, emoji bullets, etc.
Genre-aware filter Six genres (column / report / blog / formal / marketing / README) drop findings the genre legitimately uses — 격식체 in formal docs, 1) 2) 3) in technical docs, em-dashes in essays. Threshold knobs (KO-26 권고형 결말 5→8 in formal, KO-39 따옴표 5→8 in marketing).
Change-rate guardrails < 30% proceed · 30–50% warn and re-verify fact inventory · > 50% hard stop, refuse to output. Length-aware: short inputs use absolute change-count thresholds (5 / 10) instead of percentages.
Auto-downshift When KO-26 (권고형 결말) ≥ 5 hits and KO-31 (단문 일변도) 5+ consecutive both fire, force light intensity even if the user asked for medium or strong — prevents the change-rate budget from blowing up on a single paragraph.
Fact inventory Named entities, metrics, dates, citations recorded before rewrite. The rewrite must restore any dropped fact and never fabricate one. Vague claims stay vague rather than become specific.
Voice calibration Voice sample overrides genre rules — match the user's sentence-length distribution, vocabulary level, and transition habits. Avoids "clean but soulless" output.
Anti-AI audit pass Required Step 5 — internally asks "what still makes this obviously AI-generated?" and revises once more. Catches leftover em-dashes, sycophantic openers, trailing chatbot disclaimers.

Principles: Meaning preserved 100% · Span-grounded edits only (no fix direction = no finding) · Genre kept (column ↛ essay) · Over-polish refused (>50% change rate)


Quality & Learning Pipeline

Each plugin's thinking principles feed the next one. What gets caught in review turns into a planning constraint; what fails in solve turns into a humble lesson.

Example: building a payment API

  1. x-build plan → PRD goal has "and"? Split into two projects. (planning principle)
  2. x-build consensus → critic finds "retry logic not specified for payment gateway timeout" (thinking)
  3. x-build run → agents execute with done_criteria as acceptance contracts
  4. x-review diff → finds unhandled error path, Challenge stage validates it's genuinely High (judgment)
  5. x-solver iterate → diagnoses state, anchors to last passing test, traces with evidence (thinking protocol)
  6. x-humble reflect → "Why was the retry gap found during review, not planning?" → lesson saved (retrospective)
Full pipeline diagram
x-probe → Premise Validation (PROCEED/RETHINK/KILL)
     ↓
x-build plan → PRD Quality Gate (7.0+) → Consensus Review (4 agents)
     ↓
x-build tasks done-criteria → Acceptance contracts from PRD
     ↓
x-op strategy --verify → Judge Panel (bias-aware) → Auto-retry
     ↓
x-eval score → Per-task quality tracking → Project quality dashboard
     ↓
x-build verify-contracts → Done criteria fulfillment check
     ↓
x-humble reflect → Root cause + bias analysis → KEEP/STOP/START lessons
     ↓
lessons → CLAUDE.md + x-eval judge context → Next session applies patterns
Component Mechanism
Self-Score Every x-op strategy auto-scores against mapped rubric
--verify loop Judge panel (bias-aware) → fail → feedback → re-execute (max 2)
PRD consensus architect + critic + planner + security with principle-backed prompts
Acceptance contracts done_criteria auto-derived from PRD → injected into agents → verified at close
Auto-handoff Phase transitions preserve decisions, discard exploration noise
plan-check (11 dims) atomicity, deps, coverage (incl. done_criteria), granularity (upper bound >15), completeness, context, naming (44-verb dict), tech-leakage, scope-clarity (Out of Scope match), risk-ordering (DAG-based), overall
Quality dashboard x-build status shows per-task scores + project avg
Domain rubrics 5 presets (api-design, frontend, data-pipeline, security, architecture)
Bias-aware judging x-humble lessons (confirmed 3+) inform judge context
x-eval diff Measure how skills changed + quality delta

Benchmarks

Empirical consistency measurements across all plugins. Run with /xm:eval consistency.

Plugin Strategy Consistency Status
x-eval rubric-scoring 0.957 PASS
x-humble retrospective 0.950 PASS
x-op debate 0.930 PASS
x-solver decompose 0.917 PASS
x-review multi-lens review 0.890 PASS
x-probe premise-extraction 0.826 PASS
x-build planning 0.824 PASS

Average: 0.899 | All 7 plugins PASS | Verdict consistency: 100%

A/B vs vanilla Claude Code: xm matches vanilla F1 (0.857) with superior precision (1.0 vs 0.75).

Full data: benchmarks/


Architecture

xm/                              Marketplace repo
├── x-build/                        Project harness + PRD pipeline
├── x-op/                           Strategy orchestration (17 strategies)
├── x-eval/                         Quality evaluation + diff
├── x-humble/                       Structured retrospective
├── x-solver/                       Problem solving (4 strategies)
├── x-agent/                        Agent primitives & teams
├── x-probe/                        Premise validation (probe before build)
├── x-review/                       Code review orchestrator
├── x-trace/                        Execution tracing
├── x-memory/                       Cross-session memory
├── x-sync/                         Multi-machine .xm/ sync server
├── xm/                          Bundle (all skills) + shared config + server
└── .claude-plugin/marketplace.json  11 plugins registered
How it works
SKILL.md (spec)  →  Claude (orchestrator)  →  Agent Tool (execution)
       ↕                      ↕
x-build CLI (state)  ←  tasks update (callback)
  • SKILL.md: Orchestration spec that Claude reads. Defines plan→run flow, agent spawn patterns, error recovery.
  • x-build CLI: State management layer. Persists tasks/phases/checkpoints as JSON in .xm/build/. Does not spawn agents directly.
  • Claude: Interprets SKILL.md, spawns agents via Agent Tool, calls CLI callbacks on completion.
  • Persistent Server: Bun HTTP server caches CLI calls for fast repeated responses. AsyncLocalStorage for per-request isolation.
  • Bundle sync: scripts/sync-bundle.sh enforces standalone ↔ bundle file synchronization.

Agent Catalog

37 specialist agents ship with xm, split into core roles and domain experts. Plugins pull them in automatically when extra context would help. x-op refine injects them by topic; x-review picks them up with --specialists.

/xm agents list                        # List all 37 specialists
/xm agents match "payment API design"  # Find best agents for a topic
/xm agents get security --slim         # Show a specialist's rules
Tier Agents
Core api-designer, compliance, database, dependency-manager, deslop, developer-experience, devops, docs, frontend, performance, qa, refactor, reviewer, security, sre, tech-lead, ux-reviewer
Domain ai-coding-dx, analytics, blockchain, data-pipeline, data-visualization, eks, embedded-iot, event-driven, finops, gamedev, i18n, kubernetes, macos, mlops, mobile, monorepo, oke, prompt-engineer, search, serverless

Catalog located at xm/agents/catalog.json. Each agent has a full rules file and a slim version (~30 lines) for prompt injection.


Configuration

/xm config set agent_max_count 10              # 10 agents parallel
/xm config set team_default_leader_model opus  # Team Leader model
/xm config set team_max_members 5              # Max members per team
/xm config show

Settings stored in .xm/config.json (project-level).

Cost Efficiency

Spend gets controlled with two knobs. Model profiles decide which model handles which role; budget guards stop a run before it blows past the cap.

/xm config set model_profile economy           # Sonnet-centric, maximum savings
/xm config set model_profile default           # Default — Opus-centric (Opus 4.7 era)
/xm config set model_profile max               # Opus everywhere, quality-first
/xm config set budget '{"max_usd": 5.0}'       # Set session budget limit

The model_profile key expresses cost intent (how much to spend) on a single axis. Legacy names balanced and performance are auto-mapped to default and max respectively.

Profile architect executor designer explorer writer Notes
economy sonnet sonnet sonnet haiku haiku ~70-85% savings vs default
default opus opus sonnet sonnet haiku Opus-centric baseline
max opus opus opus sonnet haiku ~1.5-2x vs default

Script-only commands (config show, version, agents list, …) still route to haiku regardless of profile (see Model Guardrail in xm/skills/kit/SKILL.md).

Profile changes now automatically rewrite SKILL.md frontmatter model: fields and body markers (<!-- managed-model: <role> -->) via xm/lib/skill-frontmatter-sync.mjs — Claude Code runtime then enforces the chosen model deterministically per skill turn. Mapping table: xm/lib/skill-model-map.json.

Key roles shown; full mapping includes reviewer, security, designer, debugger, writer. See MODEL_PROFILES in source.

Per-role overrides: /xm config set model_overrides '{"architect": "opus"}' on top of any profile.

Budget guards warn at 80% usage and block execution at 100%, tracked via session metrics. Rolling spend is tracked in .xm/spend-cache.json over a configurable window (budget.window_hours, default 24h). Per-project caps use budget.projects:

/xm config set budget '{"max_usd": 5.0, "window_hours": 48, "projects": {"my-proj": {"max_usd": 2.0}}}'

Cost vs Quality Benchmark

Same coding task (rateLimiter — sliding window) across three models:

Criterion haiku sonnet opus
Correctness ✅ works ✅ works ✅ works
Edge cases (0, negative) partial ✅ full ✅ full
Edge cases (NaN, Infinity, float) ✅ isFinite + floor
Code quality 6/10 8/10 9/10
Estimated cost (medium task) $0.07 $0.81 $4.05

Takeaway: haiku will write code that runs, but you'll be the one finding the edge cases. sonnet covers most production work fine. opus is what you reach for when robustness matters more than the cost. Profiles let you pick which trade-off you're making: economy (sonnet-centric), default (opus-centric), or max (all-opus). For a workload-specific estimate, run /xm:build forecast.

Automatic Model Routing

xm picks the cheapest model that can actually handle the request. Plain display commands fall to haiku (~78% cheaper); anything that needs to think escalates to sonnet or opus.

Task type Model Examples
Display/query haiku config show, version, agents list, status, task list
Interactive wizard sonnet config (interactive), init, setup, auto-route confirmation
Reasoning sonnet (escalate to opus when budget allows) plan, run, strategy execution, code review

Principle: if the output is determined by a script (not LLM reasoning), use haiku. The model is a messenger, not a thinker.

Cost-Aware Routing

The selection chain has three levels: model_overrides → profile → fallback. Every decision gets stamped with a correlation ID (ce-XXXXXXXX), so you can trace it back to the outcome later. Reach for model_overrides when you want to pin a specific role to a specific model regardless of profile.


Troubleshooting

Circuit breaker is OPEN
/xm:build circuit-breaker reset    # Manual reset
"No steps computed"
/xm:build steps compute            # Build DAG from task dependencies
plan-check shows errors
  1. Read each error message
  2. Fix: /xm:build tasks update <id> --done-criteria "..." or add missing tasks
  3. Re-run: /xm:build plan-check
"Cannot run — current phase is Plan"
/xm:build phase next               # Advance to Execute phase
/xm:build run                      # Then run
Task stuck in RUNNING
/xm:build tasks update <id> --status failed --error-msg "timeout"
/xm:build run                      # Will retry or skip

Contributing

Contributions are welcome. See the issues page for open tasks.


Requirements

  • Claude Code (Node.js >= 18 bundled)
  • macOS, Linux, or Windows
  • No external dependencies

License

MIT © x-mesh

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors