A content-addressed skill registry for AI coding agents, with LLM-judged loadout evaluation and autonomous GitHub discovery.
Treats skills as a measured, versioned portfolio instead of individually-selected markdown files. Pins every skill version by SHA-256, bundles them into priority-ordered loadouts, and uses an LLM as a blind A/B judge to rank which loadouts actually work on your historical tasks.
┌──────────────────────────────────────────────────────────┐
│ 43 tests green · 8/10 LLM-judge attribution on V1 │
│ 11 real production skills harvested · 342+ audited │
│ Python stdlib only · Zero dependencies · MIT │
└──────────────────────────────────────────────────────────┘
git clone https://github.com/techieharry/ToolMaster.git
cd ToolMaster
# Pin some real skills into the content-addressed store
python -m toolmaster pin skills/refactor
python -m toolmaster pin skills/bug-fix
python -m toolmaster pin skills/test-generator
# Build two deliberately differentiated loadouts
python -m toolmaster loadout create refactor_stack <refactor-hash>
python -m toolmaster loadout create bugfix_stack <bug-fix-hash> <test-generator-hash>
# Ask the offer engine which loadout fits a task
python -m toolmaster offer "write a failing test for an off-by-one bug"
# → ranks loadouts by BM25F + outcome data, returns canonical/iterated/sideways
# One-shot: offer + dispatch to a specialist agent
python -m toolmaster autopilot "refactor the long handler function"
# → picks canonical loadout, calls Haiku 4.5 with skills loaded as system prompt,
# returns specialist output, writes recording so future rankings learnEvery agent tooling system today — Anthropic's Skills spec, wshobson/agents, obra/superpowers, Multica, LobeHub — treats skills as individual, mutable, name-addressed assets. You install them one at a time, match them by keyword, and evaluate them one at a time.
Nobody asks:
- Does this stack of skills beat that stack on this type of task?
- Which combinations actually produce lower edit distances in practice?
- Can the same outcome signal drive recommendations, without relying on download counts?
ToolMaster does.
| Primitive | ToolMaster | Anthropic Skills | wshobson/agents | obra/superpowers | Multica | LobeHub |
|---|---|---|---|---|---|---|
| Content-addressed skill versions (SHA-256) | ✅ | ❌ | ❌ | ❌ | ⚠ lockfile only | ❌ |
| Loadouts as the unit of composition | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ |
| Loadout-vs-loadout LLM eval | ✅ 8/10 on survival test | ❌ | per-skill only | ❌ | ❌ | ❌ |
| Autonomous GitHub scout + safety audit | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ |
| Cross-project data flywheel | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ |
| Skill dispatch to specialist agents (cost-reduced) | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ |
No competitor has more than one of these. ToolMaster has all six.
The core question: does loadout-level evaluation actually produce actionable insight, or does it collapse to noise?
Setup:
- 7 real seed skills pinned from skills/ (
refactor,bug-fix,code-review,data-analysis,document-writer,test-generator,api-integration) - Two deliberately differentiated loadouts:
refactor_stack(refactor + code-review + document-writer) vsbugfix_stack(bug-fix + test-generator + data-analysis) - 10 recorded tasks (5 refactor-coded, 5 bug-fix-coded)
- Blind A/B LLM judge via
anthropic/claude-haiku-4.5through OpenRouter, position randomized per-task to eliminate bias
Result: 8/10 correct attribution.
- All 5 bug-fix tasks →
bugfix_stack(5/5) - 3/5 refactor tasks →
refactor_stack, 1 tie, 1 defensible miss - Per-task reasoning from the judge was specific and coherent — cited actual skill names ("Loadout A's refactor skill directly addresses the core task...") and task characteristics, not generic text
Full driver: tests/run_v1_survival.py · Results: docs/v1-survival-results.md
Standard SKILL.md (Anthropic spec, zero changes)
↓
ToolMaster Registry — content-addressed, SHA-256, immutable versions
↓
Loadouts — named, priority-ordered stacks of skill hashes
↓
Replay-Eval — blind A/B LLM judge, loadout vs loadout, result cached
↓
Offer Engine (V2) — canonical / iterated / sideways ranking
↓
Delegate Primitive — specialist agent dispatch with pinned loadout
↓
Autonomous Layer — scout crawls GitHub, audits via LLM, writes proposals
Full diagram + data flow + model routing: ARCHITECTURE.md
toolmaster/ Python package, 14 modules, stdlib only
├── store.py Content-addressed SHA-256 store
├── loadout.py Named priority-ordered stacks + 7 agent targets
├── record.py Task recordings for replay-eval
├── compare.py Blind A/B LLM judge + cost preview + result cache
├── offer.py V2 offer engine (canonical/iterated/sideways)
├── delegate.py Skill dispatch primitive + autopilot
├── suggest.py Skill ranking with BM25F + performance boost
├── matcher.py 7-signal BM25F (exact/prefix/phrase/bm25f/jaccard/fuzzy)
├── quality.py Skill quality gate (0-100, critical-fail on security)
├── protocol.py Agent session lifecycle (checkin/out/used/return)
├── scout.py GitHub crawl + LLM audit + re-engineer + propose
├── sync.py Cross-project harvest + push
├── global_watcher.py Background daemon (poll/sync/scout scheduler)
└── cli.py 19 commands
tests/
├── test_v1_survival.py 43 stdlib-only tests, all green
└── run_v1_survival.py End-to-end survival driver (LLM judge)
skills/ 7 seed skills (refactor, bug-fix, ...)
hooks/ Claude Code lifecycle hooks + daemon launcher
docs/
├── v1-survival-results.md Full survival test output with per-task reasoning
├── discovery-sources.md Catalog of 3-tier scout discovery pipeline
├── competitive-analysis.md Full competitive landscape
├── immutable-parts.md Why content-addressing is load-bearing
├── roguelike-selection.md Design of the three-offer V2 UX
└── toolbox-master.md V3 background job design
git clone https://github.com/techieharry/ToolMaster.git
cd ToolMaster
pip install -e .Or run without install:
python -m toolmaster <command>Python 3.10+. Zero runtime dependencies — toolmaster is pure stdlib. Tests are stdlib unittest (no pytest).
python -m toolmaster loadout apply refactor_stack --target claude # .claude/skills/
python -m toolmaster loadout apply refactor_stack --target cursor # .cursor/rules/
python -m toolmaster loadout apply refactor_stack --target codex # .codex/skills/
python -m toolmaster loadout apply refactor_stack --target aider # .aider/conventions/
python -m toolmaster loadout apply refactor_stack --target windsurf # .windsurf/rules/
python -m toolmaster loadout apply refactor_stack --target continue # .continue/context/
python -m toolmaster loadout apply refactor_stack --target agents # .agents/skills/# Single scout cycle: crawls 19 watched repos + GitHub topics + awesome-list mining
python -m toolmaster scout
# Persistent background daemon:
# Windows: ./hooks/run-watcher.ps1 start
# Unix/Bash: ./hooks/run-watcher.sh startThe scout fetches external skill candidates, runs a static security pre-scan (prompt injection, shell execution, credential references, obfuscation), then an LLM safety audit via Haiku, then writes import/extract_techniques/reject proposals to ~/.toolmaster/proposals/. See docs/discovery-sources.md for the full pipeline.
| Phase | Status | Notes |
|---|---|---|
| V1 — CLI MVP | ✅ shipped | Store, loadouts, recording, compare. Survival test passed 8/10. |
| V2 — Offer engine | ✅ code shipped | Cold-start tested; warm-path needs accumulated data |
| V2 — Delegate primitive | ✅ shipped | Specialist agent dispatch with pinned loadout |
| V2 — Multi-agent targets | ✅ 7 agents | Claude, Cursor, Codex, Aider, Windsurf, Continue, generic |
| V2 — Scout expansion | ✅ shipped | 19 repos + topic + awesome-list mining, auth'd |
| V3 — Background jobs | 🔨 designed | Dedup, scout (built), iterate (designed) |
See roadmap.md for details.
python -m unittest tests.test_v1_survival -v
# Ran 43 tests in 1.5s — OKAll tests are stdlib-only and isolate to tmp directories. No network required.
MIT — see LICENSE. Built as open-source infrastructure; contributions welcome.
Built by Haris Yusuf in Toronto, with Claude Code (Opus 4.6 / 1M context) as the pair-programming co-author. Every commit credits both.
The problem I had: 12+ active projects with 20+ SKILL.md files scattered across them, zero visibility into which ones worked, which combinations worked, or which versions were "last-good" when I edited something and it got worse. Manual skill curation didn't scale past ~5 skills. I wanted a system that measures the skill portfolio the way I measure code quality — per-version, per-composition, per-outcome.
Nothing existing did that, so I built it.
Open to roles in AI developer tooling, agent infrastructure, LLM evaluation, and Python / system design work. Reach me via GitHub issues or the contact info on my profile.