An empirical benchmark comparing skills generated by five different skill synthesis systems for Claude Code. We measured whether injecting a generated skill improves task success rate and token efficiency compared to a no-skill baseline.
Measured head-to-head against 5 leading systems on tasks developers actually ship.
| Token Usage vs no-skill baseline 169K tokens vs 897K baseline |
Highest Score against 5 competing systems 78% combined avg — 4 skills x 2 models |
Structural Quality vs competitor average 16/16 score across 8 structural dimensions |
MEGA Code ████░░░░░░░░░░░░░░░░ 169K ← 81% reduction
HF Upskill ████████████████░░░░ 763K
anthropic-skill █████████████████░░░ 826K
Baseline ██████████████████░░ 897K
skill-factory ██████████████████████████████ 1,448K
skill-builder ██████████████████████████████████████████ 2,024K
MEGA Code ████████████████ 78% ← #1
HF Upskill ██████████████░░ 70%
anthropic-skill █████████████░░░ 65%
Baseline █████████████░░░ 65%
skill-builder ██████████░░░░░░ 50%
skill-factory █████████░░░░░░░ 43%
Skills are reusable instruction packages (SKILL.md files) for Claude Code-style agents. Different systems generate these skills in different ways — some from prompts, some from real coding sessions. This evaluation measures which approach produces the most effective skills.
The evaluation uses 4 task areas from a full-stack chat application project:
| Skill | Domain |
|---|---|
| Build Gemini Chat App | Full-Stack / API Integration |
| Multi-Agent Chat Interface | Backend / AI Agents |
| React Multi Chat Sessions | Frontend / State Management |
| Ruff FastAPI Setup | Code Quality / DevOps |
Each task is evaluated with 5 standardized test cases (generated by Claude Opus via HF Upskill), for a total of 20 evaluations per model per system.
| System | Type | Repository |
|---|---|---|
| MEGA Code | Fully autonomous (extracted from real coding session) | wisdomgraph/mega-code |
| HF Upskill | Semi-automatic (generated from prompt files) | huggingface/upskill |
| anthropic-skill-creator | Interactive (5-stage pipeline) | anthropics/skills |
| claude-code-skill-factory | Semi-automatic (interactive builder) | alirezarezvani/claude-code-skill-factory |
| skill-builder | Interactive (conversational) | metaskills/skill-builder |
All systems were evaluated under identical conditions using HF Upskill's eval harness:
- Same test cases per skill across all systems
- Same evaluation models — Claude Haiku 4.5 and Claude Sonnet 4.5
- Same harness —
upskill eval
Three conditions per skill: baseline (no skill), competitor skill, and MEGA Code skill.
For full methodology details, see Skill Quality Experiment.md.
| System | Combined Tokens | Combined Avg Score | Tokens / Score Point |
|---|---|---|---|
| Base (No Skill) | 896,553 | 65% | 13,793 |
| MEGA Code | 169,122 | 78% | 2,168 |
| HF Upskill | 763,082 | 70% | 10,901 |
| anthropic-skill-creator | 825,970 | 65% | 12,707 |
| claude-code-skill-factory | 1,447,816 | 43% | 33,670 |
| skill-builder | 2,023,507 | 50% | 40,470 |
| System | Sonnet Tokens | Sonnet Score | Haiku Tokens | Haiku Score |
|---|---|---|---|---|
| Base (No Skill) | 91,246 | 70% | 805,307 | 60% |
| MEGA Code | 91,907 | 80% | 77,215 | 75% |
| HF Upskill | 65,958 | 70% | 697,124 | 70% |
| anthropic-skill-creator | 188,562 | 60% | 637,408 | 70% |
| claude-code-skill-factory | 443,020 | 40% | 1,004,796 | 45% |
| skill-builder | 394,208 | 50% | 1,629,299 | 50% |
- MEGA Code achieves the highest combined score — 78% vs 65% baseline (+13pp) with the lowest total token usage (169K vs 897K baseline).
- HF Upskill is the next strongest — 70% combined score with moderate token usage.
- anthropic-skill-creator shows mixed results — matches baseline overall (65%) but excels on specific tasks.
- claude-code-skill-factory and skill-builder underperform — both increase token usage substantially and score below baseline.
- Haiku benefits the most from MEGA Code skills — 75% score (+15pp) with 90% token reduction.
For full transparency, the complete interaction transcripts used to generate each system's skills are available in used_prompts/. These let you inspect the exact inputs, intermediate steps, and outputs of every skill generation session.
| System | Transcripts |
|---|---|
| MEGA Code | Session trace — skill extraction from real coding session |
| HF Upskill | Session files — session.json and history_skill_gen.json per skill |
| anthropic-skill-creator | Transcripts — full 5-stage pipeline interactions |
| claude-code-skill-factory | Transcripts — interactive builder sessions |
| skill-builder | Transcripts — conversational skill generation sessions |
The original chat application development session (used by MEGA Code) is also available as a transcript.
- Claude Code v2.1.72+
- Python 3.12+ with uv
- Node.js v22.17.0+
ANTHROPIC_API_KEYenvironment variable set
pip install upskill==0.2.1Use the prompt files in used_prompts/ to generate skills and standardized test cases:
upskill generate "<contents-of-prompt-file>" --test-gen-model opusThis produces a skill_meta.json with the skill and 5 test cases per skill. These test cases are the canonical set used for all systems.
Generate competitor skills by following each system's workflow (see Systems and Generation Methods) using the same prompt content.
Copy the test cases from HF Upskill's skill_meta.json into a tests.json for each skill directory, so every system is evaluated against the same tests.
upskill eval <skill-path>/SKILL.md --tests <tests.json> -m haiku -m sonnet -v
# Export results
upskill runs --csv results_performance.csvRaw result CSVs are in results/. Generated skills for all systems are in generated_skills/.
├── Skill Quality Experiment.md # Full experiment document
├── results_summary.md # Condensed results
├── generated_skills/ # All skills organized by system
│ ├── mega-code/
│ ├── upskill/
│ ├── claude-code-skill-factory/
│ ├── skill-builder/
│ └── anthropic-skill-creator/
├── results/ # Raw evaluation CSVs
├── used_prompts/ # Prompt files and full interaction transcripts
│ ├── <system>/ # Per-system generation transcripts
This project is licensed under the Apache License 2.0.