Skill Quality Evaluation

An empirical benchmark comparing skills generated by five different skill synthesis systems for Claude Code. We measured whether injecting a generated skill improves task success rate and token efficiency compared to a no-skill baseline.

Real Work. Real Results.

Measured head-to-head against 5 leading systems on tasks developers actually ship.

1/5

Token Usage
_{vs no-skill baseline}
_{169K tokens vs 897K baseline}

#1

Highest Score
_{against 5 competing systems}
_{78% combined avg — 4 skills x 2 models}

3x

Structural Quality
_{vs competitor average}
_{16/16 score across 8 structural dimensions}

Token Usage

MEGA Code        ████░░░░░░░░░░░░░░░░  169K   ← 81% reduction
HF Upskill       ████████████████░░░░  763K
anthropic-skill  █████████████████░░░  826K
Baseline         ██████████████████░░  897K
skill-factory    ██████████████████████████████  1,448K
skill-builder    ██████████████████████████████████████████  2,024K

Combined Score

MEGA Code        ████████████████  78%   ← #1
HF Upskill       ██████████████░░  70%
anthropic-skill  █████████████░░░  65%
Baseline         █████████████░░░  65%
skill-builder    ██████████░░░░░░  50%
skill-factory    █████████░░░░░░░  43%

What Are Skills?

Skills are reusable instruction packages (SKILL.md files) for Claude Code-style agents. Different systems generate these skills in different ways — some from prompts, some from real coding sessions. This evaluation measures which approach produces the most effective skills.

Dataset

The evaluation uses 4 task areas from a full-stack chat application project:

Skill	Domain
Build Gemini Chat App	Full-Stack / API Integration
Multi-Agent Chat Interface	Backend / AI Agents
React Multi Chat Sessions	Frontend / State Management
Ruff FastAPI Setup	Code Quality / DevOps

Each task is evaluated with 5 standardized test cases (generated by Claude Opus via HF Upskill), for a total of 20 evaluations per model per system.

Systems Compared

System	Type	Repository
MEGA Code	Fully autonomous (extracted from real coding session)	wisdomgraph/mega-code
HF Upskill	Semi-automatic (generated from prompt files)	huggingface/upskill
anthropic-skill-creator	Interactive (5-stage pipeline)	anthropics/skills
claude-code-skill-factory	Semi-automatic (interactive builder)	alirezarezvani/claude-code-skill-factory
skill-builder	Interactive (conversational)	metaskills/skill-builder

Methodology

All systems were evaluated under identical conditions using HF Upskill's eval harness:

Same test cases per skill across all systems
Same evaluation models — Claude Haiku 4.5 and Claude Sonnet 4.5
Same harness — upskill eval

Three conditions per skill: baseline (no skill), competitor skill, and MEGA Code skill.

For full methodology details, see Skill Quality Experiment.md.

Results

System	Combined Tokens	Combined Avg Score	Tokens / Score Point
Base (No Skill)	896,553	65%	13,793
MEGA Code	169,122	78%	2,168
HF Upskill	763,082	70%	10,901
anthropic-skill-creator	825,970	65%	12,707
claude-code-skill-factory	1,447,816	43%	33,670
skill-builder	2,023,507	50%	40,470

Per-Model Breakdown

System	Sonnet Tokens	Sonnet Score	Haiku Tokens	Haiku Score
Base (No Skill)	91,246	70%	805,307	60%
MEGA Code	91,907	80%	77,215	75%
HF Upskill	65,958	70%	697,124	70%
anthropic-skill-creator	188,562	60%	637,408	70%
claude-code-skill-factory	443,020	40%	1,004,796	45%
skill-builder	394,208	50%	1,629,299	50%

Key Findings

MEGA Code achieves the highest combined score — 78% vs 65% baseline (+13pp) with the lowest total token usage (169K vs 897K baseline).
HF Upskill is the next strongest — 70% combined score with moderate token usage.
anthropic-skill-creator shows mixed results — matches baseline overall (65%) but excels on specific tasks.
claude-code-skill-factory and skill-builder underperform — both increase token usage substantially and score below baseline.
Haiku benefits the most from MEGA Code skills — 75% score (+15pp) with 90% token reduction.

Full Interaction Transcripts

For full transparency, the complete interaction transcripts used to generate each system's skills are available in used_prompts/. These let you inspect the exact inputs, intermediate steps, and outputs of every skill generation session.

System	Transcripts
MEGA Code	Session trace — skill extraction from real coding session
HF Upskill	Session files — `session.json` and `history_skill_gen.json` per skill
anthropic-skill-creator	Transcripts — full 5-stage pipeline interactions
claude-code-skill-factory	Transcripts — interactive builder sessions
skill-builder	Transcripts — conversational skill generation sessions

The original chat application development session (used by MEGA Code) is also available as a transcript.

Reproducing the Results

Prerequisites

Claude Code v2.1.72+
Python 3.12+ with uv
Node.js v22.17.0+
ANTHROPIC_API_KEY environment variable set

1. Install the evaluation harness

pip install upskill==0.2.1

2. Generate test cases and competitor skills

Use the prompt files in used_prompts/ to generate skills and standardized test cases:

upskill generate "<contents-of-prompt-file>" --test-gen-model opus

This produces a skill_meta.json with the skill and 5 test cases per skill. These test cases are the canonical set used for all systems.

Generate competitor skills by following each system's workflow (see Systems and Generation Methods) using the same prompt content.

3. Standardize test cases

Copy the test cases from HF Upskill's skill_meta.json into a tests.json for each skill directory, so every system is evaluated against the same tests.

4. Run evaluations

upskill eval <skill-path>/SKILL.md --tests <tests.json> -m haiku -m sonnet -v

# Export results
upskill runs --csv results_performance.csv

5. Compare against our results

Raw result CSVs are in results/. Generated skills for all systems are in generated_skills/.

Repository Structure

├── Skill Quality Experiment.md    # Full experiment document
├── results_summary.md             # Condensed results
├── generated_skills/              # All skills organized by system
│   ├── mega-code/
│   ├── upskill/
│   ├── claude-code-skill-factory/
│   ├── skill-builder/
│   └── anthropic-skill-creator/
├── results/                       # Raw evaluation CSVs
├── used_prompts/                  # Prompt files and full interaction transcripts
│   ├── <system>/                  #   Per-system generation transcripts

License

This project is licensed under the Apache License 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
generated_skills		generated_skills
results		results
used_prompts		used_prompts
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
Qualitative structural analysis.md		Qualitative structural analysis.md
README.md		README.md
Skill Quality Experiment.md		Skill Quality Experiment.md
results_summary.md		results_summary.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Skill Quality Evaluation

Real Work. Real Results.

1/5

#1

3x

Token Usage

Combined Score

What Are Skills?

Dataset

Systems Compared

Methodology

Results

Per-Model Breakdown

Key Findings

Full Interaction Transcripts

Reproducing the Results

Prerequisites

1. Install the evaluation harness

2. Generate test cases and competitor skills

3. Standardize test cases

4. Run evaluations

5. Compare against our results

Repository Structure

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Skill Quality Evaluation

Real Work. Real Results.

1/5

#1

3x

Token Usage

Combined Score

What Are Skills?

Dataset

Systems Compared

Methodology

Results

Per-Model Breakdown

Key Findings

Full Interaction Transcripts

Reproducing the Results

Prerequisites

1. Install the evaluation harness

2. Generate test cases and competitor skills

3. Standardize test cases

4. Run evaluations

5. Compare against our results

Repository Structure

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages