Skip to content

wisdomgraph/mega-code-eval

Repository files navigation

Skill Quality Evaluation

An empirical benchmark comparing skills generated by five different skill synthesis systems for Claude Code. We measured whether injecting a generated skill improves task success rate and token efficiency compared to a no-skill baseline.

Real Work. Real Results.

Measured head-to-head against 5 leading systems on tasks developers actually ship.

1/5

Token Usage
vs no-skill baseline
169K tokens vs 897K baseline

#1

Highest Score
against 5 competing systems
78% combined avg — 4 skills x 2 models

3x

Structural Quality
vs competitor average
16/16 score across 8 structural dimensions

Token Usage

MEGA Code        ████░░░░░░░░░░░░░░░░  169K   ← 81% reduction
HF Upskill       ████████████████░░░░  763K
anthropic-skill  █████████████████░░░  826K
Baseline         ██████████████████░░  897K
skill-factory    ██████████████████████████████  1,448K
skill-builder    ██████████████████████████████████████████  2,024K

Combined Score

MEGA Code        ████████████████  78%   ← #1
HF Upskill       ██████████████░░  70%
anthropic-skill  █████████████░░░  65%
Baseline         █████████████░░░  65%
skill-builder    ██████████░░░░░░  50%
skill-factory    █████████░░░░░░░  43%

What Are Skills?

Skills are reusable instruction packages (SKILL.md files) for Claude Code-style agents. Different systems generate these skills in different ways — some from prompts, some from real coding sessions. This evaluation measures which approach produces the most effective skills.

Dataset

The evaluation uses 4 task areas from a full-stack chat application project:

Skill Domain
Build Gemini Chat App Full-Stack / API Integration
Multi-Agent Chat Interface Backend / AI Agents
React Multi Chat Sessions Frontend / State Management
Ruff FastAPI Setup Code Quality / DevOps

Each task is evaluated with 5 standardized test cases (generated by Claude Opus via HF Upskill), for a total of 20 evaluations per model per system.

Systems Compared

System Type Repository
MEGA Code Fully autonomous (extracted from real coding session) wisdomgraph/mega-code
HF Upskill Semi-automatic (generated from prompt files) huggingface/upskill
anthropic-skill-creator Interactive (5-stage pipeline) anthropics/skills
claude-code-skill-factory Semi-automatic (interactive builder) alirezarezvani/claude-code-skill-factory
skill-builder Interactive (conversational) metaskills/skill-builder

Methodology

All systems were evaluated under identical conditions using HF Upskill's eval harness:

  • Same test cases per skill across all systems
  • Same evaluation models — Claude Haiku 4.5 and Claude Sonnet 4.5
  • Same harnessupskill eval

Three conditions per skill: baseline (no skill), competitor skill, and MEGA Code skill.

For full methodology details, see Skill Quality Experiment.md.

Results

System Combined Tokens Combined Avg Score Tokens / Score Point
Base (No Skill) 896,553 65% 13,793
MEGA Code 169,122 78% 2,168
HF Upskill 763,082 70% 10,901
anthropic-skill-creator 825,970 65% 12,707
claude-code-skill-factory 1,447,816 43% 33,670
skill-builder 2,023,507 50% 40,470

Per-Model Breakdown

System Sonnet Tokens Sonnet Score Haiku Tokens Haiku Score
Base (No Skill) 91,246 70% 805,307 60%
MEGA Code 91,907 80% 77,215 75%
HF Upskill 65,958 70% 697,124 70%
anthropic-skill-creator 188,562 60% 637,408 70%
claude-code-skill-factory 443,020 40% 1,004,796 45%
skill-builder 394,208 50% 1,629,299 50%

Key Findings

  • MEGA Code achieves the highest combined score — 78% vs 65% baseline (+13pp) with the lowest total token usage (169K vs 897K baseline).
  • HF Upskill is the next strongest — 70% combined score with moderate token usage.
  • anthropic-skill-creator shows mixed results — matches baseline overall (65%) but excels on specific tasks.
  • claude-code-skill-factory and skill-builder underperform — both increase token usage substantially and score below baseline.
  • Haiku benefits the most from MEGA Code skills — 75% score (+15pp) with 90% token reduction.

Full Interaction Transcripts

For full transparency, the complete interaction transcripts used to generate each system's skills are available in used_prompts/. These let you inspect the exact inputs, intermediate steps, and outputs of every skill generation session.

System Transcripts
MEGA Code Session trace — skill extraction from real coding session
HF Upskill Session filessession.json and history_skill_gen.json per skill
anthropic-skill-creator Transcripts — full 5-stage pipeline interactions
claude-code-skill-factory Transcripts — interactive builder sessions
skill-builder Transcripts — conversational skill generation sessions

The original chat application development session (used by MEGA Code) is also available as a transcript.

Reproducing the Results

Prerequisites

  • Claude Code v2.1.72+
  • Python 3.12+ with uv
  • Node.js v22.17.0+
  • ANTHROPIC_API_KEY environment variable set

1. Install the evaluation harness

pip install upskill==0.2.1

2. Generate test cases and competitor skills

Use the prompt files in used_prompts/ to generate skills and standardized test cases:

upskill generate "<contents-of-prompt-file>" --test-gen-model opus

This produces a skill_meta.json with the skill and 5 test cases per skill. These test cases are the canonical set used for all systems.

Generate competitor skills by following each system's workflow (see Systems and Generation Methods) using the same prompt content.

3. Standardize test cases

Copy the test cases from HF Upskill's skill_meta.json into a tests.json for each skill directory, so every system is evaluated against the same tests.

4. Run evaluations

upskill eval <skill-path>/SKILL.md --tests <tests.json> -m haiku -m sonnet -v

# Export results
upskill runs --csv results_performance.csv

5. Compare against our results

Raw result CSVs are in results/. Generated skills for all systems are in generated_skills/.

Repository Structure

├── Skill Quality Experiment.md    # Full experiment document
├── results_summary.md             # Condensed results
├── generated_skills/              # All skills organized by system
│   ├── mega-code/
│   ├── upskill/
│   ├── claude-code-skill-factory/
│   ├── skill-builder/
│   └── anthropic-skill-creator/
├── results/                       # Raw evaluation CSVs
├── used_prompts/                  # Prompt files and full interaction transcripts
│   ├── <system>/                  #   Per-system generation transcripts

License

This project is licensed under the Apache License 2.0.

About

MEGA Code Evaluation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors