An agent skill for eval-driven development of LLM-powered applications.
The eval-driven-dev skill guides your coding agent through the full QA loop for LLM applications:
- Understand the app — read the codebase, trace the data flow, learn what the app is supposed to do
- Instrument it — add
enable_storage()and@observeso every run is captured to a local SQLite database - Build a dataset — save representative traces as test cases with
pixie dataset save - Write eval tests — generate
test_*.pyfiles withassert_dataset_passand appropriate evaluators - Run the tests —
pixie testto run all evals and report per-case scores - Investigate failures — look up the stored trace for each failure, diagnose, fix, repeat
npx openskills install yiouli/pixie-qaThe accompanying python package would be installed by the skill automatically when it's used.
Open a conversation and say something like when developing a python based AI project:
"setup QA for my agent"
Your coding agent will read your code, instrument it, build a dataset from a few real runs, write and run eval-based tests, investigate failures and fix.
The pixie-qa Python package (imported as pixie) is what Claude installs and uses inside your project. For the package API and CLI reference, see docs/package.md.