feat: wire-protocol module + first-party Python client#11
Merged
Conversation
…ients Adds src/wire/ — Zod schemas as the single contract, pure handler functions, Hono HTTP server, stdio RPC for batch use, and OpenAPI 3.1 emission. CLI binary (agent-eval) wraps both transports. Schemas: JudgeRequest, JudgeResult, Rubric, RubricDimension, FailureMode, ListRubricsResponse, VersionResponse, ErrorResponse — all with .describe() field-level docs that flow through to the generated OpenAPI. Methods exposed: judge, listRubrics, version. Built-in rubric: anti-slop (voice/signal quality for technical-buyer audiences). Inline rubrics also accepted via the same endpoint. Both transports route to identical handlers. The TS runtime is the source of truth; clients in other languages are generated from openapi.json. 24 new tests, 576/576 pass.
…ocked to npm clients/python/ — pip-installable as `tangle-agent-eval`, version-locked to @tangle-network/agent-eval. Thin transport adapter: every judgement runs in the Node runtime, marshalled over HTTP or stdio RPC. No Python-side eval logic — preventing drift by construction. API: Client.judge(content=..., rubric_name="anti-slop") -> JudgeResult Client.list_rubrics() -> ListRubricsResponse Client.version() -> VersionResponse Auto-detects HTTP server, falls back to subprocess. pydantic v2 models mirror the Zod schemas; mutual-exclusion refinement (rubric_name XOR rubric) validates client-side before any transport fires. 11/11 tests pass — including 4 real subprocess integration tests against the bundled CLI (no mocks).
…E rewrite Drew's directive: SKILL.md is dense by design (footgun bible) and overloaded as the sole onboarding doc. Split the audiences: - README.md: human entry point. What it is, who it's for, 30-second quickstart for both wire-protocol and in-process TS use. - docs/concepts.md (NEW): 5-minute mental model. Vocabulary table, three-layer eval explained, rubric + verifier basics, trace model. - docs/wire-protocol.md (NEW): full HTTP/RPC reference with request/response examples for every endpoint. Adding-a-method recipe. - SKILL.md: vocabulary section added at top so agents have plain-English definitions of every term used in the directives. - CLAUDE.md: split-audience pointer instead of single SKILL.md redirect. Principle: every term defined in plain English with one example. Every endpoint has copy-pasteable input/output. No jargon left undefined.
verify job typechecks JS, runs JS tests, builds, emits OpenAPI, version-locks npm and PyPI package versions, installs the Python client, runs its tests (including real subprocess integration against dist/cli.js), uploads artifacts. publish-npm depends on verify; publish-pypi depends on publish-npm. If anything breaks, neither package ships. PyPI uses trusted publishing (OIDC); npm uses NPM_TOKEN.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a wire-protocol layer (
src/wire/) so non-TypeScript clients can drive agent-eval over HTTP or stdio RPC, plus a first-party Python client atclients/python/that publishes astangle-agent-evalto PyPI version-locked to the npm package. The TypeScript runtime stays the single source of truth — clients in other languages are transport adapters, not ports.Architecture
src/wire/schemas.ts) are the contract.src/wire/handlers.ts) are pure functions; both transports route to them.pnpm openapi).agent-eval):serve | rpc | rpc-batch | openapi | version.What's exposed today
judge,listRubrics,version,/healthz,/openapi.json. One built-in rubric:anti-slop(voice/signal quality for technical-buyer audiences). Inline rubrics also supported on/v1/judge.Adding a method is mechanical (~6 small edits) — recipe in
docs/wire-protocol.md.Docs overhaul
Drew's directive: SKILL.md was overloaded as the only onboarding doc, dense by design (footgun bible). Split the audiences:
README.md— human entry point. What it is, who it's for, two quickstarts (wire-protocol + in-process TS).docs/concepts.md(new) — 5-min mental model. Vocabulary table, three-layer eval, rubrics, verifiers, traces.docs/wire-protocol.md(new) — full HTTP/RPC reference with copy-pasteable input/output for every endpoint. Adding-a-method recipe.SKILL.md— same agent directives, but with a vocabulary section at the top so every term used is defined in plain English.CLAUDE.md— split-audience pointer.Principle applied throughout: every term defined in plain English with one example. Every endpoint has copy-pasteable I/O. No jargon undefined.
CI
Dual-publish workflow (
.github/workflows/publish.yml):verifytypechecks JS, runs JS tests, builds, emits OpenAPI, version-locks npm vs PyPI, installs the Python client, runs its tests (including real subprocess integration againstdist/cli.js), uploads artifacts.publish-npmdepends onverify.publish-pypidepends onpublish-npm. If anything breaks, neither package ships. Versions stay locked.PyPI uses trusted publishing (OIDC); npm uses
NPM_TOKEN.Tests
tests/wire/(schemas, server, rpc).clients/python/tests/— including 4 real subprocess integration tests against the bundled CLI (no mocks).Not in this PR (queued for follow-on)
runScenario,runBuilderSession,listJudges) — pending shape review of the judge endpoint.datamodel-code-generatorfor Python models when the surface grows past ~10 endpoints.AGENT_EVAL_LIVE=1.dist/openapi.jsonwhen the demand shows up.Test plan
src/wire/schemas.ts) — once landed, breaking changes bumpWIRE_VERSION.anti-sloprubric weights and failure modes match the autoresearch-loop intent.verifyjob passes including the Python integration tests.pnpm build && node dist/cli.js servethenpip install -e clients/python && python -c "from tangle_agent_eval import Client; print(Client().version().version)".echo '{}' | node dist/cli.js rpc listRubrics | jqreturns the anti-slop rubric.