feat: wire-protocol module + first-party Python client by drewstone · Pull Request #11 · tangle-network/agent-eval

drewstone · 2026-04-25T23:21:11Z

Summary

Adds a wire-protocol layer (src/wire/) so non-TypeScript clients can drive agent-eval over HTTP or stdio RPC, plus a first-party Python client at clients/python/ that publishes as tangle-agent-eval to PyPI version-locked to the npm package. The TypeScript runtime stays the single source of truth — clients in other languages are transport adapters, not ports.

Architecture

your code (any language)
        │
        ▼
   thin transport client  ──HTTP──▶  agent-eval serve   ──┐
        │                                                  │
        └─────subprocess────────▶  agent-eval rpc        ──┤
                                                           ▼
                                              same TS handlers, same rubrics,
                                              same scoring code

Schemas (Zod, src/wire/schemas.ts) are the contract.
Handlers (src/wire/handlers.ts) are pure functions; both transports route to them.
OpenAPI is auto-emitted from the schemas (pnpm openapi).
CLI binary (agent-eval): serve | rpc | rpc-batch | openapi | version.
Python client mirrors the schemas as pydantic v2 models, validates client-side, and falls back from HTTP → subprocess.

What's exposed today

judge, listRubrics, version, /healthz, /openapi.json. One built-in rubric: anti-slop (voice/signal quality for technical-buyer audiences). Inline rubrics also supported on /v1/judge.

Adding a method is mechanical (~6 small edits) — recipe in docs/wire-protocol.md.

Docs overhaul

Drew's directive: SKILL.md was overloaded as the only onboarding doc, dense by design (footgun bible). Split the audiences:

README.md — human entry point. What it is, who it's for, two quickstarts (wire-protocol + in-process TS).
docs/concepts.md (new) — 5-min mental model. Vocabulary table, three-layer eval, rubrics, verifiers, traces.
docs/wire-protocol.md (new) — full HTTP/RPC reference with copy-pasteable input/output for every endpoint. Adding-a-method recipe.
SKILL.md — same agent directives, but with a vocabulary section at the top so every term used is defined in plain English.
CLAUDE.md — split-audience pointer.

Principle applied throughout: every term defined in plain English with one example. Every endpoint has copy-pasteable I/O. No jargon undefined.

CI

Dual-publish workflow (.github/workflows/publish.yml):

verify typechecks JS, runs JS tests, builds, emits OpenAPI, version-locks npm vs PyPI, installs the Python client, runs its tests (including real subprocess integration against dist/cli.js), uploads artifacts.
publish-npm depends on verify.
publish-pypi depends on publish-npm. If anything breaks, neither package ships. Versions stay locked.

PyPI uses trusted publishing (OIDC); npm uses NPM_TOKEN.

Tests

24 new TS tests in tests/wire/ (schemas, server, rpc).
11 Python tests in clients/python/tests/ — including 4 real subprocess integration tests against the bundled CLI (no mocks).
Existing 576 tests still pass.

Not in this PR (queued for follow-on)

Wire-surface expansion (runScenario, runBuilderSession, listJudges) — pending shape review of the judge endpoint.
Switch to datamodel-code-generator for Python models when the surface grows past ~10 endpoints.
Live LLM integration test gated by AGENT_EVAL_LIVE=1.
Other-language clients (Rust, Go) — generate from dist/openapi.json when the demand shows up.

Test plan

Review the wire-protocol shape (src/wire/schemas.ts) — once landed, breaking changes bump WIRE_VERSION.
Confirm anti-slop rubric weights and failure modes match the autoresearch-loop intent.
CI: verify job passes including the Python integration tests.
Local: pnpm build && node dist/cli.js serve then pip install -e clients/python && python -c "from tangle_agent_eval import Client; print(Client().version().version)".
Local: echo '{}' | node dist/cli.js rpc listRubrics | jq returns the anti-slop rubric.

…ients Adds src/wire/ — Zod schemas as the single contract, pure handler functions, Hono HTTP server, stdio RPC for batch use, and OpenAPI 3.1 emission. CLI binary (agent-eval) wraps both transports. Schemas: JudgeRequest, JudgeResult, Rubric, RubricDimension, FailureMode, ListRubricsResponse, VersionResponse, ErrorResponse — all with .describe() field-level docs that flow through to the generated OpenAPI. Methods exposed: judge, listRubrics, version. Built-in rubric: anti-slop (voice/signal quality for technical-buyer audiences). Inline rubrics also accepted via the same endpoint. Both transports route to identical handlers. The TS runtime is the source of truth; clients in other languages are generated from openapi.json. 24 new tests, 576/576 pass.

…ocked to npm clients/python/ — pip-installable as `tangle-agent-eval`, version-locked to @tangle-network/agent-eval. Thin transport adapter: every judgement runs in the Node runtime, marshalled over HTTP or stdio RPC. No Python-side eval logic — preventing drift by construction. API: Client.judge(content=..., rubric_name="anti-slop") -> JudgeResult Client.list_rubrics() -> ListRubricsResponse Client.version() -> VersionResponse Auto-detects HTTP server, falls back to subprocess. pydantic v2 models mirror the Zod schemas; mutual-exclusion refinement (rubric_name XOR rubric) validates client-side before any transport fires. 11/11 tests pass — including 4 real subprocess integration tests against the bundled CLI (no mocks).

…E rewrite Drew's directive: SKILL.md is dense by design (footgun bible) and overloaded as the sole onboarding doc. Split the audiences: - README.md: human entry point. What it is, who it's for, 30-second quickstart for both wire-protocol and in-process TS use. - docs/concepts.md (NEW): 5-minute mental model. Vocabulary table, three-layer eval explained, rubric + verifier basics, trace model. - docs/wire-protocol.md (NEW): full HTTP/RPC reference with request/response examples for every endpoint. Adding-a-method recipe. - SKILL.md: vocabulary section added at top so agents have plain-English definitions of every term used in the directives. - CLAUDE.md: split-audience pointer instead of single SKILL.md redirect. Principle: every term defined in plain English with one example. Every endpoint has copy-pasteable input/output. No jargon left undefined.

verify job typechecks JS, runs JS tests, builds, emits OpenAPI, version-locks npm and PyPI package versions, installs the Python client, runs its tests (including real subprocess integration against dist/cli.js), uploads artifacts. publish-npm depends on verify; publish-pypi depends on publish-npm. If anything breaks, neither package ships. PyPI uses trusted publishing (OIDC); npm uses NPM_TOKEN.

drewstone added 4 commits April 25, 2026 16:50

drewstone merged commit fa8fc94 into main Apr 25, 2026

drewstone deleted the feat/wire-protocol-and-python-client branch May 8, 2026 14:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: wire-protocol module + first-party Python client#11

feat: wire-protocol module + first-party Python client#11
drewstone merged 4 commits into
mainfrom
feat/wire-protocol-and-python-client

drewstone commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

drewstone commented Apr 25, 2026

Summary

Architecture

What's exposed today

Docs overhaul

CI

Tests

Not in this PR (queued for follow-on)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant