Skip to content

feat: wire-protocol module + first-party Python client#11

Merged
drewstone merged 4 commits into
mainfrom
feat/wire-protocol-and-python-client
Apr 25, 2026
Merged

feat: wire-protocol module + first-party Python client#11
drewstone merged 4 commits into
mainfrom
feat/wire-protocol-and-python-client

Conversation

@drewstone
Copy link
Copy Markdown
Contributor

Summary

Adds a wire-protocol layer (src/wire/) so non-TypeScript clients can drive agent-eval over HTTP or stdio RPC, plus a first-party Python client at clients/python/ that publishes as tangle-agent-eval to PyPI version-locked to the npm package. The TypeScript runtime stays the single source of truth — clients in other languages are transport adapters, not ports.

Architecture

your code (any language)
        │
        ▼
   thin transport client  ──HTTP──▶  agent-eval serve   ──┐
        │                                                  │
        └─────subprocess────────▶  agent-eval rpc        ──┤
                                                           ▼
                                              same TS handlers, same rubrics,
                                              same scoring code
  • Schemas (Zod, src/wire/schemas.ts) are the contract.
  • Handlers (src/wire/handlers.ts) are pure functions; both transports route to them.
  • OpenAPI is auto-emitted from the schemas (pnpm openapi).
  • CLI binary (agent-eval): serve | rpc | rpc-batch | openapi | version.
  • Python client mirrors the schemas as pydantic v2 models, validates client-side, and falls back from HTTP → subprocess.

What's exposed today

judge, listRubrics, version, /healthz, /openapi.json. One built-in rubric: anti-slop (voice/signal quality for technical-buyer audiences). Inline rubrics also supported on /v1/judge.

Adding a method is mechanical (~6 small edits) — recipe in docs/wire-protocol.md.

Docs overhaul

Drew's directive: SKILL.md was overloaded as the only onboarding doc, dense by design (footgun bible). Split the audiences:

  • README.md — human entry point. What it is, who it's for, two quickstarts (wire-protocol + in-process TS).
  • docs/concepts.md (new) — 5-min mental model. Vocabulary table, three-layer eval, rubrics, verifiers, traces.
  • docs/wire-protocol.md (new) — full HTTP/RPC reference with copy-pasteable input/output for every endpoint. Adding-a-method recipe.
  • SKILL.md — same agent directives, but with a vocabulary section at the top so every term used is defined in plain English.
  • CLAUDE.md — split-audience pointer.

Principle applied throughout: every term defined in plain English with one example. Every endpoint has copy-pasteable I/O. No jargon undefined.

CI

Dual-publish workflow (.github/workflows/publish.yml):

  • verify typechecks JS, runs JS tests, builds, emits OpenAPI, version-locks npm vs PyPI, installs the Python client, runs its tests (including real subprocess integration against dist/cli.js), uploads artifacts.
  • publish-npm depends on verify.
  • publish-pypi depends on publish-npm. If anything breaks, neither package ships. Versions stay locked.

PyPI uses trusted publishing (OIDC); npm uses NPM_TOKEN.

Tests

  • 24 new TS tests in tests/wire/ (schemas, server, rpc).
  • 11 Python tests in clients/python/tests/ — including 4 real subprocess integration tests against the bundled CLI (no mocks).
  • Existing 576 tests still pass.

Not in this PR (queued for follow-on)

  • Wire-surface expansion (runScenario, runBuilderSession, listJudges) — pending shape review of the judge endpoint.
  • Switch to datamodel-code-generator for Python models when the surface grows past ~10 endpoints.
  • Live LLM integration test gated by AGENT_EVAL_LIVE=1.
  • Other-language clients (Rust, Go) — generate from dist/openapi.json when the demand shows up.

Test plan

  • Review the wire-protocol shape (src/wire/schemas.ts) — once landed, breaking changes bump WIRE_VERSION.
  • Confirm anti-slop rubric weights and failure modes match the autoresearch-loop intent.
  • CI: verify job passes including the Python integration tests.
  • Local: pnpm build && node dist/cli.js serve then pip install -e clients/python && python -c "from tangle_agent_eval import Client; print(Client().version().version)".
  • Local: echo '{}' | node dist/cli.js rpc listRubrics | jq returns the anti-slop rubric.

…ients

Adds src/wire/ — Zod schemas as the single contract, pure handler functions,
Hono HTTP server, stdio RPC for batch use, and OpenAPI 3.1 emission. CLI
binary (agent-eval) wraps both transports.

Schemas: JudgeRequest, JudgeResult, Rubric, RubricDimension, FailureMode,
ListRubricsResponse, VersionResponse, ErrorResponse — all with .describe()
field-level docs that flow through to the generated OpenAPI.

Methods exposed: judge, listRubrics, version. Built-in rubric: anti-slop
(voice/signal quality for technical-buyer audiences). Inline rubrics also
accepted via the same endpoint.

Both transports route to identical handlers. The TS runtime is the source
of truth; clients in other languages are generated from openapi.json.

24 new tests, 576/576 pass.
…ocked to npm

clients/python/ — pip-installable as `tangle-agent-eval`, version-locked to
@tangle-network/agent-eval. Thin transport adapter: every judgement runs in
the Node runtime, marshalled over HTTP or stdio RPC. No Python-side eval
logic — preventing drift by construction.

API:
  Client.judge(content=..., rubric_name="anti-slop") -> JudgeResult
  Client.list_rubrics() -> ListRubricsResponse
  Client.version() -> VersionResponse

Auto-detects HTTP server, falls back to subprocess. pydantic v2 models mirror
the Zod schemas; mutual-exclusion refinement (rubric_name XOR rubric)
validates client-side before any transport fires.

11/11 tests pass — including 4 real subprocess integration tests against
the bundled CLI (no mocks).
…E rewrite

Drew's directive: SKILL.md is dense by design (footgun bible) and overloaded
as the sole onboarding doc. Split the audiences:

- README.md: human entry point. What it is, who it's for, 30-second quickstart
  for both wire-protocol and in-process TS use.
- docs/concepts.md (NEW): 5-minute mental model. Vocabulary table, three-layer
  eval explained, rubric + verifier basics, trace model.
- docs/wire-protocol.md (NEW): full HTTP/RPC reference with request/response
  examples for every endpoint. Adding-a-method recipe.
- SKILL.md: vocabulary section added at top so agents have plain-English
  definitions of every term used in the directives.
- CLAUDE.md: split-audience pointer instead of single SKILL.md redirect.

Principle: every term defined in plain English with one example. Every
endpoint has copy-pasteable input/output. No jargon left undefined.
verify job typechecks JS, runs JS tests, builds, emits OpenAPI, version-locks
npm and PyPI package versions, installs the Python client, runs its tests
(including real subprocess integration against dist/cli.js), uploads
artifacts. publish-npm depends on verify; publish-pypi depends on
publish-npm. If anything breaks, neither package ships.

PyPI uses trusted publishing (OIDC); npm uses NPM_TOKEN.
@drewstone drewstone merged commit fa8fc94 into main Apr 25, 2026
@drewstone drewstone deleted the feat/wire-protocol-and-python-client branch May 8, 2026 14:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant