feat(bench): EnterpriseOps-Gym adapter — deployable SQL state-checker (the gate's non-coding middle-band domain) by drewstone · Pull Request #157 · tangle-network/agent-runtime

drewstone · 2026-06-04T21:56:23Z

Wires EnterpriseOps-Gym (ServiceNow-AI) as a real BenchmarkAdapter through the unifier. It's the gate's ideal domain: a deployable programmatic verifier — the agent's tool-call transcript is replayed against a freshly-seeded gym MCP server, then the task's own SQL state-checks run via /api/sql-runner (NOT an answer-oracle, NOT an LLM-judge). Graded per-verifier (score = passes/total; resolved = all-pass), covering the enterprise-ops user story. loadTasks pulls the HF dataset with a committed offline fixture; judge fails loud with the exact docker pull/run/seed step when the gym server is unreachable — never a fabricated score. One line in adapters.ts. Verified: bench tsc 0, fixture test 5/5, lint clean. Note: dataset card Apache-2.0 vs paper CC-BY-NC-SA — confirm before redistribution.

The gate's ideal NON-coding domain: a deployable programmatic verifier (the agent's tool-call transcript is replayed against a freshly-seeded gym MCP server, then the task's own SQL state-checks run via /api/sql-runner) — NOT an answer- oracle. Graded per-verifier (score = passes/total = the bench's verifier_level_pass_rate; resolved = all-pass = overall_success_rate). Covers the enterprise-ops user story (itsm/hr/csm/calendar/email/drive/teams). Real adapter on the existing pattern: loadTasks pulls the HF dataset (ServiceNow-AI/EnterpriseOps-Gym) with a committed offline fixtures fallback; judge invokes the live-env driver and FAILS LOUD with the exact docker pull/run/ seed step when the gym server is unreachable — never a fabricated score. Registered in adapters.ts (one line). goldArtifact undefined (the oracle is the seeded DB state, not a portable transcript — documented, not faked). Note: dataset card says Apache-2.0; the paper lists CC-BY-NC-SA — confirm before redistribution. Verified: bench tsc 0, fixture test 5/5, repo lint clean.

Cuts the 58-commit backlog on main into a published release. Headline surface: - runToolLoop / streamToolLoop — bounded turn-level tool-dispatch loop (#137) - RSI agent tree: recursive Agent.act, Supervisor keystone, runProgram, the adaptive-driver channel (#139/#151/#165) - optimization API collapsed onto agent-eval selfImprove; the runtime keeps the CODE-surface ImprovementDriver you pass as driver (#172) - deployable benchmark adapters: AppWorld, commit0, aec-bench, EnterpriseOps-Gym; runBenchmarks over one ADAPTERS registry (#153/#156/#157) - agent-eval floor raised to >=0.83.0 (#175)

drewstone merged commit d1d26d3 into main Jun 4, 2026
1 check passed

drewstone deleted the feat/enterpriseops-gym-adapter branch June 4, 2026 21:56

drewstone mentioned this pull request Jun 6, 2026

chore(release): agent-runtime 0.45.0 #176

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bench): EnterpriseOps-Gym adapter — deployable SQL state-checker (the gate's non-coding middle-band domain)#157

feat(bench): EnterpriseOps-Gym adapter — deployable SQL state-checker (the gate's non-coding middle-band domain)#157
drewstone merged 1 commit into
mainfrom
feat/enterpriseops-gym-adapter

drewstone commented Jun 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

drewstone commented Jun 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant