Skip to content

feat(bench): EnterpriseOps-Gym adapter — deployable SQL state-checker (the gate's non-coding middle-band domain)#157

Merged
drewstone merged 1 commit into
mainfrom
feat/enterpriseops-gym-adapter
Jun 4, 2026
Merged

feat(bench): EnterpriseOps-Gym adapter — deployable SQL state-checker (the gate's non-coding middle-band domain)#157
drewstone merged 1 commit into
mainfrom
feat/enterpriseops-gym-adapter

Conversation

@drewstone
Copy link
Copy Markdown
Contributor

Wires EnterpriseOps-Gym (ServiceNow-AI) as a real BenchmarkAdapter through the unifier. It's the gate's ideal domain: a deployable programmatic verifier — the agent's tool-call transcript is replayed against a freshly-seeded gym MCP server, then the task's own SQL state-checks run via /api/sql-runner (NOT an answer-oracle, NOT an LLM-judge). Graded per-verifier (score = passes/total; resolved = all-pass), covering the enterprise-ops user story. loadTasks pulls the HF dataset with a committed offline fixture; judge fails loud with the exact docker pull/run/seed step when the gym server is unreachable — never a fabricated score. One line in adapters.ts. Verified: bench tsc 0, fixture test 5/5, lint clean. Note: dataset card Apache-2.0 vs paper CC-BY-NC-SA — confirm before redistribution.

The gate's ideal NON-coding domain: a deployable programmatic verifier (the
agent's tool-call transcript is replayed against a freshly-seeded gym MCP server,
then the task's own SQL state-checks run via /api/sql-runner) — NOT an answer-
oracle. Graded per-verifier (score = passes/total = the bench's
verifier_level_pass_rate; resolved = all-pass = overall_success_rate). Covers the
enterprise-ops user story (itsm/hr/csm/calendar/email/drive/teams).

Real adapter on the existing pattern: loadTasks pulls the HF dataset
(ServiceNow-AI/EnterpriseOps-Gym) with a committed offline fixtures fallback;
judge invokes the live-env driver and FAILS LOUD with the exact docker pull/run/
seed step when the gym server is unreachable — never a fabricated score.
Registered in adapters.ts (one line). goldArtifact undefined (the oracle is the
seeded DB state, not a portable transcript — documented, not faked).

Note: dataset card says Apache-2.0; the paper lists CC-BY-NC-SA — confirm before
redistribution. Verified: bench tsc 0, fixture test 5/5, repo lint clean.
@drewstone drewstone merged commit d1d26d3 into main Jun 4, 2026
1 check passed
@drewstone drewstone deleted the feat/enterpriseops-gym-adapter branch June 4, 2026 21:56
drewstone added a commit that referenced this pull request Jun 6, 2026
Cuts the 58-commit backlog on main into a published release. Headline surface:
- runToolLoop / streamToolLoop — bounded turn-level tool-dispatch loop (#137)
- RSI agent tree: recursive Agent.act, Supervisor keystone, runProgram, the
  adaptive-driver channel (#139/#151/#165)
- optimization API collapsed onto agent-eval selfImprove; the runtime keeps the
  CODE-surface ImprovementDriver you pass as driver (#172)
- deployable benchmark adapters: AppWorld, commit0, aec-bench, EnterpriseOps-Gym;
  runBenchmarks over one ADAPTERS registry (#153/#156/#157)
- agent-eval floor raised to >=0.83.0 (#175)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant