proofbench is a deterministic evaluation harness for LLM agents. A run passes only when the agent's final answer matches a SHA-256 oracle. There is no LLM-judge, no probabilistic grading path, and no scorer state that can drift between runs.
Most agent evaluations delegate correctness to another model. That makes the result non-deterministic, gameable, expensive to reproduce, and hard to compare across machines or time. proofbench moves pass/fail authority into a cryptographic verifier: the score is byte-identical across runs, and an agent cannot pass a task without producing the expected preimage.
This maps directly to reproducible evaluation frameworks for LLM and agent performance testing at scale: fixed task suites, replayable adapter outputs, machine-readable scorecards, deterministic regression gates, and failure modes that can be counted without re-judging transcripts.
Each task stores the SHA-256 digest of the accepted final answer. The runner
asks an adapter for an answer, extracts the last FINAL_[A-Z0-9_]+ token, hashes
that extracted token, and compares it with the oracle digest.
That final-answer anchoring matters. If an agent mentions the right token early and later changes its answer, proofbench scores the later answer. The evaluated object is the agent's final committed answer, not the best substring in the transcript.
The current examples are synthetic security-flavored primitives. They are not a product benchmark, a frontier-model leaderboard, or a replacement for broad agent task suites. The point is the scoring contract: deterministic, replayable, and resistant to answer farming because the hash is not useful without the answer preimage.
- Deterministic oracle:
src/oracle.tsverifies answers with SHA-256. - Replay adapters:
ReplayAdapterloads fixed outputs from JSON;EchoAdaptersupports simple CLI smoke paths. - Failure taxonomy: current failures distinguish
missing_final_answerfromanswer_hash_mismatch. - Last-match answer anchoring: only the final
FINAL_...token is scored. - Empty-suite rejection: task files must contain at least one task.
- JSON scorecards: runs produce stable artifacts for CI, comparison, or later reporting.
- Oracle:
src/oracle.tsowns hashing and answer verification. This is the sole authority on pass/fail. - Runner:
src/runner.tsiterates tasks, calls the adapter, anchors the final answer, assigns failure labels, and returns a scorecard. - Adapters:
src/adapters.tsdefines the agent boundary. The included replay adapter makes runs reproducible; provider adapters can be added behind the same interface. - CLI and reporting:
src/cli.ts,src/io.ts, andsrc/report.tsload suites, write scorecards, and render concise text reports.
npm install
npm testRun a single replay manually:
npm run build
node dist/cli.js run examples/tasks.json --answers examples/answers-good.json --out reports/scorecard.json
node dist/cli.js report reports/scorecard.jsonnpm run demoThe demo runs two deterministic replays:
examples/answers-good.jsonsolves every task and reports a 100% score.examples/answers-bad.jsonintentionally omits a final answer for one task, producing a scorecard that exposes the failure taxonomy.
The bad replay exits nonzero when run directly because it did not solve the suite. The demo treats that nonzero status as expected and still prints the scorecard.
npm testThe test suite builds TypeScript and then runs Node's built-in test runner. It covers deterministic hashing, exact verification, last-match answer anchoring, good and bad replay behavior, and empty-suite rejection through the CLI.
proofbench contains toy tasks and fake answer tokens. It does not include live targets, exploit payloads, private prompts, private datasets, customer data, or model-provider integrations. The repository demonstrates focused evaluation primitives that can be extended into larger regression-gated agent eval systems.