Skip to content

raceksd-source/proofbench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

proofbench

proofbench is a deterministic evaluation harness for LLM agents. A run passes only when the agent's final answer matches a SHA-256 oracle. There is no LLM-judge, no probabilistic grading path, and no scorer state that can drift between runs.

Most agent evaluations delegate correctness to another model. That makes the result non-deterministic, gameable, expensive to reproduce, and hard to compare across machines or time. proofbench moves pass/fail authority into a cryptographic verifier: the score is byte-identical across runs, and an agent cannot pass a task without producing the expected preimage.

This maps directly to reproducible evaluation frameworks for LLM and agent performance testing at scale: fixed task suites, replayable adapter outputs, machine-readable scorecards, deterministic regression gates, and failure modes that can be counted without re-judging transcripts.

What It Measures

Each task stores the SHA-256 digest of the accepted final answer. The runner asks an adapter for an answer, extracts the last FINAL_[A-Z0-9_]+ token, hashes that extracted token, and compares it with the oracle digest.

That final-answer anchoring matters. If an agent mentions the right token early and later changes its answer, proofbench scores the later answer. The evaluated object is the agent's final committed answer, not the best substring in the transcript.

The current examples are synthetic security-flavored primitives. They are not a product benchmark, a frontier-model leaderboard, or a replacement for broad agent task suites. The point is the scoring contract: deterministic, replayable, and resistant to answer farming because the hash is not useful without the answer preimage.

Features

  • Deterministic oracle: src/oracle.ts verifies answers with SHA-256.
  • Replay adapters: ReplayAdapter loads fixed outputs from JSON; EchoAdapter supports simple CLI smoke paths.
  • Failure taxonomy: current failures distinguish missing_final_answer from answer_hash_mismatch.
  • Last-match answer anchoring: only the final FINAL_... token is scored.
  • Empty-suite rejection: task files must contain at least one task.
  • JSON scorecards: runs produce stable artifacts for CI, comparison, or later reporting.

Architecture

  • Oracle: src/oracle.ts owns hashing and answer verification. This is the sole authority on pass/fail.
  • Runner: src/runner.ts iterates tasks, calls the adapter, anchors the final answer, assigns failure labels, and returns a scorecard.
  • Adapters: src/adapters.ts defines the agent boundary. The included replay adapter makes runs reproducible; provider adapters can be added behind the same interface.
  • CLI and reporting: src/cli.ts, src/io.ts, and src/report.ts load suites, write scorecards, and render concise text reports.

Quickstart

npm install
npm test

Run a single replay manually:

npm run build
node dist/cli.js run examples/tasks.json --answers examples/answers-good.json --out reports/scorecard.json
node dist/cli.js report reports/scorecard.json

One-Command Demo

npm run demo

The demo runs two deterministic replays:

  • examples/answers-good.json solves every task and reports a 100% score.
  • examples/answers-bad.json intentionally omits a final answer for one task, producing a scorecard that exposes the failure taxonomy.

The bad replay exits nonzero when run directly because it did not solve the suite. The demo treats that nonzero status as expected and still prints the scorecard.

Tests

npm test

The test suite builds TypeScript and then runs Node's built-in test runner. It covers deterministic hashing, exact verification, last-match answer anchoring, good and bad replay behavior, and empty-suite rejection through the CLI.

Scope Boundary

proofbench contains toy tasks and fake answer tokens. It does not include live targets, exploit payloads, private prompts, private datasets, customer data, or model-provider integrations. The repository demonstrates focused evaluation primitives that can be extended into larger regression-gated agent eval systems.

About

Deterministic answer-grounding oracle for LLM outputs — SHA-256 as sole authority, no LLM-judge.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors