Eval Bench runs deterministic eval cases before proof recording.
U27-S05
EVAL BENCH
CLASS: SYSTEM
OPERATING_POSITION: 04/07
FUNCTION: Eval Case Execution + Result Artifact Generation
REF_ID: U27-S05-EVAL-BENCH
SOURCE_STATUS: PUBLIC_PACKAGE
ACCESS_STATUS: CLEARED_FOR_EXTERNAL_USE
This repository is a released Unit27 field kit: visible, inspectable, and intended for orientation, testing, and practical use. Controlled protocol materials remain outside this source package.
It answers one narrow question:
Did the declared eval cases run, and what did they record?
Use Eval Bench after a handoff packet defines what should be checked and before Proof Ledger records durable proof.
It is useful when a repo needs deterministic eval results instead of a vague statement that something was tested.
Example:
Problem: The handoff names acceptance checks, but there is no eval result artifact.
Result: Eval Bench runs the declared cases and writes a report Proof Ledger can record.
The current public release is GitHub-first. Run it from a local checkout:
git clone https://github.com/unit27research/unit27-eval-bench
cd unit27-eval-bench
pip install -e .
eval-bench demo
cat eval-bench-demo/u27/EVAL_REPORT.mdOn your own repo:
eval-bench init
eval-bench run
eval-bench inspect u27/eval_results.jsonEval Bench writes:
evals/eval_cases.jsonu27/eval_results.jsonu27/EVAL_REPORT.mdu27/eval_evidence/*.txtevals/proof_cases.json
It is designed to feel like a deterministic eval runner, not a benchmark platform or proof ledger.
Stack Engine -> Context Engine -> Handoff Engine -> Eval Bench -> Proof Ledger -> Boundary Engine -> u27-check
Eval Bench sits after handoff generation and before proof recording. It runs declared eval cases and produces result artifacts. Proof Ledger remains responsible for durable proof records.
eval-bench init
eval-bench run
eval-bench inspect u27/eval_results.json
eval-bench demoExit codes:
0 = success
1 = one or more eval cases failed
2 = input or inspection error
{
"id": "cli-smoke-test",
"claim": "The primary CLI command returns usable output.",
"command": "python3 -c \"print('eval bench smoke ok')\"",
"expected_exit": 0,
"limits": ["This eval confirms command execution, not full product correctness."]
}Eval Bench is released as part of the Unit27 public tooling channel. CI is configured to verify the unit test suite and wheel contents before changes are considered ready.
Eval Bench does not:
- Decide which project should be built
- Generate agent handoff packets
- Record durable proof
- Check public claims
- Perform launch QA
- Replace Proof Ledger, Boundary Engine, or
u27-check
PYTHONPATH=src python3 -m unittest discover -s tests
eval-bench demo
eval-bench inspect eval-bench-demo/u27/eval_results.jsonPYTHONPATH=src python3 -m unittest discover -s tests
PYTHONPATH=src python3 -m eval_bench.cli demo --root examples/sample-project
PYTHONPATH=src python3 -m eval_bench.cli inspect examples/sample-project/u27/eval_results.json
python3 -m pip wheel . --no-deps --no-build-isolation -w /tmp/eval-bench-wheel
python3 scripts/verify_wheel.py /tmp/eval-bench-wheel/unit27_eval_bench-0.1.0-py3-none-any.whlMIT