U27-S05 // Eval Bench

Eval Bench runs deterministic eval cases before proof recording.

U27-S05
EVAL BENCH

CLASS: SYSTEM
OPERATING_POSITION: 04/07
FUNCTION: Eval Case Execution + Result Artifact Generation
REF_ID: U27-S05-EVAL-BENCH

Release Status

SOURCE_STATUS: PUBLIC_PACKAGE ACCESS_STATUS: CLEARED_FOR_EXTERNAL_USE

This repository is a released Unit27 field kit: visible, inspectable, and intended for orientation, testing, and practical use. Controlled protocol materials remain outside this source package.

It answers one narrow question:

Did the declared eval cases run, and what did they record?

Why Use It

Use Eval Bench after a handoff packet defines what should be checked and before Proof Ledger records durable proof.

It is useful when a repo needs deterministic eval results instead of a vague statement that something was tested.

Example:

Problem: The handoff names acceptance checks, but there is no eval result artifact.
Result: Eval Bench runs the declared cases and writes a report Proof Ledger can record.

60-Second Start

The current public release is GitHub-first. Run it from a local checkout:

git clone https://github.com/unit27research/unit27-eval-bench
cd unit27-eval-bench
pip install -e .
eval-bench demo
cat eval-bench-demo/u27/EVAL_REPORT.md

On your own repo:

eval-bench init
eval-bench run
eval-bench inspect u27/eval_results.json

What It Does

Eval Bench writes:

evals/eval_cases.json
u27/eval_results.json
u27/EVAL_REPORT.md
u27/eval_evidence/*.txt
evals/proof_cases.json

It is designed to feel like a deterministic eval runner, not a benchmark platform or proof ledger.

System Position

Stack Engine -> Context Engine -> Handoff Engine -> Eval Bench -> Proof Ledger -> Boundary Engine -> u27-check

Eval Bench sits after handoff generation and before proof recording. It runs declared eval cases and produces result artifacts. Proof Ledger remains responsible for durable proof records.

CLI

eval-bench init
eval-bench run
eval-bench inspect u27/eval_results.json
eval-bench demo

Exit codes:

0 = success
1 = one or more eval cases failed
2 = input or inspection error

Case Shape

{
  "id": "cli-smoke-test",
  "claim": "The primary CLI command returns usable output.",
  "command": "python3 -c \"print('eval bench smoke ok')\"",
  "expected_exit": 0,
  "limits": ["This eval confirms command execution, not full product correctness."]
}

Reliability

Eval Bench is released as part of the Unit27 public tooling channel. CI is configured to verify the unit test suite and wheel contents before changes are considered ready.

What It Does Not Do

Eval Bench does not:

Decide which project should be built
Generate agent handoff packets
Record durable proof
Check public claims
Perform launch QA
Replace Proof Ledger, Boundary Engine, or u27-check

Verify

PYTHONPATH=src python3 -m unittest discover -s tests
eval-bench demo
eval-bench inspect eval-bench-demo/u27/eval_results.json

Acceptance

PYTHONPATH=src python3 -m unittest discover -s tests
PYTHONPATH=src python3 -m eval_bench.cli demo --root examples/sample-project
PYTHONPATH=src python3 -m eval_bench.cli inspect examples/sample-project/u27/eval_results.json
python3 -m pip wheel . --no-deps --no-build-isolation -w /tmp/eval-bench-wheel
python3 scripts/verify_wheel.py /tmp/eval-bench-wheel/unit27_eval_bench-0.1.0-py3-none-any.whl

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
evals		evals
examples/sample-project		examples/sample-project
scripts		scripts
src/eval_bench		src/eval_bench
tests		tests
u27		u27
.gitignore		.gitignore
DESIGN_NOTES.md		DESIGN_NOTES.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

U27-S05 // Eval Bench

Release Status

Why Use It

60-Second Start

What It Does

System Position

CLI

Case Shape

Reliability

What It Does Not Do

Verify

Acceptance

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

U27-S05 // Eval Bench

Release Status

Why Use It

60-Second Start

What It Does

System Position

CLI

Case Shape

Reliability

What It Does Not Do

Verify

Acceptance

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages