[codex] P12-S4: ship public eval harness#155
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This sprint turns Alice quality into a reproducible public eval surface instead of leaving quality claims spread across ad hoc fixtures and one-off checks.
It adds persisted eval suites, cases, runs, and results; a local public eval runner; checked-in fixture catalog and baseline report artifacts; and CLI/API surfaces for suite listing, run execution, and report inspection.
Why This Changed
Phase 12 needed a repeatable evidence boundary for retrieval, mutation, contradiction, and open-loop quality. After
P12-S1throughP12-S3, the repo had stronger behavior but not a public harness that could reproduce and explain those quality claims in a stable way.User And Operator Impact
Operators can now list public suites, run the harness, inspect stored runs, and compare emitted reports against the checked-in baseline artifact. The checked-in fixture catalog is the current branch source of truth for suite definitions and ordering, and runtime sync prunes stale persisted suite/case rows.
Validation
./.venv/bin/pytest tests/unit/test_public_evals.py tests/unit/test_20260414_0060_phase12_public_eval_harness.py tests/unit/test_cli.py tests/unit/test_main.py tests/integration/test_public_evals_api.py tests/integration/test_cli_integration.py tests/integration/test_retrieval_evaluation_api.py -q./.venv/bin/python scripts/check_control_doc_truth.pyrg -n "/Users|samirusani|Desktop/Codex" RULES.md ARCHITECTURE.md CURRENT_STATE.md .ai/handoff/CURRENT_STATE.md PRODUCT_BRIEF.md ROADMAP.md docs/evals eval/fixtures eval/baselinesUpgrade Overview
Protected Areas
Compatibility Impact
This branch adds eval-specific storage and public evaluation surfaces without reopening shipped retrieval, mutation, or contradiction implementations beyond testability hooks. Existing shipped runtime behavior becomes measurable through fixture-backed suites and baseline artifacts.
Migration / Rollout
Apply the Alembic migration for
eval_suites,eval_cases,eval_runs, andeval_resultsbefore using persisted run storage. The current branch exposes/v1/evals/*and a checked-in JSON baseline artifact, but final API and artifact-format policy still remain explicit Control Tower decisions.Operator Action
Operators can use the current branch surfaces to list suites, run the harness, inspect recent runs, and read a stored run report through CLI or API. The current branch treats
eval/fixtures/public_eval_suites.jsonas the authoritative suite catalog andeval/baselines/public_eval_harness_v1.jsonas the checked-in baseline artifact for this sprint.Validation
The approved branch head passed the focused eval-runner regression slice, the control-doc truth check, and the local-path scrub listed above.
Rollback
Rollback by reverting this squash merge and the eval-harness migration together so the runtime, persisted eval rows, fixture catalog expectations, and baseline artifact stay aligned.