[codex] P12-S4: ship public eval harness by samrusani · Pull Request #155 · samrusani/AliceBot

samrusani · 2026-04-14T18:36:52Z

Summary

This sprint turns Alice quality into a reproducible public eval surface instead of leaving quality claims spread across ad hoc fixtures and one-off checks.

It adds persisted eval suites, cases, runs, and results; a local public eval runner; checked-in fixture catalog and baseline report artifacts; and CLI/API surfaces for suite listing, run execution, and report inspection.

Why This Changed

Phase 12 needed a repeatable evidence boundary for retrieval, mutation, contradiction, and open-loop quality. After P12-S1 through P12-S3, the repo had stronger behavior but not a public harness that could reproduce and explain those quality claims in a stable way.

User And Operator Impact

Operators can now list public suites, run the harness, inspect stored runs, and compare emitted reports against the checked-in baseline artifact. The checked-in fixture catalog is the current branch source of truth for suite definitions and ordering, and runtime sync prunes stale persisted suite/case rows.

Validation

./.venv/bin/pytest tests/unit/test_public_evals.py tests/unit/test_20260414_0060_phase12_public_eval_harness.py tests/unit/test_cli.py tests/unit/test_main.py tests/integration/test_public_evals_api.py tests/integration/test_cli_integration.py tests/integration/test_retrieval_evaluation_api.py -q
./.venv/bin/python scripts/check_control_doc_truth.py
rg -n "/Users|samirusani|Desktop/Codex" RULES.md ARCHITECTURE.md CURRENT_STATE.md .ai/handoff/CURRENT_STATE.md PRODUCT_BRIEF.md ROADMAP.md docs/evals eval/fixtures eval/baselines

Upgrade Overview

Protected Areas

Compatibility Impact

This branch adds eval-specific storage and public evaluation surfaces without reopening shipped retrieval, mutation, or contradiction implementations beyond testability hooks. Existing shipped runtime behavior becomes measurable through fixture-backed suites and baseline artifacts.

Migration / Rollout

Apply the Alembic migration for eval_suites, eval_cases, eval_runs, and eval_results before using persisted run storage. The current branch exposes /v1/evals/* and a checked-in JSON baseline artifact, but final API and artifact-format policy still remain explicit Control Tower decisions.

Operator Action

Operators can use the current branch surfaces to list suites, run the harness, inspect recent runs, and read a stored run report through CLI or API. The current branch treats eval/fixtures/public_eval_suites.json as the authoritative suite catalog and eval/baselines/public_eval_harness_v1.json as the checked-in baseline artifact for this sprint.

Validation

The approved branch head passed the focused eval-runner regression slice, the control-doc truth check, and the local-path scrub listed above.

Rollback

Rollback by reverting this squash merge and the eval-harness migration together so the runtime, persisted eval rows, fixture catalog expectations, and baseline artifact stay aligned.

P12-S4: ship public eval harness

a99422b

samrusani merged commit dd77643 into main Apr 14, 2026
4 checks passed

samrusani deleted the codex/p12-s4-public-eval-harness branch April 14, 2026 18:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[codex] P12-S4: ship public eval harness#155

[codex] P12-S4: ship public eval harness#155
samrusani merged 1 commit intomainfrom
codex/p12-s4-public-eval-harness

samrusani commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

samrusani commented Apr 14, 2026

Summary

Why This Changed

User And Operator Impact

Validation

Upgrade Overview

Protected Areas

Compatibility Impact

Migration / Rollout

Operator Action

Validation

Rollback

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant