Skip to content

[codex] P12-S4: ship public eval harness#155

Merged
samrusani merged 1 commit intomainfrom
codex/p12-s4-public-eval-harness
Apr 14, 2026
Merged

[codex] P12-S4: ship public eval harness#155
samrusani merged 1 commit intomainfrom
codex/p12-s4-public-eval-harness

Conversation

@samrusani
Copy link
Copy Markdown
Owner

Summary

This sprint turns Alice quality into a reproducible public eval surface instead of leaving quality claims spread across ad hoc fixtures and one-off checks.

It adds persisted eval suites, cases, runs, and results; a local public eval runner; checked-in fixture catalog and baseline report artifacts; and CLI/API surfaces for suite listing, run execution, and report inspection.

Why This Changed

Phase 12 needed a repeatable evidence boundary for retrieval, mutation, contradiction, and open-loop quality. After P12-S1 through P12-S3, the repo had stronger behavior but not a public harness that could reproduce and explain those quality claims in a stable way.

User And Operator Impact

Operators can now list public suites, run the harness, inspect stored runs, and compare emitted reports against the checked-in baseline artifact. The checked-in fixture catalog is the current branch source of truth for suite definitions and ordering, and runtime sync prunes stale persisted suite/case rows.

Validation

  • ./.venv/bin/pytest tests/unit/test_public_evals.py tests/unit/test_20260414_0060_phase12_public_eval_harness.py tests/unit/test_cli.py tests/unit/test_main.py tests/integration/test_public_evals_api.py tests/integration/test_cli_integration.py tests/integration/test_retrieval_evaluation_api.py -q
  • ./.venv/bin/python scripts/check_control_doc_truth.py
  • rg -n "/Users|samirusani|Desktop/Codex" RULES.md ARCHITECTURE.md CURRENT_STATE.md .ai/handoff/CURRENT_STATE.md PRODUCT_BRIEF.md ROADMAP.md docs/evals eval/fixtures eval/baselines

Upgrade Overview

Protected Areas

  • memory schema
  • evidence pipeline
  • trust rules
  • promotion logic
  • continuity APIs

Compatibility Impact

This branch adds eval-specific storage and public evaluation surfaces without reopening shipped retrieval, mutation, or contradiction implementations beyond testability hooks. Existing shipped runtime behavior becomes measurable through fixture-backed suites and baseline artifacts.

Migration / Rollout

Apply the Alembic migration for eval_suites, eval_cases, eval_runs, and eval_results before using persisted run storage. The current branch exposes /v1/evals/* and a checked-in JSON baseline artifact, but final API and artifact-format policy still remain explicit Control Tower decisions.

Operator Action

Operators can use the current branch surfaces to list suites, run the harness, inspect recent runs, and read a stored run report through CLI or API. The current branch treats eval/fixtures/public_eval_suites.json as the authoritative suite catalog and eval/baselines/public_eval_harness_v1.json as the checked-in baseline artifact for this sprint.

Validation

The approved branch head passed the focused eval-runner regression slice, the control-doc truth check, and the local-path scrub listed above.

Rollback

Rollback by reverting this squash merge and the eval-harness migration together so the runtime, persisted eval rows, fixture catalog expectations, and baseline artifact stay aligned.

@samrusani samrusani merged commit dd77643 into main Apr 14, 2026
4 checks passed
@samrusani samrusani deleted the codex/p12-s4-public-eval-harness branch April 14, 2026 18:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant