Skip to content

Eval: 10 mock sites + routine-compile eval + full-eval benchmark report#56

Merged
softpudding merged 14 commits intomainfrom
eval/full-20260411-benchmark
Apr 11, 2026
Merged

Eval: 10 mock sites + routine-compile eval + full-eval benchmark report#56
softpudding merged 14 commits intomainfrom
eval/full-20260411-benchmark

Conversation

@softpudding
Copy link
Copy Markdown
Owner

Summary

Snapshots the exact state of feat/optimize-browser-routine that the 20260411 full-eval benchmark was run against. This PR is a superset of the eval work currently landed between main and this branch.

Noteworthy contents (13 commits total):

Mock eval sites (10 new hard sites):

  • c451049 / cdf21cf — 6 codex-authored hard mock sites: Amazon, Booking, Drive, GitHub, Gmail + tracker/server plumbing + 13 test-case YAMLs.
  • 895c376 — 4 follow-up mock sites: MapQuest, StayBnB, TaskFlow, VidHub + 8 test-case YAMLs + eval/AGENTS.md gotchas doc.

Routine compile/replay eval infra:

  • 5d82823, 45c07e3, 12ba733, 3483695, 25b3a2e, c1f9f50, cc0f520 — routine record/replay eval support, TechForum interaction expansion, compiled-replay fixtures, eval server surrogate fix, record-compile-replay workflow docs, compiler ask_user prompting + Finviz fixture fixes.

Full-eval 20260411 benchmark (this commit, 37a38f1):

  • eval/reports/20260411_full_eval.md — 105-run benchmark across qwen3.5-flash, qwen3.5-plus, qwen3.6-plus on the full 35-test dataset. Documents seven recurring OpenBrowser agent issues (missing drag primitive, stale-DOM disorientation, terminal error interpretation, feedback-loop blindness, instruction-precision drift, missing completion signal, 8765 HTTP channel fragility) with per-test evidence and proposed fixes.
  • eval/evaluation_report.json refreshed with 20260411 numbers (105 tests, 82.86% pass rate).
  • eval/routine_eval/evaluate_routine_compile.py main-report build/write helpers factored out so partial runs still emit a valid report.

Test plan

  • Full 105-run eval completed (~3h, 82.9% avg pass rate across models) — see eval/reports/20260411_full_eval.md
  • All 35 tests load without schema errors
  • Reviewer sanity-checks the observation report and mock site implementations

🤖 Generated with Claude Code

softpudding and others added 14 commits April 9, 2026 20:17
- Fix _format_agent_event to match actual SSE event types (ActionEvent/
  ObservationEvent/MessageEvent) and use frontend-matching labels
  (STEP/ASK/RESULT/AGENT)
- Collapse multiline event bodies to single lines with "|" separator
  for readable terminal output
- Add --full-events flag for no-truncation mode (default 500 chars)
- Add _log() helper for immediate flushed stderr output at milestones
- Replace synthetic fixtures with real browser recordings
- Delete unused fixtures (techforum_upvote_clear, finviz_threshold_ambiguous)
- Update finviz_filter_clear expectations: require asking about sort
  direction (user's real goal is finding 20% monthly drops)
- Update replay YAML and golden routine for multi-filter Finviz workflow

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Save two compiler-produced routines as static test assets and wire them
into evaluate_browser_agent.py via replay YAML test cases:

- techforum_search_upvote_agents.md: search "AI", upvote+collect agent
  posts, open comments (6 criteria, 10 pts)
- finviz_filter_sort_open.md: 5 filters, Performance view, sort by
  Perf Month ascending, open top 3 losers (8 criteria, 12 pts)

Both run with: uv run python eval/evaluate_browser_agent.py \
  --test replay_techforum_upvote --test replay_finviz_filter_simple

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix UnicodeEncodeError in eval server when tracker events contain
  emoji surrogates (e.g. 👍 from TechForum upvote buttons)
- Fix search_for_ai criterion: search event fires on /techforum/
  not /techforum/search.html

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tions

Bump agent-sdk to b7f39b82 which reframes the sorted-list asking rule
around position-based vs identity-based replay divergence with a worked
example, making the compiler reliably ask clarification questions.

Also fix the finviz_filter_clear fixture: remove the sort-direction
question from required expectations since the sort state is observable
from the trace (element class + keyframe values). Update intent_summary
and raw_intention to match.

Eval results: 2/2 pass, mean asking_behavior 1.0.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds four new browser-agent eval sites targeting interaction primitives
not yet covered by the existing suite: panel-state navigation and icon-only
transport buttons (MapQuest); dual-handle price slider drag, segmented
search popovers, and two-step booking checkout (StayBnB); HTML5 drag-and-drop
with hover-reveal inline editing (TaskFlow); auto-hide player controls,
thin-bar timeline scrub, nested settings popup, and hover-reveal volume
slider (VidHub).

Each site ships with 2 test-case YAMLs under eval/dataset/ scored against
tracker events, for 8 new tests total. eval/README.md documents each site
with main challenges. eval/AGENTS.md captures non-obvious implementation
gotchas learned during the build (stacking context with header popovers,
default-state event anti-pattern, tracker case normalization, dual-handle
drag, deep-link entry points). eval/SPEC_NEW_SITES.md is the design brief
the sites were generated from.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…outine

Brings in 4 additional mock eval sites (MapQuest, StayBnB, TaskFlow, VidHub)
from the parallel new-mock-eval-sites worktree, alongside the 5 sites already
merged from codex/mock-eval-sites (gmail, drive, booking, github, amazon).

Conflict resolution in eval/server.py: kept the generic file-serving refactor
and DEFAULT_PORT / configurable-port main() from the target branch, and folded
the 4 new site entries into SITE_NAME_TO_BUCKET, /api/sites, /api/help, and
print_startup_info alongside the existing codex entries. The 4 new URL_MAPPINGS
entries for /mapquest/, /staybnb/, /taskflow/, /vidhub/ are redundant with
send_file's directory→index.html fallback but are kept for consistency with
the legacy dataflow/finviz/bluebook/northstar entries.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- eval/reports/20260411_full_eval.md: 105-run benchmark across
  qwen3.5-flash, qwen3.5-plus, qwen3.6-plus on the full 35-test dataset
  (pre-existing + codex mock sites + follow-up mock sites). Documents
  seven recurring OpenBrowser agent issues with per-test evidence and
  proposed fixes.
- eval/evaluation_report.json: refreshed with the 20260411 run
  (105 tests, 82.86% pass rate).
- eval/routine_eval/evaluate_routine_compile.py: factor main-report
  build/write helpers so partial runs still emit a valid report.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Satisfies the pre-commit black hook on files touched by earlier
mock-site and routine-eval commits on this branch. No logic changes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@softpudding softpudding merged commit 3d251b5 into main Apr 11, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant