Eval: 10 mock sites + routine-compile eval + full-eval benchmark report by softpudding · Pull Request #56 · softpudding/OpenBrowser

softpudding · 2026-04-11T13:53:51Z

Summary

Snapshots the exact state of feat/optimize-browser-routine that the 20260411 full-eval benchmark was run against. This PR is a superset of the eval work currently landed between main and this branch.

Noteworthy contents (13 commits total):

Mock eval sites (10 new hard sites):

c451049 / cdf21cf — 6 codex-authored hard mock sites: Amazon, Booking, Drive, GitHub, Gmail + tracker/server plumbing + 13 test-case YAMLs.
895c376 — 4 follow-up mock sites: MapQuest, StayBnB, TaskFlow, VidHub + 8 test-case YAMLs + eval/AGENTS.md gotchas doc.

Routine compile/replay eval infra:

5d82823, 45c07e3, 12ba733, 3483695, 25b3a2e, c1f9f50, cc0f520 — routine record/replay eval support, TechForum interaction expansion, compiled-replay fixtures, eval server surrogate fix, record-compile-replay workflow docs, compiler ask_user prompting + Finviz fixture fixes.

Full-eval 20260411 benchmark (this commit, 37a38f1):

eval/reports/20260411_full_eval.md — 105-run benchmark across qwen3.5-flash, qwen3.5-plus, qwen3.6-plus on the full 35-test dataset. Documents seven recurring OpenBrowser agent issues (missing drag primitive, stale-DOM disorientation, terminal error interpretation, feedback-loop blindness, instruction-precision drift, missing completion signal, 8765 HTTP channel fragility) with per-test evidence and proposed fixes.
eval/evaluation_report.json refreshed with 20260411 numbers (105 tests, 82.86% pass rate).
eval/routine_eval/evaluate_routine_compile.py main-report build/write helpers factored out so partial runs still emit a valid report.

Test plan

Full 105-run eval completed (~3h, 82.9% avg pass rate across models) — see eval/reports/20260411_full_eval.md
All 35 tests load without schema errors
Reviewer sanity-checks the observation report and mock site implementations

🤖 Generated with Claude Code

- Fix _format_agent_event to match actual SSE event types (ActionEvent/ ObservationEvent/MessageEvent) and use frontend-matching labels (STEP/ASK/RESULT/AGENT) - Collapse multiline event bodies to single lines with "|" separator for readable terminal output - Add --full-events flag for no-truncation mode (default 500 chars) - Add _log() helper for immediate flushed stderr output at milestones - Replace synthetic fixtures with real browser recordings - Delete unused fixtures (techforum_upvote_clear, finviz_threshold_ambiguous) - Update finviz_filter_clear expectations: require asking about sort direction (user's real goal is finding 20% monthly drops) - Update replay YAML and golden routine for multi-filter Finviz workflow Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Save two compiler-produced routines as static test assets and wire them into evaluate_browser_agent.py via replay YAML test cases: - techforum_search_upvote_agents.md: search "AI", upvote+collect agent posts, open comments (6 criteria, 10 pts) - finviz_filter_sort_open.md: 5 filters, Performance view, sort by Perf Month ascending, open top 3 losers (8 criteria, 12 pts) Both run with: uv run python eval/evaluate_browser_agent.py \ --test replay_techforum_upvote --test replay_finviz_filter_simple Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Fix UnicodeEncodeError in eval server when tracker events contain emoji surrogates (e.g. 👍 from TechForum upvote buttons) - Fix search_for_ai criterion: search event fires on /techforum/ not /techforum/search.html Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…tions Bump agent-sdk to b7f39b82 which reframes the sorted-list asking rule around position-based vs identity-based replay divergence with a worked example, making the compiler reliably ask clarification questions. Also fix the finviz_filter_clear fixture: remove the sort-direction question from required expectations since the sort state is observable from the trace (element class + keyframe values). Update intent_summary and raw_intention to match. Eval results: 2/2 pass, mean asking_behavior 1.0. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Adds four new browser-agent eval sites targeting interaction primitives not yet covered by the existing suite: panel-state navigation and icon-only transport buttons (MapQuest); dual-handle price slider drag, segmented search popovers, and two-step booking checkout (StayBnB); HTML5 drag-and-drop with hover-reveal inline editing (TaskFlow); auto-hide player controls, thin-bar timeline scrub, nested settings popup, and hover-reveal volume slider (VidHub). Each site ships with 2 test-case YAMLs under eval/dataset/ scored against tracker events, for 8 new tests total. eval/README.md documents each site with main challenges. eval/AGENTS.md captures non-obvious implementation gotchas learned during the build (stacking context with header popovers, default-state event anti-pattern, tracker case normalization, dual-handle drag, deep-link entry points). eval/SPEC_NEW_SITES.md is the design brief the sites were generated from. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…outine Brings in 4 additional mock eval sites (MapQuest, StayBnB, TaskFlow, VidHub) from the parallel new-mock-eval-sites worktree, alongside the 5 sites already merged from codex/mock-eval-sites (gmail, drive, booking, github, amazon). Conflict resolution in eval/server.py: kept the generic file-serving refactor and DEFAULT_PORT / configurable-port main() from the target branch, and folded the 4 new site entries into SITE_NAME_TO_BUCKET, /api/sites, /api/help, and print_startup_info alongside the existing codex entries. The 4 new URL_MAPPINGS entries for /mapquest/, /staybnb/, /taskflow/, /vidhub/ are redundant with send_file's directory→index.html fallback but are kept for consistency with the legacy dataflow/finviz/bluebook/northstar entries. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- eval/reports/20260411_full_eval.md: 105-run benchmark across qwen3.5-flash, qwen3.5-plus, qwen3.6-plus on the full 35-test dataset (pre-existing + codex mock sites + follow-up mock sites). Documents seven recurring OpenBrowser agent issues with per-test evidence and proposed fixes. - eval/evaluation_report.json: refreshed with the 20260411 run (105 tests, 82.86% pass rate). - eval/routine_eval/evaluate_routine_compile.py: factor main-report build/write helpers so partial runs still emit a valid report. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Satisfies the pre-commit black hook on files touched by earlier mock-site and routine-eval commits on this branch. No logic changes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

softpudding and others added 14 commits April 9, 2026 20:17

Expand TechForum eval frontend interactions

5d82823

Add routine record/replay eval support and expand Finviz fixtures

45c07e3

docs: document record-compile-replay workflow

c1f9f50

Add hard mock evaluation sites

c451049

Fix review findings in eval mocks

cdf21cf

Merge branch 'codex/mock-eval-sites' into feat/optimize-browser-routine

c26b7cc

Apply black formatting to eval + test files

13f1069

Satisfies the pre-commit black hook on files touched by earlier mock-site and routine-eval commits on this branch. No logic changes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

softpudding merged commit 3d251b5 into main Apr 11, 2026
4 checks passed

softpudding mentioned this pull request Apr 11, 2026

Remove eval SPEC design docs from the repo #57

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Eval: 10 mock sites + routine-compile eval + full-eval benchmark report#56

Eval: 10 mock sites + routine-compile eval + full-eval benchmark report#56
softpudding merged 14 commits intomainfrom
eval/full-20260411-benchmark

softpudding commented Apr 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

softpudding commented Apr 11, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant