v0.2.0: Multi-model leaderboard, Layer 2 LLM-Judge, credibility fixes by rhahn28 · Pull Request #1 · rhahn28/patentbench

rhahn28 · 2026-03-26T01:19:26Z

Summary

Multi-model leaderboard: Claude Sonnet 4 (99.1%), Gemini 2.5 Flash (99.1%), Gemini 2.5 Pro (89.7%) alongside ABIGAIL v3 (100%)
Layer 2 LLM-Judge: 22 Tier 3 reasoning cases scored via calibrated Gemini Flash judge across 9 quality dimensions (71.1% overall, 95% CI: 63.1-78.3%)
Evaluator fixes: Negation-aware entity status detection, word-boundary rejection type matching, claim range expansion for XML OA parsing
Statistical rigor: Bootstrap confidence intervals, Wilcoxon signed-rank tests, Cohen's d effect sizes, Bonferroni correction — all previously documented but unimplemented
Results schema unified: Harness now produces v0.2.0 schema matching all published results files (fixes reproducibility gap)
Externalized poison pills: Moved from hardcoded config.py to runtime-loaded JSON (data/poison_pills.json)
Test coverage: 157 tests passing (up from ~60), new suites for edge cases, metrics, harness, XML parser
HuggingFace adapter: Infrastructure for open-source model benchmarking (Mistral, Llama, etc.)

What This Fixes (from external review)

~~"Only the creator has been benchmarked"~~ → 3 external models with published results
~~"Results weren't generated by the harness"~~ → Unified v0.2.0 schema
~~"75% of evaluation framework doesn't exist"~~ → Layer 2 live with scores
~~"Entity status detection is broken"~~ → Negation-aware extraction
~~"Rejection matching is noisy"~~ → Word-boundary + citation patterns
~~"Statistical methods don't exist in code"~~ → All 4 methods implemented
~~"Poison pills are public"~~ → Externalized to separate config
~~"Tests are thin"~~ → 157 tests, 2.6x coverage increase

Test plan

All 157 tests passing (python -m pytest tests/ -v)
Gemini Flash Layer 1: 99.1% (verified)
Gemini Pro Layer 1: 89.7% (verified)
Claude Sonnet Layer 1: 99.1% (verified)
Layer 2 LLM-Judge: 22 cases scored with bootstrap CIs
Re-run Gemini Pro with fixed evaluator for updated score
Update website leaderboard (abigail.app) to match README

🤖 Generated with Claude Code

- Replace all Claude Opus/Sonnet references with system names (ABIGAIL v3) - Underlying model architecture is a trade secret; benchmark measures outputs - Expand METHODOLOGY.md with detailed Layer 2-4 implementation: - Layer 2: LLM-as-Judge with published rubrics (5 dimensions, 1-5 scale) - Layer 3: Comparative evaluation with position deblinding - Layer 4: Human calibration protocol (5 attorneys, Cohen's Kappa) - Add Section 8: How to submit a system for evaluation - Update README leaderboard to show real Layer 1 data only Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…sults Addresses the key credibility gaps identified in external review: - Fix results schema mismatch: BenchmarkResults.to_dict() now produces unified v0.2.0 schema matching all published results files - Fix entity status detection: negation-aware regex prevents false positives like "not a micro entity" matching as "micro" - Fix rejection type matching: word-boundary regex and statutory citation patterns replace naive substring matching - Implement statistical methods: bootstrap CIs, Wilcoxon signed-rank, Cohen's d, Bonferroni correction (all claimed in METHODOLOGY.md) - Add Layer 2 LLM-Judge: 22 Tier 3 reasoning cases scored via Gemini Flash judge (71.1% overall, 95% CI: 63.1-78.3%) - Add multi-model leaderboard: Claude Sonnet 4 (99.1%), Gemini Flash (99.1%), Gemini Pro (89.7%) alongside ABIGAIL v3 (100%) - Wire LLM judge client into run_benchmark.py for automatic Layer 2 scoring - Add run_layer2.py for standalone Tier 3 reasoning evaluation - Tests: 157 passing (up from ~60), covering edge cases, metrics, harness Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

rhahn28 and others added 2 commits March 22, 2026 19:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.2.0: Multi-model leaderboard, Layer 2 LLM-Judge, credibility fixes#1

v0.2.0: Multi-model leaderboard, Layer 2 LLM-Judge, credibility fixes#1
rhahn28 wants to merge 2 commits intomainfrom
v0.2.0-credibility-fixes

rhahn28 commented Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rhahn28 commented Mar 26, 2026

Summary

What This Fixes (from external review)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant