Skip to content

v0.2.0: Multi-model leaderboard, Layer 2 LLM-Judge, credibility fixes#1

Open
rhahn28 wants to merge 2 commits intomainfrom
v0.2.0-credibility-fixes
Open

v0.2.0: Multi-model leaderboard, Layer 2 LLM-Judge, credibility fixes#1
rhahn28 wants to merge 2 commits intomainfrom
v0.2.0-credibility-fixes

Conversation

@rhahn28
Copy link
Copy Markdown
Owner

@rhahn28 rhahn28 commented Mar 26, 2026

Summary

  • Multi-model leaderboard: Claude Sonnet 4 (99.1%), Gemini 2.5 Flash (99.1%), Gemini 2.5 Pro (89.7%) alongside ABIGAIL v3 (100%)
  • Layer 2 LLM-Judge: 22 Tier 3 reasoning cases scored via calibrated Gemini Flash judge across 9 quality dimensions (71.1% overall, 95% CI: 63.1-78.3%)
  • Evaluator fixes: Negation-aware entity status detection, word-boundary rejection type matching, claim range expansion for XML OA parsing
  • Statistical rigor: Bootstrap confidence intervals, Wilcoxon signed-rank tests, Cohen's d effect sizes, Bonferroni correction — all previously documented but unimplemented
  • Results schema unified: Harness now produces v0.2.0 schema matching all published results files (fixes reproducibility gap)
  • Externalized poison pills: Moved from hardcoded config.py to runtime-loaded JSON (data/poison_pills.json)
  • Test coverage: 157 tests passing (up from ~60), new suites for edge cases, metrics, harness, XML parser
  • HuggingFace adapter: Infrastructure for open-source model benchmarking (Mistral, Llama, etc.)

What This Fixes (from external review)

  1. "Only the creator has been benchmarked" → 3 external models with published results
  2. "Results weren't generated by the harness" → Unified v0.2.0 schema
  3. "75% of evaluation framework doesn't exist" → Layer 2 live with scores
  4. "Entity status detection is broken" → Negation-aware extraction
  5. "Rejection matching is noisy" → Word-boundary + citation patterns
  6. "Statistical methods don't exist in code" → All 4 methods implemented
  7. "Poison pills are public" → Externalized to separate config
  8. "Tests are thin" → 157 tests, 2.6x coverage increase

Test plan

  • All 157 tests passing (python -m pytest tests/ -v)
  • Gemini Flash Layer 1: 99.1% (verified)
  • Gemini Pro Layer 1: 89.7% (verified)
  • Claude Sonnet Layer 1: 99.1% (verified)
  • Layer 2 LLM-Judge: 22 cases scored with bootstrap CIs
  • Re-run Gemini Pro with fixed evaluator for updated score
  • Update website leaderboard (abigail.app) to match README

🤖 Generated with Claude Code

rhahn28 and others added 2 commits March 22, 2026 19:19
- Replace all Claude Opus/Sonnet references with system names (ABIGAIL v3)
- Underlying model architecture is a trade secret; benchmark measures outputs
- Expand METHODOLOGY.md with detailed Layer 2-4 implementation:
  - Layer 2: LLM-as-Judge with published rubrics (5 dimensions, 1-5 scale)
  - Layer 3: Comparative evaluation with position deblinding
  - Layer 4: Human calibration protocol (5 attorneys, Cohen's Kappa)
- Add Section 8: How to submit a system for evaluation
- Update README leaderboard to show real Layer 1 data only

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…sults

Addresses the key credibility gaps identified in external review:

- Fix results schema mismatch: BenchmarkResults.to_dict() now produces
  unified v0.2.0 schema matching all published results files
- Fix entity status detection: negation-aware regex prevents false positives
  like "not a micro entity" matching as "micro"
- Fix rejection type matching: word-boundary regex and statutory citation
  patterns replace naive substring matching
- Implement statistical methods: bootstrap CIs, Wilcoxon signed-rank,
  Cohen's d, Bonferroni correction (all claimed in METHODOLOGY.md)
- Add Layer 2 LLM-Judge: 22 Tier 3 reasoning cases scored via Gemini Flash
  judge (71.1% overall, 95% CI: 63.1-78.3%)
- Add multi-model leaderboard: Claude Sonnet 4 (99.1%), Gemini Flash (99.1%),
  Gemini Pro (89.7%) alongside ABIGAIL v3 (100%)
- Wire LLM judge client into run_benchmark.py for automatic Layer 2 scoring
- Add run_layer2.py for standalone Tier 3 reasoning evaluation
- Tests: 157 passing (up from ~60), covering edge cases, metrics, harness

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant