v0.2.0: Multi-model leaderboard, Layer 2 LLM-Judge, credibility fixes#1
Open
v0.2.0: Multi-model leaderboard, Layer 2 LLM-Judge, credibility fixes#1
Conversation
- Replace all Claude Opus/Sonnet references with system names (ABIGAIL v3) - Underlying model architecture is a trade secret; benchmark measures outputs - Expand METHODOLOGY.md with detailed Layer 2-4 implementation: - Layer 2: LLM-as-Judge with published rubrics (5 dimensions, 1-5 scale) - Layer 3: Comparative evaluation with position deblinding - Layer 4: Human calibration protocol (5 attorneys, Cohen's Kappa) - Add Section 8: How to submit a system for evaluation - Update README leaderboard to show real Layer 1 data only Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…sults Addresses the key credibility gaps identified in external review: - Fix results schema mismatch: BenchmarkResults.to_dict() now produces unified v0.2.0 schema matching all published results files - Fix entity status detection: negation-aware regex prevents false positives like "not a micro entity" matching as "micro" - Fix rejection type matching: word-boundary regex and statutory citation patterns replace naive substring matching - Implement statistical methods: bootstrap CIs, Wilcoxon signed-rank, Cohen's d, Bonferroni correction (all claimed in METHODOLOGY.md) - Add Layer 2 LLM-Judge: 22 Tier 3 reasoning cases scored via Gemini Flash judge (71.1% overall, 95% CI: 63.1-78.3%) - Add multi-model leaderboard: Claude Sonnet 4 (99.1%), Gemini Flash (99.1%), Gemini Pro (89.7%) alongside ABIGAIL v3 (100%) - Wire LLM judge client into run_benchmark.py for automatic Layer 2 scoring - Add run_layer2.py for standalone Tier 3 reasoning evaluation - Tests: 157 passing (up from ~60), covering edge cases, metrics, harness Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
What This Fixes (from external review)
"Only the creator has been benchmarked"→ 3 external models with published results"Results weren't generated by the harness"→ Unified v0.2.0 schema"75% of evaluation framework doesn't exist"→ Layer 2 live with scores"Entity status detection is broken"→ Negation-aware extraction"Rejection matching is noisy"→ Word-boundary + citation patterns"Statistical methods don't exist in code"→ All 4 methods implemented"Poison pills are public"→ Externalized to separate config"Tests are thin"→ 157 tests, 2.6x coverage increaseTest plan
python -m pytest tests/ -v)🤖 Generated with Claude Code