Fix validation script bugs across 13 validators#13
Merged
Ruiying-Ma merged 3 commits intomainfrom Mar 12, 2026
Merged
Conversation
Window size mismatches (comment vs code): - PANCANCER_ATLAS/query1: 10 → 50 chars (matching comment) - DEPS_DEV_V1/query1: 10 → 50 chars (matching comment) - stockindex/query3: 20 → 50 chars (matching comment) Forbidden-value false positives (stockindex/query1, query2): - Old: reject if ANY forbidden value appears anywhere in output - New: reject only if forbidden value appears BEFORE the correct answer - Prevents penalizing correct answers that also provide context Bare except clauses → except ValueError (6 files): - GITHUB_REPOS/query1, stockmarket/query1, PANCANCER_ATLAS/query3, yelp/query1, yelp/query2, googlelocal/query3 Whitespace-stripping substring match → word-boundary regex (2 files): - agnews/query1, music_brainz/query2 - Old: stripped ALL whitespace then substring matched (e.g., "therundown" matched inside "therundownfilm") - New: case-insensitive word-boundary match via re.search(\b...\b) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes bugs and code quality issues in 13 validation scripts across 7 datasets, discovered during PromptQL benchmark evaluation (8 runs, 2 models).
Window size mismatches (comment says 50, code uses 10 or 20)
start+10→start+50(matches docstring)start+10→start+50(matches docstring)+20→+50(matches docstring)These tight windows cause false negatives when the agent formats output as a table or multi-column layout.
Forbidden-value false positives (stockindex/query1, query2)
Bare
except:→except ValueError:(6 files)Bare except catches
SystemExit,KeyboardInterrupt, etc. Fixed in: GITHUB_REPOS/q1, stockmarket/q1, PANCANCER_ATLAS/q3, yelp/q1, yelp/q2, googlelocal/q3.Whitespace-stripping match → word-boundary regex (agnews/q1, music_brainz/q2)
"therundown"matched inside"therundownfilm")\bThe Rundown\b)Test plan
🤖 Generated with Claude Code