Skip to content

Fix validation script bugs across 13 validators#13

Merged
Ruiying-Ma merged 3 commits intomainfrom
fix/validation-script-bugs
Mar 12, 2026
Merged

Fix validation script bugs across 13 validators#13
Ruiying-Ma merged 3 commits intomainfrom
fix/validation-script-bugs

Conversation

@anushruthasura
Copy link
Copy Markdown
Collaborator

Summary

Fixes bugs and code quality issues in 13 validation scripts across 7 datasets, discovered during PromptQL benchmark evaluation (8 runs, 2 models).

Window size mismatches (comment says 50, code uses 10 or 20)

  • PANCANCER_ATLAS/query1: start+10start+50 (matches docstring)
  • DEPS_DEV_V1/query1: start+10start+50 (matches docstring)
  • stockindex/query3: +20+50 (matches docstring)

These tight windows cause false negatives when the agent formats output as a table or multi-column layout.

Forbidden-value false positives (stockindex/query1, query2)

  • Old: reject if ANY forbidden value appears anywhere — penalizes correct answers that also provide context (e.g., "399001.SZ is most volatile; N225 is least volatile" → FAIL)
  • New: reject only if a forbidden value appears BEFORE the correct answer (positional check)

Bare except:except ValueError: (6 files)

Bare except catches SystemExit, KeyboardInterrupt, etc. Fixed in: GITHUB_REPOS/q1, stockmarket/q1, PANCANCER_ATLAS/q3, yelp/q1, yelp/q2, googlelocal/q3.

Whitespace-stripping match → word-boundary regex (agnews/q1, music_brainz/q2)

  • Old: stripped ALL whitespace then used substring matching ("therundown" matched inside "therundownfilm")
  • New: case-insensitive word-boundary regex (\bThe Rundown\b)

Test plan

  • Wrote 19 unit tests covering all changed validators (pass/fail cases, edge cases)
  • All 19 tests pass
  • Cross-checked against 216 actual PromptQL benchmark responses (4 runs × 54 queries)
  • Verified all 13 files parse and import correctly
  • Confirmed result flips are correct (old false positives now properly rejected)

🤖 Generated with Claude Code

Window size mismatches (comment vs code):
- PANCANCER_ATLAS/query1: 10 → 50 chars (matching comment)
- DEPS_DEV_V1/query1: 10 → 50 chars (matching comment)
- stockindex/query3: 20 → 50 chars (matching comment)

Forbidden-value false positives (stockindex/query1, query2):
- Old: reject if ANY forbidden value appears anywhere in output
- New: reject only if forbidden value appears BEFORE the correct answer
- Prevents penalizing correct answers that also provide context

Bare except clauses → except ValueError (6 files):
- GITHUB_REPOS/query1, stockmarket/query1, PANCANCER_ATLAS/query3,
  yelp/query1, yelp/query2, googlelocal/query3

Whitespace-stripping substring match → word-boundary regex (2 files):
- agnews/query1, music_brainz/query2
- Old: stripped ALL whitespace then substring matched (e.g., "therundown"
  matched inside "therundownfilm")
- New: case-insensitive word-boundary match via re.search(\b...\b)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Ruiying-Ma Ruiying-Ma merged commit 247d2bd into main Mar 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants