Fix validation script bugs across 13 validators by anushruthasura · Pull Request #13 · ucbepic/DataAgentBench

anushruthasura · 2026-03-10T22:24:30Z

Summary

Fixes bugs and code quality issues in 13 validation scripts across 7 datasets, discovered during PromptQL benchmark evaluation (8 runs, 2 models).

Window size mismatches (comment says 50, code uses 10 or 20)

PANCANCER_ATLAS/query1: start+10 → start+50 (matches docstring)
DEPS_DEV_V1/query1: start+10 → start+50 (matches docstring)
stockindex/query3: +20 → +50 (matches docstring)

These tight windows cause false negatives when the agent formats output as a table or multi-column layout.

Forbidden-value false positives (stockindex/query1, query2)

Old: reject if ANY forbidden value appears anywhere — penalizes correct answers that also provide context (e.g., "399001.SZ is most volatile; N225 is least volatile" → FAIL)
New: reject only if a forbidden value appears BEFORE the correct answer (positional check)

Bare `except:` → `except ValueError:` (6 files)

Bare except catches SystemExit, KeyboardInterrupt, etc. Fixed in: GITHUB_REPOS/q1, stockmarket/q1, PANCANCER_ATLAS/q3, yelp/q1, yelp/q2, googlelocal/q3.

Whitespace-stripping match → word-boundary regex (agnews/q1, music_brainz/q2)

Old: stripped ALL whitespace then used substring matching ("therundown" matched inside "therundownfilm")
New: case-insensitive word-boundary regex (\bThe Rundown\b)

Test plan

Wrote 19 unit tests covering all changed validators (pass/fail cases, edge cases)
All 19 tests pass
Cross-checked against 216 actual PromptQL benchmark responses (4 runs × 54 queries)
Verified all 13 files parse and import correctly
Confirmed result flips are correct (old false positives now properly rejected)

🤖 Generated with Claude Code

Window size mismatches (comment vs code): - PANCANCER_ATLAS/query1: 10 → 50 chars (matching comment) - DEPS_DEV_V1/query1: 10 → 50 chars (matching comment) - stockindex/query3: 20 → 50 chars (matching comment) Forbidden-value false positives (stockindex/query1, query2): - Old: reject if ANY forbidden value appears anywhere in output - New: reject only if forbidden value appears BEFORE the correct answer - Prevents penalizing correct answers that also provide context Bare except clauses → except ValueError (6 files): - GITHUB_REPOS/query1, stockmarket/query1, PANCANCER_ATLAS/query3, yelp/query1, yelp/query2, googlelocal/query3 Whitespace-stripping substring match → word-boundary regex (2 files): - agnews/query1, music_brainz/query2 - Old: stripped ALL whitespace then substring matched (e.g., "therundown" matched inside "therundownfilm") - New: case-insensitive word-boundary match via re.search(\b...\b) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

anushruthasura mentioned this pull request Mar 10, 2026

Enable inline validation after agent runs #14

Closed

4 tasks

Ruiying-Ma added 2 commits March 11, 2026 19:26

discard changes

5bff93e

update comment

87b0f2f

Ruiying-Ma merged commit 247d2bd into main Mar 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix validation script bugs across 13 validators#13

Fix validation script bugs across 13 validators#13
Ruiying-Ma merged 3 commits intomainfrom
fix/validation-script-bugs

anushruthasura commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

anushruthasura commented Mar 10, 2026

Summary

Window size mismatches (comment says 50, code uses 10 or 20)

Forbidden-value false positives (stockindex/query1, query2)

Bare except: → except ValueError: (6 files)

Whitespace-stripping match → word-boundary regex (agnews/q1, music_brainz/q2)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Bare `except:` → `except ValueError:` (6 files)