Enable inline validation after agent runs by anushruthasura · Pull Request #14 · ucbepic/DataAgentBench

anushruthasura · 2026-03-10T22:36:02Z

Summary

Uncomments the validate() method and its call in DataAgent.run() so that each agent run immediately validates its answer against ground truth. Results are logged to validation.jsonl in the run directory.

Previously, validation only happened externally via accuracy.py, meaning users got no immediate feedback on correctness during development or debugging.

Code change

Uncomment DataAgent.validate() method (was already fully implemented)
Uncomment the validation call in DataAgent.run()
Use self.logger.info instead of print for consistency with rest of codebase
Truncate logged answers to 200 chars to avoid flooding logs

Broader validation architecture suggestions

While running PromptQL through all 54 queries (8 runs, 2 models), we identified several architectural patterns in the validation system worth discussing:

Ground truth is hardcoded in every validator

All 54 validate.py files hardcode their expected answers in Python code, despite ground_truth.csv files existing in every query directory. The central common_scaffold/validate/validate.py reads the CSV but never uses it — it just calls validate_mod.validate(llm_answer) which has its own hardcoded values. This creates a maintenance risk where the CSV and Python code can drift apart. Consider having validators read from ground_truth.csv as the single source of truth.

No partial credit for list-type queries

Validators for multi-item answers (e.g., "list the top 5 businesses with hours and ratings") are all-or-nothing — one missing item fails the entire validation. For benchmarking purposes, partial credit (e.g., "found 4/5 expected items") would give more signal about agent capability. We built this into our custom validator and found it very informative.

Proximity window pattern is fragile

The "find name, look for value within N chars" pattern (used in ~10 validators) breaks with any non-trivial formatting — tables, bullet points, multi-line output. This was the #1 cause of false negatives in our testing. PR #13 fixes the worst cases (where comments and code disagreed), but the pattern itself is inherently fragile for agents that produce structured/formatted output. Consider either much larger windows (200+ chars) or switching to a "both values present anywhere in output" approach for simple name-value pairs.

Test plan

Verified DataAgent.py parses correctly (ast.parse)
Import and validate() function signature match existing call site
validation_log_path attribute already initialized in __init__
validate import already present at top of file (line 18)

🤖 Generated with Claude Code

Uncomment the validate() method and its call in DataAgent.run() so that each agent run immediately validates its answer against ground truth and logs the result to validation.jsonl. Previously validation only happened externally via accuracy.py, giving no immediate feedback during development/debugging. Minor improvements: - Use self.logger.info instead of print for consistency - Truncate llm_answer/ground_truth to 200 chars in log output Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

shreyashankar · 2026-03-12T01:23:45Z

Closing because the proposed code change doesn't match the description of the PR; and the PR objective is not so urgent

shreyashankar closed this Mar 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable inline validation after agent runs#14

Enable inline validation after agent runs#14
anushruthasura wants to merge 1 commit intomainfrom
fix/enable-inline-validation

anushruthasura commented Mar 10, 2026

Uh oh!

shreyashankar commented Mar 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

anushruthasura commented Mar 10, 2026

Summary

Code change

Broader validation architecture suggestions

Ground truth is hardcoded in every validator

No partial credit for list-type queries

Proximity window pattern is fragile

Test plan

Uh oh!

shreyashankar commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shreyashankar commented Mar 12, 2026 •

edited

Loading