Skip to content

Enable inline validation after agent runs#14

Closed
anushruthasura wants to merge 1 commit intomainfrom
fix/enable-inline-validation
Closed

Enable inline validation after agent runs#14
anushruthasura wants to merge 1 commit intomainfrom
fix/enable-inline-validation

Conversation

@anushruthasura
Copy link
Copy Markdown
Collaborator

Summary

Uncomments the validate() method and its call in DataAgent.run() so that each agent run immediately validates its answer against ground truth. Results are logged to validation.jsonl in the run directory.

Previously, validation only happened externally via accuracy.py, meaning users got no immediate feedback on correctness during development or debugging.

Code change

  • Uncomment DataAgent.validate() method (was already fully implemented)
  • Uncomment the validation call in DataAgent.run()
  • Use self.logger.info instead of print for consistency with rest of codebase
  • Truncate logged answers to 200 chars to avoid flooding logs

Broader validation architecture suggestions

While running PromptQL through all 54 queries (8 runs, 2 models), we identified several architectural patterns in the validation system worth discussing:

Ground truth is hardcoded in every validator

All 54 validate.py files hardcode their expected answers in Python code, despite ground_truth.csv files existing in every query directory. The central common_scaffold/validate/validate.py reads the CSV but never uses it — it just calls validate_mod.validate(llm_answer) which has its own hardcoded values. This creates a maintenance risk where the CSV and Python code can drift apart. Consider having validators read from ground_truth.csv as the single source of truth.

No partial credit for list-type queries

Validators for multi-item answers (e.g., "list the top 5 businesses with hours and ratings") are all-or-nothing — one missing item fails the entire validation. For benchmarking purposes, partial credit (e.g., "found 4/5 expected items") would give more signal about agent capability. We built this into our custom validator and found it very informative.

Proximity window pattern is fragile

The "find name, look for value within N chars" pattern (used in ~10 validators) breaks with any non-trivial formatting — tables, bullet points, multi-line output. This was the #1 cause of false negatives in our testing. PR #13 fixes the worst cases (where comments and code disagreed), but the pattern itself is inherently fragile for agents that produce structured/formatted output. Consider either much larger windows (200+ chars) or switching to a "both values present anywhere in output" approach for simple name-value pairs.

Test plan

  • Verified DataAgent.py parses correctly (ast.parse)
  • Import and validate() function signature match existing call site
  • validation_log_path attribute already initialized in __init__
  • validate import already present at top of file (line 18)

🤖 Generated with Claude Code

Uncomment the validate() method and its call in DataAgent.run() so
that each agent run immediately validates its answer against ground
truth and logs the result to validation.jsonl.

Previously validation only happened externally via accuracy.py, giving
no immediate feedback during development/debugging.

Minor improvements:
- Use self.logger.info instead of print for consistency
- Truncate llm_answer/ground_truth to 200 chars in log output

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@shreyashankar
Copy link
Copy Markdown
Collaborator

shreyashankar commented Mar 12, 2026

Closing because the proposed code change doesn't match the description of the PR; and the PR objective is not so urgent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants