Enable inline validation after agent runs#14
Closed
anushruthasura wants to merge 1 commit intomainfrom
Closed
Conversation
Uncomment the validate() method and its call in DataAgent.run() so that each agent run immediately validates its answer against ground truth and logs the result to validation.jsonl. Previously validation only happened externally via accuracy.py, giving no immediate feedback during development/debugging. Minor improvements: - Use self.logger.info instead of print for consistency - Truncate llm_answer/ground_truth to 200 chars in log output Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Collaborator
|
Closing because the proposed code change doesn't match the description of the PR; and the PR objective is not so urgent |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Uncomments the
validate()method and its call inDataAgent.run()so that each agent run immediately validates its answer against ground truth. Results are logged tovalidation.jsonlin the run directory.Previously, validation only happened externally via
accuracy.py, meaning users got no immediate feedback on correctness during development or debugging.Code change
DataAgent.validate()method (was already fully implemented)DataAgent.run()self.logger.infoinstead ofprintfor consistency with rest of codebaseBroader validation architecture suggestions
While running PromptQL through all 54 queries (8 runs, 2 models), we identified several architectural patterns in the validation system worth discussing:
Ground truth is hardcoded in every validator
All 54
validate.pyfiles hardcode their expected answers in Python code, despiteground_truth.csvfiles existing in every query directory. The centralcommon_scaffold/validate/validate.pyreads the CSV but never uses it — it just callsvalidate_mod.validate(llm_answer)which has its own hardcoded values. This creates a maintenance risk where the CSV and Python code can drift apart. Consider having validators read fromground_truth.csvas the single source of truth.No partial credit for list-type queries
Validators for multi-item answers (e.g., "list the top 5 businesses with hours and ratings") are all-or-nothing — one missing item fails the entire validation. For benchmarking purposes, partial credit (e.g., "found 4/5 expected items") would give more signal about agent capability. We built this into our custom validator and found it very informative.
Proximity window pattern is fragile
The "find name, look for value within N chars" pattern (used in ~10 validators) breaks with any non-trivial formatting — tables, bullet points, multi-line output. This was the #1 cause of false negatives in our testing. PR #13 fixes the worst cases (where comments and code disagreed), but the pattern itself is inherently fragile for agents that produce structured/formatted output. Consider either much larger windows (200+ chars) or switching to a "both values present anywhere in output" approach for simple name-value pairs.
Test plan
DataAgent.pyparses correctly (ast.parse)validate()function signature match existing call sitevalidation_log_pathattribute already initialized in__init__validateimport already present at top of file (line 18)🤖 Generated with Claude Code