Skip to content

fix: Isolate evaluator errors in run_evaluations#84

Merged
afarntrog merged 2 commits intostrands-agents:mainfrom
afarntrog:82_duplicate_entries
Jan 13, 2026
Merged

fix: Isolate evaluator errors in run_evaluations#84
afarntrog merged 2 commits intostrands-agents:mainfrom
afarntrog:82_duplicate_entries

Conversation

@afarntrog
Copy link
Contributor

@afarntrog afarntrog commented Jan 12, 2026

Description

Previously, a single try/except wrapped both task execution and all evaluators, causing all evaluators to fail if any single evaluator threw an exception.

Now task execution and evaluator execution have separate error handling:

  • Task execution errors: Record failure for all evaluators and skip case
  • Evaluator errors: Only the failing evaluator is marked failed; others continue to run and can succeed independently

Related Issues

#82

Documentation PR

Type of Change

Bug fix

Testing

How have you tested the change? Verify that the changes do not break functionality or introduce warnings in consuming repositories: agents-docs, agents-tools, agents-cli

  • I ran hatch run prepare

Checklist

  • I have read the CONTRIBUTING document
  • I have added any necessary tests that prove my fix is effective or my feature works
  • I have updated the documentation accordingly
  • I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Previously, a single try/except wrapped both task execution and all
evaluators, causing all evaluators to fail if any single evaluator
threw an exception.

Now task execution and evaluator execution have separate error handling:
- Task execution errors: Record failure for all evaluators and skip case
- Evaluator errors: Only the failing evaluator is marked failed; others
  continue to run and can succeed independently
Previously, a single try/except wrapped both task execution and all
evaluators, causing all evaluators to fail if any single evaluator
threw an exception.

Now task execution and evaluator execution have separate error handling:
- Task execution errors: Record failure for all evaluators and skip case
- Evaluator errors: Only the failing evaluator is marked failed; others
  continue to run and can succeed independently
@afarntrog
Copy link
Contributor Author

The integ tests failure is unrelated: The test test_output_evaluator_multiple_cases runs 3 evaluations in parallel via run_evaluations_async, and the third one got throttled by AWS.

Copy link

@strands-agent strands-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review Summary

Overall Assessment: ✅ APPROVE - This is a well-structured bug fix with proper error isolation.

What This PR Does

The PR fixes a critical bug in run_evaluations() where a single failing evaluator would cause ALL evaluators to fail. The fix properly isolates errors:

  1. Task execution errors → All evaluators marked as failed (correct behavior - no output to evaluate)
  2. Individual evaluator errors → Only the failing evaluator is marked failed; others continue independently

Code Quality Assessment

✅ Strengths

  1. Clean separation of concerns: Task execution and evaluator execution now have distinct try/except blocks
  2. Appropriate error messages:
    • "Task execution error: {str(e)}" for task failures
    • "Evaluator error: {str(e)}" for evaluator failures
  3. Good test coverage: Added ThrowingEvaluator test fixture and test_experiment_run_evaluations_evaluator_error_isolated() that validates the exact behavior
  4. Minimal changes: Only 18 lines added to production code, focused and atomic
  5. Correct context preservation: Evaluator errors now use evaluation_context.model_dump() instead of case.model_dump(), preserving the actual task output

📝 Minor Observations (Non-blocking)

  1. Typo fixed: Nice catch on changing "An error occured""Evaluator error" (was a typo: "occured" → "occurred")

  2. Test completeness: The test covers the sync path. Consider adding an async test in a future PR:

    @pytest.mark.asyncio
    async def test_experiment_run_evaluations_async_evaluator_error_isolated():
        # Similar test for run_evaluations_async

CI Status

  • Pull Request and Push Action: SUCCESS
  • ⚠️ Secure Integration test: FAILURE (as noted by author - AWS throttling, unrelated to this change)

Verdict

This is a clean, well-tested bug fix that follows best practices. The error isolation pattern is exactly what's needed for robust evaluator execution. The integration test failure is confirmed unrelated (AWS throttling on parallel evaluations).

🦆


🤖 This is an experimental AI agent response from the Strands team, powered by Strands Agents. We're exploring how AI agents can help with community support and development. Your feedback helps us improve! If you'd prefer human assistance, please let us know.

@afarntrog afarntrog merged commit a1e62eb into strands-agents:main Jan 13, 2026
12 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants