fix: Isolate evaluator errors in run_evaluations by afarntrog · Pull Request #84 · strands-agents/evals

afarntrog · 2026-01-12T19:28:19Z

Description

Previously, a single try/except wrapped both task execution and all evaluators, causing all evaluators to fail if any single evaluator threw an exception.

Now task execution and evaluator execution have separate error handling:

Task execution errors: Record failure for all evaluators and skip case
Evaluator errors: Only the failing evaluator is marked failed; others continue to run and can succeed independently

Related Issues

#82

Documentation PR

Type of Change

Bug fix

Testing

How have you tested the change? Verify that the changes do not break functionality or introduce warnings in consuming repositories: agents-docs, agents-tools, agents-cli

I ran hatch run prepare

Checklist

I have read the CONTRIBUTING document
I have added any necessary tests that prove my fix is effective or my feature works
I have updated the documentation accordingly
I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
My changes generate no new warnings
Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Previously, a single try/except wrapped both task execution and all evaluators, causing all evaluators to fail if any single evaluator threw an exception. Now task execution and evaluator execution have separate error handling: - Task execution errors: Record failure for all evaluators and skip case - Evaluator errors: Only the failing evaluator is marked failed; others continue to run and can succeed independently

afarntrog · 2026-01-12T19:54:38Z

The integ tests failure is unrelated: The test test_output_evaluator_multiple_cases runs 3 evaluations in parallel via run_evaluations_async, and the third one got throttled by AWS.

strands-agent

Code Review Summary

Overall Assessment: ✅ APPROVE - This is a well-structured bug fix with proper error isolation.

What This PR Does

The PR fixes a critical bug in run_evaluations() where a single failing evaluator would cause ALL evaluators to fail. The fix properly isolates errors:

Task execution errors → All evaluators marked as failed (correct behavior - no output to evaluate)
Individual evaluator errors → Only the failing evaluator is marked failed; others continue independently

Code Quality Assessment

✅ Strengths

Clean separation of concerns: Task execution and evaluator execution now have distinct try/except blocks
Appropriate error messages:
- "Task execution error: {str(e)}" for task failures
- "Evaluator error: {str(e)}" for evaluator failures
Good test coverage: Added ThrowingEvaluator test fixture and test_experiment_run_evaluations_evaluator_error_isolated() that validates the exact behavior
Minimal changes: Only 18 lines added to production code, focused and atomic
Correct context preservation: Evaluator errors now use evaluation_context.model_dump() instead of case.model_dump(), preserving the actual task output

📝 Minor Observations (Non-blocking)

Typo fixed: Nice catch on changing "An error occured" → "Evaluator error" (was a typo: "occured" → "occurred")

Test completeness: The test covers the sync path. Consider adding an async test in a future PR:

@pytest.mark.asyncio
async def test_experiment_run_evaluations_async_evaluator_error_isolated():
    # Similar test for run_evaluations_async

CI Status

✅ Pull Request and Push Action: SUCCESS
⚠️ Secure Integration test: FAILURE (as noted by author - AWS throttling, unrelated to this change)

Verdict

This is a clean, well-tested bug fix that follows best practices. The error isolation pattern is exactly what's needed for robust evaluator execution. The integration test failure is confirmed unrelated (AWS throttling on parallel evaluations).

🦆

🤖 This is an experimental AI agent response from the Strands team, powered by Strands Agents. We're exploring how AI agents can help with community support and development. Your feedback helps us improve! If you'd prefer human assistance, please let us know.

afarntrog added 2 commits January 12, 2026 14:23

afarntrog had a problem deploying to auto-approve January 12, 2026 19:28 — with GitHub Actions Failure

afarntrog had a problem deploying to auto-approve January 12, 2026 19:31 — with GitHub Actions Failure

strands-agent approved these changes Jan 13, 2026

View reviewed changes

pgrayy approved these changes Jan 13, 2026

View reviewed changes

afarntrog merged commit a1e62eb into strands-agents:main Jan 13, 2026
12 of 14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Isolate evaluator errors in run_evaluations#84

fix: Isolate evaluator errors in run_evaluations#84
afarntrog merged 2 commits intostrands-agents:mainfrom
afarntrog:82_duplicate_entries

afarntrog commented Jan 12, 2026 •

edited

Loading

Uh oh!

afarntrog commented Jan 12, 2026

Uh oh!

strands-agent left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

afarntrog commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issues

Documentation PR

Type of Change

Testing

Checklist

Uh oh!

afarntrog commented Jan 12, 2026

Uh oh!

strands-agent left a comment

Choose a reason for hiding this comment

Code Review Summary

What This PR Does

Code Quality Assessment

✅ Strengths

📝 Minor Observations (Non-blocking)

CI Status

Verdict

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

afarntrog commented Jan 12, 2026 •

edited

Loading