Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -193,13 +193,13 @@ The goal of this repository is to revamp this documentation repo so that it prov
│ │ ├── streaming
│ │ │ ├── async-iterators.md
│ │ │ ├── callback-handlers.md
│ │ │ └── overview.md
│ │ │ └── index.md
│ │ └── tools
│ │ ├── community-tools-package.md
│ │ ├── executors.md
│ │ ├── mcp-tools.md
│ │ ├── python-tools.md
│ │ └── tools_overview.md
│ │ └── index.md
│ ├── deploy
│ │ ├── deploy_to_amazon_ec2.md
│ │ ├── deploy_to_amazon_eks.md
Expand Down
2 changes: 1 addition & 1 deletion docs/user-guide/concepts/agents/agent-loop.md
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,7 @@ Solutions:

### Inappropriate Tool Selection

When the model consistently picks the wrong tool, the problem is usually ambiguous tool descriptions. Review the descriptions from the model's perspective. If two tools have overlapping descriptions, the model has no basis for choosing between them. See [Tools Overview](../tools/tools_overview.md) for guidance on writing effective descriptions.
When the model consistently picks the wrong tool, the problem is usually ambiguous tool descriptions. Review the descriptions from the model's perspective. If two tools have overlapping descriptions, the model has no basis for choosing between them. See [Tools Overview](../tools/index.md) for guidance on writing effective descriptions.

### MaxTokensReachedException

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -233,7 +233,7 @@ See [Model Providers](models/nova_sonic.md) for provider-specific options.

`BidiAgent` supports many of the same constructs as `Agent`:

- **[Tools](../../tools/tools_overview.md)**: Function calling works identically
- **[Tools](../../tools/index.md)**: Function calling works identically
- **[Hooks](hooks.md)**: Lifecycle event handling with bidirectional-specific events
- **[Session Management](session-management.md)**: Conversation persistence across sessions
- **[Tool Executors](../../tools/executors.md)**: Concurrent and custom execution patterns
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Bidirectional streaming events enable real-time monitoring and processing of aud

## Event Model

Bidirectional streaming uses a different event model than [standard streaming](../../streaming/overview.md):
Bidirectional streaming uses a different event model than [standard streaming](../../streaming/index.md):

**Standard Streaming:**

Expand Down Expand Up @@ -322,7 +322,7 @@ Events for tool execution during conversations. Bidirectional streaming reuses t

#### ToolUseStreamEvent

Emitted when the model requests tool execution. See [Tools Overview](../../tools/tools_overview.md) for details.
Emitted when the model requests tool execution. See [Tools Overview](../../tools/index.md) for details.

```python
{
Expand Down
2 changes: 1 addition & 1 deletion docs/user-guide/concepts/interrupts.md
Original file line number Diff line number Diff line change
Expand Up @@ -156,7 +156,7 @@ agent = Agent(

```

> ⚠️ Interrupts are not supported in [direct tool calls](./tools/tools_overview.md#direct-method-calls) (i.e., calls such as `agent.tool.my_tool()`).
> ⚠️ Interrupts are not supported in [direct tool calls](./tools/index.md#direct-method-calls) (i.e., calls such as `agent.tool.my_tool()`).

### Components

Expand Down
2 changes: 1 addition & 1 deletion docs/user-guide/concepts/multi-agent/graph.md
Original file line number Diff line number Diff line change
Expand Up @@ -334,7 +334,7 @@ async for event in graph.stream_async("Research and analyze market trends"):
print(f"Graph completed: {result.status}")
```

See the [streaming overview](../streaming/overview.md#multi-agent-events) for details on all multi-agent event types.
See the [streaming overview](../streaming/index.md#multi-agent-events) for details on all multi-agent event types.

## Graph Results

Expand Down
2 changes: 1 addition & 1 deletion docs/user-guide/concepts/multi-agent/swarm.md
Original file line number Diff line number Diff line change
Expand Up @@ -225,7 +225,7 @@ async for event in swarm.stream_async("Design and implement a REST API"):
print(f"\nSwarm completed: {result.status}")
```

See the [streaming overview](../streaming/overview.md#multi-agent-events) for details on all multi-agent event types.
See the [streaming overview](../streaming/index.md#multi-agent-events) for details on all multi-agent event types.

## Swarm Results

Expand Down
2 changes: 1 addition & 1 deletion docs/user-guide/concepts/streaming/async-iterators.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

Async iterators provide asynchronous streaming of agent events, allowing you to process events as they occur in real-time. This approach is ideal for asynchronous frameworks where you need fine-grained control over async execution flow.

For a complete list of available events including text generation, tool usage, lifecycle, and reasoning events, see the [streaming overview](./overview.md#event-types).
For a complete list of available events including text generation, tool usage, lifecycle, and reasoning events, see the [streaming overview](./index.md#event-types).

## Basic Usage

Expand Down
2 changes: 1 addition & 1 deletion docs/user-guide/concepts/streaming/callback-handlers.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

Callback handlers allow you to intercept and process events as they happen during agent execution in Python. This enables real-time monitoring, custom output formatting, and integration with external systems through function-based event handling.

For a complete list of available events including text generation, tool usage, lifecycle, and reasoning events, see the [streaming overview](./overview.md#event-types).
For a complete list of available events including text generation, tool usage, lifecycle, and reasoning events, see the [streaming overview](./index.md#event-types).

> **Note:** For asynchronous applications, consider [async iterators](./async-iterators.md) instead.

Expand Down
2 changes: 1 addition & 1 deletion docs/user-guide/concepts/tools/custom-tools.md
Original file line number Diff line number Diff line change
Expand Up @@ -363,7 +363,7 @@ Tools can access their execution context to interact with the invoking agent, cu

=== "Python"

Async tools can yield intermediate results to provide real-time progress updates. Each yielded value becomes a [streaming event](../streaming/overview.md), with the final value serving as the tool's return result:
Async tools can yield intermediate results to provide real-time progress updates. Each yielded value becomes a [streaming event](../streaming/index.md), with the final value serving as the tool's return result:

```python
from datetime import datetime
Expand Down
4 changes: 2 additions & 2 deletions docs/user-guide/deploy/operating-agents-in-production.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ agent = Agent(
)
```

See [Adding Tools to Agents](../concepts/tools/tools_overview.md/#adding-tools-to-agents) and [Auto reloading tools](../concepts/tools/tools_overview.md#auto-loading-and-reloading-tools) for more information.
See [Adding Tools to Agents](../concepts/tools/index.md/#adding-tools-to-agents) and [Auto reloading tools](../concepts/tools/index.md#auto-loading-and-reloading-tools) for more information.

### Security Considerations

Expand Down Expand Up @@ -150,6 +150,6 @@ Operating Strands agents in production requires careful consideration of configu

- [Conversation Management](../../user-guide/concepts/agents/conversation-management.md)
- [Streaming - Async Iterator](../../user-guide/concepts/streaming/async-iterators.md)
- [Tool Development](../../user-guide/concepts/tools/tools_overview.md)
- [Tool Development](../../user-guide/concepts/tools/index.md)
- [Guardrails](../../user-guide/safety-security/guardrails.md)
- [Responsible AI](../../user-guide/safety-security/responsible-ai.md)
190 changes: 97 additions & 93 deletions docs/user-guide/evals-sdk/eval-sop.md
Original file line number Diff line number Diff line change
Expand Up @@ -231,47 +231,49 @@ Generates insights and recommendations:

The evaluation plan follows a comprehensive structured format with detailed analysis and implementation guidance:

# Evaluation Plan for QA+Search Agent

## 1. Evaluation Requirements
- **User Input:** "generate an evaluation plan for this qa agent..."
- **Interpreted Evaluation Requirements:** Evaluate the QA agent's ability to answer questions using web search capabilities...

## 2. Agent Analysis
| **Attribute** | **Details** |
| :-------------------- | :---------------------------------------------------------- |
| **Agent Name** | QA+Search |
| **Purpose** | Answer questions by searching the web using Tavily API... |
| **Core Capabilities** | Web search integration, information synthesis... |

**Agent Architecture Diagram:**
(Mermaid diagram showing User Query → Agent → WebSearchTool → Tavily API flow)

## 3. Evaluation Metrics
### Answer Quality Score
- **Evaluation Area:** Final response quality
- **Method:** LLM-as-Judge (using OutputEvaluator with custom rubric)
- **Scoring Scale:** 0.0 to 1.0
- **Pass Threshold:** 0.75 or higher

## 4. Test Data Generation
- **Simple Factual Questions**: Questions requiring basic web search...
- **Multi-Step Reasoning Questions**: Questions requiring synthesis...

## 5. Evaluation Implementation Design
### 5.1 Evaluation Code Structure
./ # Repository root directory
├── requirements.txt # Consolidated dependencies
└── eval/ # Evaluation workspace
├── README.md # Running instructions
├── run_evaluation.py # Strands Evals SDK implementation
└── results/ # Evaluation outputs

## 6. Progress Tracking
### 6.1 User Requirements Log
| **Timestamp** | **Source** | **Requirement** |
| :------------ | :--------- | :-------------- |
| 2025-12-01 | eval sop | Generate evaluation plan... |
```markdown
# Evaluation Plan for QA+Search Agent

## 1. Evaluation Requirements
- **User Input:** "generate an evaluation plan for this qa agent..."
- **Interpreted Evaluation Requirements:** Evaluate the QA agent's ability to answer questions using web search capabilities...

## 2. Agent Analysis
| **Attribute** | **Details** |
| :-------------------- | :---------------------------------------------------------- |
| **Agent Name** | QA+Search |
| **Purpose** | Answer questions by searching the web using Tavily API... |
| **Core Capabilities** | Web search integration, information synthesis... |

**Agent Architecture Diagram:**
(Mermaid diagram showing User Query → Agent → WebSearchTool → Tavily API flow)

## 3. Evaluation Metrics
### Answer Quality Score
- **Evaluation Area:** Final response quality
- **Method:** LLM-as-Judge (using OutputEvaluator with custom rubric)
- **Scoring Scale:** 0.0 to 1.0
- **Pass Threshold:** 0.75 or higher

## 4. Test Data Generation
- **Simple Factual Questions**: Questions requiring basic web search...
- **Multi-Step Reasoning Questions**: Questions requiring synthesis...

## 5. Evaluation Implementation Design
### 5.1 Evaluation Code Structure
./ # Repository root directory
├── requirements.txt # Consolidated dependencies
└── eval/ # Evaluation workspace
├── README.md # Running instructions
├── run_evaluation.py # Strands Evals SDK implementation
└── results/ # Evaluation outputs

## 6. Progress Tracking
### 6.1 User Requirements Log
| **Timestamp** | **Source** | **Requirement** |
| :------------ | :--------- | :-------------- |
| 2025-12-01 | eval sop | Generate evaluation plan... |
```

### Generated Test Cases
Test cases are generated in JSONL format with structured metadata:
Expand All @@ -288,58 +290,60 @@ Test cases are generated in JSONL format with structured metadata:

The evaluation report provides comprehensive analysis with actionable insights:

# Agent Evaluation Report for QA+Search Agent

## Executive Summary
- **Test Scale**: 2 test cases
- **Success Rate**: 100%
- **Overall Score**: 1.000 (Perfect)
- **Status**: Excellent
- **Action Priority**: Continue monitoring; consider expanding test coverage...

## Evaluation Results
### Test Case Coverage
- **Simple Factual Questions (Geography)**: Questions requiring basic factual information...
- **Simple Factual Questions (Sports/Time-sensitive)**: Questions requiring current event information...

### Results
| **Metric** | **Score** | **Target** | **Status** |
| :---------------------- | :-------- | :--------- | :--------- |
| Answer Quality Score | 1.00 | 0.75+ | Pass ✅ |
| Overall Test Pass Rate | 100% | 75%+ | Pass ✅ |

## Agent Success Analysis
### Strengths
- **Perfect Accuracy**: The agent correctly answered 100% of test questions...
- **Evidence**: Both test cases scored 1.0/1.0 (perfect scores)
- **Contributing Factors**: Effective use of web search tool...

## Agent Failure Analysis
### No Failures Detected
The evaluation identified zero failures across all test cases...

## Action Items & Recommendations
### Expand Test Coverage - Priority 1 (Enhancement)
- **Description**: Increase the number and diversity of test cases...
- **Actions**:
- [ ] Add 5-10 additional test cases covering edge cases
- [ ] Include multi-step reasoning scenarios
- [ ] Add test cases for error conditions

## Artifacts & Reproduction
### Reference Materials
- **Agent Code**: `qa_agent/qa_agent.py`
- **Test Cases**: `eval/test-cases.jsonl`
- **Results**: `eval/results/.../evaluation_report.json`

### Reproduction Steps
source .venv/bin/activate
python eval/run_evaluation.py

## Evaluation Limitations and Improvement
### Test Data Improvement
- **Current Limitations**: Only 2 test cases, limited scenario diversity...
- **Recommended Improvements**: Increase test case count to 10-20 cases...
```markdown
# Agent Evaluation Report for QA+Search Agent

## Executive Summary
- **Test Scale**: 2 test cases
- **Success Rate**: 100%
- **Overall Score**: 1.000 (Perfect)
- **Status**: Excellent
- **Action Priority**: Continue monitoring; consider expanding test coverage...

## Evaluation Results
### Test Case Coverage
- **Simple Factual Questions (Geography)**: Questions requiring basic factual information...
- **Simple Factual Questions (Sports/Time-sensitive)**: Questions requiring current event information...

### Results
| **Metric** | **Score** | **Target** | **Status** |
| :---------------------- | :-------- | :--------- | :--------- |
| Answer Quality Score | 1.00 | 0.75+ | Pass ✅ |
| Overall Test Pass Rate | 100% | 75%+ | Pass ✅ |

## Agent Success Analysis
### Strengths
- **Perfect Accuracy**: The agent correctly answered 100% of test questions...
- **Evidence**: Both test cases scored 1.0/1.0 (perfect scores)
- **Contributing Factors**: Effective use of web search tool...

## Agent Failure Analysis
### No Failures Detected
The evaluation identified zero failures across all test cases...

## Action Items & Recommendations
### Expand Test Coverage - Priority 1 (Enhancement)
- **Description**: Increase the number and diversity of test cases...
- **Actions**:
- [ ] Add 5-10 additional test cases covering edge cases
- [ ] Include multi-step reasoning scenarios
- [ ] Add test cases for error conditions

## Artifacts & Reproduction
### Reference Materials
- **Agent Code**: `qa_agent/qa_agent.py`
- **Test Cases**: `eval/test-cases.jsonl`
- **Results**: `eval/results/.../evaluation_report.json`

### Reproduction Steps
source .venv/bin/activate
python eval/run_evaluation.py

## Evaluation Limitations and Improvement
### Test Data Improvement
- **Current Limitations**: Only 2 test cases, limited scenario diversity...
- **Recommended Improvements**: Increase test case count to 10-20 cases...
```

## Best Practices

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -313,5 +313,5 @@ def compare_agent_versions(cases: list, agents: dict) -> dict:
## Related Documentation

- [Quickstart Guide](../quickstart.md): Get started with Strands Evals
- [Simulators Overview](../simulators/overview.md): Learn about simulators
- [Simulators Overview](../simulators/index.md): Learn about simulators
- [Experiment Generator](../experiment_generator.md): Generate test cases automatically
2 changes: 1 addition & 1 deletion docs/user-guide/evals-sdk/simulators/user_simulation.md
Original file line number Diff line number Diff line change
Expand Up @@ -662,7 +662,7 @@ while user_sim.has_next():

## Related Documentation

- [Simulators Overview](overview.md): Learn about the ActorSimulator and simulator framework
- [Simulators Overview](index.md): Learn about the ActorSimulator and simulator framework
- [Quickstart Guide](../quickstart.md): Get started with Strands Evals
- [Helpfulness Evaluator](../evaluators/helpfulness_evaluator.md): Evaluate conversation helpfulness
- [Goal Success Rate Evaluator](../evaluators/goal_success_rate_evaluator.md): Assess goal completion
11 changes: 6 additions & 5 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ theme:
- content.code.copy
- content.tabs.link
- content.code.select
- navigation.indexes
- navigation.instant
- navigation.instant.prefetch
- navigation.instant.progress
Expand Down Expand Up @@ -84,7 +85,7 @@ nav:
- User Guide:
- Welcome: README.md
- Quickstart:
- Overview: user-guide/quickstart/index.md
- Getting Started: user-guide/quickstart/overview.md
- Python: user-guide/quickstart/python.md
- TypeScript: user-guide/quickstart/typescript.md
- Concepts:
Expand All @@ -97,7 +98,7 @@ nav:
- Structured Output: user-guide/concepts/agents/structured-output.md
- Conversation Management: user-guide/concepts/agents/conversation-management.md
- Tools:
- Overview: user-guide/concepts/tools/tools_overview.md
- Overview: user-guide/concepts/tools/index.md
- Creating Custom Tools: user-guide/concepts/tools/custom-tools.md
- Model Context Protocol (MCP): user-guide/concepts/tools/mcp-tools.md
- Executors: user-guide/concepts/tools/executors.md
Expand All @@ -121,7 +122,7 @@ nav:
- CLOVA Studio<sup> community</sup>: user-guide/concepts/model-providers/clova-studio.md
- FireworksAI<sup> community</sup>: user-guide/concepts/model-providers/fireworksai.md
- Streaming:
- Overview: user-guide/concepts/streaming/quickstart.md
- Overview: user-guide/concepts/streaming/index.md
- Async Iterators: user-guide/concepts/streaming/async-iterators.md
- Callback Handlers: user-guide/concepts/streaming/callback-handlers.md
- Multi-agent:
Expand Down Expand Up @@ -163,7 +164,7 @@ nav:
- Getting Started: user-guide/evals-sdk/quickstart.md
- Eval SOP: user-guide/evals-sdk/eval-sop.md
- Evaluators:
- Overview: user-guide/evals-sdk/evaluators/overview.md
- Overview: user-guide/evals-sdk/evaluators/index.md
- Output: user-guide/evals-sdk/evaluators/output_evaluator.md
- Trajectory: user-guide/evals-sdk/evaluators/trajectory_evaluator.md
- Interactions: user-guide/evals-sdk/evaluators/interactions_evaluator.md
Expand All @@ -175,7 +176,7 @@ nav:
- Custom: user-guide/evals-sdk/evaluators/custom_evaluator.md
- Experiment Generator: user-guide/evals-sdk/experiment_generator.md
- Simulators:
- Overview: user-guide/evals-sdk/simulators/overview.md
- Overview: user-guide/evals-sdk/simulators/index.md
- User Simulation: user-guide/evals-sdk/simulators/user_simulation.md
- How-To Guides:
- Experiment Management: user-guide/evals-sdk/how-to/experiment_management.md
Expand Down