[FEATURE] Trainable Strands Agents with Continuous Learning

### Problem Statement


Strands agents need the ability to learn from their workflow execution and operational experience. To do so, the Strands SDK need to address these two problems.

- Trajectory Data Utilization: Collect agent execution traces into training datasets for continuous model improvement.
- Trajectory-based Training: Fine-tune models using real agent execution data to build domain-specific agents that outperform generic models on customer tasks while reducing costs.

This enhancement extends the current model-driven framework towards learning-driven framework for continuous agent improvement.


### Proposed Solution

Add training capabilities to the Strands Agent SDK to enable continuous learning through model fine-tuning on captured agent trajectories.This is done by integrating open source frameworks such as   [rLLM](https://github.com/rllm-org/rllm/) and [veRL](https://github.com/volcengine/verl)

Example API usage

```python
from rllm.agents import StrandsAgent
from rllm.environments.tools.strands_env import StrandsEnv
from rllm.rewards.reward_fn import math_reward_fn
from rllm.trainer.agent_trainer import AgentTrainer

from strands_tools import python_repl, calculator

agent_args = {"tools": [calculator], 
              "system_prompt": "You are a helpful assistant."}

trainer = AgentTrainer(
    agent_class=StrandsAgent,
    env_class=StrandsEnv,
    agent_args=agent_args,
    env_args={"reward_fn": reward_function},
    config=training_config,
    train_dataset=dataset,
    val_dataset=validation_dataset,
)

trainer.train()
```

**Implementation Components**
1. Trajectory Capture System: Automatically record agent interactions, tool calls, and outcomes
2. Training Environment: StrandsEnv wrapper that interfaces with existing agent execution engine and Strands SDK
3. Training Pipeline: Integration with proven RL/SFT frameworks
4. Reward Function Framework: Configurable reward systems for different domains

 

### Use Case

Primary Use Cases

1. Performance Improvement through Experience
    - Learn which tool calls with which parameter values work better in which situations
    - Improve the sequence/order of actions in workflows from final reward signals (success/failure)
    - Get better at parsing and responding to specific types of user requests
2. Cost Optimization through Specialized Models
    - Train smaller, domain-specific models (e.g., 4B parameters) that can potentially match performance of large API models (e.g. 120B+) on specific tasks given sufficient trajectory and interaction data
    - Replace remote API calls with faster local inference costs on the same network
    - Reduce token usage through more efficient reasoning patterns learned from training data
3. Operational Independence
    - By using highly optimized/trained local models, it is possible to eliminate rate limiting constraints that throttle high-volume (multi-) agent deployments
    - Business Continuity and Consistency: Avoid workflow disruptions or quality variations when external API providers update models, deprecate versions, or change APIs that break existing integrations
4. Domain Specialization: Train agents for specific business contexts and agentic workflows
    - Train agents on company-specific workflows, terminology, and success patterns
    - Adapt reasoning approaches to specific industry contexts (e.g., financial analysis vs. code generation)
    - Create specialized tool usage patterns based on actual operational requirements and constraints



### Alternatives Solutions

Manual Prompt and Context Engineering:
 - Pros: Current widely used approach, fine-grained control, no additional SDK complexity, immediate implementation
 - Cons: Labor-intensive, requires manual tuning for each use case, static rules do not always adapt, no learning from real interaction data, context strategies do not improve based on actual performance outcomes




### Additional Context

Technical Considerations

* Reward Design Complexity: Not all problems have easily quantifiable rewards
* Scope Limitations: Primarily beneficial for local model fine-tuning scenarios
* Infrastructure Requirements: Requires compute resources for training (SageMaker AI integration, etc.)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FEATURE] Trainable Strands Agents with Continuous Learning #923

Problem Statement

Proposed Solution

Use Case

Alternatives Solutions

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FEATURE] Trainable Strands Agents with Continuous Learning #923

Description

Problem Statement

Proposed Solution

Use Case

Alternatives Solutions

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions