Skip to content

[FEATURE] Trainable Strands Agents with Continuous Learning #923

@chenwuperth

Description

@chenwuperth

Problem Statement

Strands agents need the ability to learn from their workflow execution and operational experience. To do so, the Strands SDK need to address these two problems.

  • Trajectory Data Utilization: Collect agent execution traces into training datasets for continuous model improvement.
  • Trajectory-based Training: Fine-tune models using real agent execution data to build domain-specific agents that outperform generic models on customer tasks while reducing costs.

This enhancement extends the current model-driven framework towards learning-driven framework for continuous agent improvement.

Proposed Solution

Add training capabilities to the Strands Agent SDK to enable continuous learning through model fine-tuning on captured agent trajectories.This is done by integrating open source frameworks such as rLLM and veRL

Example API usage

from rllm.agents import StrandsAgent
from rllm.environments.tools.strands_env import StrandsEnv
from rllm.rewards.reward_fn import math_reward_fn
from rllm.trainer.agent_trainer import AgentTrainer

from strands_tools import python_repl, calculator

agent_args = {"tools": [calculator], 
              "system_prompt": "You are a helpful assistant."}

trainer = AgentTrainer(
    agent_class=StrandsAgent,
    env_class=StrandsEnv,
    agent_args=agent_args,
    env_args={"reward_fn": reward_function},
    config=training_config,
    train_dataset=dataset,
    val_dataset=validation_dataset,
)

trainer.train()

Implementation Components

  1. Trajectory Capture System: Automatically record agent interactions, tool calls, and outcomes
  2. Training Environment: StrandsEnv wrapper that interfaces with existing agent execution engine and Strands SDK
  3. Training Pipeline: Integration with proven RL/SFT frameworks
  4. Reward Function Framework: Configurable reward systems for different domains

Use Case

Primary Use Cases

  1. Performance Improvement through Experience
    • Learn which tool calls with which parameter values work better in which situations
    • Improve the sequence/order of actions in workflows from final reward signals (success/failure)
    • Get better at parsing and responding to specific types of user requests
  2. Cost Optimization through Specialized Models
    • Train smaller, domain-specific models (e.g., 4B parameters) that can potentially match performance of large API models (e.g. 120B+) on specific tasks given sufficient trajectory and interaction data
    • Replace remote API calls with faster local inference costs on the same network
    • Reduce token usage through more efficient reasoning patterns learned from training data
  3. Operational Independence
    • By using highly optimized/trained local models, it is possible to eliminate rate limiting constraints that throttle high-volume (multi-) agent deployments
    • Business Continuity and Consistency: Avoid workflow disruptions or quality variations when external API providers update models, deprecate versions, or change APIs that break existing integrations
  4. Domain Specialization: Train agents for specific business contexts and agentic workflows
    • Train agents on company-specific workflows, terminology, and success patterns
    • Adapt reasoning approaches to specific industry contexts (e.g., financial analysis vs. code generation)
    • Create specialized tool usage patterns based on actual operational requirements and constraints

Alternatives Solutions

Manual Prompt and Context Engineering:

  • Pros: Current widely used approach, fine-grained control, no additional SDK complexity, immediate implementation
  • Cons: Labor-intensive, requires manual tuning for each use case, static rules do not always adapt, no learning from real interaction data, context strategies do not improve based on actual performance outcomes

Additional Context

Technical Considerations

  • Reward Design Complexity: Not all problems have easily quantifiable rewards
  • Scope Limitations: Primarily beneficial for local model fine-tuning scenarios
  • Infrastructure Requirements: Requires compute resources for training (SageMaker AI integration, etc.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions