# 11. (Appendix) Designing & Deploying Scalable Agentic Systems

## TL;DR (for practitioners)

If you’re in a hurry, here’s the high-level recipe:

1. **Define the agentic workflow** as roles (planner, tools executor, critic, router) and tools (APIs, databases, search, etc.).
2. **Use Amazon Bedrock** for:

   * Foundation models and **Agents for Bedrock** (single or multi-agent orchestrations). ([AWS Documentation][1])
   * Knowledge bases, guardrails, and (increasingly) **AgentCore** for runtime, memory, and observability.
3. **Use SageMaker** for classic MLOps:

   * Train/domain-adapt your models, embedding models, ranking models, safety filters.
   * Register them in **SageMaker Model Registry**, promote via CI/CD. ([AWS Documentation][2])
4. **Orchestrate the system** via:

   * **Agents for Bedrock / AgentCore** (built-in agent orchestration), plus
   * **AWS Step Functions** and **Lambda / ECS / EKS** for workflows around the agents. ([AWS Documentation][3])
5. **Persist memory and knowledge** using S3, DynamoDB, OpenSearch, and Bedrock Knowledge Bases.
6. **Instrument heavily** with **CloudWatch metrics, logs, and traces**, plus AgentCore / CloudWatch Application Signals for agent-level observability. ([Amazon Web Services, Inc.][4])
7. **Scale safely** using:

   * Horizontal scaling of stateless components,
   * Caching and model-routing (small vs large models),
   * Guardrails, policy enforcement, and staged rollouts (canaries / blue–green).

The rest of this markdown walks through all of this in depth, but still at a **no-code / architecture & MLOps** level.


## What Is an “Agentic System” (in Cloud Terms)?

In production, an **agentic system** is not just “an LLM with tools”. It’s a distributed application where:

* **Agents** = LLM-driven components with a *role* (e.g., planner, data retriever, reasoner, code executor, critic).
* **Tools** = APIs and services the agent can call (databases, SaaS APIs, internal microservices).
* **Memory** = Long-term and short-term context (user profile, task history, RAG documents, intermediate steps).
* **Orchestration** = Control flow across agents and tools: planning, retries, timeouts, routing, escalation.
* **Guardrails & Governance** = Policies for safety, cost, compliance, and auditability.

On AWS, this maps roughly to:

* **Amazon Bedrock** for foundation models, agents, knowledge bases, guardrails. ([Amazon Web Services, Inc.][5])
* **Agentic runtime & orchestration** via:

  * **Agents for Bedrock** and **Bedrock AgentCore** for agent workflows, tooling, memory, and observability. ([Amazon Web Services, Inc.][6])
  * **AWS Step Functions** and/or open-source frameworks (LangGraph, CrewAI, pydantic-ai, etc.) for higher-level workflows. ([AWS Documentation][3])
* **SageMaker** for training, evaluation, and lifecycle of supporting models (embeddings, rerankers, safety classifiers, etc.).
* **Cloud-native infra** for tools: Lambda, ECS/EKS, SQS, EventBridge, DynamoDB, S3, OpenSearch.


## Non-Functional Requirements for Agentic Systems

Before picking services, lock down your **NFRs**:

* **Latency:** Chat-like apps want sub-2s “first token” latency; back-office agents can tolerate more.
* **Throughput & concurrency:** How many parallel sessions? Spikes? Global vs regional traffic?
* **Reliability & fault-tolerance:** What if a tool is down? A model throttles? A region fails?
* **Cost constraints:** Per-session / per-user budget, cost allocation per team/product.
* **Safety & compliance:** PII, PCI, HIPAA/GDPR, data residency.
* **Governance:** Who can change prompts, tools, models? How are changes reviewed and rolled out?
* **Auditability:** Can you reconstruct what an agent did and why?

Everything else follows from this.


## A Reference Architecture for Agentic AI on AWS

Think in layers:

### Experience & API Layer

* **Channels**: Web/mobile app, internal console, Slack/Teams bot.
* **Ingress**:

  * **Amazon API Gateway / Amazon CloudFront** for public-facing APIs.
  * Auth via Cognito / SSO / IAM roles.

### Orchestration & Agents Layer

* **Amazon Bedrock Agents / AgentCore** for:

  * Defining agents with tools, knowledge bases, and guardrails.
  * Multi-agent collaboration (planner agent + domain experts). ([AWS Documentation][3])
* **AWS Step Functions** to:

  * Wrap the agent calls in durable workflows (start → plan → act → verify → respond).
  * Coordinate multiple services (logging, billing, notifications). ([Amazon Web Services, Inc.][7])
* Optional:

  * **Open-source frameworks** (LangGraph, CrewAI, pydantic-ai) hosted on ECS/EKS or Lambda, integrated with Bedrock models and AgentCore.

### Tools & Microservices Layer

* **Business tools**:

  * REST/GraphQL APIs on **Lambda, ECS, or EKS**.
  * Database-backed services on **RDS**, **DynamoDB**, etc.
* **Observability tools**:

  * CloudWatch Logs and Metrics.
  * Incident management systems, ticketing APIs (Jira, ServiceNow).

All tools must be:

* **Stateless**, with externalized state in DBs or queues,
* **Idempotent** or safely retryable,
* **Strongly authenticated/authorized** (IAM, VPC, private link).

### Memory, Knowledge, and State

* **Long-term memory & knowledge**:

  * **Amazon S3** as the data lake.
  * **Bedrock Knowledge Bases** for RAG over documents.
  * **OpenSearch** or vector DB (self-managed or via partner) for low-latency semantic search.
* **Short-term / session memory**:

  * **Amazon DynamoDB** or **ElastiCache (Redis)** for session state, conversation history pointers.
  * **AgentCore Memory** for managed agent memory without custom infra. ([Amazon Web Services, Inc.][6])

### Model & Data Science Layer (MLOps)

* **SageMaker**:

  * Data processing pipelines,
  * Model training (embeddings, ranking, safety filters),
  * Evaluation jobs,
  * **Model Registry** for versioning & approvals. ([AWS Documentation][2])
* **Integration with Bedrock**:

  * Use Bedrock models for core LLM tasks; SageMaker models as tools (e.g., classifier endpoints).


## Designing the Agentic Workflow

### Decompose the Use Case into Agent Roles

For any use case (e.g., “cloud ops copilot”, “biopharma business expert”), identify roles:

* **Router / Intent classifier**: Which agent or workflow should handle this?
* **Planner**: Breaks the task into steps and selects tools/agents.
* **Domain agents**: Specialized in support, billing, infra, legal, etc.
* **Critic / verifier**: Checks outputs (hallucinations, safety, consistency).
* **Summarizer / presenter**: Formats responses for end-users or APIs.

In Bedrock, each of these can be an **agent definition**, or you can use **multi-agent collaboration** within a single Bedrock Agents setup. ([Amazon Web Services, Inc.][8])

### Tool Design Principles

Tools define how agents act on the world. Design them as:

* **Clear, narrow APIs**: e.g., `get_ticket_status`, `scale_service`, `list_failed_deployments`.
* **Typed and validated** inputs/outputs, documented in OpenAPI where possible.
* **Secure by default**:

  * Use **AgentCore Identity** or scoped IAM roles for tools that access AWS or third-party services. ([Amazon Web Services, Inc.][6])
  * Enforce *least privilege* and explicit allow-lists.
* **Observable**: Log each call with correlation IDs and agent metadata.

### Memory & Knowledge Strategy

Avoid a single “magic memory” bucket.

* **Episodic memory**: Per-conversation or per-session context, stored in DynamoDB/Redis, with TTLs.
* **Semantic memory (RAG)**:

  * Knowledge Bases in Bedrock backed by S3/OpenSearch.
  * Clear separation between public knowledge, team knowledge, and per-tenant private knowledge.
* **Working memory**:

  * Intermediate steps, tool results, scratchpads; often ephemeral but logged for debugging.

### Safety, Guardrails & Policy Enforcement

Use **multiple layers**:

* **Bedrock Guardrails**:

  * Content filters, topic restrictions, safety settings. ([Amazon Web Services, Inc.][5])
* **Custom safety models**:

  * Trained on SageMaker (toxicity detectors, PII detectors, policy compliance).
* **Policy-aware tools**:

  * Tools themselves should validate that requested action is allowed for given user/role.
* **Audit logging**:

  * Every action and tool call traceable to user, agent, and policy decision.


## MLOps Lifecycle for Agentic AI

Agentic systems are **multi-model, multi-prompt, multi-tool**. Your MLOps must manage **all three**: models, prompts, and tools.

### Data & Feature Pipelines

* **Raw data** in S3 (logs, chat transcripts, tool responses, business metrics).
* **Feature stores & embeddings**:

  * Embedding pipelines in SageMaker for documents and user profiles.
  * Periodic jobs to refresh indexes in knowledge bases or vector stores.
* **Labelling & feedback loops**:

  * Human feedback on agent interactions (was this helpful/safe?).
  * Weak labels from monitoring (tool mismatch, errors, escalations).

### Model Training & Evaluation (SageMaker)

Typical supporting models:

* Embedding models for RAG.
* Rerankers or recommendation models.
* Safety filters (toxicity, PII).
* Routing models (which agent/tool/model to use).

Use **SageMaker Pipelines** to:

* Automate data preparation → training → evaluation → registration.
* Store trained models and metrics in **Model Registry**, with stage tags like *staging* and *production*. ([AWS Documentation][2])

### Prompt & Agent Versioning

Treat prompts and agent configs like code:

* Version-control prompts, tool schemas, and agent graphs in Git.
* Use environment-specific configs (dev / staging / prod).
* Run automated tests:

  * Regression tests on curated prompt suites.
  * “Safety tests” against red-team datasets.

### CI/CD for Agentic Systems

Typical flow:

1. Developer changes prompts/agents/tools in Git.
2. CI:

   * Lints prompts/configs.
   * Runs synthetic tests using Bedrock in a dev environment.
3. CD:

   * Deploys updated Bedrock agents or AgentCore configurations to staging.
   * Runs shadow traffic or A/B tests.
   * On approval, promotes to production.

AWS tools: CodeCommit/CodeBuild/CodePipeline or GitHub Actions + AWS CDK/CloudFormation.


## Example End-to-End System: Cloud Operations Copilot

Let’s ground this in a concrete (but code-free) example, inspired by AWS scenarios where agents triage **CloudWatch Logs** and mitigate incidents. ([Amazon Web Services, Inc.][9])

### Roles & Agents

* **Triage Agent**:

  * Reads error summaries from CloudWatch Logs Insights.
  * Classifies severity and suggests likely root cause.
* **Remediation Agent**:

  * Proposes runbooks or direct actions (restart service, roll back deployment).
  * Interfaces with Systems Manager Automation / Change Manager.
* **Communicator Agent**:

  * Drafts human-readable updates for Slack/Teams, incident tickets.

These can be **multi-agent collaborators inside Bedrock** (using Bedrock Agents multi-agent features), or orchestrated externally via Step Functions and AgentCore. ([Amazon Web Services, Inc.][8])

### Tools

* **Log analysis tool**:

  * Calls CloudWatch Logs Insights to query error patterns.
* **Deployment tool**:

  * Calls CodeDeploy / ECS APIs to roll back tasks.
* **Runbook tool**:

  * Looks up remediation steps stored in S3 or a knowledge base.
* **Notification tool**:

  * Sends updates via SNS/Slack webhook.

### Orchestration

* **Step Functions** orchestrates:

  1. Trigger from CloudWatch alarm or an operator.
  2. Invoke Triage Agent (Bedrock).
  3. Parallel branch:

     * Run more detailed diagnostics.
     * Notify on-call engineer.
  4. If low/medium severity, let Remediation Agent propose and possibly execute an automated runbook.
  5. Use Communicator Agent to update incident tickets and channels.

* All of this is instrumented with **CloudWatch metrics** and **traces**, with **AgentCore Observability** and CloudWatch Application Signals capturing agent steps and tool calls. ([Amazon Web Services, Inc.][10])


## Scaling Strategies for Agentic Workloads

### Concurrency, Throttling & Backpressure

* Understand **Bedrock limits** for model calls and configure:

  * Rate limits in API Gateway.
  * Concurrency limits in Lambda or ECS services.
* Use **queues** (SQS, EventBridge) for non-interactive work to smooth spikes.
* For interactive chat:

  * Use streaming responses from Bedrock to deliver tokens early while longer tools run in the background.

### Caching

* **Prompt-level caching**:

  * Cache “expensive” results (e.g., summarizing a static document) in DynamoDB or Redis.
* **Embedding caching**:

  * Avoid re-computing embeddings for unchanged documents or frequent queries.
* **Tool result caching**:

  * Cache stable API responses (configuration, catalogs).

### Model & Agent Routing for Cost/Latency

* **Tiered models** in Bedrock (small, medium, large; different vendors). ([Amazon Web Services, Inc.][5])
* Strategies:

  * Use a **smaller / cheaper model** for lightweight tasks, and escalate to larger models only when needed.
  * Use a **router model** or heuristic to choose which model or agent is appropriate.
* For multi-agent setups:

  * Avoid “agent explosion”: have a router that limits which agents get invoked per request.

### Horizontal Scaling of Tools and Orchestrators

* Design **stateless** orchestrator components:

  * Lambdas can scale quickly for spiky workloads.
  * ECS/EKS services for more predictable high-volume flows.
* Use **auto scaling policies** based on:

  * Queue depth,
  * Error rates,
  * Latency percentiles (p95, p99).

### Multi-Region and Multi-Account

* For global users:

  * Deploy agents in multiple regions close to users.
  * Use **Route 53** or CloudFront for routing.
* For large organizations:

  * Multi-account structure with central governance:

    * Central **Bedrock / CloudWatch observability** account,
    * Application accounts hosting tools and workloads. ([Medium][11])


## Observability, Monitoring, and Evaluation

Agentic systems are *complex*. You need **deep observability**:

### Telemetry Pillars

* **Metrics**:

  * Latency per step (LLM calls, tools).
  * Success/failure rates, retries.
  * Cost metrics (tokens, Bedrock usage).
* **Logs**:

  * Structured logs with correlation IDs.
  * Redacted inputs/outputs where necessary.
* **Traces**:

  * End-to-end traces from user request → agents → tools → response.

**AWS CloudWatch** and **AgentCore Observability** now provide features specifically for generative AI and agents (Application Signals, Bedrock observability, multi-framework support). ([Amazon Web Services, Inc.][4])

### Agentic-Specific Metrics

Track:

* **Tool utilization**:

  * Frequency and latency of each tool.
  * Error codes and failure rates.
* **Agent behavior**:

  * Number of steps per request.
  * Looping/oscillation detection (too many iterations).
  * Escalation rate (how often agents ask humans for help).
* **Quality & safety**:

  * User satisfaction scores (thumbs up/down).
  * Safety incidents (flagged content, blocked tool calls).
  * Consistency between tool outputs and final responses.

### Continuous Evaluation

* Maintain **golden test sets**:

  * Realistic scenarios with expected outputs or constraints.
* Periodically run:

  * Offline evaluations (accuracy, helpfulness, safety).
  * **Load tests** to ensure capacity and SLO adherence.
* Integrate evaluation with CI:

  * Block deployments if regressions in quality or safety exceed thresholds.


## Governance, Risk, and Compliance

### Permissions & Identity

* Use **IAM roles** with least privilege for:

  * Agents (via AgentCore Identity),
  * Tools,
  * Orchestrators (Lambdas / ECS tasks). ([Amazon Web Services, Inc.][6])
* Separate roles for:

  * Model access,
  * Data access,
  * Write vs read operations.

### Data Governance

* Encrypt data at rest (KMS) and in transit (TLS).
* Tag and isolate sensitive datasets; ensure correct residency.
* Implement **data retention policies**:

  * How long do you store prompts, tool calls, and transcripts?
  * Can users request deletion?

### Change Management & Rollback

* Treat **agents as deployable artifacts**:

  * Use staging and production environments.
  * Implement **canary deployments** (route a small percentage of traffic to new versions).
  * Have a **one-click rollback** path.

### Responsible AI

* Maintain **model cards** and **agent cards** describing:

  * Intended use,
  * Limitations,
  * Safety mitigations.
* Periodically review:

  * Bias assessments (especially if agents make high-stakes decisions).
  * Misuse patterns and new threat models.


## Implementation Checklist

Here’s a pragmatic sequence for building a real-world agentic system on AWS:

1. **Scope & design**

   * Define user journeys, agent roles, tools, success metrics.
2. **Choose AWS components**

   * Bedrock models + Agents/AgentCore for core agent logic.
   * SageMaker for supporting models and MLOps.
   * Step Functions + Lambda/ECS for orchestration & tools.
3. **Prototype**

   * Single-region, limited-traffic POC.
   * Simple logging, manual evaluation.
4. **Hardening**

   * Add guardrails and safety filters.
   * Introduce proper observability and tracing.
   * Build CI/CD and basic canary deployment.
5. **Scale-out**

   * Optimize for latency and cost (routing, caching).
   * Add multi-agent collaboration where beneficial.
   * Implement auto scaling.
6. **Continuous improvement**

   * Close the feedback loop from telemetry & user feedback to training data.
   * Regularly refresh knowledge bases and retrain supporting models.


## Common Pitfalls & Anti-Patterns

* **“Single giant agent”**: One agent doing everything with a massive prompt. Hard to debug, scale, and govern.
* **No explicit tool contracts**: Letting the LLM “invent” tool usage instead of having strict schemas.
* **Unbounded loops**: Agents that keep thinking & calling tools indefinitely—always add step and time limits.
* **Hidden state**: Storing critical context only in prompts, not in explicit memory/state stores.
* **No observability**: Debugging via ad hoc logs instead of structured metrics, logs, and traces.
* **Prompt sprawl**: Unversioned prompts directly edited in consoles; use Git and environments instead.


## Where SageMaker, Bedrock, and AgentCore Each Fit

To summarize their roles in an **agentic MLOps** stack:

* **Amazon Bedrock**

  * Foundation models (Claude, etc.).
  * Knowledge bases and guardrails.
  * **Agents for Bedrock** for integrated tool use and multi-agent workflows. ([AWS Documentation][1])

* **Amazon Bedrock AgentCore** (emerging platform)

  * Purpose-built runtime for agents.
  * Managed memory, identity, gateway, code interpreter, browser tool.
  * Deep observability for agent steps and interactions. ([Amazon Web Services, Inc.][6])

* **Amazon SageMaker**

  * Classic MLOps backbone:

    * Training pipelines,
    * Model Registry and approvals,
    * Batch/online endpoints for supporting models.
  * Ideal for domain-specific ML around your agents.

Together, they allow you to build **production-grade, scalable, observable, and governable agentic systems** that go far beyond “just calling an LLM”.

[1]: https://docs.aws.amazon.com/bedrock/latest/userguide/agents.html?utm_source=chatgpt.com "Automate tasks in your application using AI agents"
[2]: https://docs.aws.amazon.com/sagemaker/latest/dg/model-registry.html?utm_source=chatgpt.com "Model Registration Deployment with Model Registry"
[3]: https://docs.aws.amazon.com/prescriptive-guidance/latest/agentic-ai-patterns/multi-agent-collaboration.html?utm_source=chatgpt.com "Multi-agent collaboration - AWS Prescriptive Guidance"
[4]: https://aws.amazon.com/blogs/mt/observing-agentic-ai-workloads-using-amazon-cloudwatch/?utm_source=chatgpt.com "Observing Agentic AI workloads using ..."
[5]: https://aws.amazon.com/bedrock/?utm_source=chatgpt.com "Amazon Bedrock - Generative AI"
[6]: https://aws.amazon.com/bedrock/agentcore/?utm_source=chatgpt.com "Amazon Bedrock AgentCore"
[7]: https://aws.amazon.com/blogs/machine-learning/orchestrate-generative-ai-workflows-with-amazon-bedrock-and-aws-step-functions/?utm_source=chatgpt.com "Orchestrate generative AI workflows with Amazon Bedrock ..."
[8]: https://aws.amazon.com/blogs/machine-learning/build-an-intelligent-multi-agent-business-expert-using-amazon-bedrock/?utm_source=chatgpt.com "Build an intelligent multi-agent business expert using ..."
[9]: https://aws.amazon.com/blogs/mt/enable-cloud-operations-workflows-with-generative-ai-using-agents-for-amazon-bedrock-and-amazon-cloudwatch-logs/?utm_source=chatgpt.com "Enable cloud operations workflows with generative AI ..."
[10]: https://aws.amazon.com/blogs/machine-learning/build-trustworthy-ai-agents-with-amazon-bedrock-agentcore-observability/?utm_source=chatgpt.com "Build trustworthy AI agents with Amazon Bedrock ..."
[11]: https://medium.com/aws-in-plain-english/cloudwatchs-new-era-centralized-monitoring-for-ai-agents-and-multi-account-workloads-3b6bcb2c4656?utm_source=chatgpt.com "CloudWatch's New Era: Centralized Monitoring for AI ..."
