diff --git a/blog/ai-agent-unit-economics-cost-per-conversation-per-user-margin.md b/blog/ai-agent-unit-economics-cost-per-conversation-per-user-margin.md new file mode 100644 index 0000000..ae4638a --- /dev/null +++ b/blog/ai-agent-unit-economics-cost-per-conversation-per-user-margin.md @@ -0,0 +1,136 @@ +--- +title: "AI Agent Unit Economics: Cost Per Conversation, Cost Per User, and Margin Analysis" +date: 2026-03-23 +author: Cycles Team +tags: [unit-economics, costs, enterprise, margins, best-practices] +description: "Model AI agent costs as business metrics — cost per conversation, cost per user, margin analysis — and use budget enforcement to bound variance." +blog: true +sidebar: false +--- + +# AI Agent Unit Economics: Cost Per Conversation, Cost Per User, and Margin Analysis + +A B2B SaaS company adds an AI copilot to their customer support product. They price the feature at $15/user/month and estimate $3/user/month in LLM costs based on their pilot: 20 conversations per user per month, 6 turns per conversation, GPT-4o at ~$0.15 per conversation. Gross margin target: 80%. + +Month one in production with 200 users: average cost per user is $4.20. Close enough. Month two: $6.80. Month three: $11.50. The average is not the problem — the distribution is. 70% of users cost under $3/month. 20% cost $8-25/month. 10% cost $40-120/month. One user triggered 340 conversations in a month — automated integration testing against the copilot endpoint. That single user cost $310. + +The company's overall gross margin on the AI feature in month three: **23%** — far below their 80% target. The top 10% of users by cost consume 72% of total spend. Worse, three users each cost over $200, wiping out the margin from 50+ light users apiece. The product is profitable for most users and catastrophically unprofitable for a few — and there is no mechanism to distinguish between them at the point of execution. + + + +## From Tokens to Business Metrics + +Most engineering teams track cost at the wrong level of abstraction. They know their [per-token price](/blog/how-much-do-ai-agents-cost). They know their monthly API bill. They cannot answer: "What does it cost us to resolve one support ticket?" or "What is our cost per active user this month?" + +The translation requires four inputs: + +1. **Raw token cost** — per-model pricing from the provider +2. **Calls per unit of work** — how many LLM calls does one conversation, review, or document take? +3. **Units of work per user** — how many conversations, reviews, or documents does one user generate per month? +4. **Variance distribution** — what does the cost spread look like across users? + +The first three give you the average. The fourth determines whether the average is useful. + +| Use case | Unit of work | Avg calls/unit | Avg cost/unit | Median | P90 | P99 | P99/Median | +|---|---|---|---|---|---|---|---| +| Support copilot | Conversation | 9 | $0.21 | $0.08 | $0.45 | $3.80 | 47× | +| Code review agent | Pull request | 22 | $1.85 | $1.20 | $4.50 | $18.00 | 15× | +| Document processor | Document | 4 | $0.12 | $0.09 | $0.30 | $2.10 | 23× | + +The rightmost column — P99/Median — is the variance multiplier. For the support copilot, the most expensive 1% of conversations cost 47× the median. This ratio determines whether average-based pricing works or breaks. + +## Why Variance Destroys Margin Predictions + +If you price at 3× average cost — a standard SaaS margin target — you need the cost distribution to be tight enough that 3× average covers nearly all users. For normal distributions, it does. Agent cost distributions are not normal. They follow a heavy-tail pattern because: + +**Context window growth is superlinear.** Each turn in a conversation sends all previous turns. A 6-turn conversation sends ~21 message payloads total (1+2+3+4+5+6). A 20-turn conversation sends ~210. The cost scales with the square of conversation length, not linearly. + +**Retries cluster.** A 5% overall failure rate sounds manageable. But failures are not evenly distributed — some conversations hit 50% failure rates because they trigger edge cases in tool execution. Those conversations cost 2-3× more than their content suggests, and the extra cost is invisible in average metrics. + +**Tool call depth varies 10-50×.** A "what's my order status?" query makes 2 LLM calls. A "help me debug this integration" query makes 30+. Both are "one conversation" in your metrics. + +**User behavior is unpredictable.** Some users send one message per conversation. Others send 40-message threads. Some users open 5 conversations per month. Others open 200. The variance in user behavior compounds the variance in per-conversation cost. + +To illustrate, consider a generic AI feature with $4.00 average cost per user: + +| Pricing at | Avg cost/user | Price/user | Margin (tight distribution, CV=0.5) | Margin (heavy-tail, CV=3.0) | +|---|---|---|---|---| +| 2× average cost | $4.00 | $8.00 | 65% | -15% | +| 3× average cost | $4.00 | $12.00 | 78% | 22% | +| 5× average cost | $4.00 | $20.00 | 87% | 55% | + +CV is the coefficient of variation — standard deviation divided by mean. Tight distributions have CV < 1. Agent cost distributions typically have CV of 2-4. At CV=3.0, even pricing at 3× average only yields 22% margin, because a small number of high-cost users eat the profit from everyone else. + +## Building a Unit Economics Dashboard + +Four metrics every team running AI features should track: + +**1. Cost per unit of work.** For a support copilot, this is cost per conversation. For a code review agent, cost per pull request. Track the median, P90, P95, and P99 — not just the average. The average masks the tail. + +**2. Cost per active user per month.** Total spend attributed to a user divided by one month. Break this out by percentile: what does your P50 user cost? Your P90? Your P99? The gap between P50 and P99 is your variance exposure. + +**3. Variance ratio (P95/Median).** A single number that captures how fat the tail is. If P95/Median < 5, your pricing model can rely on averages. If P95/Median > 10, averages are misleading and you need per-user budget enforcement. + +**4. Margin per user cohort.** Revenue minus cost, grouped by usage tier. This reveals whether your product is profitable for all users or subsidized by light users to cover heavy ones. + +Cycles' `Subject` hierarchy maps directly to these metrics. Each reservation and commit is tagged with a subject — tenant, workflow, agent — so cost attribution is structural, not inferred from API logs after the fact. + +```python +from runcycles import CyclesClient, CyclesConfig, Subject + +client = CyclesClient(CyclesConfig.from_env()) + +# Get balance for a specific user's monthly spend +balance = client.get_balance( + subject=Subject( + tenant="acme", + workflow="support-copilot", + agent=f"user-{user_id}", + ), +) + +cost_usd = balance.committed / 100_000_000 # microcents to dollars +``` + +With per-subject cost attribution, you can compute all four metrics directly: aggregate by conversation ID for cost-per-conversation, by user ID for cost-per-user, and by user cohort for margin analysis. No log parsing, no reconciliation against provider invoices. + +## How Budget Enforcement Bounds Variance + +You cannot control variance at the pricing layer. You must control it at the execution layer. Budget enforcement — a [runtime authority](/blog/ai-agent-budget-control-enforce-hard-spend-limits) that makes a deterministic allow/deny decision before every LLM call — transforms the cost distribution from unbounded heavy-tail to bounded exposure. + +Three enforcement strategies, each mapped to margin impact: + +**Per-conversation cap.** Set a $2.00 hard limit per conversation. Conversations that would have cost $0.45 (P90) pass through unaffected, but the $3.80 outliers (P99) are capped. The agent degrades gracefully — shorter responses, cheaper model fallback, or an explicit "I've reached my limit for this conversation, please start a new one" message. The tail is cut. + +**Per-user monthly cap.** Set a $15.00/month ceiling per user — matching the price point. Users who would have cost $80/month are bounded. The feature becomes profitable for every user, by definition. This is the same pattern used in [multi-tenant AI cost control](/blog/multi-tenant-ai-cost-control-per-tenant-budgets-quotas-isolation) for per-tenant isolation. + +**Tiered budgets by plan.** Free users get $2/month in agent budget. Pro users get $20/month. Enterprise gets custom limits. The budget enforcement implements the pricing model directly — the hard limit and the price point are the same number. + +| Strategy | Without cap | $2/conversation cap | $15/user/month cap | +|---|---|---|---| +| Avg cost/user/month | $11.50 | $4.80 | $4.80 | +| P99 cost/user/month | $120.00 | $14.00 | $15.00 | +| Worst-case user | $310.00 | $22.00 | $15.00 | +| Feature gross margin | 23% | 68% | 68% | +| Users hitting cap | 0% | 12% | 5% | + +The $15/user/month cap turns a 23% margin feature into a 68% margin feature — close to the 80% target — with only 5% of users ever hitting the limit. For those users, the agent [degrades gracefully](/how-to/how-to-think-about-degradation-paths-in-cycles-deny-downgrade-disable-or-defer) — it does not hard-stop. It can switch to a cheaper model, reduce response length, or defer non-critical tasks. + +## Cost Per Conversation as a Business KPI + +Token pricing is an engineering metric. Cost per conversation is a business KPI. Three patterns for using it: + +**Chargeback.** Enterprise customers pay for actual AI usage. Cycles' per-tenant tracking provides the billing data — every reservation and commit is scoped to a tenant, so cost attribution is automatic. The usage report is the invoice. See [Multi-Tenant AI Cost Control](/blog/multi-tenant-ai-cost-control-per-tenant-budgets-quotas-isolation) for the full chargeback model. + +**Feature-level P&L.** Treat the AI copilot as its own cost center. Track cost per conversation as COGS. Monitor margin weekly. Set alerts when margin drops below threshold. This is [Tier 3 of the cost management maturity model](/blog/ai-agent-cost-management-guide) — alerting on business metrics, not just raw spend. + +**Model routing by economics.** Route simple conversations to GPT-4o-mini ($0.15/1M input tokens) and complex conversations to GPT-4o ($2.50/1M input tokens). The routing decision is economic, not just capability-based. A simple "what's my order status?" query does not need a $2.50/1M-token model. A complex debugging session does. [Routing and enforcement complement each other](/blog/manifest-vs-cycles-routing-vs-runtime-authority) — the router picks the model, the runtime authority bounds the cost. + +## Next Steps + +- **[How Much Do AI Agents Actually Cost?](/blog/how-much-do-ai-agents-cost)** — raw provider pricing and per-scenario cost breakdowns +- **[AI Agent Cost Management: The Complete Guide](/blog/ai-agent-cost-management-guide)** — the five-tier maturity model from monitoring to hard enforcement +- **[Multi-Tenant AI Cost Control](/blog/multi-tenant-ai-cost-control-per-tenant-budgets-quotas-isolation)** — per-tenant budgets and chargeback models +- **[5 Real-World AI Agent Failures That Budget Controls Would Have Prevented](/blog/ai-agent-failures-budget-controls-prevent)** — what happens when variance is unbounded +- **[AI Agent Budget Control: Enforce Hard Spend Limits](/blog/ai-agent-budget-control-enforce-hard-spend-limits)** — the reserve-commit pattern for pre-execution enforcement +- **[Cycles vs LLM Proxies and Observability Tools](/blog/cycles-vs-llm-proxies-and-observability-tools)** — why dashboards cannot prevent the overspend diff --git a/blog/multi-agent-budget-control-crewai-autogen-openai-agents-sdk.md b/blog/multi-agent-budget-control-crewai-autogen-openai-agents-sdk.md new file mode 100644 index 0000000..7596cf1 --- /dev/null +++ b/blog/multi-agent-budget-control-crewai-autogen-openai-agents-sdk.md @@ -0,0 +1,188 @@ +--- +title: "Multi-Agent Budget Control for CrewAI, AutoGen, and OpenAI Agents SDK" +date: 2026-03-23 +author: Cycles Team +tags: [multi-agent, crewai, autogen, openai, budgets, engineering, best-practices] +description: "Multi-agent delegation chains create recursive cost exposure. Enforce per-agent budget boundaries in CrewAI, AutoGen, and OpenAI Agents SDK." +blog: true +sidebar: false +--- + +# Multi-Agent Budget Control for CrewAI, AutoGen, and OpenAI Agents SDK + +A team builds a research pipeline using CrewAI with three agents: a Planner that breaks topics into sub-questions, a Researcher that investigates each one, and a Writer that synthesizes the results. The Planner delegates 5 sub-questions per topic to the Researcher. For complex sub-questions, the Researcher delegates down to a Deep Analyst agent that makes 15 LLM calls per investigation. In development, one topic costs ~$3.50. + +In production, a batch of 40 topics kicks off overnight. The Researcher's delegation is non-deterministic — some topics trigger zero Deep Analyst calls, others trigger four. One topic causes all 5 sub-questions to delegate to the Deep Analyst, each triggering its own tool loop with retries. That single topic costs $89. + +| Layer | Calls (expected) | Calls (worst case) | Cost (expected) | Cost (worst case) | +|---|---|---|---|---| +| Planner | 2 | 2 | $0.30 | $0.30 | +| Researcher (5 sub-questions) | 40-60 | 40-60 | $2.50 | $2.50 | +| Deep Analyst (0-2 delegations) | 0-30 | 75 (5 × 15) | $0.70 | $47.00 | +| Retries (growing context) | ~5 | ~55 | — | $39.00 | +| **Total** | **~50-95** | **~190** | **$3.50** | **$89.00** | + +The Deep Analyst's cost is not linear in call count — each retry sends a longer context window, so later calls cost 3-5× more than early ones. That is why 190 calls cost $89, not $7. + +The 40-topic batch: $1,740 instead of the projected $140. Most topics cost $15-30 because production topics are more complex than the development test set. The provider dashboard shows the total. It does not show which agent in the delegation chain caused the blowout, or that delegation depth was the problem. + + + +## Why Delegation Chains Are Different from Fan-Out + +[Fan-out](/blog/langgraph-budget-control-durable-execution-retries-fan-out) creates parallel branches from a single parent — the total cost is the sum of the branches. Delegation chains create serial depth — Agent A calls Agent B calls Agent C. The cost is multiplicative because each delegator's retry and loop behavior wraps around the entire subtree below it. + +If the Planner retries a failed topic, it re-executes the Researcher, which re-executes every Deep Analyst delegation. A single retry at the top of the chain replays every agent below it. This is the recursive version of the [retry storm pattern](/blog/ai-agent-failures-budget-controls-prevent) — except the blast radius grows with delegation depth, not retry count. + +| Property | Fan-out (parallel) | Delegation chain (serial depth) | +|---|---|---| +| Cost structure | Additive — sum of branches | Multiplicative — product of depths | +| Concurrency risk | Branches race on shared budget | Child inherits parent's remaining budget | +| Retry blast radius | One branch retries independently | Parent retries the entire child subtree | +| Visibility | Branches visible at one graph level | Depth hidden inside opaque agent calls | +| Budget scoping | Sub-budgets per branch | Budget must flow DOWN with diminishing allocation | + +## The Delegation Tax: Framework by Framework + +None of the major multi-agent frameworks enforce per-agent budgets. Each provides a delegation mechanism with no cost boundary between delegator and delegate. + +### CrewAI + +Agents in a Crew can delegate tasks to other agents via `allow_delegation=True`. When Agent A delegates to Agent B, the framework creates a new task execution context. There is no budget boundary between them — they share the same API key and the same global execution. The Crew has no concept of "Agent B's budget." A delegated agent can make unlimited LLM calls because nothing in the framework tracks per-agent spend. + +### AutoGen + +Multi-agent conversations use `GroupChat` or `initiate_chat()` chains. When an AssistantAgent sends work to another agent, the receiving agent runs its own LLM call loop. AutoGen tracks message counts but not token costs. The `max_consecutive_auto_reply` setting limits message rounds, not spend. A single reply that involves 5 tool calls and 5 LLM calls counts as 1 reply toward the limit — the cost inside that reply is invisible to the framework. + +### OpenAI Agents SDK + +The `handoff()` mechanism passes control from one agent to another. Each agent has its own system prompt and tool definitions. The SDK provides tracing via `RunContext` but no budget enforcement. A handoff chain of 3 agents, each making 10 tool calls, produces 30+ LLM calls with no per-agent ceiling. + +| Framework | Delegation mechanism | Built-in cost control | What's missing | +|---|---|---|---| +| CrewAI | `allow_delegation=True` | None | Per-agent spend limit | +| AutoGen | `initiate_chat()`, `GroupChat` | `max_consecutive_auto_reply` (count, not cost) | Token/dollar cap per agent | +| OpenAI Agents SDK | `handoff()` | None (tracing only) | Pre-execution budget check | + +The common gap: these frameworks control execution flow. They do not control execution cost. That requires a [runtime authority](/blog/ai-agent-budget-control-enforce-hard-spend-limits) that sits between each agent and the LLM provider, making a deterministic allow/deny decision before every call. + +## The Pattern: Hierarchical Budget Allocation for Delegation Chains + +The [reserve-commit lifecycle](/blog/ai-agent-budget-control-enforce-hard-spend-limits) already solves single-agent budget enforcement. For multi-agent delegation, the same pattern applies — but budget must flow down the chain with diminishing allocations. + +``` +Run Budget: $25.00 +├── Planner: $2.00 (reserved from run) +├── Researcher (sub-question 1): $4.00 (reserved from run) +│ └── Deep Analyst: $2.00 (reserved from Researcher's allocation) +├── Researcher (sub-question 2): $4.00 +│ └── (no delegation — stays within $4.00) +├── Researcher (sub-question 3): $4.00 +│ └── Deep Analyst: $2.00 +├── Writer: $3.00 +└── Unallocated: $4.00 (safety margin) +``` + +Three design principles make this work: + +**Diminishing allocation.** Each delegation level gets a fraction of the parent's budget, not the full remaining balance. The Deep Analyst receives $2.00 carved from the Researcher's $4.00 — not $23.00 from the run's remaining budget. This bounds the blast radius of any single agent regardless of depth. + +**Pre-delegation reservation.** Before Agent A delegates to Agent B, Agent A reserves the sub-budget from its own allocation. If Agent A's remaining budget cannot fund the delegation, the delegation does not happen — the agent receives a clear budget-exhausted signal and can take an alternative path. This is enforcement before the action, not observation after. + +**Commit on return.** When the delegated agent completes, actual cost is committed and unused budget is released back to the parent. The Researcher reserved $2.00 for the Deep Analyst, but if the Deep Analyst only spent $1.30, the remaining $0.70 returns to the Researcher's pool. No budget is permanently locked. + +## What This Looks Like in Practice + +Cycles — a runtime authority for autonomous agents — integrates with any multi-agent framework through a budget-scoped handler per agent. Each agent in the delegation chain gets its own `Subject` in the Cycles hierarchy, creating a hard limit that survives across framework boundaries. + +For CrewAI, attach a handler to each agent's LLM: + +```python +from langchain_openai import ChatOpenAI +from runcycles import CyclesClient, CyclesConfig, Subject +from budget_handler import CyclesBudgetHandler # see integration guide + +client = CyclesClient(CyclesConfig.from_env()) + +# Each agent gets a budget-scoped handler +def make_agent_llm(agent_name: str) -> ChatOpenAI: + handler = CyclesBudgetHandler( + client=client, + subject=Subject( + tenant="acme", + workflow="research-pipeline", + agent=agent_name, + ), + ) + return ChatOpenAI(model="gpt-4o", callbacks=[handler]) + +planner_llm = make_agent_llm("planner") # bounded by planner's budget +researcher_llm = make_agent_llm("researcher") # bounded by researcher's budget +analyst_llm = make_agent_llm("deep-analyst") # bounded by analyst's budget +``` + +For AutoGen, attach the handler to each agent's underlying model: + +```python +from autogen import ConversableAgent + +# Each agent gets a budget-scoped LLM +researcher = ConversableAgent( + name="researcher", + llm_config={ + "model": "gpt-4o", + "callbacks": [CyclesBudgetHandler( + client=client, + subject=Subject( + tenant="acme", + workflow="research-pipeline", + agent="researcher", + ), + )], + }, +) +``` + +For the OpenAI Agents SDK, intercept each `handoff()` boundary: + +```python +from runcycles import CyclesClient, CyclesConfig, Subject + +# Before handoff, reserve sub-budget from parent +def budget_handoff(parent_agent: str, child_agent: str, budget_usd: float): + client.reserve( + subject=Subject( + tenant="acme", + workflow="research-pipeline", + agent=child_agent, + ), + amount=int(budget_usd * 100_000_000), # USD to microcents + ) +``` + +The key is that each agent's LLM calls are bounded independently. A [checker variable in application memory is not enough](/blog/vibe-coding-budget-wrapper-vs-budget-authority) — it does not survive process restarts, does not handle concurrent agents, and does not provide atomic reservation semantics. The runtime authority must be external to the framework. + +For the full callback handler implementation, see [Integrating Cycles with LangChain](/how-to/integrating-cycles-with-langchain). + +## What Happens Without Per-Agent Budgets + +The difference between debugging a $1,740 bill and preventing it. + +| Scenario | Without per-agent budget | With Cycles | +|---|---|---| +| Deep Analyst enters tool loop | 200+ calls, $89 per topic | Budget exhausted after ~15 calls, graceful denial | +| Planner retries failed delegation | Recreates entire child subtree at full cost | New sub-budget from parent's remaining allocation | +| 40-topic overnight batch | $1,740, discovered Monday morning | Each topic capped at $25, batch max = $1,000 | +| Debugging which agent overspent | Parse API logs, reconstruct delegation chain manually | Per-agent balance queries show where spend accumulated | +| Non-deterministic delegation depth | Cost variance of 25× between topics | Hard limit per agent regardless of delegation path | + +The research pipeline from the opening scenario would have stopped at $25 per topic. The Deep Analyst's tool loop would have hit its $2.00 sub-budget after ~15 calls instead of running to 75+. The overnight batch of 40 topics would have cost at most $1,000 — bounded exposure instead of an open-ended bill. + +## Next Steps + +- **[LangGraph Budget Control for Durable Execution, Retries, and Fan-Out](/blog/langgraph-budget-control-durable-execution-retries-fan-out)** — budget enforcement for graph-based fan-out (the parallel counterpart to delegation chains) +- **[AI Agent Budget Control: Enforce Hard Spend Limits](/blog/ai-agent-budget-control-enforce-hard-spend-limits)** — the reserve-commit pattern that powers per-agent enforcement +- **[5 Real-World AI Agent Failures That Budget Controls Would Have Prevented](/blog/ai-agent-failures-budget-controls-prevent)** — retry storm and infinite loop cost math +- **[You Can Vibe Code a Budget Wrapper](/blog/vibe-coding-budget-wrapper-vs-budget-authority)** — why a per-agent counter is not the same as a runtime authority +- **[How Much Do AI Agents Actually Cost?](/blog/how-much-do-ai-agents-cost)** — raw provider pricing behind the cost math in this post +- **[Multi-Tenant AI Cost Control](/blog/multi-tenant-ai-cost-control-per-tenant-budgets-quotas-isolation)** — per-tenant budgets for teams running multi-agent systems in shared platforms