Skip to content

blog: Reserving Authority When You Can't Pause#651

Open
amavashev wants to merge 3 commits into
mainfrom
blog/voice-agent-budgets-when-you-cant-pause-to-reserve
Open

blog: Reserving Authority When You Can't Pause#651
amavashev wants to merge 3 commits into
mainfrom
blog/voice-agent-budgets-when-you-cant-pause-to-reserve

Conversation

@amavashev
Copy link
Copy Markdown
Contributor

Summary

First post in a new pillar: voice / realtime agent governance. Net-new surface for the corpus.

The post identifies the structural constraint: reserve-commit assumes the agent can wait synchronously for ALLOW, but voice agents can't — a 100ms sync gate on each audio frame would push the conversation past the ~700ms natural-feel threshold. The fix is not to abandon the gate; it is to position it where the latency budget can absorb it.

Four patterns covered:

  1. Predictive reservation, true-up later — reserve N minutes upfront, commit actual at call end
  2. Tier-aware gating — sync gate on slow-path tool calls, predictive reservation on fast-path audio
  3. Time-bounded floor authority — per-second auto-replenish for very high-throughput deployments
  4. Speculative commit with deny window — slow-path only, since audio is unrecoverable

Voice-specific failure modes (talking-to-itself loops, stuck conversations, premium-tier escalation runaway, hold-music wall-clock blind spots) get their own table. Mirrors the PocketOS two-layer fix: per-call provider caps + agent-side runtime authority.

Stack matrix for OpenAI Realtime / Vapi / Retell AI / ElevenLabs shows where the gate can sit in each.

Author: Albert Mavashev
Date: 2026-05-20
Word count: ~3,100 body

Reviews

  • Internal cycles 1–3 (scorecard 9.3/10)
  • Glossary auto-linker applied 7 contextual links
  • Codex external review: round 1 REVISE-MINOR (9 findings, 9 applied / 2 pushed back), round 2 SHIP

Codex verified upstream:

  • OpenAI Realtime API event surface (response.function_call_arguments.delta/.done are the actual events — my original `response.function_call` was wrong; fixed)
  • ElevenLabs / Vapi / Retell AI pricing pages
  • Cycles-docs internal targets

Cycle 1 fact-check caught and fixed:

  • OpenAI Realtime latency attribution softened (the OpenAI page is bot-blocked from WebFetch; reframed as industry observation with turn-taking research convergence)
  • Retell AI pricing range widened from $0.07-$0.15 to $0.07-$0.31 (original ceiling was understated)
  • Opener cost figure corrected ($390 in 17 minutes was implausible given the post's own pricing table; lowered to $90)
  • ElevenLabs $0.24/min ceiling clarified as requiring LLM + telephony at cost on top of burst hosting
  • Vapi $0.115-$0.42/min labeled as derived estimate
  • Provider-layer cap claim softened (pricing pages don't uniformly establish hard caps)

Per-dimension scores

Dimension Score
Factual accuracy 9.5
Credibility 9
Cross-links 9
SEO (title 40/51, desc 152/160) 9.5
Code accuracy 9
Structure & flow 9.5
Terminology 9.5
Tone & style 9.5

Overall: 9.3 / 10

Test plan

Dependencies and order

This post links to three sibling posts that are on PR branches awaiting merge: `/blog/agent-memory-writes-are-actions-too` (PR #648), `/blog/when-coding-agents-press-merge` (PR #649), `/blog/computer-use-agents-have-no-tool-boundary` (PR #650). Merge order: #648#649#650 → this PR so the trilogy + this voice post all land with working cross-links.

Tags

Three new pillar tags introduced for the voice surface: `voice-agents`, `realtime`, `latency`. Future voice posts should align on these. Other tags (`budgets`, `runtime-authority`, `agents`, `engineering`, `RISK_POINTS`) match corpus convention.

amavashev added 2 commits May 15, 2026 13:00
New pillar post on voice / realtime agent budgets. Net-new surface
for the corpus — first post in the voice-agents pillar.

The post identifies the structural constraint: reserve-commit assumes
the agent can wait synchronously for ALLOW. Voice and realtime agents
can't — a 100ms sync gate on each audio frame would push the
conversation past the ~700ms natural-feel threshold. The fix is not to
abandon the gate; it is to position it where the latency budget can
absorb it.

Four patterns covered:
1. Predictive reservation, true-up later (reserve N minutes upfront,
   commit actual at call end)
2. Tier-aware gating (sync gate on the slow-path tool calls,
   predictive reservation on the fast-path audio)
3. Time-bounded floor authority (per-second auto-replenish for very
   high-throughput deployments)
4. Speculative commit with deny window (slow-path only, since audio is
   unrecoverable)

Stack matrix for OpenAI Realtime / Vapi / Retell AI / ElevenLabs shows
where the gate can sit in each. Voice-specific failure modes (talking-
to-itself loops, stuck conversations, premium-tier escalation runaway,
hold-music wall-clock blind spots) get their own table. Mirrors the
PocketOS two-layer fix: per-call provider caps + agent-side runtime
authority.

Internal cross-links to tracking-tokens-in-a-streaming-llm-response
(closest sibling), estimate-drift (calibration), ai-agent-action-
control (parent tier model), retry-storms-and-idempotency, when-budget-
runs-out, multi-tenant cost control, plus the just-shipped trilogy
(memory-writes, merge, computer-use).

External citations: OpenAI Realtime delivery scale page, callsphere
pricing analysis, implicit acknowledgment of ElevenLabs/Vapi/Retell
pricing models.

Reviews: internal cycles 1-3 (scorecard 9.3/10), glossary linker added
7 contextual links. Cycle 1 fact-check caught and fixed:
- OpenAI Realtime latency attribution softened (the OpenAI page is
  bot-blocked from WebFetch, so "OpenAI targets X ms" became
  "production deployments typically land at X ms with industry
  guidance converging on ~700ms").
- Retell AI pricing range widened from $0.07-$0.15 to $0.07-$0.31
  (the original ceiling was understated).
- Opener cost figure corrected ($390 in 17 minutes was implausible
  given the post's own pricing table; lowered to $90).
- Retell AI capitalization standardized ("Retell AI" everywhere
  brand-list contexts, not bare "Retell").

Three new pillar tags introduced: voice-agents, realtime, latency.
…-pause-to-reserve

Apply/skip tally: 9 applied, 2 pushed back.

Applied:
- `response.function_call` → `response.function_call_arguments.*`:
  the OpenAI Realtime API uses function-call output items and the
  function_call_arguments streaming events; my original event name
  was not a real Realtime server event. Fixed in both the prose and
  the stack-by-stack table.
- 80-150 ms relay hop: removed the specific band attribution. The
  OpenAI page does not state it. Generic phrasing: "a forwarding hop
  sized to fit inside the conversation's latency budget."
- ElevenLabs row: clarified the $0.08-$0.24/min framing. Hosting is
  $0.08/min flat or $0.16/min burst; the $0.24 ceiling derives once
  LLM and telephony layer on at cost.
- Vapi row: labeled the $0.115-$0.42/min range as an estimate (it's
  derived from $0.05/min orchestration plus a BYOK provider stack at
  cost; the actual all-in depends on provider choices).
- 17-minute "$1.50-$8.00 model spend alone": tightened to "against
  the per-minute stack rates above" since the rates in the table
  mix all-in / provider / orchestration models.
- Provider-layer caps: softened from "OpenAI, Vapi, Retell AI, and
  ElevenLabs all expose per-call or per-session limits" to "to
  whatever degree each provider exposes them — typically through
  per-session budget headers, dashboard caps, or programmatic
  limits." Pricing pages don't uniformly establish hard caps.
- "Most production voice teams use this only..." for speculative
  commit: softened to "This pattern is usually safer on the
  slow-path tool layer."
- Description trimmed 162 → 152 chars: changed "—" to ":", "sit
  synchronously in the path" to "sync on the hot path."
- `reserve-commit` glossary link: pointed to /protocol/how-reserve-
  commit-works-in-cycles instead of /glossary#reservation
  (reserve-commit is a lifecycle term, not the reservation entry).

Skipped, with reason:
- Body cross-link count (11) above 5-8 pillar target: three of the
  eleven are the trilogy references in a single closing sentence
  that names the sibling extension series (memory-writes, merge,
  computer-use). They are coherent as a triple, not redundant.
- 2026-05-20 publish date: intentional sequence after the trilogy
  (5/16, 5/18, 5/19, 5/20).

Codex verified upstream: ElevenLabs/Vapi/Retell AI pricing pages,
OpenAI Realtime API event surface (function_call_arguments.delta /
.done are the actual streaming events), and the cycles-docs main-
branch internal targets. Sibling links to memory-writes, merge,
and computer-use treated as just-merged via PR #648-#650.
…o 2026-06-06

Moved from 2026-05-20 to 2026-06-06 to match the weekly publishing
cadence for the action-authority extension arc. Sequence: memory 5/16,
merge 5/23, computer-use 5/30, voice 6/06.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant