feat(operator,platform): refactor operator agent loop and improve platform reliability#499
Conversation
Greptile SummaryMajor architectural refactoring replacing OpenCode CLI with direct LLM function-calling agent loop and browser pooling, plus platform reliability improvements. Operator Service Refactoring
Platform Reliability Improvements
Issues Found
Confidence Score: 3/5
|
| Filename | Overview |
|---|---|
| services/operator/app/services/agent_loop.py | New 824-line agent loop with direct LLM function-calling and Playwright browser automation, replacing OpenCode CLI |
| services/operator/app/services/browser_pool.py | New browser context pooling with persistent Chromium instance and semaphore-based concurrency control |
| services/operator/app/config.py | Removed workspace config fields, added vision config, changed headless default from True to False |
| services/platform/convex/lib/agent_response/generate_response.ts | Added 9-minute platform hard limit cap, improved retry logic with fast model, better timeout budget handling |
| services/platform/app/components/error-boundaries/core/error-boundary-base.tsx | Added auto-retry with exponential backoff, cleanup for timers in componentWillUnmount |
| services/platform/app/features/chat/hooks/use-chat-pending-state.ts | Simplified pending state logic to clear only when no assistant messages are in non-terminal state |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[User Request] --> B[BrowserService]
B --> C{Initialize?}
C -->|Yes| D[BrowserPool.initialize]
D --> E[Launch Chromium]
C -->|No| F[BrowserPool.acquire]
F --> G[Create BrowserContext]
G --> H[run_agent_loop]
H --> I{Agent Loop}
I --> J[Call LLM with tools]
J --> K{Response Type}
K -->|Tool Calls| L[Execute Playwright Tools]
L --> M[navigate/snapshot/click/etc]
M --> N{Continue?}
N -->|Yes, budget remains| I
N -->|Timeout/Max turns| O[Phase 2: Summarize]
K -->|Text Response| P[Return Response]
O --> P
P --> Q[BrowserPool.release]
Q --> R[Close Context]
R --> S[Return Results]
Last reviewed commit: faa8ad3
📝 WalkthroughWalkthroughThis PR refactors the Operator service from an OpenCode + Playwright MCP architecture to direct Playwright browser automation with an in-process LLM agent loop. Key changes include removing WorkspaceManager and vision_server modules, introducing BrowserPool for concurrent context management and agent_loop for orchestrating tool-calling agent interactions. Configuration removes workspace-related settings and adds vision API credentials. Additional platform-side changes introduce error boundary retry mechanisms with exponential backoff, propagate incomplete message signals through chat UI components, add useFastModel options to agent factories, refactor authentication flows using getAuthUserIdentity, and impose platform hard timeout limits on agent execution. Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 8
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
services/platform/app/features/chat/hooks/__tests__/use-chat-pending-state.test.ts (1)
8-19: 🧹 Nitpick | 🔵 TrivialPrefer
satisfies UIMessageoveras UIMessagein this test fixture helper.This provides compile-time validation while keeping type inference intact, aligning with the repository's guidance to avoid type casting.
♻️ Suggested refactor
- return { + const message = { key: overrides.id, role: 'assistant', text: '', _creationTime: Date.now(), status: 'success', parts: [], ...overrides, - } as UIMessage; + } satisfies UIMessage; + return message;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@services/platform/app/features/chat/hooks/__tests__/use-chat-pending-state.test.ts` around lines 8 - 19, The test helper createUIMessage currently casts the returned object with "as UIMessage"; replace that cast by using the TypeScript "satisfies UIMessage" operator on the object literal (keep the function signature returning UIMessage) so the object is validated at compile time without losing inference. Update the return expression in createUIMessage to use "satisfies UIMessage" instead of "as UIMessage" and ensure the function still returns UIMessage.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@services/operator/app/services/agent_loop.py`:
- Around line 736-743: The check `context is not None` is redundant because
`context: BrowserContext` is required; remove that condition and simplify
`should_batch = len(navigate_calls) > 1 and context is not None` to just
`should_batch = len(navigate_calls) > 1`, leaving the batching logic using
`navigate_calls`, `parsed_calls`, `_fetch_single_page`, and `urls` unchanged so
the parallel `asyncio.gather` and `nav_results` creation behave the same.
- Around line 453-460: The select_option branch in the tool handler (tool_name
== "select_option") currently calls _resolve_locator(...) and await
locator.select_option(value) but doesn't wait for page load like the click and
type_text branches; update the handler to await the same post-action load wait
(e.g. await page.wait_for_load_state("domcontentloaded") or the existing helper
used elsewhere) after await locator.select_option(value) to make behavior
consistent and prevent race conditions before returning the "Selected ..."
message.
- Around line 569-620: The function _call_llm_with_tools currently returns
immediately on httpx.TimeoutException instead of retrying; change the handler
for TimeoutException to mirror the general Exception branch (set last_error to a
descriptive timeout message, log a warning with attempt info, backoff with await
asyncio.sleep(2**attempt) and continue if attempt < retries, otherwise
break/allow final error logging), and rename the parameter _retries to retries
(and update all internal uses and any callers) so the retry loop uses the public
name; ensure final logger.error still reports the last_error after retries are
exhausted.
In `@services/operator/app/services/browser_pool.py`:
- Around line 75-76: Replace the two assert statements that check
self._semaphore and self._browser (used in BrowserPool methods around the
asserts at lines showing the checks) with explicit runtime checks that raise a
clear RuntimeError (or custom exception) if either is None; do the same for the
similar check at the other location (around the check near line 92).
Specifically, locate the methods referencing self._semaphore and self._browser
in browser_pool.py, remove the assert lines and insert explicit if-not checks
that raise RuntimeError with a helpful message (e.g., "BrowserPool not
initialized: _semaphore is None" / "_browser is None") so the error cannot be
silently skipped when Python runs with -O.
- Line 54: The disconnected callback creates a fire-and-forget task via
asyncio.create_task(self._on_browser_disconnected()), which can be orphaned
during fast shutdown; change this to store the created Task (e.g.
self._disconnect_task = asyncio.create_task(self._on_browser_disconnected()))
and ensure that any shutdown/cleanup method for BrowserPool (or equivalent)
checks for self._disconnect_task and awaits or cancels it (and awaits) so the
_on_browser_disconnected coroutine is not left running; update the
_on_browser_disconnected reference and any close/shutdown logic to handle a
pending task safely.
In `@services/operator/Dockerfile`:
- Around line 74-75: The Dockerfile has a standalone RUN chown -R pwuser:pwuser
/app that can be merged with adjacent RUN instructions to reduce image layers
(per Hadolint DL3059); update the Dockerfile by combining this chown into the
nearest preceding or following RUN that performs related setup (using shell
chaining like &&) so the ownership change and other setup steps execute in a
single RUN, referencing the existing RUN chown -R pwuser:pwuser /app command and
any neighboring RUN that installs/copies app files.
In
`@services/platform/app/components/error-boundaries/core/error-boundary-base.tsx`:
- Around line 60-71: The retryCount is only incremented in the timeout handler,
which means once a boundary recovers it still accumulates retries; update the
setTimeout callback in the error boundary's retry logic (where retryTimerId is
set and getRetryDelay is used) to reset retryCount to 0 when clearing
hasError/error and isRetrying, e.g. setState should set retryCount: 0 instead of
incrementing prev.retryCount so future independent errors start with fresh
retries.
In `@services/platform/convex/lib/agent_response/generate_response.ts`:
- Around line 194-206: Compute remainingPlatformMs = Math.max(platformDeadline -
Date.now(), 0) and use it when computing effectiveTimeoutMs and actionDeadline
to prevent negative timeouts; if remainingPlatformMs === 0 then throw an
AgentTimeoutError immediately (use AgentTimeoutError symbol) to abort early;
ensure rawTimeoutMs calculation still uses args.deadlineMs ?
Math.max(args.deadlineMs - Date.now(), 30_000) : agentConfig.timeoutMs but cap
effectiveTimeoutMs = Math.min(rawTimeoutMs, remainingPlatformMs) and set
actionDeadline = Math.min(args.deadlineMs ?? startTime + effectiveTimeoutMs,
platformDeadline). Also update the recovery timeout logic (the code referenced
around lines 689–725) to clamp any recovery allocation to remainingPlatformMs so
recovery never requests more time than the platform budget.
---
Outside diff comments:
In
`@services/platform/app/features/chat/hooks/__tests__/use-chat-pending-state.test.ts`:
- Around line 8-19: The test helper createUIMessage currently casts the returned
object with "as UIMessage"; replace that cast by using the TypeScript "satisfies
UIMessage" operator on the object literal (keep the function signature returning
UIMessage) so the object is validated at compile time without losing inference.
Update the return expression in createUIMessage to use "satisfies UIMessage"
instead of "as UIMessage" and ensure the function still returns UIMessage.
…tform reliability Restructure operator service with dedicated agent loop and browser pool, removing MCP vision server and workspace manager. Improve platform error boundaries, agent response generation, RLS context handling, branding queries, and chat pending state logic.
…hin platform budget
Use contextlib.suppress for CancelledError in BrowserPool shutdown and fix line formatting in OutputAccumulator.
8b3b831 to
f0de21c
Compare
…tform reliability (#499)
Summary
agent_loop.py) and browser pool (browser_pool.py), replacing the MCP vision server and workspace manager for cleaner separation of concernsChanges
Operator (Python)
agent_loop.py— dedicated agent loop service with structured task executionbrowser_pool.py— connection pooling for browser instancesworkspace_manager.pyandmcp/vision_server.py(consolidated into new modules)browser_service.pyby extracting pool managementuv.lockdependenciesPlatform (TypeScript/Convex)
generate_response.tswith better timeout budget handlingTest plan
test_agent_loop.py(728 lines) covering agent loop behaviortest_browser_pool.py(173 lines) covering browser pool managementtest_output_accumulator.pyandtest_phase2_summarization.pyfor new structureuse-chat-pending-state.test.tsfor refined pending state logic🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Improvements
Changes