Custom planner v0#2
Closed
YushaArif99 wants to merge 65 commits into
Closed
Conversation
d2dab4e to
6e1941e
Compare
… of the custom planner implementation
…nctionNode, and Plan
…r generated plans
…cation in planner
…lans based on natural language tasks
…flow, including support for pausing and resuming plans, handling task events, and integrating browser state broadcasting. **NOTE**: still a WIP...
…c module loading logic
…anner and threading for command acknowledgment
… function retrieval from the reimplement queue
…n assertions in the verification process
…n planner primitives
…d modification functions.
…tion and exploration scenarios
…ath in BrowserWorker
…until the event is set and returns the correct Primitive object.
… Planner, streamlining test suite.
…ification context, enhancing error handling and bridge function execution.
…ion only occurs if a valid main tab is returned, improving robustness of exploration flow.
…g the structure and content of the payload produced for function modifications.
…ker to enhance browser state communication.
…fying snapshot processing and data retrieval from the broadcast queue.
…ilitate efficient comparison of DOM structures and generate stable hashes for DOM subtrees.
…ner module, along with a helper function to retrieve it.
…ed versions of snapshots, including truncation of elements and optional DOM hashing.
…ducing a dispatcher for primitive and source code verification. Replace legacy fingerprinting and heuristic methods with streamlined checks for navigation and interactive primitives. Update payload structure for LLM verification to include detailed primitive information and DOM change summaries.
…cluding stability checks for _hash_dom, diff summary generation, and tiered verification logic for navigation, scrolling, and button interactions.
…rowser state during plan execution.
…fying plan execution and course correction handling through mocked dependencies and queues.
…ce stubbing of subsequent helper calls in zero_shot.py
…tralized definitions from sys_msg
2e5b9de to
2f4868c
Compare
0915ca5 to
85d10e2
Compare
djl11
added a commit
that referenced
this pull request
Feb 2, 2026
…tricter assertions FAILING TEST - requires cancel_running fix to pass. Changes: - Use 2s simulated LLM time (realistic ratio vs 0.1s utterance interval) - Assert cancelled_count == 0 (no running tasks should be cancelled) - Remove dependency on actual LLM calls for deterministic timing - Improve docstring to clearly document the bug and required fixes This test now properly validates that BOTH fixes are required: 1. Debouncer asyncio.shield() fix (already committed) 2. cancel_running=False for voice mode (NOT YET COMMITTED) Without fix #2, test fails with 4 cancellations. With both fixes, test passes with 0 cancellations.
djl11
added a commit
that referenced
this pull request
May 28, 2026
…k failures
Four E2E spending tests have been failing in CI:
- test_assistant_limit_check (DID NOT RAISE SpendingLimitExceededError)
- test_inflight_cancellation_on_limit_exceeded (timing wrong)
- test_limit_check_callback_allows_under_limit (allowed=False, cap=0.0, spend=$10)
- test_limit_exceeded_blocks_llm_call (DID NOT RAISE)
All four share a single root cause: state leaking through a
SHARED "SpendingTest Assistant" record reused across every test
in the file. The old e2e_config fixture did a "find-by-name then
reuse, else create" lookup. Every test in TestE2ESpendingLimits
got the same agent_id, so:
1. Cumulative spend (current_spend) is NEVER reset by Orchestra
once an LLM call lands on it. Once any test makes a real LLM
call, the assistant carries that spend for the rest of the
session. test_limit_check_callback_allows_under_limit fails
when it sees current_spend=$10 from earlier tests, even
though it asserts the assistant "starts fresh".
2. The PATCH-based cap restore in test bodies
(test_limit_exceeded_blocks_llm_call etc.) reads the
*current* cap then restores it. If a previous test leaked
cap=0, that becomes the "original" for the next test,
making the leak permanent.
3. The fixture-level cap=None reset is best-effort with
bare-except and silently fails on any Orchestra hiccup,
leaving the cap unreset.
The previous "await the reset PATCH" fix (c583ab2) addressed
fragility #3 but couldn't address #1 (spend accumulation) or #2
(test-body restore racing the reset).
Fix: each test gets its OWN freshly-created assistant with a
unique surname (test-node-slug + 8-char UUID). The fixture:
- Always POSTs a new assistant at setup (no find-by-name reuse)
- Raises loudly on create failure (was: silently leaving
test_agent_id=None then propagating to SESSION_DETAILS)
- DELETEs the assistant at teardown via /assistant/{id}
No state survives between tests:
- Fresh agent_id per test → spend starts at 0
- Fresh cap=25 per test → no cap-leak between tests
- Delete in teardown → no residual rows accumulate
The non-E2E tests in the file (TestAtomicUpsert,
TestUpdateCumulativeSpend, …) don't use e2e_config — they mock
SESSION_DETAILS and are unaffected.
Side effects:
- Each test creates + deletes an assistant: ~2 extra HTTP
round-trips per test. Acceptable cost given the
correctness win.
- Local DB rows accumulate transiently if a teardown DELETE
fails (bare-except), but local.sh's docker-volume rebuild
on restart clears them; CI runs are fresh per matrix job
anyway.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.