Skip to content

Custom planner v0#2

Closed
YushaArif99 wants to merge 65 commits into
mainfrom
custom-planner
Closed

Custom planner v0#2
YushaArif99 wants to merge 65 commits into
mainfrom
custom-planner

Conversation

@YushaArif99
Copy link
Copy Markdown
Contributor

No description provided.

@YushaArif99 YushaArif99 force-pushed the custom-planner branch 3 times, most recently from d2dab4e to 6e1941e Compare May 6, 2025 11:07
YushaArif99 added 27 commits May 7, 2025 23:12
…flow, including support for pausing and resuming plans, handling task events, and integrating browser state broadcasting. **NOTE**: still a WIP...
…anner and threading for command acknowledgment
… function retrieval from the reimplement queue
YushaArif99 added 22 commits May 7, 2025 23:12
…until the event is set and returns the correct Primitive object.
…ification context, enhancing error handling and bridge function execution.
…ion only occurs if a valid main tab is returned, improving robustness of exploration flow.
…g the structure and content of the payload produced for function modifications.
…fying snapshot processing and data retrieval from the broadcast queue.
…ilitate efficient comparison of DOM structures and generate stable hashes for DOM subtrees.
…ner module, along with a helper function to retrieve it.
…ed versions of snapshots, including truncation of elements and optional DOM hashing.
…ducing a dispatcher for primitive and source code verification. Replace legacy fingerprinting and heuristic methods with streamlined checks for navigation and interactive primitives. Update payload structure for LLM verification to include detailed primitive information and DOM change summaries.
…cluding stability checks for _hash_dom, diff summary generation, and tiered verification logic for navigation, scrolling, and button interactions.
…fying plan execution and course correction handling through mocked dependencies and queues.
…ce stubbing of subsequent helper calls in zero_shot.py
@djl11 djl11 force-pushed the main branch 2 times, most recently from 0915ca5 to 85d10e2 Compare May 14, 2025 18:12
@YushaArif99 YushaArif99 deleted the custom-planner branch July 3, 2025 05:02
djl11 added a commit that referenced this pull request Feb 2, 2026
…tricter assertions

FAILING TEST - requires cancel_running fix to pass.

Changes:
- Use 2s simulated LLM time (realistic ratio vs 0.1s utterance interval)
- Assert cancelled_count == 0 (no running tasks should be cancelled)
- Remove dependency on actual LLM calls for deterministic timing
- Improve docstring to clearly document the bug and required fixes

This test now properly validates that BOTH fixes are required:
1. Debouncer asyncio.shield() fix (already committed)
2. cancel_running=False for voice mode (NOT YET COMMITTED)

Without fix #2, test fails with 4 cancellations.
With both fixes, test passes with 0 cancellations.
djl11 added a commit that referenced this pull request May 28, 2026
…k failures

Four E2E spending tests have been failing in CI:
  - test_assistant_limit_check (DID NOT RAISE SpendingLimitExceededError)
  - test_inflight_cancellation_on_limit_exceeded (timing wrong)
  - test_limit_check_callback_allows_under_limit (allowed=False, cap=0.0, spend=$10)
  - test_limit_exceeded_blocks_llm_call (DID NOT RAISE)

All four share a single root cause: state leaking through a
SHARED "SpendingTest Assistant" record reused across every test
in the file. The old e2e_config fixture did a "find-by-name then
reuse, else create" lookup. Every test in TestE2ESpendingLimits
got the same agent_id, so:

  1. Cumulative spend (current_spend) is NEVER reset by Orchestra
     once an LLM call lands on it. Once any test makes a real LLM
     call, the assistant carries that spend for the rest of the
     session. test_limit_check_callback_allows_under_limit fails
     when it sees current_spend=$10 from earlier tests, even
     though it asserts the assistant "starts fresh".

  2. The PATCH-based cap restore in test bodies
     (test_limit_exceeded_blocks_llm_call etc.) reads the
     *current* cap then restores it. If a previous test leaked
     cap=0, that becomes the "original" for the next test,
     making the leak permanent.

  3. The fixture-level cap=None reset is best-effort with
     bare-except and silently fails on any Orchestra hiccup,
     leaving the cap unreset.

The previous "await the reset PATCH" fix (c583ab2) addressed
fragility #3 but couldn't address #1 (spend accumulation) or #2
(test-body restore racing the reset).

Fix: each test gets its OWN freshly-created assistant with a
unique surname (test-node-slug + 8-char UUID). The fixture:
  - Always POSTs a new assistant at setup (no find-by-name reuse)
  - Raises loudly on create failure (was: silently leaving
    test_agent_id=None then propagating to SESSION_DETAILS)
  - DELETEs the assistant at teardown via /assistant/{id}

No state survives between tests:
  - Fresh agent_id per test → spend starts at 0
  - Fresh cap=25 per test → no cap-leak between tests
  - Delete in teardown → no residual rows accumulate

The non-E2E tests in the file (TestAtomicUpsert,
TestUpdateCumulativeSpend, …) don't use e2e_config — they mock
SESSION_DETAILS and are unaffected.

Side effects:
  - Each test creates + deletes an assistant: ~2 extra HTTP
    round-trips per test. Acceptable cost given the
    correctness win.
  - Local DB rows accumulate transiently if a teardown DELETE
    fails (bare-except), but local.sh's docker-volume rebuild
    on restart clears them; CI runs are fresh per matrix job
    anyway.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant