added a nice chat interface component to play with#3
Merged
Conversation
djl11
added a commit
that referenced
this pull request
May 28, 2026
…k failures
Four E2E spending tests have been failing in CI:
- test_assistant_limit_check (DID NOT RAISE SpendingLimitExceededError)
- test_inflight_cancellation_on_limit_exceeded (timing wrong)
- test_limit_check_callback_allows_under_limit (allowed=False, cap=0.0, spend=$10)
- test_limit_exceeded_blocks_llm_call (DID NOT RAISE)
All four share a single root cause: state leaking through a
SHARED "SpendingTest Assistant" record reused across every test
in the file. The old e2e_config fixture did a "find-by-name then
reuse, else create" lookup. Every test in TestE2ESpendingLimits
got the same agent_id, so:
1. Cumulative spend (current_spend) is NEVER reset by Orchestra
once an LLM call lands on it. Once any test makes a real LLM
call, the assistant carries that spend for the rest of the
session. test_limit_check_callback_allows_under_limit fails
when it sees current_spend=$10 from earlier tests, even
though it asserts the assistant "starts fresh".
2. The PATCH-based cap restore in test bodies
(test_limit_exceeded_blocks_llm_call etc.) reads the
*current* cap then restores it. If a previous test leaked
cap=0, that becomes the "original" for the next test,
making the leak permanent.
3. The fixture-level cap=None reset is best-effort with
bare-except and silently fails on any Orchestra hiccup,
leaving the cap unreset.
The previous "await the reset PATCH" fix (c583ab2) addressed
fragility #3 but couldn't address #1 (spend accumulation) or #2
(test-body restore racing the reset).
Fix: each test gets its OWN freshly-created assistant with a
unique surname (test-node-slug + 8-char UUID). The fixture:
- Always POSTs a new assistant at setup (no find-by-name reuse)
- Raises loudly on create failure (was: silently leaving
test_agent_id=None then propagating to SESSION_DETAILS)
- DELETEs the assistant at teardown via /assistant/{id}
No state survives between tests:
- Fresh agent_id per test → spend starts at 0
- Fresh cap=25 per test → no cap-leak between tests
- Delete in teardown → no residual rows accumulate
The non-E2E tests in the file (TestAtomicUpsert,
TestUpdateCumulativeSpend, …) don't use e2e_config — they mock
SESSION_DETAILS and are unaffected.
Side effects:
- Each test creates + deletes an assistant: ~2 extra HTTP
round-trips per test. Acceptable cost given the
correctness win.
- Local DB rows accumulate transiently if a teardown DELETE
fails (bare-except), but local.sh's docker-volume rebuild
on restart clears them; CI runs are fresh per matrix job
anyway.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.