Skip to content

test: resolve two_phase_e2e xfails (AI-3, AI-4)#63

Merged
rolandpg merged 1 commit into
masterfrom
test-fix/xfail-cleanup
Apr 17, 2026
Merged

test: resolve two_phase_e2e xfails (AI-3, AI-4)#63
rolandpg merged 1 commit into
masterfrom
test-fix/xfail-cleanup

Conversation

@rolandpg
Copy link
Copy Markdown
Owner

AI-3 — generate() call count audit

Traced every synchronous call site reaching zettelforge.llm_client.generate from MemoryManager.remember_with_extraction:

Caller File Calls per remember_with_extraction
FactExtractor.extract src/zettelforge/fact_extractor.py:46 1
MemoryUpdater.decide (per fact) src/zettelforge/memory_updater.py:45 1 × N facts
MemoryUpdater.decide retry (on parse fail) src/zettelforge/memory_updater.py:51 0–1 × N facts
MemoryUpdater.applyself.mm.remember src/zettelforge/memory_manager.py:136 0 (sync path defers all LLM work to background thread)

Synchronous total for a single-fact input: 2 calls (1 extract + 1 decide).

The xfail reason ("calls generate 4x") was an undercount of the real problem: background daemon threads (_enrichment_worker in memory_manager.py:127) share the module-level generate patch and race the foreground test, consuming side_effect entries non-deterministically. Each remember() enqueues up to 3 jobs (_run_enrichment causal extraction, _run_llm_ner, _run_evolution), each of which calls generate if the daemon drains before the patch context ends.

No production bug — this is purely a test/mock-layer issue.

AI-4 — Option chosen: Fix (prompt-aware side_effect)

Why this option: Smallest diff that restores real coverage and is robust to call-count drift.

  • Option 2 (delete) rejected: test_memory_updater.py tests UPDATE/NOOP routing at the MemoryUpdater level with direct mocking, but does not cover the end-to-end path through remember_with_extraction (extraction → decide → apply → supersession). Deleting loses meaningful integration coverage.
  • Option 3 (rewrite on mock provider) rejected: larger diff, couples Stream C to Stream B's provider-seeding API, and still has the same background-thread race.
  • Option 1 (fix): replaced the brittle call-count-indexed side_effect list with a routing function that inspects the prompt text ("Extract the most important facts" vs "Compare this new fact"). Background-thread generate() calls fall through to return "" and no-op gracefully through the existing retry/exception handlers. Robust to retries and enrichment-worker races.

Diff: tests/test_two_phase_e2e.py, +18 / -15, single file. No src/ changes.

Verification

$ ZETTELFORGE_LLM_PROVIDER=mock ZETTELFORGE_BACKEND=sqlite ZETTELFORGE_EMBEDDING_PROVIDER=fastembed \
    /tmp/nexus-stream-c-venv/bin/python -m pytest tests/test_two_phase_e2e.py -v --timeout=30 --timeout-method=thread

tests/test_two_phase_e2e.py::TestRememberWithExtraction::test_extracts_and_stores_facts PASSED  [ 20%]
tests/test_two_phase_e2e.py::TestRememberWithExtraction::test_returns_empty_for_low_importance PASSED  [ 40%]
tests/test_two_phase_e2e.py::TestRememberWithExtraction::test_update_supersedes_old_note PASSED  [ 60%]
tests/test_two_phase_e2e.py::TestRememberWithExtraction::test_noop_stores_nothing_new PASSED  [ 80%]
tests/test_two_phase_e2e.py::TestRememberWithExtraction::test_method_exists PASSED  [100%]

========================= 5 passed, 1 warning in 3.73s =========================

Before: 3 passed, 2 xfailed. After: 5 passed, 0 xfailed. No tests re-marked xfail.

AI-3 audit: remember_with_extraction synchronously calls llm_client.generate
twice per fact (1 FactExtractor.extract + 1 MemoryUpdater.decide), plus up to
1 retry in decide() when JSON parsing fails. The "4x" reason originates from
background enrichment workers (_run_enrichment causal extraction, _run_llm_ner,
_run_evolution) that share the module-level generate() patch and race the
foreground test, consuming side_effect entries unpredictably.

AI-4: fix via prompt-aware side_effect. Replaced the brittle call-count-indexed
side_effect list with a routing function that inspects the prompt text and
returns the appropriate canned response. This is robust to retries and
background-thread generate() calls (they receive "" and no-op gracefully).
Smallest diff that restores coverage without touching src/.

Verified: all 5 tests in test_two_phase_e2e.py PASS, 0 xfails.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 17, 2026 21:32
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the end-to-end two-phase pipeline tests to remove brittle generate() call-order assumptions and un-xfail previously flaky cases caused by background enrichment threads sharing the patched generate().

Changes:

  • Removed xfail markers from two remember_with_extraction E2E tests.
  • Replaced call-count-indexed mock_generate.side_effect lists with prompt-aware routing functions that return the correct response based on prompt content.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 44 to +56
def test_update_supersedes_old_note(self, mock_generate, fresh_mm):
old_note, _ = fresh_mm.remember("APT28 uses DROPBEAR malware", domain="cti")

# Mock returns: extraction → decision → any extra generate calls get empty
mock_generate.side_effect = [
'[{"fact": "APT28 no longer uses DROPBEAR", "importance": 9}]',
'{"operation": "UPDATE", "reason": "refines old intel"}',
] + [""] * 10 # Extra calls (synthesis, causal, etc.) get empty string
# Prompt-aware mock: responds based on prompt content rather than call order.
# This is robust to call count drift (retries, background enrichment workers).
def _route(prompt, *args, **kwargs):
if "Extract the most important facts" in prompt:
return '[{"fact": "APT28 no longer uses DROPBEAR", "importance": 9}]'
if "Compare this new fact" in prompt:
return '{"operation": "UPDATE", "reason": "refines old intel"}'
return ""

mock_generate.side_effect = _route
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this test, fresh_mm.remember(...) can enqueue/background-run enrichment that calls the patched zettelforge.llm_client.generate before mock_generate.side_effect is set. In that window the mock returns a Mock object, which can trigger extra retries/log noise and reintroduce timing-dependent behavior. Consider setting a safe default (return_value = "" or the routing side_effect) before calling remember() to keep background-thread calls deterministic.

Copilot uses AI. Check for mistakes.
Comment on lines 68 to +79
def test_noop_stores_nothing_new(self, mock_generate, fresh_mm):
fresh_mm.remember("APT28 targets NATO", domain="cti")
initial_count = fresh_mm.store.count_notes()

mock_generate.side_effect = [
'[{"fact": "APT28 targets NATO", "importance": 6}]',
'{"operation": "NOOP", "reason": "already stored"}',
] + [""] * 10 # Extra calls get empty string
def _route(prompt, *args, **kwargs):
if "Extract the most important facts" in prompt:
return '[{"fact": "APT28 targets NATO", "importance": 6}]'
if "Compare this new fact" in prompt:
return '{"operation": "NOOP", "reason": "already stored"}'
return ""

mock_generate.side_effect = _route
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fresh_mm.remember(...) happens before mock_generate.side_effect is configured, but remember() can spawn background enrichment that calls the patched generate() concurrently. Until side_effect is set, those calls will return a Mock instance and may cause nondeterministic retries/logging. Set a safe default behavior for mock_generate before calling remember() (e.g., route function or return_value="") to fully eliminate the race.

Copilot uses AI. Check for mistakes.
@rolandpg rolandpg merged commit 0136888 into master Apr 17, 2026
14 of 15 checks passed
rolandpg added a commit that referenced this pull request Apr 19, 2026
Detection-rules-as-memory, MCP Registry publication, SQLite concurrency
hardening, test-suite hygiene, and brand/docs polish.

Highlights:
- feat: Sigma + YARA as first-class memory entities with LLM rule
  explainer (#70) + Detection Rules as Memory README section (#74)
- feat: MCP Registry publication (server.json + mcp-name tag) (#75)
- fix: SQLite reader concurrency — 16 methods now hold _write_lock
  (closes #68, fixes a production read-during-write race) (#69)
- fix: 3 CI test regressions stabilized (#67)
- chore: test-suite hygiene — 280→305 passing, 17→10 skipped,
  2→0 xfailed; migrated langchain_retriever to Pydantic V2 ConfigDict
  (#62, #63, #64, #65)
- brand: neural-chain architecture diagram + light/dark parity,
  canonical security channels, refreshed social preview (#61)

See CHANGELOG.md for details.

Bumps: pyproject.toml, src/zettelforge/__init__.py, mkdocs.yml,
server.json, SECURITY.md.
@rolandpg rolandpg mentioned this pull request Apr 19, 2026
rolandpg added a commit that referenced this pull request Apr 19, 2026
Detection-rules-as-memory, MCP Registry publication, SQLite concurrency
hardening, test-suite hygiene, and brand/docs polish.

Highlights:
- feat: Sigma + YARA as first-class memory entities with LLM rule
  explainer (#70) + Detection Rules as Memory README section (#74)
- feat: MCP Registry publication (server.json + mcp-name tag) (#75)
- fix: SQLite reader concurrency — 16 methods now hold _write_lock
  (closes #68, fixes a production read-during-write race) (#69)
- fix: 3 CI test regressions stabilized (#67)
- chore: test-suite hygiene — 280→305 passing, 17→10 skipped,
  2→0 xfailed; migrated langchain_retriever to Pydantic V2 ConfigDict
  (#62, #63, #64, #65)
- brand: neural-chain architecture diagram + light/dark parity,
  canonical security channels, refreshed social preview (#61)

See CHANGELOG.md for details.

Bumps: pyproject.toml, src/zettelforge/__init__.py, mkdocs.yml,
server.json, SECURITY.md.
@rolandpg rolandpg deleted the test-fix/xfail-cleanup branch April 25, 2026 05:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants