test: resolve two_phase_e2e xfails (AI-3, AI-4)#63
Conversation
AI-3 audit: remember_with_extraction synchronously calls llm_client.generate twice per fact (1 FactExtractor.extract + 1 MemoryUpdater.decide), plus up to 1 retry in decide() when JSON parsing fails. The "4x" reason originates from background enrichment workers (_run_enrichment causal extraction, _run_llm_ner, _run_evolution) that share the module-level generate() patch and race the foreground test, consuming side_effect entries unpredictably. AI-4: fix via prompt-aware side_effect. Replaced the brittle call-count-indexed side_effect list with a routing function that inspects the prompt text and returns the appropriate canned response. This is robust to retries and background-thread generate() calls (they receive "" and no-op gracefully). Smallest diff that restores coverage without touching src/. Verified: all 5 tests in test_two_phase_e2e.py PASS, 0 xfails. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Updates the end-to-end two-phase pipeline tests to remove brittle generate() call-order assumptions and un-xfail previously flaky cases caused by background enrichment threads sharing the patched generate().
Changes:
- Removed
xfailmarkers from tworemember_with_extractionE2E tests. - Replaced call-count-indexed
mock_generate.side_effectlists with prompt-aware routing functions that return the correct response based on prompt content.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def test_update_supersedes_old_note(self, mock_generate, fresh_mm): | ||
| old_note, _ = fresh_mm.remember("APT28 uses DROPBEAR malware", domain="cti") | ||
|
|
||
| # Mock returns: extraction → decision → any extra generate calls get empty | ||
| mock_generate.side_effect = [ | ||
| '[{"fact": "APT28 no longer uses DROPBEAR", "importance": 9}]', | ||
| '{"operation": "UPDATE", "reason": "refines old intel"}', | ||
| ] + [""] * 10 # Extra calls (synthesis, causal, etc.) get empty string | ||
| # Prompt-aware mock: responds based on prompt content rather than call order. | ||
| # This is robust to call count drift (retries, background enrichment workers). | ||
| def _route(prompt, *args, **kwargs): | ||
| if "Extract the most important facts" in prompt: | ||
| return '[{"fact": "APT28 no longer uses DROPBEAR", "importance": 9}]' | ||
| if "Compare this new fact" in prompt: | ||
| return '{"operation": "UPDATE", "reason": "refines old intel"}' | ||
| return "" | ||
|
|
||
| mock_generate.side_effect = _route |
There was a problem hiding this comment.
In this test, fresh_mm.remember(...) can enqueue/background-run enrichment that calls the patched zettelforge.llm_client.generate before mock_generate.side_effect is set. In that window the mock returns a Mock object, which can trigger extra retries/log noise and reintroduce timing-dependent behavior. Consider setting a safe default (return_value = "" or the routing side_effect) before calling remember() to keep background-thread calls deterministic.
| def test_noop_stores_nothing_new(self, mock_generate, fresh_mm): | ||
| fresh_mm.remember("APT28 targets NATO", domain="cti") | ||
| initial_count = fresh_mm.store.count_notes() | ||
|
|
||
| mock_generate.side_effect = [ | ||
| '[{"fact": "APT28 targets NATO", "importance": 6}]', | ||
| '{"operation": "NOOP", "reason": "already stored"}', | ||
| ] + [""] * 10 # Extra calls get empty string | ||
| def _route(prompt, *args, **kwargs): | ||
| if "Extract the most important facts" in prompt: | ||
| return '[{"fact": "APT28 targets NATO", "importance": 6}]' | ||
| if "Compare this new fact" in prompt: | ||
| return '{"operation": "NOOP", "reason": "already stored"}' | ||
| return "" | ||
|
|
||
| mock_generate.side_effect = _route |
There was a problem hiding this comment.
fresh_mm.remember(...) happens before mock_generate.side_effect is configured, but remember() can spawn background enrichment that calls the patched generate() concurrently. Until side_effect is set, those calls will return a Mock instance and may cause nondeterministic retries/logging. Set a safe default behavior for mock_generate before calling remember() (e.g., route function or return_value="") to fully eliminate the race.
Detection-rules-as-memory, MCP Registry publication, SQLite concurrency hardening, test-suite hygiene, and brand/docs polish. Highlights: - feat: Sigma + YARA as first-class memory entities with LLM rule explainer (#70) + Detection Rules as Memory README section (#74) - feat: MCP Registry publication (server.json + mcp-name tag) (#75) - fix: SQLite reader concurrency — 16 methods now hold _write_lock (closes #68, fixes a production read-during-write race) (#69) - fix: 3 CI test regressions stabilized (#67) - chore: test-suite hygiene — 280→305 passing, 17→10 skipped, 2→0 xfailed; migrated langchain_retriever to Pydantic V2 ConfigDict (#62, #63, #64, #65) - brand: neural-chain architecture diagram + light/dark parity, canonical security channels, refreshed social preview (#61) See CHANGELOG.md for details. Bumps: pyproject.toml, src/zettelforge/__init__.py, mkdocs.yml, server.json, SECURITY.md.
Detection-rules-as-memory, MCP Registry publication, SQLite concurrency hardening, test-suite hygiene, and brand/docs polish. Highlights: - feat: Sigma + YARA as first-class memory entities with LLM rule explainer (#70) + Detection Rules as Memory README section (#74) - feat: MCP Registry publication (server.json + mcp-name tag) (#75) - fix: SQLite reader concurrency — 16 methods now hold _write_lock (closes #68, fixes a production read-during-write race) (#69) - fix: 3 CI test regressions stabilized (#67) - chore: test-suite hygiene — 280→305 passing, 17→10 skipped, 2→0 xfailed; migrated langchain_retriever to Pydantic V2 ConfigDict (#62, #63, #64, #65) - brand: neural-chain architecture diagram + light/dark parity, canonical security channels, refreshed social preview (#61) See CHANGELOG.md for details. Bumps: pyproject.toml, src/zettelforge/__init__.py, mkdocs.yml, server.json, SECURITY.md.
AI-3 — generate() call count audit
Traced every synchronous call site reaching
zettelforge.llm_client.generatefromMemoryManager.remember_with_extraction:remember_with_extractionFactExtractor.extractsrc/zettelforge/fact_extractor.py:46MemoryUpdater.decide(per fact)src/zettelforge/memory_updater.py:45MemoryUpdater.decideretry (on parse fail)src/zettelforge/memory_updater.py:51MemoryUpdater.apply→self.mm.remembersrc/zettelforge/memory_manager.py:136Synchronous total for a single-fact input: 2 calls (1 extract + 1 decide).
The xfail reason ("calls generate 4x") was an undercount of the real problem: background daemon threads (
_enrichment_workerinmemory_manager.py:127) share the module-levelgeneratepatch and race the foreground test, consumingside_effectentries non-deterministically. Eachremember()enqueues up to 3 jobs (_run_enrichmentcausal extraction,_run_llm_ner,_run_evolution), each of which callsgenerateif the daemon drains before the patch context ends.No production bug — this is purely a test/mock-layer issue.
AI-4 — Option chosen: Fix (prompt-aware side_effect)
Why this option: Smallest diff that restores real coverage and is robust to call-count drift.
test_memory_updater.pytests UPDATE/NOOP routing at theMemoryUpdaterlevel with direct mocking, but does not cover the end-to-end path throughremember_with_extraction(extraction → decide → apply → supersession). Deleting loses meaningful integration coverage.side_effectlist with a routing function that inspects the prompt text ("Extract the most important facts"vs"Compare this new fact"). Background-thread generate() calls fall through toreturn ""and no-op gracefully through the existing retry/exception handlers. Robust to retries and enrichment-worker races.Diff:
tests/test_two_phase_e2e.py, +18 / -15, single file. Nosrc/changes.Verification
Before: 3 passed, 2 xfailed. After: 5 passed, 0 xfailed. No tests re-marked xfail.