test: resolve two_phase_e2e xfails (AI-3, AI-4) by rolandpg · Pull Request #63 · rolandpg/zettelforge

rolandpg · 2026-04-17T21:32:06Z

AI-3 — generate() call count audit

Traced every synchronous call site reaching zettelforge.llm_client.generate from MemoryManager.remember_with_extraction:

Caller	File	Calls per `remember_with_extraction`
`FactExtractor.extract`	`src/zettelforge/fact_extractor.py:46`	1
`MemoryUpdater.decide` (per fact)	`src/zettelforge/memory_updater.py:45`	1 × N facts
`MemoryUpdater.decide` retry (on parse fail)	`src/zettelforge/memory_updater.py:51`	0–1 × N facts
`MemoryUpdater.apply` → `self.mm.remember`	`src/zettelforge/memory_manager.py:136`	0 (sync path defers all LLM work to background thread)

Synchronous total for a single-fact input: 2 calls (1 extract + 1 decide).

The xfail reason ("calls generate 4x") was an undercount of the real problem: background daemon threads (_enrichment_worker in memory_manager.py:127) share the module-level generate patch and race the foreground test, consuming side_effect entries non-deterministically. Each remember() enqueues up to 3 jobs (_run_enrichment causal extraction, _run_llm_ner, _run_evolution), each of which calls generate if the daemon drains before the patch context ends.

No production bug — this is purely a test/mock-layer issue.

AI-4 — Option chosen: Fix (prompt-aware side_effect)

Why this option: Smallest diff that restores real coverage and is robust to call-count drift.

Option 2 (delete) rejected: test_memory_updater.py tests UPDATE/NOOP routing at the MemoryUpdater level with direct mocking, but does not cover the end-to-end path through remember_with_extraction (extraction → decide → apply → supersession). Deleting loses meaningful integration coverage.
Option 3 (rewrite on mock provider) rejected: larger diff, couples Stream C to Stream B's provider-seeding API, and still has the same background-thread race.
Option 1 (fix): replaced the brittle call-count-indexed side_effect list with a routing function that inspects the prompt text ("Extract the most important facts" vs "Compare this new fact"). Background-thread generate() calls fall through to return "" and no-op gracefully through the existing retry/exception handlers. Robust to retries and enrichment-worker races.

Diff: tests/test_two_phase_e2e.py, +18 / -15, single file. No src/ changes.

Verification

$ ZETTELFORGE_LLM_PROVIDER=mock ZETTELFORGE_BACKEND=sqlite ZETTELFORGE_EMBEDDING_PROVIDER=fastembed \
    /tmp/nexus-stream-c-venv/bin/python -m pytest tests/test_two_phase_e2e.py -v --timeout=30 --timeout-method=thread

tests/test_two_phase_e2e.py::TestRememberWithExtraction::test_extracts_and_stores_facts PASSED  [ 20%]
tests/test_two_phase_e2e.py::TestRememberWithExtraction::test_returns_empty_for_low_importance PASSED  [ 40%]
tests/test_two_phase_e2e.py::TestRememberWithExtraction::test_update_supersedes_old_note PASSED  [ 60%]
tests/test_two_phase_e2e.py::TestRememberWithExtraction::test_noop_stores_nothing_new PASSED  [ 80%]
tests/test_two_phase_e2e.py::TestRememberWithExtraction::test_method_exists PASSED  [100%]

========================= 5 passed, 1 warning in 3.73s =========================

Before: 3 passed, 2 xfailed. After: 5 passed, 0 xfailed. No tests re-marked xfail.

AI-3 audit: remember_with_extraction synchronously calls llm_client.generate twice per fact (1 FactExtractor.extract + 1 MemoryUpdater.decide), plus up to 1 retry in decide() when JSON parsing fails. The "4x" reason originates from background enrichment workers (_run_enrichment causal extraction, _run_llm_ner, _run_evolution) that share the module-level generate() patch and race the foreground test, consuming side_effect entries unpredictably. AI-4: fix via prompt-aware side_effect. Replaced the brittle call-count-indexed side_effect list with a routing function that inspects the prompt text and returns the appropriate canned response. This is robust to retries and background-thread generate() calls (they receive "" and no-op gracefully). Smallest diff that restores coverage without touching src/. Verified: all 5 tests in test_two_phase_e2e.py PASS, 0 xfails. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Updates the end-to-end two-phase pipeline tests to remove brittle generate() call-order assumptions and un-xfail previously flaky cases caused by background enrichment threads sharing the patched generate().

Changes:

Removed xfail markers from two remember_with_extraction E2E tests.
Replaced call-count-indexed mock_generate.side_effect lists with prompt-aware routing functions that return the correct response based on prompt content.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-17T21:34:57Z

    def test_update_supersedes_old_note(self, mock_generate, fresh_mm):
        old_note, _ = fresh_mm.remember("APT28 uses DROPBEAR malware", domain="cti")

-        # Mock returns: extraction → decision → any extra generate calls get empty
-        mock_generate.side_effect = [
-            '[{"fact": "APT28 no longer uses DROPBEAR", "importance": 9}]',
-            '{"operation": "UPDATE", "reason": "refines old intel"}',
-        ] + [""] * 10  # Extra calls (synthesis, causal, etc.) get empty string
+        # Prompt-aware mock: responds based on prompt content rather than call order.
+        # This is robust to call count drift (retries, background enrichment workers).
+        def _route(prompt, *args, **kwargs):
+            if "Extract the most important facts" in prompt:
+                return '[{"fact": "APT28 no longer uses DROPBEAR", "importance": 9}]'
+            if "Compare this new fact" in prompt:
+                return '{"operation": "UPDATE", "reason": "refines old intel"}'
+            return ""
+
+        mock_generate.side_effect = _route


In this test, fresh_mm.remember(...) can enqueue/background-run enrichment that calls the patched zettelforge.llm_client.generate before mock_generate.side_effect is set. In that window the mock returns a Mock object, which can trigger extra retries/log noise and reintroduce timing-dependent behavior. Consider setting a safe default (return_value = "" or the routing side_effect) before calling remember() to keep background-thread calls deterministic.

Copilot · 2026-04-17T21:34:57Z

    def test_noop_stores_nothing_new(self, mock_generate, fresh_mm):
        fresh_mm.remember("APT28 targets NATO", domain="cti")
        initial_count = fresh_mm.store.count_notes()

-        mock_generate.side_effect = [
-            '[{"fact": "APT28 targets NATO", "importance": 6}]',
-            '{"operation": "NOOP", "reason": "already stored"}',
-        ] + [""] * 10  # Extra calls get empty string
+        def _route(prompt, *args, **kwargs):
+            if "Extract the most important facts" in prompt:
+                return '[{"fact": "APT28 targets NATO", "importance": 6}]'
+            if "Compare this new fact" in prompt:
+                return '{"operation": "NOOP", "reason": "already stored"}'
+            return ""
+
+        mock_generate.side_effect = _route


fresh_mm.remember(...) happens before mock_generate.side_effect is configured, but remember() can spawn background enrichment that calls the patched generate() concurrently. Until side_effect is set, those calls will return a Mock instance and may cause nondeterministic retries/logging. Set a safe default behavior for mock_generate before calling remember() (e.g., route function or return_value="") to fully eliminate the race.

Detection-rules-as-memory, MCP Registry publication, SQLite concurrency hardening, test-suite hygiene, and brand/docs polish. Highlights: - feat: Sigma + YARA as first-class memory entities with LLM rule explainer (#70) + Detection Rules as Memory README section (#74) - feat: MCP Registry publication (server.json + mcp-name tag) (#75) - fix: SQLite reader concurrency — 16 methods now hold _write_lock (closes #68, fixes a production read-during-write race) (#69) - fix: 3 CI test regressions stabilized (#67) - chore: test-suite hygiene — 280→305 passing, 17→10 skipped, 2→0 xfailed; migrated langchain_retriever to Pydantic V2 ConfigDict (#62, #63, #64, #65) - brand: neural-chain architecture diagram + light/dark parity, canonical security channels, refreshed social preview (#61) See CHANGELOG.md for details. Bumps: pyproject.toml, src/zettelforge/__init__.py, mkdocs.yml, server.json, SECURITY.md.

Copilot AI review requested due to automatic review settings April 17, 2026 21:32

Copilot started reviewing on behalf of rolandpg April 17, 2026 21:32 View session

Copilot AI reviewed Apr 17, 2026

View reviewed changes

rolandpg merged commit 0136888 into master Apr 17, 2026
14 of 15 checks passed

rolandpg mentioned this pull request Apr 19, 2026

chore(release): v2.4.0 #76

Merged

rolandpg deleted the test-fix/xfail-cleanup branch April 25, 2026 05:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: resolve two_phase_e2e xfails (AI-3, AI-4)#63

test: resolve two_phase_e2e xfails (AI-3, AI-4)#63
rolandpg merged 1 commit into
masterfrom
test-fix/xfail-cleanup

rolandpg commented Apr 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Copilot AI Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rolandpg commented Apr 17, 2026

AI-3 — generate() call count audit

AI-4 — Option chosen: Fix (prompt-aware side_effect)

Verification

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants