Skip to content

fix(tests): stabilize 3 CI regressions exposed by test-suite audit sprint#67

Merged
rolandpg merged 5 commits into
masterfrom
stabilize/master-regressions
Apr 17, 2026
Merged

fix(tests): stabilize 3 CI regressions exposed by test-suite audit sprint#67
rolandpg merged 5 commits into
masterfrom
stabilize/master-regressions

Conversation

@rolandpg
Copy link
Copy Markdown
Owner

Summary

Fixes three CI-only test failures on master, each introduced or exposed by the AI test-suite audit sprint (commits af080ec, 7fdf013, 0136888). All three are test-layer fixes — zero source code (src/) changes. No new pytest plugins, no xfail/skipif reintroductions.

Background: docs/superpowers/research/2026-04-17-test-suite-audit.md.

Regression 1 — tests/test_causal_extraction.py::test_causal_extraction (line 86)

  • Root cause: commit af080ec (AI-8) replaced return False with assert len(causal_edges) > 0. Under CI env (ZETTELFORGE_LLM_PROVIDER=mock + ZETTELFORGE_BACKEND=sqlite) this can never pass: (a) MockProvider returns unparseable text, so extract_causal_triples returns []; (b) even with a seeded-JSON mock, store_causal_edges writes to SQLite via backend=self.store, while the test reads get_knowledge_graph()._edges — the JSONL singleton SQLite never touches.
  • Fix chosen: user's fallback — assert the ingestion pipeline ran (assert note.id) instead of asserting edge count. Diagnostic prints retained for local real-LLM runs.
  • Evidence: 1 passed under CI-sim env (empty AMEM_DATA_DIR and polluted).

Regression 2 — tests/test_core.py::TestVectorRecall::test_vector_recall_latency_reasonable (line 134)

  • Root cause: same enrichment-queue race as AI-6; seed mm.remember() loop does not block on background enrichment before recall() runs.
  • Fix chosen: apply AI-6 pattern — sync=True on each of the 10 seed remembers. Mirrors af080ec exactly.
  • Evidence: 20/20 passed with pytest-repeat --count=20.

Regression 3 — tests/test_langchain_retriever.py::TestZettelForgeRetriever::test_metadata_fields (line 66)

  • Root cause: embedding-similarity mismatch, not a race. The fuzzy query "XZ Utils CVE" does not score highly enough on fastembed (nomic-embed-text-v1.5-Q) against the 3-note seed corpus, so mm.recall returns 0 notes. Other tests in the class pass because they use stronger-matching APT28 queries. Deterministic failure (10/10) — not the flake described in the audit's Regression 3 write-up; the actual failure is assert len(docs) > 0 on line 66, not the "security_ops" in domains on line 79.
  • Fix chosen: test-only change — swap the fuzzy query for the exact CVE ID "CVE-2024-3094", which hits the CVE-regex recall path and reliably returns the seeded security_ops note. The test's intent is metadata-field presence; the specific query is incidental.
  • Evidence: 10/10 passed with pytest-repeat --count=10 under CI-sim env.

Rerun evidence (combined)

CI=true ZETTELFORGE_LLM_PROVIDER=mock ZETTELFORGE_BACKEND=sqlite \
  ZETTELFORGE_EMBEDDING_PROVIDER=fastembed pytest \
  tests/test_causal_extraction.py \
  tests/test_core.py::TestVectorRecall::test_vector_recall_latency_reasonable \
  tests/test_langchain_retriever.py::TestZettelForgeRetriever::test_metadata_fields -v
# => 3 passed

# 20x latency rerun: 20/20 passed

Test plan

  • Run each failing test solo under CI-sim env and observe it now passes
  • Run latency test with --count=20 to confirm race fix
  • Run metadata_fields with --count=10 to confirm query fix is deterministic
  • Run the full tests/test_core.py, tests/test_causal_extraction.py, tests/test_langchain_retriever.py (26 tests) to confirm no regressions
  • GitHub Actions CI on this PR passes

References

  • Audit doc: docs/superpowers/research/2026-04-17-test-suite-audit.md
  • Prior commits: af080ec (AI-6, AI-8 — source of Regression 1), 0136888 (master HEAD at sprint end)

🤖 Generated with Claude Code

rolandpg and others added 3 commits April 17, 2026 17:13
…-ran

The `assert len(causal_edges) > 0` introduced in af080ec (AI-8) fails
deterministically in CI under ZETTELFORGE_LLM_PROVIDER=mock + BACKEND=sqlite.
Causal edges require:
  1. An LLM producing a parseable JSON array of triples (mock provider
     returns "mock response" which fails extract_json(expect="array")).
  2. Routing through the JSONL KG singleton the test inspects -- but
     SQLiteBackend.add_kg_edge writes to SQLite, not the JSONL singleton.

Neither holds in CI, so the assertion can never pass regardless of the LLM.
Minimal fix: assert the ingestion pipeline ran (note.id produced) instead
of asserting causal-edge presence. Diagnostic prints remain useful when
this test is run locally against a real LLM.

Refs audit doc docs/superpowers/research/2026-04-17-test-suite-audit.md
and the AI-8 regression introduced in af080ec.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…y test

Same enrichment-queue race as AI-6: seed remember() calls queue LLM NER
enrichment to a background worker; subsequent recall() can race ahead and
observe zero results. Applies the AI-6 pattern (sync=True on seed calls)
to the 10-note seed loop in test_vector_recall_latency_reasonable.

Verified 20/20 with pytest-repeat --count=20 under CI-simulation env.

Refs audit doc docs/superpowers/research/2026-04-17-test-suite-audit.md
and the AI-6 fix pattern in af080ec.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ns docs

Root cause: the fuzzy query "XZ Utils CVE" does not score highly enough
on fastembed (nomic-embed-text-v1.5-Q) against the 3-note seed corpus.
Recall returns 0 results, so `assert len(docs) > 0` fails deterministically.
This is an embedding-similarity mismatch, not a race or source-layer bug
(other tests in the same file pass with APT28-centric queries that DO
match their seed note).

Minimal test-only fix: change the query to the exact CVE ID
"CVE-2024-3094" so the CVE-regex recall path in MemoryManager reliably
returns the seeded security_ops note. The test's intent is to validate
metadata-field presence, so the specific query is incidental.

Verified 10/10 with pytest-repeat --count=10 under CI-simulation env.

Refs audit doc docs/superpowers/research/2026-04-17-test-suite-audit.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 17, 2026 22:19
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 594a258363

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread tests/test_causal_extraction.py Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Stabilizes three CI-only test regressions uncovered during the recent test-suite audit sprint by adjusting test expectations and inputs to align with CI’s mock/SQLite/fastembed execution paths (no src/ changes).

Changes:

  • Update causal-extraction test to assert end-to-end ingestion succeeded rather than asserting LLM-derived causal edges under mock/SQLite.
  • Remove enrichment-queue race in vector recall latency test by using sync=True during seeding.
  • Make LangChain retriever metadata test deterministic by querying an exact CVE ID that reliably matches the seeded corpus.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
tests/test_causal_extraction.py Switches assertion from causal-edge count to ingestion success; adjusts __main__ execution path.
tests/test_core.py Seeds notes with sync=True to avoid recall racing background enrichment.
tests/test_langchain_retriever.py Uses an exact CVE query to make retrieval deterministic for metadata assertions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/test_causal_extraction.py Outdated
Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 17, 2026

Just as a heads up, I was blocked by some firewall rules while working on your feedback. Expand below for details.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • huggingface.co
    • Triggering command: /home/REDACTED/.local/bin/pytest pytest tests/test_causal_extraction.py -v (dns block)
    • Triggering command: /home/REDACTED/.local/bin/pytest pytest tests/test_causal_extraction.py::test_causal_extraction -v (dns block)
    • Triggering command: /home/REDACTED/.local/bin/pytest pytest tests/test_core.py::TestVectorRecall::test_vector_recall_latency_reasonable -v (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

…d mock

Addresses PR #67 review feedback (chatgpt-codex-connector P2): the prior
`assert note.id` passed even when causal extraction, JSON parsing, or KG
edge persistence was completely broken, defeating the test's purpose.

Now drives the full pipeline via a prompt-routed mock (same pattern as
AI-3/AI-4 in tests/test_two_phase_e2e.py) and queries the SQLite backend
directly via `get_causal_edges("intrusion_set", "apt28")` — where
`store_causal_edges(..., backend=self.store)` actually writes, not the
JSONL KnowledgeGraph singleton that the old assertion targeted.

Uses `sync=True` on `mm.remember` so causal extraction runs inline and
the backend write completes before the read — no enrichment-queue race.

Verified: 5/5 consecutive passes under the CI-sim env, plus the full
tri-file suite from PR #67 (tests/test_causal_extraction.py,
tests/test_core.py::TestVectorRecall::test_vector_recall_latency_reasonable,
tests/test_langchain_retriever.py) = 8 passed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rolandpg rolandpg self-assigned this Apr 17, 2026
@rolandpg rolandpg merged commit ba558a3 into master Apr 17, 2026
11 checks passed
@rolandpg rolandpg deleted the stabilize/master-regressions branch April 17, 2026 22:39
rolandpg added a commit that referenced this pull request Apr 19, 2026
Detection-rules-as-memory, MCP Registry publication, SQLite concurrency
hardening, test-suite hygiene, and brand/docs polish.

Highlights:
- feat: Sigma + YARA as first-class memory entities with LLM rule
  explainer (#70) + Detection Rules as Memory README section (#74)
- feat: MCP Registry publication (server.json + mcp-name tag) (#75)
- fix: SQLite reader concurrency — 16 methods now hold _write_lock
  (closes #68, fixes a production read-during-write race) (#69)
- fix: 3 CI test regressions stabilized (#67)
- chore: test-suite hygiene — 280→305 passing, 17→10 skipped,
  2→0 xfailed; migrated langchain_retriever to Pydantic V2 ConfigDict
  (#62, #63, #64, #65)
- brand: neural-chain architecture diagram + light/dark parity,
  canonical security channels, refreshed social preview (#61)

See CHANGELOG.md for details.

Bumps: pyproject.toml, src/zettelforge/__init__.py, mkdocs.yml,
server.json, SECURITY.md.
@rolandpg rolandpg mentioned this pull request Apr 19, 2026
rolandpg added a commit that referenced this pull request Apr 19, 2026
Detection-rules-as-memory, MCP Registry publication, SQLite concurrency
hardening, test-suite hygiene, and brand/docs polish.

Highlights:
- feat: Sigma + YARA as first-class memory entities with LLM rule
  explainer (#70) + Detection Rules as Memory README section (#74)
- feat: MCP Registry publication (server.json + mcp-name tag) (#75)
- fix: SQLite reader concurrency — 16 methods now hold _write_lock
  (closes #68, fixes a production read-during-write race) (#69)
- fix: 3 CI test regressions stabilized (#67)
- chore: test-suite hygiene — 280→305 passing, 17→10 skipped,
  2→0 xfailed; migrated langchain_retriever to Pydantic V2 ConfigDict
  (#62, #63, #64, #65)
- brand: neural-chain architecture diagram + light/dark parity,
  canonical security channels, refreshed social preview (#61)

See CHANGELOG.md for details.

Bumps: pyproject.toml, src/zettelforge/__init__.py, mkdocs.yml,
server.json, SECURITY.md.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants