docs: Phase 0.5 attribution PRELIMINARY → CONFIRMED (mechanism corrected)#97
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 63974aa879
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| pct: ((.value * 1000 / $total | round) / 10)}) | sort_by(-.total_ms)' "$LOG" | ||
|
|
||
| # Tail events with full phase breakdown | ||
| jq -s '[.[] | select(.activity_name=="remember" and .duration_ms > 30000 and .time > "'$SINCE'")] | |
There was a problem hiding this comment.
Exclude unsampled tails from reproduction query
In the Reproduction section (checked in 2026-04-25-phase-0.5-attribution.md), the jq filter for “Tail events with full phase breakdown” only checks activity_name, duration_ms, and time, but not whether phase_timings_ms exists. That means >30s events without phase timings are still included (as reflected by the later “not sampled live” row), which can pollute reproduced tail analyses and make the 99.66–99.98% attribution claim non-reproducible from this command. Add a .phase_timings_ms != null predicate here, matching the aggregate query above.
Useful? React with 👍 / 👎.
|
|
||
| ## TL;DR | ||
|
|
||
| - The preliminary attribution at the **phase level** is now confirmed by v2.4.2's `phase_timings_ms` instrumentation. Across 8 individual tail events, `lance_index` accounted for 99.66–99.98% of total `remember()` duration. There is no surprise phase. The work is, definitively, in `LanceStore._index_in_lance()`. |
There was a problem hiding this comment.
Correct tail-event sample size claim
The TL;DR states that “Across 8 individual tail events, lance_index accounted for 99.66–99.98%,” but the detailed table shows one of those 8 events as “not sampled live” with no lance_index value. This overstates the evidence and inflates confidence in the exact range; the text should either report 7 sampled events or provide the missing 8th breakdown.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pull request overview
Promotes the Phase 0.5 remember() latency attribution research note from PRELIMINARY to CONFIRMED, while correcting the previously-attributed LanceDB “fragment growth” mechanism to a “version history / version-chain” hypothesis and documenting the cleanup_old_versions() intervention.
Changes:
- Added a new CONFIRMED analysis doc with updated mechanism explanation and reproduction steps.
- Marked the prior preliminary doc as SUPERSEDED and added a banner + correction summary pointing to the new doc.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| docs/superpowers/research/2026-04-25-phase-0.5-attribution.md | New confirmed analysis + corrected mechanism + follow-ups and repro commands |
| docs/superpowers/research/2026-04-24-phase-0.5-attribution-prelim.md | Updates frontmatter/status and adds a superseded banner + correction summary |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| --- | ||
| title: "RFC-009 Phase 0.5 — remember() latency attribution (preliminary)" | ||
| status: PRELIMINARY — confirm after v2.4.2 phase_timings_ms lands | ||
| title: "RFC-009 Phase 0.5 — remember() latency attribution (preliminary, SUPERSEDED)" | ||
| status: SUPERSEDED 2026-04-25 by docs/superpowers/research/2026-04-25-phase-0.5-attribution.md |
There was a problem hiding this comment.
PR description says it “Closes task #36”, but issue #36 in this PR’s metadata is about adding ADRs under docs/adr/ (directory doesn’t exist in this change set). If #36 is intended to be closed, the PR needs the ADR deliverables; otherwise, please adjust the PR description to reference the correct issue/task so GitHub doesn’t auto-close the ADR issue incorrectly.
| - "Pass 1 (2026-04-24): Vigil v2.4.1 OCSF logs reconstructed via adjacent ocsf_file_activity events (961 remember() calls, 95 tail events)" | ||
| - "Pass 2 (2026-04-24/25): Vigil v2.4.2 first-party phase_timings_ms (200+ remember() calls, 8 tail events with full phase breakdown)" | ||
| - "Pass 3 (2026-04-25T01:34Z): cleanup_old_versions intervention on notes_cti — 5.69 GB → 29 MB, thousands of versions → 1, 7,917 disk files → 1" | ||
| followup: "Task #43 — restart Vigil traffic + measure remember() p95 post-cleanup. Two outcomes: (a) tails collapse → version-chain hypothesis confirmed, RFC-009 Phase 1.5 = periodic cleanup_old_versions; (b) tails persist → diagnosis needs another round, mechanism still unknown." |
There was a problem hiding this comment.
Frontmatter claims Pass 2 has “200+ remember() calls” and “8 tail events with full phase breakdown”, but the body immediately below aggregates 83 remember() calls and the per-tail table includes one event with lance_index “(not sampled live)”. Please reconcile these counts/claims (either update the frontmatter numbers or explain the sampling/filters so the doc is internally consistent).
| - "Pass 1 (2026-04-24): Vigil v2.4.1 OCSF logs reconstructed via adjacent ocsf_file_activity events (961 remember() calls, 95 tail events)" | ||
| - "Pass 2 (2026-04-24/25): Vigil v2.4.2 first-party phase_timings_ms (200+ remember() calls, 8 tail events with full phase breakdown)" | ||
| - "Pass 3 (2026-04-25T01:34Z): cleanup_old_versions intervention on notes_cti — 5.69 GB → 29 MB, thousands of versions → 1, 7,917 disk files → 1" | ||
| followup: "Task #43 — restart Vigil traffic + measure remember() p95 post-cleanup. Two outcomes: (a) tails collapse → version-chain hypothesis confirmed, RFC-009 Phase 1.5 = periodic cleanup_old_versions; (b) tails persist → diagnosis needs another round, mechanism still unknown." |
There was a problem hiding this comment.
This doc references “Task #43” for the post-cleanup workload retest, but in this PR’s metadata issue #43 is “Add example: Slack bot for CTI queries”. To avoid confusion and broken cross-references, please either link to the correct GitHub issue/PR for the workload retest or rename this to an unambiguous internal task identifier (and avoid using #NN which looks like a GitHub issue reference).
| |---|---| | ||
| | **#43** | Restart Vigil with v2.4.2/v2.4.3 pin, run representative workload, measure `phase_timings_ms.lance_index` distribution. Compares against the 53–67 s tail band documented above. | | ||
| | **#38** | Code-side periodic `cleanup_old_versions` (RFC-009 Phase 1.5 — provisional, awaiting #43). | |
There was a problem hiding this comment.
CHANGELOG.md
- Adds the missing [2.4.3] - 2026-04-25 section. Origin/master had the
pyproject.toml bump but no CHANGELOG entry committed; the GitHub
release notes exist but describe the OCSF fix inaccurately ('resolved
from ocsf-schema package' — it actually reads pyproject.toml first
and falls back to importlib.metadata.version('zettelforge')).
- Each v2.4.3 bullet now matches the actual diff: OCSF version
self-correct mechanism, ZETTELFORGE_LOG_LEVEL resolution order,
fastembed preload site, compact_lance script, Tier 0/1/2 LLM
observability scope.
- Operational note ties v2.4.3 to the Vigil incident response: 5.66 GB
recovered, p95 49.8s → ~250ms, periodic feature deferred to v2.5.0
Phase 1.5.
RFC-009 Phase 1.5
- Section 7 implementation plan: Phase 0.5 marked 'done — attribution
confirmed end-to-end' (was 'profile with py-spy'); new Phase 1.5 row
for periodic LanceDB version cleanup (3h, ships v2.5.0).
- New Section 1.5 with the full spec: mechanism (version-chain growth
at 2/insert, not fragment count), scope (LanceVersionMaintenance
daemon thread per shard, calls cleanup_old_versions with telemetry),
configuration (lance.cleanup_interval_minutes default 60), default
cadence rationale (Vigil-rate-anchored), safety (best-effort, never
crashes the agent), and explicit out-of-scope items (count-based
trigger deferred until we characterise the latency-vs-version-count
curve; compact_files() periodic loop unnecessary because the dataset
is naturally 1 logical fragment).
- Phase 0.5 evidence cited inline: PR #97 confirmed analysis,
measured 32x avg / 44x max latency improvement on the 44-call
post-cleanup retest.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sm corrected) WHERE: confirmed across two independent passes. lance_index = 98.1% aggregate (v2.4.2 phase_timings_ms, n=83) and 99.66-99.98% on tail events (n=8). The preliminary's reconstruction (98.4%) and the first-party instrumentation agree within rounding noise. WHY: corrected. The preliminary asserted "fragment count growth → slow manifest writes." That was wrong. Lance.to_lance().get_fragments() reports 1 logical fragment for notes_cti; the 7,916 .lance files in data/ were orphan version snapshots, not active fragments. compact_files() correctly returns "0 work needed" because there is no fragment merging to do. INTERVENTION: cleanup_old_versions() executed 2026-04-25T01:34Z recovered notes_cti from 5.69 GB / thousands of versions → 29 MB / 1 version (196x shrink). No row loss. OPEN: whether the cleanup actually moves the 60-second tail. Pending task #43 workload retest. If tails collapse → RFC-009 Phase 1.5 = periodic cleanup_old_versions. If tails persist → diagnosis is not done; mechanism still unknown. Preliminary doc marked SUPERSEDED in its frontmatter and a banner at top, but text preserved as a record of the diagnostic process. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…d-to-end 44 remember() calls against the cleaned notes_cti shard: - avg latency: 5,696ms → 180ms (32x) - p50: 765ms → 130ms (5.9x) - max: 66,535ms → 1,504ms (44x) - tails ≥5s: 96/951 → 0/44 - lance_index share: 98.1% → 39.6% Phase share is now balanced across construct/supersession/entity_index/ kg_update — no single phase dominates. The pre-cleanup tail behaviour is gone. Doc updates: - Frontmatter status flips from 'WHY hypothesis pending' to 'CONFIRMED end-to-end' - TL;DR adds the Pass 4 numbers and the urgency call for task #38 - New 'Pass 4 — post-cleanup workload retest' section with the before/after latency table, the phase-share rebalancing table, and the version-growth-rate observation (44 inserts → 89 versions, so the bloat returns at 2 versions/insert without periodic cleanup) - 'Implications for RFC-009' section drops 'Provisional' / 'contingent on task #43' hedging - 'What this document still cannot say' section trims the resolved questions, retains the open ones (notes_general row vs version effect; named-operation for the 55s tail clustering; latency-vs- version-count curve for tuning a count-based trigger) - Confidence table promotes the cleanup claim from Medium to CONFIRMED and the Phase 1.5 scope claim from Medium-conditional to CONFIRMED+urgent - Open follow-ups table marks #43 + #37 as DONE and elevates #38 to URGENT with concrete recommended shape (background thread, configurable interval, telemetry event per cleanup) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
e1f3519 to
06a57a2
Compare
* docs: v2.4.3 CHANGELOG entry (accurate) + RFC-009 Phase 1.5 spec
CHANGELOG.md
- Adds the missing [2.4.3] - 2026-04-25 section. Origin/master had the
pyproject.toml bump but no CHANGELOG entry committed; the GitHub
release notes exist but describe the OCSF fix inaccurately ('resolved
from ocsf-schema package' — it actually reads pyproject.toml first
and falls back to importlib.metadata.version('zettelforge')).
- Each v2.4.3 bullet now matches the actual diff: OCSF version
self-correct mechanism, ZETTELFORGE_LOG_LEVEL resolution order,
fastembed preload site, compact_lance script, Tier 0/1/2 LLM
observability scope.
- Operational note ties v2.4.3 to the Vigil incident response: 5.66 GB
recovered, p95 49.8s → ~250ms, periodic feature deferred to v2.5.0
Phase 1.5.
RFC-009 Phase 1.5
- Section 7 implementation plan: Phase 0.5 marked 'done — attribution
confirmed end-to-end' (was 'profile with py-spy'); new Phase 1.5 row
for periodic LanceDB version cleanup (3h, ships v2.5.0).
- New Section 1.5 with the full spec: mechanism (version-chain growth
at 2/insert, not fragment count), scope (LanceVersionMaintenance
daemon thread per shard, calls cleanup_old_versions with telemetry),
configuration (lance.cleanup_interval_minutes default 60), default
cadence rationale (Vigil-rate-anchored), safety (best-effort, never
crashes the agent), and explicit out-of-scope items (count-based
trigger deferred until we characterise the latency-vs-version-count
curve; compact_files() periodic loop unnecessary because the dataset
is naturally 1 logical fragment).
- Phase 0.5 evidence cited inline: PR #97 confirmed analysis,
measured 32x avg / 44x max latency improvement on the 44-call
post-cleanup retest.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* docs: PR #98 review fixes — naming, config-driven retention, late-bound tables
Addresses 6 review comments from Codex and Copilot on PR #98:
RFC-009 — naming & spec accuracy
- 'LanceStore._index_in_lance()' → 'MemoryStore._index_in_lance()'
(line cite added: src/zettelforge/memory_store.py:160). Three sites:
Section 1.1 candidate list, Section 1.1 step 4 narrative, Section 1.5
intro. The class is MemoryStore — there is no LanceStore.
- '<domain>.lance' → 'notes_<domain>.lance' for table naming. Tables
are created at memory_store.py:163 as f'notes_{note.metadata.domain}',
so the disk path is notes_<domain>.lance/. Three sites updated.
- Section 1.5.5 illustrative call drops the hardcoded
timedelta(hours=1) in favour of timedelta(...) referencing
lance.cleanup_older_than_seconds. Spec is now consistent across
Sections 1.5.2 / 1.5.3 / 1.5.5 / Section 7 row.
RFC-009 — late-bound table registration (Codex P1)
- Section 1.5.2 step 2 now requires _register_table_for_cleanup() to
be called from MemoryStore._index_in_lance() so domains created
lazily on first write also get a maintenance thread. Without this
hook, any domain not seen at MemoryStore.__init__ would never get
cleanup, recreating the bloat condition for new shards.
RFC-009 — config-driven retention (Codex P2 + Copilot)
- Section 1.5.2 step 3 now reads lance.cleanup_older_than_seconds at
each iteration; if 0, skip; otherwise pass to older_than. This
honors the operator-facing knob introduced in 1.5.3.
RFC-009 — disable sentinel (Copilot)
- Section 1.5.3 default switched from 'null disables; 0 disables;
>0 enables' to '0 disables; >0 enables'. Note added explaining why:
ZettelForge's _parse_simple_yaml() fallback at config.py:253 (used
when PyYAML isn't installed) doesn't treat 'null' specially and
loads it as the string 'null', which breaks numeric-typed settings.
'0' survives both PyYAML and the simple-parser path identically.
CHANGELOG.md — accurate event name (Copilot)
- Tier 0/1/2 observability bullet's fact_extractor.py event name
corrected. fact_extractor.py:69-71 actually emits
parse_failed{schema='fact_extraction', reason='empty_completion'},
not 'empty_completion_returned' as the bullet originally claimed.
All claims verified against current src/ before this commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…5) (#101) Implements the daemon specified in docs/rfcs/RFC-009-enrichment-pipeline-v2.md §1.5 — closes task #38. The 2026-04-25 Phase 0.5 retest confirmed `cleanup_old_versions()` is the lever for the 55-second `remember()` tail (PR #97). Without this periodic feature the bloat regrows at ~2 versions/insert on write-heavy shards. What landed: - New `LanceVersionMaintenance` class (`src/zettelforge/lance_maintenance.py`): * One daemon thread per `notes_<domain>.lance` table. * Lazy / late-bound: `register_table()` is called from `MemoryStore._index_in_lance()` so domains created after init also get coverage. Idempotent. * Each loop iteration re-reads `cleanup_interval_minutes` and `cleanup_older_than_seconds` from config so operators can flip the loop without restarting the agent. `0` disables. * Per-cleanup `lance_cleanup_old_versions` log event with `bytes_freed`, `versions_pruned`, `elapsed_seconds`. * Best-effort: any exception inside the cleanup is caught, logged at warning, and the loop continues. - New `LanceConfig` dataclass in `config.py` with `cleanup_interval_minutes=60` and `cleanup_older_than_seconds=3600` defaults; `0` is the canonical disable sentinel for both (chosen over YAML `null` because ZettelForge's `_parse_simple_yaml()` fallback at config.py:253 would treat `null` as the string "null"). - `MemoryStore.__init__` instantiates the daemon (db=None at first; bound to the live lancedb on first `_index_in_lance` via the new `_ensure_lance_maintenance_started()` helper). Lazy binding keeps read-only stores from paying the lancedb-connect cost. - 19 new tests in `tests/test_lance_maintenance.py` covering: * `_run_one()` direct iteration logic (configured older_than is threaded through; older_than=0 short-circuits; lance exceptions are swallowed; subsequent iterations still run; negative values clamp to 0). * `register_table()` is idempotent; gates on `db=None` and `interval=0`. * `start()` discovers existing `<name>.lance/` directories under table_root; ignores files and non-`.lance` subdirs. * `stop()` unblocks waiting threads via the stop_event so a 9999-minute interval doesn't hold up shutdown. * Stats-attribute extraction prefers `old_versions`, falls back to `versions_removed`, returns None on unknown shape. * Telemetry event emits as expected. Regression sweep: 71/71 green across test_basic + test_consolidation + test_logging_compliance + test_lance_maintenance. What remains for a future PR: - `MemoryManager.shutdown()` orchestrator (RFC-009 Phase 5) will call `_lance_maintenance.stop()` cleanly; today the daemon=True threads exit at process exit which is fine for current use cases. - Latency-vs-version-count curve characterisation to inform an optional count-based trigger (`lance.cleanup_threshold_versions`). Filed as future investigation in PR #97 / RFC §1.5 out-of-scope. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Closes task #36. Promotes the Phase 0.5 attribution doc to CONFIRMED status with a substantive correction to the failure mechanism.
What's confirmed
WHERE: 98%+ of
remember()time is inlance_index. Two independent passes agree:phase_timings_ms(2026-04-24/25, n=83 calls): 98.1%What's corrected
WHY: the fragment-count mechanism in the preliminary was wrong.
The 7,916
.lancefiles indata/were uncleaned version snapshots, not active fragments.compact_files()correctly returnsfragments_added=0, fragments_removed=0because there's no fragment merging to do. The active dataset has always been 1 logical fragment.What was executed
cleanup_old_versions()on 2026-04-25T01:34Z:notes_ctisizeWhat's still open
Whether the cleanup moves the 60-second tail. Active dataset was already 1 logical fragment, so
compact_files()was always going to be a no-op. The remaining hypothesis is that LanceDB's insert path walks the version chain. Workload retest is required (task #43).If tails collapse post-cleanup → RFC-009 Phase 1.5 becomes "periodic
cleanup_old_versions" not "periodiccompact_files". If tails persist → the diagnosis isn't done.File changes
docs/superpowers/research/2026-04-25-phase-0.5-attribution.md(the confirmed analysis)docs/superpowers/research/2026-04-24-phase-0.5-attribution-prelim.md— frontmatter status changed to SUPERSEDED, banner added at top pointing to the new doc. Text body preserved as a record of the diagnostic process.Related
Test plan
🤖 Generated with Claude Code