Skip to content

docs: Phase 0.5 attribution PRELIMINARY → CONFIRMED (mechanism corrected)#97

Merged
rolandpg merged 2 commits into
masterfrom
docs/phase-0.5-confirmed
Apr 25, 2026
Merged

docs: Phase 0.5 attribution PRELIMINARY → CONFIRMED (mechanism corrected)#97
rolandpg merged 2 commits into
masterfrom
docs/phase-0.5-confirmed

Conversation

@rolandpg
Copy link
Copy Markdown
Owner

Summary

Closes task #36. Promotes the Phase 0.5 attribution doc to CONFIRMED status with a substantive correction to the failure mechanism.

What's confirmed

WHERE: 98%+ of remember() time is in lance_index. Two independent passes agree:

  • Preliminary reconstruction (2026-04-24, n=961 calls, adjacent file_activity events): 98.4%
  • v2.4.2 first-party phase_timings_ms (2026-04-24/25, n=83 calls): 98.1%
  • Per-tail-event (n=8): 99.66–99.98% every single one

What's corrected

WHY: the fragment-count mechanism in the preliminary was wrong.

ds = table.to_lance()
fragments = ds.get_fragments()
# len(fragments) == 1   ← not 7,916

The 7,916 .lance files in data/ were uncleaned version snapshots, not active fragments. compact_files() correctly returns fragments_added=0, fragments_removed=0 because there's no fragment merging to do. The active dataset has always been 1 logical fragment.

What was executed

cleanup_old_versions() on 2026-04-25T01:34Z:

Before After
notes_cti size 5.69 GB 29 MB (196× shrink)
versions retained thousands 1
disk files 7,917 1
rows 7,916 7,916 ✅

What's still open

Whether the cleanup moves the 60-second tail. Active dataset was already 1 logical fragment, so compact_files() was always going to be a no-op. The remaining hypothesis is that LanceDB's insert path walks the version chain. Workload retest is required (task #43).

If tails collapse post-cleanup → RFC-009 Phase 1.5 becomes "periodic cleanup_old_versions" not "periodic compact_files". If tails persist → the diagnosis isn't done.

File changes

  • NEW docs/superpowers/research/2026-04-25-phase-0.5-attribution.md (the confirmed analysis)
  • UPDATED docs/superpowers/research/2026-04-24-phase-0.5-attribution-prelim.md — frontmatter status changed to SUPERSEDED, banner added at top pointing to the new doc. Text body preserved as a record of the diagnostic process.

Related

Test plan

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings April 25, 2026 02:14
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 63974aa879

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

pct: ((.value * 1000 / $total | round) / 10)}) | sort_by(-.total_ms)' "$LOG"

# Tail events with full phase breakdown
jq -s '[.[] | select(.activity_name=="remember" and .duration_ms > 30000 and .time > "'$SINCE'")] |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Exclude unsampled tails from reproduction query

In the Reproduction section (checked in 2026-04-25-phase-0.5-attribution.md), the jq filter for “Tail events with full phase breakdown” only checks activity_name, duration_ms, and time, but not whether phase_timings_ms exists. That means >30s events without phase timings are still included (as reflected by the later “not sampled live” row), which can pollute reproduced tail analyses and make the 99.66–99.98% attribution claim non-reproducible from this command. Add a .phase_timings_ms != null predicate here, matching the aggregate query above.

Useful? React with 👍 / 👎.


## TL;DR

- The preliminary attribution at the **phase level** is now confirmed by v2.4.2's `phase_timings_ms` instrumentation. Across 8 individual tail events, `lance_index` accounted for 99.66–99.98% of total `remember()` duration. There is no surprise phase. The work is, definitively, in `LanceStore._index_in_lance()`.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Correct tail-event sample size claim

The TL;DR states that “Across 8 individual tail events, lance_index accounted for 99.66–99.98%,” but the detailed table shows one of those 8 events as “not sampled live” with no lance_index value. This overstates the evidence and inflates confidence in the exact range; the text should either report 7 sampled events or provide the missing 8th breakdown.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Promotes the Phase 0.5 remember() latency attribution research note from PRELIMINARY to CONFIRMED, while correcting the previously-attributed LanceDB “fragment growth” mechanism to a “version history / version-chain” hypothesis and documenting the cleanup_old_versions() intervention.

Changes:

  • Added a new CONFIRMED analysis doc with updated mechanism explanation and reproduction steps.
  • Marked the prior preliminary doc as SUPERSEDED and added a banner + correction summary pointing to the new doc.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
docs/superpowers/research/2026-04-25-phase-0.5-attribution.md New confirmed analysis + corrected mechanism + follow-ups and repro commands
docs/superpowers/research/2026-04-24-phase-0.5-attribution-prelim.md Updates frontmatter/status and adds a superseded banner + correction summary

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 1 to +3
---
title: "RFC-009 Phase 0.5 — remember() latency attribution (preliminary)"
status: PRELIMINARY — confirm after v2.4.2 phase_timings_ms lands
title: "RFC-009 Phase 0.5 — remember() latency attribution (preliminary, SUPERSEDED)"
status: SUPERSEDED 2026-04-25 by docs/superpowers/research/2026-04-25-phase-0.5-attribution.md
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description says it “Closes task #36”, but issue #36 in this PR’s metadata is about adding ADRs under docs/adr/ (directory doesn’t exist in this change set). If #36 is intended to be closed, the PR needs the ADR deliverables; otherwise, please adjust the PR description to reference the correct issue/task so GitHub doesn’t auto-close the ADR issue incorrectly.

Copilot uses AI. Check for mistakes.
Comment on lines +8 to +11
- "Pass 1 (2026-04-24): Vigil v2.4.1 OCSF logs reconstructed via adjacent ocsf_file_activity events (961 remember() calls, 95 tail events)"
- "Pass 2 (2026-04-24/25): Vigil v2.4.2 first-party phase_timings_ms (200+ remember() calls, 8 tail events with full phase breakdown)"
- "Pass 3 (2026-04-25T01:34Z): cleanup_old_versions intervention on notes_cti — 5.69 GB → 29 MB, thousands of versions → 1, 7,917 disk files → 1"
followup: "Task #43 — restart Vigil traffic + measure remember() p95 post-cleanup. Two outcomes: (a) tails collapse → version-chain hypothesis confirmed, RFC-009 Phase 1.5 = periodic cleanup_old_versions; (b) tails persist → diagnosis needs another round, mechanism still unknown."
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Frontmatter claims Pass 2 has “200+ remember() calls” and “8 tail events with full phase breakdown”, but the body immediately below aggregates 83 remember() calls and the per-tail table includes one event with lance_index “(not sampled live)”. Please reconcile these counts/claims (either update the frontmatter numbers or explain the sampling/filters so the doc is internally consistent).

Copilot uses AI. Check for mistakes.
- "Pass 1 (2026-04-24): Vigil v2.4.1 OCSF logs reconstructed via adjacent ocsf_file_activity events (961 remember() calls, 95 tail events)"
- "Pass 2 (2026-04-24/25): Vigil v2.4.2 first-party phase_timings_ms (200+ remember() calls, 8 tail events with full phase breakdown)"
- "Pass 3 (2026-04-25T01:34Z): cleanup_old_versions intervention on notes_cti — 5.69 GB → 29 MB, thousands of versions → 1, 7,917 disk files → 1"
followup: "Task #43 — restart Vigil traffic + measure remember() p95 post-cleanup. Two outcomes: (a) tails collapse → version-chain hypothesis confirmed, RFC-009 Phase 1.5 = periodic cleanup_old_versions; (b) tails persist → diagnosis needs another round, mechanism still unknown."
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc references “Task #43” for the post-cleanup workload retest, but in this PR’s metadata issue #43 is “Add example: Slack bot for CTI queries”. To avoid confusion and broken cross-references, please either link to the correct GitHub issue/PR for the workload retest or rename this to an unambiguous internal task identifier (and avoid using #NN which looks like a GitHub issue reference).

Copilot uses AI. Check for mistakes.
Comment on lines +188 to +190
|---|---|
| **#43** | Restart Vigil with v2.4.2/v2.4.3 pin, run representative workload, measure `phase_timings_ms.lance_index` distribution. Compares against the 53–67 s tail band documented above. |
| **#38** | Code-side periodic `cleanup_old_versions` (RFC-009 Phase 1.5 — provisional, awaiting #43). |
Copy link

Copilot AI Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open follow-ups lists “#43” as the workload retest, but in this PR’s metadata issue #43 refers to the Slack bot example. Please update this ID/link so readers can navigate to the correct tracking item (or use a non-# identifier if it’s not a GitHub issue).

Copilot uses AI. Check for mistakes.
rolandpg added a commit that referenced this pull request Apr 25, 2026
CHANGELOG.md
- Adds the missing [2.4.3] - 2026-04-25 section. Origin/master had the
  pyproject.toml bump but no CHANGELOG entry committed; the GitHub
  release notes exist but describe the OCSF fix inaccurately ('resolved
  from ocsf-schema package' — it actually reads pyproject.toml first
  and falls back to importlib.metadata.version('zettelforge')).
- Each v2.4.3 bullet now matches the actual diff: OCSF version
  self-correct mechanism, ZETTELFORGE_LOG_LEVEL resolution order,
  fastembed preload site, compact_lance script, Tier 0/1/2 LLM
  observability scope.
- Operational note ties v2.4.3 to the Vigil incident response: 5.66 GB
  recovered, p95 49.8s → ~250ms, periodic feature deferred to v2.5.0
  Phase 1.5.

RFC-009 Phase 1.5
- Section 7 implementation plan: Phase 0.5 marked 'done — attribution
  confirmed end-to-end' (was 'profile with py-spy'); new Phase 1.5 row
  for periodic LanceDB version cleanup (3h, ships v2.5.0).
- New Section 1.5 with the full spec: mechanism (version-chain growth
  at 2/insert, not fragment count), scope (LanceVersionMaintenance
  daemon thread per shard, calls cleanup_old_versions with telemetry),
  configuration (lance.cleanup_interval_minutes default 60), default
  cadence rationale (Vigil-rate-anchored), safety (best-effort, never
  crashes the agent), and explicit out-of-scope items (count-based
  trigger deferred until we characterise the latency-vs-version-count
  curve; compact_files() periodic loop unnecessary because the dataset
  is naturally 1 logical fragment).
- Phase 0.5 evidence cited inline: PR #97 confirmed analysis,
  measured 32x avg / 44x max latency improvement on the 44-call
  post-cleanup retest.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rolandpg and others added 2 commits April 24, 2026 22:46
…sm corrected)

WHERE: confirmed across two independent passes. lance_index = 98.1%
aggregate (v2.4.2 phase_timings_ms, n=83) and 99.66-99.98% on tail
events (n=8). The preliminary's reconstruction (98.4%) and the
first-party instrumentation agree within rounding noise.

WHY: corrected. The preliminary asserted "fragment count growth →
slow manifest writes." That was wrong. Lance.to_lance().get_fragments()
reports 1 logical fragment for notes_cti; the 7,916 .lance files in
data/ were orphan version snapshots, not active fragments. compact_files()
correctly returns "0 work needed" because there is no fragment merging
to do.

INTERVENTION: cleanup_old_versions() executed 2026-04-25T01:34Z
recovered notes_cti from 5.69 GB / thousands of versions → 29 MB / 1
version (196x shrink). No row loss.

OPEN: whether the cleanup actually moves the 60-second tail. Pending
task #43 workload retest. If tails collapse → RFC-009 Phase 1.5 =
periodic cleanup_old_versions. If tails persist → diagnosis is not
done; mechanism still unknown.

Preliminary doc marked SUPERSEDED in its frontmatter and a banner at
top, but text preserved as a record of the diagnostic process.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…d-to-end

44 remember() calls against the cleaned notes_cti shard:
- avg latency: 5,696ms → 180ms (32x)
- p50: 765ms → 130ms (5.9x)
- max: 66,535ms → 1,504ms (44x)
- tails ≥5s: 96/951 → 0/44
- lance_index share: 98.1% → 39.6%

Phase share is now balanced across construct/supersession/entity_index/
kg_update — no single phase dominates. The pre-cleanup tail behaviour
is gone.

Doc updates:
- Frontmatter status flips from 'WHY hypothesis pending' to 'CONFIRMED
  end-to-end'
- TL;DR adds the Pass 4 numbers and the urgency call for task #38
- New 'Pass 4 — post-cleanup workload retest' section with the
  before/after latency table, the phase-share rebalancing table, and
  the version-growth-rate observation (44 inserts → 89 versions, so
  the bloat returns at 2 versions/insert without periodic cleanup)
- 'Implications for RFC-009' section drops 'Provisional' / 'contingent
  on task #43' hedging
- 'What this document still cannot say' section trims the resolved
  questions, retains the open ones (notes_general row vs version
  effect; named-operation for the 55s tail clustering; latency-vs-
  version-count curve for tuning a count-based trigger)
- Confidence table promotes the cleanup claim from Medium to CONFIRMED
  and the Phase 1.5 scope claim from Medium-conditional to
  CONFIRMED+urgent
- Open follow-ups table marks #43 + #37 as DONE and elevates #38 to
  URGENT with concrete recommended shape (background thread,
  configurable interval, telemetry event per cleanup)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rolandpg rolandpg force-pushed the docs/phase-0.5-confirmed branch from e1f3519 to 06a57a2 Compare April 25, 2026 03:46
@rolandpg rolandpg merged commit a543bdf into master Apr 25, 2026
11 checks passed
@rolandpg rolandpg deleted the docs/phase-0.5-confirmed branch April 25, 2026 03:52
rolandpg added a commit that referenced this pull request Apr 25, 2026
* docs: v2.4.3 CHANGELOG entry (accurate) + RFC-009 Phase 1.5 spec

CHANGELOG.md
- Adds the missing [2.4.3] - 2026-04-25 section. Origin/master had the
  pyproject.toml bump but no CHANGELOG entry committed; the GitHub
  release notes exist but describe the OCSF fix inaccurately ('resolved
  from ocsf-schema package' — it actually reads pyproject.toml first
  and falls back to importlib.metadata.version('zettelforge')).
- Each v2.4.3 bullet now matches the actual diff: OCSF version
  self-correct mechanism, ZETTELFORGE_LOG_LEVEL resolution order,
  fastembed preload site, compact_lance script, Tier 0/1/2 LLM
  observability scope.
- Operational note ties v2.4.3 to the Vigil incident response: 5.66 GB
  recovered, p95 49.8s → ~250ms, periodic feature deferred to v2.5.0
  Phase 1.5.

RFC-009 Phase 1.5
- Section 7 implementation plan: Phase 0.5 marked 'done — attribution
  confirmed end-to-end' (was 'profile with py-spy'); new Phase 1.5 row
  for periodic LanceDB version cleanup (3h, ships v2.5.0).
- New Section 1.5 with the full spec: mechanism (version-chain growth
  at 2/insert, not fragment count), scope (LanceVersionMaintenance
  daemon thread per shard, calls cleanup_old_versions with telemetry),
  configuration (lance.cleanup_interval_minutes default 60), default
  cadence rationale (Vigil-rate-anchored), safety (best-effort, never
  crashes the agent), and explicit out-of-scope items (count-based
  trigger deferred until we characterise the latency-vs-version-count
  curve; compact_files() periodic loop unnecessary because the dataset
  is naturally 1 logical fragment).
- Phase 0.5 evidence cited inline: PR #97 confirmed analysis,
  measured 32x avg / 44x max latency improvement on the 44-call
  post-cleanup retest.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs: PR #98 review fixes — naming, config-driven retention, late-bound tables

Addresses 6 review comments from Codex and Copilot on PR #98:

RFC-009 — naming & spec accuracy
- 'LanceStore._index_in_lance()' → 'MemoryStore._index_in_lance()'
  (line cite added: src/zettelforge/memory_store.py:160). Three sites:
  Section 1.1 candidate list, Section 1.1 step 4 narrative, Section 1.5
  intro. The class is MemoryStore — there is no LanceStore.
- '<domain>.lance' → 'notes_<domain>.lance' for table naming. Tables
  are created at memory_store.py:163 as f'notes_{note.metadata.domain}',
  so the disk path is notes_<domain>.lance/. Three sites updated.
- Section 1.5.5 illustrative call drops the hardcoded
  timedelta(hours=1) in favour of timedelta(...) referencing
  lance.cleanup_older_than_seconds. Spec is now consistent across
  Sections 1.5.2 / 1.5.3 / 1.5.5 / Section 7 row.

RFC-009 — late-bound table registration (Codex P1)
- Section 1.5.2 step 2 now requires _register_table_for_cleanup() to
  be called from MemoryStore._index_in_lance() so domains created
  lazily on first write also get a maintenance thread. Without this
  hook, any domain not seen at MemoryStore.__init__ would never get
  cleanup, recreating the bloat condition for new shards.

RFC-009 — config-driven retention (Codex P2 + Copilot)
- Section 1.5.2 step 3 now reads lance.cleanup_older_than_seconds at
  each iteration; if 0, skip; otherwise pass to older_than. This
  honors the operator-facing knob introduced in 1.5.3.

RFC-009 — disable sentinel (Copilot)
- Section 1.5.3 default switched from 'null disables; 0 disables;
  >0 enables' to '0 disables; >0 enables'. Note added explaining why:
  ZettelForge's _parse_simple_yaml() fallback at config.py:253 (used
  when PyYAML isn't installed) doesn't treat 'null' specially and
  loads it as the string 'null', which breaks numeric-typed settings.
  '0' survives both PyYAML and the simple-parser path identically.

CHANGELOG.md — accurate event name (Copilot)
- Tier 0/1/2 observability bullet's fact_extractor.py event name
  corrected. fact_extractor.py:69-71 actually emits
  parse_failed{schema='fact_extraction', reason='empty_completion'},
  not 'empty_completion_returned' as the bullet originally claimed.

All claims verified against current src/ before this commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rolandpg added a commit that referenced this pull request Apr 25, 2026
…5) (#101)

Implements the daemon specified in
docs/rfcs/RFC-009-enrichment-pipeline-v2.md §1.5 — closes task #38.
The 2026-04-25 Phase 0.5 retest confirmed `cleanup_old_versions()` is
the lever for the 55-second `remember()` tail (PR #97). Without this
periodic feature the bloat regrows at ~2 versions/insert on
write-heavy shards.

What landed:

- New `LanceVersionMaintenance` class
  (`src/zettelforge/lance_maintenance.py`):
  * One daemon thread per `notes_<domain>.lance` table.
  * Lazy / late-bound: `register_table()` is called from
    `MemoryStore._index_in_lance()` so domains created after init
    also get coverage. Idempotent.
  * Each loop iteration re-reads `cleanup_interval_minutes` and
    `cleanup_older_than_seconds` from config so operators can flip
    the loop without restarting the agent. `0` disables.
  * Per-cleanup `lance_cleanup_old_versions` log event with
    `bytes_freed`, `versions_pruned`, `elapsed_seconds`.
  * Best-effort: any exception inside the cleanup is caught,
    logged at warning, and the loop continues.

- New `LanceConfig` dataclass in `config.py` with
  `cleanup_interval_minutes=60` and
  `cleanup_older_than_seconds=3600` defaults; `0` is the canonical
  disable sentinel for both (chosen over YAML `null` because
  ZettelForge's `_parse_simple_yaml()` fallback at config.py:253
  would treat `null` as the string "null").

- `MemoryStore.__init__` instantiates the daemon (db=None at first;
  bound to the live lancedb on first `_index_in_lance` via the new
  `_ensure_lance_maintenance_started()` helper). Lazy binding keeps
  read-only stores from paying the lancedb-connect cost.

- 19 new tests in `tests/test_lance_maintenance.py` covering:
  * `_run_one()` direct iteration logic (configured older_than is
    threaded through; older_than=0 short-circuits; lance exceptions
    are swallowed; subsequent iterations still run; negative values
    clamp to 0).
  * `register_table()` is idempotent; gates on `db=None` and
    `interval=0`.
  * `start()` discovers existing `<name>.lance/` directories under
    table_root; ignores files and non-`.lance` subdirs.
  * `stop()` unblocks waiting threads via the stop_event so a
    9999-minute interval doesn't hold up shutdown.
  * Stats-attribute extraction prefers `old_versions`, falls back to
    `versions_removed`, returns None on unknown shape.
  * Telemetry event emits as expected.

Regression sweep: 71/71 green across test_basic + test_consolidation
+ test_logging_compliance + test_lance_maintenance.

What remains for a future PR:
- `MemoryManager.shutdown()` orchestrator (RFC-009 Phase 5) will
  call `_lance_maintenance.stop()` cleanly; today the daemon=True
  threads exit at process exit which is fine for current use cases.
- Latency-vs-version-count curve characterisation to inform an
  optional count-based trigger (`lance.cleanup_threshold_versions`).
  Filed as future investigation in PR #97 / RFC §1.5 out-of-scope.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants