Skip to content

Codebase memory: lazy bootstrap + per-session verify + task-closure updates #45

@ZaxShen

Description

@ZaxShen

Workflow

flowchart TD
    Start([Session start]) --> FirstCodeAsk[First code-touching ask arrives]
    FirstCodeAsk --> HasFiles{git ls-files<br/>non-empty?}

    HasFiles -- No --> NewProj[New project<br/>no scan, no registry write]
    HasFiles -- Yes --> HasMem{file_registry<br/>populated?}

    HasMem -- No --> Bootstrap[Bootstrap scan:<br/>git ls-files + md5 each<br/>insert rows; summary = null]
    HasMem -- Yes --> CleanCheck{Working tree clean<br/>AND branch up-to-date<br/>AND HEAD == last_verified_sha?}

    CleanCheck -- Yes --> Trust[Trust registry<br/>no scan]
    CleanCheck -- No --> Verify[Verification pass per row:<br/>re-md5 file.<br/>Match: keep summary.<br/>Mismatch: mark summary stale.<br/>Missing file: delete row.<br/>New file: insert row with md5.]

    NewProj --> Route[Route to architect<br/>→ SWE → pr-reviewer]
    Bootstrap --> Route
    Trust --> Route
    Verify --> RecordHEAD[config_set<br/>last_verified_sha = HEAD]
    RecordHEAD --> Route

    Route --> Close[SWE atomic close<br/>commits changes]
    Close --> Update[file_registry_update_summaries<br/>paths touched by commit<br/>+ advance last_verified_sha]
    Update --> Next{Next ask<br/>in same session?}

    Next -- Code-touching --> Route
    Next -- Read-only --> Next
    Next -- None --> End([Session end])
Loading

Entry-state matrix

State at first code-touching ask Action
Empty repo (no files) No scan, no registry write
Files exist + registry empty Bootstrap scan — git ls-files + md5 per file; summaries null
Registry populated + tree clean + branch up-to-date + HEAD == last_verified_sha Trust — skip entirely
Registry populated + any drift (dirty tree / behind upstream / HEAD moved) Verification pass — md5 compare per row

Invariants

  1. Inside a session, registry is trusted — only TMB's SWE modifies files, atomic-close updates registry.
  2. Across sessions, registry is suspect until the cheap proof (last_verified_sha check) passes.
  3. Verification is never full re-summarization. md5 compare only. Drift → mark stale. Summaries regenerate lazily when architect/SWE actually reads the file.

Key insight

The bootstrap scan is not an upfront cost. On a cold session in an existing repo, Claude has to read files for context regardless. That read is the scan — we just persist it into file_registry.

The cooperation insight (new session / git pull / dirty tree): across sessions, other developers on other machines may have changed files. Verification is mandatory on first code-touching ask of each session, unless last_verified_sha proves nothing has changed.

Schema additions

ALTER TABLE file_registry ADD COLUMN content_md5 TEXT;
ALTER TABLE file_registry ADD COLUMN summary TEXT;
ALTER TABLE file_registry ADD COLUMN summary_updated_at TEXT;

-- plus a plugin_config key:
--   last_verified_sha: TEXT — git HEAD at last successful verify/bootstrap/update

MCP surface additions

  • file_registry_update_summaries(paths) — called by SWE atomic-close for each committed path. md5 + regenerate summary + touch timestamp + advance last_verified_sha.
  • file_registry_verify() — called by project-prescan when drift detected. Returns per-path verdict (match / mismatch / missing / new) so the caller decides whether to mark stale.

Skills / agents affected

  • skills/project-prescan/ — branch on entry-state matrix; call file_registry_verify when drift detected.
  • agents/swe.md — atomic close calls file_registry_update_summaries with commit's touched paths.
  • agents/architect.md — reads from file_registry instead of scanning; may call update tool when editing non-source files.
  • skills/lazy-regen-check/ — unchanged (still governs the markdown view in docs/trustmybot/architecture/auto/).
  • skills/refresh-architecture/ — unchanged.

Acceptance criteria

  • Schema additions landed.
  • last_verified_sha config key read/written correctly.
  • Integration test — empty repo → first task closes → registry populated only for touched files.
  • Integration test — existing repo / registry empty → bootstrap scan runs once, subsequent asks don't re-scan.
  • Integration test — clean tree + unchanged HEAD + prior verify → next session trusts registry, no md5 pass.
  • Integration test — simulated git pull (HEAD moved) → next session runs verification pass, stale rows marked.
  • Integration test — simulated dirty working tree → verification detects mismatched md5, marks stale.
  • Integration test — another developer deletes a file upstream → after pull, verification deletes the registry row.
  • Integration test — another developer adds a file upstream → after pull, verification inserts a new row with null summary.
  • Budget: md5 pass on 500-file repo completes in ≤100ms.

Open questions

  • Who writes summary text — architect (fresh LLM reasoning) or a dedicated tool? Prefer architect — it's reasoning, not mechanical.
  • Skip binaries / lockfiles? Yes; filter by type column.
  • If last_verified_sha points to a commit that has been rebased away (GC'd), verification falls through to the full md5 pass. Safe but expensive — acceptable because rebase-away is rare.

Notes

This supersedes earlier drafts of #45. Cooperation case (new session / git pull / dirty tree) added per discussion — without it, the registry would drift when teammates push changes.

Related

  • docs/architecture/FLOWS.md flow 7
  • docs/architecture/ERD.md — file_registry

Metadata

Metadata

Assignees

Labels

FeatureNew feature or requestPriority: HighHigh priority — blocks meaningful workflowsWorkflowBro / SWE / pr-reviewer doctrine + planning skills

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions