Add storage catalogue; restore search beneath dotfile .gitignore by tony · Pull Request #4 · tony/agentgrep

tony · 2026-05-17T22:23:35Z

Summary

Fix silent "No matches found." for agentgrep search --agent claude / --agent codex / --agent cursor on systems where the agent stores live inside a dotfile-managed tree (yadm, chezmoi, stow, bare-git, mr). fd and rg both honored .gitignore semantics under $HOME and silently masked everything; removing fd -a also fixes a path-comparison mismatch between fd's canonicalized output and rg's symlink-preserving output.
Add an importable storage catalogue — frozen Pydantic descriptors for every on-disk prompt and history store agentgrep knows about, stamped with observed_version, observed_at, format, schema notes, and pointers to upstream type definitions where one is public.
Add a Cursor CLI agent transcript adapter so agentgrep search --agent cursor returns results from ~/.cursor/projects/<id>/agent-transcripts/<session_uuid>/<session_uuid>.jsonl in addition to the existing Cursor IDE state.vscdb store. Transcripts carry no native per-turn timestamp; agentgrep backfills the file's mtime.
Add a Google Gemini CLI adapter — --agent gemini is now a valid choice across the CLI and the MCP tool literals. The adapter parses both ~/.gemini/tmp/<project_hash>/chats/session-*.jsonl (mixed SessionMetadataRecord / MessageRecord / MetadataUpdateRecord lines) and ~/.gemini/tmp/<project_hash>/logs.json (flat LogEntry audit array).
Document the catalogue and the per-agent adapters in a new Sphinx page with per-agent sections, upstream-pinned schema citations, and a recipe for adding or updating a descriptor.

Changes by area

Search discovery and prefilter

src/agentgrep/__init__.py — list_files_matching passes -H -I to fd and drops -a. The flag pair bypasses gitignore/hidden filtering; dropping -a keeps fd's paths in the same symlink-preserved form that rg emits, so prefilter_sources_by_root can compare them directly. build_grep_command passes --no-ignore --hidden (rg) / --unrestricted --hidden (ag), and the argv builder is reshuffled so the fixed-string flag goes immediately before -l.

New store catalogue

Path	Description
`src/agentgrep/stores.py`	`StoreFormat` and `StoreRole` enums; frozen Pydantic `StoreDescriptor` and `StoreCatalog` models. JSON-schema-exportable.
`src/agentgrep/store_catalog.py`	`CATALOG` (version 2) spanning Claude, Cursor (IDE + CLI agent), Codex, Gemini. Includes `gemini_project_hash()` — a Python mirror of Gemini CLI's `getProjectHash()` so consumers can map a working directory to a Gemini tmp shard.
`tests/test_stores.py`	Shape tests: unique IDs, agent-prefix discipline, token-form path patterns (no `/home/*` leaks), `distinguishes_from` resolves, frozen-model enforcement, Pydantic JSON round-trip, fixture validation.
`tests/conftest.py`	`fixture_path(store_id, name)` helper rooted at `tests/samples/`.
`tests/samples/<agent>/<store_id>/...`	Redacted, structurally-faithful fixtures per primary-chat / prompt-history / plan store. No real user content.

New Cursor CLI agent adapter

Path	Description
`src/agentgrep/__init__.py`	`discover_cursor_cli_sources` walks `~/.cursor/projects//agent-transcripts//.jsonl`, skipping sibling project files (`repo.json`, `mcp-approvals.json`, `terminals/`, `canvases/`). `parse_cursor_cli_transcript` wraps the existing `iter_message_candidates` — which already handles Cursor's outer-`role` / inner-`message.content[].text` shape — and backfills the file's mtime as a timestamp fallback via the new `isoformat_from_mtime_ns` helper.

New Gemini CLI adapter

Path	Description
`src/agentgrep/__init__.py`	`discover_gemini_sources` walks `~/.gemini/tmp/<project_hash>/chats/session-*.jsonl` and `logs.json`. `parse_gemini_chat_file` is a custom parser (not a reuse of `iter_message_candidates`) because Gemini stores the role in a `type` key that `extract_role` does not recognise; patching `extract_role` globally would false-positive on any JSON dict containing `"type": "user"`. The parser handles `SessionMetadataRecord` (first line), `{"$set": …}` `MetadataUpdateRecord` updates, and `MessageRecord` turns. `parse_gemini_logs_file` is a flat-array parser mapping each `LogEntry` to a prompt-history record.
`src/agentgrep/__init__.py` / `src/agentgrep/mcp.py` / `tests/test_agentgrep.py`	`AgentName` literal and `AGENT_CHOICES` extended with `"gemini"`; five MCP tool literal sites updated to match.

Documentation

Path	Description
`docs/storage-catalog.md`	New reference page with per-agent sections, cross-links to upstream Rust and TypeScript type definitions, and the "Adding or updating a store" recipe. Cursor and Gemini sections name both active adapter IDs and the v1 limitations.
`docs/index.md`	Wires the new page into the "Get started" toctree.
`CHANGES`	0.1.0a2 entry with `### What's new` (storage catalogue, Gemini CLI search, Cursor CLI search), `### Fixes` (dotfile `.gitignore`), `### Documentation` (new reference page).

Test coverage

tests/test_agentgrep.py adds test_list_files_matching_ignores_gitignore (regression for the fd-flag fix), plus six adapter tests covering: Cursor user/assistant turns, Cursor tool_use safety, Gemini user prompts with timestamp + sessionId, Gemini metadata-record skipping, and Gemini logs.json parsing.

Design decisions

Catalogue as data, not code paths. Each StoreDescriptor is a frozen Pydantic row with observed_version / observed_at stamps and an upstream_ref pointer. Future upstream renames become a one-row edit plus a catalog_version bump. The catalogue is descriptive — adapters consume it; whether agentgrep searches a given store by default is a per-store decision the adapters carry, not a property of the catalogue itself.

search_by_default is tri-state. Each row is True / False / None. None documents stores agentgrep knows about but has not yet decided to search — Gemini Antigravity's protobuf blobs are a current example because no public .proto definition exists.

IDE vs CLI agent kept as separate Cursor entries. Both have role=PRIMARY_CHAT and distinguishes_from pointers. The --agent cursor dispatcher branch is additive — both the IDE state.vscdb adapters and the new CLI adapter run side-by-side, so users with both surfaces see hits from each.

Gemini chat parser is custom rather than reusing iter_message_candidates. Gemini stores role in a type key. Adding "type" to extract_role's keyset would false-positive on every JSON dict with a "type": "user" field across all adapters. A small dedicated parser is the cleaner boundary; it also gives a clean place to drop SessionMetadataRecord / MetadataUpdateRecord / empty-content gemini records explicitly.

Ignore-flag bypass is safe across all three call sites. list_files_matching is invoked only by discover_codex_sources, discover_claude_sources, discover_cursor_sources, discover_cursor_cli_sources, and discover_gemini_sources — all want raw agent data, never git-aware filtering.

Verification

End-to-end smoke against this user's data — all four agents return matches:

$ uv run agentgrep search libtmux --agent claude --limit 1

$ uv run agentgrep search libtmux --agent codex --limit 1

$ uv run agentgrep search libtmux --agent cursor --limit 1

$ uv run agentgrep search libtmux --agent gemini --limit 1

Catalogue round-trips through JSON without field loss:

$ uv run python -c "from agentgrep.store_catalog import CATALOG; CATALOG.model_dump_json()"

Catalogue JSON schema emits cleanly:

$ uv run python -c "from agentgrep.stores import StoreCatalog; print(StoreCatalog.model_json_schema()['title'])"

Sphinx renders the new page and the CHANGES references without warnings:

$ just build-docs

Test plan

Out of scope (follow-ups)

Refactor existing discover_codex_sources / discover_claude_sources / discover_cursor_sources to read paths from CATALOG rather than hard-coding strings — the catalogue is the contract those adapters will consume.
Surface Gemini assistant thoughts[*].description and toolCalls[*] as searchable text. The v1 Gemini chat parser drops gemini-typed records whose content is empty (output lives in thoughts[]); a clean follow-up adds a separate extraction path.
Honor GEMINI_CLI_HOME env override. Most users use the default ~/.gemini; documented in the catalogue row but not yet read at runtime.
Parse legacy Gemini .json single-file sessions (pre-Feb 2026 format).
Parse Gemini Antigravity protobuf conversations — schema is opaque without a public .proto.
Investigate why ~/.cursor/ai-tracking/ai-code-tracking.db has 0 rows on some systems and document the answer in cursor.ai_tracking.schema_notes.

why: agentgrep ran fd and rg against the user's ~/.claude, ~/.codex, and ~/.cursor without disabling ignore-file semantics. On systems where those paths sit inside a dotfile-managed tree with a .gitignore (yadm, chezmoi, stow, bare-git), both tools silently masked the agent data and the CLI reported "No matches found." fd exited 0 with empty output, so the Python rglob fallback never fired. After discovery, the rg-against-search-root prefilter applied the same mask a second time. Even after the ignore-flag fixes, paths from `fd -a` — which canonicalizes through symlinks — and rg — which doesn't — failed to compare equal in prefilter_sources_by_root, so every source got dropped. what: - Pass `-H -I` to fd in list_files_matching and drop `-a` so the returned paths preserve the input root's symlink structure and agree with rg's output. - Pass `--no-ignore --hidden` (rg) / `--unrestricted --hidden` (ag) in build_grep_command and reshuffle the argv builder so the fixed-string flag goes immediately before `-l`. - Add test_list_files_matching_ignores_gitignore that drops a `.gitignore: *` next to two JSONL files and asserts discovery still finds them.

…c descriptors why: Each CLI agent (Claude Code, Cursor, Codex, Gemini) lays out prompt and conversation history on disk in its own way, and those layouts drift between releases — Codex renames history files, Cursor adds a CLI-agent layout that doesn't use the IDE's state.vscdb, Gemini reorganises ~/.gemini/tmp/. Burying paths and record schemas inside the search adapters made every layout change a multi-file edit and left no place to record which version was last verified. The catalogue centralises that knowledge as frozen Pydantic descriptors, each stamped with observed_version / observed_at and a pointer to the upstream type definition where one is public. Whether agentgrep searches a given store by default stays a per-store decision the adapters consult; the catalogue itself is descriptive. what: - src/agentgrep/stores.py — StoreFormat / StoreRole enums and the frozen StoreDescriptor / StoreCatalog Pydantic models. JSON-schema-exportable for downstream validation. - src/agentgrep/store_catalog.py — initial CATALOG spanning Claude, Cursor (IDE + CLI agent kept distinct), Codex, and Gemini. Each entry cites the upstream source-of-truth where one exists; the Cursor CLI rows note that no public schema is published. Includes gemini_project_hash() — a Python mirror of getProjectHash() at packages/core/src/utils/paths.ts so consumers can map a working directory to a Gemini tmp shard. - tests/test_stores.py — shape tests for unique IDs, agent-prefix discipline, token-form path patterns (no /home/* leaks), primary-chat stores carrying upstream_ref or sample_record, distinguishes_from cross-references resolving, frozen-model enforcement, and Pydantic JSON round-trip. - tests/conftest.py — fixture_path(store_id, name) helper rooted at tests/samples/. - tests/samples/ — redacted, structurally-faithful example records for every primary-chat / prompt-history / plan store. No real user content; UUIDs, timestamps, and project hashes are placeholders.

why: The new storage catalogue ships an importable public API but no narrative entry point. Adding a Sphinx page gives readers one place to learn the catalogue's intent, see the per-agent layouts side-by-side, and follow a recipe for adding or updating a descriptor. The CHANGES placeholder for 0.1.0a2 also needed populating so the upcoming release records the fix, the catalogue, and the doc page under the project's deliverable-prose conventions. what: - docs/storage-catalog.md — new reference page with sections per agent (Claude / Cursor / Codex / Gemini), cross-links to upstream Rust and TypeScript type definitions, and an "Adding or updating a store" recipe. - docs/index.md — wire the new page into the "Get started" toctree. - CHANGES — replace the 0.1.0a2 placeholder body with a multi-sentence lead and three deliverable sections (### What's new / ### Fixes / ### Documentation).

why: PR #4 fixed search for Claude and Codex on dotfile-managed homes and shipped the storage catalogue, but left Cursor CLI and Gemini deliberately out of scope. The user has chat history in both — 502 transcripts under ~/.cursor/projects/<id>/agent-transcripts/ and 111 .jsonl sessions under ~/.gemini/tmp/<project_hash>/chats/. This wires adapters for both and surfaces "gemini" as a valid --agent value across the CLI and the MCP tool literals. what: - Extend AgentName and AGENT_CHOICES to include "gemini". Mirror the change in tests/test_agentgrep.py and across the MCP tool literal sites in src/agentgrep/mcp.py. - New discover_cursor_cli_sources walks ~/.cursor/projects/*/agent-transcripts/**/*.jsonl, skipping sibling project files (repo.json, mcp-approvals.json, terminals/, canvases/) so transcripts stay the focus. - New parse_cursor_cli_transcript wraps iter_message_candidates — which already handles Cursor's outer-role / inner- message.content[].text shape — and backfills file mtime as the timestamp fallback since transcripts carry no native per-turn timestamp. Adds an isoformat_from_mtime_ns helper and a datetime import. - New discover_gemini_sources walks ~/.gemini/tmp/<hash>/chats/session-*.jsonl and ~/.gemini/tmp/<hash>/logs.json. - New parse_gemini_chat_file is custom rather than reusing iter_message_candidates: extract_role does not recognise Gemini's type-as-role key, and patching it globally would false-positive on every JSON dict with a "type" field. The parser handles SessionMetadataRecord (first line), \$set MetadataUpdateRecord, and MessageRecord turns; empty-content gemini records (output lives in thoughts[]) are skipped for v1. - New parse_gemini_logs_file is a flat-array parser mapping each LogEntry to a prompt-history record. - Plug both new discover functions and four adapter_id branches into discover_sources and iter_source_records. The "cursor" branch becomes additive — the existing IDE / ai-tracking sources still run alongside the new CLI transcripts. - Catalog corrections in store_catalog.py: drop the non-existent bubbleId mention from cursor.cli.transcripts; clarify that Gemini's RewindRecord / PartialMetadataRecord and info|error|warning type values are documented upstream but unobserved in real files; flip search_by_default to True on the three now-parsed rows; bump catalog_version to 2. - Six functional tests covering user/assistant Cursor turns, tool_use safety, Gemini user-prompt extraction, metadata-record skipping, and logs.json parsing.

why: The new Cursor CLI and Gemini adapters need user-facing announcement: a 0.1.0a2 changelog entry naming the deliverables and a refresh of the storage-catalog reference page so the per-agent sections accurately describe what agentgrep now parses. what: - CHANGES — refresh the 0.1.0a2 lead to mention the new --agent values; add a "Gemini CLI search support" deliverable section under What's new naming the chat-session and logs.json stores and the gemini_project_hash helper; add a "Cursor CLI agent search support" section noting that --agent cursor now hits both the IDE and the CLI surfaces. Existing "Storage catalogue" and "Search beneath dotfile .gitignore" entries from PR #4 are left intact. - docs/storage-catalog.md — rewrite the Cursor section to name both adapters (cursor.cli_jsonl.v1 and the IDE state.vscdb pair) as searched. Rewrite the Gemini section to describe the records actually observed in real files (only \`user\` and \`gemini\` type values; no Rewind / PartialMetadata records), name the active adapter_ids, and note the v1 limitation around empty-content gemini records whose output lives in thoughts[].

why: The project's changelog conventions require PR references to sit in each deliverable's #### heading so readers can navigate from a CHANGES entry to the PR that shipped it. The five 0.1.0a2 entries landed without those refs across two prior changelog commits on this branch because the PR number wasn't known at authoring time. Now that PR #4 is open and ready for review, each deliverable heading gets its (#4) suffix. what: - CHANGES: append `(#4)` to the five 0.1.0a2 deliverable headings — "Gemini CLI search support", "Cursor CLI agent search support", "Storage catalogue", "Search beneath dotfile .gitignore", and "New storage-catalog reference page".

tony · 2026-05-17T23:17:21Z

Code review

Found 3 issues:

build_capabilities() hardcodes agents=["codex", "claude", "cursor"] and omits "gemini". The PR extended the AgentSelector literal and CapabilitiesModel.agents's type to include gemini, but the runtime list returned by the agentgrep://capabilities MCP resource still advertises only three agents. MCP clients that read capabilities before routing queries will not see Gemini support.

agentgrep/src/agentgrep/mcp.py

Lines 320 to 327 in eb217e5

    
           def build_capabilities() -> CapabilitiesModel: 
        
               """Build a typed capability summary.""" 
        
               backends = agentgrep.select_backends() 
        
               return CapabilitiesModel( 
        
                   agents=["codex", "claude", "cursor"], 
        
                   search_types=["prompts", "history", "all"], 
        
                   adapters=list(KNOWN_ADAPTERS), 
        
                   tools=["search", "find"],

KNOWN_ADAPTERS does not include the three adapter IDs introduced by this PR — cursor.cli_jsonl.v1, gemini.tmp_chats_jsonl.v1, gemini.tmp_logs_json.v1. These are emitted at runtime by the new discover_cursor_cli_sources and discover_gemini_sources functions in src/agentgrep/__init__.py, but build_capabilities() publishes list(KNOWN_ADAPTERS) to MCP clients. Clients filtering on adapter ID won't recognise records from the new adapters.

agentgrep/src/agentgrep/mcp.py

Lines 36 to 45 in eb217e5

    
           SERVER_VERSION = "0.1.0" 
        
           KNOWN_ADAPTERS: tuple[str, ...] = ( 
        
               "codex.history_json.v1", 
        
               "codex.sessions_jsonl.v1", 
        
               "claude.projects_jsonl.v1", 
        
               "cursor.ai_tracking_sqlite.v1", 
        
               "cursor.state_vscdb_legacy.v1", 
        
               "cursor.state_vscdb_modern.v1", 
        
           ) 
        
           READONLY_TAGS = {"readonly", "agentgrep"}

"earlier exploration mis-attributed them" / "earlier exploratory notes mis-attributed them" leaks branch-internal narrative into shipped artifacts (CLAUDE.md says: "Did users of the most recently published release ever experience this old name, old behavior, or bug? If the answer is no, it is branch-internal narrative. Move it to the commit message and describe only the current state in the artifact."). The mis-attribution happened during this PR's research; readers of 0.1.0a1 never saw a wrong attribution. The same phrasing appears in both the catalog schema_notes and the public docs page.

agentgrep/src/agentgrep/store_catalog.py

Lines 402 to 407 in eb217e5

    
               upstream_ref="github.com/openai/codex@4c89772/codex-rs/state/src/lib.rs#L71", 
        
               schema_notes=( 
        
                   "Codex logs DB (`LOGS_DB_FILENAME`). Note: the `_N.sqlite` files belong " 
        
                   "to codex, not Cursor — earlier exploration mis-attributed them." 
        
               ), 
        
           ),

agentgrep/docs/storage-catalog.md

Lines 104 to 109 in eb217e5

    
           The two `_N.sqlite` files in `~/.codex/` (`state_5.sqlite`, 
        
           `logs_2.sqlite`) belong to Codex, not Cursor — earlier exploratory 
        
           notes mis-attributed them. 
        
           ### Gemini CLI

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

…n capabilities why: PR #4 (this branch) extended AgentSelector, CapabilitiesModel.agents's type, and every per-tool literal site in src/agentgrep/mcp.py to include "gemini", but neither the runtime list returned by build_capabilities() nor KNOWN_ADAPTERS was updated. MCP clients querying agentgrep://capabilities were told there are three agents and six adapters while CLI / library users see four agents and nine adapters. The branch's own narrative ("agentgrep search --agent gemini is now a valid CLI invocation") was broken at the MCP surface. what: - src/agentgrep/mcp.py: extend KNOWN_ADAPTERS with cursor.cli_jsonl.v1, gemini.tmp_chats_jsonl.v1, and gemini.tmp_logs_json.v1 in dotted-prefix order alongside the existing adapter ids. - src/agentgrep/mcp.py: replace the hardcoded agents=["codex", "claude", "cursor"] in build_capabilities() with agents=list(agentgrep.AGENT_CHOICES) so the MCP capabilities resource stays in lockstep with the CLI's AGENT_CHOICES tuple — future agent additions only need one source-of-truth update. - src/agentgrep/mcp.py: declare a module-local AgentName alias and add AGENT_CHOICES: tuple[AgentName, ...] to the AgentGrepModule Protocol so ty can verify the list[AgentName] flowing into CapabilitiesModel. - tests/test_agentgrep_mcp.py: new test_mcp_capabilities_lists_every_supported_agent_and_adapter that reads the live agentgrep://capabilities resource and asserts the advertised agents match agentgrep.AGENT_CHOICES and the three new adapter ids are present.

…rides why: Every Codex catalogue row declares env_overrides=("CODEX_HOME",) (5 rows) and every Gemini row declares env_overrides=("GEMINI_CLI_HOME",) (8 rows). The catalogue's module docstring describes these as the contract adapters consult. But discover_codex_sources hardcoded home / ".codex" and discover_gemini_sources hardcoded home / ".gemini" / "tmp", so the catalogue shipped a promise the runtime never kept. The Codex implementation mirrors codex-rs/utils/home-dir/src/lib.rs (env var, when non-empty, replaces ~/.codex). The Gemini implementation mirrors packages/cli/index.ts (env var, when non-empty, replaces ~/.gemini; "tmp" is then appended for the chat/log root). what: - discover_codex_sources: read CODEX_HOME via os.environ.get; fall back to home / ".codex" when the env var is unset or empty. - discover_gemini_sources: read GEMINI_CLI_HOME the same way; appends /tmp to the resolved base. - Document the new behaviour in both docstrings, pointing at the upstream source-of-truth files. - tests/test_agentgrep.py: two new tests that monkeypatch the env var to an alternate tmp_path location, plant a decoy session under ${HOME}/.codex or ${HOME}/.gemini, and assert discovery hits the env-pointed root and ignores the decoy.

…t absence of type why: parse_gemini_chat_file identified the SessionMetadataRecord line with "startTime" in mapping and "type" not in mapping. Upstream distinguishes metadata records by the kind field (packages/core/src/services/chatRecordingTypes.ts), not by the absence of type. If a future Gemini CLI release adds a type field to the metadata record (e.g. type="session_meta"), the absence-based check would silently misclassify the metadata line as a turn, reset session_id mid-stream, and drop the message because content isn't where the MessageRecord parser expects it. Switching to "kind" in mapping uses the upstream-documented discriminator. The on-disk shape sampled during the v1 adapter already includes "kind":"main" on every metadata line, so this is backwards-compatible with current Gemini releases. what: - src/agentgrep/__init__.py: parse_gemini_chat_file now branches on "kind" in mapping for the SessionMetadataRecord guard. Inline comment explains the forward-compat reasoning. - tests/test_agentgrep.py: test_search_gemini_chat_session_metadata_with_future_type_field exercises the new guard by writing a metadata line that carries both kind and a hypothetical type="session_meta" field; the subsequent user MessageRecord must still emit one chat record with the correct session_id.

…_records branch why: Every other branch in iter_source_records ends with ``yield from parse_*(source); return``. The final ``gemini.tmp_logs_json.v1`` branch omitted the trailing return. Functionally identical today (it's the last branch) but inconsistent with the local pattern, and a future appended branch would have run unconditionally for log sources. One-line consistency fix locks the "each adapter_id dispatches to exactly one parser" invariant. what: - src/agentgrep/__init__.py: add ``return`` after the final ``yield from parse_gemini_logs_file(source)``.

… artifacts why: Three sites violated the project's "Shipped vs. Branch-Internal Narrative" rule (see CLAUDE.md / AGENTS.md, Published-Release Test). The mis-attribution of the `_N.sqlite` files happened during PR #4's research; readers of 0.1.0a1 never saw a wrong attribution. The CHANGES lead's "previously did not parse" phrasing compares branch states rather than shipped states. Per the rule, the artifact should describe only the current state; the historical "we got this wrong once" belongs in commit messages, not in shipped catalogue notes, docs pages, or release-notes prose. The corrected explanation for *why* the `_N.sqlite` files belong to Codex is that their filenames are defined by `STATE_DB_FILENAME` and `LOGS_DB_FILENAME` constants in the upstream Rust crate `codex-rs/state/src/lib.rs` — which is now what the shipped artifacts say. what: - src/agentgrep/store_catalog.py: rewrite the `codex.logs_db` row's `schema_notes` so it cites the upstream constant names directly and drops the "earlier exploration mis-attributed them" aside. - docs/storage-catalog.md: rewrite the Codex `_N.sqlite` paragraph to cite the upstream constants and drop the "earlier exploratory notes mis-attributed them" aside. - CHANGES: rewrite the 0.1.0a2 lead so it describes the current state ("adds search support for the Cursor CLI agent's per-project transcripts and Google Gemini CLI's session and prompt-log files") rather than the branch diff ("broadens search to two stores agentgrep previously did not parse").

why: The catalogue declares gemini.tmp.logs as role=StoreRole.PROMPT_HISTORY — upstream's logs.json is the user-prompt audit log, analogous to Codex's history.jsonl. Records from an audit log are kind="history" by the project's SearchRecord convention; parse_codex_history_file constructs them that way explicitly. The Gemini logs parser routed records through build_search_record, which auto-classifies role="user" as kind="prompt", so agentgrep search --agent gemini --type history returned zero logs.json hits. what: - parse_gemini_logs_file now yields SearchRecord(kind="history", ...) directly, mirroring parse_codex_history_file. Drops the MessageCandidate + build_search_record indirection. - test_search_gemini_logs_returns_user_message asserts kind == "history" and session_id propagates from the LogEntry.

…Spec why: Every adapter hard-coded its agent path roots, globs, post- filters, and runtime metadata (store, adapter_id, path_kind, source_kind) inline. The catalogue's env_overrides field was the only catalogue-side declaration the runtime consulted, and even that duplicated the env-var name. Centralising the runtime metadata in the catalogue means a future upstream rename — Codex moving its history file, Cursor adding a CLI agent, Gemini reorganising tmp/ — is a one-row edit; discovery catches up automatically. The catalogue earns its keep as the runtime contract, not just documentation. Pairs naturally with the codebase's first logger (logging.getLogger("agentgrep") + NullHandler in src/agentgrep/__init__.py), which fires a WARNING when CODEX_HOME or GEMINI_CLI_HOME points to a non-existent path. Upstream Codex errors in that case; agentgrep stays read-only-friendly and falls back, but the warning surfaces a misconfiguration the user almost certainly cares about. The structured extra={"agentgrep_env_var": ..., "agentgrep_env_path": ...} follows CLAUDE.md's logging conventions exactly. what: - src/agentgrep/stores.py: move PathKind and SourceKind here as literal type aliases (re-exported from agentgrep for compat); add the new DiscoverySpec Pydantic model with home_subpath, platform_paths, files, glob, and path_parts_required fields; add discovery: tuple[DiscoverySpec, ...] = () to StoreDescriptor. A tuple (rather than a single optional) handles stores whose on-disk shape spans more than one DiscoverySpec — Codex history has .json and .jsonl alternatives, Cursor IDE state has modern and legacy adapter ids. - src/agentgrep/store_catalog.py: populate discovery on claude.projects.session, cursor.cli.transcripts, cursor.ai_tracking, cursor.ide.state_vscdb, codex.history, codex.sessions, gemini.tmp.chats, and gemini.tmp.logs. - src/agentgrep/__init__.py: - logger = logging.getLogger(__name__); NullHandler registered. - resolve_env_root(env_var, default) honours the env override and logger.warning()s when the path isn't a directory. - handles_from_discovery(spec, agent, root, backends) emits SourceHandles by interpreting one DiscoverySpec. - discover_from_catalog(home, agent, base, backends) iterates CATALOG.for_agent(agent) and walks each row's discovery. - discover_codex / claude / cursor / gemini delegate to that helper; discover_cursor_cli_sources is a back-compat shim that returns [] because Cursor CLI transcripts now flow through discover_cursor_sources. - tests/test_agentgrep.py: caplog-scoped test asserts the WARNING record carries the structured agentgrep_env_var / agentgrep_env_path attributes; a companion test asserts resolve_env_root falls back silently when the env var is unset. - tests/test_stores.py: invariant test asserts every runtime adapter id is declared by some catalogue row's DiscoverySpec and also advertised in agentgrep.mcp.KNOWN_ADAPTERS.

why: Upstream Gemini CLI still reads the older single-file `.json` session format via the `isLegacyRecord` discriminator in packages/core/src/services/chatRecordingService.ts. The legacy shape is a JSON object with session metadata at the top level and the full conversation under a `messages` array; each entry carries the same per-turn fields the JSONL format uses. agentgrep had no adapter for these files, leaving ~1000 real legacy session files under ~/.gemini/tmp/*/chats/*.json invisible to search on systems that predate the JSONL migration. what: - src/agentgrep/__init__.py: - Factor _gemini_message_record_to_candidate(mapping, session_id) so the new parser shares the inner-record handling with parse_gemini_chat_file. Returns None for records with no searchable text (empty content + no thoughts/toolCalls). - parse_gemini_chat_legacy_file(source) reads the JSON object, pulls sessionId from the top level, iterates `messages[]`, and yields one SearchRecord per turn with searchable text. - Dispatcher routes adapter_id "gemini.tmp_chats_legacy_json.v1" to the new parser. - src/agentgrep/store_catalog.py: new gemini.tmp.chats_legacy row. role=SUPPLEMENTARY_CHAT, format=JSON_OBJECT, distinguishes_from the current jsonl chats. DiscoverySpec uses home_subpath=("tmp",) and glob="session-*.json" so the catalogue drives both the enumeration and the runtime metadata. - src/agentgrep/mcp.py: extend KNOWN_ADAPTERS with "gemini.tmp_chats_legacy_json.v1". - tests/test_agentgrep.py: test_search_gemini_chat_legacy_json_session plants a single legacy file with one user MessageRecord and asserts the prompt is surfaced with the right session_id and role.

…all text why: Gemini assistant turns on the current `.jsonl` format leave the `content` field empty when the model's prose lives in `thoughts[]` (reasoning, ~150-300 chars per entry) and tool invocations live in `toolCalls[]`. The previous adapter dropped those turns entirely, leaving ~70% of assistant output invisible to search. Concatenating `thoughts[*].subject`/`description` and `toolCalls[*].name`/`description` into the candidate's text restores searchability without breaking the conversation-turn boundary: one SearchRecord per turn keeps record counts flat and downstream filtering simple. Tool `args` is omitted — it's JSON-shaped and low signal compared to the human-readable `description` field. what: - src/agentgrep/__init__.py: - _gemini_thoughts_text(thoughts) flattens subject + description into a newline-joined string. Skipped when no `thoughts` field is present or it's not a list. - _gemini_tool_calls_text(toolCalls) same for tool calls' `name` + `description`. Tool `args` deliberately excluded. - _gemini_message_record_to_candidate reuses both helpers when the record is `type="gemini"`. Returns None only when content, thoughts, AND toolCalls all contribute no text. - src/agentgrep/store_catalog.py: refresh `gemini.tmp.chats` search_notes to document the new behaviour. - tests/test_agentgrep.py: test_search_gemini_chat_session_surfaces_thoughts_and_tool_calls plants a gemini turn with thoughts only and a gemini turn with toolCalls only, asserts both surface with the expected text. test_search_gemini_chat_session_drops_textless_records (renamed from drops_metadata_records) keeps the negative case: gemini-typed records with empty content AND no thoughts AND no toolCalls still produce nothing.

…avity catalogue rows why: Upstream investigation established that three catalogue rows described behaviour upstream does not have: - `gemini.history` claimed the row was a post-retention archive of `tmp/`. Upstream `packages/cli/src/utils/sessionCleanup.ts` hard-deletes expired sessions via `fs.unlink()` — there is no archive. The `.project_root` metadata stubs under some users' `~/.gemini/history/` are orphaned artefacts from an earlier layout, not Gemini-CLI-written archives. - `gemini.antigravity.brain` and `gemini.antigravity.conversations` are Antigravity IDE artefacts. Gemini CLI only detects Antigravity as an IDE launcher target (`packages/core/src/ide/detect-ide.ts`) — it never reads or writes the protobuf conversation files. If agentgrep ever supports Antigravity, that is a separate agent kind, not a Gemini sub-store. Shipping catalogue claims unbacked by upstream is the same kind of branch-internal narrative the project's CLAUDE.md rule forbids; honest catalogue means dropping the rows. what: - src/agentgrep/store_catalog.py: delete the three rows. Bump `catalog_version` to 3 (descriptor entries removed). - docs/storage-catalog.md: rewrite the Gemini section to name the three real adapters (`gemini.tmp_chats_jsonl.v1`, `gemini.tmp_chats_legacy_json.v1`, `gemini.tmp_logs_json.v1`), explain that thoughts and tool-call text now surface for assistant turns with empty `content`, and document Antigravity / `history/` as out-of-scope with upstream citations. - CHANGES: refresh the "Gemini CLI search support" entry to mention the legacy `.json` adapter, thoughts/toolCalls surfacing, and the CODEX_HOME / GEMINI_CLI_HOME warning; refresh the "Storage catalogue" entry to mention DiscoverySpec and the row prunings.

tony · 2026-05-18T00:40:25Z

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

…riven discovery why: src/agentgrep/store_catalog.py's module docstring describes the catalogue as the runtime contract — "each StoreDescriptor carries a search_by_default field that the per-agent discover functions consult." The runtime did not. discover_from_catalog walked every descriptor's discovery tuple unconditionally, so a row marked search_by_default=False but carrying a DiscoverySpec would silently flow into search results — the catalogue would lie about its own behaviour. The shape of the catalogue today hides this from users: every row that sets search_by_default=False also leaves discovery=(), so the loop happens to do the right thing in practice. Adding a DiscoverySpec to such a row in the future would have been a quiet regression. what: - src/agentgrep/__init__.py: discover_from_catalog now skips descriptors whose search_by_default is exactly False. True remains searched; None (decision-deferred) also remains searched, matching the historical default for rows that pre-date the search-by-default field. - src/agentgrep/store_catalog.py: state the policy explicitly on the two rows that previously relied on the None default — cursor.ai_tracking (role corrected from APP_STATE to SUPPLEMENTARY_CHAT to reflect that its conversation_summaries rows are chat-derived metadata, not app state) and cursor.ide.state_vscdb. - tests/test_stores.py: test_discover_from_catalog_skips_search_by_default_false plants a synthetic catalogue with one False row and one True row that both carry valid DiscoverySpecs against the same base directory, monkeypatches CATALOG, and asserts only the True row's source flows through.

…e descriptor why: A single StoreDescriptor can carry more than one DiscoverySpec — the canonical case is cursor.ide.state_vscdb, which declares one spec for the modern platform paths (~/.config/Cursor/...) and one for the legacy ~/.cursor/state.vscdb glob. On standard layouts the two roots are disjoint, but a custom XDG layout or symlinked install could place a single state.vscdb where both specs match. handles_from_discovery treated the two specs independently, so the same file flowed out as two SourceHandles with different adapter ids (cursor.state_vscdb_modern.v1 and cursor.state_vscdb_legacy.v1), producing duplicate search records downstream. what: - src/agentgrep/__init__.py: discover_from_catalog now keeps a per-descriptor seen_paths set and skips any handle whose path was already emitted by an earlier spec on the same descriptor. Cross-descriptor dedup stays out of scope — gemini.tmp.chats and gemini.tmp.chats_legacy are distinct stores even when they share a directory. - tests/test_stores.py: test_discover_from_catalog_deduplicates_paths_within_descriptor monkeypatches the catalogue with one descriptor whose two specs both point at a single state.vscdb file and asserts exactly one SourceHandle is returned.

…tinguishes missing from non-directory why: The env-override warning emitted by resolve_env_root had two problems. First, the message was present-tense ("env-override path does not exist"), which clashes with CLAUDE.md's logging style rule that messages describe events in past tense. Second, the same warning fired whether CODEX_HOME or GEMINI_CLI_HOME pointed at a missing path OR at a path that existed as a regular file (or other non-directory inode). An operator reading the log could not tell which class of misconfiguration they had — a typo versus an env var aimed at the wrong inode kind. what: - src/agentgrep/__init__.py: resolve_env_root now logs "env-override path unavailable, fell back to default" and adds a structured agentgrep_env_path_status extra field whose value is "not_a_directory" when candidate.exists() is true and "not_found" otherwise. Operators reading the log have a stable scalar to filter on. - tests/test_agentgrep.py: test_resolve_env_root_warns_on_missing_path now asserts agentgrep_env_path_status == "not_found" in addition to the env-var and env-path fields. New test_resolve_env_root_warns_when_env_path_is_file plants a real regular file at the env path and asserts the status is "not_a_directory".

…es entry point why: Cursor CLI transcripts are discovered by discover_cursor_sources, which reads their DiscoverySpec from CATALOG. A separate discover_cursor_cli_sources function exists only as an empty stub whose docstring describes the in-branch refactor history — that "compatibility" framing is branch-internal narrative for a symbol that never shipped to users. Removing the stub leaves discovery behind one entry point and drops misleading documentation from the shipped surface. what: - src/agentgrep/__init__.py: delete the discover_cursor_cli_sources definition. No callers in the repo and the symbol is not part of any released API.

tony temporarily deployed to docs May 17, 2026 22:23 — with GitHub Actions Inactive

tony added 3 commits May 17, 2026 17:34

tony force-pushed the more-backends branch from 68b428c to 6525683 Compare May 17, 2026 22:34

tony temporarily deployed to docs May 17, 2026 22:34 — with GitHub Actions Inactive

tony added 2 commits May 17, 2026 17:53

tony temporarily deployed to docs May 17, 2026 22:54 — with GitHub Actions Inactive

tony marked this pull request as ready for review May 17, 2026 23:01

tony temporarily deployed to docs May 17, 2026 23:04 — with GitHub Actions Inactive

tony added 5 commits May 17, 2026 18:32

tony temporarily deployed to docs May 17, 2026 23:38 — with GitHub Actions Inactive

tony temporarily deployed to docs May 17, 2026 23:39 — with GitHub Actions Inactive

tony added 5 commits May 17, 2026 19:10

tony temporarily deployed to docs May 18, 2026 00:30 — with GitHub Actions Inactive

tony added 3 commits May 17, 2026 19:53

tony temporarily deployed to docs May 18, 2026 00:59 — with GitHub Actions Inactive

tony merged commit cacd8b9 into master May 18, 2026
3 checks passed

tony deleted the more-backends branch May 18, 2026 01:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add storage catalogue; restore search beneath dotfile .gitignore#4

Add storage catalogue; restore search beneath dotfile .gitignore#4
tony merged 20 commits into
masterfrom
more-backends

tony commented May 17, 2026 •

edited

Loading

Uh oh!

tony commented May 17, 2026

Uh oh!

tony commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tony commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes by area

Search discovery and prefilter

New store catalogue

New Cursor CLI agent adapter

New Gemini CLI adapter

Documentation

Test coverage

Design decisions

Verification

Test plan

Out of scope (follow-ups)

Uh oh!

tony commented May 17, 2026

Code review

Uh oh!

tony commented May 18, 2026

Code review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tony commented May 17, 2026 •

edited

Loading