Skip to content

Add storage catalogue; restore search beneath dotfile .gitignore#4

Merged
tony merged 20 commits into
masterfrom
more-backends
May 18, 2026
Merged

Add storage catalogue; restore search beneath dotfile .gitignore#4
tony merged 20 commits into
masterfrom
more-backends

Conversation

@tony
Copy link
Copy Markdown
Owner

@tony tony commented May 17, 2026

Summary

  • Fix silent "No matches found." for agentgrep search --agent claude / --agent codex / --agent cursor on systems where the agent stores live inside a dotfile-managed tree (yadm, chezmoi, stow, bare-git, mr). fd and rg both honored .gitignore semantics under $HOME and silently masked everything; removing fd -a also fixes a path-comparison mismatch between fd's canonicalized output and rg's symlink-preserving output.
  • Add an importable storage catalogue — frozen Pydantic descriptors for every on-disk prompt and history store agentgrep knows about, stamped with observed_version, observed_at, format, schema notes, and pointers to upstream type definitions where one is public.
  • Add a Cursor CLI agent transcript adapter so agentgrep search --agent cursor returns results from ~/.cursor/projects/<id>/agent-transcripts/<session_uuid>/<session_uuid>.jsonl in addition to the existing Cursor IDE state.vscdb store. Transcripts carry no native per-turn timestamp; agentgrep backfills the file's mtime.
  • Add a Google Gemini CLI adapter — --agent gemini is now a valid choice across the CLI and the MCP tool literals. The adapter parses both ~/.gemini/tmp/<project_hash>/chats/session-*.jsonl (mixed SessionMetadataRecord / MessageRecord / MetadataUpdateRecord lines) and ~/.gemini/tmp/<project_hash>/logs.json (flat LogEntry audit array).
  • Document the catalogue and the per-agent adapters in a new Sphinx page with per-agent sections, upstream-pinned schema citations, and a recipe for adding or updating a descriptor.

Changes by area

Search discovery and prefilter

src/agentgrep/__init__.pylist_files_matching passes -H -I to fd and drops -a. The flag pair bypasses gitignore/hidden filtering; dropping -a keeps fd's paths in the same symlink-preserved form that rg emits, so prefilter_sources_by_root can compare them directly. build_grep_command passes --no-ignore --hidden (rg) / --unrestricted --hidden (ag), and the argv builder is reshuffled so the fixed-string flag goes immediately before -l.

New store catalogue

Path Description
src/agentgrep/stores.py StoreFormat and StoreRole enums; frozen Pydantic StoreDescriptor and StoreCatalog models. JSON-schema-exportable.
src/agentgrep/store_catalog.py CATALOG (version 2) spanning Claude, Cursor (IDE + CLI agent), Codex, Gemini. Includes gemini_project_hash() — a Python mirror of Gemini CLI's getProjectHash() so consumers can map a working directory to a Gemini tmp shard.
tests/test_stores.py Shape tests: unique IDs, agent-prefix discipline, token-form path patterns (no /home/* leaks), distinguishes_from resolves, frozen-model enforcement, Pydantic JSON round-trip, fixture validation.
tests/conftest.py fixture_path(store_id, name) helper rooted at tests/samples/.
tests/samples/<agent>/<store_id>/... Redacted, structurally-faithful fixtures per primary-chat / prompt-history / plan store. No real user content.

New Cursor CLI agent adapter

Path Description
src/agentgrep/__init__.py discover_cursor_cli_sources walks ~/.cursor/projects/*/agent-transcripts/**/*.jsonl, skipping sibling project files (repo.json, mcp-approvals.json, terminals/, canvases/). parse_cursor_cli_transcript wraps the existing iter_message_candidates — which already handles Cursor's outer-role / inner-message.content[].text shape — and backfills the file's mtime as a timestamp fallback via the new isoformat_from_mtime_ns helper.

New Gemini CLI adapter

Path Description
src/agentgrep/__init__.py discover_gemini_sources walks ~/.gemini/tmp/<project_hash>/chats/session-*.jsonl and logs.json. parse_gemini_chat_file is a custom parser (not a reuse of iter_message_candidates) because Gemini stores the role in a type key that extract_role does not recognise; patching extract_role globally would false-positive on any JSON dict containing "type": "user". The parser handles SessionMetadataRecord (first line), {"$set": …} MetadataUpdateRecord updates, and MessageRecord turns. parse_gemini_logs_file is a flat-array parser mapping each LogEntry to a prompt-history record.
src/agentgrep/__init__.py / src/agentgrep/mcp.py / tests/test_agentgrep.py AgentName literal and AGENT_CHOICES extended with "gemini"; five MCP tool literal sites updated to match.

Documentation

Path Description
docs/storage-catalog.md New reference page with per-agent sections, cross-links to upstream Rust and TypeScript type definitions, and the "Adding or updating a store" recipe. Cursor and Gemini sections name both active adapter IDs and the v1 limitations.
docs/index.md Wires the new page into the "Get started" toctree.
CHANGES 0.1.0a2 entry with ### What's new (storage catalogue, Gemini CLI search, Cursor CLI search), ### Fixes (dotfile .gitignore), ### Documentation (new reference page).

Test coverage

tests/test_agentgrep.py adds test_list_files_matching_ignores_gitignore (regression for the fd-flag fix), plus six adapter tests covering: Cursor user/assistant turns, Cursor tool_use safety, Gemini user prompts with timestamp + sessionId, Gemini metadata-record skipping, and Gemini logs.json parsing.

Design decisions

Catalogue as data, not code paths. Each StoreDescriptor is a frozen Pydantic row with observed_version / observed_at stamps and an upstream_ref pointer. Future upstream renames become a one-row edit plus a catalog_version bump. The catalogue is descriptive — adapters consume it; whether agentgrep searches a given store by default is a per-store decision the adapters carry, not a property of the catalogue itself.

search_by_default is tri-state. Each row is True / False / None. None documents stores agentgrep knows about but has not yet decided to search — Gemini Antigravity's protobuf blobs are a current example because no public .proto definition exists.

IDE vs CLI agent kept as separate Cursor entries. Both have role=PRIMARY_CHAT and distinguishes_from pointers. The --agent cursor dispatcher branch is additive — both the IDE state.vscdb adapters and the new CLI adapter run side-by-side, so users with both surfaces see hits from each.

Gemini chat parser is custom rather than reusing iter_message_candidates. Gemini stores role in a type key. Adding "type" to extract_role's keyset would false-positive on every JSON dict with a "type": "user" field across all adapters. A small dedicated parser is the cleaner boundary; it also gives a clean place to drop SessionMetadataRecord / MetadataUpdateRecord / empty-content gemini records explicitly.

Ignore-flag bypass is safe across all three call sites. list_files_matching is invoked only by discover_codex_sources, discover_claude_sources, discover_cursor_sources, discover_cursor_cli_sources, and discover_gemini_sources — all want raw agent data, never git-aware filtering.

Verification

End-to-end smoke against this user's data — all four agents return matches:

$ uv run agentgrep search libtmux --agent claude --limit 1
$ uv run agentgrep search libtmux --agent codex --limit 1
$ uv run agentgrep search libtmux --agent cursor --limit 1
$ uv run agentgrep search libtmux --agent gemini --limit 1

Catalogue round-trips through JSON without field loss:

$ uv run python -c "from agentgrep.store_catalog import CATALOG; CATALOG.model_dump_json()"

Catalogue JSON schema emits cleanly:

$ uv run python -c "from agentgrep.stores import StoreCatalog; print(StoreCatalog.model_json_schema()['title'])"

Sphinx renders the new page and the CHANGES references without warnings:

$ just build-docs

Test plan

  • test_list_files_matching_ignores_gitignore — discovery survives .gitignore: * at the search root.
  • tests/test_stores.py — catalogue shape invariants and Pydantic round-trip.
  • test_primary_fixtures_exist_and_are_well_formed — every primary-chat / prompt-history / plan store has a valid fixture file under tests/samples/.
  • test_search_cursor_cli_transcript_user_prompt — user-turn text surfaces with mtime-backfilled timestamp.
  • test_search_cursor_cli_transcript_assistant_text — both user and assistant roles emit records.
  • test_search_cursor_cli_transcript_ignores_tool_use_blockstool_use content blocks with no text payload do not crash or yield empty records.
  • test_search_gemini_chat_session_user_prompt — Gemini user MessageRecord surfaces with timestamp + sessionId.
  • test_search_gemini_chat_session_drops_metadata_records{"$set": …} updates and empty-content gemini records produce no record.
  • test_search_gemini_logs_returns_user_messagelogs.json array yields prompt-history records.
  • uv run pytest — full suite passes.
  • uv run ruff format . — clean.
  • uv run ruff check . — clean.
  • uv run ty check — clean.
  • just build-docs — new storage-catalog page renders; CHANGES entry resolves its {ref} and {mod} / {class} cross-references.
  • CLI variations smoke-tested: --agent {claude,codex,cursor,gemini,all}, repeated --agent, --type {prompts,history,all}, --regex, --case-sensitive, --any, --json, --ndjson, --limit, --color {always,never}, --progress never, agentgrep find <pattern> per agent, invalid --agent foo rejected with a five-choice error.

Out of scope (follow-ups)

  • Refactor existing discover_codex_sources / discover_claude_sources / discover_cursor_sources to read paths from CATALOG rather than hard-coding strings — the catalogue is the contract those adapters will consume.
  • Surface Gemini assistant thoughts[*].description and toolCalls[*] as searchable text. The v1 Gemini chat parser drops gemini-typed records whose content is empty (output lives in thoughts[]); a clean follow-up adds a separate extraction path.
  • Honor GEMINI_CLI_HOME env override. Most users use the default ~/.gemini; documented in the catalogue row but not yet read at runtime.
  • Parse legacy Gemini .json single-file sessions (pre-Feb 2026 format).
  • Parse Gemini Antigravity protobuf conversations — schema is opaque without a public .proto.
  • Investigate why ~/.cursor/ai-tracking/ai-code-tracking.db has 0 rows on some systems and document the answer in cursor.ai_tracking.schema_notes.

tony added 3 commits May 17, 2026 17:34
why: agentgrep ran fd and rg against the user's ~/.claude, ~/.codex,
and ~/.cursor without disabling ignore-file semantics. On systems
where those paths sit inside a dotfile-managed tree with a
.gitignore (yadm, chezmoi, stow, bare-git), both tools silently
masked the agent data and the CLI reported "No matches found." fd
exited 0 with empty output, so the Python rglob fallback never
fired. After discovery, the rg-against-search-root prefilter
applied the same mask a second time. Even after the ignore-flag
fixes, paths from `fd -a` — which canonicalizes through symlinks —
and rg — which doesn't — failed to compare equal in
prefilter_sources_by_root, so every source got dropped.

what:
- Pass `-H -I` to fd in list_files_matching and drop `-a` so the
  returned paths preserve the input root's symlink structure and
  agree with rg's output.
- Pass `--no-ignore --hidden` (rg) / `--unrestricted --hidden` (ag)
  in build_grep_command and reshuffle the argv builder so the
  fixed-string flag goes immediately before `-l`.
- Add test_list_files_matching_ignores_gitignore that drops a
  `.gitignore: *` next to two JSONL files and asserts discovery
  still finds them.
…c descriptors

why: Each CLI agent (Claude Code, Cursor, Codex, Gemini) lays out
prompt and conversation history on disk in its own way, and those
layouts drift between releases — Codex renames history files,
Cursor adds a CLI-agent layout that doesn't use the IDE's
state.vscdb, Gemini reorganises ~/.gemini/tmp/. Burying paths and
record schemas inside the search adapters made every layout change
a multi-file edit and left no place to record which version was
last verified. The catalogue centralises that knowledge as frozen
Pydantic descriptors, each stamped with observed_version /
observed_at and a pointer to the upstream type definition where
one is public. Whether agentgrep searches a given store by default
stays a per-store decision the adapters consult; the catalogue
itself is descriptive.

what:
- src/agentgrep/stores.py — StoreFormat / StoreRole enums and the
  frozen StoreDescriptor / StoreCatalog Pydantic models.
  JSON-schema-exportable for downstream validation.
- src/agentgrep/store_catalog.py — initial CATALOG spanning Claude,
  Cursor (IDE + CLI agent kept distinct), Codex, and Gemini. Each
  entry cites the upstream source-of-truth where one exists; the
  Cursor CLI rows note that no public schema is published.
  Includes gemini_project_hash() — a Python mirror of
  getProjectHash() at packages/core/src/utils/paths.ts so consumers
  can map a working directory to a Gemini tmp shard.
- tests/test_stores.py — shape tests for unique IDs, agent-prefix
  discipline, token-form path patterns (no /home/* leaks),
  primary-chat stores carrying upstream_ref or sample_record,
  distinguishes_from cross-references resolving, frozen-model
  enforcement, and Pydantic JSON round-trip.
- tests/conftest.py — fixture_path(store_id, name) helper rooted at
  tests/samples/.
- tests/samples/ — redacted, structurally-faithful example records
  for every primary-chat / prompt-history / plan store. No real
  user content; UUIDs, timestamps, and project hashes are
  placeholders.
why: The new storage catalogue ships an importable public API but
no narrative entry point. Adding a Sphinx page gives readers one
place to learn the catalogue's intent, see the per-agent layouts
side-by-side, and follow a recipe for adding or updating a
descriptor. The CHANGES placeholder for 0.1.0a2 also needed
populating so the upcoming release records the fix, the
catalogue, and the doc page under the project's deliverable-prose
conventions.

what:
- docs/storage-catalog.md — new reference page with sections per
  agent (Claude / Cursor / Codex / Gemini), cross-links to upstream
  Rust and TypeScript type definitions, and an "Adding or updating
  a store" recipe.
- docs/index.md — wire the new page into the "Get started"
  toctree.
- CHANGES — replace the 0.1.0a2 placeholder body with a
  multi-sentence lead and three deliverable sections
  (### What's new / ### Fixes / ### Documentation).
tony added 2 commits May 17, 2026 17:53
why: PR #4 fixed search for Claude and Codex on dotfile-managed
homes and shipped the storage catalogue, but left Cursor CLI and
Gemini deliberately out of scope. The user has chat history in
both — 502 transcripts under
~/.cursor/projects/<id>/agent-transcripts/ and 111 .jsonl sessions
under ~/.gemini/tmp/<project_hash>/chats/. This wires adapters
for both and surfaces "gemini" as a valid --agent value across
the CLI and the MCP tool literals.

what:
- Extend AgentName and AGENT_CHOICES to include "gemini". Mirror
  the change in tests/test_agentgrep.py and across the MCP tool
  literal sites in src/agentgrep/mcp.py.
- New discover_cursor_cli_sources walks
  ~/.cursor/projects/*/agent-transcripts/**/*.jsonl, skipping
  sibling project files (repo.json, mcp-approvals.json,
  terminals/, canvases/) so transcripts stay the focus.
- New parse_cursor_cli_transcript wraps iter_message_candidates —
  which already handles Cursor's outer-role / inner-
  message.content[].text shape — and backfills file mtime as the
  timestamp fallback since transcripts carry no native per-turn
  timestamp. Adds an isoformat_from_mtime_ns helper and a
  datetime import.
- New discover_gemini_sources walks
  ~/.gemini/tmp/<hash>/chats/session-*.jsonl and
  ~/.gemini/tmp/<hash>/logs.json.
- New parse_gemini_chat_file is custom rather than reusing
  iter_message_candidates: extract_role does not recognise
  Gemini's type-as-role key, and patching it globally would
  false-positive on every JSON dict with a "type" field. The
  parser handles SessionMetadataRecord (first line), \$set
  MetadataUpdateRecord, and MessageRecord turns; empty-content
  gemini records (output lives in thoughts[]) are skipped for v1.
- New parse_gemini_logs_file is a flat-array parser mapping each
  LogEntry to a prompt-history record.
- Plug both new discover functions and four adapter_id branches
  into discover_sources and iter_source_records. The "cursor"
  branch becomes additive — the existing IDE / ai-tracking
  sources still run alongside the new CLI transcripts.
- Catalog corrections in store_catalog.py: drop the non-existent
  bubbleId mention from cursor.cli.transcripts; clarify that
  Gemini's RewindRecord / PartialMetadataRecord and
  info|error|warning type values are documented upstream but
  unobserved in real files; flip search_by_default to True on the
  three now-parsed rows; bump catalog_version to 2.
- Six functional tests covering user/assistant Cursor turns,
  tool_use safety, Gemini user-prompt extraction, metadata-record
  skipping, and logs.json parsing.
why: The new Cursor CLI and Gemini adapters need user-facing
announcement: a 0.1.0a2 changelog entry naming the deliverables
and a refresh of the storage-catalog reference page so the
per-agent sections accurately describe what agentgrep now parses.

what:
- CHANGES — refresh the 0.1.0a2 lead to mention the new
  --agent values; add a "Gemini CLI search support" deliverable
  section under What's new naming the chat-session and
  logs.json stores and the gemini_project_hash helper; add a
  "Cursor CLI agent search support" section noting that
  --agent cursor now hits both the IDE and the CLI surfaces.
  Existing "Storage catalogue" and "Search beneath dotfile
  .gitignore" entries from PR #4 are left intact.
- docs/storage-catalog.md — rewrite the Cursor section to name
  both adapters (cursor.cli_jsonl.v1 and the IDE state.vscdb
  pair) as searched. Rewrite the Gemini section to describe
  the records actually observed in real files (only \`user\` and
  \`gemini\` type values; no Rewind / PartialMetadata records),
  name the active adapter_ids, and note the v1 limitation
  around empty-content gemini records whose output lives in
  thoughts[].
@tony tony marked this pull request as ready for review May 17, 2026 23:01
why: The project's changelog conventions require PR references
to sit in each deliverable's #### heading so readers can navigate
from a CHANGES entry to the PR that shipped it. The five 0.1.0a2
entries landed without those refs across two prior changelog
commits on this branch because the PR number wasn't known at
authoring time. Now that PR #4 is open and ready for review,
each deliverable heading gets its (#4) suffix.

what:
- CHANGES: append `(#4)` to the five 0.1.0a2 deliverable
  headings — "Gemini CLI search support",
  "Cursor CLI agent search support", "Storage catalogue",
  "Search beneath dotfile .gitignore", and
  "New storage-catalog reference page".
@tony
Copy link
Copy Markdown
Owner Author

tony commented May 17, 2026

Code review

Found 3 issues:

  1. build_capabilities() hardcodes agents=["codex", "claude", "cursor"] and omits "gemini". The PR extended the AgentSelector literal and CapabilitiesModel.agents's type to include gemini, but the runtime list returned by the agentgrep://capabilities MCP resource still advertises only three agents. MCP clients that read capabilities before routing queries will not see Gemini support.

def build_capabilities() -> CapabilitiesModel:
"""Build a typed capability summary."""
backends = agentgrep.select_backends()
return CapabilitiesModel(
agents=["codex", "claude", "cursor"],
search_types=["prompts", "history", "all"],
adapters=list(KNOWN_ADAPTERS),
tools=["search", "find"],

  1. KNOWN_ADAPTERS does not include the three adapter IDs introduced by this PR — cursor.cli_jsonl.v1, gemini.tmp_chats_jsonl.v1, gemini.tmp_logs_json.v1. These are emitted at runtime by the new discover_cursor_cli_sources and discover_gemini_sources functions in src/agentgrep/__init__.py, but build_capabilities() publishes list(KNOWN_ADAPTERS) to MCP clients. Clients filtering on adapter ID won't recognise records from the new adapters.

SERVER_VERSION = "0.1.0"
KNOWN_ADAPTERS: tuple[str, ...] = (
"codex.history_json.v1",
"codex.sessions_jsonl.v1",
"claude.projects_jsonl.v1",
"cursor.ai_tracking_sqlite.v1",
"cursor.state_vscdb_legacy.v1",
"cursor.state_vscdb_modern.v1",
)
READONLY_TAGS = {"readonly", "agentgrep"}

  1. "earlier exploration mis-attributed them" / "earlier exploratory notes mis-attributed them" leaks branch-internal narrative into shipped artifacts (CLAUDE.md says: "Did users of the most recently published release ever experience this old name, old behavior, or bug? If the answer is no, it is branch-internal narrative. Move it to the commit message and describe only the current state in the artifact."). The mis-attribution happened during this PR's research; readers of 0.1.0a1 never saw a wrong attribution. The same phrasing appears in both the catalog schema_notes and the public docs page.

upstream_ref="github.com/openai/codex@4c89772/codex-rs/state/src/lib.rs#L71",
schema_notes=(
"Codex logs DB (`LOGS_DB_FILENAME`). Note: the `_N.sqlite` files belong "
"to codex, not Cursor — earlier exploration mis-attributed them."
),
),

The two `_N.sqlite` files in `~/.codex/` (`state_5.sqlite`,
`logs_2.sqlite`) belong to Codex, not Cursor — earlier exploratory
notes mis-attributed them.
### Gemini CLI

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

tony added 5 commits May 17, 2026 18:32
…n capabilities

why: PR #4 (this branch) extended AgentSelector, CapabilitiesModel.agents's
type, and every per-tool literal site in src/agentgrep/mcp.py to include
"gemini", but neither the runtime list returned by build_capabilities()
nor KNOWN_ADAPTERS was updated. MCP clients querying
agentgrep://capabilities were told there are three agents and six
adapters while CLI / library users see four agents and nine adapters.
The branch's own narrative ("agentgrep search --agent gemini is now a
valid CLI invocation") was broken at the MCP surface.

what:
- src/agentgrep/mcp.py: extend KNOWN_ADAPTERS with cursor.cli_jsonl.v1,
  gemini.tmp_chats_jsonl.v1, and gemini.tmp_logs_json.v1 in
  dotted-prefix order alongside the existing adapter ids.
- src/agentgrep/mcp.py: replace the hardcoded
  agents=["codex", "claude", "cursor"] in build_capabilities() with
  agents=list(agentgrep.AGENT_CHOICES) so the MCP capabilities resource
  stays in lockstep with the CLI's AGENT_CHOICES tuple — future agent
  additions only need one source-of-truth update.
- src/agentgrep/mcp.py: declare a module-local AgentName alias and add
  AGENT_CHOICES: tuple[AgentName, ...] to the AgentGrepModule Protocol
  so ty can verify the list[AgentName] flowing into CapabilitiesModel.
- tests/test_agentgrep_mcp.py: new
  test_mcp_capabilities_lists_every_supported_agent_and_adapter that
  reads the live agentgrep://capabilities resource and asserts the
  advertised agents match agentgrep.AGENT_CHOICES and the three new
  adapter ids are present.
…rides

why: Every Codex catalogue row declares
env_overrides=("CODEX_HOME",) (5 rows) and every Gemini row declares
env_overrides=("GEMINI_CLI_HOME",) (8 rows). The catalogue's module
docstring describes these as the contract adapters consult. But
discover_codex_sources hardcoded home / ".codex" and
discover_gemini_sources hardcoded home / ".gemini" / "tmp", so the
catalogue shipped a promise the runtime never kept.

The Codex implementation mirrors codex-rs/utils/home-dir/src/lib.rs
(env var, when non-empty, replaces ~/.codex). The Gemini
implementation mirrors packages/cli/index.ts (env var, when non-empty,
replaces ~/.gemini; "tmp" is then appended for the chat/log root).

what:
- discover_codex_sources: read CODEX_HOME via os.environ.get; fall
  back to home / ".codex" when the env var is unset or empty.
- discover_gemini_sources: read GEMINI_CLI_HOME the same way; appends
  /tmp to the resolved base.
- Document the new behaviour in both docstrings, pointing at the
  upstream source-of-truth files.
- tests/test_agentgrep.py: two new tests that monkeypatch the env
  var to an alternate tmp_path location, plant a decoy session under
  ${HOME}/.codex or ${HOME}/.gemini, and assert discovery hits the
  env-pointed root and ignores the decoy.
…t absence of type

why: parse_gemini_chat_file identified the SessionMetadataRecord line
with "startTime" in mapping and "type" not in mapping. Upstream
distinguishes metadata records by the kind field
(packages/core/src/services/chatRecordingTypes.ts), not by the
absence of type. If a future Gemini CLI release adds a type field
to the metadata record (e.g. type="session_meta"), the absence-based
check would silently misclassify the metadata line as a turn,
reset session_id mid-stream, and drop the message because content
isn't where the MessageRecord parser expects it.

Switching to "kind" in mapping uses the upstream-documented
discriminator. The on-disk shape sampled during the v1 adapter
already includes "kind":"main" on every metadata line, so this is
backwards-compatible with current Gemini releases.

what:
- src/agentgrep/__init__.py: parse_gemini_chat_file now branches on
  "kind" in mapping for the SessionMetadataRecord guard. Inline
  comment explains the forward-compat reasoning.
- tests/test_agentgrep.py:
  test_search_gemini_chat_session_metadata_with_future_type_field
  exercises the new guard by writing a metadata line that carries
  both kind and a hypothetical type="session_meta" field; the
  subsequent user MessageRecord must still emit one chat record with
  the correct session_id.
…_records branch

why: Every other branch in iter_source_records ends with
``yield from parse_*(source); return``. The final
``gemini.tmp_logs_json.v1`` branch omitted the trailing return.
Functionally identical today (it's the last branch) but inconsistent
with the local pattern, and a future appended branch would have run
unconditionally for log sources. One-line consistency fix locks the
"each adapter_id dispatches to exactly one parser" invariant.

what:
- src/agentgrep/__init__.py: add ``return`` after the final
  ``yield from parse_gemini_logs_file(source)``.
… artifacts

why: Three sites violated the project's "Shipped vs. Branch-Internal
Narrative" rule (see CLAUDE.md / AGENTS.md, Published-Release Test).
The mis-attribution of the `_N.sqlite` files happened during PR #4's
research; readers of 0.1.0a1 never saw a wrong attribution. The
CHANGES lead's "previously did not parse" phrasing compares branch
states rather than shipped states. Per the rule, the artifact should
describe only the current state; the historical "we got this wrong
once" belongs in commit messages, not in shipped catalogue notes,
docs pages, or release-notes prose.

The corrected explanation for *why* the `_N.sqlite` files belong to
Codex is that their filenames are defined by `STATE_DB_FILENAME` and
`LOGS_DB_FILENAME` constants in the upstream Rust crate
`codex-rs/state/src/lib.rs` — which is now what the shipped artifacts
say.

what:
- src/agentgrep/store_catalog.py: rewrite the `codex.logs_db` row's
  `schema_notes` so it cites the upstream constant names directly
  and drops the "earlier exploration mis-attributed them" aside.
- docs/storage-catalog.md: rewrite the Codex `_N.sqlite` paragraph
  to cite the upstream constants and drop the "earlier exploratory
  notes mis-attributed them" aside.
- CHANGES: rewrite the 0.1.0a2 lead so it describes the current
  state ("adds search support for the Cursor CLI agent's per-project
  transcripts and Google Gemini CLI's session and prompt-log files")
  rather than the branch diff ("broadens search to two stores
  agentgrep previously did not parse").
tony added 5 commits May 17, 2026 19:10
why: The catalogue declares gemini.tmp.logs as
role=StoreRole.PROMPT_HISTORY — upstream's logs.json is the
user-prompt audit log, analogous to Codex's history.jsonl. Records
from an audit log are kind="history" by the project's
SearchRecord convention; parse_codex_history_file constructs them
that way explicitly. The Gemini logs parser routed records through
build_search_record, which auto-classifies role="user" as
kind="prompt", so agentgrep search --agent gemini --type history
returned zero logs.json hits.

what:
- parse_gemini_logs_file now yields SearchRecord(kind="history", ...)
  directly, mirroring parse_codex_history_file. Drops the
  MessageCandidate + build_search_record indirection.
- test_search_gemini_logs_returns_user_message asserts
  kind == "history" and session_id propagates from the LogEntry.
…Spec

why: Every adapter hard-coded its agent path roots, globs, post-
filters, and runtime metadata (store, adapter_id, path_kind,
source_kind) inline. The catalogue's env_overrides field was the
only catalogue-side declaration the runtime consulted, and even
that duplicated the env-var name. Centralising the runtime metadata
in the catalogue means a future upstream rename — Codex moving its
history file, Cursor adding a CLI agent, Gemini reorganising tmp/ —
is a one-row edit; discovery catches up automatically. The
catalogue earns its keep as the runtime contract, not just
documentation.

Pairs naturally with the codebase's first logger
(logging.getLogger("agentgrep") + NullHandler in
src/agentgrep/__init__.py), which fires a WARNING when CODEX_HOME
or GEMINI_CLI_HOME points to a non-existent path. Upstream Codex
errors in that case; agentgrep stays read-only-friendly and falls
back, but the warning surfaces a misconfiguration the user almost
certainly cares about. The structured extra={"agentgrep_env_var":
..., "agentgrep_env_path": ...} follows CLAUDE.md's logging
conventions exactly.

what:
- src/agentgrep/stores.py: move PathKind and SourceKind here as
  literal type aliases (re-exported from agentgrep for compat);
  add the new DiscoverySpec Pydantic model with home_subpath,
  platform_paths, files, glob, and path_parts_required fields;
  add discovery: tuple[DiscoverySpec, ...] = () to StoreDescriptor.
  A tuple (rather than a single optional) handles stores whose
  on-disk shape spans more than one DiscoverySpec — Codex history
  has .json and .jsonl alternatives, Cursor IDE state has modern
  and legacy adapter ids.
- src/agentgrep/store_catalog.py: populate discovery on
  claude.projects.session, cursor.cli.transcripts,
  cursor.ai_tracking, cursor.ide.state_vscdb, codex.history,
  codex.sessions, gemini.tmp.chats, and gemini.tmp.logs.
- src/agentgrep/__init__.py:
  - logger = logging.getLogger(__name__); NullHandler registered.
  - resolve_env_root(env_var, default) honours the env override and
    logger.warning()s when the path isn't a directory.
  - handles_from_discovery(spec, agent, root, backends) emits
    SourceHandles by interpreting one DiscoverySpec.
  - discover_from_catalog(home, agent, base, backends) iterates
    CATALOG.for_agent(agent) and walks each row's discovery.
  - discover_codex / claude / cursor / gemini delegate to that
    helper; discover_cursor_cli_sources is a back-compat shim that
    returns [] because Cursor CLI transcripts now flow through
    discover_cursor_sources.
- tests/test_agentgrep.py: caplog-scoped test asserts the WARNING
  record carries the structured agentgrep_env_var /
  agentgrep_env_path attributes; a companion test asserts
  resolve_env_root falls back silently when the env var is unset.
- tests/test_stores.py: invariant test asserts every runtime
  adapter id is declared by some catalogue row's DiscoverySpec and
  also advertised in agentgrep.mcp.KNOWN_ADAPTERS.
why: Upstream Gemini CLI still reads the older single-file `.json`
session format via the `isLegacyRecord` discriminator in
packages/core/src/services/chatRecordingService.ts. The legacy
shape is a JSON object with session metadata at the top level and
the full conversation under a `messages` array; each entry carries
the same per-turn fields the JSONL format uses. agentgrep had no
adapter for these files, leaving ~1000 real legacy session files
under ~/.gemini/tmp/*/chats/*.json invisible to search on systems
that predate the JSONL migration.

what:
- src/agentgrep/__init__.py:
  - Factor _gemini_message_record_to_candidate(mapping, session_id)
    so the new parser shares the inner-record handling with
    parse_gemini_chat_file. Returns None for records with no
    searchable text (empty content + no thoughts/toolCalls).
  - parse_gemini_chat_legacy_file(source) reads the JSON object,
    pulls sessionId from the top level, iterates `messages[]`, and
    yields one SearchRecord per turn with searchable text.
  - Dispatcher routes adapter_id "gemini.tmp_chats_legacy_json.v1"
    to the new parser.
- src/agentgrep/store_catalog.py: new gemini.tmp.chats_legacy row.
  role=SUPPLEMENTARY_CHAT, format=JSON_OBJECT, distinguishes_from
  the current jsonl chats. DiscoverySpec uses home_subpath=("tmp",)
  and glob="session-*.json" so the catalogue drives both the
  enumeration and the runtime metadata.
- src/agentgrep/mcp.py: extend KNOWN_ADAPTERS with
  "gemini.tmp_chats_legacy_json.v1".
- tests/test_agentgrep.py:
  test_search_gemini_chat_legacy_json_session plants a single
  legacy file with one user MessageRecord and asserts the prompt
  is surfaced with the right session_id and role.
…all text

why: Gemini assistant turns on the current `.jsonl` format leave the
`content` field empty when the model's prose lives in `thoughts[]`
(reasoning, ~150-300 chars per entry) and tool invocations live in
`toolCalls[]`. The previous adapter dropped those turns entirely,
leaving ~70% of assistant output invisible to search. Concatenating
`thoughts[*].subject`/`description` and `toolCalls[*].name`/`description`
into the candidate's text restores searchability without breaking the
conversation-turn boundary: one SearchRecord per turn keeps record
counts flat and downstream filtering simple. Tool `args` is omitted —
it's JSON-shaped and low signal compared to the human-readable
`description` field.

what:
- src/agentgrep/__init__.py:
  - _gemini_thoughts_text(thoughts) flattens subject + description
    into a newline-joined string. Skipped when no `thoughts` field is
    present or it's not a list.
  - _gemini_tool_calls_text(toolCalls) same for tool calls' `name` +
    `description`. Tool `args` deliberately excluded.
  - _gemini_message_record_to_candidate reuses both helpers when the
    record is `type="gemini"`. Returns None only when content,
    thoughts, AND toolCalls all contribute no text.
- src/agentgrep/store_catalog.py: refresh `gemini.tmp.chats`
  search_notes to document the new behaviour.
- tests/test_agentgrep.py:
  test_search_gemini_chat_session_surfaces_thoughts_and_tool_calls
  plants a gemini turn with thoughts only and a gemini turn with
  toolCalls only, asserts both surface with the expected text.
  test_search_gemini_chat_session_drops_textless_records (renamed
  from drops_metadata_records) keeps the negative case: gemini-typed
  records with empty content AND no thoughts AND no toolCalls still
  produce nothing.
…avity catalogue rows

why: Upstream investigation established that three catalogue rows
described behaviour upstream does not have:

- `gemini.history` claimed the row was a post-retention archive
  of `tmp/`. Upstream
  `packages/cli/src/utils/sessionCleanup.ts` hard-deletes
  expired sessions via `fs.unlink()` — there is no archive. The
  `.project_root` metadata stubs under some users'
  `~/.gemini/history/` are orphaned artefacts from an earlier
  layout, not Gemini-CLI-written archives.
- `gemini.antigravity.brain` and
  `gemini.antigravity.conversations` are Antigravity IDE
  artefacts. Gemini CLI only detects Antigravity as an IDE
  launcher target
  (`packages/core/src/ide/detect-ide.ts`) — it never reads or
  writes the protobuf conversation files. If agentgrep ever
  supports Antigravity, that is a separate agent kind, not a
  Gemini sub-store.

Shipping catalogue claims unbacked by upstream is the same kind
of branch-internal narrative the project's CLAUDE.md rule
forbids; honest catalogue means dropping the rows.

what:
- src/agentgrep/store_catalog.py: delete the three rows. Bump
  `catalog_version` to 3 (descriptor entries removed).
- docs/storage-catalog.md: rewrite the Gemini section to name the
  three real adapters (`gemini.tmp_chats_jsonl.v1`,
  `gemini.tmp_chats_legacy_json.v1`,
  `gemini.tmp_logs_json.v1`), explain that thoughts and tool-call
  text now surface for assistant turns with empty `content`, and
  document Antigravity / `history/` as out-of-scope with upstream
  citations.
- CHANGES: refresh the "Gemini CLI search support" entry to
  mention the legacy `.json` adapter, thoughts/toolCalls
  surfacing, and the CODEX_HOME / GEMINI_CLI_HOME warning;
  refresh the "Storage catalogue" entry to mention DiscoverySpec
  and the row prunings.
@tony
Copy link
Copy Markdown
Owner Author

tony commented May 18, 2026

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

tony added 3 commits May 17, 2026 19:53
…riven discovery

why: src/agentgrep/store_catalog.py's module docstring describes the
catalogue as the runtime contract — "each StoreDescriptor carries a
search_by_default field that the per-agent discover functions
consult." The runtime did not. discover_from_catalog walked every
descriptor's discovery tuple unconditionally, so a row marked
search_by_default=False but carrying a DiscoverySpec would silently
flow into search results — the catalogue would lie about its own
behaviour.

The shape of the catalogue today hides this from users: every row
that sets search_by_default=False also leaves discovery=(), so the
loop happens to do the right thing in practice. Adding a
DiscoverySpec to such a row in the future would have been a quiet
regression.

what:
- src/agentgrep/__init__.py: discover_from_catalog now skips
  descriptors whose search_by_default is exactly False. True
  remains searched; None (decision-deferred) also remains searched,
  matching the historical default for rows that pre-date the
  search-by-default field.
- src/agentgrep/store_catalog.py: state the policy explicitly on
  the two rows that previously relied on the None default —
  cursor.ai_tracking (role corrected from APP_STATE to
  SUPPLEMENTARY_CHAT to reflect that its conversation_summaries
  rows are chat-derived metadata, not app state) and
  cursor.ide.state_vscdb.
- tests/test_stores.py:
  test_discover_from_catalog_skips_search_by_default_false plants
  a synthetic catalogue with one False row and one True row that
  both carry valid DiscoverySpecs against the same base directory,
  monkeypatches CATALOG, and asserts only the True row's source
  flows through.
…e descriptor

why: A single StoreDescriptor can carry more than one DiscoverySpec —
the canonical case is cursor.ide.state_vscdb, which declares one spec
for the modern platform paths (~/.config/Cursor/...) and one for the
legacy ~/.cursor/state.vscdb glob. On standard layouts the two roots
are disjoint, but a custom XDG layout or symlinked install could place
a single state.vscdb where both specs match. handles_from_discovery
treated the two specs independently, so the same file flowed out as
two SourceHandles with different adapter ids
(cursor.state_vscdb_modern.v1 and cursor.state_vscdb_legacy.v1),
producing duplicate search records downstream.

what:
- src/agentgrep/__init__.py: discover_from_catalog now keeps a
  per-descriptor seen_paths set and skips any handle whose path was
  already emitted by an earlier spec on the same descriptor.
  Cross-descriptor dedup stays out of scope — gemini.tmp.chats and
  gemini.tmp.chats_legacy are distinct stores even when they share a
  directory.
- tests/test_stores.py:
  test_discover_from_catalog_deduplicates_paths_within_descriptor
  monkeypatches the catalogue with one descriptor whose two specs
  both point at a single state.vscdb file and asserts exactly one
  SourceHandle is returned.
…tinguishes missing from non-directory

why: The env-override warning emitted by resolve_env_root had two
problems. First, the message was present-tense ("env-override path
does not exist"), which clashes with CLAUDE.md's logging style rule
that messages describe events in past tense. Second, the same
warning fired whether CODEX_HOME or GEMINI_CLI_HOME pointed at a
missing path OR at a path that existed as a regular file (or other
non-directory inode). An operator reading the log could not tell
which class of misconfiguration they had — a typo versus an env var
aimed at the wrong inode kind.

what:
- src/agentgrep/__init__.py: resolve_env_root now logs
  "env-override path unavailable, fell back to default" and adds a
  structured agentgrep_env_path_status extra field whose value is
  "not_a_directory" when candidate.exists() is true and "not_found"
  otherwise. Operators reading the log have a stable scalar to
  filter on.
- tests/test_agentgrep.py: test_resolve_env_root_warns_on_missing_path
  now asserts agentgrep_env_path_status == "not_found" in addition
  to the env-var and env-path fields. New
  test_resolve_env_root_warns_when_env_path_is_file plants a real
  regular file at the env path and asserts the status is
  "not_a_directory".
…es entry point

why: Cursor CLI transcripts are discovered by discover_cursor_sources,
which reads their DiscoverySpec from CATALOG. A separate
discover_cursor_cli_sources function exists only as an empty stub
whose docstring describes the in-branch refactor history — that
"compatibility" framing is branch-internal narrative for a symbol
that never shipped to users. Removing the stub leaves discovery
behind one entry point and drops misleading documentation from the
shipped surface.

what:
- src/agentgrep/__init__.py: delete the discover_cursor_cli_sources
  definition. No callers in the repo and the symbol is not part of
  any released API.
@tony tony merged commit cacd8b9 into master May 18, 2026
3 checks passed
@tony tony deleted the more-backends branch May 18, 2026 01:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant