Add storage catalogue; restore search beneath dotfile .gitignore#4
Merged
Conversation
why: agentgrep ran fd and rg against the user's ~/.claude, ~/.codex, and ~/.cursor without disabling ignore-file semantics. On systems where those paths sit inside a dotfile-managed tree with a .gitignore (yadm, chezmoi, stow, bare-git), both tools silently masked the agent data and the CLI reported "No matches found." fd exited 0 with empty output, so the Python rglob fallback never fired. After discovery, the rg-against-search-root prefilter applied the same mask a second time. Even after the ignore-flag fixes, paths from `fd -a` — which canonicalizes through symlinks — and rg — which doesn't — failed to compare equal in prefilter_sources_by_root, so every source got dropped. what: - Pass `-H -I` to fd in list_files_matching and drop `-a` so the returned paths preserve the input root's symlink structure and agree with rg's output. - Pass `--no-ignore --hidden` (rg) / `--unrestricted --hidden` (ag) in build_grep_command and reshuffle the argv builder so the fixed-string flag goes immediately before `-l`. - Add test_list_files_matching_ignores_gitignore that drops a `.gitignore: *` next to two JSONL files and asserts discovery still finds them.
…c descriptors why: Each CLI agent (Claude Code, Cursor, Codex, Gemini) lays out prompt and conversation history on disk in its own way, and those layouts drift between releases — Codex renames history files, Cursor adds a CLI-agent layout that doesn't use the IDE's state.vscdb, Gemini reorganises ~/.gemini/tmp/. Burying paths and record schemas inside the search adapters made every layout change a multi-file edit and left no place to record which version was last verified. The catalogue centralises that knowledge as frozen Pydantic descriptors, each stamped with observed_version / observed_at and a pointer to the upstream type definition where one is public. Whether agentgrep searches a given store by default stays a per-store decision the adapters consult; the catalogue itself is descriptive. what: - src/agentgrep/stores.py — StoreFormat / StoreRole enums and the frozen StoreDescriptor / StoreCatalog Pydantic models. JSON-schema-exportable for downstream validation. - src/agentgrep/store_catalog.py — initial CATALOG spanning Claude, Cursor (IDE + CLI agent kept distinct), Codex, and Gemini. Each entry cites the upstream source-of-truth where one exists; the Cursor CLI rows note that no public schema is published. Includes gemini_project_hash() — a Python mirror of getProjectHash() at packages/core/src/utils/paths.ts so consumers can map a working directory to a Gemini tmp shard. - tests/test_stores.py — shape tests for unique IDs, agent-prefix discipline, token-form path patterns (no /home/* leaks), primary-chat stores carrying upstream_ref or sample_record, distinguishes_from cross-references resolving, frozen-model enforcement, and Pydantic JSON round-trip. - tests/conftest.py — fixture_path(store_id, name) helper rooted at tests/samples/. - tests/samples/ — redacted, structurally-faithful example records for every primary-chat / prompt-history / plan store. No real user content; UUIDs, timestamps, and project hashes are placeholders.
why: The new storage catalogue ships an importable public API but no narrative entry point. Adding a Sphinx page gives readers one place to learn the catalogue's intent, see the per-agent layouts side-by-side, and follow a recipe for adding or updating a descriptor. The CHANGES placeholder for 0.1.0a2 also needed populating so the upcoming release records the fix, the catalogue, and the doc page under the project's deliverable-prose conventions. what: - docs/storage-catalog.md — new reference page with sections per agent (Claude / Cursor / Codex / Gemini), cross-links to upstream Rust and TypeScript type definitions, and an "Adding or updating a store" recipe. - docs/index.md — wire the new page into the "Get started" toctree. - CHANGES — replace the 0.1.0a2 placeholder body with a multi-sentence lead and three deliverable sections (### What's new / ### Fixes / ### Documentation).
why: PR #4 fixed search for Claude and Codex on dotfile-managed homes and shipped the storage catalogue, but left Cursor CLI and Gemini deliberately out of scope. The user has chat history in both — 502 transcripts under ~/.cursor/projects/<id>/agent-transcripts/ and 111 .jsonl sessions under ~/.gemini/tmp/<project_hash>/chats/. This wires adapters for both and surfaces "gemini" as a valid --agent value across the CLI and the MCP tool literals. what: - Extend AgentName and AGENT_CHOICES to include "gemini". Mirror the change in tests/test_agentgrep.py and across the MCP tool literal sites in src/agentgrep/mcp.py. - New discover_cursor_cli_sources walks ~/.cursor/projects/*/agent-transcripts/**/*.jsonl, skipping sibling project files (repo.json, mcp-approvals.json, terminals/, canvases/) so transcripts stay the focus. - New parse_cursor_cli_transcript wraps iter_message_candidates — which already handles Cursor's outer-role / inner- message.content[].text shape — and backfills file mtime as the timestamp fallback since transcripts carry no native per-turn timestamp. Adds an isoformat_from_mtime_ns helper and a datetime import. - New discover_gemini_sources walks ~/.gemini/tmp/<hash>/chats/session-*.jsonl and ~/.gemini/tmp/<hash>/logs.json. - New parse_gemini_chat_file is custom rather than reusing iter_message_candidates: extract_role does not recognise Gemini's type-as-role key, and patching it globally would false-positive on every JSON dict with a "type" field. The parser handles SessionMetadataRecord (first line), \$set MetadataUpdateRecord, and MessageRecord turns; empty-content gemini records (output lives in thoughts[]) are skipped for v1. - New parse_gemini_logs_file is a flat-array parser mapping each LogEntry to a prompt-history record. - Plug both new discover functions and four adapter_id branches into discover_sources and iter_source_records. The "cursor" branch becomes additive — the existing IDE / ai-tracking sources still run alongside the new CLI transcripts. - Catalog corrections in store_catalog.py: drop the non-existent bubbleId mention from cursor.cli.transcripts; clarify that Gemini's RewindRecord / PartialMetadataRecord and info|error|warning type values are documented upstream but unobserved in real files; flip search_by_default to True on the three now-parsed rows; bump catalog_version to 2. - Six functional tests covering user/assistant Cursor turns, tool_use safety, Gemini user-prompt extraction, metadata-record skipping, and logs.json parsing.
why: The new Cursor CLI and Gemini adapters need user-facing announcement: a 0.1.0a2 changelog entry naming the deliverables and a refresh of the storage-catalog reference page so the per-agent sections accurately describe what agentgrep now parses. what: - CHANGES — refresh the 0.1.0a2 lead to mention the new --agent values; add a "Gemini CLI search support" deliverable section under What's new naming the chat-session and logs.json stores and the gemini_project_hash helper; add a "Cursor CLI agent search support" section noting that --agent cursor now hits both the IDE and the CLI surfaces. Existing "Storage catalogue" and "Search beneath dotfile .gitignore" entries from PR #4 are left intact. - docs/storage-catalog.md — rewrite the Cursor section to name both adapters (cursor.cli_jsonl.v1 and the IDE state.vscdb pair) as searched. Rewrite the Gemini section to describe the records actually observed in real files (only \`user\` and \`gemini\` type values; no Rewind / PartialMetadata records), name the active adapter_ids, and note the v1 limitation around empty-content gemini records whose output lives in thoughts[].
why: The project's changelog conventions require PR references to sit in each deliverable's #### heading so readers can navigate from a CHANGES entry to the PR that shipped it. The five 0.1.0a2 entries landed without those refs across two prior changelog commits on this branch because the PR number wasn't known at authoring time. Now that PR #4 is open and ready for review, each deliverable heading gets its (#4) suffix. what: - CHANGES: append `(#4)` to the five 0.1.0a2 deliverable headings — "Gemini CLI search support", "Cursor CLI agent search support", "Storage catalogue", "Search beneath dotfile .gitignore", and "New storage-catalog reference page".
Owner
Author
Code reviewFound 3 issues:
agentgrep/src/agentgrep/mcp.py Lines 320 to 327 in eb217e5
agentgrep/src/agentgrep/mcp.py Lines 36 to 45 in eb217e5
agentgrep/src/agentgrep/store_catalog.py Lines 402 to 407 in eb217e5 agentgrep/docs/storage-catalog.md Lines 104 to 109 in eb217e5 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
…n capabilities why: PR #4 (this branch) extended AgentSelector, CapabilitiesModel.agents's type, and every per-tool literal site in src/agentgrep/mcp.py to include "gemini", but neither the runtime list returned by build_capabilities() nor KNOWN_ADAPTERS was updated. MCP clients querying agentgrep://capabilities were told there are three agents and six adapters while CLI / library users see four agents and nine adapters. The branch's own narrative ("agentgrep search --agent gemini is now a valid CLI invocation") was broken at the MCP surface. what: - src/agentgrep/mcp.py: extend KNOWN_ADAPTERS with cursor.cli_jsonl.v1, gemini.tmp_chats_jsonl.v1, and gemini.tmp_logs_json.v1 in dotted-prefix order alongside the existing adapter ids. - src/agentgrep/mcp.py: replace the hardcoded agents=["codex", "claude", "cursor"] in build_capabilities() with agents=list(agentgrep.AGENT_CHOICES) so the MCP capabilities resource stays in lockstep with the CLI's AGENT_CHOICES tuple — future agent additions only need one source-of-truth update. - src/agentgrep/mcp.py: declare a module-local AgentName alias and add AGENT_CHOICES: tuple[AgentName, ...] to the AgentGrepModule Protocol so ty can verify the list[AgentName] flowing into CapabilitiesModel. - tests/test_agentgrep_mcp.py: new test_mcp_capabilities_lists_every_supported_agent_and_adapter that reads the live agentgrep://capabilities resource and asserts the advertised agents match agentgrep.AGENT_CHOICES and the three new adapter ids are present.
…rides
why: Every Codex catalogue row declares
env_overrides=("CODEX_HOME",) (5 rows) and every Gemini row declares
env_overrides=("GEMINI_CLI_HOME",) (8 rows). The catalogue's module
docstring describes these as the contract adapters consult. But
discover_codex_sources hardcoded home / ".codex" and
discover_gemini_sources hardcoded home / ".gemini" / "tmp", so the
catalogue shipped a promise the runtime never kept.
The Codex implementation mirrors codex-rs/utils/home-dir/src/lib.rs
(env var, when non-empty, replaces ~/.codex). The Gemini
implementation mirrors packages/cli/index.ts (env var, when non-empty,
replaces ~/.gemini; "tmp" is then appended for the chat/log root).
what:
- discover_codex_sources: read CODEX_HOME via os.environ.get; fall
back to home / ".codex" when the env var is unset or empty.
- discover_gemini_sources: read GEMINI_CLI_HOME the same way; appends
/tmp to the resolved base.
- Document the new behaviour in both docstrings, pointing at the
upstream source-of-truth files.
- tests/test_agentgrep.py: two new tests that monkeypatch the env
var to an alternate tmp_path location, plant a decoy session under
${HOME}/.codex or ${HOME}/.gemini, and assert discovery hits the
env-pointed root and ignores the decoy.
…t absence of type why: parse_gemini_chat_file identified the SessionMetadataRecord line with "startTime" in mapping and "type" not in mapping. Upstream distinguishes metadata records by the kind field (packages/core/src/services/chatRecordingTypes.ts), not by the absence of type. If a future Gemini CLI release adds a type field to the metadata record (e.g. type="session_meta"), the absence-based check would silently misclassify the metadata line as a turn, reset session_id mid-stream, and drop the message because content isn't where the MessageRecord parser expects it. Switching to "kind" in mapping uses the upstream-documented discriminator. The on-disk shape sampled during the v1 adapter already includes "kind":"main" on every metadata line, so this is backwards-compatible with current Gemini releases. what: - src/agentgrep/__init__.py: parse_gemini_chat_file now branches on "kind" in mapping for the SessionMetadataRecord guard. Inline comment explains the forward-compat reasoning. - tests/test_agentgrep.py: test_search_gemini_chat_session_metadata_with_future_type_field exercises the new guard by writing a metadata line that carries both kind and a hypothetical type="session_meta" field; the subsequent user MessageRecord must still emit one chat record with the correct session_id.
…_records branch why: Every other branch in iter_source_records ends with ``yield from parse_*(source); return``. The final ``gemini.tmp_logs_json.v1`` branch omitted the trailing return. Functionally identical today (it's the last branch) but inconsistent with the local pattern, and a future appended branch would have run unconditionally for log sources. One-line consistency fix locks the "each adapter_id dispatches to exactly one parser" invariant. what: - src/agentgrep/__init__.py: add ``return`` after the final ``yield from parse_gemini_logs_file(source)``.
… artifacts why: Three sites violated the project's "Shipped vs. Branch-Internal Narrative" rule (see CLAUDE.md / AGENTS.md, Published-Release Test). The mis-attribution of the `_N.sqlite` files happened during PR #4's research; readers of 0.1.0a1 never saw a wrong attribution. The CHANGES lead's "previously did not parse" phrasing compares branch states rather than shipped states. Per the rule, the artifact should describe only the current state; the historical "we got this wrong once" belongs in commit messages, not in shipped catalogue notes, docs pages, or release-notes prose. The corrected explanation for *why* the `_N.sqlite` files belong to Codex is that their filenames are defined by `STATE_DB_FILENAME` and `LOGS_DB_FILENAME` constants in the upstream Rust crate `codex-rs/state/src/lib.rs` — which is now what the shipped artifacts say. what: - src/agentgrep/store_catalog.py: rewrite the `codex.logs_db` row's `schema_notes` so it cites the upstream constant names directly and drops the "earlier exploration mis-attributed them" aside. - docs/storage-catalog.md: rewrite the Codex `_N.sqlite` paragraph to cite the upstream constants and drop the "earlier exploratory notes mis-attributed them" aside. - CHANGES: rewrite the 0.1.0a2 lead so it describes the current state ("adds search support for the Cursor CLI agent's per-project transcripts and Google Gemini CLI's session and prompt-log files") rather than the branch diff ("broadens search to two stores agentgrep previously did not parse").
why: The catalogue declares gemini.tmp.logs as role=StoreRole.PROMPT_HISTORY — upstream's logs.json is the user-prompt audit log, analogous to Codex's history.jsonl. Records from an audit log are kind="history" by the project's SearchRecord convention; parse_codex_history_file constructs them that way explicitly. The Gemini logs parser routed records through build_search_record, which auto-classifies role="user" as kind="prompt", so agentgrep search --agent gemini --type history returned zero logs.json hits. what: - parse_gemini_logs_file now yields SearchRecord(kind="history", ...) directly, mirroring parse_codex_history_file. Drops the MessageCandidate + build_search_record indirection. - test_search_gemini_logs_returns_user_message asserts kind == "history" and session_id propagates from the LogEntry.
…Spec
why: Every adapter hard-coded its agent path roots, globs, post-
filters, and runtime metadata (store, adapter_id, path_kind,
source_kind) inline. The catalogue's env_overrides field was the
only catalogue-side declaration the runtime consulted, and even
that duplicated the env-var name. Centralising the runtime metadata
in the catalogue means a future upstream rename — Codex moving its
history file, Cursor adding a CLI agent, Gemini reorganising tmp/ —
is a one-row edit; discovery catches up automatically. The
catalogue earns its keep as the runtime contract, not just
documentation.
Pairs naturally with the codebase's first logger
(logging.getLogger("agentgrep") + NullHandler in
src/agentgrep/__init__.py), which fires a WARNING when CODEX_HOME
or GEMINI_CLI_HOME points to a non-existent path. Upstream Codex
errors in that case; agentgrep stays read-only-friendly and falls
back, but the warning surfaces a misconfiguration the user almost
certainly cares about. The structured extra={"agentgrep_env_var":
..., "agentgrep_env_path": ...} follows CLAUDE.md's logging
conventions exactly.
what:
- src/agentgrep/stores.py: move PathKind and SourceKind here as
literal type aliases (re-exported from agentgrep for compat);
add the new DiscoverySpec Pydantic model with home_subpath,
platform_paths, files, glob, and path_parts_required fields;
add discovery: tuple[DiscoverySpec, ...] = () to StoreDescriptor.
A tuple (rather than a single optional) handles stores whose
on-disk shape spans more than one DiscoverySpec — Codex history
has .json and .jsonl alternatives, Cursor IDE state has modern
and legacy adapter ids.
- src/agentgrep/store_catalog.py: populate discovery on
claude.projects.session, cursor.cli.transcripts,
cursor.ai_tracking, cursor.ide.state_vscdb, codex.history,
codex.sessions, gemini.tmp.chats, and gemini.tmp.logs.
- src/agentgrep/__init__.py:
- logger = logging.getLogger(__name__); NullHandler registered.
- resolve_env_root(env_var, default) honours the env override and
logger.warning()s when the path isn't a directory.
- handles_from_discovery(spec, agent, root, backends) emits
SourceHandles by interpreting one DiscoverySpec.
- discover_from_catalog(home, agent, base, backends) iterates
CATALOG.for_agent(agent) and walks each row's discovery.
- discover_codex / claude / cursor / gemini delegate to that
helper; discover_cursor_cli_sources is a back-compat shim that
returns [] because Cursor CLI transcripts now flow through
discover_cursor_sources.
- tests/test_agentgrep.py: caplog-scoped test asserts the WARNING
record carries the structured agentgrep_env_var /
agentgrep_env_path attributes; a companion test asserts
resolve_env_root falls back silently when the env var is unset.
- tests/test_stores.py: invariant test asserts every runtime
adapter id is declared by some catalogue row's DiscoverySpec and
also advertised in agentgrep.mcp.KNOWN_ADAPTERS.
why: Upstream Gemini CLI still reads the older single-file `.json`
session format via the `isLegacyRecord` discriminator in
packages/core/src/services/chatRecordingService.ts. The legacy
shape is a JSON object with session metadata at the top level and
the full conversation under a `messages` array; each entry carries
the same per-turn fields the JSONL format uses. agentgrep had no
adapter for these files, leaving ~1000 real legacy session files
under ~/.gemini/tmp/*/chats/*.json invisible to search on systems
that predate the JSONL migration.
what:
- src/agentgrep/__init__.py:
- Factor _gemini_message_record_to_candidate(mapping, session_id)
so the new parser shares the inner-record handling with
parse_gemini_chat_file. Returns None for records with no
searchable text (empty content + no thoughts/toolCalls).
- parse_gemini_chat_legacy_file(source) reads the JSON object,
pulls sessionId from the top level, iterates `messages[]`, and
yields one SearchRecord per turn with searchable text.
- Dispatcher routes adapter_id "gemini.tmp_chats_legacy_json.v1"
to the new parser.
- src/agentgrep/store_catalog.py: new gemini.tmp.chats_legacy row.
role=SUPPLEMENTARY_CHAT, format=JSON_OBJECT, distinguishes_from
the current jsonl chats. DiscoverySpec uses home_subpath=("tmp",)
and glob="session-*.json" so the catalogue drives both the
enumeration and the runtime metadata.
- src/agentgrep/mcp.py: extend KNOWN_ADAPTERS with
"gemini.tmp_chats_legacy_json.v1".
- tests/test_agentgrep.py:
test_search_gemini_chat_legacy_json_session plants a single
legacy file with one user MessageRecord and asserts the prompt
is surfaced with the right session_id and role.
…all text
why: Gemini assistant turns on the current `.jsonl` format leave the
`content` field empty when the model's prose lives in `thoughts[]`
(reasoning, ~150-300 chars per entry) and tool invocations live in
`toolCalls[]`. The previous adapter dropped those turns entirely,
leaving ~70% of assistant output invisible to search. Concatenating
`thoughts[*].subject`/`description` and `toolCalls[*].name`/`description`
into the candidate's text restores searchability without breaking the
conversation-turn boundary: one SearchRecord per turn keeps record
counts flat and downstream filtering simple. Tool `args` is omitted —
it's JSON-shaped and low signal compared to the human-readable
`description` field.
what:
- src/agentgrep/__init__.py:
- _gemini_thoughts_text(thoughts) flattens subject + description
into a newline-joined string. Skipped when no `thoughts` field is
present or it's not a list.
- _gemini_tool_calls_text(toolCalls) same for tool calls' `name` +
`description`. Tool `args` deliberately excluded.
- _gemini_message_record_to_candidate reuses both helpers when the
record is `type="gemini"`. Returns None only when content,
thoughts, AND toolCalls all contribute no text.
- src/agentgrep/store_catalog.py: refresh `gemini.tmp.chats`
search_notes to document the new behaviour.
- tests/test_agentgrep.py:
test_search_gemini_chat_session_surfaces_thoughts_and_tool_calls
plants a gemini turn with thoughts only and a gemini turn with
toolCalls only, asserts both surface with the expected text.
test_search_gemini_chat_session_drops_textless_records (renamed
from drops_metadata_records) keeps the negative case: gemini-typed
records with empty content AND no thoughts AND no toolCalls still
produce nothing.
…avity catalogue rows why: Upstream investigation established that three catalogue rows described behaviour upstream does not have: - `gemini.history` claimed the row was a post-retention archive of `tmp/`. Upstream `packages/cli/src/utils/sessionCleanup.ts` hard-deletes expired sessions via `fs.unlink()` — there is no archive. The `.project_root` metadata stubs under some users' `~/.gemini/history/` are orphaned artefacts from an earlier layout, not Gemini-CLI-written archives. - `gemini.antigravity.brain` and `gemini.antigravity.conversations` are Antigravity IDE artefacts. Gemini CLI only detects Antigravity as an IDE launcher target (`packages/core/src/ide/detect-ide.ts`) — it never reads or writes the protobuf conversation files. If agentgrep ever supports Antigravity, that is a separate agent kind, not a Gemini sub-store. Shipping catalogue claims unbacked by upstream is the same kind of branch-internal narrative the project's CLAUDE.md rule forbids; honest catalogue means dropping the rows. what: - src/agentgrep/store_catalog.py: delete the three rows. Bump `catalog_version` to 3 (descriptor entries removed). - docs/storage-catalog.md: rewrite the Gemini section to name the three real adapters (`gemini.tmp_chats_jsonl.v1`, `gemini.tmp_chats_legacy_json.v1`, `gemini.tmp_logs_json.v1`), explain that thoughts and tool-call text now surface for assistant turns with empty `content`, and document Antigravity / `history/` as out-of-scope with upstream citations. - CHANGES: refresh the "Gemini CLI search support" entry to mention the legacy `.json` adapter, thoughts/toolCalls surfacing, and the CODEX_HOME / GEMINI_CLI_HOME warning; refresh the "Storage catalogue" entry to mention DiscoverySpec and the row prunings.
Owner
Author
Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
…riven discovery why: src/agentgrep/store_catalog.py's module docstring describes the catalogue as the runtime contract — "each StoreDescriptor carries a search_by_default field that the per-agent discover functions consult." The runtime did not. discover_from_catalog walked every descriptor's discovery tuple unconditionally, so a row marked search_by_default=False but carrying a DiscoverySpec would silently flow into search results — the catalogue would lie about its own behaviour. The shape of the catalogue today hides this from users: every row that sets search_by_default=False also leaves discovery=(), so the loop happens to do the right thing in practice. Adding a DiscoverySpec to such a row in the future would have been a quiet regression. what: - src/agentgrep/__init__.py: discover_from_catalog now skips descriptors whose search_by_default is exactly False. True remains searched; None (decision-deferred) also remains searched, matching the historical default for rows that pre-date the search-by-default field. - src/agentgrep/store_catalog.py: state the policy explicitly on the two rows that previously relied on the None default — cursor.ai_tracking (role corrected from APP_STATE to SUPPLEMENTARY_CHAT to reflect that its conversation_summaries rows are chat-derived metadata, not app state) and cursor.ide.state_vscdb. - tests/test_stores.py: test_discover_from_catalog_skips_search_by_default_false plants a synthetic catalogue with one False row and one True row that both carry valid DiscoverySpecs against the same base directory, monkeypatches CATALOG, and asserts only the True row's source flows through.
…e descriptor why: A single StoreDescriptor can carry more than one DiscoverySpec — the canonical case is cursor.ide.state_vscdb, which declares one spec for the modern platform paths (~/.config/Cursor/...) and one for the legacy ~/.cursor/state.vscdb glob. On standard layouts the two roots are disjoint, but a custom XDG layout or symlinked install could place a single state.vscdb where both specs match. handles_from_discovery treated the two specs independently, so the same file flowed out as two SourceHandles with different adapter ids (cursor.state_vscdb_modern.v1 and cursor.state_vscdb_legacy.v1), producing duplicate search records downstream. what: - src/agentgrep/__init__.py: discover_from_catalog now keeps a per-descriptor seen_paths set and skips any handle whose path was already emitted by an earlier spec on the same descriptor. Cross-descriptor dedup stays out of scope — gemini.tmp.chats and gemini.tmp.chats_legacy are distinct stores even when they share a directory. - tests/test_stores.py: test_discover_from_catalog_deduplicates_paths_within_descriptor monkeypatches the catalogue with one descriptor whose two specs both point at a single state.vscdb file and asserts exactly one SourceHandle is returned.
…tinguishes missing from non-directory
why: The env-override warning emitted by resolve_env_root had two
problems. First, the message was present-tense ("env-override path
does not exist"), which clashes with CLAUDE.md's logging style rule
that messages describe events in past tense. Second, the same
warning fired whether CODEX_HOME or GEMINI_CLI_HOME pointed at a
missing path OR at a path that existed as a regular file (or other
non-directory inode). An operator reading the log could not tell
which class of misconfiguration they had — a typo versus an env var
aimed at the wrong inode kind.
what:
- src/agentgrep/__init__.py: resolve_env_root now logs
"env-override path unavailable, fell back to default" and adds a
structured agentgrep_env_path_status extra field whose value is
"not_a_directory" when candidate.exists() is true and "not_found"
otherwise. Operators reading the log have a stable scalar to
filter on.
- tests/test_agentgrep.py: test_resolve_env_root_warns_on_missing_path
now asserts agentgrep_env_path_status == "not_found" in addition
to the env-var and env-path fields. New
test_resolve_env_root_warns_when_env_path_is_file plants a real
regular file at the env path and asserts the status is
"not_a_directory".
…es entry point why: Cursor CLI transcripts are discovered by discover_cursor_sources, which reads their DiscoverySpec from CATALOG. A separate discover_cursor_cli_sources function exists only as an empty stub whose docstring describes the in-branch refactor history — that "compatibility" framing is branch-internal narrative for a symbol that never shipped to users. Removing the stub leaves discovery behind one entry point and drops misleading documentation from the shipped surface. what: - src/agentgrep/__init__.py: delete the discover_cursor_cli_sources definition. No callers in the repo and the symbol is not part of any released API.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
agentgrep search --agent claude/--agent codex/--agent cursoron systems where the agent stores live inside a dotfile-managed tree (yadm, chezmoi, stow, bare-git, mr).fdandrgboth honored.gitignoresemantics under$HOMEand silently masked everything; removingfd -aalso fixes a path-comparison mismatch between fd's canonicalized output and rg's symlink-preserving output.observed_version,observed_at, format, schema notes, and pointers to upstream type definitions where one is public.agentgrep search --agent cursorreturns results from~/.cursor/projects/<id>/agent-transcripts/<session_uuid>/<session_uuid>.jsonlin addition to the existing Cursor IDEstate.vscdbstore. Transcripts carry no native per-turn timestamp; agentgrep backfills the file's mtime.--agent geminiis now a valid choice across the CLI and the MCP tool literals. The adapter parses both~/.gemini/tmp/<project_hash>/chats/session-*.jsonl(mixedSessionMetadataRecord/MessageRecord/MetadataUpdateRecordlines) and~/.gemini/tmp/<project_hash>/logs.json(flatLogEntryaudit array).Changes by area
Search discovery and prefilter
src/agentgrep/__init__.py—list_files_matchingpasses-H -Itofdand drops-a. The flag pair bypasses gitignore/hidden filtering; dropping-akeeps fd's paths in the same symlink-preserved form thatrgemits, soprefilter_sources_by_rootcan compare them directly.build_grep_commandpasses--no-ignore --hidden(rg) /--unrestricted --hidden(ag), and the argv builder is reshuffled so the fixed-string flag goes immediately before-l.New store catalogue
src/agentgrep/stores.pyStoreFormatandStoreRoleenums; frozen PydanticStoreDescriptorandStoreCatalogmodels. JSON-schema-exportable.src/agentgrep/store_catalog.pyCATALOG(version 2) spanning Claude, Cursor (IDE + CLI agent), Codex, Gemini. Includesgemini_project_hash()— a Python mirror of Gemini CLI'sgetProjectHash()so consumers can map a working directory to a Gemini tmp shard.tests/test_stores.py/home/*leaks),distinguishes_fromresolves, frozen-model enforcement, Pydantic JSON round-trip, fixture validation.tests/conftest.pyfixture_path(store_id, name)helper rooted attests/samples/.tests/samples/<agent>/<store_id>/...New Cursor CLI agent adapter
src/agentgrep/__init__.pydiscover_cursor_cli_sourceswalks~/.cursor/projects/*/agent-transcripts/**/*.jsonl, skipping sibling project files (repo.json,mcp-approvals.json,terminals/,canvases/).parse_cursor_cli_transcriptwraps the existingiter_message_candidates— which already handles Cursor's outer-role/ inner-message.content[].textshape — and backfills the file's mtime as a timestamp fallback via the newisoformat_from_mtime_nshelper.New Gemini CLI adapter
src/agentgrep/__init__.pydiscover_gemini_sourceswalks~/.gemini/tmp/<project_hash>/chats/session-*.jsonlandlogs.json.parse_gemini_chat_fileis a custom parser (not a reuse ofiter_message_candidates) because Gemini stores the role in atypekey thatextract_roledoes not recognise; patchingextract_roleglobally would false-positive on any JSON dict containing"type": "user". The parser handlesSessionMetadataRecord(first line),{"$set": …}MetadataUpdateRecordupdates, andMessageRecordturns.parse_gemini_logs_fileis a flat-array parser mapping eachLogEntryto a prompt-history record.src/agentgrep/__init__.py/src/agentgrep/mcp.py/tests/test_agentgrep.pyAgentNameliteral andAGENT_CHOICESextended with"gemini"; five MCP tool literal sites updated to match.Documentation
docs/storage-catalog.mddocs/index.mdCHANGES### What's new(storage catalogue, Gemini CLI search, Cursor CLI search),### Fixes(dotfile.gitignore),### Documentation(new reference page).Test coverage
tests/test_agentgrep.pyaddstest_list_files_matching_ignores_gitignore(regression for the fd-flag fix), plus six adapter tests covering: Cursor user/assistant turns, Cursortool_usesafety, Gemini user prompts with timestamp + sessionId, Gemini metadata-record skipping, and Geminilogs.jsonparsing.Design decisions
Catalogue as data, not code paths. Each
StoreDescriptoris a frozen Pydantic row withobserved_version/observed_atstamps and anupstream_refpointer. Future upstream renames become a one-row edit plus acatalog_versionbump. The catalogue is descriptive — adapters consume it; whether agentgrep searches a given store by default is a per-store decision the adapters carry, not a property of the catalogue itself.search_by_defaultis tri-state. Each row isTrue/False/None.Nonedocuments stores agentgrep knows about but has not yet decided to search — Gemini Antigravity's protobuf blobs are a current example because no public.protodefinition exists.IDE vs CLI agent kept as separate Cursor entries. Both have
role=PRIMARY_CHATanddistinguishes_frompointers. The--agent cursordispatcher branch is additive — both the IDEstate.vscdbadapters and the new CLI adapter run side-by-side, so users with both surfaces see hits from each.Gemini chat parser is custom rather than reusing
iter_message_candidates. Gemini storesrolein atypekey. Adding"type"toextract_role's keyset would false-positive on every JSON dict with a"type": "user"field across all adapters. A small dedicated parser is the cleaner boundary; it also gives a clean place to dropSessionMetadataRecord/MetadataUpdateRecord/ empty-content gemini records explicitly.Ignore-flag bypass is safe across all three call sites.
list_files_matchingis invoked only bydiscover_codex_sources,discover_claude_sources,discover_cursor_sources,discover_cursor_cli_sources, anddiscover_gemini_sources— all want raw agent data, never git-aware filtering.Verification
End-to-end smoke against this user's data — all four agents return matches:
$ uv run agentgrep search libtmux --agent claude --limit 1$ uv run agentgrep search libtmux --agent codex --limit 1$ uv run agentgrep search libtmux --agent cursor --limit 1$ uv run agentgrep search libtmux --agent gemini --limit 1Catalogue round-trips through JSON without field loss:
$ uv run python -c "from agentgrep.store_catalog import CATALOG; CATALOG.model_dump_json()"Catalogue JSON schema emits cleanly:
$ uv run python -c "from agentgrep.stores import StoreCatalog; print(StoreCatalog.model_json_schema()['title'])"Sphinx renders the new page and the CHANGES references without warnings:
$ just build-docsTest plan
test_list_files_matching_ignores_gitignore— discovery survives.gitignore: *at the search root.tests/test_stores.py— catalogue shape invariants and Pydantic round-trip.test_primary_fixtures_exist_and_are_well_formed— every primary-chat / prompt-history / plan store has a valid fixture file undertests/samples/.test_search_cursor_cli_transcript_user_prompt— user-turn text surfaces with mtime-backfilled timestamp.test_search_cursor_cli_transcript_assistant_text— both user and assistant roles emit records.test_search_cursor_cli_transcript_ignores_tool_use_blocks—tool_usecontent blocks with notextpayload do not crash or yield empty records.test_search_gemini_chat_session_user_prompt— Gemini user MessageRecord surfaces with timestamp + sessionId.test_search_gemini_chat_session_drops_metadata_records—{"$set": …}updates and empty-content gemini records produce no record.test_search_gemini_logs_returns_user_message—logs.jsonarray yields prompt-history records.uv run pytest— full suite passes.uv run ruff format .— clean.uv run ruff check .— clean.uv run ty check— clean.just build-docs— newstorage-catalogpage renders; CHANGES entry resolves its{ref}and{mod}/{class}cross-references.--agent {claude,codex,cursor,gemini,all}, repeated--agent,--type {prompts,history,all},--regex,--case-sensitive,--any,--json,--ndjson,--limit,--color {always,never},--progress never,agentgrep find <pattern>per agent, invalid--agent foorejected with a five-choice error.Out of scope (follow-ups)
discover_codex_sources/discover_claude_sources/discover_cursor_sourcesto read paths fromCATALOGrather than hard-coding strings — the catalogue is the contract those adapters will consume.thoughts[*].descriptionandtoolCalls[*]as searchable text. The v1 Gemini chat parser drops gemini-typed records whosecontentis empty (output lives inthoughts[]); a clean follow-up adds a separate extraction path.GEMINI_CLI_HOMEenv override. Most users use the default~/.gemini; documented in the catalogue row but not yet read at runtime..jsonsingle-file sessions (pre-Feb 2026 format)..proto.~/.cursor/ai-tracking/ai-code-tracking.dbhas 0 rows on some systems and document the answer incursor.ai_tracking.schema_notes.