Skip to content

feat: add listIds() to SearchProvider#11

Merged
harlan-zw merged 2 commits intomainfrom
feat/list-ids
Mar 21, 2026
Merged

feat: add listIds() to SearchProvider#11
harlan-zw merged 2 commits intomainfrom
feat/list-ids

Conversation

@harlan-zw
Copy link
Copy Markdown
Collaborator

@harlan-zw harlan-zw commented Mar 21, 2026

🔗 Linked issue

Related to skilld-dev/skilld#28

❓ Type of change

  • 📖 Documentation
  • 🐞 Bug fix
  • 👌 Enhancement
  • ✨ New feature
  • 🧹 Chore
  • ⚠️ Breaking change

📚 Description

The SearchProvider interface had no way to list existing document IDs, which forced consumers into all-or-nothing rebuilds. This adds an optional listIds() method to the interface, implements it in the sqlite driver (SELECT id FROM documents_meta), and wires it through createRetriv. Consumers can now diff incoming docs against the stored set and only chunk/embed the delta.

Summary by CodeRabbit

  • New Features
    • Added a method to list all indexed document IDs, allowing inspection and management of the full document set.
    • When content was stored in chunks, the method collapses chunk-scoped IDs into their parent document IDs and returns unique parents.
    • If no provider exposes IDs, the method safely returns an empty list.

Adds an optional listIds() method to the SearchProvider interface,
implemented in the sqlite driver and wired through createRetriv.
Returns all document IDs stored in the index, enabling consumers
to diff incoming docs against what's already indexed and only
process the delta.
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 21, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: cab32f08-1145-40e2-82c6-bb7fc6ed6a17

📥 Commits

Reviewing files that changed from the base of the PR and between 7835485 and 3b2ac23.

📒 Files selected for processing (1)
  • src/retriv.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/retriv.ts

📝 Walkthrough

Walkthrough

This PR adds a new async listIds(): Promise<string[]> method to the SearchProvider interface, implements it in the SQLite provider (querying documents_meta), and exposes it at the top-level retriv API which delegates to the first driver that provides listIds and post-processes chunk IDs when a chunker is configured.

Changes

Cohort / File(s) Summary
Type Definition
src/types.ts
Added optional listIds?: () => Promise<string[]> to SearchProvider.
SQLite Provider
src/db/sqlite.ts
Implemented listIds() to run SELECT id FROM documents_meta and return string[] of ids.
Retriv API
src/retriv.ts
Added top-level listIds() that calls the first driver exposing listIds, falls back to [], and collapses #chunk- IDs to parent document IDs when a chunker is configured.

Sequence Diagram

sequenceDiagram
    participant Client
    participant Retriv as Retriv API
    participant Driver as SearchProvider (SQLite)
    participant DB as Database (documents_meta)

    Client->>Retriv: listIds()
    Retriv->>Driver: listIds() on first supporting driver
    Driver->>DB: SELECT id FROM documents_meta
    DB-->>Driver: rows with id values
    Driver-->>Retriv: string[] of IDs
    Retriv-->>Client: Promise<string[]>
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I dug through rows both near and far,

Found every id like a shiny star,
Chunk-bits stitched back to parent ground,
One small hop and all are found,
🥕📜 Hop, retriv — list them round.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding a listIds() method to the SearchProvider interface. It is specific, directly related to the primary objective, and uses appropriate semantic versioning prefix (feat:).
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/list-ids

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 3 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="src/retriv.ts">

<violation number="1" location="src/retriv.ts:229">
P2: When chunking is enabled, `listIds()` returns chunk IDs (e.g. `doc1#chunk-0`) rather than the original document IDs consumers pass to `index()`. Since `createRetriv` is the layer that creates chunk IDs in `prepareDocs`, it should also be responsible for mapping them back to parent IDs here — otherwise the stated use case of diffing incoming docs against the stored set won't work.</violation>
</file>

Since this is your first cubic review, here's how it works:

  • cubic automatically reviews your code and comments on bugs and improvements
  • Teach cubic by replying to its comments. cubic learns from your replies and gets better over time
  • Add one-off context when rerunning by tagging @cubic-dev-ai with guidance or docs links (including llms.txt)
  • Ask questions if you need clarification on any suggestion

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/retriv.ts (1)

229-232: Prefer aggregating IDs from all supporting drivers in hybrid mode.

Current logic returns IDs from only the first driver with listIds, which can mask backend drift. Union + dedupe is safer.

Suggested refactor
 async listIds() {
-  const driver = drivers.find(d => d.listIds)
-  return driver?.listIds?.() ?? []
+  const providers = drivers.filter(d => d.listIds)
+  if (!providers.length)
+    return []
+
+  const sets = await Promise.all(providers.map(d => d.listIds!()))
+  return Array.from(new Set(sets.flat()))
 },
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/retriv.ts` around lines 229 - 232, The current listIds method only calls
the first driver with listIds, which hides mismatches; update the retriv.ts
async listIds() to call listIds on all drivers in the drivers array that
implement it (e.g., filter drivers by d.listIds), await all results
(Promise.all), flatten and union/dedupe the ID arrays, and return that
aggregated list instead of the first result so hybrid mode returns the combined
set from all supporting drivers.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/db/sqlite.ts`:
- Around line 336-339: listIds() currently returns chunk IDs from documents_meta
(e.g., "<doc>#chunk-..."); update listIds in src/db/sqlite.ts to return
canonical/original document IDs by extracting the prefix before any "#chunk"
suffix and deduplicating the results: run the same SELECT id FROM
documents_meta, map each row to r.id.split('#chunk')[0] (or strip the "#chunk"
part if present), collect unique values (Set) and return the array of canonical
IDs so delta-sync uses original document IDs.

---

Nitpick comments:
In `@src/retriv.ts`:
- Around line 229-232: The current listIds method only calls the first driver
with listIds, which hides mismatches; update the retriv.ts async listIds() to
call listIds on all drivers in the drivers array that implement it (e.g., filter
drivers by d.listIds), await all results (Promise.all), flatten and union/dedupe
the ID arrays, and return that aggregated list instead of the first result so
hybrid mode returns the combined set from all supporting drivers.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d3961bdd-9aab-4f1f-9b8d-39a67417dcb4

📥 Commits

Reviewing files that changed from the base of the PR and between e8837c2 and 7835485.

📒 Files selected for processing (3)
  • src/db/sqlite.ts
  • src/retriv.ts
  • src/types.ts

Comment on lines +336 to +339
async listIds() {
const rows = db.prepare('SELECT id FROM documents_meta').all() as Array<{ id: string }>
return rows.map(r => r.id)
},
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

listIds() leaks internal chunk IDs instead of canonical document IDs.

At Line 337, SELECT id FROM documents_meta returns chunk IDs (<doc>#chunk-n) when chunking is enabled, which breaks delta-sync based on original document IDs.

Proposed fix
 async listIds() {
-  const rows = db.prepare('SELECT id FROM documents_meta').all() as Array<{ id: string }>
-  return rows.map(r => r.id)
+  const rows = db.prepare(`
+    SELECT DISTINCT
+      COALESCE(json_extract(metadata, '$._parentId'), id) AS id
+    FROM documents_meta
+    ORDER BY id
+  `).all() as Array<{ id: string }>
+  return rows.map(r => r.id)
 },
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/db/sqlite.ts` around lines 336 - 339, listIds() currently returns chunk
IDs from documents_meta (e.g., "<doc>#chunk-..."); update listIds in
src/db/sqlite.ts to return canonical/original document IDs by extracting the
prefix before any "#chunk" suffix and deduplicating the results: run the same
SELECT id FROM documents_meta, map each row to r.id.split('#chunk')[0] (or strip
the "#chunk" part if present), collect unique values (Set) and return the array
of canonical IDs so delta-sync uses original document IDs.

When chunking is enabled, the driver returns chunk IDs (e.g.
doc1#chunk-0). Since createRetriv owns the chunking layer, it
should map these back to canonical parent document IDs so
consumers can diff against the original IDs they passed to index().
@harlan-zw harlan-zw merged commit cc70d68 into main Mar 21, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant