Skip to content

CodeDB performance improvements#257

Merged
rsnodgrass merged 9 commits intomainfrom
codedb
Mar 17, 2026
Merged

CodeDB performance improvements#257
rsnodgrass merged 9 commits intomainfrom
codedb

Conversation

@rupakm
Copy link
Copy Markdown
Contributor

@rupakm rupakm commented Mar 16, 2026

  1. Parallel blob prefetch (repoPool + prefetchBlobContents):

    • Blobs were read 3x: ensureBlob(), ParseSymbols(), ParseComments()
    • Now prefetched once in parallel via worker pool before each parse phase
    • Per-repo mutexes (repoPool) ensure thread-safe go-git packfile access
    • Shared openReposFromDB() eliminates duplicated repo-opening code
  2. Concurrent comment extraction:

    • comments.Extract() now runs across GOMAXPROCS workers in parallel
    • Results collected into a slice, then written by a single SQL/Bleve writer
    • Blob cache freed before SQL phase to reduce peak memory
  3. Batched SQL inserts:

    • Symbols: multi-row INSERT with batch size 50 (was one-by-one)
    • Symbol refs: multi-row INSERT with batch size 50 (was one-by-one)
    • Comments: multi-row INSERT with batch size 100 (was one-by-one)
    • Uses SQLite sequential ID guarantee for correct Bleve doc IDs
  4. Smaller improvements (cache size, etc)

Benchmark (200 commits, 2418 blobs, 148K comments):
comment parsing: 19.1s → 7.9s (2.4x faster)
total indexing: 29.5s → 18.5s (1.6x faster)

Summary by CodeRabbit

  • New Features

    • Added an ox codedb command suite with index, search, and sql subcommands.
  • Performance

    • Faster indexing via parallel blob prefetching, batched writes, larger commit batches, and reduced memory peaks.
    • Improved DB performance with additional pragmas and new indexes.
  • Tests

    • Added CLI discovery tests for codedb; removed several obsolete benchmarks.
  • Documentation

    • Added detailed design and implementation specs for codedb integration.

Rupak Majumdar and others added 4 commits March 7, 2026 12:29
…pragmas

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: SageOx <ox@sageox.ai>
…batched SQL inserts

Three high-impact optimizations to the codedb indexing pipeline:

1. Parallel blob prefetch (repoPool + prefetchBlobContents):
   - Blobs were read 3x: ensureBlob(), ParseSymbols(), ParseComments()
   - Now prefetched once in parallel via worker pool before each parse phase
   - Per-repo mutexes (repoPool) ensure thread-safe go-git packfile access
   - Shared openReposFromDB() eliminates duplicated repo-opening code

2. Concurrent comment extraction:
   - comments.Extract() now runs across GOMAXPROCS workers in parallel
   - Results collected into a slice, then written by a single SQL/Bleve writer
   - Blob cache freed before SQL phase to reduce peak memory

3. Batched SQL inserts:
   - Symbols: multi-row INSERT with batch size 50 (was one-by-one)
   - Symbol refs: multi-row INSERT with batch size 50 (was one-by-one)
   - Comments: multi-row INSERT with batch size 100 (was one-by-one)
   - Uses SQLite sequential ID guarantee for correct Bleve doc IDs

Benchmark (200 commits, 2418 blobs, 148K comments):
  comment parsing: 19.1s → 7.9s (2.4x faster)
  total indexing:  29.5s → 18.5s (1.6x faster)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: SageOx <ox@sageox.ai>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 16, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: aafd12de-27b4-4904-87ab-f96b820b8ccc

📥 Commits

Reviewing files that changed from the base of the PR and between 98396a3 and 8411408.

📒 Files selected for processing (3)
  • docs/ai/specs/codedb-integration-design.md
  • internal/codedb/index/indexer.go
  • internal/codedb/search/executor_bench_test.go

📝 Walkthrough

Walkthrough

Adds an ox codedb CLI wrapper (index, search, sql) that delegates to an external codedb binary discovered via PATH; refactors the CodeDB indexer for parallel prefetching and batched inserts; adds DB indexes and SQLite PRAGMA tuning; removes several micro-benchmarks; and adds design/implementation docs and tests.

Changes

Cohort / File(s) Summary
CLI Wrapper
cmd/ox/codedb.go, cmd/ox/codedb_index.go, cmd/ox/codedb_search.go, cmd/ox/codedb_sql.go
New parent codedb Cobra command and passthrough subcommands. Adds findCodeDB() (PATH discovery with actionable error) and runCodeDB() (exec with streaming IO and exit-code propagation).
CLI Tests
cmd/ox/codedb_test.go
Adds tests for binary discovery: behavior when PATH is empty and presence-on-PATH case (skips if not installed).
Indexer Core
internal/codedb/index/indexer.go
Large refactor: increase batching (bleveBatchSize 200→500), enlarge tree cache eviction, introduce repoPool and parallel blob prefetching, prefetch-driven ParseSymbols/ParseComments, batched SQL inserts with RETURNING IDs, and memory/error handling changes.
Schema & DB config
internal/codedb/store/schema.go, internal/codedb/store/store.go
Add SQL indexes (idx_blobs_parsed_lang, idx_blobs_comments_parsed) and apply additional SQLite PRAGMA settings on DB open (synchronous, cache_size, mmap_size, temp_store).
Docs & Design
docs/ai/specs/codedb-integration-design.md, docs/ai/specs/codedb-integration-impl.md
Adds design and implementation specs for the ox codedb wrapper, IO/arg passthrough, data layout, testing notes, sequence diagram, and implementation plan.
Benchmarks removed
internal/codedb/index/indexer_bench_test.go, internal/codedb/search/executor_bench_test.go
Removed several micro-benchmarks comparing string formatting and parsing methods; trims benchmark surface.

Sequence Diagram

sequenceDiagram
    autonumber
    participant User as "User"
    participant oxCLI as "ox CLI"
    participant Lookup as "Binary Lookup"
    participant Exec as "Subprocess Exec"
    participant CodeDB as "codedb Binary"
    participant Output as "stdout/stderr"

    User->>oxCLI: run `ox codedb search [query]`
    oxCLI->>Lookup: findCodeDB()
    Lookup->>Lookup: scan PATH for `codedb`
    Lookup-->>oxCLI: return binary path / error
    oxCLI->>Exec: runCodeDB(bin, ["search", ...args])
    Exec->>CodeDB: exec.Command("codedb", ...)
    CodeDB->>Output: write results to stdout/stderr
    Output-->>Exec: stream back to caller
    Exec-->>oxCLI: exit status
    oxCLI-->>User: propagate output and exit code
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

sageox

Poem

🐇 I sniff the PATH and give a cheer,

codedb hops in, the output clear.
Blobs fetched in batches, comments found fast,
Indexes laid down, pragmas set to last.
Hooray — small rabbit, big code blast! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 69.23% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: performance improvements to CodeDB indexing through parallel blob prefetch, concurrent comment extraction, and batched SQL inserts.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codedb
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

Love it @rupak

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (6)
docs/plans/2026-03-07-ox-codedb-impl.md (2)

34-47: Test code in plan has redundant defer — actual implementation is cleaner.

The plan shows manual PATH restoration with defer os.Setenv("PATH", origPath) on line 38, but t.Setenv already handles cleanup automatically. The actual test implementation correctly uses just t.Setenv("PATH", "") without the redundant defer.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/plans/2026-03-07-ox-codedb-impl.md` around lines 34 - 47, The test
TestFindCodeDB_NotInPath redundantly restores PATH with defer os.Setenv("PATH",
origPath) even though t.Setenv("PATH", "") already handles cleanup; remove the
manual origPath capture and the defer os.Setenv call and simply use
t.Setenv("PATH", "") in TestFindCodeDB_NotInPath so the test relies on the
testing framework to restore PATH automatically.

144-153: Plan shows type assertion, but implementation correctly uses errors.As.

The plan document shows the older type assertion pattern:

if exitErr, ok := err.(*exec.ExitError); ok {

The actual implementation in cmd/ox/codedb.go correctly uses errors.As as per coding guidelines:

if errors.As(err, &exitErr) {

This is fine since the implementation is correct — consider updating the plan for consistency if it will be referenced again.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/plans/2026-03-07-ox-codedb-impl.md` around lines 144 - 153, Update the
plan document to match the current implementation by replacing the old type
assertion pattern with the errors.As usage: change the shown snippet that uses
"if exitErr, ok := err.(*exec.ExitError); ok {" to the form that uses errors.As
(as implemented in cmd/ox/codedb.go), and mention that you should check for
*exec.ExitError via errors.As into a variable named exitErr before calling
os.Exit(exitErr.ExitCode()); reference the Run method and the error handling
block that returns fmt.Errorf("failed to run codedb: %w", err) so readers see
the consistent behavior.
internal/codedb/index/indexer.go (4)

1195-1199: Silent failure when opening repos may mask issues.

Failed repo opens are silently skipped, which is reasonable for deleted repos but could mask systematic issues. Consider logging at debug level when a repo fails to open:

r, err := plainOpenTolerant(p)
if err != nil {
    // slog.Debug("skipping repo", "path", p, "error", err)
    continue
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/codedb/index/indexer.go` around lines 1195 - 1199, The loop that
calls plainOpenTolerant(p) currently silently continues on error which can hide
systemic failures; update the error branch in the indexer (where
plainOpenTolerant is used and repos is appended) to emit a debug-level log
including the repo path (p) and the error before continuing so failed opens are
visible (e.g., use the package's logger to slog.Debug or equivalent with
"skipping repo", path and error context).

1340-1349: Parent ID updates are still one-by-one.

The symbols batch insert optimization is partially negated by individual UPDATE statements for parent_id. Consider batching these updates using a single statement with CASE/WHEN:

UPDATE symbols SET parent_id = CASE id
  WHEN ? THEN ? WHEN ? THEN ? ... END
WHERE id IN (?, ?, ...)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/codedb/index/indexer.go` around lines 1340 - 1349, The current loop
in indexer.go that calls tx.Exec per symbol (iterating syms and using symDBIDs)
should be replaced with a single batched UPDATE using a CASE WHEN expression:
collect all (id -> parent_id) pairs for symbols where ParentIdx is valid, build
an UPDATE symbols SET parent_id = CASE id WHEN ? THEN ? WHEN ? THEN ? ... END
WHERE id IN (?, ?, ...), and execute it once via tx.Exec with the parameter
slice (first the id,parent pairs then the id list). Ensure you only include
symbols with ParentIdx >= 0 and < len(symDBIDs), use symDBIDs[j] for the id and
symDBIDs[sym.ParentIdx] for parent_id, and preserve existing error handling
surrounding the tx.Exec call.

1170-1177: Workers continue processing after context cancellation.

When the context is cancelled, the main loop breaks (line 1171-1173) but items already in the channel buffer will still be processed by workers. This is generally fine for a prefetch operation, but if strict cancellation is needed, consider draining the channel or using a select with ctx.Done().

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/codedb/index/indexer.go` around lines 1170 - 1177, The loop that
sends items from hashes into ch currently breaks on ctx cancellation but can
leave buffered items to be processed; change the send loop to use a select so
sends respect cancellation: label the loop (e.g., sendLoop) and replace ch <- h
with select { case <-ctx.Done(): break sendLoop; case ch <- h: } so the loop
stops immediately when ctx is cancelled, then close(ch) and wg.Wait() as before;
also ensure worker goroutines that receive from ch continue to check ctx where
appropriate (they can use a select on ctx.Done() when processing) to fully
enforce strict cancellation (references: hashes, ch, ctx, wg, worker receivers).

1098-1126: Lock held during I/O may limit parallelism.

The per-repo mutex is held while reading the blob content (lines 1103-1116). This serializes all reads from the same repo, which may reduce the benefit of parallelism when most blobs come from a single repository.

Consider releasing the lock after BlobObject() returns and before io.ReadAll(), since the blob object should be self-contained. However, if go-git's packfile reader requires the lock during the entire read, the current approach is correct.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/codedb/index/indexer.go` around lines 1098 - 1126, The
repoPool.readBlob function currently holds rp.locks[i] across the entire blob
read (BlobObject(), Reader(), io.ReadAll()), which serializes reads per repo;
change the locking so we only hold the per-repo lock long enough to obtain the
blob object (or a safe reader) and then release it before doing io.ReadAll to
allow concurrent reads from the same repo. Concretely: inside repoPool.readBlob,
call r.BlobObject(oid) while locked, capture the returned blobObj (or, if
blobObj.Reader() is safe without holding the repo lock, capture reader while
locked), then rp.locks[i].Unlock() immediately before calling reader.Read/
io.ReadAll; retain the current behavior (keep lock across Reader()/ReadAll) only
if you confirm go-git requires the repo lock for the reader to be valid. Ensure
you still close the reader and handle errors exactly as before (use
blobObj.Reader() or reader variable names from the diff).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/codedb/index/indexer.go`:
- Around line 1596-1613: The code in the loop for indexing comments assumes
sequential DB IDs by computing firstID using res.LastInsertId() and mapping
k→commentID and also contains dead code `_ = cm`; update the indexing to not
assume sequential IDs—retrieve or store the actual inserted ID for each comment
(e.g., obtain IDs from the individual insert results or from the DB row metadata
that produced res) and use that real ID when building the Bleve document key
(replace the commentID computation based on firstID/len(batch)); also remove the
leftover `_ = cm` statement and keep the existing use of cm.Text for
BleveCommentDoc and the batching logic (references: res.LastInsertId, firstID,
commentID, cm, commentBatch, BleveCommentDoc, bleveBatchSize,
s.CommentIndex.Batch).
- Around line 1328-1338: The code incorrectly assumes sequential LastInsertId
values; in the block around tx.Exec, LastInsertId and the comment about SQLite
guarantees must be removed and replaced by using an INSERT ... RETURNING id
statement and reading the returned IDs from the result rows so you capture the
actual DB-assigned IDs; change the exec to a query (e.g., tx.Query or
tx.QueryContext) for the INSERT ... RETURNING id, iterate the returned rows in
order and assign each returned id directly into symDBIDs[batchStart+k] for the
corresponding batch element, and update stats.SymbolsExtracted as before.

---

Nitpick comments:
In `@docs/plans/2026-03-07-ox-codedb-impl.md`:
- Around line 34-47: The test TestFindCodeDB_NotInPath redundantly restores PATH
with defer os.Setenv("PATH", origPath) even though t.Setenv("PATH", "") already
handles cleanup; remove the manual origPath capture and the defer os.Setenv call
and simply use t.Setenv("PATH", "") in TestFindCodeDB_NotInPath so the test
relies on the testing framework to restore PATH automatically.
- Around line 144-153: Update the plan document to match the current
implementation by replacing the old type assertion pattern with the errors.As
usage: change the shown snippet that uses "if exitErr, ok :=
err.(*exec.ExitError); ok {" to the form that uses errors.As (as implemented in
cmd/ox/codedb.go), and mention that you should check for *exec.ExitError via
errors.As into a variable named exitErr before calling
os.Exit(exitErr.ExitCode()); reference the Run method and the error handling
block that returns fmt.Errorf("failed to run codedb: %w", err) so readers see
the consistent behavior.

In `@internal/codedb/index/indexer.go`:
- Around line 1195-1199: The loop that calls plainOpenTolerant(p) currently
silently continues on error which can hide systemic failures; update the error
branch in the indexer (where plainOpenTolerant is used and repos is appended) to
emit a debug-level log including the repo path (p) and the error before
continuing so failed opens are visible (e.g., use the package's logger to
slog.Debug or equivalent with "skipping repo", path and error context).
- Around line 1340-1349: The current loop in indexer.go that calls tx.Exec per
symbol (iterating syms and using symDBIDs) should be replaced with a single
batched UPDATE using a CASE WHEN expression: collect all (id -> parent_id) pairs
for symbols where ParentIdx is valid, build an UPDATE symbols SET parent_id =
CASE id WHEN ? THEN ? WHEN ? THEN ? ... END WHERE id IN (?, ?, ...), and execute
it once via tx.Exec with the parameter slice (first the id,parent pairs then the
id list). Ensure you only include symbols with ParentIdx >= 0 and <
len(symDBIDs), use symDBIDs[j] for the id and symDBIDs[sym.ParentIdx] for
parent_id, and preserve existing error handling surrounding the tx.Exec call.
- Around line 1170-1177: The loop that sends items from hashes into ch currently
breaks on ctx cancellation but can leave buffered items to be processed; change
the send loop to use a select so sends respect cancellation: label the loop
(e.g., sendLoop) and replace ch <- h with select { case <-ctx.Done(): break
sendLoop; case ch <- h: } so the loop stops immediately when ctx is cancelled,
then close(ch) and wg.Wait() as before; also ensure worker goroutines that
receive from ch continue to check ctx where appropriate (they can use a select
on ctx.Done() when processing) to fully enforce strict cancellation (references:
hashes, ch, ctx, wg, worker receivers).
- Around line 1098-1126: The repoPool.readBlob function currently holds
rp.locks[i] across the entire blob read (BlobObject(), Reader(), io.ReadAll()),
which serializes reads per repo; change the locking so we only hold the per-repo
lock long enough to obtain the blob object (or a safe reader) and then release
it before doing io.ReadAll to allow concurrent reads from the same repo.
Concretely: inside repoPool.readBlob, call r.BlobObject(oid) while locked,
capture the returned blobObj (or, if blobObj.Reader() is safe without holding
the repo lock, capture reader while locked), then rp.locks[i].Unlock()
immediately before calling reader.Read/ io.ReadAll; retain the current behavior
(keep lock across Reader()/ReadAll) only if you confirm go-git requires the repo
lock for the reader to be valid. Ensure you still close the reader and handle
errors exactly as before (use blobObj.Reader() or reader variable names from the
diff).

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6c486fb5-72cb-45be-9dbd-4a276c78b141

📥 Commits

Reviewing files that changed from the base of the PR and between f9878b8 and 1ee2427.

📒 Files selected for processing (10)
  • cmd/ox/codedb.go
  • cmd/ox/codedb_index.go
  • cmd/ox/codedb_search.go
  • cmd/ox/codedb_sql.go
  • cmd/ox/codedb_test.go
  • docs/plans/2026-03-07-ox-codedb-design.md
  • docs/plans/2026-03-07-ox-codedb-impl.md
  • internal/codedb/index/indexer.go
  • internal/codedb/store/schema.go
  • internal/codedb/store/store.go

Copy link
Copy Markdown
Contributor

@rsnodgrass rsnodgrass left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: PR #257 — CodeDB Performance Improvements

Summary

The performance optimizations are well-targeted (parallel blob prefetch, concurrent comment extraction, batched SQL). The merge from main (#251) was clean. However, I found several issues that need attention before merge.

I've pushed fixes for items 1-8, 11, and 12 to this branch. Items 9 and 10 are tracked as follow-up issues.


Critical / High

1. Goroutine deadlock on context cancellation (indexer.go ~L1170, ~L1529)
Both worker pool send loops could hang forever if context is cancelled while workers exit but the channel buffer is full. The ch <- h send blocks with no select on ctx.Done().
Fixed: Added labeled select with ctx.Done() fallback.

2. Sequential ID assumption for batch INSERT (indexer.go ~L1332, ~L1597)
LastInsertId() arithmetic assumes contiguous rowids within a multi-row INSERT — an SQLite implementation detail, not a documented guarantee. Wrong IDs → silent Bleve corruption.
Fixed: Added RowsAffected() guard that fails fast if count doesn't match batch size.

3. No per-blob size cap in prefetch (indexer.go ~L1114)
readBlob calls io.ReadAll with no limit. The 1 MB guard in ParseComments applies after the cache is populated. Large repos could OOM.
Fixed: Added io.LimitReader(reader, maxPrefetchBlobSize+1) in readBlob.

4. Dead _ = cm code (indexer.go ~L1605)
cm is already used on the preceding line (cm.Text). Leftover from a refactor.
Fixed: Removed.

Medium

5. os.Exit() in runCodeDB bypasses defer cleanup (codedb.go ~L56)
Skips PersistentPostRunE (profiling, telemetry). Acceptable for passthrough but undocumented.
Fixed: Added comment documenting the design choice.

6. Design docs in wrong location (docs/plans/)
docs/plans/ doesn't exist in the repo structure. These should be in docs/ai/specs/.
Fixed: Moved to docs/ai/specs/codedb-integration-design.md and codedb-integration-impl.md.

7. openReposFromDB conflates two failure modes (indexer.go ~L1205)
"could not open any git repos" for both "no repos in DB" and "all paths invalid."
Fixed: Distinct messages based on whether the DB had repos.

8. Silent error swallowing in skip-path tx.Exec (indexer.go ~L1302, ~L1565)
Errors from UPDATE blobs SET parsed = 1 are ignored — causes unnecessary re-processing on next run.
Fixed: Added slog.Warn logging.

Lower Priority (also fixed)

11. mmap_size=256MB pragma undocumented (store.go ~L64)
Fixed: Added comment explaining the value and fallback behavior.

12. symbolID=0 dangling FK in symbol_refs (indexer.go ~L1368)
When ContainingSymIdx is out of range, symbolID=0 creates a dangling reference (rowids start at 1). Under PRAGMA foreign_keys=1, this would fail the entire batch.
Fixed: Changed to interface{} (nil → SQL NULL), which satisfies the FK constraint.

Tracked as Follow-up Issues

9. repoPool mutex serializes single-repo case#259
10. treeCache=256 memory analysis needed#260

Not Fixed (Commit History)

13. Co-Authored-By attribution — Both perf commits include Co-Authored-By: Claude Opus 4.6 which the project's CLAUDE.md prohibits. This is in git history and not worth rewriting.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
internal/codedb/index/indexer.go (1)

1345-1356: ⚠️ Potential issue | 🔴 Critical

Do not derive inserted IDs via LastInsertId() arithmetic in batch inserts.

At Line 1349-1356 and Line 1621-1629, the code assumes contiguous DB-assigned IDs. That can mis-link parent_id/symbol_id and Bleve comment doc IDs under valid SQLite allocation behavior.

🛠️ Safer direction (capture real IDs with RETURNING id)
-			res, err := tx.Exec(sb.String(), sqlArgs...)
+			rows, err := tx.Query(sb.String()+" RETURNING id", sqlArgs...)
 			if err != nil {
 				return stats, fmt.Errorf("batch insert symbols: %w", err)
 			}
-			lastID, _ := res.LastInsertId()
-			count, _ := res.RowsAffected()
-			if count != int64(len(batch)) {
-				return stats, fmt.Errorf("symbols batch: expected %d rows inserted, got %d", len(batch), count)
-			}
-			for k := range batch {
-				symDBIDs[batchStart+k] = lastID - int64(len(batch)-1-k)
-			}
+			k := 0
+			for rows.Next() {
+				var id int64
+				if err := rows.Scan(&id); err != nil {
+					rows.Close()
+					return stats, fmt.Errorf("scan inserted symbol id: %w", err)
+				}
+				symDBIDs[batchStart+k] = id
+				k++
+			}
+			if err := rows.Err(); err != nil {
+				rows.Close()
+				return stats, fmt.Errorf("iterate inserted symbol ids: %w", err)
+			}
+			rows.Close()
+			if k != len(batch) {
+				return stats, fmt.Errorf("symbols batch: expected %d returned ids, got %d", len(batch), k)
+			}

Apply the same pattern for comment inserts so Bleve keys use actual inserted comment IDs.

Does SQLite guarantee contiguous/sequential rowids for a multi-row INSERT without AUTOINCREMENT, and is deriving all row IDs from LastInsertId() guaranteed safe?

Also applies to: 1615-1629

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/codedb/index/indexer.go` around lines 1345 - 1356, The batch-insert
logic must stop deriving individual IDs by arithmetic from res.LastInsertId();
instead modify the symbol and comment batch insert paths (the block using
tx.Exec(...) and populating symDBIDs[batchStart+k] and the analogous
comment/Bleve mapping code) to execute a multi-row INSERT with a RETURNING id
clause (or perform the insert per-row) and iterate the returned rows to capture
each actual inserted id in order, assigning them into symDBIDs (and the comment
ID mapping) based on batchStart+k; replace tx.Exec usage with a query/rows
iteration (e.g., tx.Query/QueryContext and rows.Scan) so parent_id/symbol_id and
Bleve document IDs use the real DB-assigned IDs.
🧹 Nitpick comments (2)
internal/codedb/index/indexer.go (1)

1289-1291: Wrap openReposFromDB errors with phase context.

At Line 1290 and Line 1494, returning raw errors drops where/why the failure happened.

♻️ Proposed patch
-	repos, err := openReposFromDB(s)
+	repos, err := openReposFromDB(s)
 	if err != nil {
-		return stats, err
+		return stats, fmt.Errorf("open repos for symbol parsing: %w", err)
 	}
...
-	repos, err := openReposFromDB(s)
+	repos, err := openReposFromDB(s)
 	if err != nil {
-		return stats, err
+		return stats, fmt.Errorf("open repos for comment parsing: %w", err)
 	}
As per coding guidelines: "Wrap errors with context to provide actionable error messages."

Also applies to: 1493-1495

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/codedb/index/indexer.go` around lines 1289 - 1291, The code returns
raw errors from openReposFromDB which loses phase context; update the return
paths that currently do "return stats, err" (notably in the openReposFromDB call
sites around the existing return at Line 1290 and the similar one at Line 1494)
to wrap the error with contextual text (e.g., using fmt.Errorf or errors.Wrap)
identifying the phase/operation (like "openReposFromDB: failed to open repos" or
similar) so callers can see where it failed; ensure you wrap the original error
with %w (or the equivalent) rather than replacing it.
docs/ai/specs/codedb-integration-design.md (1)

118-126: Align the runCodeDB snippet with actual exit-code propagation behavior.

The design snippet currently returns c.Run() directly, but the implementation handles non-zero subprocess exits explicitly. Keeping this snippet aligned will avoid implementation drift in future edits.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/ai/specs/codedb-integration-design.md` around lines 118 - 126, The
snippet currently returns c.Run() directly but must propagate subprocess
non-zero exit codes like the implementation: call err := c.Run(); if err != nil
detect an exec.ExitError (type *exec.ExitError) and extract the exit code (use
ExitError.ExitCode() or examine syscall.WaitStatus) and call os.Exit(code) to
propagate it; otherwise return the error (or nil if no error). Update the
runCodeDB function to perform that check and os.Exit with the child exit code
when appropriate.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/ox/codedb.go`:
- Around line 37-43: The returned error message when CodeDB isn't found should
wrap the underlying LookPath error so callers can use errors.Is()/errors.As();
update the error construction in the return from the codedb lookup (the
fmt.Errorf branch that currently returns the multi-line install/build message)
to include the original err using %w (matching the pattern used later in the
file around the line with wrapping on line 62), so the formatted context message
is preserved while wrapping the LookPath cause.

In `@docs/ai/specs/codedb-integration-design.md`:
- Around line 12-20: The fenced code blocks in the codedb-integration-design.md
doc are missing language identifiers (MD040); update each triple-backtick fence
that wraps the examples (the block beginning with "User -> ox codedb search
\"lang:go Iterator\" --json", the directory listing starting with "cmd/ox/" that
lists codedb.go, codedb_index.go, codedb_search.go, codedb_sql.go, and the
storage layout block starting with "~/.local/share/sageox/codedb/") to include a
language tag (e.g., use ```text) so markdownlint stops flagging MD040.

---

Duplicate comments:
In `@internal/codedb/index/indexer.go`:
- Around line 1345-1356: The batch-insert logic must stop deriving individual
IDs by arithmetic from res.LastInsertId(); instead modify the symbol and comment
batch insert paths (the block using tx.Exec(...) and populating
symDBIDs[batchStart+k] and the analogous comment/Bleve mapping code) to execute
a multi-row INSERT with a RETURNING id clause (or perform the insert per-row)
and iterate the returned rows to capture each actual inserted id in order,
assigning them into symDBIDs (and the comment ID mapping) based on batchStart+k;
replace tx.Exec usage with a query/rows iteration (e.g., tx.Query/QueryContext
and rows.Scan) so parent_id/symbol_id and Bleve document IDs use the real
DB-assigned IDs.

---

Nitpick comments:
In `@docs/ai/specs/codedb-integration-design.md`:
- Around line 118-126: The snippet currently returns c.Run() directly but must
propagate subprocess non-zero exit codes like the implementation: call err :=
c.Run(); if err != nil detect an exec.ExitError (type *exec.ExitError) and
extract the exit code (use ExitError.ExitCode() or examine syscall.WaitStatus)
and call os.Exit(code) to propagate it; otherwise return the error (or nil if no
error). Update the runCodeDB function to perform that check and os.Exit with the
child exit code when appropriate.

In `@internal/codedb/index/indexer.go`:
- Around line 1289-1291: The code returns raw errors from openReposFromDB which
loses phase context; update the return paths that currently do "return stats,
err" (notably in the openReposFromDB call sites around the existing return at
Line 1290 and the similar one at Line 1494) to wrap the error with contextual
text (e.g., using fmt.Errorf or errors.Wrap) identifying the phase/operation
(like "openReposFromDB: failed to open repos" or similar) so callers can see
where it failed; ensure you wrap the original error with %w (or the equivalent)
rather than replacing it.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1ca849ca-766b-4e09-b368-01123f97294b

📥 Commits

Reviewing files that changed from the base of the PR and between 1ee2427 and 81b6f18.

📒 Files selected for processing (5)
  • cmd/ox/codedb.go
  • docs/ai/specs/codedb-integration-design.md
  • docs/ai/specs/codedb-integration-impl.md
  • internal/codedb/index/indexer.go
  • internal/codedb/store/store.go

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
docs/ai/specs/codedb-integration-design.md (2)

118-136: Reconsider the os.Exit pattern in subprocess wrapper.

While the comment explains the rationale for using os.Exit to preserve subprocess exit codes, this pattern has several drawbacks:

  • Testing complexity: os.Exit terminates the process, making unit tests impossible without subprocess wrappers
  • Cleanup bypass: Defers, PersistentPostRunE, and other cleanup handlers are skipped
  • Error handling: Prevents graceful error propagation up the call stack

Alternative approach: Return a custom error type that carries the desired exit code, then handle it in main() or a top-level error handler:

type ExitCodeError struct {
    Code int
    Err  error
}

func (e *ExitCodeError) Error() string { return e.Err.Error() }

func runCodeDB(bin string, args []string) error {
    // ... setup ...
    if err := c.Run(); err != nil {
        var exitErr *exec.ExitError
        if errors.As(err, &exitErr) {
            return &ExitCodeError{Code: exitErr.ExitCode(), Err: err}
        }
        return fmt.Errorf("failed to run codedb: %w", err)
    }
    return nil
}

Then in the cobra RunE or main():

if err != nil {
    var exitErr *ExitCodeError
    if errors.As(err, &exitErr) {
        os.Exit(exitErr.Code)
    }
    return err
}

This preserves testability while still propagating exit codes correctly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/ai/specs/codedb-integration-design.md` around lines 118 - 136, Replace
the direct os.Exit call inside runCodeDB with an error return that preserves the
subprocess exit code: introduce a custom ExitCodeError struct (with Code int and
Err error and an Error() method) and have runCodeDB return &ExitCodeError{Code:
exitErr.ExitCode(), Err: err} when errors.As finds an *exec.ExitError; keep
other errors wrapped as before. Update the top-level caller (e.g., the cobra
RunE handler or main) to detect ExitCodeError via errors.As and call
os.Exit(exitErr.Code) there, allowing runCodeDB to remain testable and cleanup
handlers to run.

105-110: Consider referencing external documentation for installation instructions.

Hardcoded installation commands in error messages can become stale as the CodeDBGo project evolves (e.g., binary name changes, installation methods change, or the repository moves).

Consider referencing the CodeDBGo documentation URL instead, which can be kept current by the upstream maintainers.

📚 Suggested alternative
         return "", fmt.Errorf(
             "codedb not found in PATH: %w\n\n"+
-            "Install CodeDB:\n"+
-            "  go install github.com/sageox/CodeDBGo/cmd/codedb@latest\n\n"+
-            "Or build from source:\n"+
-            "  cd CodeDBGo && make build && make install", err)
+            "Install CodeDB:\n"+
+            "  See https://github.com/sageox/CodeDBGo#installation", err)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/ai/specs/codedb-integration-design.md` around lines 105 - 110, The error
message currently returns a multi-line string with hardcoded install/build
commands in the fmt.Errorf(...) call; replace those commands with a stable
reference to the CodeDBGo documentation URL (e.g. "See
https://github.com/sageox/CodeDBGo#installation for installation instructions")
while preserving the error wrapping (%w) and return signature (the existing
return "", fmt.Errorf(..., err) expression). Update the fmt.Errorf call to
include a concise message pointing to the docs URL instead of inline install
steps so the message remains accurate as upstream changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/ai/specs/codedb-integration-design.md`:
- Line 80: Update the table entry to state that Ox controls the CodeDB data
directory by passing a per-ledger path computed via paths.CodeDBSharedDir()
(e.g. ~/.local/share/sageox/<endpoint>/ledgers/<repoID>/.sageox/cache/codedb/)
to codedb.Open(dataDir), rather than using CodeDBGo's standalone default;
mention CodeDBGo's default only as the standalone fallback and clarify that Ox
does not rely on the default nor omit --root when invoking codedb.Open.

In `@internal/codedb/index/indexer.go`:
- Around line 1316-1321: Currently a blobCache miss (content, ok :=
blobCache[blob.contentHash]; !ok || len(content) == 0) causes an immediate
tx.Exec("UPDATE blobs SET parsed = 1 ...") which permanently marks blobs as
parsed; change this so that cache misses or empty content (transient
prefetch/read failures) are NOT marked parsed/comments_parsed=1. Specifically,
remove or guard the tx.Exec call for the blobCache miss path and only set
parsed/comments_parsed when you have a definitive, non-retriable determination
that the blob is truly unreadable (for example after a confirmed repo read error
path or explicit validation), leaving the blob unmodified on simple cache misses
so future retries can process it; update the same logic at the other occurrences
that reference blobCache, blob.contentHash and the tx.Exec("UPDATE blobs SET
parsed = 1 ...") calls (the blocks around the comments_parsed updates at the
other two locations).
- Around line 1632-1659: The comment batching loop can drift if the number of
returned rows doesn't match the number of items in batch; add a sanity check
after iterating rows (before closing rows) that compares cmIdx (or the count of
scanned IDs) against the expected count (len(batch) or the local expected
comment count for this loop) and return an error if they differ to avoid
ID→content mismatches; update the same area that touches rows, commentBatch,
BleveCommentDoc, and stats.CommentsExtracted and ensure you handle
rollback/rows.Close() and do not flush or advance commentBatch when the counts
mismatch (i.e. return fmt.Errorf("mismatched returned comment ids: expected %d
got %d", expected, cmIdx)).

---

Nitpick comments:
In `@docs/ai/specs/codedb-integration-design.md`:
- Around line 118-136: Replace the direct os.Exit call inside runCodeDB with an
error return that preserves the subprocess exit code: introduce a custom
ExitCodeError struct (with Code int and Err error and an Error() method) and
have runCodeDB return &ExitCodeError{Code: exitErr.ExitCode(), Err: err} when
errors.As finds an *exec.ExitError; keep other errors wrapped as before. Update
the top-level caller (e.g., the cobra RunE handler or main) to detect
ExitCodeError via errors.As and call os.Exit(exitErr.Code) there, allowing
runCodeDB to remain testable and cleanup handlers to run.
- Around line 105-110: The error message currently returns a multi-line string
with hardcoded install/build commands in the fmt.Errorf(...) call; replace those
commands with a stable reference to the CodeDBGo documentation URL (e.g. "See
https://github.com/sageox/CodeDBGo#installation for installation instructions")
while preserving the error wrapping (%w) and return signature (the existing
return "", fmt.Errorf(..., err) expression). Update the fmt.Errorf call to
include a concise message pointing to the docs URL instead of inline install
steps so the message remains accurate as upstream changes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 50835040-f29d-4f0b-94a1-b1b3dd1260e0

📥 Commits

Reviewing files that changed from the base of the PR and between 81b6f18 and 98396a3.

📒 Files selected for processing (4)
  • cmd/ox/codedb.go
  • docs/ai/specs/codedb-integration-design.md
  • internal/codedb/index/indexer.go
  • internal/codedb/index/indexer_bench_test.go
💤 Files with no reviewable changes (1)
  • internal/codedb/index/indexer_bench_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • cmd/ox/codedb.go

@rsnodgrass rsnodgrass merged commit 694916b into main Mar 17, 2026
1 check was pending
@rsnodgrass rsnodgrass deleted the codedb branch March 17, 2026 02:03
@rsnodgrass rsnodgrass restored the codedb branch March 17, 2026 02:03
rupakm added a commit that referenced this pull request Mar 17, 2026
Co-authored-by: SageOx <ox@sageox.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants