A code indexing and search system that enables Sourcegraph-style queries against repositories and their full history. Embedded, built in Rust.
- Sourcegraph-style search — intuitive query language with filters like
repo:,lang:,file:,type:,calls:,returns: - Full-text code search — literal, regex, phrase, and term search over file contents via Tantivy
- Diff search — find commits that introduced or modified specific code
- Commit search — search commit metadata by author, date range, and message
- Symbol search — find functions, structs, classes, traits, and other symbols via tree-sitter
- Cross-reference queries —
calls:fnto find callers,calledby:fnto find callees - Type-aware search —
returns:Typeto find functions by return type - Rich SQL queries — filter by repo, branch, file path, language, commit metadata
- Unified query engine — Tantivy search indexes exposed as SQLite virtual tables, enabling JOINs across text search and metadata in a single SQL query
- Content-addressable dedup — mirrors git's blob model; identical files across branches/commits are stored and indexed once
- Incremental updates — re-indexing is proportional to new commits, not total repo size
- Embedded — no external servers; everything runs in-process
# Build
cargo build --release
# Index a repository (clones, walks history, extracts symbols)
codedb --root ~/.codedb index https://github.com/user/repo
# Search for code (Sourcegraph-style query)
codedb --root ~/.codedb search "function_name"
# Filtered search
codedb --root ~/.codedb search "lang:rust file:*.rs -file:test serialize"
# Regex search
codedb --root ~/.codedb search "/fn\s+process_\w+/"
# OR search: match either term
codedb --root ~/.codedb search "serialize OR deserialize"
# Find symbols
codedb --root ~/.codedb search "type:symbol select:symbol.function SFrame"
# Cross-reference: who calls groupby()?
codedb --root ~/.codedb search "calls:groupby"
# Type info: functions returning BatchIterator
codedb --root ~/.codedb search "returns:BatchIterator"
# Diff search: commits that touched "streaming"
codedb --root ~/.codedb search "type:diff file:*.rs streaming"
# Commit search by author
codedb --root ~/.codedb search "type:commit author:Yucheng parallel"
# Show generated SQL instead of executing
codedb --root ~/.codedb search --sql "lang:rust file:*.rs serialize"
# Run arbitrary SQL (with full-text search via virtual tables)
codedb --root ~/.codedb sql "
SELECT fr.path, cs.score
FROM code_search('error handling') cs
JOIN blobs b ON b.id = cs.blob_id
JOIN file_revs fr ON fr.blob_id = b.id
JOIN refs r ON r.commit_id = fr.commit_id
WHERE r.name = 'refs/heads/main'
ORDER BY cs.score DESC
LIMIT 10
"The search command uses a Sourcegraph-compatible query language.
Bare words are search terms; filters use key:value syntax.
| Filter | Description | Example |
|---|---|---|
repo: / -repo: |
Include / exclude by repository name; supports @rev |
repo:SFrame@main |
file: / -file: |
Include / exclude file paths | file:*.rs -file:test |
lang: / -lang: |
Include / exclude by language | lang:rust / -lang:python |
type: |
Search type: code, diff, commit, symbol |
type:symbol |
rev: |
Branch or ref (default: refs/heads/main) |
rev:develop |
select: |
Output format: repo, file, symbol, symbol.KIND |
select:symbol.function |
count: |
Max results (default: 20) | count:50 |
case: |
Case sensitivity (yes/no) |
case:yes |
author: / -author: |
Include / exclude commit/diff author | author:Yucheng / -author:bot |
before: / after: |
Date range for commits/diffs | after:2024-01-01 |
message: / -message: |
Include / exclude by commit message | message:refactor / -message:WIP |
calls: |
Find functions that call a given function | calls:groupby |
calledby: |
Find functions called by a given function | calledby:groupby |
returns: |
Find functions returning a given type | returns:SFrame |
patterntype: |
Pattern interpretation: literal, keyword, regexp |
patterntype:regexp |
- Bare words —
foo barmatches files containing both terms (implicit AND). - Quoted phrases —
"error handling"matches the exact phrase. - OR operator —
foo OR barmatches files containing either term. Works in all search types. - Regex —
/pattern/orpatterntype:regexp patternfor regex matching (code and diff search only).
type:code(default) — Full-text search across file contents. Returns path, score, and snippet.type:symbol— Search extracted symbols. Useselect:symbol.KINDto filter by kind (function, struct, class, etc.).type:diff— Search within commit diffs. Supportsauthor:,before:,after:,file:filters.type:commit— Search commit metadata. Supportsauthor:,before:,after:,message:filters.
CodeDB implements a subset of the Sourcegraph query language, tailored for searching one or a few locally-indexed repositories rather than a large multi-tenant instance. Most simple Sourcegraph queries work unchanged.
The core query experience is compatible:
- Bare search terms —
error handlingsearches file contents, just like Sourcegraph. - Quoted phrases —
"parse error"matches the exact phrase. - Filters —
repo:,file:,lang:(aliasl:),rev:,case:,type:(code/diff/commit/symbol),select:,count:,author:,before:,after:,message:all work as expected. - Negation —
-file:,-repo:,-lang:,-author:,-message:all supported. @revision—repo:foo@branchworks, equivalent torepo:foo rev:branch.patterntype:—literal,keyword, andregexpall supported.- OR operator —
foo OR barworks across all search types. - Regex —
/pattern/syntax for code and diff search.
| Area | Sourcegraph | CodeDB |
|---|---|---|
| Scope | Searches across thousands of repositories on a hosted instance. Has fork:, archived:, visibility:, repogroup:, repo:has.file(), repo:has.path(), file:has.owner() for filtering across a large corpus. |
Designed for one or a few locally-indexed repos. Repository-level metadata filters (fork:, archived:, visibility:, repogroup:, repo:has.*, file:has.*) are not supported — they don't apply at this scale. |
| Regex | Supports /regex/ patterns and patterntype:regexp. |
Supported for code and diff search via /pattern/ or patterntype:regexp. Not available for symbol or commit search (use raw SQL). |
| Boolean operators | Full AND, OR, NOT with parenthesized grouping. |
OR supported between search terms (foo OR bar). All other terms are implicitly AND'ed. No NOT or parentheses. Negation supported for filters (-file:, -repo:, -lang:, -author:, -message:). |
| Structural search | type:structural with Comby patterns for syntax-aware matching (e.g., fmt.Sprintf(:[args])). |
Not supported. CodeDB addresses similar use cases through symbol extraction and cross-reference queries instead. |
| Code intelligence | Separate LSIF/SCIP-based precise code navigation (go-to-definition, find-references). Not part of the query language. | Built into the query language via calls:, calledby:, and returns: filters. Uses tree-sitter extraction — less precise than LSIF/SCIP but requires no separate indexing pipeline. |
patterntype: |
Controls interpretation: literal, regexp, keyword, structural. |
literal, keyword, and regexp supported. structural not supported. |
@revision syntax |
repo:foo@branch to pin a revision. |
Supported — splits into repo:foo + rev:branch. |
timeout:, stable: |
Query execution controls. | Not supported (queries run locally and complete fast). |
Most queries port directly:
| Sourcegraph query | CodeDB equivalent |
|---|---|
lang:go fmt.Sprintf |
lang:go fmt.Sprintf (identical) |
file:\.py$ import requests |
file:*.py import requests (use GLOB wildcards for file filters) |
repo:myorg/myrepo error |
repo:myrepo error (substring match on repo name) |
type:diff author:alice fix |
type:diff author:alice fix (identical) |
type:symbol lang:rust Iterator |
type:symbol lang:rust Iterator (identical) |
repo:foo@develop query |
repo:foo@develop query (identical — @ syntax supported) |
patterntype:regexp err\d+ |
patterntype:regexp err\d+ (identical) or /err\d+/ |
foo OR bar lang:go |
lang:go foo OR bar (identical) |
(foo OR bar) AND baz |
Not supported — no parenthesized grouping |
When the query language isn't enough, use codedb search --sql to see the
generated SQL, then adapt it with codedb sql for full control — including
JOINs, aggregations, regex via Tantivy, and anything else SQLite supports.
CodeDB uses tree-sitter to extract symbols from source code during indexing.
| Language | Symbol Types |
|---|---|
| Rust | function, struct, enum, trait, impl, const, static, module |
| Python | function, class |
| JavaScript | function, class, method, interface, enum, type_alias |
| TypeScript | function, class, method, interface, enum, type_alias |
| TSX | function, class, method, interface, enum, type_alias |
| Go | function, method, type |
| C | function, struct, enum |
| C++ | function, struct, enum, class, namespace |
- Symbols — name, kind, location (line/column), full signature
- Type info — return types, parameter lists
- Scope nesting — methods within classes/impls tracked via parent relationships
- Call references — function call sites with containing symbol context
┌─────────────────────────────────────┐
│ codedb-cli (binary) │
├─────────────────────────────────────┤
│ codedb-core (library) │
│ ┌───────────┐ ┌────────────────┐ │
│ │ SQLite │ │ Tantivy │ │
│ │ (metadata,│ │ (code_search, │ │
│ │ DAG, │◄─┤ diff_search │ │
│ │ file_revs│ │ via vtab) │ │
│ └───────────┘ └────────────────┘ │
│ ┌───────────┐ ┌────────────────┐ │
│ │ gix │ │ tree-sitter │ │
│ │ (git ops) │ │ (symbols, │ │
│ │ │ │ call refs) │ │
│ └───────────┘ └────────────────┘ │
├─────────────────────────────────────┤
│ tantivy-sqlite (vtab bridge) │
└─────────────────────────────────────┘
Three crates in this workspace:
| Crate | Purpose |
|---|---|
tantivy-sqlite |
Generic SQLite virtual table bridge for Tantivy indexes |
codedb-core |
Library: git ingestion, schema, indexing, symbol extraction, query translation |
codedb-cli |
CLI binary wrapping codedb-core |
| Table | Description |
|---|---|
repos |
Indexed repositories |
commits |
Commit metadata (hash, author, message, timestamp) |
commit_parents |
Commit parent relationships (DAG) |
refs |
Branch/tag refs pointing to commits |
blobs |
Unique file contents (content-addressable by SHA) |
file_revs |
Files present at each ref tip |
diffs |
Per-file diffs for each commit |
symbols |
Extracted symbols (name, kind, signature, return type, params) |
symbol_refs |
Call sites and references between symbols |
code_search() |
Virtual table — full-text search over file contents |
diff_search() |
Virtual table — full-text search over diffs |
~/.codedb/
db.sqlite # SQLite database (metadata + virtual tables)
tantivy/code_search/ # Tantivy index for file contents
tantivy/diff_search/ # Tantivy index for commit diffs
repos/{host}/{owner}/{name}.git/ # Bare git clones
-- Search code on a specific branch
SELECT fr.path, cs.score, cs.snippet
FROM code_search('process_data') cs
JOIN blobs b ON b.id = cs.blob_id
JOIN file_revs fr ON fr.blob_id = b.id
JOIN refs r ON r.commit_id = fr.commit_id
WHERE r.name = 'refs/heads/main'
AND fr.path GLOB '*.rs'
ORDER BY cs.score DESC
-- Find commits that changed code matching a pattern
SELECT c.hash, substr(c.message, 1, 80), ds.score
FROM diff_search('deprecated_function') ds
JOIN diffs d ON d.id = ds.diff_id
JOIN commits c ON c.id = d.commit_id
ORDER BY c.timestamp DESC
-- Most called functions (excluding common builtins)
SELECT sr.ref_name AS function, COUNT(*) AS calls
FROM symbol_refs sr
JOIN blobs b ON b.id = sr.blob_id
JOIN file_revs fr ON fr.blob_id = b.id
JOIN refs r ON r.commit_id = fr.commit_id
WHERE r.name = 'refs/heads/main'
AND sr.kind = 'call'
GROUP BY sr.ref_name
ORDER BY calls DESC
LIMIT 15
-- Functions with specific parameter types
SELECT DISTINCT fr.path || ':' || s.line AS location, s.params
FROM symbols s
JOIN blobs b ON b.id = s.blob_id
JOIN file_revs fr ON fr.blob_id = b.id
JOIN refs r ON r.commit_id = fr.commit_id
WHERE s.params LIKE '%SFrame%'
AND s.kind = 'function'
AND r.name = 'refs/heads/main'
-- Language breakdown of a repo
SELECT b.language, COUNT(*) as file_count
FROM blobs b
JOIN file_revs fr ON fr.blob_id = b.id
JOIN refs r ON r.commit_id = fr.commit_id
WHERE r.name = 'refs/heads/main'
GROUP BY b.language
ORDER BY file_count DESCuse codedb_core::CodeDB;
use std::path::Path;
let mut db = CodeDB::open(Path::new("/path/to/data"))?;
// Index a repository (clones bare, walks history, extracts symbols)
db.index_repo("https://github.com/user/repo")?;
// Re-index later (incremental — only processes new commits)
db.index_repo("https://github.com/user/repo")?;
// Sourcegraph-style search
let results = db.search("lang:rust type:symbol SFrame")?;
// Or query via SQL directly
let mut stmt = db.conn().prepare(
"SELECT fr.path, cs.score
FROM code_search('keyword') cs
JOIN blobs b ON b.id = cs.blob_id
JOIN file_revs fr ON fr.blob_id = b.id
LIMIT 10"
)?;Run the included demo script to index SFrameRust and see CodeDB in action:
./demo.shIt indexes the full repo, then runs a series of queries demonstrating: database stats, language breakdown, full-text code search, symbol search, cross-reference queries (calls/calledby), type-aware queries (returns), diff search, commit search, SQL generation, and incremental re-indexing.
cargo build --releaseRequires a working Rust toolchain. All dependencies (SQLite, Tantivy, gix, tree-sitter) are compiled from source — no system libraries needed.
BSD-3-Clause