A versioned, provenance-grounded knowledge layer over a codebase — served to humans and AI agents. Not RAG-over-code.
knowbase turns a git repository into a Knowledge Layer: a queryable, git-versioned model of what a codebase means — its architecture, domain entities, API contracts, dependencies, events, and business processes — where every fact is bound to the exact lines of code it came from (file:line@sha).
The one thing that makes it different: it does not embed your code and hope. It extracts durable knowledge and grounds each unit in a real code span. LLMs and embeddings are replaceable adapters around that spine — swap the model, the knowledge and its provenance stay.
flowchart LR
subgraph usual["The usual pipeline — lossy, opaque"]
direction LR
R1[Repository] --> E1["Embeddings<br/>chunks, no provenance,<br/>drifts from HEAD"] --> A1[AI Agent]
end
subgraph kbflow["knowbase — grounded, versioned"]
direction LR
R2[Repository] --> X2["Knowledge Extraction<br/>deterministic + LLM adapters"] --> K2["Knowledge Layer<br/>provenance · method ·<br/>confidence · freshness"] --> A2[AI Agent]
end
- Code is implementation, not knowledge. "What are the API contracts?", "which module owns billing?", "what invalidates this cache?" are not answered by reading one file — the answer is spread across the repo and lives in nobody's head.
- Docs rot. Hand-written architecture docs and diagrams drift from
HEADthe moment they are merged. There is no mechanical link back to the code, so nothing tells you when they go stale. - Agents get fragments. RAG-over-code retrieves nearest-neighbor chunks with no provenance and no notion of whether they still reflect the current commit. The model fills the gaps by guessing.
knowbase answers with units of knowledge that are versioned against git and traceable to source — or it does not answer at all.
The core invariant: nothing is stored unless it is bound to ≥ 1 exact code span (file:line@sha). That single rule buys three properties at once:
- Anti-hallucination. An ungrounded artifact is not stored — enforced both in-app (a
GroundingErrorbefore any write) and in the database (a deferredartifact_grounded_checkconstraint trigger that fails the transaction atCOMMIT). An extractor that cannot point at code cannot persist a claim. - Incremental update. A
git diffmaps changed code to changed spans, which invalidates exactly the derived artifacts whose grounding moved — no over-invalidation, no stale survivors. - Consumer trust. Every served unit carries its provenance, the method that produced it (deterministic vs. model), confidence, and freshness relative to the commit.
Identity is content-addressed and location-free by construction. A span's span_id is a sha256 over (normalization_version, lang, span_kind, fq_symbol_path, structural_fingerprint) — no file path, no byte offsets. The structural fingerprint is a normalized S-expression of the tree-sitter parse (named nodes only; comments and docstrings dropped; identifiers and literals kept). So reformatting, moving a file, or editing a comment does not change identity; a real rename or a structural edit does. Location is recorded per-SHA in span_occurrence, separate from identity.
Artifacts are content-addressed the same way — over their byte-sorted, de-duplicated grounding spans plus extractor_id/extractor_version (and prompt_version/model_id for model-backed extractors). Re-indexing the same commit reproduces the identical set of artifact ids.
The spine is a handful of content-addressed tables: each artifact carries ≥ 1 derived_from edge to a code_span, spans are located per-SHA in span_occurrence, and a per-SHA snapshot ties the grounded artifacts to a commit.
flowchart TD
AR["artifact<br/>knowledge unit (+ embedding)"]
CS["code_span<br/>content-addressed id · location-free"]
SO["span_occurrence<br/>file:line @ sha"]
SE["snapshot_entry<br/>per-SHA manifest"]
CM["commit_ref / branch_ref"]
AR -->|"derived_from ≥1 (else rejected:<br/>GroundingError + DB trigger)"| CS
CS -->|"located per-SHA"| SO
AR -->|"appears in"| SE
SE -->|"scoped to"| CM
Indexing one commit walks that spine end to end — INGEST → STRUCTURE → INVALIDATE → EXTRACT → SNAPSHOT → SERVE, with kb embed as a separate pass that adds pgvector semantic search on top:
flowchart LR
G["git blobs @ SHA<br/>no checkout"] --> S["tree-sitter spans<br/>content-addressed identity"]
S --> I["invalidate<br/>diff-based"]
I --> X["extractors<br/>deterministic · each grounded ≥1 span"]
X --> N["per-SHA snapshot<br/>manifest of grounded artifacts"]
N --> EV["eval gate"]
EV --> SV["serve<br/>humans + AI agents (MCP)"]
N -. "kb embed (separate pass)" .-> EM["pgvector embeddings<br/>semantic search"]
EM -.-> SV
v0.2 — spine + the first knowledge extractors, MCP serving, and the knowledge-vs-RAG gate. Everything here grounds what it claims, and nothing it cannot:
- Provenance spine — content-addressed
span_id(LOCKED); tree-sitter spans with a normalized S-expression fingerprint and per-SHA location; a single-Postgres, Alembic-managed store with content-addressed idempotent writes; the ≥ 1derived_fromanti-hallucination invariant enforced in-app and by a deferred DB trigger; pygit2 git ingest (no checkout) with a diff-based invalidation seed. - Deterministic extractors — the import / dependency graph (grimp resolves the edge, tree-sitter grounds it on the exact import statement, with an honest
approximatefallback for re-exports / relative / unmappable imports — never a silent loss), and the FastAPI API-contract extractor, which grounds a single route across files (handler inroutes.py+response_modelclass inschemas.py). kb introspect— a sandboxed, network-blockedapp.openapi()oracle, eval-only and never on the index path, that the API gate scores the static contract against.- Read-only MCP server —
find_provenance,get_knowledge, andsearch_knowledge, each returning provenance-carrying units (method + confidence + freshness). - pgvector embeddings + semantic search — a replaceable embedding provider (sentence-transformers by default, OpenAI optional) populated by a separate
kb embedpass; torch stays out of the index path. - A frozen RAG-over-source baseline and the Tier-3 knowledge-vs-RAG recall gate — the honest A/B that backs the "knowledge > RAG" thesis.
- Seven HARD CI eval gates (see Development).
Not done yet (and deliberately not faked): the semantic / LLM-grounded extraction layer, the nightly LLM-judged A/B, ADR mining from git history, grounded business-process extraction, incremental re-index on git push, and languages beyond Python. See the Roadmap.
- Python 3.12+
- uv (install)
- PostgreSQL 17 — required to run the daemon. For the test suite you do not need a running server: it spins an ephemeral local Postgres cluster via
initdb/pg_ctl(no Docker). You just need the Postgres binaries on the machine (e.g. from Postgres.app or a system package). Point at them withKB_PG_BINDIRif they are not onPATH, or skip the ephemeral cluster entirely by settingKB_TEST_DB_URLto an existing database.
uv sync --extra dev # create the venv + install everything
uv sync --extra dev --extra embed # add the embedding stack (CPU torch) for `kb embed` + semantic searchThe base --extra dev install stays torch-free; the embed extra pulls sentence-transformers (CPU-only torch via a pinned index) and is only needed for kb embed and search_knowledge.
uv run pytest src/kb/eval -q # the seven HARD gates (spins an ephemeral local Postgres)uv run kb --help
uv run kb index <repo> --sha <sha> --db-url <postgres-url>--sha accepts any commit-ish (sha, branch, tag, or HEAD, the default). The database URL can also come from the KB_DB_URL environment variable instead of --db-url. A run prints what it produced:
indexed 4f1c2a9b8d3e: 12 files, 318 spans, 27 artifacts, 1 gaps
gaps (unparseable, recorded): src/legacy/broken.py
Under the hood it runs the spine for that one commit — INGEST → STRUCTURE → EXTRACT → SNAPSHOT. For example, an import like from shop.billing import charge on line 1 of shop/orders.py becomes an import_edge artifact (import:shop.orders->shop.billing) grounded on the exact import span at that file:line@sha, with span_mapping: "exact". "Gaps" are files that hit a syntax error: they are recorded, never silently dropped, so blind spots are visible rather than invisible.
uv run kb embed --db-url <postgres-url> # separate pass: populate artifact embeddingskb embed runs a replaceable embedding provider (sentence-transformers all-MiniLM-L6-v2 by default, OpenAI optional via KB_EMBED_PROVIDER=openai) over the latest snapshot's artifacts and writes them into artifact.embedding (pgvector). It is idempotent and torch only loads when this command runs — never on the index path.
uv run kb serve --db-url <postgres-url> # read-only MCP server over stdioThe server exposes three read-only tools, each returning provenance-carrying knowledge units (extraction method + confidence + freshness):
| Tool | Purpose |
|---|---|
find_provenance(file, line, sha?) |
What grounded knowledge sits at file:line@sha — the spans there and the artifacts derived from them. |
get_knowledge(target, sha?, token_budget?) |
Resolve a logical key / file / module to its knowledge units, ranked and trimmed to a token budget (omissions reported, never silently truncated). |
search_knowledge(query, sha?, k?, token_budget?) |
Cosine-ranked semantic search over the embedded artifacts (requires kb embed). |
uv run kb introspect app.main:app --repo <path> # sandboxed app.openapi() oraclekb introspect runs a FastAPI app in a network-blocked sandbox and emits its openapi() as JSON — the ground truth the Tier-1 API gate scores the static contract extractor against. It executes user code, so it is eval-only and never runs during indexing.
A Python package kb (uv, src-layout). Modules and their responsibilities:
| Module | Responsibility |
|---|---|
kb.ids |
Content-addressed identity hashing (LOCKED). span_id excludes file path and byte offsets; artifact_id refuses to be computed without ≥ 1 grounding span. |
kb.structural |
tree-sitter span extraction; the structural fingerprint is a normalized S-expression (named nodes only; comments and docstrings dropped; identifiers/literals kept). Location is recorded per-SHA. |
kb.store |
A single PostgreSQL via Alembic; content-addressed idempotent writes; the ≥ 1 derived_from invariant enforced in-app and by a deferred constraint trigger. |
kb.git |
pygit2 ingest — reads blobs at a SHA (no checkout) — plus the diff-based invalidation seed. |
kb.extract.deterministic.imports |
Deterministic import / dependency edges: tree-sitter spans grounded by line, grimp edge resolution. |
kb.extract.deterministic.fastapi_contract |
Static FastAPI API-contract extractor; grounds a route across files (handler + response_model class), never imports user code. |
kb.introspect |
Sandboxed, network-blocked app.openapi() oracle — eval-only ground truth for the API gate, never on the index path. |
kb.mcp |
Read-only MCP server and its provenance-carrying records: find_provenance, get_knowledge, search_knowledge. |
kb.embed |
Replaceable embedding adapters (sentence-transformers default, OpenAI optional) + snapshot population. Torch isolated behind the embed extra and a lazy import. |
kb.rag |
The frozen pgvector RAG-over-source baseline — the "other arm" of the knowledge-vs-RAG A/B (no provenance, no grounding). |
kb.daemon.cli |
The kb CLI: index, embed, serve (MCP), and introspect — all functional. |
kb.eval |
Seven HARD CI gates (identity reproducibility, adversarial grounding, Tier-1 import oracle, Tier-1 API oracle, Tier-3 knowledge-vs-RAG recall, Tier-4 one-hop invalidation, invariants) plus the supporting MCP / embed / store suite. |
Core tables: commit_ref, branch_ref, code_span, span_occurrence, artifact (now with embedding vector(384) + embedding_model_id), artifact_derived_from, snapshot_entry, and rag_chunk (the baseline arm).
uv sync --extra dev # venv + install
uv run ruff check src/kb # lint
uv run mypy # strict type-check
uv run pytest src/kb/eval -q # the seven HARD eval gatesCI (GitHub Actions, workflow "CI", .github/workflows/ci.yml) runs ruff, mypy --strict, and the eval gates against a pgvector/pgvector:pg17 service (with the embedding model cached). The seven HARD gates that block a merge:
- Identity reproducibility — formatting / comment / docstring / location changes must NOT change
span_id; a rename MUST. Pure identity core, no database. - Adversarial grounding — an ungrounded artifact is rejected by both layers (the app's
GroundingErrorand the DB's deferredartifact_grounded_checktrigger); a genuinely grounded artifact commits cleanly. - Tier-1 import oracle — extracted import edges match a hand-labeled oracle, grounded on the actual import statement span; a dynamic import is asserted as a known gap, not a silent loss.
- Tier-1 API oracle — the statically-extracted FastAPI contract equals the app's own
openapi()(from the sandboxed introspect oracle), and the route's cross-file grounding (handler +response_model) is asserted. - Tier-3 knowledge-vs-RAG recall — knowbase cross-file recall@k == 1.0 for every contract question (a structural floor: one artifact already spans both files, so it holds regardless of embedding quality); the RAG arm is reported but never asserted, so a model bump can't redden CI.
- Tier-4 one-hop invalidation — a content diff invalidates exactly the artifacts whose grounding span changed (set-equality: no over-invalidation, no stale survivors); a version bump invalidates everything.
- Invariants — zero orphans (every snapshot artifact is grounded), and re-indexing the same SHA yields the identical set of artifact ids.
The identity rules in kb.ids (and kb.structural) are LOCKED: changing one is a breaking change, gated behind a NORMALIZATION_VERSION / extractor_version bump so existing digests are invalidated rather than silently colliding.
The honest north star is to show that a grounded knowledge layer beats RAG-over-code on real questions. The reason it can is structural: to answer a cross-file contract question, RAG must independently retrieve two source chunks across files, while one grounded knowbase artifact already spans both — which is exactly what the Tier-3 gate measures.
flowchart LR
Q["Q: shape of GET /api/orders response?"]
Q --> A["api:GET /api/orders<br/>(one grounded artifact)"]
Q --> RAG
A --> H["routes.py — handler span ✓"]
A --> M["schemas.py — response_model span ✓"]
subgraph RAG["RAG-over-source — independent top-k chunks"]
direction TB
C1["chunk · routes.py ✓"]
C2["chunk · schemas.py — only if retrieved"]
C3["chunk · other.py ✗"]
end
Next milestones:
- Nightly LLM-judged A/B (key-gated, non-gating) — grounded-answer accuracy + hallucination rate on top of recall.
- LLM-grounded semantic layer — model-backed artifacts that still carry ≥ 1 span (
extraction_method = "llm_grounded"). - Incremental re-index on git push — turn the diff-based invalidation seed into live updates.
- ADR mining from git / PR history.
- Grounded business-process extraction.
- More languages beyond Python.
Contributions are welcome. See CONTRIBUTING.md for setup, the eval-gate discipline, and pull-request expectations. A useful rule of thumb: if an extractor cannot ground a claim on a code span, it does not get to make the claim.
Please report vulnerabilities responsibly — see SECURITY.md. Do not open public issues for security reports.