Skip to content

walterfan/lazy-code-kg

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Lazy Rabbit Coder

Lazy Rabbit Coder is a Go command-line tool for building and querying a code knowledge base from source repositories.

The project is intended for AI-agent workflows: it scans a repository with tree-sitter and ripgrep, extracts code entities and relationships, enriches them with LLM-generated summaries and embeddings, stores the result in SQLite with sqlite-vec, and exposes query capabilities through both a CLI and an MCP server running in command-line mode.

What It Builds

Lazy Rabbit Coder turns source code into a local knowledge base and graph:

  • Parse source repositories into packages, files, functions, structs, interfaces, variables, and constants.
  • Build relationship edges such as CALLS, IMPLEMENTS, IMPORTS, CONTAINS, DEPENDS_ON, RETURNS, and ACCEPTS.
  • Chunk code for retrieval and attach vector embeddings from an embedding API.
  • Store metadata, entities, chunks, vectors, and graph edges in local SQLite.
  • Query the knowledge base with hybrid search: full text, vector similarity, and graph traversal.
  • Run as an MCP server so AI agents can inspect and query code through standard MCP tools.

The current repository is an early scaffold. It already defines the core model types and parser interface; CLI commands, tree-sitter parsing, storage, retrieval, enrichment, and MCP serving are tracked as planned work through OpenSpec.

Repository Layout

.
├── api/                 # Future API contracts and MCP schemas
├── cmd/
│   ├── codekg/          # Planned CLI entry point for indexing and querying
│   └── server/          # Planned MCP/server entry point
├── config/              # Future config examples for LLM, embeddings, and storage
├── deploy/              # Future packaging and deployment helpers
├── docs/                # Project documentation
├── internal/
│   ├── enricher/        # Planned LLM summaries and embedding generation
│   ├── generator/       # Planned graph/doc generation helpers
│   ├── model/           # Code entity, relation, chunk, repository, and query models
│   ├── parser/          # Parser interfaces and tree-sitter implementation boundary
│   ├── retriever/       # Planned hybrid retrieval logic
│   ├── store/           # Planned SQLite/sqlite-vec persistence
│   └── syncer/          # Planned repository scan and incremental sync logic
├── openspec/            # Spec-driven change proposals, designs, and tasks
└── scripts/             # Project helper scripts

Core Concepts

Code Entity

A code entity is a structured element extracted from a repository, such as a package, file, function, struct, interface, variable, or constant. Entities include source location, signature, doc string, body, summary, metadata, and optional embedding vectors.

Code Relation

A code relation is a typed edge between entities. Relations form the knowledge graph used for dependency analysis, impact analysis, call tracing, and context expansion during retrieval.

Code Chunk

A code chunk is a smaller retrieval unit derived from an entity. Chunks are embedded separately so natural-language queries can retrieve precise code snippets instead of only coarse file-level matches.

Hybrid Query

Query results combine vector search, graph expansion, and full-text search. The model layer already includes options for top-k retrieval, hop count, relevance threshold, and enabling or disabling each retrieval source.

CLI

The CLI binary is codekg. init, update, list, query, agent, embed, enrich, resolve, graph, and parsers are wired end-to-end; the remaining commands are registered with help text but return an actionable "not yet implemented; tracked by " error so the surface is discoverable from day one.

# Build the binary.
go build -buildvcs=false -o codekg ./cmd/codekg

# Install the CLI to ~/.local/bin/codekg by default.
# Override with --bin-dir DIR or CODEKG_INSTALL_DIR=DIR.
./scripts/install.sh
./scripts/install.sh --bin-dir /usr/local/bin
CODEKG_INSTALL_DIR="$HOME/bin" ./scripts/install.sh

# List the language parsers compiled into this binary.
./codekg parsers
./codekg parsers --json

# Initialize a code knowledge base (SQLite) for a repository.
# DB defaults to $XDG_CACHE_HOME/codekg/<repo-hash>.db; --db overrides.
./codekg init /path/to/repo
./codekg init                     # defaults to current directory

# Re-running is incremental: only files whose content_sha256 changed
# are re-parsed; deleted files (and their entities/relations/chunks) are
# cascade-cleaned.
./codekg update /path/to/repo
./codekg update                   # defaults to current directory

# Resolve staged CALLS/EXTENDS/IMPLEMENTS edges after all entities are indexed.
# `init/update --resolve` run both phases; `resolve` re-runs only the resolver.
./codekg init /path/to/repo --resolve
./codekg update /path/to/repo --resolve
./codekg resolve /path/to/repo

# Inspect indexed entities and their source paths.
./codekg list
./codekg list --repo-path /path/to/repo
./codekg list --kind function,class --json
./codekg list --limit 50

# Query uses GraphRAG-style retrieval by default: keyword search, vector search
# when embeddings/config are available, and graph expansion from the seed hits,
# merged with Reciprocal Rank Fusion. When --db and database.path are unset,
# codekg derives the default DB from the current working directory, so
# `cd /path/to/repo && ./codekg query ...` uses the same DB as
# `./codekg init .`. Use --repo-path when querying from outside the repo.
./codekg query "Authenticate" --db /path/to/your.db
./codekg query "Authenticate"
./codekg query --repo-path . "Authenticate"
./codekg query "session token" --json --limit 5
./codekg query "what's the entry point" --verbose

# Answer natural-language questions with retrieved GraphRAG context and a
# configured LLM provider. Unlike `query`, `agent` sends selected retrieved
# signatures/docs/summaries/snippets to the provider, so review --show-context
# when validating what leaves the machine.
./codekg agent "what's the entry point"
./codekg agent "what's the entry point" --show-context
./codekg agent "what's the flow of graph query" --verbose # print LLM request/response
./codekg agent "what's the entry point" --json

# Vector embeddings (codekg-vector-embeddings) — pure-Go sqlite-vec via
# `modernc.org/sqlite/vec`, no CGO, no system install. Configure with a
# `.env` in the working directory (auto-loaded) or real env vars:
#
#   EMBEDDING_URL=https://your-gateway/v1
#   EMBEDDING_MODEL=text-embedding-3-small
#   EMBEDDING_API_KEY=sk-...
#
# Real env vars always win over .env. Use `--env-file PATH` to load a
# different file. `codekg embed [path]` walks chunks already in the DB
# and stores per-(provider, model) embeddings. When [path] is omitted,
# codekg derives the default DB from the current working directory.
# Re-runs are no-ops for unchanged content.
./codekg embed --dry-run                # from repo root, no HTTP call
./codekg embed . --dry-run              # report what would be sent, no HTTP call
./codekg embed .                        # generate and store embeddings
./codekg embed . --limit 1000           # embed first 1000 pending chunks only
./codekg embed --db /path/to/kg.db      # explicit DB instead of repo path
./codekg --env-file ./prod.env embed .  # use a non-default .env path

# Hybrid query controls. Vector and graph are attempted by default; use flags
# when you need to force/disable a source.
./codekg query "where is auth enforced"
./codekg query --repo-path . "where is auth enforced"
./codekg query "session token" --limit 10 --json
./codekg query --repo-path . "Authenticate" --no-fulltext --vector   # vector-only, fail if unavailable
./codekg query "Authenticate" --no-vector --max-hops 0               # keyword-only

# Graph retrieval: inspect a subgraph directly, or tune query graph expansion.
./codekg graph --entity auth.Authenticate --hops 2 --rel CALLS
./codekg graph --entity auth.Authenticate --format mermaid
./codekg query "Authenticate" --max-hops 1
./codekg query "where is auth enforced" --max-hops 2

# LLM enrichment is opt-in. `index` and `query` never make LLM calls; `enrich`
# and `agent` send selected entity signatures/docs/bodies/snippets to the
# configured provider. Use provider=openai-compatible for HTTP APIs or
# provider=codex-cli to run `codex exec` locally without LLM_API_KEY.
# For openai-compatible, configure LLM_BASE_URL, LLM_MODEL, and LLM_API_KEY
# (or enrichment.llm.*).
./codekg enrich --repo . --dry-run --kind function --limit 20
./codekg enrich --repo . --kind function --limit 20
./codekg enrich --repo . --kind function --force

# Codex CLI provider example. This avoids LLM_BASE_URL / LLM_API_KEY and uses
# the local `codex exec` command instead.
CODEKG_ENRICHMENT_LLM_PROVIDER=codex-cli ./codekg agent "what's the entry point"

Supported Languages

codekg-multi-language-parsers ships tree-sitter parsers for:

Language Extensions Relations
Go .go CONTAINS, IMPORTS
Java .java CONTAINS, IMPORTS, EXTENDS, IMPLEMENTS
Python .py CONTAINS, IMPORTS, EXTENDS
TypeScript .ts CONTAINS, IMPORTS, EXTENDS, IMPLEMENTS
TSX .tsx CONTAINS, IMPORTS, EXTENDS, IMPLEMENTS
JavaScript .js, .mjs, .cjs CONTAINS, IMPORTS, EXTENDS, IMPLEMENTS
JSX .jsx CONTAINS, IMPORTS, EXTENDS, IMPLEMENTS
C++ .cpp, .cc, .cxx, .h, .hh, .hpp CONTAINS, IMPORTS, EXTENDS

Run codekg parsers to see the live matrix for the binary you built.

To build slimmer binaries that omit specific languages, use build tags:

# Drop Java + C++ grammars from the binary.
go build -buildvcs=false -tags "codekg_no_java codekg_no_cpp" -o codekg ./cmd/codekg

Available tags: codekg_no_java, codekg_no_python, codekg_no_tsjs, codekg_no_cpp. Files with no registered parser are still discovered, but counted as unsupported=N in the index summary and skipped.

Deferred commands (registered with help text, runtime delivered later):

Command Tracked by
`codekg repo list show
codekg mcp (stdio MCP server) codekg-mcp-server

The MCP server lands in codekg-mcp-server.

Vector embeddings (codekg embed + hybrid query --vector) ship in codekg-vector-embeddings and use modernc.org/sqlite/vec for KNN — no CGO and no system dependency.

Graph retrieval (codekg graph, codekg query --graph) ships in codekg-graph-retrieval. The resolver is heuristic: it first tries exact qualified-name matches, then qualified-name suffix matches, then repo-wide bare-name matches. Ambiguous name matches produce multiple weighted edges. Rows that cannot be resolved remain in relations_unresolved so you can inspect/debug unresolved targets and rerun codekg resolve <repo> later.

LLM enrichment (codekg enrich) ships in codekg-llm-enrichment. Prompt templates can use {{.Language}}, {{.Kind}}, {{.QualifiedName}}, {{.Signature}}, {{.DocString}}, {{.Body}}, {{.Summary}}, {{.FilePath}}, {{.StartLine}}, and {{.EndLine}}. Re-runs are idempotent by sha256(system + "\x00" + rendered_user)[:16] plus content hash; changing the prompt or passing --force re-summarizes selected entities. enrichment.privacy.deny_paths is applied in addition to parser excludes, and unresolved or denied entities are counted in the enrich_summary output.

Agent answers (codekg agent) ship in codekg-agent-answer. The command uses the same GraphRAG retrieval defaults as query, builds a bounded context from the top retrieved entities, and calls the configured LLM provider to produce a concise cited answer. Use --show-context or --json to inspect the context and citations used for the answer. Use --verbose to print the exact prompt sent to the LLM provider and the returned response. If context is missing, it reports that it cannot answer instead of guessing.

Codex CLI LLM Provider

Use codex-cli when the OpenAI-compatible LLM API is unavailable or when you prefer to route codekg agent through the local Codex CLI.

enrichment:
  llm:
    provider: codex-cli
    command: codex              # optional; defaults to codex
    model: gpt-5.1-codex        # optional; passed to codex exec -m
CODEKG_ENRICHMENT_LLM_PROVIDER=codex-cli ./codekg agent "what's the entry point"
./codekg --config ./config/codekg.yaml agent "what's the flow of graph query" --verbose

codex-cli invokes codex exec --skip-git-repo-check --color never -, passes the generated answer prompt on stdin, and captures stdout as the answer. It does not require LLM_BASE_URL or LLM_API_KEY. --verbose prints the exact prompt sent to Codex and the returned response.

A sample config lives at config/codekg.example.yaml.

Development

Prerequisites:

  • Go 1.25.8 or compatible.
  • Optional: ripgrep (rg) for faster file discovery. Codekg falls back to a pure-Go walker that respects .gitignore when rg is missing.
  • Vector storage is provided by modernc.org/sqlite/vec, a pure-Go port of the sqlite-vec extension. No CGO, no system install, and no separate SQLite binary required.
  • An OpenAI-compatible embedding endpoint and API key are required for codekg embed (e.g. OpenAI itself or a corporate LLM gateway). Configure via EMBEDDING_URL, EMBEDDING_MODEL, and EMBEDDING_API_KEY (or the matching enrichment.embedding.* keys in config.yaml).
  • LLM API credentials are required for codekg enrich and codekg agent when using openai-compatible; codex-cli uses the local Codex CLI instead.

Useful commands (the root Makefile wraps these — make help lists every target):

# Format Go code.
make fmt              # or: go fmt ./...

# Run all Go tests (storage, parser, discovery, indexer, retriever, CLI).
make test             # or: go test ./...

# With race detector (use whenever syncer/store/retriever change).
make test-race        # or: go test -race ./...

# Build the codekg CLI to ./codekg.
make build            # or: go build -buildvcs=false -o codekg ./cmd/codekg

# Install the codekg CLI to ~/.local/bin/codekg.
./scripts/install.sh  # or: ./scripts/install.sh --bin-dir /usr/local/bin

# Build every package.
make build-all        # or: go build -buildvcs=false ./...

# Slimmer binary without a grammar (see Build tags above).
make build TAGS="codekg_no_java codekg_no_cpp"

# Run a single test.
make test RUN=TestEmbedDryRun

OpenSpec Workflow

This repository uses OpenSpec for proposal-driven changes. New behavior should be described under openspec/changes/<change-id>/ before implementation when it affects CLI behavior, persistence schema, MCP contracts, parsing semantics, or retrieval behavior.

Common flow:

openspec new change "<change-id>"
openspec status --change "<change-id>"
openspec instructions proposal --change "<change-id>" --json
openspec status --change "<change-id>"

Relationship to Async Code Mate

Lazy Rabbit Coder borrows the AI-assisted code understanding direction from async-code-mate, especially the CodeKG and MCP ideas, but this repository is focused on a smaller command-line-first tool:

  • Local repo indexing instead of a full web application.
  • SQLite/sqlite-vec as the default local knowledge store.
  • CLI and stdio MCP workflows as first-class interfaces.
  • Project knowledge base and graph generation for AI agents.

Status

The repository is in the design and scaffold stage. The model package defines the initial data contracts, and the parser package defines the parsing boundary. See openspec/changes/ for active proposals and implementation tasks.

About

lazy code knowledge graph

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages