A powerful CLI tool for semantic code search and duplicate detection using vector embeddings and LLM.
- Semantic Code Search: Natural language queries to find relevant code
- Duplicate Detection: Find logically similar code across your codebase
- Multi-language Support: Go, Python, TypeScript, JavaScript
- MCP Integration: Model Context Protocol server for LLM integration
- Vector Database: Uses Qdrant for efficient similarity search
- Go AST Metadata:
internal/parser/go_parser.gonow captures package names, imports, signatures, doc comments, and callees for every function/method. The indexer (internal/indexer/indexer.go) injects this metadata into both embeddings and Qdrant payloads so hybrid queries can combine semantic similarity with structured filters.
This project already uses Go's built-in AST packages to extract function and method definitions for indexing. We plan to extend this further to get closer to tools like claude-context that use AST-based splitting for richer semantic understanding.
Planned directions:
-
Richer AST-Derived Metadata
- For Go code (
internal/parser/go_parser.go), extract and embed additional structural information:- Package name and imports (e.g.
go/ast,go/parser,go/token,go/types). - Function signatures (parameter and return types).
- Method receiver types.
- Doc comments and key callees inside each function.
- Package name and imports (e.g.
- Include these fields in the text we send to the embedding model so that queries like "where do we construct Go ASTs" can match based on imports and API usage, not just raw code text.
- For Go code (
-
AST-Based Code Chunking
- Move beyond "one function = one chunk" by using AST structure to define more semantic chunks:
- Top-level declarations (functions, methods, types, etc.).
- File-level summary chunks that describe the purpose of a file, its imports, and exported symbols.
- Optional sub-chunking of very large functions by control-flow blocks.
- This mirrors the
AstCodeSplitterapproach used inclaude-context, improving recall for module-level queries.
- Move beyond "one function = one chunk" by using AST structure to define more semantic chunks:
-
Structure-Aware Query Planning and Filtering
- Extend the query planning step (LLM-powered
QueryPlan) to:- Recognise structured signals in user queries (e.g. mentions of
go/ast,go/parser, specific APIs). - Generate sub-queries or filters that target files/functions with matching imports, languages, or node types.
- Recognise structured signals in user queries (e.g. mentions of
- Store additional AST-derived fields (like imports or symbol names) in the vector payload, and use them in filter logic before ranking by semantic similarity.
- Extend the query planning step (LLM-powered
These changes aim to combine AST structure with vector search so that the system can answer more precise questions about where particular APIs, patterns, or language features are used in a codebase.
go build -o codebase main.goUse the PowerShell script under scripts/install-codebase.ps1 to download the latest release from GitHub, install it to a system directory, and configure environment variables.
-
Open PowerShell as Administrator (required to write system environment variables).
-
From the repository root, run:
powershell -ExecutionPolicy Bypass -File .\scripts\install-codebase.ps1
By default this installs into
C:\mcp\codebase. You can override the install directory:powershell -ExecutionPolicy Bypass -File .\scripts\install-codebase.ps1 -InstallDir 'C:\mcp\codebase'
What the script does:
- Fetches the latest
sxueck/codebaserelease from GitHub and picks the Windows asset matching your architecture. - Downloads and extracts/copies it into the target install directory.
- Sets
CODEBASE_HOME(system-level) to the install directory. - Appends the install directory to the system
PATHif it is not already present. - Ensures a config directory and file at
~/.codebase/config.json:- Creates
%USERPROFILE%\.codebaseif needed. - Creates an empty
config.jsonif it does not exist yet (you can edit this file later as needed).
- Creates
- Go 1.22+
- Qdrant (running on localhost:6334 or configured via QDRANT_URL)
- OpenAI API Key
Set environment variables:
export OPENAI_API_KEY=your_key_here
export OPENAI_BASE_URL=https://api.openai.com/v1 # optional custom endpoint
export OPENAI_EMBEDDING_MODEL=text-embedding-3-large # optional embedding model
export OPENAI_LLM_MODEL=gpt-4-turbo-preview # optional chat model
export QDRANT_URL=localhost:6334
export QDRANT_API_KEY=your_qdrant_password # optional auth secretcodebase index --dir ./path/to/projectcodebase mcpConfigure in Claude Desktop:
{
"mcpServers": {
"codebase-cli": {
"command": "codebase",
"args": ["mcp"]
}
}
}codebase query --q "找到逻辑高度重复的代码"MIT