Skip to content

Lightning-fast semantic code search with AST chunking (15+ languages) - Hybrid TF-IDF + Vector, MCP-ready for AI assistants

License

Notifications You must be signed in to change notification settings

SylphxAI/coderag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

CodeRAG

Lightning-fast hybrid code search for AI assistants

npm version npm version CI License

Zero dependencies β€’ <50ms search β€’ Hybrid TF-IDF + Vector β€’ MCP ready

Quick Start β€’ Features β€’ MCP Setup β€’ API


Why CodeRAG?

Traditional code search tools are either slow (full-text grep), inaccurate (keyword matching), or complex (require external services).

CodeRAG is different:

❌ Old way: Docker + ChromaDB + Ollama + 30 second startup
βœ… CodeRAG: npx @sylphx/coderag-mcp (instant)
Feature grep/ripgrep Cloud RAG CodeRAG
Semantic understanding ❌ βœ… βœ…
Zero external deps βœ… ❌ βœ…
Offline support βœ… ❌ βœ…
Startup time Instant 10-30s <1s
Search latency ~100ms ~500ms <50ms

✨ Features

Search

  • πŸ” Hybrid Search - TF-IDF + optional vector embeddings
  • 🧠 StarCoder2 Tokenizer - Code-aware tokenization (4.7MB, trained on code)
  • πŸ“Š Smoothed IDF - No term gets ignored, stable ranking
  • ⚑ <50ms Latency - Instant results even on large codebases

Indexing

  • πŸš€ 1000-2000 files/sec - Fast initial indexing
  • πŸ’Ύ SQLite Persistence - Instant startup (<100ms) with cached index
  • ⚑ Incremental Updates - Smart diff detection, no full rebuilds
  • πŸ‘οΈ File Watching - Real-time index updates on file changes

Integration

  • πŸ“¦ MCP Server - Works with Claude Desktop, Cursor, VS Code, Windsurf
  • 🧠 Vector Search - Optional OpenAI embeddings for semantic search
  • 🌳 AST Chunking - Smart code splitting using Synth parsers (15+ languages)
  • πŸ’» Low Memory Mode - SQL-based search for resource-constrained environments

πŸš€ Quick Start

Option 1: MCP Server (Recommended for AI Assistants)

npx @sylphx/coderag-mcp --root=/path/to/project

Or add to your MCP config:

{
  "mcpServers": {
    "coderag": {
      "command": "npx",
      "args": ["-y", "@sylphx/coderag-mcp", "--root=/path/to/project"]
    }
  }
}

See MCP Server Setup for Claude Desktop, Cursor, VS Code, etc.

Option 2: As a Library

npm install @sylphx/coderag
# or
bun add @sylphx/coderag
import { CodebaseIndexer, PersistentStorage } from '@sylphx/coderag'

// Create indexer with persistent storage
const storage = new PersistentStorage({ codebaseRoot: './my-project' })
const indexer = new CodebaseIndexer({
  codebaseRoot: './my-project',
  storage,
})

// Index codebase (instant on subsequent runs)
await indexer.index({ watch: true })

// Search
const results = await indexer.search('authentication logic', { limit: 10 })
console.log(results)
// [{ path: 'src/auth/login.ts', score: 0.87, matchedTerms: ['authentication', 'logic'], snippet: '...' }]

πŸ“¦ Packages

Package Description Install
@sylphx/coderag Core search library npm i @sylphx/coderag
@sylphx/coderag-mcp MCP server for AI assistants npx @sylphx/coderag-mcp

πŸ”Œ MCP Server Setup

Claude Desktop

Add to claude_desktop_config.json:

macOS: ~/Library/Application Support/Claude/claude_desktop_config.json Windows: %APPDATA%\Claude\claude_desktop_config.json

{
  "mcpServers": {
    "coderag": {
      "command": "npx",
      "args": ["-y", "@sylphx/coderag-mcp", "--root=/path/to/project"]
    }
  }
}

Cursor

Add to ~/.cursor/mcp.json (macOS) or %USERPROFILE%\.cursor\mcp.json (Windows):

{
  "mcpServers": {
    "coderag": {
      "command": "npx",
      "args": ["-y", "@sylphx/coderag-mcp", "--root=/path/to/project"]
    }
  }
}

VS Code

Add to VS Code settings (JSON) or .vscode/mcp.json:

{
  "mcp": {
    "servers": {
      "coderag": {
        "command": "npx",
        "args": ["-y", "@sylphx/coderag-mcp", "--root=${workspaceFolder}"]
      }
    }
  }
}

Windsurf

Add to ~/.codeium/windsurf/mcp_config.json:

{
  "mcpServers": {
    "coderag": {
      "command": "npx",
      "args": ["-y", "@sylphx/coderag-mcp", "--root=/path/to/project"]
    }
  }
}

Claude Code

claude mcp add coderag -- npx -y @sylphx/coderag-mcp --root=/path/to/project

πŸ› οΈ MCP Tool: codebase_search

Search project source files with hybrid TF-IDF + vector ranking.

Parameters

Parameter Type Required Default Description
query string Yes - Search query
limit number No 10 Max results
include_content boolean No true Include code snippets
file_extensions string[] No - Filter by extension (e.g., [".ts", ".tsx"])
path_filter string No - Filter by path pattern
exclude_paths string[] No - Exclude paths (e.g., ["node_modules", "dist"])

Example

{
  "query": "user authentication login",
  "limit": 5,
  "file_extensions": [".ts", ".tsx"],
  "exclude_paths": ["node_modules", "dist", "test"]
}

Response Format

LLM-optimized output (minimal tokens, maximum content):

# Search: "user authentication login" (3 results)

## src/auth/login.ts:15-28
```typescript
15: export async function authenticate(credentials) {
16:   const user = await findUser(credentials.email)
17:   return validatePassword(user, credentials.password)
18: }

src/middleware/auth.ts:42-55 [md→typescript]

42: // Embedded code from markdown docs
43: const authMiddleware = (req, res, next) => {

src/utils/large.ts:1-200 [truncated]

1: // First 70% shown...

... [800 chars truncated] ...

195: // Last 20% shown

---

## πŸ“š API Reference

### `CodebaseIndexer`

Main class for indexing and searching.

```typescript
import { CodebaseIndexer, PersistentStorage } from '@sylphx/coderag'

const storage = new PersistentStorage({ codebaseRoot: './project' })
const indexer = new CodebaseIndexer({
  codebaseRoot: './project',
  storage,
  maxFileSize: 1024 * 1024, // 1MB default
})

// Index with file watching
await indexer.index({ watch: true })

// Search with options
const results = await indexer.search('query', {
  limit: 10,
  includeContent: true,
  fileExtensions: ['.ts', '.js'],
  excludePaths: ['node_modules'],
})

// Stop watching
await indexer.stopWatch()

PersistentStorage

SQLite-backed storage for instant startup.

import { PersistentStorage } from '@sylphx/coderag'

const storage = new PersistentStorage({
  codebaseRoot: './project',  // Creates .coderag/ folder
  dbPath: './custom.db',      // Optional custom path
})

Low-Level TF-IDF Functions

import { buildSearchIndex, searchDocuments, initializeTokenizer } from '@sylphx/coderag'

// Initialize StarCoder2 tokenizer (4.7MB, one-time download)
await initializeTokenizer()

// Build index
const documents = [
  { uri: 'file://auth.ts', content: 'export function authenticate...' },
  { uri: 'file://user.ts', content: 'export class User...' },
]
const index = await buildSearchIndex(documents)

// Search
const results = await searchDocuments('authenticate user', index, { limit: 5 })

Vector Search (Optional)

For semantic search with embeddings:

import { hybridSearch, createEmbeddingProvider } from '@sylphx/coderag'

// Requires OPENAI_API_KEY environment variable
const results = await hybridSearch('authentication flow', indexer, {
  vectorWeight: 0.7,  // 70% vector, 30% TF-IDF
  limit: 10,
})

βš™οΈ Configuration

MCP Server Options

Option Default Description
--root=<path> Current directory Codebase root path
--max-size=<bytes> 1048576 (1MB) Max file size to index
--no-auto-index false Disable auto-indexing on startup

Environment Variables

Variable Description
OPENAI_API_KEY Enable vector search with OpenAI embeddings
OPENAI_BASE_URL Custom OpenAI-compatible endpoint
EMBEDDING_MODEL Embedding model (default: text-embedding-3-small)
EMBEDDING_DIMENSIONS Custom embedding dimensions

πŸ“Š Performance

Metric Value
Initial indexing ~1000-2000 files/sec
Startup with cache <100ms
Search latency <50ms
Memory per 1000 files ~1-2 MB
Tokenizer size 4.7MB (StarCoder2)

Benchmarks

Tested on MacBook Pro M1, 16GB RAM:

Codebase Files Index Time Search Time
Small (100 files) 100 0.5s <10ms
Medium (1000 files) 1,000 2s <30ms
Large (10000 files) 10,000 15s <50ms

πŸ—οΈ Architecture

coderag/
β”œβ”€β”€ packages/
β”‚   β”œβ”€β”€ core/                     # @sylphx/coderag
β”‚   β”‚   β”œβ”€β”€ src/
β”‚   β”‚   β”‚   β”œβ”€β”€ indexer.ts           # Main indexer with file watching
β”‚   β”‚   β”‚   β”œβ”€β”€ tfidf.ts             # TF-IDF with StarCoder2 tokenizer
β”‚   β”‚   β”‚   β”œβ”€β”€ code-tokenizer.ts    # StarCoder2 tokenization
β”‚   β”‚   β”‚   β”œβ”€β”€ hybrid-search.ts     # Vector + TF-IDF fusion
β”‚   β”‚   β”‚   β”œβ”€β”€ incremental-tfidf.ts # Smart incremental updates
β”‚   β”‚   β”‚   β”œβ”€β”€ storage-persistent.ts # SQLite storage
β”‚   β”‚   β”‚   β”œβ”€β”€ vector-storage.ts    # LanceDB vector storage
β”‚   β”‚   β”‚   β”œβ”€β”€ embeddings.ts        # OpenAI embeddings
β”‚   β”‚   β”‚   β”œβ”€β”€ ast-chunking.ts      # Synth AST chunking
β”‚   β”‚   β”‚   └── language-config.ts   # Language registry (15+ languages)
β”‚   β”‚   └── package.json
β”‚   β”‚
β”‚   └── mcp-server/               # @sylphx/coderag-mcp
β”‚       β”œβ”€β”€ src/
β”‚       β”‚   └── index.ts             # MCP server
β”‚       └── package.json

How It Works

  1. Indexing: Scans codebase, tokenizes with StarCoder2, builds TF-IDF index
  2. AST Chunking: Splits code at semantic boundaries (functions, classes, etc.)
  3. Storage: Persists to SQLite (.coderag/ folder) for instant startup
  4. Watching: Detects file changes, performs incremental updates
  5. Search: Hybrid TF-IDF + optional vector search with score fusion

Supported Languages

AST-based chunking with semantic boundary detection:

Category Languages
JavaScript JavaScript, TypeScript, JSX, TSX
Systems Python, Go, Java, C
Markup Markdown, HTML, XML
Data/Config JSON, YAML, TOML, INI
Other Protobuf

Embedded Code Support: Automatically parses code blocks in Markdown and <script>/<style> tags in HTML.


πŸ”§ Development

# Clone
git clone https://github.com/SylphxAI/coderag.git
cd coderag

# Install
bun install

# Build
bun run build

# Test
bun run test

# Lint & Format
bun run lint
bun run format

🀝 Contributing

Contributions are welcome! Please:

  1. Open an issue to discuss changes
  2. Fork and create a feature branch
  3. Run bun run lint and bun run test
  4. Submit a pull request

πŸ“„ License

MIT Β© Sylphx


Powered by Sylphx

Built with @sylphx/synth β€’ @sylphx/mcp-server-sdk β€’ @sylphx/doctor β€’ @sylphx/bump

About

Lightning-fast semantic code search with AST chunking (15+ languages) - Hybrid TF-IDF + Vector, MCP-ready for AI assistants

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •