Grep your documents with context. Fast offline search for PDFs, DOCX, Markdown, and code — no vectors, no cloud, no ML dependencies. Trigram indexing + SimHash fingerprinting built in Rust.
contextgrep index ./docs/
contextgrep search "purchase agreement" # grep with context
contextgrep query 'type:contract amount:>1M' # structured DSL
contextgrep similar ./contract_draft.docx # find near-duplicates
contextgrep recent --since 7d
contextgrep clusters ./docs/
1. Install the MCP server (one-time)
claude mcp add contextgrep npx contextgrep@latestmacOS Homebrew users: use
$(which npx)instead ofnpx
2. Index your documents (one-time per folder)
Just tell your AI assistant:
"Index my documents at /Users/john/Documents/contracts"
The assistant calls the index tool automatically. The index is saved to .searchindex/ inside that folder — subsequent runs are incremental and only process new or changed files.
3. Start searching
"Find all contracts mentioning indemnification" "Which documents have a purchase price over $1M?" "Show me files modified in the last 7 days" "Find documents similar to this NDA"
The assistant picks the right tool (search, query, recent, similar) based on your question.
| This tool | Vector search | BM25 | |
|---|---|---|---|
| Works offline | yes | no (needs model) | yes |
| Deterministic | yes | no | yes |
| Handles typos/partials | yes | sometimes | no |
| Cost | zero | $$$ (inference) | zero |
| Explainable results | yes | no | partially |
| Finds near-duplicates | yes | yes | no |
The goal: something that feels as fast as grep, understands document structure, and needs zero infrastructure.
Search runs in 3 stages:
Query
↓
Trigram index — fast fuzzy/substring candidate retrieval
↓
Structural filter — hard constraints (type:, path:, amount:, date:)
↓
Scoring — trigram overlap + term proximity + recency + structure + title boost
Trigram index — splits text into overlapping 3-character windows and builds posting lists. Handles typos, partial matches, and substring queries without needing exact word boundaries.
SimHash — computes a 64-bit document fingerprint from word bigram shingles. Two documents with fewer than ~8 differing bits are near-duplicates. Used by contextgrep similar and contextgrep clusters.
Structural metadata — regex-based extraction of dates, currency amounts, email addresses, and document type inference. Used by the DSL filter layer.
Scoring formula:
score = 0.45 × trigram_overlap
+ 0.20 × term_proximity
+ 0.10 × recency_decay
+ 0.20 × structural_field_match
+ 0.05 × title_boost
No Rust required. Add this to your MCP config:
{
"mcpServers": {
"contextgrep": {
"command": "npx",
"args": ["contextgrep@latest"]
}
}
}Or via Claude Code CLI:
claude mcp add contextgrep npx contextgrep@latestmacOS (Homebrew Node.js): If the server fails to connect, use the full path to
npx:claude mcp add contextgrep $(which npx) contextgrep@latest
With a custom index location:
{
"mcpServers": {
"contextgrep": {
"command": "npx",
"args": ["contextgrep@latest", "--index", "/path/to/your/index"]
}
}
}Config file locations:
| Client | Config file |
|---|---|
| Claude Code | ~/.claude/settings.json |
| Claude Desktop (macOS) | ~/Library/Application Support/Claude/claude_desktop_config.json |
| Cursor | ~/.cursor/mcp.json or .cursor/mcp.json in project |
| VS Code | .vscode/mcp.json |
| Zed | ~/.config/zed/settings.json → context_servers |
Download a pre-built binary from GitHub Releases, put it in your PATH, then:
contextgrep index ./docs/
contextgrep search "purchase agreement"Build from source (requires Rust 1.75+):
git clone https://github.com/ranjanj1/contextgrep
cd contextgrep
cargo install --path .| Type | Extensions |
|---|---|
| Plain text | .txt, .text |
| Markdown | .md, .markdown |
.pdf |
|
| Word documents | .docx |
| Config/data | .toml, .yaml, .yml, .json |
| Code | .rs, .py, .js, .ts, .go, .java, .c, .cpp, .rb, .swift, .kt, .sh |
Respects .gitignore, .ignore, and .searchignore files during walks.
Index a folder. Subsequent runs are incremental — only new or changed files are re-indexed.
contextgrep index ./docs/
contextgrep index ./docs/ --full # force full re-indexTrigram-based fuzzy search. Handles partial words and minor typos.
contextgrep search "purchase agreement"
contextgrep search "purchse agreem" # typo-tolerant
contextgrep search "2024 contract"
contextgrep search "indemnif" # prefix match
contextgrep search "agreement" -n 5 # top 5 resultsSnippet control:
--context-size N (default 120) extracts N characters on both sides of the match, so the hit sits in the middle of the snippet (~2×N chars total).
contextgrep search "indemnif" --context-size 500 # ~1000 chars around each match
contextgrep search "indemnif" --full-content # return entire file text instead of a snippetRAG usage — pipe full content as JSON into your LLM:
contextgrep search "purchase price" --full-content --output json -n 3[
{
"path": "./docs/acquisition.pdf",
"score": 0.72,
"type": "contract",
"snippet": "...the purchase price shall be...",
"content": "Agreement for Services\n\nThis agreement..."
}
]import subprocess, json
def retrieve(question: str, top_k: int = 3) -> list[dict]:
result = subprocess.run(
["contextgrep", "search", question, "--full-content", "--output", "json", "-n", str(top_k)],
capture_output=True, text=True,
)
return json.loads(result.stdout)Structured search using the filter DSL.
contextgrep query 'type:contract'
contextgrep query 'amount:>1M'
contextgrep query 'type:contract amount:>500K path:/legal'
contextgrep query '"non-disclosure" AND date:>2024-01-01'
contextgrep query 'type:invoice OR type:receipt'
contextgrep query 'NOT type:draft'DSL reference:
| Syntax | Meaning |
|---|---|
type:contract |
Document type equals (fuzzy) |
amount:>1M |
Any extracted amount > 1,000,000 |
amount:<=50000 |
Any extracted amount ≤ 50,000 |
date:>2024-01-01 |
Any extracted date after Jan 1 2024 |
date:>2024 |
Any extracted date after 2024 |
path:/legal |
File path contains /legal |
email:@acme.com |
Email matching @acme.com found in doc |
"exact phrase" |
Phrase must appear as written |
AND, OR, NOT |
Boolean operators |
(...) |
Grouping |
Find documents similar to a given file using SimHash Hamming distance.
contextgrep similar ./contract_v1.docx
contextgrep similar ./contract_v1.docx --threshold 12 # looser matching
contextgrep similar ./contract_v1.docx -n 20 # top 20 resultsSimilarity score is 1 - (hamming_distance / 64). Score of 1.0 = identical content, 0.875 = 8 bits differ.
Show recently modified documents, sorted newest first.
contextgrep recent --since 7d # last 7 days
contextgrep recent --since 2w # last 2 weeks
contextgrep recent --since 3m # last 3 months
contextgrep recent --since 1y # last year
contextgrep recent --since 2024-06-01 # since a specific dateGroup documents into similarity clusters using Locality-Sensitive Hashing on SimHash fingerprints.
contextgrep clusters ./docs/
contextgrep clusters ./docs/ --bits 4 # coarse clustering (fewer, larger groups)
contextgrep clusters ./docs/ --bits 8 # fine-grained clusteringAll commands support --output plain|json|tsv:
# Default: human-readable
contextgrep search "agreement"
# JSON: newline-delimited, good for scripting
contextgrep search "agreement" --output json | jq '.[0].path'
# TSV: tab-separated path, score, snippet
contextgrep search "agreement" --output tsv | cut -f1The index is stored in .searchindex/ and resolved with this precedence:
--index <path>CLI flag.searchindex/in the current directory or any parent (walks up)~/.searchindex/as a global fallback
.searchindex/
segments/
0000/
postings.trgm # trigram posting lists (mmap binary)
simhash.bin # flat u64 array of SimHash fingerprints
0001/ # incremental segment from next run
docstore.redb # document metadata and snippets
metadata.redb # structural metadata (dates, amounts, etc.)
Incremental indexing: each contextgrep index run compares mtime and xxh64 file hash. Unchanged files are skipped. Deleted files are removed from the index. Segments merge automatically when count exceeds 8.
To exclude files from indexing, add patterns to .searchignore (same syntax as .gitignore).
# Run all tests
cargo test
# Run specific test file
cargo test --test index_test
# Run unit tests for a module
cargo test trigram
cargo test simhash
cargo test query
# Run benchmarks
cargo bench
# Build release binary
cargo build --release
./target/release/contextgrep --helpProject layout:
src/
cli/ commands.rs, output.rs
parser/ text, pdf, docx, code, walker, metadata
indexer/ trigram, simhash, schema, pipeline
search/ query DSL, filters, proximity, scorer
storage/ mmap (postings), redb store, segment manager
config.rs index path resolution
error.rs unified error type
tests/
fixtures/ sample.txt, sample.md, sample.rs
integration/ index_test, search_test, query_test
benches/ trigram, simhash, search benchmarks
- Semantic queries will miss results. Searching for "indemnification" won't match a document that only says "liability protection". Trigrams are lexical, not conceptual.
- PDF quality varies. Image-based PDFs (scanned documents) produce no text.
- No ranking feedback loop. Scoring weights are fixed defaults; there's no click-through learning.
- Single-machine only. No distributed index, no server mode.
MIT
