contextgrep

Grep your documents with context. Fast offline search for PDFs, DOCX, Markdown, and code — no vectors, no cloud, no ML dependencies. Trigram indexing + SimHash fingerprinting built in Rust.

Demo

contextgrep index ./docs/
contextgrep search "purchase agreement"       # grep with context
contextgrep query 'type:contract amount:>1M'  # structured DSL
contextgrep similar ./contract_draft.docx    # find near-duplicates
contextgrep recent --since 7d
contextgrep clusters ./docs/

Quick Start

1. Install the MCP server (one-time)

claude mcp add contextgrep npx contextgrep@latest

macOS Homebrew users: use $(which npx) instead of npx

2. Index your documents (one-time per folder)

Just tell your AI assistant:

"Index my documents at /Users/john/Documents/contracts"

The assistant calls the index tool automatically. The index is saved to .searchindex/ inside that folder — subsequent runs are incremental and only process new or changed files.

3. Start searching

"Find all contracts mentioning indemnification" "Which documents have a purchase price over $1M?" "Show me files modified in the last 7 days" "Find documents similar to this NDA"

The assistant picks the right tool (search, query, recent, similar) based on your question.

Why not vectors or BM25?

	This tool	Vector search	BM25
Works offline	yes	no (needs model)	yes
Deterministic	yes	no	yes
Handles typos/partials	yes	sometimes	no
Cost	zero	$$$ (inference)	zero
Explainable results	yes	no	partially
Finds near-duplicates	yes	yes	no

The goal: something that feels as fast as grep, understands document structure, and needs zero infrastructure.

How it works

Search runs in 3 stages:

Query
  ↓
Trigram index       — fast fuzzy/substring candidate retrieval
  ↓
Structural filter   — hard constraints (type:, path:, amount:, date:)
  ↓
Scoring             — trigram overlap + term proximity + recency + structure + title boost

Trigram index — splits text into overlapping 3-character windows and builds posting lists. Handles typos, partial matches, and substring queries without needing exact word boundaries.

SimHash — computes a 64-bit document fingerprint from word bigram shingles. Two documents with fewer than ~8 differing bits are near-duplicates. Used by contextgrep similar and contextgrep clusters.

Structural metadata — regex-based extraction of dates, currency amounts, email addresses, and document type inference. Used by the DSL filter layer.

Scoring formula:

score = 0.45 × trigram_overlap
      + 0.20 × term_proximity
      + 0.10 × recency_decay
      + 0.20 × structural_field_match
      + 0.05 × title_boost

Installation

MCP server (Claude Code, Claude Desktop, Cursor, VS Code…)

No Rust required. Add this to your MCP config:

{
  "mcpServers": {
    "contextgrep": {
      "command": "npx",
      "args": ["contextgrep@latest"]
    }
  }
}

Or via Claude Code CLI:

claude mcp add contextgrep npx contextgrep@latest

macOS (Homebrew Node.js): If the server fails to connect, use the full path to npx:
claude mcp add contextgrep $(which npx) contextgrep@latest

With a custom index location:

{
  "mcpServers": {
    "contextgrep": {
      "command": "npx",
      "args": ["contextgrep@latest", "--index", "/path/to/your/index"]
    }
  }
}

Config file locations:

Client	Config file
Claude Code	`~/.claude/settings.json`
Claude Desktop (macOS)	`~/Library/Application Support/Claude/claude_desktop_config.json`
Cursor	`~/.cursor/mcp.json` or `.cursor/mcp.json` in project
VS Code	`.vscode/mcp.json`
Zed	`~/.config/zed/settings.json` → `context_servers`

CLI binary

Download a pre-built binary from GitHub Releases, put it in your PATH, then:

contextgrep index ./docs/
contextgrep search "purchase agreement"

Build from source (requires Rust 1.75+):

git clone https://github.com/ranjanj1/contextgrep
cd contextgrep
cargo install --path .

Supported file types

Type	Extensions
Plain text	`.txt`, `.text`
Markdown	`.md`, `.markdown`
PDF	`.pdf`
Word documents	`.docx`
Config/data	`.toml`, `.yaml`, `.yml`, `.json`
Code	`.rs`, `.py`, `.js`, `.ts`, `.go`, `.java`, `.c`, `.cpp`, `.rb`, `.swift`, `.kt`, `.sh`

Respects .gitignore, .ignore, and .searchignore files during walks.

Commands

`contextgrep index <path>`

Index a folder. Subsequent runs are incremental — only new or changed files are re-indexed.

contextgrep index ./docs/
contextgrep index ./docs/ --full        # force full re-index

`contextgrep search <query>`

Trigram-based fuzzy search. Handles partial words and minor typos.

contextgrep search "purchase agreement"
contextgrep search "purchse agreem"           # typo-tolerant
contextgrep search "2024 contract"
contextgrep search "indemnif"                 # prefix match
contextgrep search "agreement" -n 5           # top 5 results

Snippet control:

--context-size N (default 120) extracts N characters on both sides of the match, so the hit sits in the middle of the snippet (~2×N chars total).

contextgrep search "indemnif" --context-size 500    # ~1000 chars around each match
contextgrep search "indemnif" --full-content        # return entire file text instead of a snippet

RAG usage — pipe full content as JSON into your LLM:

contextgrep search "purchase price" --full-content --output json -n 3

[
  {
    "path": "./docs/acquisition.pdf",
    "score": 0.72,
    "type": "contract",
    "snippet": "...the purchase price shall be...",
    "content": "Agreement for Services\n\nThis agreement..."
  }
]

import subprocess, json

def retrieve(question: str, top_k: int = 3) -> list[dict]:
    result = subprocess.run(
        ["contextgrep", "search", question, "--full-content", "--output", "json", "-n", str(top_k)],
        capture_output=True, text=True,
    )
    return json.loads(result.stdout)

`contextgrep query <dsl>`

Structured search using the filter DSL.

contextgrep query 'type:contract'
contextgrep query 'amount:>1M'
contextgrep query 'type:contract amount:>500K path:/legal'
contextgrep query '"non-disclosure" AND date:>2024-01-01'
contextgrep query 'type:invoice OR type:receipt'
contextgrep query 'NOT type:draft'

DSL reference:

Syntax	Meaning
`type:contract`	Document type equals (fuzzy)
`amount:>1M`	Any extracted amount > 1,000,000
`amount:<=50000`	Any extracted amount ≤ 50,000
`date:>2024-01-01`	Any extracted date after Jan 1 2024
`date:>2024`	Any extracted date after 2024
`path:/legal`	File path contains `/legal`
`email:@acme.com`	Email matching `@acme.com` found in doc
`"exact phrase"`	Phrase must appear as written
`AND`, `OR`, `NOT`	Boolean operators
`(...)`	Grouping

`contextgrep similar <file>`

Find documents similar to a given file using SimHash Hamming distance.

contextgrep similar ./contract_v1.docx
contextgrep similar ./contract_v1.docx --threshold 12    # looser matching
contextgrep similar ./contract_v1.docx -n 20             # top 20 results

Similarity score is 1 - (hamming_distance / 64). Score of 1.0 = identical content, 0.875 = 8 bits differ.

`contextgrep recent`

Show recently modified documents, sorted newest first.

contextgrep recent --since 7d      # last 7 days
contextgrep recent --since 2w      # last 2 weeks
contextgrep recent --since 3m      # last 3 months
contextgrep recent --since 1y      # last year
contextgrep recent --since 2024-06-01   # since a specific date

`contextgrep clusters`

Group documents into similarity clusters using Locality-Sensitive Hashing on SimHash fingerprints.

contextgrep clusters ./docs/
contextgrep clusters ./docs/ --bits 4    # coarse clustering (fewer, larger groups)
contextgrep clusters ./docs/ --bits 8    # fine-grained clustering

Output formats

All commands support --output plain|json|tsv:

# Default: human-readable
contextgrep search "agreement"

# JSON: newline-delimited, good for scripting
contextgrep search "agreement" --output json | jq '.[0].path'

# TSV: tab-separated path, score, snippet
contextgrep search "agreement" --output tsv | cut -f1

Index location

The index is stored in .searchindex/ and resolved with this precedence:

--index <path> CLI flag
.searchindex/ in the current directory or any parent (walks up)
~/.searchindex/ as a global fallback

.searchindex/
  segments/
    0000/
      postings.trgm    # trigram posting lists (mmap binary)
      simhash.bin      # flat u64 array of SimHash fingerprints
    0001/              # incremental segment from next run
  docstore.redb        # document metadata and snippets
  metadata.redb        # structural metadata (dates, amounts, etc.)

Incremental indexing: each contextgrep index run compares mtime and xxh64 file hash. Unchanged files are skipped. Deleted files are removed from the index. Segments merge automatically when count exceeds 8.

To exclude files from indexing, add patterns to .searchignore (same syntax as .gitignore).

Development

# Run all tests
cargo test

# Run specific test file
cargo test --test index_test

# Run unit tests for a module
cargo test trigram
cargo test simhash
cargo test query

# Run benchmarks
cargo bench

# Build release binary
cargo build --release
./target/release/contextgrep --help

Project layout:

src/
  cli/          commands.rs, output.rs
  parser/       text, pdf, docx, code, walker, metadata
  indexer/      trigram, simhash, schema, pipeline
  search/       query DSL, filters, proximity, scorer
  storage/      mmap (postings), redb store, segment manager
  config.rs     index path resolution
  error.rs      unified error type
tests/
  fixtures/     sample.txt, sample.md, sample.rs
  integration/  index_test, search_test, query_test
benches/        trigram, simhash, search benchmarks

Limitations

Semantic queries will miss results. Searching for "indemnification" won't match a document that only says "liability protection". Trigrams are lexical, not conceptual.
PDF quality varies. Image-based PDFs (scanned documents) produce no text.
No ranking feedback loop. Scoring weights are fixed defaults; there's no click-through learning.
Single-machine only. No distributed index, no server mode.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github/workflows		.github/workflows
benches		benches
npm		npm
src		src
tests		tests
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
demo.jpg		demo.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

contextgrep

Demo

Quick Start

Why not vectors or BM25?

How it works

Installation

MCP server (Claude Code, Claude Desktop, Cursor, VS Code…)

CLI binary

Supported file types

Commands

`contextgrep index <path>`

`contextgrep search <query>`

`contextgrep query <dsl>`

`contextgrep similar <file>`

`contextgrep recent`

`contextgrep clusters`

Output formats

Index location

Development

Limitations

License

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

contextgrep

Demo

Quick Start

Why not vectors or BM25?

How it works

Installation

MCP server (Claude Code, Claude Desktop, Cursor, VS Code…)

CLI binary

Supported file types

Commands

contextgrep index <path>

contextgrep search <query>

contextgrep query <dsl>

contextgrep similar <file>

contextgrep recent

contextgrep clusters

Output formats

Index location

Development

Limitations

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`contextgrep index <path>`

`contextgrep search <query>`

`contextgrep query <dsl>`

`contextgrep similar <file>`

`contextgrep recent`

`contextgrep clusters`

Packages