Hermit is a self-contained local semantic search service for turning one or more document folders into searchable knowledge-base collections.
It is designed for local-first workflows and works well as a lightweight retrieval backend for notes, technical documents, and small RAG-style applications.
- Runs fully locally: models, vector data, and metadata live inside the project
- Semantic Markdown chunking: a state-machine parser splits
.mdfiles into 11 semantically coherent block types (headings, fenced code, math, tables, blockquotes, lists, …) before chunking — code blocks and tables are never split mid-content - Heading-aware sliding window: chunks always start at a section heading when possible, so each retrieved chunk is self-contained and retrieval-friendly even without surrounding context
- Dual Vector Storage Modes: Automatically handles both local (embedded) and standalone (Docker-based) Qdrant backends.
- Multi-threaded Search Acceleration: Bypasses GIL with ONNX-based reranking using a dedicated thread pool, achieving ~3x throughput improvement.
- Multi-collection support: one folder maps to one collection
- Hybrid retrieval: dense + sparse recall
- Reranking: a cross-encoder reranks the fused candidates
- Incremental sync: startup scan plus periodic polling
- CPU-friendly: built on
fastembed+ ONNX Runtime, no GPU required
Hermit is a good fit when you want to:
- search a local notes or markdown repository semantically
- expose a simple retrieval API for a local tool or agent
- build a private, small-footprint RAG layer without cloud dependencies
The current implementation reads files as text using UTF-8 with replacement on decode errors, so it works best with plain-text sources such as .md and .txt files.
Hermit uses the following search flow:
- Encode the query into dense and sparse representations
- Run hybrid retrieval in Qdrant
- Fuse candidates with RRF
- Rerank the candidate set
- Return the top matching chunks
Each registered folder goes through:
- startup scan
- SQLite metadata diffing
- text chunking (see below)
- embedding generation
- Qdrant upsert
- ongoing periodic polling
For .md files, Hermit uses a two-phase strategy instead of a simple token sliding window:
Phase 1 — block parsing (parse_md_blocks): A state-machine parser scans the file line-by-line and groups content into 11 semantically coherent block types:
| Block type | Examples |
|---|---|
| YAML frontmatter | --- … --- at file start |
| Fenced code block | ``` … ``` or ~~~ … ~~~ |
| Math block | $$ … $$ |
| ATX heading | # H1 … ###### H6 |
| Setext heading | underline with === or --- |
| Table | pipe-delimited rows |
| Blockquote | > prefixed lines |
| Horizontal rule | --- / *** / ___ |
| List | entire list including nested items |
| Standalone image |  or Obsidian ![[path]] |
| Paragraph | any other contiguous non-blank text |
This ensures fenced code blocks, math formulas, and tables are always kept intact as a unit.
Phase 2 — heading-aware sliding window (chunk_markdown): Blocks are grouped into chunks (default: 4 blocks per chunk) with two structural rules:
- Rule 1 — no orphan headings: if the last block in a chunk is a heading, the chunk is automatically extended by one block so the heading always enters a chunk together with at least its first body block.
- Rule 2 — heading-anchored start: the next chunk begins at the nearest preceding heading rather than at an arbitrary paragraph, so every chunk carries its own section context.
Other file types continue to use the token-based sliding window (chunk_text).
See docs/markdown-chunking.md for the full design.
- Chunk size:
256tokens (using the embedding model's tokenizer) - Chunk overlap:
32tokens - Search
top_k:5 - Default
w_dense:0.7 - Default
w_sparse:0.3 - Default rerank candidates:
20 - Max collections:
4 - Max collection name length:
64 - Default port:
8000
- API framework: FastAPI
- Vector database: Qdrant (Dual-mode: Embedded or Standalone Docker)
- Inference backend: fastembed (ONNX-based, parallelized via
ThreadPoolExecutor) - Metadata store: SQLite
- Filesystem watcher: periodic polling
Current models:
- Dense embedding:
jinaai/jina-embeddings-v2-base-zh - Sparse embedding:
Qdrant/bm25 - Reranker:
jinaai/jina-reranker-v2-base-multilingual
Hermit automatically detects the environment and switches between two modes:
- Local Mode (Default): Uses Qdrant's embedded client. Data is stored in
data/qdrant/. Ideal for zero-configuration local use. - Standalone Mode: If
QDRANT_HOSTis set (e.g.,QDRANT_HOST=localhost), Hermit will manage a Qdrant Docker container automatically. This mode is faster for large-scale indexing and supports better persistent management.
To force standalone mode, start Hermit with:
QDRANT_HOST=localhost hermit startHermit is optimized for stable local search memory usage:
- Serialized Search: Search requests run through a single-worker executor. This keeps the shared ONNX sessions from serving multiple reranker requests concurrently.
- Bounded ONNX Inference: Uses
HERMIT_ONNX_THREADS=2by default to keep ONNX Runtime per-thread arenas small. Raise it only after measuring that latency improves enough to justify the extra resident memory. - Smaller Rerank Pool: Uses 20 candidates per query by default while keeping cross-encoder reranking enabled.
- Embedding Cache: Indexing skips ONNX inference for chunks whose exact model input was seen before. Vectors are cached on disk (
HERMIT_HOME/cache, sha256-keyed bymodel_name::input_text) with a 7-day TTL. Cache hits validate the vector dimension and fall back to a fresh embed on mismatch — model upgrades or partially-corrupted entries are self-healing. Always on by design; the cache is bounded and self-reaping.
.
├── main.py
├── pyproject.toml
├── README.md
├── README_cn.md
├── docs/
│ └── design.md
├── hermit/
│ ├── cli.py
│ ├── config.py
│ ├── api/
│ │ ├── routes.py
│ │ └── schemas.py
│ ├── ingestion/
│ │ ├── chunker.py
│ │ ├── scanner.py
│ │ ├── task_queue.py
│ │ └── watcher.py
│ ├── retrieval/
│ │ ├── embedder.py
│ │ ├── reranker.py
│ │ └── searcher.py
│ └── storage/
│ ├── metadata.py
│ ├── model_signature.py
│ ├── qdrant.py
│ └── registry.py
├── data/
│ ├── collections.json
│ ├── metadata/
│ └── qdrant/
└── models/
- Python
3.12 ~ 3.13 - macOS or Linux
Using a virtual environment is recommended.
python -m venv .venv
source .venv/bin/activate
pip install -U pip
pip install -e .If you plan to use hermit download, make sure huggingface_hub is available in your environment, since the CLI uses it to download model snapshots.
hermit downloadOptional flags:
hermit download --force
hermit download --skip-verifyNotes:
- missing models can also be downloaded automatically on first service startup
- downloading them explicitly makes first boot less surprising and easier to monitor
hermit kb add my_docs ./documentsList collections:
hermit kb listRemove a collection:
hermit kb remove my_docsCollection naming rules:
- must start with a letter or digit
- may contain only letters, digits, underscores, and hyphens
- must be unique
python main.pyOn startup, Hermit will:
- warm up embedding and reranker models
- start the background indexing worker
- restore persisted collections from
data/collections.json - scan each collection folder
- start watching registered folders for changes
Default bind address:
- Host:
0.0.0.0 - Port:
8000
curl -X POST http://127.0.0.1:8000/search \
-H 'Content-Type: application/json' \
-d '{
"query": "two sum approach",
"collection": "my_docs",
"top_k": 5,
"w_dense": 0.7,
"w_sparse": 0.3,
"rerank_candidates": 30
}'Hermit currently provides these CLI commands.
Download all required models and optionally run a basic verification step.
hermit downloadFlags:
--force: force re-download--skip-verify: skip post-download verification
Register a folder as a collection.
hermit kb add notes ./documentsRemove a collection and delete its metadata store.
hermit kb remove notesList all registered collections.
hermit kb listThe current codebase exposes the following endpoints.
Run hybrid semantic search.
Request example:
{
"query": "sliding window maximum",
"collection": "my_docs",
"top_k": 5,
"w_dense": 0.7,
"w_sparse": 0.3,
"rerank_candidates": 30
}Response example:
{
"results": [
{
"text": "...",
"source_file": "/abs/path/to/file.md",
"chunk_index": 0,
"total_chunks": 3,
"score": 0.82
}
]
}Trigger a manual scan/sync for a collection.
Response fields:
addedupdateddeleted
Example:
curl -X POST http://127.0.0.1:8000/collections/my_docs/syncGet collection status.
Response fields:
namefolder_pathindexed_filestotal_chunkswatching
Example:
curl http://127.0.0.1:8000/collections/my_docs/statusGet background indexing task status for a collection.
Response fields:
collectionpending_tasksqueued_tasksin_progress_tasksworker_alive
Example:
curl http://127.0.0.1:8000/collections/my_docs/tasksServer health and runtime info.
Response fields:
status—"ready"or"starting"uptime— seconds since server startmodels_loaded— whether embedding/reranker models are loadedcollections— list of{name, indexed_files, total_chunks}per collectionpending_index_tasks— total background indexing tasks waiting across all collectionsqdrant_mode—"local"(embedded) or"standalone"(external Qdrant server)qdrant_host— host address in standalone mode;nullin local mode
Example:
curl http://127.0.0.1:8000/healthBy default, Hermit stores its runtime data inside the project directory:
models/: local model cachedata/qdrant/: Qdrant embedded datadata/metadata/: one SQLite database per collectiondata/collections.json: persisted collection configurationcache/dense/andcache/sparse/: embedding cache (sha256-keyed, 7-day TTL)
That makes the project easy to back up, move, and clean up. No mysterious hidden cave system under your home directory.
- recursively scans all non-hidden files
- skips any path segment starting with
. - reads files as text with
utf-8anderrors="replace"
Hermit tracks indexed files in SQLite and uses SHA256 to detect content changes.
During scanning it handles:
- new files: enqueue or index them
- modified files: rechunk, re-embed, and replace old chunks
- deleted files: remove them from Qdrant and SQLite
- default chunk size is
256tokens (using the embedding model's tokenizer) - adjacent chunks overlap by
32tokens - empty text is skipped
- short text stays as a single chunk
- there is currently no API to create or delete collections; use the CLI for that
w_denseandw_sparseare accepted by the API, but the current implementation uses RRF fusion rather than explicit weighted score fusion- all files are treated as text; PDF, image, and Office parsing are out of scope
- the maximum number of collections is currently
4 - first-time model downloads may take a while and use noticeable disk space
The test suite currently covers:
- CLI validation and collection management
- scanner add/update/delete logic
- task queue status reporting
- selected API route behavior
Run tests with:
pytestFor implementation details, see:
docs/design.md
If you want a small, local-first, multi-collection semantic search service that quietly gets the job done, Hermit fits the brief nicely.