Hermit

Hermit is a self-contained local semantic search service for turning one or more document folders into searchable knowledge-base collections.

It is designed for local-first workflows and works well as a lightweight retrieval backend for notes, technical documents, and small RAG-style applications.

Highlights

Runs fully locally: models, vector data, and metadata live inside the project
Semantic Markdown chunking: a state-machine parser splits .md files into 11 semantically coherent block types (headings, fenced code, math, tables, blockquotes, lists, …) before chunking — code blocks and tables are never split mid-content
Heading-aware sliding window: chunks always start at a section heading when possible, so each retrieved chunk is self-contained and retrieval-friendly even without surrounding context
Dual Vector Storage Modes: Automatically handles both local (embedded) and standalone (Docker-based) Qdrant backends.
Multi-threaded Search Acceleration: Bypasses GIL with ONNX-based reranking using a dedicated thread pool, achieving ~3x throughput improvement.
Multi-collection support: one folder maps to one collection
Hybrid retrieval: dense + sparse recall
Reranking: a cross-encoder reranks the fused candidates
Incremental sync: startup scan plus periodic polling
CPU-friendly: built on fastembed + ONNX Runtime, no GPU required

What it is good for

Hermit is a good fit when you want to:

search a local notes or markdown repository semantically
expose a simple retrieval API for a local tool or agent
build a private, small-footprint RAG layer without cloud dependencies

The current implementation reads files as text using UTF-8 with replacement on decode errors, so it works best with plain-text sources such as .md and .txt files.

How it works

Retrieval pipeline

Hermit uses the following search flow:

Encode the query into dense and sparse representations
Run hybrid retrieval in Qdrant
Fuse candidates with RRF
Rerank the candidate set
Return the top matching chunks

Indexing pipeline

Each registered folder goes through:

startup scan
SQLite metadata diffing
text chunking (see below)
embedding generation
Qdrant upsert
ongoing periodic polling

Markdown semantic chunking

For .md files, Hermit uses a two-phase strategy instead of a simple token sliding window:

Phase 1 — block parsing (parse_md_blocks): A state-machine parser scans the file line-by-line and groups content into 11 semantically coherent block types:

Block type	Examples
YAML frontmatter	`---` … `---` at file start
Fenced code block	``` … ``` or `~~~` … `~~~`
Math block	`$$` … `$$`
ATX heading	`# H1` … `###### H6`
Setext heading	underline with `===` or `---`
Table	pipe-delimited rows
Blockquote	`>` prefixed lines
Horizontal rule	`---` / `***` / `___`
List	entire list including nested items
Standalone image	`![alt](url)` or Obsidian `![[path]]`
Paragraph	any other contiguous non-blank text

This ensures fenced code blocks, math formulas, and tables are always kept intact as a unit.

Phase 2 — heading-aware sliding window (chunk_markdown): Blocks are grouped into chunks (default: 4 blocks per chunk) with two structural rules:

Rule 1 — no orphan headings: if the last block in a chunk is a heading, the chunk is automatically extended by one block so the heading always enters a chunk together with at least its first body block.
Rule 2 — heading-anchored start: the next chunk begins at the nearest preceding heading rather than at an arbitrary paragraph, so every chunk carries its own section context.

Other file types continue to use the token-based sliding window (chunk_text).

See docs/markdown-chunking.md for the full design.

Default settings

Chunk size: 256 tokens (using the embedding model's tokenizer)
Chunk overlap: 32 tokens
Search top_k: 5
Default w_dense: 0.7
Default w_sparse: 0.3
Default rerank candidates: 20
Max collections: 4
Max collection name length: 64
Default port: 8000

Tech stack

API framework: FastAPI
Vector database: Qdrant (Dual-mode: Embedded or Standalone Docker)
Inference backend: fastembed (ONNX-based, parallelized via ThreadPoolExecutor)
Metadata store: SQLite
Filesystem watcher: periodic polling

Current models:

Dense embedding: jinaai/jina-embeddings-v2-base-zh
Sparse embedding: Qdrant/bm25
Reranker: jinaai/jina-reranker-v2-base-multilingual

Vector Storage Modes

Hermit automatically detects the environment and switches between two modes:

Local Mode (Default): Uses Qdrant's embedded client. Data is stored in data/qdrant/. Ideal for zero-configuration local use.
Standalone Mode: If QDRANT_HOST is set (e.g., QDRANT_HOST=localhost), Hermit will manage a Qdrant Docker container automatically. This mode is faster for large-scale indexing and supports better persistent management.

To force standalone mode, start Hermit with:

QDRANT_HOST=localhost hermit start

Performance & Memory

Hermit is optimized for stable local search memory usage:

Serialized Search: Search requests run through a single-worker executor. This keeps the shared ONNX sessions from serving multiple reranker requests concurrently.
Bounded ONNX Inference: Uses HERMIT_ONNX_THREADS=2 by default to keep ONNX Runtime per-thread arenas small. Raise it only after measuring that latency improves enough to justify the extra resident memory.
Smaller Rerank Pool: Uses 20 candidates per query by default while keeping cross-encoder reranking enabled.
Embedding Cache: Indexing skips ONNX inference for chunks whose exact model input was seen before. Vectors are cached on disk (HERMIT_HOME/cache, sha256-keyed by model_name::input_text) with a 7-day TTL. Cache hits validate the vector dimension and fall back to a fresh embed on mismatch — model upgrades or partially-corrupted entries are self-healing. Always on by design; the cache is bounded and self-reaping.

Project layout

.
├── main.py
├── pyproject.toml
├── README.md
├── README_cn.md
├── docs/
│   └── design.md
├── hermit/
│   ├── cli.py
│   ├── config.py
│   ├── api/
│   │   ├── routes.py
│   │   └── schemas.py
│   ├── ingestion/
│   │   ├── chunker.py
│   │   ├── scanner.py
│   │   ├── task_queue.py
│   │   └── watcher.py
│   ├── retrieval/
│   │   ├── embedder.py
│   │   ├── reranker.py
│   │   └── searcher.py
│   └── storage/
│       ├── metadata.py
│       ├── model_signature.py
│       ├── qdrant.py
│       └── registry.py
├── data/
│   ├── collections.json
│   ├── metadata/
│   └── qdrant/
└── models/

Installation

Requirements

Python 3.12 ~ 3.13
macOS or Linux

Using a virtual environment is recommended.

Install from source

python -m venv .venv
source .venv/bin/activate
pip install -U pip
pip install -e .

If you plan to use hermit download, make sure huggingface_hub is available in your environment, since the CLI uses it to download model snapshots.

Quick start

1. Download models (optional but recommended)

hermit download

Optional flags:

hermit download --force
hermit download --skip-verify

Notes:

missing models can also be downloaded automatically on first service startup
downloading them explicitly makes first boot less surprising and easier to monitor

2. Register a knowledge-base folder

hermit kb add my_docs ./documents

List collections:

hermit kb list

Remove a collection:

hermit kb remove my_docs

Collection naming rules:

must start with a letter or digit
may contain only letters, digits, underscores, and hyphens
must be unique

3. Start the service

python main.py

On startup, Hermit will:

warm up embedding and reranker models
start the background indexing worker
restore persisted collections from data/collections.json
scan each collection folder
start watching registered folders for changes

Default bind address:

Host: 0.0.0.0
Port: 8000

4. Search

curl -X POST http://127.0.0.1:8000/search \
	-H 'Content-Type: application/json' \
	-d '{
		"query": "two sum approach",
		"collection": "my_docs",
		"top_k": 5,
		"w_dense": 0.7,
		"w_sparse": 0.3,
		"rerank_candidates": 30
	}'

CLI

Hermit currently provides these CLI commands.

`hermit download`

Download all required models and optionally run a basic verification step.

hermit download

Flags:

--force: force re-download
--skip-verify: skip post-download verification

`hermit kb add <name> <dir>`

Register a folder as a collection.

hermit kb add notes ./documents

`hermit kb remove <name>`

Remove a collection and delete its metadata store.

hermit kb remove notes

`hermit kb list`

List all registered collections.

hermit kb list

HTTP API

The current codebase exposes the following endpoints.

`POST /search`

Run hybrid semantic search.

Request example:

{
	"query": "sliding window maximum",
	"collection": "my_docs",
	"top_k": 5,
	"w_dense": 0.7,
	"w_sparse": 0.3,
	"rerank_candidates": 30
}

Response example:

{
	"results": [
		{
			"text": "...",
			"source_file": "/abs/path/to/file.md",
			"chunk_index": 0,
			"total_chunks": 3,
			"score": 0.82
		}
	]
}

`POST /collections/{name}/sync`

Trigger a manual scan/sync for a collection.

Response fields:

added
updated
deleted

Example:

curl -X POST http://127.0.0.1:8000/collections/my_docs/sync

`GET /collections/{name}/status`

Get collection status.

Response fields:

name
folder_path
indexed_files
total_chunks
watching

Example:

curl http://127.0.0.1:8000/collections/my_docs/status

`GET /collections/{name}/tasks`

Get background indexing task status for a collection.

Response fields:

collection
pending_tasks
queued_tasks
in_progress_tasks
worker_alive

Example:

curl http://127.0.0.1:8000/collections/my_docs/tasks

`GET /health`

Server health and runtime info.

Response fields:

status — "ready" or "starting"
uptime — seconds since server start
models_loaded — whether embedding/reranker models are loaded
collections — list of {name, indexed_files, total_chunks} per collection
pending_index_tasks — total background indexing tasks waiting across all collections
qdrant_mode — "local" (embedded) or "standalone" (external Qdrant server)
qdrant_host — host address in standalone mode; null in local mode

Example:

curl http://127.0.0.1:8000/health

Storage layout

By default, Hermit stores its runtime data inside the project directory:

models/: local model cache
data/qdrant/: Qdrant embedded data
data/metadata/: one SQLite database per collection
data/collections.json: persisted collection configuration
cache/dense/ and cache/sparse/: embedding cache (sha256-keyed, 7-day TTL)

That makes the project easy to back up, move, and clean up. No mysterious hidden cave system under your home directory.

Indexing behavior

File handling

recursively scans all non-hidden files
skips any path segment starting with .
reads files as text with utf-8 and errors="replace"

Change detection

Hermit tracks indexed files in SQLite and uses SHA256 to detect content changes.

During scanning it handles:

new files: enqueue or index them
modified files: rechunk, re-embed, and replace old chunks
deleted files: remove them from Qdrant and SQLite

Chunking rules

default chunk size is 256 tokens (using the embedding model's tokenizer)
adjacent chunks overlap by 32 tokens
empty text is skipped
short text stays as a single chunk

Known limitations

there is currently no API to create or delete collections; use the CLI for that
w_dense and w_sparse are accepted by the API, but the current implementation uses RRF fusion rather than explicit weighted score fusion
all files are treated as text; PDF, image, and Office parsing are out of scope
the maximum number of collections is currently 4
first-time model downloads may take a while and use noticeable disk space

Development and testing

The test suite currently covers:

CLI validation and collection management
scanner add/update/delete logic
task queue status reporting
selected API route behavior

Run tests with:

pytest

Design notes

For implementation details, see:

docs/design.md

In one sentence

If you want a small, local-first, multi-collection semantic search service that quietly gets the job done, Hermit fits the brief nicely.

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
.agents/skills/hermit-search		.agents/skills/hermit-search
.vscode		.vscode
docs		docs
hermit		hermit
problems		problems
tests		tests
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
README_cn.md		README_cn.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Hermit

Highlights

What it is good for

How it works

Retrieval pipeline

Indexing pipeline

Markdown semantic chunking

Default settings

Tech stack

Vector Storage Modes

Performance & Memory

Project layout

Installation

Requirements

Install from source

Quick start

1. Download models (optional but recommended)

2. Register a knowledge-base folder

3. Start the service

4. Search

CLI

hermit download

hermit kb add <name> <dir>

hermit kb remove <name>

hermit kb list

HTTP API

POST /search

POST /collections/{name}/sync

GET /collections/{name}/status

GET /collections/{name}/tasks

GET /health

Storage layout

Indexing behavior

File handling

Change detection

Chunking rules

Known limitations

Development and testing

Design notes

In one sentence

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`hermit download`

`hermit kb add <name> <dir>`

`hermit kb remove <name>`

`hermit kb list`

`POST /search`

`POST /collections/{name}/sync`

`GET /collections/{name}/status`

`GET /collections/{name}/tasks`

`GET /health`

Packages