AsterSearch is a backend-centric, standalone search engine that indexes documents and serves ranked search results using a classic inverted index + BM25 ranking model.
It’s designed as:
- A library + HTTP service you can embed or run standalone
- Focused on developer use (like a tiny self-hosted Elastic/Lucene)
- Optimized for text-heavy documents: articles, notes, blog posts, docs
- Index: named collection of documents (e.g.
articles,notes,products) - Document: JSON object with a unique
idand one or more text fields - Field: named attribute (e.g.
title,body,tags) with optional weights - Token: normalized term after tokenization/stopword removal
- Posting List: list of
(docId, termFrequency, positions)for a token - Segment: immutable chunk of index on disk (allows incremental indexing + merge)
-
Indexing Input (HTTP API)
-
Endpoint:
POST /v1/indexes/{indexName}/documents -
Body (JSON, batched):
{ "documents": [ { "id": "doc-123", "title": "Understanding BM25 for Search", "body": "BM25 is a ranking function used by search engines...", "tags": ["search", "ranking"] } ] }
-
-
Search Input (HTTP API)
-
Endpoint:
GET /v1/search -
Query params:
index(string, required) – which index to searchq(string, required) – user query, e.g."bm25 search ranking"page(int, default1)pageSize(int, default10, max100)filters(JSON or simple syntax, optional), e.g.tags:search
-
-
Admin / Schema Input
-
Endpoint:
POST /v1/indexes -
Body:
{ "name": "articles", "fields": { "title": { "type": "text", "weight": 2.0 }, "body": { "type": "text", "weight": 1.0 }, "tags": { "type": "keyword", "filterOnly": true } }, "tokenizer": "standard_en_id" }
-
-
Search Output
-
Response:
{ "index": "articles", "query": "bm25 search ranking", "totalHits": 178, "page": 1, "pageSize": 10, "results": [ { "id": "doc-123", "score": 12.84, "highlights": { "title": "Understanding <em>BM25</em> for <em>Search</em>", "body": "... <em>BM25</em> is a ranking function used by many <em>search</em> engines ..." }, "metadata": { "tags": ["search", "ranking"] } } ], "timingMs": 7 }
-
-
Indexing Output
-
Response:
{ "indexed": 1, "errors": [], "segmentId": "seg-2025-11-20T12:00:01Z" }
-
-
Admin Output
GET /v1/indexes/{indexName}→ schema, stats, disk usage, doc count, segments
-
Multiple named indexes in one process
-
Schema-aware indexing:
- text fields: tokenized + BM25 scoring
- keyword fields: exact match / filter / aggregation only
-
BM25 ranking:
- Tunable
k1,bper index - Field weights (e.g. title > body)
- Tunable
-
Boolean queries:
- Default: AND between terms
- Support
+must -must_not "phrases"in the query string
-
Highlights/snippets:
- Term position tracking → highlight matched tokens
- Snippet extraction around best-matching spans
-
Filters:
- Exact match filters (e.g.
tags:search,lang:en) - Numeric range filters (e.g.
views > 1000)
- Exact match filters (e.g.
-
Pagination / ranked results:
offset/limitorpage/pageSize
-
Latency:
- P50 < 20 ms for index with 100k docs
- P95 < 100 ms for index with 1M docs
-
Throughput:
-
100 queries/s on a single node (target)
-
-
Indexing:
- Support at least 1M docs on a single node
-
Durability:
- Write-ahead log (WAL) or append-only segment files
- Crash-safe after WAL flush
-
HTTP API Layer
- Validates requests
- Translates to internal commands (IndexDocument, SearchQuery, etc.)
-
Index Manager
- Manages multiple indexes
- Keeps in-memory registry: schema, BM25 params, segment list
-
Index Writer
- Tokenizes documents
- Updates in-memory posting lists
- Periodically flushes to segment files (immutable)
- Maintains term dictionary + doc store (for highlights/snippets)
-
Searcher
- Parses query string → tokens + operators
- Looks up posting lists
- Applies BM25
- Applies filters
- Produces top-K results with scores
-
Storage Engine
-
On disk:
segments/seg-xxx.postings(compressed posting lists)segments/seg-xxx.docs(document store: original fields)segments/seg-xxx.meta(headers, stats, doc count)
-
Optional:
- memory-mapped files for fast access
-
-
Background Jobs
- Segment merge (compacts many small segments into fewer big ones)
- Index optimization (rebuild stats, prune tombstone docs)
-
Posting List Entry:
struct Posting { docId: u32; termFreq: u16; positions: Vec<u16>; // token positions in field }
-
Inverted Index:
Map<Term, Map<FieldName, Vec<Posting>>>
-
Document Store:
- Map
docId → serialized JSONor binary encoded doc (for retrieval & highlights)
- Map
-
Index Metadata:
struct IndexStats { docCount: u64; avgFieldLength: Map<FieldName, f64>; totalDocsDeleted: u64; }
(You can change this later, but this is a solid “high spec” default.)
-
Language: Go or Rust (for perf & systems feel)
-
Storage:
- Raw files on disk (your own layout) OR
- BoltDB/Badger for key-value (term → postings location, docId → doc)
-
Config:
- TOML/YAML config for global defaults & per-index overrides
-
Deployment:
- Single binary
astersearch - Runs as systemd service on VPS
- Exposes HTTP on
:8080behind Nginx/Caddy if needed
- Single binary
-
HTTP REST API for:
- Index creation/listing
- Document indexing/updating/deleting
- Search queries
- Stats/health (
/v1/health,/v1/indexes/{name}/stats)
-
(Optional later) gRPC or embedded library API:
Search(index, query)IndexDocuments(index, docs)
- Defaults: listens on
:8080, stores indexes indata/indexes, enables request logs + metrics. - Flags:
--config(TOML/YAML),--listen,--index-path. - Environment:
ASTERSEARCH_INDEX_PATHoverrides the storage directory (kept for backward compatibility). - Config examples live in
config/examples/config.tomlandconfig/examples/config.yamland support per-index defaults (tokenizer, BM25k1/b, merge interval/threshold, flush_max_documents/flush_max_postings) as well as logging/metrics toggles.
-
Protect admin endpoints (
/v1/indexes,/v1/indexes/{name},/v1/indexes/{name}/stats) by setting one or moresecurity.admin_tokensentries. -
Lock down indexing (
/v1/indexes/{name}/documents) withsecurity.index_tokens(admin tokens are also accepted for writers). -
Tokens are checked from
Authorization: Bearer <token>orX-API-Key: <token>headers. When no tokens are configured, the endpoints remain open for backward compatibility. -
Throttle abusive clients with
security.rate_limit.requests_per_min(per-client, per-index). Example TOML fragment:[security] admin_tokens = ["super-secret"] index_tokens = ["writer-token"] [security.rate_limit] requests_per_min = 120
go run ./... --config ./config/examples/config.toml
# or override inline
go run ./... --listen :8080 --index-path ./data/indexes- JSON request logging is on by default; disable via
logging.request_logs = false. /v1/metricsreturns basic counters (requests, errors, last status/latency) when enabled.- Standard
/v1/healthendpoint is included for liveness probes.
The loadtest/ directory provides a k6 scenario that drives POST /v1/indexes/{index}/documents and GET /v1/search with realistic payloads. Thresholds are pre-wired for the latency/QPS targets above; see loadtest/README.md for commands and expected outputs.
- Install the binary at
/usr/local/bin/astersearchand copydeploy/systemd/astersearch.service+deploy/systemd/astersearch.envto/etc/astersearch/. - Ensure the working directory (default
/var/lib/astersearch) exists and is writable by theastersearchuser. - Reload + start:
sudo systemctl daemon-reload
sudo systemctl enable --now astersearch.service
sudo journalctl -u astersearch -fFor TLS or path normalization, place Nginx in front of the service on :8080:
location / {
proxy_pass http://127.0.0.1:8080;
proxy_set_header Host $host;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}Keep /v1/health and /v1/metrics reachable for monitoring.