-
Notifications
You must be signed in to change notification settings - Fork 0
Configuration
All configuration lives in .env.local. There are eight knobs.
Your OpenAI API key. Used for embeddings, the LLM reranker, and chat completions.
OPENAI_API_KEY=sk-proj-...
Get one at platform.openai.com/api-keys.
Default: text-embedding-3-small (1536 dims, $0.02 per 1M tokens)
Alternatives:
-
text-embedding-3-large(3072 dims, $0.13 per 1M tokens) — slightly better recall, 2x cost, 2x latency -
text-embedding-ada-002(1536 dims, deprecated but still works) — older
Stick with the small model unless you have measured retrieval quality issues.
Default: gpt-4o-mini
Used for both answer generation and the LLM reranker.
Alternatives:
-
gpt-4obetter, more expensive -
gpt-4-turboolder, similar cost to 4o
gpt-4o-mini is the right default. It is cheap, fast, follows the "answer only from the passages" instruction reliably, and reranks well enough in one batched call. If the reranker call fails for any reason, retrieval falls back to a deterministic lexical reranker, so a model hiccup never breaks a question.
Default: 1000 (characters)
Bigger chunks = more context per retrieval, fewer chunks total, fewer embeddings. Smaller chunks = more precise retrieval, more granularity.
| Content type | Recommended size |
|---|---|
| Code, structured docs | 500-800 |
| Prose (articles, books) | 1000-1500 |
| Tabular data, reference | 200-500 with metadata |
Default: 200 (characters)
Overlap means a sentence that spans two chunks is findable in either. 20 percent of CHUNK_SIZE is the standard heuristic.
Set to 0 for highly structured content (code, JSON) where there's no sentence to straddle.
Default: 5
How many chunks the reranker keeps and passes to the model as context. Hybrid search first pulls a wider candidate pool (roughly four times TOP_K, minimum 20) for the reranker to work over, then the reranker trims to TOP_K.
| TOP_K | Trade-off |
|---|---|
| 1-3 | Fast, cheap, may miss the answer |
| 5-7 | Balanced (recommended) |
| 10+ | More context, more cost, can confuse the model |
The right value depends on your chunk size and content density.
Default: 1 and 1 (plain, unweighted Reciprocal Rank Fusion)
Hybrid search fuses two rankings, dense (embeddings) and lexical (BM25), with weighted RRF. Each ranking contributes weight / (60 + rank). Equal weights reproduce plain RRF, which is the right default. Tilt the weights when you know your corpus leans one way:
| Corpus | Suggestion |
|---|---|
| Codes, SKUs, error strings, identifiers | Raise HYBRID_LEXICAL_WEIGHT to 1.5-2 |
| Prose, articles, paraphrase-heavy text | Raise HYBRID_DENSE_WEIGHT to 1.5-2 |
| Mixed or unknown | Leave both at 1 |
Weights of 0 are allowed: set the lexical weight to 0 for dense-only fusion, or the dense weight to 0 for lexical-only. Any non-negative finite value is accepted; an invalid value falls back to the default.
- Pick a PDF and write 10 questions you know the answer to.
- Run with defaults. Score correctness manually.
- If retrieval is missing relevant chunks, increase
TOP_Kor decreaseCHUNK_SIZE. - If an exact term is missed, confirm it is in the extracted text; hybrid search and BM25 should otherwise surface it.
- If the model is hallucinating, check that retrieval is finding the right chunks; the citations make it obvious which passages were used.
- If responses are slow or costly, decrease
TOP_K, which shrinks both the candidate pool and the reranking call.
For a 50-page PDF (roughly 100k characters):
- 100 chunks at 1000 chars each.
- Embedding cost: 100k tokens at roughly £0.000016 per 1k, about £0.0016 once.
- Each question: 1 question embedding, 1 reranking call over the candidate pool, and 1 chat completion. With
gpt-4o-minithe rerank plus generation is well under a penny. - Total per question: under £0.002.
You can answer 1000 questions on a 50-page PDF for around £1 to £2. Drop the reranker if you want to remove one call.