Skip to content

Configuration

sarmakska edited this page Jun 7, 2026 · 3 revisions

Configuration

All configuration lives in .env.local. There are eight knobs.

Required

OPENAI_API_KEY

Your OpenAI API key. Used for embeddings, the LLM reranker, and chat completions.

OPENAI_API_KEY=sk-proj-...

Get one at platform.openai.com/api-keys.

Optional (with defaults)

EMBEDDING_MODEL

Default: text-embedding-3-small (1536 dims, $0.02 per 1M tokens)

Alternatives:

  • text-embedding-3-large (3072 dims, $0.13 per 1M tokens) — slightly better recall, 2x cost, 2x latency
  • text-embedding-ada-002 (1536 dims, deprecated but still works) — older

Stick with the small model unless you have measured retrieval quality issues.

CHAT_MODEL

Default: gpt-4o-mini

Used for both answer generation and the LLM reranker.

Alternatives:

  • gpt-4o better, more expensive
  • gpt-4-turbo older, similar cost to 4o

gpt-4o-mini is the right default. It is cheap, fast, follows the "answer only from the passages" instruction reliably, and reranks well enough in one batched call. If the reranker call fails for any reason, retrieval falls back to a deterministic lexical reranker, so a model hiccup never breaks a question.

CHUNK_SIZE

Default: 1000 (characters)

Bigger chunks = more context per retrieval, fewer chunks total, fewer embeddings. Smaller chunks = more precise retrieval, more granularity.

Content type Recommended size
Code, structured docs 500-800
Prose (articles, books) 1000-1500
Tabular data, reference 200-500 with metadata

CHUNK_OVERLAP

Default: 200 (characters)

Overlap means a sentence that spans two chunks is findable in either. 20 percent of CHUNK_SIZE is the standard heuristic.

Set to 0 for highly structured content (code, JSON) where there's no sentence to straddle.

TOP_K

Default: 5

How many chunks the reranker keeps and passes to the model as context. Hybrid search first pulls a wider candidate pool (roughly four times TOP_K, minimum 20) for the reranker to work over, then the reranker trims to TOP_K.

TOP_K Trade-off
1-3 Fast, cheap, may miss the answer
5-7 Balanced (recommended)
10+ More context, more cost, can confuse the model

The right value depends on your chunk size and content density.

HYBRID_DENSE_WEIGHT and HYBRID_LEXICAL_WEIGHT

Default: 1 and 1 (plain, unweighted Reciprocal Rank Fusion)

Hybrid search fuses two rankings, dense (embeddings) and lexical (BM25), with weighted RRF. Each ranking contributes weight / (60 + rank). Equal weights reproduce plain RRF, which is the right default. Tilt the weights when you know your corpus leans one way:

Corpus Suggestion
Codes, SKUs, error strings, identifiers Raise HYBRID_LEXICAL_WEIGHT to 1.5-2
Prose, articles, paraphrase-heavy text Raise HYBRID_DENSE_WEIGHT to 1.5-2
Mixed or unknown Leave both at 1

Weights of 0 are allowed: set the lexical weight to 0 for dense-only fusion, or the dense weight to 0 for lexical-only. Any non-negative finite value is accepted; an invalid value falls back to the default.

Tuning workflow

  1. Pick a PDF and write 10 questions you know the answer to.
  2. Run with defaults. Score correctness manually.
  3. If retrieval is missing relevant chunks, increase TOP_K or decrease CHUNK_SIZE.
  4. If an exact term is missed, confirm it is in the extracted text; hybrid search and BM25 should otherwise surface it.
  5. If the model is hallucinating, check that retrieval is finding the right chunks; the citations make it obvious which passages were used.
  6. If responses are slow or costly, decrease TOP_K, which shrinks both the candidate pool and the reranking call.

Cost example

For a 50-page PDF (roughly 100k characters):

  • 100 chunks at 1000 chars each.
  • Embedding cost: 100k tokens at roughly £0.000016 per 1k, about £0.0016 once.
  • Each question: 1 question embedding, 1 reranking call over the candidate pool, and 1 chat completion. With gpt-4o-mini the rerank plus generation is well under a penny.
  • Total per question: under £0.002.

You can answer 1000 questions on a 50-page PDF for around £1 to £2. Drop the reranker if you want to remove one call.

Clone this wiki locally