Skip to content

How RAG Works

sarmakska edited this page Jun 7, 2026 · 3 revisions

How RAG Works

Retrieval-Augmented Generation in plain English, using this repo as the running example. This page covers the full pipeline as implemented: chunking, embedding, hybrid retrieval, reranking, and citation streaming.

The problem RAG solves

Large language models know a lot, but they do not know your data. They were trained on a snapshot of the internet, not your internal docs, your contracts, or last quarter's strategy memo.

You can fix this two ways:

  1. Fine-tune the model on your data (expensive, slow, hard to update).
  2. Retrieval-Augmented Generation (cheap, fast, easy to update).

RAG wins for almost every real-world product use case.

The trick

When the user asks a question, instead of sending it straight to the model, you:

  1. Find the chunks of your data most relevant to the question.
  2. Stuff those chunks into the prompt as context.
  3. Tell the model to answer using only that context, and to cite the chunks it uses.

The model does not need to know your data. It needs to be good at reading the supplied passages and answering from them, which is exactly what these models are good at.

The pipeline in this repo

graph LR
  A[Chunk] --> B[Embed]
  B --> C[Hybrid retrieve]
  C --> D[Rerank]
  D --> E[Generate + cite]
Loading

1. Chunk, page by page

PDFs are too big to send whole, so they are split into chunks. This repo extracts text one page at a time (lib/pdf.ts), then chunks at 1000 characters with 200 character overlap while recording the page each chunk starts on (lib/chunker.ts). That recorded page is what every citation points back to.

How big should a chunk be?

  • Too small (50 chars): no context, retrieval is noisy.
  • Too big (10,000 chars): you waste tokens on irrelevant text.
  • Sweet spot: a paragraph or two. 500 to 1500 chars works for most prose.

The 200-character overlap keeps a sentence that straddles two chunks findable in either.

2. Embed

An embedding is a vector that represents the meaning of a piece of text. Similar meaning gives similar vectors. This repo uses text-embedding-3-small, which produces 1536-dimensional vectors. You do not need to know how the model produces them, only that semantically similar text gives a higher cosine similarity score.

3. Hybrid retrieve

Dense embedding search is strong on meaning and paraphrase but weak on rare exact terms: product codes, error strings, surnames. BM25 lexical search is the opposite. Running both and fusing the rankings is more robust than either alone, which is why this repo does hybrid search.

  • Dense ranks chunks by cosine similarity to the question vector (search in lib/vector-store.ts).
  • BM25 ranks chunks by weighted term overlap with the question (lib/bm25.ts).
  • Weighted Reciprocal Rank Fusion combines the two rankings. Each chunk's fused score is the sum of weight / (60 + rank) over the rankings it appears in. RRF needs no score normalisation, which matters because cosine scores and BM25 scores live on completely different scales. The weights default to equal (plain RRF); set HYBRID_DENSE_WEIGHT and HYBRID_LEXICAL_WEIGHT to tilt fusion towards meaning or towards exact terms for your corpus. See Configuration.

Hybrid search pulls a wide candidate pool, tuned for recall: get the right chunk in the pool, even if it is not ranked first yet.

4. Rerank

The candidate pool is then reordered for precision. The reranker scores each candidate against the question directly and keeps the top-k (lib/reranker.ts, orchestrated in lib/retrieval.ts).

  • The default LLM reranker asks the model to score each passage 0 to 10 for how well it answers the question, in one batched call.
  • If that call fails for any reason, retrieval falls back to a deterministic lexical reranker that scores query-term coverage. The fallback means a rerank hiccup never breaks a question, and it is what the offline tests use.

5. Generate and cite

The top-k chunks are numbered and placed in the prompt. The model is instructed to answer only from them and to cite the passages it uses inline, for example [1] or [2][3]. The response streams back as NDJSON (lib/citations.ts): a citation event first (so the UI shows sources and pages immediately), then answer tokens, then a done event.

The augmented prompt

After retrieval and reranking, the prompt looks like this:

SYSTEM: You answer questions strictly from the provided document passages. Each
passage is numbered. Cite the passages you use inline with their number in
square brackets. If the answer is not in the passages, say so plainly.

USER: Passages:

[1] (source: handbook.pdf, page 12)
Refund Policy. We offer a 30-day money-back guarantee on all subscription
plans. Refunds are processed within 5 business days...

[2] (source: handbook.pdf, page 13)
Annual subscriptions are pro-rated on cancellation...

Question: what is the refund policy?

The model reads, answers, and cites. The UI renders [1] and [2] as links to the source filename and page.

Why this beats fine-tuning

Fine-tuning RAG
Update with new info retrain (expensive) re-index (cheap)
Cite sources impossible built in here
Cost to set up high low
Hallucinations likely reducible with grounding and citations
Multi-document hard built in here

Common RAG mistakes (and how this repo handles them)

Bad chunking. Full-page chunks surface irrelevant material; tiny chunks lack context. This repo defaults to paragraph-sized chunks and lets you tune size and overlap.

Dense-only retrieval. Pure embedding search misses exact terms. Hybrid search with BM25 catches them.

No reranking. Top-k by similarity alone is decent; reranking the pool against the question is measurably better. This repo reranks by default.

Trusting the model to refuse. If the passages do not contain the answer, a model may still hallucinate. The system prompt pins "say so plainly," and citations make ungrounded claims visible.

Top-k too high. Twenty passages cost more tokens and confuse the model. The reranker trims to a tight top-k (default 5).

Further reading

Clone this wiki locally