How RAG Works

Retrieval-Augmented Generation in plain English, using this repo as the running example.

The problem RAG solves

LLMs like GPT-4 know a lot, but they don't know your data. They were trained on a snapshot of the internet, not your internal docs, customer database, or last quarter's strategy memo.

You can fix this two ways:

Fine-tune the model on your data (expensive, slow, hard to update)
Retrieval-Augmented Generation (cheap, fast, easy to update)

RAG wins for almost every real-world product use case.

The trick

When the user asks a question, instead of sending it straight to the LLM, you do this:

Find the chunks of YOUR data most relevant to the question
Stuff those chunks into the prompt as context
Tell the LLM to answer using only that context

The LLM doesn't need to "know" your data. It just needs to be smart enough to read your data and answer based on it. That is what GPT-4 is good at.

The three moving parts

graph LR
  A[Chunk] --> B[Embed]
  B --> C[Retrieve]

1. Chunk

PDFs and documents are too big to send to an LLM whole. So you split them into chunks.

How big should a chunk be? In this repo, 1000 characters with 200 character overlap. Why?

Too small (50 chars): a chunk has no context, retrieval is noisy.
Too big (10,000 chars): you waste tokens on irrelevant text.
Sweet spot: roughly the size of a paragraph or two. 500-1500 chars works for most prose.

The 200-char overlap is so a sentence that spans two chunks is still findable in either.

2. Embed

An embedding is a list of numbers that represents the meaning of a piece of text. Two chunks with similar meaning have similar embedding vectors.

This repo uses text-embedding-3-small, which produces 1536-dimensional vectors. So every chunk is now a list of 1536 numbers between -1 and 1.

You don't need to understand HOW the embedding model produces these numbers. You only need to know that semantically similar text → similar vectors → similar cosine similarity score.

3. Retrieve

When the user asks "what is the company's refund policy?", we:

Embed the question into a 1536-dim vector
Compute cosine similarity between the question vector and EVERY chunk vector
Take the top 5 most similar chunks

These 5 chunks should contain the answer (if the document covered it). They go into the prompt as context.

The augmented prompt

After retrieval, the prompt sent to GPT-4 looks like this:

SYSTEM: You answer questions strictly from the provided document chunks. If the
answer is not in the chunks, say so plainly. Be concise. Quote short passages
when useful.

USER: Document chunks:

[Chunk 1]
Refund Policy
We offer a 30-day money-back guarantee on all subscription plans. Refunds
are processed within 5 business days...

[Chunk 2]
For annual subscriptions, refunds are pro-rated...

[Chunk 3]
... (irrelevant chunk that came in top-5 by accident)

Question: what is the company's refund policy?

The LLM reads, finds the relevant text, and answers.

Why this beats fine-tuning

	Fine-tuning	RAG
Update with new info	retrain (expensive)	re-index (cheap)
Cite sources	impossible	trivial
Cost per query	medium	medium
Cost to set up	high	low
Hallucinations	likely	possible but reducible
Multi-document	hard	trivial

For 95 percent of "chat with my docs" use cases, RAG is the right answer.

Common RAG mistakes

Bad chunking. If your chunks are full pages, retrieval surfaces irrelevant material. If chunks are tiny sentences, the LLM doesn't have enough context. Tune to your content.

No re-ranking. Top-5 by cosine similarity is good. Top-5 by cosine then re-ranked by a cross-encoder is much better. Worth adding when accuracy matters.

Trusting the LLM to refuse. If the chunks don't contain the answer, the LLM may still hallucinate. Pin "say so plainly" in the system prompt and test.

Top-k too high. Sending 20 chunks costs more tokens and confuses the LLM. 3-7 is usually optimal.

Ignoring metadata. Filename, section, date can all be embedded with the chunk and used to filter retrieval. Not done in this starter, but easy to add.

What this repo doesn't do (yet)

Re-ranking: a planned addition. A cross-encoder on the top-20 measurably improves quality.
Hybrid search: dense (embeddings) plus sparse (BM25) retrieval is more robust than either alone.
Citation rendering: the answer doesn't show which chunks it came from. Trivial to add: pass chunk IDs through to the response.
Multi-document indexing: the UI is single-PDF. The store API supports many.

These are intentional gaps for a starter. Add them when you understand why you need them.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How RAG Works

How RAG Works

The problem RAG solves

The trick

The three moving parts

1. Chunk

2. Embed

3. Retrieve

The augmented prompt

Why this beats fine-tuning

Common RAG mistakes

What this repo doesn't do (yet)

Further reading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally