-
Notifications
You must be signed in to change notification settings - Fork 0
How RAG Works
Retrieval-Augmented Generation in plain English, using this repo as the running example.
LLMs like GPT-4 know a lot, but they don't know your data. They were trained on a snapshot of the internet, not your internal docs, customer database, or last quarter's strategy memo.
You can fix this two ways:
- Fine-tune the model on your data (expensive, slow, hard to update)
- Retrieval-Augmented Generation (cheap, fast, easy to update)
RAG wins for almost every real-world product use case.
When the user asks a question, instead of sending it straight to the LLM, you do this:
- Find the chunks of YOUR data most relevant to the question
- Stuff those chunks into the prompt as context
- Tell the LLM to answer using only that context
The LLM doesn't need to "know" your data. It just needs to be smart enough to read your data and answer based on it. That is what GPT-4 is good at.
graph LR
A[Chunk] --> B[Embed]
B --> C[Retrieve]
PDFs and documents are too big to send to an LLM whole. So you split them into chunks.
How big should a chunk be? In this repo, 1000 characters with 200 character overlap. Why?
- Too small (50 chars): a chunk has no context, retrieval is noisy.
- Too big (10,000 chars): you waste tokens on irrelevant text.
- Sweet spot: roughly the size of a paragraph or two. 500-1500 chars works for most prose.
The 200-char overlap is so a sentence that spans two chunks is still findable in either.
An embedding is a list of numbers that represents the meaning of a piece of text. Two chunks with similar meaning have similar embedding vectors.
This repo uses text-embedding-3-small, which produces 1536-dimensional vectors. So every chunk is now a list of 1536 numbers between -1 and 1.
You don't need to understand HOW the embedding model produces these numbers. You only need to know that semantically similar text → similar vectors → similar cosine similarity score.
When the user asks "what is the company's refund policy?", we:
- Embed the question into a 1536-dim vector
- Compute cosine similarity between the question vector and EVERY chunk vector
- Take the top 5 most similar chunks
These 5 chunks should contain the answer (if the document covered it). They go into the prompt as context.
After retrieval, the prompt sent to GPT-4 looks like this:
SYSTEM: You answer questions strictly from the provided document chunks. If the
answer is not in the chunks, say so plainly. Be concise. Quote short passages
when useful.
USER: Document chunks:
[Chunk 1]
Refund Policy
We offer a 30-day money-back guarantee on all subscription plans. Refunds
are processed within 5 business days...
[Chunk 2]
For annual subscriptions, refunds are pro-rated...
[Chunk 3]
... (irrelevant chunk that came in top-5 by accident)
Question: what is the company's refund policy?
The LLM reads, finds the relevant text, and answers.
| Fine-tuning | RAG | |
|---|---|---|
| Update with new info | retrain (expensive) | re-index (cheap) |
| Cite sources | impossible | trivial |
| Cost per query | medium | medium |
| Cost to set up | high | low |
| Hallucinations | likely | possible but reducible |
| Multi-document | hard | trivial |
For 95 percent of "chat with my docs" use cases, RAG is the right answer.
Bad chunking. If your chunks are full pages, retrieval surfaces irrelevant material. If chunks are tiny sentences, the LLM doesn't have enough context. Tune to your content.
No re-ranking. Top-5 by cosine similarity is good. Top-5 by cosine then re-ranked by a cross-encoder is much better. Worth adding when accuracy matters.
Trusting the LLM to refuse. If the chunks don't contain the answer, the LLM may still hallucinate. Pin "say so plainly" in the system prompt and test.
Top-k too high. Sending 20 chunks costs more tokens and confuses the LLM. 3-7 is usually optimal.
Ignoring metadata. Filename, section, date can all be embedded with the chunk and used to filter retrieval. Not done in this starter, but easy to add.
- Re-ranking: a planned addition. A cross-encoder on the top-20 measurably improves quality.
- Hybrid search: dense (embeddings) plus sparse (BM25) retrieval is more robust than either alone.
- Citation rendering: the answer doesn't show which chunks it came from. Trivial to add: pass chunk IDs through to the response.
- Multi-document indexing: the UI is single-PDF. The store API supports many.
These are intentional gaps for a starter. Add them when you understand why you need them.
- Anthropic's contextual retrieval — improvements over vanilla RAG
- pgvector — Postgres extension for vector search
- LlamaIndex docs — when you need a framework