# RAG Introduction — Notes

This notebook contains foundational notes on **Retrieval-Augmented Generation (RAG)**, classical search, LLMs, and design choices for building high-performing RAG systems.

## 1. Classical Search Systems

- **Classical search** refers to traditional information retrieval (IR) systems that existed before LLMs.
- The broader field of **information retrieval** was already mature when large language models were first developed.
- Ideas from this field underlie how **retrievers** and **RAG systems** are designed.
- At scale, most retrievers are based on **vector databases**.
- There is a **variety of approaches** for computing similarity scores between queries and documents.

## 2. Foundational Concepts: Search and LLM

- Understanding both **search (retrieval)** and **LLM (generation)** is essential for RAG.
- **Retrieval** manages a knowledge base of trusted, relevant, and possibly private information.
- **Retrievers** search the knowledge base to find content relevant to a user query.
- **LLMs** generate text (completions) given a prompt; they benefit from having relevant information added to that prompt.

## 3. What is RAG and Why It Helps

- **RAG improves the performance of LLMs** by giving them information they did not see (or see enough of) in training.
- When we send a prompt to an LLM, we hope the information relevant to our question was in the training data—ideally many times.
- There will always be **information the model was not trained on**. In such cases, we **add that information to the prompt**.
- The **retriever** manages a knowledge base of trusted, relevant, and possibly private information and **searches** it to supply that information.

## 4. Core RAG Building Blocks

| Component | Role |
|-----------|------|
| **Retrieve function** | Wrapper around the retriever. Accepts a text query and returns relevant documents (or chunks) from the knowledge base. |
| **Generate function** | Wrapper around the LLM. Accepts a text prompt and returns the LLM response. |

RAG pipelines typically: **query → retrieve → build prompt (query + docs) → generate**.

## 5. Introduction to LLMs

### 5.1 LLMs as “fancy autocomplete”
- Under the hood, LLMs use a **neural network**—a large mathematical model of language.
- The model encodes which words are commonly used together, in what order, and what they mean in context.
- When an LLM generates a completion, it is **adding new tokens to the end of the prompt, one token at a time**.

### 5.2 Tokens (not words)
- LLMs generate **tokens**, not raw words. A token is a piece of text (subword, word, or punctuation).
- Examples: “London” and “door” might be single tokens; “programmatically” and “unhappy” are often split into multiple tokens.
- Punctuation (e.g. `.`, `,`, `?`) can be separate tokens.
- Typical vocabulary size: **tens of thousands to 100,000+ tokens**. Using subword tokens lets the model represent any word without a token for every possible word.

### 5.3 How the next token is chosen
1. Process the **current completion** (prompt + already generated tokens) to capture relationships and meaning.
2. For **each token in the vocabulary**, compute the **probability** that it should appear next.
3. **Sample** the next token from this probability distribution (e.g. “shining” might have high probability, “rising” lower, “exploding” very low but non-zero).
4. **Repeat** for the next token. Earlier token choices affect later ones—this is **auto-regressive** behavior and keeps the completion coherent.

### 5.4 Context and cost
- **Longer prompts** require more computation.
- There is a **context window limit**—the maximum sequence length the model can process at once.

## 6. Purpose of the Retriever — The Librarian Analogy

- **Purpose:** Supply the LLM with useful information that was **not** available (or not emphasized) in its training data.
- **Analogy:** Think of the retriever as a **librarian**:
  - The librarian has **organized** books (documents) in sections.
  - You ask a **question**; the librarian interprets it and **finds relevant documents**.
  - Documents are assigned **similarity scores** based on how related they are to the question.
- **Ideal case:** The retriever ranks documents perfectly and returns exactly the right number.
- **In practice:** The retriever may rank some relevant documents too low and some irrelevant ones too high, making it hard to know how many to return. **Monitoring and experimentation** (e.g. with similarity thresholds and top-k) are needed to optimize performance.

## 7. RAG as a Design Space (Not a Single “Concept”)

- **RAG is not one fixed concept**—there are **many ways to implement** it.
- **Design choices** directly affect the **accuracy and performance** of a RAG system.
- To build a good system, you need to make deliberate choices about retrieval, chunking, prompting, and generation.

## 8. Attributes of a High-Performing RAG System

- **Relevance:** Retrieved documents are actually useful for answering the query.
- **Precision vs. recall:** Right balance of how many documents to retrieve and how strict similarity thresholds are.
- **Robustness:** Handles ambiguous or noisy queries and documents.
- **Efficiency:** Latency and cost (retrieval + LLM) are acceptable for the use case.
- **Maintainability:** Easy to update the knowledge base, change models, or tune parameters.
- **Observability:** You can monitor retrieval quality and generation quality over time.

## 9. Design Choices for RAG

- **Chunking strategy:** How to split documents (by paragraph, sentence, fixed size, semantic, etc.).
- **Embedding model:** Which model to use for turning text into vectors (affects similarity quality).
- **Retrieval method:** Dense (vector) vs. sparse (e.g. BM25) vs. hybrid; filters and metadata.
- **Number of documents / top-k:** How many chunks to pass to the LLM.
- **Prompt template:** How to combine the query and retrieved documents (order, instructions, format).
- **LLM choice:** Model size, instruction-tuning, and cost/latency tradeoffs.
- **Re-ranking (optional):** A second-stage model to re-rank retrieved chunks before generation.

## 10. Tuning a RAG System

- **Hyperparameter tuning** is important: e.g. chunk size, overlap, top-k, similarity threshold, temperature for the LLM.
- Monitor retrieval metrics (e.g. relevance, recall@k) and generation quality (e.g. faithfulness, answer relevance).
- Iterate on **chunking**, **embedding model**, and **prompt design** based on evaluation and user feedback.

## 11. Advanced and Extended RAG Topics

- **Multi-step, agentic workflows:** RAG is often **one component** in a larger agentic pipeline (e.g. plan → retrieve → reason → act). The agent may call tools, use multiple retrievers, or chain several steps.
- **Agentic document extraction:** Using LLM/agent logic to extract, normalize, or structure information from raw documents before or alongside retrieval.
- **Multi-model / reasoning models:** Combining different models (e.g. one for retrieval/reranking, one for reasoning, one for generation) or using specialized “reasoning” models in the loop.
- **RAG fine-tuning:** Fine-tuning the LLM (or the retriever) on your domain or on “retrieve-then-answer” examples to improve end-to-end RAG performance.
- **Long-context models:** Models with very large context windows can reduce the need for perfect retrieval by including more (or all) documents in the prompt; tradeoffs with cost and latency.
- **Agentic RAG:** Systems where **multiple LLMs** (or agents) each handle a **single part** of a large workflow. The system has **agency** to decide **what data to retrieve**, when to re-query, and how to combine retrieval with reasoning and generation.

## Summary

| Topic | Takeaway |
|-------|----------|
| **Classical search** | IR is mature; vector DBs and similarity scoring underpin modern retrievers. |
| **RAG** | Augments LLMs by adding retrieved, trusted information to the prompt. |
| **Retrieve / Generate** | Retrieve = query → relevant docs; Generate = prompt → LLM response. |
| **LLMs** | Token-by-token, auto-regressive generation; limited by context window and compute. |
| **Retriever** | Like a librarian: organize, score, and return relevant documents; needs tuning. |
| **RAG design** | Many design choices; tuning and monitoring are key to high performance. |
| **Advanced** | Agentic workflows, multi-model setups, fine-tuning, and long-context models extend basic RAG. |