# RAG Fundamentals

**RAG (Retrieval-Augmented Generation)** retrieves *relevant information from a knowledge base*, which we then pass to an LLM to generate a response. It's like giving the LLM a *cheat sheet* of *just the right reference material* before asking it to answer.

**Why RAG matters**: Long texts can exceed context limits, but more importantly, adding noise makes models lose sight of detail and nuanceâ€”despite what needle-in-the-haystack benchmarks claim. RAG lets you supply only what's relevant.

**RAG for Category Mapping**

Say I want to map a product description against a set of predefined categories from a standardized nomenclatureâ€”hundreds of items long. RAG helps by pre-filtering those categories down to just the relevant ones before asking the LLM to pick the best match. This matters because A) it's cheaper, and B) the LLM won't miss the right category when it's not buried in noise. RAG acts as the **smart filter** that gives the LLM a focused shortlist instead of the whole haystack.

In [None]:
docs = [
    "Electronics > Computers > Laptops",
    "Electronics > Computers > Desktop Computers",
    "Electronics > Mobile Devices > Smartphones",
    "Electronics > Mobile Devices > Tablets",
    "Electronics > Audio > Headphones > Wireless Headphones",
    "Electronics > Audio > Headphones > Wired Headphones",
    "Electronics > Audio > Speakers > Bluetooth Speakers",
    "Home & Kitchen > Furniture > Office Furniture > Desks",
    "Home & Kitchen > Furniture > Office Furniture > Chairs",
    "Home & Kitchen > Appliances > Small Appliances > Coffee Makers",
    "Clothing > Men's Clothing > Shirts",
    "Clothing > Women's Clothing > Dresses",
    "Sports & Outdoors > Exercise & Fitness > Yoga > Yoga Mats",
    "Sports & Outdoors > Exercise & Fitness > Cardio > Treadmills",
    "Books > Fiction > Science Fiction",
    "Books > Non-Fiction > Business & Money",
]

query = "Noise-cancelling over-ear bluetooth headphones with 30-hour battery life and premium sound quality"

It is important to note that what we are trying to do is **surface a handful or relevent results** that can then be passed to an LLM to do something further (ie. apply a single category for the product description).

## Retrieval Architectures Overview

Once we've decided to retrieve relevant documents, the question becomes: *how do we actually compare a query to documents?* All approaches boil down to encoding text into vectors and measuring similarity â€” but *when* and *how* we do that encoding matters a lot for both accuracy and speed.

Then a quick preview of the three approaches you'll cover:

1. **Bi-encoders** â€” encode separately, compare vectors (fast, scalable)
2. **Cross-encoders** â€” encode together, get direct relevance score (accurate, slow)
3. **Late interaction** â€” encode separately, compare token-by-token (middle ground)

## Model Background

I have discussed elsehwere about BERT embeddings and how they capture contextual meaning of text.

At a high level, the models we use for RAG *compare* embeddings for a list of documents (ie. product categories) against the query (ie. product description) to find the most similar documents. In other words, the *meaning determins proximity* between docs and query.

However the models we use for RAG are different to BERT base models - they are fine-tuned for *similarity* search, typically via **contrastive learning** (pull similar pairs closer together in embedding space while pushing dissimilar pairs further apart).

### Semantic Embedding Models

When we fine-tune a model for similarity or RAG, what exactly are they outputting?

Traditionally, they output a **single vector representation** for the *entire input sequence*. 

Where does the single vector come from? There are a few different ways

#### CLS pooling

Utilize the `[CLS]` vector to the output for which we minimize the loss against (already outputted by default with BERT models)


```
[CLS] + query tokens + [SEP] + document tokens + [SEP]
                        â†“
                   Transformer
                        â†“
              [CLS] embedding captures
           "how related are these two texts?"
```

The key to note here is fine-tuning models are trained to **update the `[CLS]` vector** such that the loss is minimized. This is despite the fact the model outputs embeddings for *entire input sequence* - we only care about the `[CLS]` embedding vector (`e_cls`) when we calculate the loss

```
Input:  [CLS] "noise" "cancelling" "headphones" [SEP]
Output:   â†“      â†“         â†“           â†“         â†“
        e_cls   e_1       e_2         e_3       e_sep

```

We only care about the `[CLS]` vector, nothing else (discard it).

NOTE: Notice how we are **not** generating the embeddings for the `document` alongside the `query`. These are done **separately** in what is called ***bi-encoder*** approach.

With bi-encoders, we do something like this:
```
Query:    [CLS] "noise" "cancelling" "headphones" [SEP]  â†’  e_cls_query
Document: [CLS] "Electronics" ">" "Audio" ">" ... [SEP]  â†’  e_cls_doc
```
and then compare `e_cls_query` with `e_cls_doc` using cosine similarity (or dot product)

Some popular models that use CLS pooling:

1. **BERT** (original) - The `[CLS]` token was designed for classification tasks, though raw BERT isn't great for similarity without fine-tuning
2. **Sentence-BERT (SBERT)** - One of the first widely-used sentence embedding models, fine-tuned specifically for semantic similarity using CLS pooling (though it also supports mean pooling)
4. **ColBERT** (in its global representation mode) - Though ColBERT is famous for its late interaction approach, it can use CLS for certain operations
5. **Many cross-encoder models** - When you concatenate query + document and pass through a single encoder, the CLS token captures the relevance score

Worth noting: CLS pooling has fallen somewhat out of favor for pure bi-encoder similarity modelsâ€”**mean pooling** (averaging all token embeddings) often works better in practice. But CLS remains very common in cross-encoder setups where query and document are processed together.

#### Mean Pooling

Mean pooling averages the embeddings of all tokens in the sequence to produce a single vector representation.

```
Input:  [CLS] "noise" "cancelling" "headphones" [SEP]
Output:   â†“      â†“         â†“           â†“         â†“
        e_cls   e_1       e_2         e_3       e_sep
                 \         |          /
                  \        |         /
                   â†’ mean(e_1, e_2, e_3) â†’ final embedding
```

Rather than relying on a single special token to carry all the semantic weight, mean pooling lets *every content token **contribute equally** to the final representation*. The `[CLS]` and `[SEP]` tokens are typically excluded from the average (though some implementations include them).

**Why it often works better than CLS:**
- The `[CLS]` token is a single point that must learn to summarize everythingâ€”a lot of pressure on one vector
- Mean pooling distributes the representation across all tokens, which can be more robust
- For longer sequences especially, averaging captures more of the content

**Popular models using mean pooling:**

1. **Sentence-Transformers models** (e.g., `all-MiniLM-L6-v2`, `all-mpnet-base-v2`) - the default pooling strategy for most modern sentence embedding models
2. **E5 models** (from Microsoft) - use mean pooling with instruction-tuned embeddings
3. **GTE models** (from Alibaba) - general text embeddings using mean pooling
4. **BGE models** (from BAAI) - another popular family defaulting to mean pooling

Mean pooling has become the **de facto standard** for bi-encoder similarity models where query and document are embedded separately.

Again, we would then encode the `document` and comapre the mean of both `query` and `document` embeddings for similarity

> **In summary**:
> 
> ```
> Input:  [CLS] "noise" "cancelling" "headphones" [SEP]
> Output:   â†“      â†“         â†“           â†“         â†“
>         e_cls   e_1       e_2         e_3       e_sep
>
> CLS pooling:  use e_cls
> Mean pooling: use mean(e_1, e_2, e_3)
> ```


fine - so show me the improved blog outline then

##### ðŸ¤–ReplyðŸ¤–<!-- SOLVEIT_SEPARATOR_7f3a9b2c -->

ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ 

Here's the restructured outline:

```
# RAG Fundamentals
- What is RAG, why it matters
- RAG for Category Mapping (your use case example)

## Retrieval Architectures Overview
- Brief intro: "how do we compare query to documents?"

### Bi-Encoders
- One sentence: encode query and docs separately, compare vectors
- Use sentence-transformers as the concrete example under Bi-Encoders â€” "Here's how this looks with Sentence-Transformers" with a quick code snippet showing model.encode(query) and model.encode(docs)
- This raises the question: how do we get a single vector?

#### Pooling Strategies
- CLS pooling (with your token diagram)
- Mean pooling
- Summary comparison

### Cross-Encoders
- Encode query + doc together, get relevance score directly
- More accurate, but expensive (can't pre-compute doc embeddings)

### Late Interaction (ColBERT)
- Best of both worlds: embed separately, compare token-by-token
- Trade-off: more storage, but better accuracy than bi-encoders

## When to Use Which?
- Practical guidance / comparison table
```

The key change: pooling becomes a *subsection* of bi-encoders, since it's specifically answering "how do bi-encoders get their vectors?" Cross-encoders don't need pooling discussion (they output a score directly), and ColBERT has its own token-level comparison mechanism.

Does this structure feel right?

so we have some text under RAG fundamentals. what goes in the Retrieval Architectures overview?

##### ðŸ¤–ReplyðŸ¤–<!-- SOLVEIT_SEPARATOR_7f3a9b2c -->

ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ 

This section is your **bridge** â€” it sets up the question that the subsections will answer.

Something like:

> Once we've decided to retrieve relevant documents, the question becomes: *how do we actually compare a query to documents?* All approaches boil down to encoding text into vectors and measuring similarity â€” but *when* and *how* we do that encoding matters a lot for both accuracy and speed.

Then a quick preview of the three approaches you'll cover:

1. **Bi-encoders** â€” encode separately, compare vectors (fast, scalable)
2. **Cross-encoders** â€” encode together, get direct relevance score (accurate, slow)
3. **Late interaction** â€” encode separately, compare token-by-token (middle ground)

That's it â€” just enough to orient the reader before diving into bi-encoders.

Does that feel like the right level of detail for an overview, or do you want to expand/trim it?

OK i have added that. what do I put in the next section - bi-encoders? lets re-use as much of my current text as possible (then tell me what we can delete). return the full text that should go in that section

##### ðŸ¤–ReplyðŸ¤–<!-- SOLVEIT_SEPARATOR_7f3a9b2c -->

ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ 

Here's the full text for the **Bi-encoders** section:

---

## Bi-Encoders

Bi-encoders encode the query and each document **separately**, producing a single vector for each. We then compare these vectors using cosine similarity (or dot product) to find the most relevant documents.

```
Query:    [CLS] "noise" "cancelling" "headphones" [SEP]  â†’  e_query
Document: [CLS] "Electronics" ">" "Audio" ">" ... [SEP]  â†’  e_doc

similarity = cosine(e_query, e_doc)
```

**Why this matters for scale**: Because documents are encoded independently, we can pre-compute all document embeddings once and store them. At query time, we only encode the query and compare against the stored vectors â€” this makes bi-encoders very fast.

### Pooling Strategies

When we fine-tune a model for similarity, it outputs embeddings for the *entire input sequence*. But we need a **single vector** to represent the text. Where does it come from?

#### CLS Pooling

Use the `[CLS]` token's embedding as the representation:

```
Input:  [CLS] "noise" "cancelling" "headphones" [SEP]
Output:   â†“      â†“         â†“           â†“         â†“
        e_cls   e_1       e_2         e_3       e_sep

â†’ use e_cls
```

#### Mean Pooling

Average all token embeddings (excluding special tokens):

```
â†’ use mean(e_1, e_2, e_3)
```

**Why mean pooling often works better**: The `[CLS]` token must learn to summarize everything â€” a lot of pressure on one vector. Mean pooling distributes the representation across all tokens, which can be more robust, especially for longer sequences.

Mean pooling has become the **de facto standard** for bi-encoder models (e.g., `all-MiniLM-L6-v2`, E5, GTE, BGE).

---

**You can delete these existing notes** (they're now absorbed above):
- "Semantic Embedding Models" 
- "CLS pooling" section
- "Mean Pooling" section
- The summary comparison note
- The note starting "Notice how we are **not** generating..."

Keep the model lists if you want, or I can show you where to work them back in as a reference section later. Want me to continue with Cross-Encoders next?