# RAG Fundamentals

**RAG (Retrieval-Augmented Generation)** retrieves *relevant information from a knowledge base*, which we then pass to an LLM to generate a response. It's like giving the LLM a *cheat sheet* of *just the right reference material* before asking it to answer.

**Why RAG matters**: Long texts can exceed context limits, but more importantly, adding noise makes models lose sight of detail and nuanceâ€”despite what needle-in-the-haystack benchmarks claim. RAG lets you supply only what's relevant.

**RAG for Category Mapping**

Say I want to map a product description against a set of predefined categories from a standardized nomenclatureâ€”hundreds of items long. RAG helps by pre-filtering those categories down to just the relevant ones before asking the LLM to pick the best match. This matters because A) it's cheaper, and B) the LLM won't miss the right category when it's not buried in noise. RAG acts as the **smart filter** that gives the LLM a focused shortlist instead of the whole haystack.

In [None]:
docs = [
    "Electronics > Computers > Laptops",
    "Electronics > Computers > Desktop Computers",
    "Electronics > Mobile Devices > Smartphones",
    "Electronics > Mobile Devices > Tablets",
    "Electronics > Audio > Headphones > Wireless Headphones",
    "Electronics > Audio > Headphones > Wired Headphones",
    "Electronics > Audio > Speakers > Bluetooth Speakers",
    "Home & Kitchen > Furniture > Office Furniture > Desks",
    "Home & Kitchen > Furniture > Office Furniture > Chairs",
    "Home & Kitchen > Appliances > Small Appliances > Coffee Makers",
    "Clothing > Men's Clothing > Shirts",
    "Clothing > Women's Clothing > Dresses",
    "Sports & Outdoors > Exercise & Fitness > Yoga > Yoga Mats",
    "Sports & Outdoors > Exercise & Fitness > Cardio > Treadmills",
    "Books > Fiction > Science Fiction",
    "Books > Non-Fiction > Business & Money",
]

query = "Noise-cancelling over-ear bluetooth headphones with 30-hour battery life and premium sound quality"

It is important to note that what we are trying to do is **surface a handful or relevent results** that can then be passed to an LLM to do something further (ie. apply a single category for the product description).

## Model Background

I have discussed elsehwere about BERT embeddings and how they capture contextual meaning of text.

At a high level, the models we use for RAG *compare* embeddings for a list of documents (ie. product categories) against the query (ie. product description) to find the most similar documents. In other words, the *meaning determins proximity* between docs and query.

However the models we use for RAG are different to BERT base models - they are fine-tuned for *similarity* search

In a few sentences, how do you train a model for similarity search (assume you would use a BERT base model and then fine tune it? if so, I want to know about the fine tune procedure)

##### ðŸ¤–ReplyðŸ¤–<!-- SOLVEIT_SEPARATOR_7f3a9b2c -->

ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ ðŸ§ 

Great question! The fine-tuning procedure typically uses **contrastive learning**:

1. **Training data**: You need pairs (or triplets) of texts with similarity labels. For example: (query, positive_match) pairs, or (anchor, positive, negative) triplets.

2. **Loss function**: The model learns to pull similar pairs *closer together* in embedding space while pushing dissimilar pairs *further apart*. Common losses include:
   - **Contrastive loss**: penalizes distance between positives, rewards distance between negatives
   - **Triplet loss**: ensures anchor is closer to positive than to negative by some margin
   - **Multiple Negatives Ranking Loss**: treats other items in a batch as negatives (very efficient)

3. **Architecture tweak**: BERT gives you token-level embeddings, but for similarity you need a single vector per text. So you add a **pooling layer** (e.g., mean of all tokens, or just the [CLS] token) to get one embedding per sentence.

The result is a model where `cosine_similarity(embed(text_A), embed(text_B))` meaningfully reflects semantic similarityâ€”unlike base BERT, which wasn't trained with that objective.

Does the triplet/contrastive setup make sense, or would an example help clarify?