---



# Unit 2: RAG, Vector Stores, and Indexing

## Introduction
LLMs have a knowledge cutoff and can hallucinate. **Retrieval Augmented Generation (RAG)** solves this by retrieving relevant data and injecting it into the prompt.

In this notebook, we will master:
1.  **Embeddings:** Representing text as vectors.
2.  **Vector Stores:** Storing and searching vectors (FAISS).
3.  **NaÃ¯ve RAG:** The standard Retrieval -> Augment -> Generate pipeline.
4.  **Indexing Challenges:** Deep dive into how vector databases search efficiently (Flat, IVF, HNSW, PQ).

---

## Part 4a: Embeddings & Vector Space

### 1. Introduction: Computers Don't Read English

If you ask a computer "Is a cat similar to a dog?", it doesn't know. To a computer, "cat" is just a string of bytes: `01100011...`.

To solve this, we use **Embeddings**.

### What is an Embedding?
An embedding is a translation from **Words** to **Lists of Numbers (Vectors)**, such that similar words represent close numbers.

### The Process (Flowchart)
```mermaid
graph LR
    A["Input Text ('Apple')"] -->|Tokenization| B["Tokens (101, 255)"]
    B -->|Embedding Model| C["Vector List ([0.1, -0.5, 0.9...])"]
    C -->|Store| D["Vector Database"]
```

In [None]:
# Setup
%pip install python-dotenv --upgrade --quiet langchain langchain-huggingface sentence-transformers langchain-community

from dotenv import load_dotenv
load_dotenv()

import os
from langchain_huggingface import HuggingFaceEmbeddings

# Using a FREE, open-source model from Hugging Face
# 'all-MiniLM-L6-v2' is small, fast, and very good for English.
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

## 2. Viewing a Vector

Let's see what the word "Apple" looks like to the machine.

### Conceptual Note: Dimensions
The vector below has **384 dimensions** (for MiniLM).
- Imagine a graph with X and Y axes (2 Dimensions). You can plot a point (x, y).
- Now imagine adding Z (3 Dimensions).
- Now imagine **384 axes**.

Each axis represents a feature (e.g., "Is it a fruit?", "Is it red?", "Is it tech-related?"). The numbers aren't random; they encode meaning.

In [None]:
vector = embeddings.embed_query("Apple")

print(f"Dimensionality: {len(vector)}")
print(f"First 5 numbers: {vector[:5]}")

## 3. The Math: Cosine Similarity

How do we know if two vectors are close? We measure the **Angle** between them.

### Cosine Similarity Formula
$$ \text{Similarity} = \cos(\theta) = \frac{A \cdot B}{\|A\| \|B\|} $$

- **1.0**: Arrows point in the Exact Same Direction (Identical).
- **0.0**: Arrows are Perpendicular (Unrelated).
- **-1.0**: Arrows point in Opposite Directions (Opposite).

**Experiment:** Let's compare "Cat", "Dog", and "Car".

In [None]:
import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

vec_cat = embeddings.embed_query("Cat")
vec_dog = embeddings.embed_query("Dog")
vec_car = embeddings.embed_query("Car")

print(f"Cat vs Dog: {cosine_similarity(vec_cat, vec_dog):.4f}")
print(f"Cat vs Car: {cosine_similarity(vec_cat, vec_car):.4f}")

### Analysis
You should see that **Cat & Dog** score higher (e.g., ~0.8) than **Cat & Car** (e.g., ~0.3).
This Mathematical Distance is the foundation of all Search engines and RAG systems.

This is arguably the most important concept in modern AI.

---



# Unit 2 - Part 4b: Naive RAG Pipeline

## 1. Introduction: The Open-Book Test

RAG (Retrieval-Augmented Generation) is just an Open-Book Test architecture.
1.  **Retrieval:** Find the right page in the textbook.
2.  **Generation:** Write the answer using that page.

### The Pipeline (Flowchart)
```mermaid
graph TD
    User[User Question] --> Retriever[Retriever System]
    Retriever -->|Search Database| Docs[Relevant Documents]
    Docs --> Combiner[Prompt Template]
    User --> Combiner
    Combiner -->|Full Prompt w/ Context| LLM[Gemini Model]
    LLM --> Answer[Final Answer]
```

In [None]:
pip install langchain_google_genai

In [None]:
# Setup
%pip install python-dotenv --upgrade --quiet faiss-cpu langchain-huggingface sentence-transformers langchain-community
from dotenv import load_dotenv
load_dotenv()

import getpass
import os
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_huggingface import HuggingFaceEmbeddings

if "GOOGLE_API_KEY" not in os.environ:
    os.environ["GOOGLE_API_KEY"] = getpass.getpass("Enter your Google API Key: ")

llm = ChatGoogleGenerativeAI(model="gemini-2.5-flash")

# Using the same free model as Part 4a
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

## 2. The "Knowledge Base" (Grounding)

LLMs hallucinate because they rely on "parametric memory" (what they learned during training).
RAG introduces "non-parametric memory" (external facts).

Let's define some facts the LLM definitely *does not* know.

In [None]:
from langchain_core.documents import Document

docs = [
    Document(page_content="Piyush's favorite food is Pizza with extra cheese."),
    Document(page_content="The secret password to the lab is 'Blueberry'."),
    Document(page_content="LangChain is a framework for developing applications powered by language models."),
]

## 3. Indexing ( Storing the knowledge)

We use **FAISS** (Facebook AI Similarity Search) to store the embeddings.
Think of FAISS as a super-fast librarian that organizes books by content, not title.

In [None]:
from langchain_community.vectorstores import FAISS

vectorstore = FAISS.from_documents(docs, embeddings)
retriever = vectorstore.as_retriever()

## 4. The RAG Chain

We use LCEL to stitch it together.

**Step 1:** The `retriever` takes the question, converts it to numbers, and finds the closest document.
**Step 2:** `RunnablePassthrough` holds the question.
**Step 3:** The `prompt` combines them.

In [None]:
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

template = """
Answer based ONLY on the context below:
{context}

Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

result = chain.invoke("What is the secret password?")
print(result)

## 5. Analysis

The retrieval step is opaque here. In the next notebook (**4c**), we will look *inside* the retriever to understand how FAISS actually finds that document among millions of others.

---



# Unit 2 - Part 4c: Deep Dive into Indexing Algorithms

## 1. Introduction: The Scale Problem

Comparing 1 vector against 10 vectors is fast.
Comparing 1 vector against **100 Million** vectors is slow.

**FAISS (Facebook AI Similarity Search)** was built to solve this.

### The Trade-off Triangle
You can pick 2:
- **Speed** (Query time)
- **Accuracy** (Recall)
- **Memory** (RAM usage)

We will explore algorithms that optimize different corners of this triangle.

In [None]:
import faiss
import numpy as np

# Mock Data: 10,000 vectors of size 128
d = 128
nb = 10000
xb = np.random.random((nb, d)).astype('float32')

## 2. Flat Index (Brute Force)

**Concept:** Check every single item.

- **Algo:** `IndexFlatL2`
- **Pros:** 100% Accuracy (Gold Standard).
- **Cons:** Slow (O(N)). Unusable at 1M+ vectors.


In [None]:
index = faiss.IndexFlatL2(d)
index.add(xb)
print(f"Flat Index contains {index.ntotal} vectors")

## 3. IVF (Inverted File Index)

**Concept:** Clustering / Partitioning.

Imagine looking for a book. Instead of checking every shelf, you go to the "Sci-Fi" section. Then you only search books *in that section*.

### How it works (Flowchart)
```mermaid
graph TD
    Data[All 1M Vectors] -->|Train| Clusters[1000 Cluster Centers (Centroids)]
    Query[User Query] -->|Step 1| FindClosest[Find Closest Centroid]
    FindClosest -->|Step 2| Search[Search ONLY vectors in that Cluster]
```

**Analogy:** Voronoi Cells (Zip Codes). We only search the local zip code.

In [None]:
nlist = 100 # How many 'zip codes' (clusters) we want
quantizer = faiss.IndexFlatL2(d) # The calculator for distance
index_ivf = faiss.IndexIVFFlat(quantizer, d, nlist)

# We MUST train it first so it learns where the clusters are
index_ivf.train(xb)
index_ivf.add(xb)

## 4. HNSW (Hierarchical Navigable Small World)

**Concept:** Six Degrees of Separation.

Most data is connected. HNSW builds a **Graph**.
- **Layer 0:** Every point connects to neighbors.
- **Layer 1:** "Express Highways" connecting distant points.

**Analogy:** Catching a flight.
You don't fly Local -> Local -> Local.
You fly Local -> **HUB** (Chicago) -> **HUB** (London) -> Local.

- **Pros:** Extremely fast retrieval.
- **Cons:** Heavier on RAM (needs to store the edges of the graph).

In [None]:
M = 16 # Number of connections per node (The 'Hub' factor)
index_hnsw = faiss.IndexHNSWFlat(d, M)
index_hnsw.add(xb)

## 5. PQ (Product Quantization)

**Concept:** Compression (Lossy).

Do we need 32-bit float precision (`0.123456789`)? No. `0.12` is fine.
PQ breaks the vector into chunks and approximates them.

**Analogy:** 4K Video vs 480p Video.
- 480p is blurry, but it's 10x smaller and faster to stream.
- Use PQ when you are RAM constrained (e.g., storing 1 Billion vectors).

In [None]:
m = 8 # Split vector into 8 sub-vectors
index_pq = faiss.IndexPQ(d, m, 8)
index_pq.train(xb)
index_pq.add(xb)
print("PQ Compression complete. RAM usage minimized.")