## Why Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) is an architecture that combines
information retrieval with language model generation.

RAG consists of three main components:

1. **Retrieval**  
   Relevant documents are retrieved from an external knowledge base
   using vector embeddings and similarity search.

2. **Augmentation**  
   The retrieved documents are injected into the modelâ€™s context.
   This step augments the language model with external, factual information
   that is not contained in its internal parameters.

3. **Generation**  
   A language model generates the final answer conditioned on the retrieved context.

The key idea behind RAG is that generation quality depends on retrieval quality.
By grounding the model in retrieved documents, RAG reduces hallucinations
and enables the use of private or domain-specific data.

## Role of Embeddings and Vector Databases

Embeddings are numerical vector representations of text that encode its meaning.
Texts with similar meaning are mapped to vectors that are close to each other
in the embedding space, even if they use different words.

In a RAG pipeline:
- Documents are converted into embeddings and stored in a vector database.
- User queries are embedded using the same embedding model.
- A similarity search compares the query vector with document vectors
  to find the most relevant content.

Vector databases are designed to perform this similarity search efficiently,
even when the number of documents is large.

In this Week 1 experiment, embeddings are evaluated directly to confirm
that semantically related queries retrieve documents from the correct topic,
which is a fundamental requirement for a reliable RAG system.

## Similarity Metrics

To retrieve relevant documents, we need a way to compare embeddings.
This is done using similarity (or distance) metrics.

The following metrics are used in this experiment:

- **Cosine similarity**  
  Compares how similar two vectors are by looking at the angle between them.
  It focuses on meaning rather than vector size and is the most commonly used
  metric for embedding-based retrieval.

- **Dot product**  
  Measures similarity using both the direction and the length of vectors.
  If embeddings are normalized, the dot product produces the same ranking
  as cosine similarity.

- **Euclidean distance**  
  Measures how far two vectors are from each other in space.
  It can rank results differently from cosine similarity, especially
  in high-dimensional embedding spaces.

In this notebook, these metrics are compared to observe how different
similarity choices affect document retrieval results.

# Week 1: PDF Embedding Benchmark

**Goal:**  
Evaluate embedding models on a small multi-topic PDF dataset.

**Scope (Week 1 only):**
- Embeddings
- Similarity metrics (cosine, dot product, Euclidean)
- Retrieval correctness

## What is being evaluated

This notebook evaluates whether vector embeddings can correctly retrieve
documents from the appropriate topic when multiple unrelated topics are present.

The experiment focuses on:
- Understanding embedding behavior
- Comparing similarity metrics
- Verifying retrieval correctness across topics

In [11]:
# Imports
import os
import numpy as np
import pandas as pd
from pathlib import Path
from sentence_transformers import SentenceTransformer
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

## Configuration

- PDFs are organized by topic folders
- Documents are chunked before embedding
- Multiple embedding models are compared
- Top-k retrieval is evaluated

In [14]:
# Project paths (notebook-safe)
PROJECT_ROOT = Path.cwd().parent
DATA_DIR = PROJECT_ROOT / "data"
ARTIFACTS_DIR = PROJECT_ROOT / "artifacts"
ARTIFACTS_DIR.mkdir(exist_ok=True)

print("Project root:", PROJECT_ROOT)
print("Data dir exists:", DATA_DIR.exists())
print("Artifacts dir exists:", ARTIFACTS_DIR.exists())

CHUNK_SIZE = 500
CHUNK_OVERLAP = 50
K = 3

Project root: /Users/tkhamidulin/Desktop/First Project - RAG
Data dir exists: True
Artifacts dir exists: True


## Loading PDFs

PDF files are loaded from topic-specific folders.
Each document is tagged with metadata:
- topic
- source file
- page number

In [15]:
def load_pdfs(data_dir: Path) -> list:
    """
    Load PDFs from topic subfolders.
    Adds metadata: topic, source file, page number.
    """
    docs = []

    for topic_dir in data_dir.iterdir():
        if not topic_dir.is_dir():
            continue

        topic = topic_dir.name
        for pdf_path in topic_dir.glob("*.pdf"):
            loader = PyPDFLoader(str(pdf_path))
            pages = loader.load()
            for doc in pages:
                doc.metadata["topic"] = topic
                doc.metadata["source"] = pdf_path.name
                docs.append(doc)

        print(f"[{topic}] loaded")

    print(f"Total pages loaded: {len(docs)}")
    return docs

## Document Chunking

Documents are split into overlapping chunks to preserve semantic context.
Chunk metadata is preserved.

In [16]:
def chunk_docs(docs: list, chunk_size: int, overlap: int) -> list:
    """
    Split documents into overlapping chunks.
    """
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=overlap
    )
    chunks = splitter.split_documents(docs)

    print(f"Total chunks created: {len(chunks)}")
    return chunks

## Embedding Models

The following standard open-source embedding models are evaluated:

- MiniLM
- MPNet
- E5-base
- BGE-base
- GTE-base

These models are widely used as baselines in retrieval and RAG systems.

In [17]:
MODELS = {
    "MiniLM": "sentence-transformers/all-MiniLM-L6-v2",
    "MPNet": "sentence-transformers/all-mpnet-base-v2",
    "E5-base": "intfloat/e5-base-v2",
    "BGE-base": "BAAI/bge-base-en-v1.5",
    "GTE-base": "thenlper/gte-base",
}

## Evaluation Queries

Each query is associated with an expected topic.
Retrieval is considered correct if documents from the same topic are returned.

In [19]:
QUERIES = [
    ("What is retrieval augmented generation?", "RAG"),
    ("How does HyDE improve retrieval?", "RAG"),
    ("What are LangChain retrieval methods?", "RAG"),

    ("How to create a git branch?", "GIT"),
    ("What is git rebase?", "GIT"),
    ("How to resolve merge conflicts?", "GIT"),

    ("What is Google Cloud Platform?", "GCP"),
    ("How to use gcloud CLI?", "GCP"),
    ("What are GCP security features?", "GCP"),
]

## Similarity Metrics and Retrieval Evaluation

The following similarity metrics are compared:
- Cosine similarity
- Dot product (raw and normalized)
- Euclidean distance

Retrieval quality is measured using:
- hit@k
- MRR
- topic accuracy@k

In [20]:
# %% 3) Similarity functions
def cosine_sim(a: np.ndarray, b: np.ndarray) -> np.ndarray:
    """Cosine similarity (normalized dot product)."""
    a_norm = a / (np.linalg.norm(a, axis=1, keepdims=True) + 1e-9)
    b_norm = b / (np.linalg.norm(b) + 1e-9)
    return a_norm @ b_norm

def dot_raw(a: np.ndarray, b: np.ndarray) -> np.ndarray:
    """Raw dot product (no normalization)."""
    return a @ b

def dot_norm(a: np.ndarray, b: np.ndarray) -> np.ndarray:
    """Dot product on pre-normalized vectors."""
    return a @ b  # assumes a, b already normalized

def euclidean_dist(a: np.ndarray, b: np.ndarray) -> np.ndarray:
    """Negative Euclidean distance (higher = closer)."""
    return -np.linalg.norm(a - b, axis=1)

SIMILARITY_FNS = {
    "cosine": cosine_sim,
    "dot_raw": dot_raw,
    "dot_norm": dot_norm,
    "euclidean": euclidean_dist,
}

In [21]:

# %% 4) Retrieval metrics
def hit_at_k(topics: list, expected: str, k: int) -> float:
    """1.0 if expected topic in top-k."""
    return 1.0 if expected in topics[:k] else 0.0

def mrr(topics: list, expected: str) -> float:
    """Reciprocal rank of first correct result."""
    for i, t in enumerate(topics, 1):
        if t == expected:
            return 1.0 / i
    return 0.0

def topic_acc_at_k(topics: list, expected: str, k: int) -> float:
    """Fraction of top-k matching expected topic."""
    top = topics[:k]
    return sum(1 for t in top if t == expected) / len(top) if top else 0.0


In [22]:

# %% 5) Model-specific formatting
def format_query(text: str, model_id: str) -> str:
    if "e5" in model_id.lower():
        return f"query: {text}"
    if "bge" in model_id.lower():
        return f"Represent this sentence for searching relevant passages: {text}"
    return text

def format_passage(text: str, model_id: str) -> str:
    if "e5" in model_id.lower():
        return f"passage: {text}"
    return text

## Embedding Benchmark

For each embedding model and similarity metric:
- All chunks are embedded
- Queries are executed
- Top-k results are evaluated
- Debug output is printed for inspection

In [27]:
def run_benchmark(chunks, queries):
    results = []

    for model_name, model_id in MODELS.items():
        model = SentenceTransformer(model_id)
        texts = [c.page_content for c in chunks]
        doc_emb = model.encode(texts, convert_to_numpy=True)

        for query, expected in queries:
            q_emb = model.encode(query, convert_to_numpy=True)
            scores = cosine_sim(doc_emb, q_emb)

            top_idx = np.argsort(scores)[::-1][:K]
            top_topics = [chunks[i].metadata["topic"] for i in top_idx]

            results.append({
                "model": model_name,
                "query": query,
                "expected": expected,
                "hit@3": hit_at_k(top_topics, expected, K),
                "mrr": mrr(top_topics, expected)
            })

    return pd.DataFrame(results)


## Running the Experiment

In [29]:
docs = load_pdfs(DATA_DIR)
chunks = chunk_docs(docs, CHUNK_SIZE, CHUNK_OVERLAP)

df = run_benchmark(chunks, QUERIES)
df


[GCP] loaded


Ignoring wrong pointing object 11 0 (offset 0)


[RAG] loaded
[GIT] loaded
Total pages loaded: 154
Total chunks created: 662


Unnamed: 0,model,query,expected,hit@3,mrr
0,MiniLM,What is retrieval augmented generation?,RAG,1.0,1.0
1,MiniLM,How does HyDE improve retrieval?,RAG,1.0,1.0
2,MiniLM,What are LangChain retrieval methods?,RAG,1.0,1.0
3,MiniLM,How to create a git branch?,GIT,1.0,1.0
4,MiniLM,What is git rebase?,GIT,1.0,1.0
5,MiniLM,How to resolve merge conflicts?,GIT,1.0,1.0
6,MiniLM,What is Google Cloud Platform?,GCP,1.0,1.0
7,MiniLM,How to use gcloud CLI?,GCP,1.0,1.0
8,MiniLM,What are GCP security features?,GCP,1.0,1.0
9,MPNet,What is retrieval augmented generation?,RAG,1.0,1.0


## 6. Conclusions

- All tested models are capable of retrieving documents from the correct topic
- Cosine similarity is a stable default choice
- Dataset diversity (multiple topics) is important to validate retrieval correctness