# Week 2: Building the Local Retrieval Pipeline (FAISS)

**Scope:** Build a complete retrieval pipeline using real PDF documents: loading → chunking → embeddings → FAISS index → retrieval → evaluation.

**Data:** PDF documents from `data/` folder (RAG, GIT, GCP topics)

**Models:** `all-MiniLM-L6-v2` (384d), `all-mpnet-base-v2` (768d)

**Metrics:** Hit@K, Mean Reciprocal Rank (MRR)

In [2]:
import numpy as np
import pandas as pd
import faiss
from pathlib import Path
from sentence_transformers import SentenceTransformer
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

faiss.omp_set_num_threads(4)
print("FAISS configured: Using CPU mode")

FAISS configured: Using CPU mode


## How FAISS Works

### Architecture Overview

**FAISS** (Facebook AI Similarity Search) enables fast similarity search:

1. **Indexing**: Store embeddings in an optimized data structure
2. **Search**: Find K nearest neighbors to a query embedding
3. **Scoring**: Return similarity scores for ranking

### Index Types

| Index | Description | Use Case |
|-------|-------------|----------|
| **IndexFlatIP** | Exact inner product search | Small datasets, highest accuracy |
| **IndexFlatL2** | Exact L2 distance search | When using non-normalized embeddings |
| **IndexIVF** | Approximate search with clustering | Large datasets |

### Why IndexFlatIP with Normalized Embeddings?

- **Normalized embeddings** have magnitude = 1.0
- **Inner product = cosine similarity** for normalized vectors
- Faster than computing cosine directly

In [3]:
PROJECT_ROOT = Path("..").resolve()
DATA_DIR = PROJECT_ROOT / "data"
ARTIFACTS_DIR = PROJECT_ROOT / "artifacts"
ARTIFACTS_DIR.mkdir(exist_ok=True)

print(f"Project root: {PROJECT_ROOT}")
print(f"Data dir: {DATA_DIR} (exists: {DATA_DIR.exists()})")
print(f"Artifacts dir: {ARTIFACTS_DIR}")

Project root: /Users/tkhamidulin/Desktop/First Project - RAG
Data dir: /Users/tkhamidulin/Desktop/First Project - RAG/data (exists: True)
Artifacts dir: /Users/tkhamidulin/Desktop/First Project - RAG/artifacts


---
## Experiment 1: Load PDF Documents

**Goal:** Load real PDF documents from the `data/` folder.

**Structure:**
```
data/
├── RAG/     (3 PDFs about RAG, HyDE, LangChain)
├── GIT/     (3 PDFs about Git)
└── GCP/     (3 PDFs about Google Cloud)
```

Each subfolder name becomes the **topic** for evaluation.

In [4]:
def load_pdfs(data_dir: Path) -> list:
    """
    Load all PDFs from topic subfolders.
    Adds metadata: topic (folder name), source (filename), page.
    """
    docs = []
    topics_found = []
    
    for topic_dir in sorted(data_dir.iterdir()):
        if not topic_dir.is_dir():
            continue
        
        topic = topic_dir.name
        pdf_files = list(topic_dir.glob("*.pdf"))
        
        if not pdf_files:
            print(f"  WARNING: No PDFs in {topic}/")
            continue
        
        topics_found.append(topic)
        print(f"\n[{topic}]")
        
        for pdf_path in pdf_files:
            try:
                loader = PyPDFLoader(str(pdf_path))
                pdf_docs = loader.load()
                for doc in pdf_docs:
                    doc.metadata["topic"] = topic
                    doc.metadata["source"] = pdf_path.name
                    docs.append(doc)
                print(f"  + {pdf_path.name} ({len(pdf_docs)} pages)")
            except Exception as e:
                print(f"  ERROR: {pdf_path.name} - {e}")
    
    print(f"\nLoaded: {len(docs)} pages from {len(topics_found)} topics: {topics_found}")
    return docs


# Load all PDFs
raw_docs = load_pdfs(DATA_DIR)


[GCP]
  + gcloud-cheat-sheet.pdf (2 pages)
  + google_security_wp.pdf (18 pages)


Ignoring wrong pointing object 11 0 (offset 0)


  + A-Complete-Guide-to-the-Google-Cloud-Platform.pdf (53 pages)

[GIT]
  + GitGuide.pdf (8 pages)
  + git-cheat-sheet-education.pdf (2 pages)
  + How_to_Git.pdf (45 pages)

[RAG]
  + Advanced RAG — Improving retrieval using Hypothetical Document Embeddings(HyDE) _ by Plaban Nayak _ AI Planet.pdf (19 pages)
  + Retrieval - Docs by LangChain.pdf (3 pages)
  + Retrieval-Augmented Generation (RAG) _ Pinecone.pdf (4 pages)

Loaded: 154 pages from 3 topics: ['GCP', 'GIT', 'RAG']


In [5]:

print("Sample document:")
print(f"  Topic: {raw_docs[0].metadata['topic']}")
print(f"  Source: {raw_docs[0].metadata['source']}")
print(f"  Page: {raw_docs[0].metadata.get('page', 0)}")
print(f"  Content preview: {raw_docs[0].page_content[:200]}...")

Sample document:
  Topic: GCP
  Source: gcloud-cheat-sheet.pdf
  Page: 0
  Content preview: gcloud init
I n i t i a l i z e ,  a u t h o r i z e ,  a n d  c o n fi g u r e  g c l o u d
gcloud version
D i s p l a y  ve r s i o n  a n d  i n s t a l l e d  c o m p o n e n t s
gcloud components...


---
## Experiment 2: Implement Chunking with ONE Configuration

**Goal:** Split documents into chunks using LangChain's RecursiveCharacterTextSplitter.

**Why RecursiveCharacterTextSplitter?**
- Tries to split on natural boundaries (paragraphs, sentences)
- Falls back to character-level splits if needed
- Preserves semantic coherence better than fixed-size splits

**Parameters:**
- `chunk_size`: Maximum characters per chunk
- `chunk_overlap`: Characters shared between consecutive chunks

In [6]:

CHUNK_SIZE = 500
CHUNK_OVERLAP = 50

splitter = RecursiveCharacterTextSplitter(
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP,
    separators=["\n\n", "\n", ". ", " ", ""]
)

chunks = splitter.split_documents(raw_docs)

print(f"Original documents: {len(raw_docs)} pages")
print(f"After chunking: {len(chunks)} chunks")
print(f"Chunk size: {CHUNK_SIZE}, Overlap: {CHUNK_OVERLAP}")

Original documents: 154 pages
After chunking: 663 chunks
Chunk size: 500, Overlap: 50


In [7]:

topic_counts = {}
for c in chunks:
    topic = c.metadata["topic"]
    topic_counts[topic] = topic_counts.get(topic, 0) + 1

print("Chunks per topic:")
for topic, count in sorted(topic_counts.items()):
    print(f"  {topic}: {count} chunks")

Chunks per topic:
  GCP: 299 chunks
  GIT: 91 chunks
  RAG: 273 chunks


In [8]:
# Inspect a sample chunk
sample_chunk = chunks[501]
print("Sample chunk:")
print(f"  Topic: {sample_chunk.metadata['topic']}")
print(f"  Source: {sample_chunk.metadata['source']}")
print(f"  Length: {len(sample_chunk.page_content)} chars")
print(f"  Content: {sample_chunk.page_content[:300]}...")

Sample chunk:
  Topic: RAG
  Source: Advanced RAG — Improving retrieval using Hypothetical Document Embeddings(HyDE) _ by Plaban Nayak _ AI Planet.pdf
  Length: 471 chars
  Content: significant influence on core managerial activities, notably each of the element
organizational design.  In addition, of course, management must remain focused o
creating value for customers in an efficient manner and on de veloping and  impl
strategies to attract necessary inputs such as capital, t...


---
## Experiment 3: Build FAISS Index with ONE Model

**Goal:** Create a working retrieval pipeline before comparing models.

**Steps:**
1. Extract text from chunks
2. Generate embeddings with SentenceTransformer
3. Build FAISS IndexFlatIP
4. Test retrieval with a simple query

In [9]:
# Load ONE model first
model = SentenceTransformer("all-MiniLM-L6-v2")

# Extract texts
texts = [c.page_content for c in chunks]
topics = [c.metadata["topic"] for c in chunks]

# Generate embeddings (normalized for cosine similarity)
# FIX: Add validation before encoding
if not texts or len(texts) == 0:
    raise ValueError("Cannot generate embeddings: texts list is empty")

embeddings = model.encode(texts, normalize_embeddings=True, show_progress_bar=True)
# FIX: Ensure proper data type and contiguity (required by FAISS)
embeddings = np.ascontiguousarray(embeddings, dtype=np.float32)

print(f"\nEmbeddings shape: {embeddings.shape}")
print(f"Embedding dimension: {embeddings.shape[1]}")
if len(embeddings) > 0:
    print(f"L2 norm of first embedding: {np.linalg.norm(embeddings[0]):.4f} (should be 1.0)")
else:
    print("WARNING: No embeddings generated!")



Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Batches:   0%|          | 0/21 [00:00<?, ?it/s]


Embeddings shape: (663, 384)
Embedding dimension: 384
L2 norm of first embedding: 1.0000 (should be 1.0)


In [10]:
# FIX: Add validation before creating index
if len(embeddings) == 0:
    raise ValueError("Cannot create index: embeddings array is empty")

if embeddings.shape[1] == 0:
    raise ValueError("Cannot create index: embedding dimension is 0")

# Ensure embeddings are float32 and contiguous (required by FAISS)
embeddings = np.ascontiguousarray(embeddings, dtype=np.float32)

dimension = embeddings.shape[1]
index = faiss.IndexFlatIP(dimension)  # Inner Product = Cosine for normalized vectors
index.add(embeddings)

print(f"FAISS index built")
print(f"  Index type: IndexFlatIP")
print(f"  Dimension: {dimension}")
print(f"  Total vectors: {index.ntotal}")

FAISS index built
  Index type: IndexFlatIP
  Dimension: 384
  Total vectors: 663


In [11]:
# Test retrieval with ONE query
query = "What is RAG?"
query_embedding = model.encode([query], normalize_embeddings=True)
query_embedding = np.array(query_embedding, dtype="float32")

# Search top 3
k = 3
scores, indices = index.search(query_embedding, k)

print(f"Query: '{query}'")
print(f"\nTop {k} results:")
for rank, (score, idx) in enumerate(zip(scores[0], indices[0]), 1):
    chunk = chunks[idx]
    print(f"\n{rank}. [{chunk.metadata['topic']}] score={score:.4f}")
    print(f"   Source: {chunk.metadata['source']}")
    print(f"   {chunk.page_content[:150]}...")

Query: 'What is RAG?'

Top 3 results:

1. [RAG] score=0.6102
   Source: Retrieval - Docs by LangChain.pdf
   Building blocks
RAG architectures
RAG can be implemented in multiple ways, depending on your systemʼs needs. We
outline each type in the sections belo...

2. [RAG] score=0.5627
   Source: Retrieval-Augmented Generation (RAG) _ Pinecone.pdf
   Traditional RAG
By combining relevant data from an external data source with the user’s query and providing it to the
model as context for the generat...

3. [RAG] score=0.5067
   Source: Advanced RAG — Improving retrieval using Hypothetical Document Embeddings(HyDE) _ by Plaban Nayak _ AI Planet.pdf
   Mar 13, 2025
In by
Implement RAG with Knowledge
Graph and Llama-Index
Introduction
Dec 3, 2023
In by
Semantic Chunking for RAG
What is Chunking ?
Apr ...


In [12]:
# Test retrieval with ONE query
query = "What is GIT?"
query_embedding = model.encode([query], normalize_embeddings=True)
query_embedding = np.array(query_embedding, dtype="float32")

# Search top 3
k = 3
scores, indices = index.search(query_embedding, k)

print(f"Query: '{query}'")
print(f"\nTop {k} results:")
for rank, (score, idx) in enumerate(zip(scores[0], indices[0]), 1):
    chunk = chunks[idx]
    print(f"\n{rank}. [{chunk.metadata['topic']}] score={score:.4f}")
    print(f"   Source: {chunk.metadata['source']}")
    print(f"   {chunk.page_content[:150]}...")

Query: 'What is GIT?'

Top 3 results:

1. [GIT] score=0.7412
   Source: GitGuide.pdf
   Beginner’s   Guide   to   using   Git  
Written   By:   Kyle   Swygert  
Voiland   College   Tutoring   Services  
WSU   CptS   223   and   onward   
...

2. [GIT] score=0.7135
   Source: How_to_Git.pdf
   Version control tool that tracks file change history (like track
changes for word but much more sophisticated)
Popular among software developers
GitHu...

3. [GIT] score=0.6573
   Source: How_to_Git.pdf
   Introduction to Git
Basic Git commands
Collaboration with Git and GitHub
3
Objectives...


In [13]:

query = "What is GCP?"
query_embedding = model.encode([query], normalize_embeddings=True)
query_embedding = np.array(query_embedding, dtype="float32")

# Search top 3
k = 3
scores, indices = index.search(query_embedding, k)

print(f"Query: '{query}'")
print(f"\nTop {k} results:")
for rank, (score, idx) in enumerate(zip(scores[0], indices[0]), 1):
    chunk = chunks[idx]
    print(f"\n{rank}. [{chunk.metadata['topic']}] score={score:.4f}")
    print(f"   Source: {chunk.metadata['source']}")
    print(f"   {chunk.page_content[:150]}...")

Query: 'What is GCP?'

Top 3 results:

1. [GCP] score=0.6503
   Source: A-Complete-Guide-to-the-Google-Cloud-Platform.pdf
   enterprise network. But since the cloud 
operates web scale its failures often 
get ampliﬁed. 
Comparing the 
Networking Services of 
Google and Amazo...

2. [GCP] score=0.5649
   Source: A-Complete-Guide-to-the-Google-Cloud-Platform.pdf
   IV. BONUS! Live Migration in
GCP
Another advantage of working with 
GCP networking is the availability of 
live migration. Hardware failures 
happen i...

3. [GCP] score=0.5511
   Source: A-Complete-Guide-to-the-Google-Cloud-Platform.pdf
   based on Google’s Andromeda. This 
architecture allows the creation of 
networking elements at any level and 
supports customization of the network
to...


---
## Experiment 4: Compare Chunk Sizes

**Goal:** Understand how chunk size affects retrieval quality.

**Configurations:**
- Small: 300 chars, 30 overlap → more precise, less context
- Medium: 500 chars, 50 overlap → balanced
- Large: 800 chars, 80 overlap → more context, may include noise

**Expected behavior:**
- Small chunks → more chunks, precise matches
- Large chunks → fewer chunks, broader context

In [15]:
chunk_configs = [
    {"chunk_size": 300, "overlap": 30},
    {"chunk_size": 500, "overlap": 50},
    {"chunk_size": 800, "overlap": 80},
]

print("Chunk size comparison:")
print("=" * 50)

for cfg in chunk_configs:
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=cfg["chunk_size"],
        chunk_overlap=cfg["overlap"],
    )
    chunks_temp = splitter.split_documents(raw_docs)
    
    avg_len = np.mean([len(c.page_content) for c in chunks_temp])
    
    print(f"\nChunk size: {cfg['chunk_size']}, Overlap: {cfg['overlap']}")
    print(f"  Total chunks: {len(chunks_temp)}")
    print(f"  Avg chunk length: {avg_len:.0f} chars")

Chunk size comparison:

Chunk size: 300, Overlap: 30
  Total chunks: 1097
  Avg chunk length: 248 chars

Chunk size: 500, Overlap: 50
  Total chunks: 662
  Avg chunk length: 418 chars

Chunk size: 800, Overlap: 80
  Total chunks: 435
  Avg chunk length: 640 chars


---
## Experiment 5: Compare Embedding Models

**Goal:** Compare MiniLM vs MPNet on the same retrieval task.

| Model | Dimensions | Speed | Quality |
|-------|------------|-------|--------|
| all-MiniLM-L6-v2 | 384 | Fast | Good |
| all-mpnet-base-v2 | 768 | Slower | Better |

In [17]:
models = {
    "MiniLM": "all-MiniLM-L6-v2",
    "MPNet": "all-mpnet-base-v2",
}


def build_index_with_model(texts: list[str], model: SentenceTransformer):
    """Build FAISS index with pre-loaded model (prevents memory issues)."""
    # FIX: Add validation to prevent crashes
    if not texts or len(texts) == 0:
        raise ValueError(f"Cannot build index: texts list is empty")
    
    # FIX: Use batch processing for large datasets to prevent memory issues
    batch_size = 32  # Process in smaller batches
    embeddings_list = []
    
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        emb_batch = model.encode(batch, normalize_embeddings=True, show_progress_bar=False)
        embeddings_list.append(emb_batch)
    
    # Concatenate all batches
    emb = np.vstack(embeddings_list)
    
    # FIX: Ensure proper data type and contiguity (required by FAISS)
    emb = np.ascontiguousarray(emb, dtype=np.float32)
    
    if emb.shape[0] == 0 or emb.shape[1] == 0:
        raise ValueError(f"Cannot build index: embeddings have invalid shape {emb.shape}")
    
    index = faiss.IndexFlatIP(emb.shape[1])
    index.add(emb)
    return index


def build_index(texts: list[str], model_name: str):
    """
    Build FAISS index with given model name (backward compatibility).
    
    FIX: This function loads the model each time, which can cause memory issues.
    For better performance, use build_index_with_model() with a pre-loaded model.
    """
    # FIX: Add validation to prevent crashes
    if not texts or len(texts) == 0:
        raise ValueError(f"Cannot build index: texts list is empty")
    
    # Load model
    model = SentenceTransformer(model_name)
    
    # FIX: Use batch processing for large datasets to prevent memory issues
    batch_size = 32
    embeddings_list = []
    
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        emb_batch = model.encode(batch, normalize_embeddings=True, show_progress_bar=False)
        embeddings_list.append(emb_batch)
    
    # Concatenate all batches
    emb = np.vstack(embeddings_list)
    
    # FIX: Ensure proper data type and contiguity (required by FAISS)
    emb = np.ascontiguousarray(emb, dtype=np.float32)
    
    if emb.shape[0] == 0 or emb.shape[1] == 0:
        raise ValueError(f"Cannot build index: embeddings have invalid shape {emb.shape}")
    
    index = faiss.IndexFlatIP(emb.shape[1])
    index.add(emb)
    return model, index


def retrieve(model, index, query: str, top_k: int = 3):
    """Retrieve top-k results for a query."""
    # FIX: Add validation
    if not query or len(query.strip()) == 0:
        raise ValueError("Query cannot be empty")
    
    if index.ntotal == 0:
        raise ValueError("Index is empty, cannot retrieve")
    
    q = model.encode([query], normalize_embeddings=True)
    q = np.ascontiguousarray(q, dtype=np.float32)
    
    # FIX: Ensure top_k doesn't exceed available vectors
    actual_k = min(top_k, index.ntotal)
    scores, indices = index.search(q, actual_k)
    return scores[0], indices[0]


# Test query for model comparison
query = "How to create a git branch?"

print(f"Query: '{query}'")
print("=" * 50)
print("\nNOTE: Models are tested in separate cells to prevent memory overload.")
print("Run the cells below one at a time.")

Query: 'How to create a git branch?'

NOTE: Models are tested in separate cells to prevent memory overload.
Run the cells below one at a time.


---
## Model Comparison: MiniLM

**Goal:** Test MiniLM model (384d embeddings) on the query.

**Why separate cells?** Each model is tested in its own cell to prevent memory overload and crashes.

In [16]:
# Test MiniLM model
model_name = "MiniLM"
model_id = "all-MiniLM-L6-v2"

try:
    print(f"Loading {model_name}...")
    model = SentenceTransformer(model_id)
    print(f"✓ {model_name} loaded successfully\n")
    
    print("Building index...")
    idx = build_index_with_model(texts, model)
    print(f"✓ Index built: {idx.ntotal} vectors\n")
    
    print("Retrieving...")
    scores, indices = retrieve(model, idx, query)
    
    print(f"\n{model_name} Results:")
    print("-" * 50)
    for rank, (score, i) in enumerate(zip(scores, indices), 1):
        print(f"{rank}. [{topics[i]}] score={score:.4f} - {chunks[i].metadata['source'][:30]}")
    
    # Clean up to free memory
    del idx
    del model
    print(f"\n✓ {model_name} completed and memory freed")
    
except Exception as e:
    print(f"\n✗ ERROR with {model_name}: {str(e)}")
    import traceback
    traceback.print_exc()

Loading MiniLM...


Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


✓ MiniLM loaded successfully

Building index...

✗ ERROR with MiniLM: name 'build_index_with_model' is not defined


Traceback (most recent call last):
  File "/var/folders/z8/cqzr9q312nzfqkq1hrzrwqnr0000gp/T/ipykernel_28878/4041347915.py", line 11, in <module>
    idx = build_index_with_model(texts, model)
          ^^^^^^^^^^^^^^^^^^^^^^
NameError: name 'build_index_with_model' is not defined


---
## Model Comparison: MPNet

**Goal:** Test MPNet model (768d embeddings) on the query.


In [18]:
# Test MPNet model
model_name = "MPNet"
model_id = "all-mpnet-base-v2"

try:
    print(f"Loading {model_name}...")
    model = SentenceTransformer(model_id)
    print(f"✓ {model_name} loaded successfully\n")
    
    print("Building index...")
    idx = build_index_with_model(texts, model)
    print(f"✓ Index built: {idx.ntotal} vectors\n")
    
    print("Retrieving...")
    scores, indices = retrieve(model, idx, query)
    
    print(f"\n{model_name} Results:")
    print("-" * 50)
    for rank, (score, i) in enumerate(zip(scores, indices), 1):
        print(f"{rank}. [{topics[i]}] score={score:.4f} - {chunks[i].metadata['source'][:30]}")
    
    # Clean up to free memory
    del idx
    del model
    print(f"\n✓ {model_name} completed and memory freed")
    
except Exception as e:
    print(f"\n✗ ERROR with {model_name}: {str(e)}")
    print("This may indicate insufficient memory. Try:")
    print("  1. Restart kernel and run only this cell")
    print("  2. Close other applications to free memory")
    print("  3. Use MiniLM instead (smaller model)")
    import traceback
    traceback.print_exc()

Loading MPNet...


Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

MPNetModel LOAD REPORT from: sentence-transformers/all-mpnet-base-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


: 