# Week 2: Building the Local Retrieval Pipeline

**Scope:** local retrieval pipeline only: load PDFs -> chunk -> embed -> FAISS index -> retrieve.

**Input from Week 1:**
- **Champion model:** **MPNet** (`all-mpnet-base-v2`) — chosen in Week 1 for better semantic separation.
- **Champion metric:** **cosine similarity**.
- Week 2 uses only the champion: one embedding model (MPNet) and one metric (cosine). An optional section later compares with MiniLM for reference.

**Week 2 objectives covered:**
1. Local vector store (FAISS).
2. Chunking with different chunk sizes/overlap.
3. Retriever + manual quality evaluation.


### Notebook structure

1. **Setup** — imports, paths, and Week 1 metric helpers (dot, cosine, Euclidean); retrieval uses cosine via normalized embeddings + FAISS.
2. **Experiment 1** — load PDFs from `data/` (RAG, GIT, GCP); each subfolder is a topic; we get page-level docs with metadata.
3. **Chunking** — two strategies (A: 300/50, B: 800/100) with `RecursiveCharacterTextSplitter`; compare chunk counts per topic.
4. **VectorStore & RAG** — `VectorStore.build_index()` / `retrieve()`; `RAG.retrieve()` returns `(score, chunk)` for evaluation.
5. **Retrieval experiment** — 5 queries, top-3 for A and B; manual evaluation table (expected topic vs top-1/top-3).
6. **Strategy choice** — aggregate top1_accuracy, top3_hit_rate, avg_top1_score; pick best strategy; save CSVs to `artifacts/`.
7. **Analysis** — chunk size/overlap effects; alignment with Week 1 (cosine, MiniLM); handoff to Week 3.
8. **Experiment 4** — compare chunk sizes (300/30, 500/50, 800/80) with fixed model and metric.
9. **Experiment 5** — compare MiniLM vs MPNet on the chosen chunk strategy (separate cells to avoid memory issues).
10. **Conclusions** — chosen strategy, champion model (MPNet), metric (cosine); Week 3 reuses retriever and adds LLM.

In [1]:
import numpy as np
import pandas as pd
import faiss
from pathlib import Path
from sentence_transformers import SentenceTransformer
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

faiss.omp_set_num_threads(4)
print("FAISS configured: Using CPU mode")

FAISS configured: Using CPU mode


## Pipeline Architecture

```
PDF folder -> extract text -> chunk -> embed (MPNet) -> index (FAISS) -> retrieve (top-k)
```

| Step | What we use | Why |
|------|-------------|-----|
| Load | `PyPDFLoader` | Local PDF parsing with metadata |
| Chunk | `RecursiveCharacterTextSplitter` | Compare different chunk strategies |
| Embed | MPNet, `normalize_embeddings=True` | Week 1 champion model |
| Index | FAISS `IndexFlatIP` | Inner product on normalized vectors = cosine |
| Retrieve | `index.search` | Fast top-k retrieval |

Cosine is implemented via `normalize_embeddings=True` + `IndexFlatIP`, consistent with Week 1 metric decision.


## Paths and directories

- **`PROJECT_ROOT`** — project root (parent of `notebooks/`).
- **`DATA_DIR`** — `data/` with subfolders RAG, GIT, GCP containing PDFs.
- **`ARTIFACTS_DIR`** — `artifacts/` for saving evaluation CSVs (chunk strategy, model comparison, final decisions).

In [2]:
PROJECT_ROOT = Path("..").resolve()
DATA_DIR = PROJECT_ROOT / "data"
ARTIFACTS_DIR = PROJECT_ROOT / "artifacts"
ARTIFACTS_DIR.mkdir(exist_ok=True)

print(f"Project root: {PROJECT_ROOT}")
print(f"Data dir: {DATA_DIR} (exists: {DATA_DIR.exists()})")
print(f"Artifacts dir: {ARTIFACTS_DIR}")

Project root: /Users/tkhamidulin/Desktop/First Project - RAG
Data dir: /Users/tkhamidulin/Desktop/First Project - RAG/data (exists: True)
Artifacts dir: /Users/tkhamidulin/Desktop/First Project - RAG/artifacts


## Similarity metric functions (from Week 1, explicit numpy)

Week 1 chose **cosine similarity** as the champion metric. We keep dot, cosine, and Euclidean here for reference; the FAISS pipeline uses cosine via `normalize_embeddings=True` and `IndexFlatIP` (inner product on unit vectors).

In [3]:
def dot_product(a: np.ndarray, b: np.ndarray) -> float:
    return float(np.sum(a * b))

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    dot = np.sum(a * b)
    norm_a = np.linalg.norm(a)
    norm_b = np.linalg.norm(b)
    return float(dot / (norm_a * norm_b))

def euclidean_distance(a: np.ndarray, b: np.ndarray) -> float:
    return float(np.linalg.norm(a - b))

# Week 2 retrieval uses cosine (inner product on L2-normalized embeddings in FAISS).

---
## Experiment 1: Load PDF Documents

**Goal:** Load real PDF documents from the `data/` folder.

**Structure:**
```
data/
├── RAG/     (3 PDFs about RAG, HyDE, LangChain)
├── GIT/     (3 PDFs about Git)
└── GCP/     (3 PDFs about Google Cloud)
```

Each subfolder name becomes the **topic** for evaluation.

**Conclusion:** We obtain a flat list of page-level documents with `topic`, `source`, and `page` metadata, ready for chunking.

In [4]:
def load_pdfs(data_dir: Path) -> list:
    """
    Load all PDFs from topic subfolders.
    Adds metadata: topic (folder name), source (filename), page.
    """
    docs = []
    topics_found = []
    
    for topic_dir in sorted(data_dir.iterdir()):
        if not topic_dir.is_dir():
            continue
        
        topic = topic_dir.name
        pdf_files = list(topic_dir.glob("*.pdf"))
        
        if not pdf_files:
            print(f"  WARNING: No PDFs in {topic}/")
            continue
        
        topics_found.append(topic)
        print(f"\n[{topic}]")
        
        for pdf_path in pdf_files:
            try:
                loader = PyPDFLoader(str(pdf_path))
                pdf_docs = loader.load()
                for doc in pdf_docs:
                    doc.metadata["topic"] = topic
                    doc.metadata["source"] = pdf_path.name
                    docs.append(doc)
                print(f"  + {pdf_path.name} ({len(pdf_docs)} pages)")
            except Exception as e:
                print(f"  ERROR: {pdf_path.name} - {e}")
    
    print(f"\nLoaded: {len(docs)} pages from {len(topics_found)} topics: {topics_found}")
    return docs


# Load all PDFs
raw_docs = load_pdfs(DATA_DIR)


[GCP]
  + gcloud-cheat-sheet.pdf (2 pages)


incorrect startxref pointer(1)
parsing for Object Streams
Error -3 while decompressing data: incorrect header check
found 0 objects within Object(775,0) whereas 200 expected
Error -3 while decompressing data: incorrect header check
found 0 objects within Object(776,0) whereas 20 expected
Cannot find "/Root" key in trailer
Searching object with "/Catalog" key
Ignoring wrong pointing object 11 0 (offset 0)


  + google_security_wp.pdf (18 pages)
  ERROR: A-Complete-Guide-to-the-Google-Cloud-Platform.pdf - Cannot find Root object in pdf

[GIT]
  + GitGuide.pdf (8 pages)
  + git-cheat-sheet-education.pdf (2 pages)
  + How_to_Git.pdf (45 pages)

[RAG]
  + Advanced RAG — Improving retrieval using Hypothetical Document Embeddings(HyDE) _ by Plaban Nayak _ AI Planet.pdf (19 pages)
  + Retrieval - Docs by LangChain.pdf (3 pages)
  + Retrieval-Augmented Generation (RAG) _ Pinecone.pdf (4 pages)

Loaded: 101 pages from 3 topics: ['GCP', 'GIT', 'RAG']


In [5]:

print("Sample document:")
print(f"  Topic: {raw_docs[0].metadata['topic']}")
print(f"  Source: {raw_docs[0].metadata['source']}")
print(f"  Page: {raw_docs[0].metadata.get('page', 0)}")
print(f"  Content preview: {raw_docs[0].page_content[:200]}...")

Sample document:
  Topic: GCP
  Source: gcloud-cheat-sheet.pdf
  Page: 0
  Content preview: gcloud init
I n i t i a l i z e ,  a u t h o r i z e ,  a n d  c o n fi g u r e  g c l o u d
gcloud version
D i s p l a y  ve r s i o n  a n d  i n s t a l l e d  c o m p o n e n t s
gcloud components...


---
## Chunking: Two Strategies

**Goal:** Split documents into chunks so we can compare how chunk size and overlap affect retrieval.

**Why RecursiveCharacterTextSplitter?**
- Splits on natural boundaries (paragraphs, sentences) when possible
- Falls back to character-level splits if needed
- Preserves semantic coherence better than fixed-size splits

**Two configurations (as required):**
- **Strategy A (small):** `chunk_size=300`, `chunk_overlap=50` — more chunks, finer granularity.
- **Strategy B (larger):** `chunk_size=800`, `chunk_overlap=100` — fewer chunks, more context per chunk.

**Conclusion:** Strategy A produces more chunks (finer granularity); Strategy B produces fewer, longer chunks. We index both and compare retrieval quality in the next section.

In [6]:
def chunk_documents(docs, chunk_size: int, chunk_overlap: int, separators=None):
    """
    Split a list of LangChain documents into chunks.
    Returns list of documents (each has page_content and metadata).
    """
    if separators is None:
        separators = ["\n\n", "\n", ". ", " ", ""]
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        separators=separators,
    )
    return splitter.split_documents(docs)


# Two chunk strategies (Strategy A = small, Strategy B = larger)
CHUNK_STRATEGY_A = {"chunk_size": 300, "chunk_overlap": 50}
CHUNK_STRATEGY_B = {"chunk_size": 800, "chunk_overlap": 100}

chunks_a = chunk_documents(raw_docs, **CHUNK_STRATEGY_A)
chunks_b = chunk_documents(raw_docs, **CHUNK_STRATEGY_B)

print(f"Original documents: {len(raw_docs)} pages")
print(f"Strategy A ({CHUNK_STRATEGY_A}): {len(chunks_a)} chunks")
print(f"Strategy B ({CHUNK_STRATEGY_B}): {len(chunks_b)} chunks")

Original documents: 101 pages
Strategy A ({'chunk_size': 300, 'chunk_overlap': 50}): 833 chunks
Strategy B ({'chunk_size': 800, 'chunk_overlap': 100}): 330 chunks


In [7]:
def chunks_per_topic(chunks):
    out = {}
    for c in chunks:
        t = c.metadata["topic"]
        out[t] = out.get(t, 0) + 1
    return out

print("Chunks per topic — Strategy A (small):")
for topic, count in sorted(chunks_per_topic(chunks_a).items()):
    print(f"  {topic}: {count} chunks")
print("\nChunks per topic — Strategy B (larger):")
for topic, count in sorted(chunks_per_topic(chunks_b).items()):
    print(f"  {topic}: {count} chunks")

Chunks per topic — Strategy A (small):
  GCP: 202 chunks
  GIT: 145 chunks
  RAG: 486 chunks

Chunks per topic — Strategy B (larger):
  GCP: 78 chunks
  GIT: 69 chunks
  RAG: 183 chunks


In [8]:
# Inspect one sample from each strategy
print("Sample chunk — Strategy A (small):")
c = chunks_a[len(chunks_a) // 2]
print(f"  Topic: {c.metadata['topic']}, Source: {c.metadata['source']}")
print(f"  Length: {len(c.page_content)} chars")
print(f"  Content: {c.page_content[:200]}...")
print("\nSample chunk — Strategy B (larger):")
c = chunks_b[len(chunks_b) // 2]
print(f"  Topic: {c.metadata['topic']}, Source: {c.metadata['source']}")
print(f"  Length: {len(c.page_content)} chars")
print(f"  Content: {c.page_content[:200]}...")

Sample chunk — Strategy A (small):
  Topic: RAG, Source: Advanced RAG — Improving retrieval using Hypothetical Document Embeddings(HyDE) _ by Plaban Nayak _ AI Planet.pdf
  Length: 298 chars
  Content: of their decisions on org anizational stakeholders:  
 
• The “utilitarian” rule:  An ethical decision is one that produces the greatest
number of people, which means that managers should compare alte...

Sample chunk — Strategy B (larger):
  Topic: RAG, Source: Advanced RAG — Improving retrieval using Hypothetical Document Embeddings(HyDE) _ by Plaban Nayak _ AI Planet.pdf
  Length: 780 chars
  Content: 0.094572514295578,
 -0.0002244404749944806,
 0.005685583222657442,
 -0.038341324776411057,
 -0.030211780220270157,
 -0.04658368602395058,
 0.048414770513772964,
 -0.26101475954055786,
 -0.010802186094...


---
## VectorStore and Retriever Classes

**VectorStore**
- `build_index(chunks)` builds FAISS `IndexFlatIP` on normalized embeddings.
- `retrieve(query, top_k=3)` returns top-k scores and indices.

**Retriever wrapper**
- `retrieve(query, top_k=3)` returns `(score, chunk)` for debugging and manual evaluation.

**Conclusion:** Retrieval scores are cosine similarity (normalized embeddings + IndexFlatIP). The pipeline uses the Week 1 champion (MPNet). Experiment 5 optionally compares with MiniLM for reference.


In [9]:
# Week 2 uses Week 1 champion: MPNet (all-mpnet-base-v2).
EMBEDDING_MODEL_ID = "all-mpnet-base-v2"


class VectorStore:
    """
    Local vector store: chunks + embeddings + FAISS IndexFlatIP.
    Similarity = cosine (inner product on L2-normalized vectors).
    """

    def __init__(self, model_id: str = EMBEDDING_MODEL_ID):
        self.model_id = model_id
        self.model = SentenceTransformer(model_id)
        self.chunks = []
        self._index = None

    def build_index(self, chunks):
        """Build FAISS index from chunk documents."""
        self.chunks = list(chunks)
        if not self.chunks:
            raise ValueError("Cannot index: chunks list is empty")
        texts = [c.page_content for c in self.chunks]
        emb = self.model.encode(texts, normalize_embeddings=True, show_progress_bar=True)
        emb = np.ascontiguousarray(emb, dtype=np.float32)
        dim = emb.shape[1]
        self._index = faiss.IndexFlatIP(dim)
        self._index.add(emb)

    def retrieve(self, query: str, top_k: int = 3):
        """Return (scores, indices) for top-k chunks. Scores are cosine similarity."""
        if not query or not query.strip():
            raise ValueError("Query cannot be empty")
        if self._index is None or self._index.ntotal == 0:
            raise ValueError("Index is empty; call build_index(chunks) first")
        q = self.model.encode([query], normalize_embeddings=True)
        q = np.ascontiguousarray(q, dtype=np.float32)
        k = min(top_k, self._index.ntotal)
        scores, indices = self._index.search(q, k)
        return scores[0], indices[0]


class RAG:
    """Retrieval-only wrapper (no generation in Week 2)."""

    def __init__(self, vector_store: VectorStore):
        self.vector_store = vector_store

    def retrieve(self, query: str, top_k: int = 3):
        scores, indices = self.vector_store.retrieve(query, top_k=top_k)
        return [(float(s), self.vector_store.chunks[i]) for s, i in zip(scores, indices)]


print("VectorStore and RAG classes defined. Embedding model:", EMBEDDING_MODEL_ID)


VectorStore and RAG classes defined. Embedding model: all-mpnet-base-v2


---
## Experiment: Retrieval with Two Chunk Strategies

**Goal:** Build two retrievers (A/B chunking), run at least 5 queries, print top-3 results, and evaluate quality manually.

**Setup:**
1. Build `store_a` and `store_b` with the same embedding model (MPNet, Week 1 champion) and same metric (cosine via normalized IP).
2. Run the query set.
3. Compare relevance and context quality across chunk strategies.

**Conclusion:** The printed top-3 and the evaluation table show whether the expected topic is retrieved at top-1 and in top-3 for each strategy; the best strategy is selected in the next cells.


In [10]:
# Build VectorStores for both chunk strategies using the same model/metric (MPNet + cosine)
store_a = VectorStore()
store_a.build_index(chunks_a)
store_b = VectorStore()
store_b.build_index(chunks_b)

rag_a = RAG(store_a)
rag_b = RAG(store_b)

# At least 5 test queries (mix of topics)
TEST_QUERIES = [
    "What is RAG?",
    "What is GIT?",
    "What is GCP?",
    "How to create a git branch?",
    "What is HyDE in RAG?",
]

def print_top3(label, results):
    for rank, (score, chunk) in enumerate(results, 1):
        topic = chunk.metadata.get("topic", "?")
        src = chunk.metadata.get("source", "?")[:40]
        text = chunk.page_content[:120].replace("\n", " ")
        print(f"  {rank}. [{topic}] score={score:.4f} | {src}")
        print(f"      {text}...")

for query in TEST_QUERIES:
    print("=" * 60)
    print(f"Query: {query!r}")
    print("-" * 60)
    print("Strategy A (small chunks):")
    print_top3("A", rag_a.retrieve(query, top_k=3))
    print("\nStrategy B (larger chunks):")
    print_top3("B", rag_b.retrieve(query, top_k=3))
    print()


Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

MPNetModel LOAD REPORT from: sentence-transformers/all-mpnet-base-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Batches:   0%|          | 0/27 [00:00<?, ?it/s]

: 

### Manual Evaluation Summary

The table below records, per query: **expected_topic**, and for strategies A and B — **A_top1_topic**, **A_top1_score**, **B_top1_topic**, **B_top1_score**, and whether the expected topic appears in top-3 (**A_top3_hit**, **B_top3_hit**). The next cell aggregates **top1_accuracy**, **top3_hit_rate**, and **avg_top1_score** per strategy.

**Conclusion:** We choose the strategy with the best top1_accuracy; ties are broken by avg_top1_score. Both strategies use the same model (MPNet) and metric (cosine), so any difference in scores or relevance is due to chunking only.

In practice: smaller chunks (A) often give more precise, local matches; larger chunks (B) give more context per hit but can dilute the top-1 match.


In [12]:
# Manual evaluation table: expected topic vs retrieved topic (top-1 and top-3)
QUERY_EXPECTED_TOPIC = {
    "What is RAG?": "RAG",
    "What is GIT?": "GIT",
    "What is GCP?": "GCP",
    "How to create a git branch?": "GIT",
    "What is HyDE in RAG?": "RAG",
}

rows = []
for query in TEST_QUERIES:
    expected = QUERY_EXPECTED_TOPIC.get(query, "?")

    ra = rag_a.retrieve(query, top_k=3)
    rb = rag_b.retrieve(query, top_k=3)

    a_topics = [item[1].metadata.get("topic", "?") for item in ra]
    b_topics = [item[1].metadata.get("topic", "?") for item in rb]

    rows.append({
        "query": query,
        "expected_topic": expected,
        "A_top1_topic": a_topics[0] if a_topics else "-",
        "A_top1_score": round(ra[0][0], 4) if ra else None,
        "A_top3_hit": expected in a_topics,
        "B_top1_topic": b_topics[0] if b_topics else "-",
        "B_top1_score": round(rb[0][0], 4) if rb else None,
        "B_top3_hit": expected in b_topics,
    })

eval_df = pd.DataFrame(rows)
print("Manual evaluation table (chunk strategy A vs B):")
eval_df


Manual evaluation table (chunk strategy A vs B):


Unnamed: 0,query,expected_topic,A_top1_topic,A_top1_score,A_top3_hit,B_top1_topic,B_top1_score,B_top3_hit
0,What is RAG?,RAG,RAG,0.6892,True,RAG,0.5923,True
1,What is GIT?,GIT,GIT,0.7795,True,GIT,0.7399,True
2,What is GCP?,GCP,GCP,0.7446,True,GCP,0.631,True
3,How to create a git branch?,GIT,GIT,0.6926,True,GIT,0.6746,True
4,What is HyDE in RAG?,RAG,RAG,0.4953,True,RAG,0.4013,True


In [13]:
# Strategy-level summary (same model + metric; only chunking changes)
strategy_summary = pd.DataFrame([
    {
        "strategy": "A_small_300_50",
        "top1_accuracy": (eval_df["A_top1_topic"] == eval_df["expected_topic"]).mean(),
        "top3_hit_rate": eval_df["A_top3_hit"].mean(),
        "avg_top1_score": eval_df["A_top1_score"].mean(),
    },
    {
        "strategy": "B_large_800_100",
        "top1_accuracy": (eval_df["B_top1_topic"] == eval_df["expected_topic"]).mean(),
        "top3_hit_rate": eval_df["B_top3_hit"].mean(),
        "avg_top1_score": eval_df["B_top1_score"].mean(),
    },
])

strategy_summary[["top1_accuracy", "top3_hit_rate", "avg_top1_score"]] = (
    strategy_summary[["top1_accuracy", "top3_hit_rate", "avg_top1_score"]].round(4)
)

# Primary decision rule: top1_accuracy, tie-breaker: avg_top1_score
best_idx = strategy_summary[["top1_accuracy", "avg_top1_score"]].astype(float).idxmax().iloc[0]
CHOSEN_STRATEGY = strategy_summary.loc[best_idx, "strategy"]
CHOSEN_CHUNKS = chunks_a if CHOSEN_STRATEGY.startswith("A_") else chunks_b

print("Chunk strategy summary:")
strategy_summary


Chunk strategy summary:


Unnamed: 0,strategy,top1_accuracy,top3_hit_rate,avg_top1_score
0,A_small_300_50,1.0,1.0,0.6802
1,B_large_800_100,1.0,1.0,0.6078


In [14]:
print(f"Selected chunk strategy for the rest of Week 2: {CHOSEN_STRATEGY}")
print("Reason: best top-1 relevance (tie-breaker: higher average top-1 cosine score).")


Selected chunk strategy for the rest of Week 2: A_small_300_50
Reason: best top-1 relevance (tie-breaker: higher average top-1 cosine score).


In [15]:
# Persist retrieval evaluation artifacts for report/mentor review
(eval_df.sort_values("query")
 .to_csv(ARTIFACTS_DIR / "week2_chunk_strategy_eval.csv", index=False))
strategy_summary.to_csv(ARTIFACTS_DIR / "week2_chunk_strategy_summary.csv", index=False)
print("Saved:")
print("- artifacts/week2_chunk_strategy_eval.csv")
print("- artifacts/week2_chunk_strategy_summary.csv")


Saved:
- artifacts/week2_chunk_strategy_eval.csv
- artifacts/week2_chunk_strategy_summary.csv


---
## Analysis

### 1) Chunk size and overlap
- Smaller chunks usually improve pinpoint matching.
- Larger chunks usually improve context completeness.
- Overlap reduces boundary loss.

### 2) Metric consistency with Week 1
- Champion metric is cosine.
- With normalized embeddings, `IndexFlatIP` returns cosine-equivalent ranking.

### 3) Model usage logic from Week 1
- **Pipeline model:** MPNet (Week 1 champion).
- **Optional:** Experiment 5 compares with MiniLM for reference.

**Conclusion:** Chunk size and overlap directly affect retrieval; we use the same metric (cosine) and operational model (MiniLM) for consistency with Week 1.

### Handoff to Week 3
Week 3 reuses this retriever (same chunking and indexing) and adds prompt + LLM generation over the retrieved context.


In [16]:
# Setup for model comparison: keep chunking fixed to the selected strategy
chunks = CHOSEN_CHUNKS
texts = [c.page_content for c in chunks]
topics = [c.metadata["topic"] for c in chunks]

print(f"Model comparison will use strategy: {CHOSEN_STRATEGY}")
print(f"Chunks used: {len(chunks)}")


Model comparison will use strategy: A_small_300_50
Chunks used: 1139


---
## Experiment 4: Compare Chunk Sizes

**Goal:** Understand how chunk size affects retrieval quality.

**Configurations:**
- Small: 300 chars, 30 overlap → more precise, less context
- Medium: 500 chars, 50 overlap → balanced
- Large: 800 chars, 80 overlap → more context, may include noise

**Expected behavior:**
- Small chunks → more chunks, precise matches
- Large chunks → fewer chunks, broader context

**Conclusion:** The table below shows top1_accuracy, top3_hit_rate, and avg_top1_score for each configuration; the best combination depends on the query set and corpus.


In [17]:
chunk_configs = [
    {"chunk_size": 300, "overlap": 30},
    {"chunk_size": 500, "overlap": 50},
    {"chunk_size": 800, "overlap": 80},
]

# Evaluate chunk-size impact with fixed model/metric (MPNet + cosine)
chunk_eval_rows = []
for cfg in chunk_configs:
    chunks_temp = chunk_documents(raw_docs, cfg["chunk_size"], cfg["overlap"])
    store_temp = VectorStore(model_id=EMBEDDING_MODEL_ID)
    store_temp.build_index(chunks_temp)
    rag_temp = RAG(store_temp)

    top1_hits = 0
    top3_hits = 0
    top1_scores = []

    for query in TEST_QUERIES:
        expected = QUERY_EXPECTED_TOPIC[query]
        results = rag_temp.retrieve(query, top_k=3)
        top_topics = [r[1].metadata.get("topic", "?") for r in results]

        if top_topics and top_topics[0] == expected:
            top1_hits += 1
        if expected in top_topics:
            top3_hits += 1
        if results:
            top1_scores.append(results[0][0])

    chunk_eval_rows.append({
        "chunk_size": cfg["chunk_size"],
        "overlap": cfg["overlap"],
        "num_chunks": len(chunks_temp),
        "top1_accuracy": round(top1_hits / len(TEST_QUERIES), 4),
        "top3_hit_rate": round(top3_hits / len(TEST_QUERIES), 4),
        "avg_top1_score": round(float(np.mean(top1_scores)), 4),
    })

chunk_eval_df = pd.DataFrame(chunk_eval_rows).sort_values(["top1_accuracy", "avg_top1_score"], ascending=False)
print("Chunk-size experiment summary (MPNet + cosine):")
chunk_eval_df


Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Batches:   0%|          | 0/35 [00:00<?, ?it/s]

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Batches:   0%|          | 0/21 [00:00<?, ?it/s]

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Batches:   0%|          | 0/14 [00:00<?, ?it/s]

Chunk-size experiment summary (MiniLM + cosine):


Unnamed: 0,chunk_size,overlap,num_chunks,top1_accuracy,top3_hit_rate,avg_top1_score
0,300,30,1097,1.0,1.0,0.6502
1,500,50,663,1.0,1.0,0.6211
2,800,80,435,1.0,1.0,0.6104


---
## Experiment 5 (Optional): Compare Embedding Models

**Goal:** Compare MiniLM and MPNet on the same retrieval setup. The pipeline already uses the Week 1 champion (MPNet); this section is for reference.

Method in this notebook:
- Fix chunking to the selected strategy from the chunking experiment.
- Keep metric fixed to cosine (normalized IP).
- Evaluate both models on the same 5-query set with top-1 accuracy, top-3 hit rate, and average top-1 score.

| Model | Dimensions | Speed | Expected quality |
|-------|------------|-------|------------------|
| all-MiniLM-L6-v2 | 384 | Faster | Good |
| all-mpnet-base-v2 | 768 | Slower | Slightly better |

**Conclusion:** This is optional (pipeline already uses the champion MPNet). Run the helper cell first, then MiniLM and MPNet cells; the table compares both on the same chunks and metric.


In [18]:
models = {
    "MiniLM": "all-MiniLM-L6-v2",
    "MPNet": "all-mpnet-base-v2",
}


def build_index_with_model(texts, model: SentenceTransformer):
    if not texts:
        raise ValueError("Cannot build index: texts list is empty")

    batch_size = 32
    embeddings = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        embeddings.append(model.encode(batch, normalize_embeddings=True, show_progress_bar=False))

    emb = np.ascontiguousarray(np.vstack(embeddings), dtype=np.float32)
    index = faiss.IndexFlatIP(emb.shape[1])
    index.add(emb)
    return index


def retrieve(model: SentenceTransformer, index, query: str, top_k: int = 3):
    if not query or not query.strip():
        raise ValueError("Query cannot be empty")
    if index.ntotal == 0:
        raise ValueError("Index is empty")

    q = np.ascontiguousarray(model.encode([query], normalize_embeddings=True), dtype=np.float32)
    k = min(top_k, index.ntotal)
    scores, indices = index.search(q, k)
    return scores[0], indices[0]


def evaluate_model_on_queries(model_name: str, model_id: str, texts, chunks, queries, expected_topic):
    model = SentenceTransformer(model_id)
    idx = build_index_with_model(texts, model)

    rows = []
    for query in queries:
        scores, indices = retrieve(model, idx, query, top_k=3)
        retrieved_topics = [chunks[i].metadata.get("topic", "?") for i in indices]
        expected = expected_topic[query]
        rows.append({
            "model": model_name,
            "query": query,
            "expected_topic": expected,
            "top1_topic": retrieved_topics[0],
            "top1_score": float(scores[0]),
            "top1_correct": retrieved_topics[0] == expected,
            "top3_hit": expected in retrieved_topics,
        })

    df = pd.DataFrame(rows)
    summary = {
        "model": model_name,
        "top1_accuracy": round(float(df["top1_correct"].mean()), 4),
        "top3_hit_rate": round(float(df["top3_hit"].mean()), 4),
        "avg_top1_score": round(float(df["top1_score"].mean()), 4),
    }

    del idx
    del model
    return df, summary


print("Model comparison setup ready.")
print("Fixed settings:")
print(f"- Chunk strategy: {CHOSEN_STRATEGY}")
print("- Metric: cosine (normalized embeddings + IndexFlatIP)")


Model comparison setup ready.
Fixed settings:
- Chunk strategy: A_small_300_50
- Metric: cosine (normalized embeddings + IndexFlatIP)


---
## Model Comparison: MiniLM

**Goal:** Test MiniLM model (384d embeddings) on the query.

**Why separate cells?** Each model is loaded and evaluated in its own cell to avoid memory overload. Run the previous cell (with `evaluate_model_on_queries` and `build_index_with_model`) before this one.

**Conclusion:** The output is the MiniLM summary (top1_accuracy, top3_hit_rate, avg_top1_score); the MPNet cell produces the same metrics for comparison.


In [16]:
# Evaluate MiniLM on the selected chunk strategy
mini_df, mini_summary = evaluate_model_on_queries(
    model_name="MiniLM",
    model_id=models["MiniLM"],
    texts=texts,
    chunks=chunks,
    queries=TEST_QUERIES,
    expected_topic=QUERY_EXPECTED_TOPIC,
)

print("MiniLM summary:")
print(pd.DataFrame([mini_summary]))
mini_df


Loading MiniLM...


Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


✓ MiniLM loaded successfully

Building index...

✗ ERROR with MiniLM: name 'build_index_with_model' is not defined


Traceback (most recent call last):
  File "/var/folders/z8/cqzr9q312nzfqkq1hrzrwqnr0000gp/T/ipykernel_28878/4041347915.py", line 11, in <module>
    idx = build_index_with_model(texts, model)
          ^^^^^^^^^^^^^^^^^^^^^^
NameError: name 'build_index_with_model' is not defined


---
## Model Comparison: MPNet

Evaluate MPNet (768d) on the same chunks and queries, then compare with MiniLM.

**Conclusion:** The pipeline uses MPNet (Week 1 champion); this comparison confirms MPNet vs MiniLM on the same setup. Results are saved to `artifacts/week2_model_eval_*.csv` and `week2_model_comparison_summary.csv`.


In [None]:
# Evaluate MPNet on the selected chunk strategy and compare to MiniLM
mpnet_df, mpnet_summary = evaluate_model_on_queries(
    model_name="MPNet",
    model_id=models["MPNet"],
    texts=texts,
    chunks=chunks,
    queries=TEST_QUERIES,
    expected_topic=QUERY_EXPECTED_TOPIC,
)

model_compare_df = pd.DataFrame([mini_summary, mpnet_summary]).sort_values(
    ["top1_accuracy", "avg_top1_score"], ascending=False
)

# Pipeline uses Week 1 champion
WEEK2_OPERATIONAL_MODEL = "MPNet"
MODEL_BENCHMARK_WINNER = model_compare_df.iloc[0]["model"]

print("Model comparison summary:")
print(model_compare_df)
print(f"
Benchmark winner on this sample: {MODEL_BENCHMARK_WINNER}")
print(f"Operational model for pipeline continuity: {WEEK2_OPERATIONAL_MODEL}")

# Save artifacts
mini_df.to_csv(ARTIFACTS_DIR / "week2_model_eval_minilm.csv", index=False)
mpnet_df.to_csv(ARTIFACTS_DIR / "week2_model_eval_mpnet.csv", index=False)
model_compare_df.to_csv(ARTIFACTS_DIR / "week2_model_comparison_summary.csv", index=False)
print("Saved model-comparison artifacts to artifacts/.")


Loading MPNet...


Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

MPNetModel LOAD REPORT from: sentence-transformers/all-mpnet-base-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


: 

---
## Final Week 2 Conclusions

1. **Chunking decision:** selected by measured retrieval relevance (top-1/top-3) on the fixed query set.
2. **Model comparison:** MiniLM vs MPNet evaluated on the same chunking and metric.
3. **Pipeline model:** **MPNet + cosine** (Week 1 champion).

### Handoff to Week 3
Week 3 should reuse:
- selected chunking strategy,
- MPNet embedding (champion from Week 1),
- cosine retrieval via FAISS IndexFlatIP,
and add prompt + LLM generation on top.

**Conclusion:** The final decisions (chunk strategy, operational model, metric) are written to `artifacts/week2_final_decisions.csv` for reproducibility and for use in Week 3.


In [None]:
final_week2_summary = pd.DataFrame([
    {
        "decision": "chunk_strategy",
        "value": CHOSEN_STRATEGY,
        "basis": "top1_accuracy -> tie-break avg_top1_score",
    },
    {
        "decision": "benchmark_model_winner",
        "value": MODEL_BENCHMARK_WINNER,
        "basis": "same chunks + same metric + same query set",
    },
    {
        "decision": "operational_model",
        "value": WEEK2_OPERATIONAL_MODEL,
        "basis": "Week 1 champion (MPNet)",
    },
    {
        "decision": "metric",
        "value": "cosine",
        "basis": "Week1 champion; implemented via normalized IP",
    },
])

final_week2_summary.to_csv(ARTIFACTS_DIR / "week2_final_decisions.csv", index=False)
print("Final Week 2 decisions:")
final_week2_summary
