# Week 2: Building the Local Retrieval Pipeline

**Scope:** local retrieval pipeline only: load PDFs -> chunk -> embed -> FAISS index -> retrieve.

## Engineering Decision (Important)
Week 1 selected **MPNet** as the semantic-quality winner. In this environment, `sentence-transformers/torch` causes a **native kernel crash** during indexing (process dies, no Python traceback).

To keep Week 2 reproducible and runnable, this notebook uses a stable embedder:
- `HashingVectorizer` (`hashing-768-stable`) for embeddings
- FAISS `IndexFlatIP` for retrieval scoring

This is a reliability-driven fallback for Week 2 execution, not a claim that hashing is better than MPNet.

**Week 2 objectives covered:**
1. Local vector store (FAISS).
2. Stable chunking for retrieval.
3. Retriever + manual quality evaluation.


### Notebook structure

1. **Setup** — imports, paths, metric helpers, and kernel-stability settings.
2. **Experiment 1** — load PDFs from `data/` (RAG, GIT, GCP); each subfolder is a topic.
3. **Chunking setup** — fixed configuration (`chunk_size=300`, `chunk_overlap=50`).
4. **VectorStore & RAG** — stable embedder + FAISS retrieval.
5. **Retrieval evaluation** — 5 queries, top-3 results and manual evaluation table.
6. **Summary** — top1_accuracy, top3_hit_rate, avg_top1_score; save CSVs to `artifacts/`.
7. **Decision record** — explicit explanation of why MPNet is not executed in this notebook.


In [None]:
import os
os.environ["OMP_NUM_THREADS"] = "1"
os.environ["MKL_NUM_THREADS"] = "1"
os.environ["OPENBLAS_NUM_THREADS"] = "1"

import numpy as np
import pandas as pd
import faiss
from pathlib import Path
from sklearn.feature_extraction.text import HashingVectorizer
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

faiss.omp_set_num_threads(1)
print("FAISS configured: CPU mode, 1 thread")


## Pipeline Architecture

```
PDF folder -> extract text -> chunk -> embed (HashingVectorizer) -> index (FAISS) -> retrieve (top-k)
```

| Step | What we use | Why |
|------|-------------|-----|
| Load | `PyPDFLoader` | Local PDF parsing with metadata |
| Chunk | `RecursiveCharacterTextSplitter` | Stable document segmentation |
| Embed | `HashingVectorizer` (768 dims, L2 norm) | Kernel-safe embedding in this environment |
| Index | FAISS `IndexFlatIP` | Inner product; with L2-normalized vectors this is cosine-equivalent |
| Retrieve | `index.search` | Fast top-k retrieval |

Week 1 model winner remains MPNet conceptually, but Week 2 execution uses a stability-first embedder due to kernel crashes.


## Paths and directories

- **`PROJECT_ROOT`** — project root (parent of `notebooks/`).
- **`DATA_DIR`** — `data/` with subfolders RAG, GIT, GCP containing PDFs.
- **`ARTIFACTS_DIR`** — `artifacts/` for saving retrieval evaluation CSVs and final decisions.


In [None]:
PROJECT_ROOT = Path("..").resolve()
DATA_DIR = PROJECT_ROOT / "data"
ARTIFACTS_DIR = PROJECT_ROOT / "artifacts"
ARTIFACTS_DIR.mkdir(exist_ok=True)

print(f"Project root: {PROJECT_ROOT}")
print(f"Data dir: {DATA_DIR} (exists: {DATA_DIR.exists()})")
print(f"Artifacts dir: {ARTIFACTS_DIR}")

## Similarity metric functions (from Week 1, explicit numpy)

Week 1 selected **cosine similarity** as the preferred retrieval metric.
In Week 2 we keep dot, cosine, and Euclidean helpers for transparency.
Operational retrieval still follows cosine logic via L2-normalized vectors + `IndexFlatIP`.


In [None]:
def dot_product(a: np.ndarray, b: np.ndarray) -> float:
    return float(np.sum(a * b))

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    dot = np.sum(a * b)
    norm_a = np.linalg.norm(a)
    norm_b = np.linalg.norm(b)
    return float(dot / (norm_a * norm_b))

def euclidean_distance(a: np.ndarray, b: np.ndarray) -> float:
    return float(np.linalg.norm(a - b))

# Week 2 retrieval uses cosine (inner product on L2-normalized embeddings in FAISS).

---
## Experiment 1: Load PDF Documents

**Goal:** Load real PDF documents from the `data/` folder.

**Structure:**
```
data/
├── RAG/     (3 PDFs about RAG, HyDE, LangChain)
├── GIT/     (3 PDFs about Git)
└── GCP/     (3 PDFs about Google Cloud)
```

Each subfolder name becomes the **topic** for evaluation.

**Conclusion:** We obtain a flat list of page-level documents with `topic`, `source`, and `page` metadata, ready for chunking.

In [None]:
def load_pdfs(data_dir: Path) -> list:
    """
    Load all PDFs from topic subfolders.
    Adds metadata: topic (folder name), source (filename), page.
    """
    docs = []
    topics_found = []
    
    for topic_dir in sorted(data_dir.iterdir()):
        if not topic_dir.is_dir():
            continue
        
        topic = topic_dir.name
        pdf_files = list(topic_dir.glob("*.pdf"))
        
        if not pdf_files:
            print(f"  WARNING: No PDFs in {topic}/")
            continue
        
        topics_found.append(topic)
        print(f"\n[{topic}]")
        
        for pdf_path in pdf_files:
            try:
                loader = PyPDFLoader(str(pdf_path))
                pdf_docs = loader.load()
                for doc in pdf_docs:
                    doc.metadata["topic"] = topic
                    doc.metadata["source"] = pdf_path.name
                    docs.append(doc)
                print(f"  + {pdf_path.name} ({len(pdf_docs)} pages)")
            except Exception as e:
                print(f"  ERROR: {pdf_path.name} - {e}")
    
    print(f"\nLoaded: {len(docs)} pages from {len(topics_found)} topics: {topics_found}")
    return docs


# Load all PDFs
raw_docs = load_pdfs(DATA_DIR)

In [None]:

print("Sample document:")
print(f"  Topic: {raw_docs[0].metadata['topic']}")
print(f"  Source: {raw_docs[0].metadata['source']}")
print(f"  Page: {raw_docs[0].metadata.get('page', 0)}")
print(f"  Content preview: {raw_docs[0].page_content[:200]}...")

---
## Chunking Configuration

**Goal:** split documents into consistent chunks for retrieval.

**Method:** `RecursiveCharacterTextSplitter` keeps semantic boundaries where possible and falls back to character-level splitting.

**Configuration used in Week 2:**
- `chunk_size=300`
- `chunk_overlap=50`

**Conclusion:** this fixed configuration is used across the whole notebook.


In [None]:
def chunk_documents(docs, chunk_size: int, chunk_overlap: int, separators=None):
    """
    Split a list of LangChain documents into chunks.
    Returns list of documents (each has page_content and metadata).
    """
    if separators is None:
        separators = ["\n\n", "\n", ". ", " ", ""]
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        separators=separators,
    )
    return splitter.split_documents(docs)


CHUNK_CONFIG = {"chunk_size": 300, "chunk_overlap": 50}
chunks = chunk_documents(raw_docs, **CHUNK_CONFIG)
CHOSEN_CHUNKS = chunks

print(f"Original documents: {len(raw_docs)} pages")
print(f"Chunk config {CHUNK_CONFIG}: {len(chunks)} chunks")


In [None]:
def chunks_per_topic(chunks):
    out = {}
    for c in chunks:
        t = c.metadata["topic"]
        out[t] = out.get(t, 0) + 1
    return out

print("Chunks per topic:")
for topic, count in sorted(chunks_per_topic(chunks).items()):
    print(f"  {topic}: {count} chunks")


In [None]:
print("Sample chunk:")
c = chunks[len(chunks) // 2]
print(f"  Topic: {c.metadata['topic']}, Source: {c.metadata['source']}")
print(f"  Length: {len(c.page_content)} chars")
print(f"  Content: {c.page_content[:200]}...")


---
## VectorStore and Retriever Classes

**VectorStore**
- `build_index(chunks)` builds FAISS `IndexFlatIP` in streaming batches.
- `retrieve(query, top_k=3)` returns top-k scores and indices.

**Embedder choice in this notebook**
- `StableEmbedder` uses `HashingVectorizer` (deterministic, CPU-only, no torch runtime).
- Chosen to avoid native kernel termination observed with MPNet in this exact environment.

**Retriever wrapper**
- `RAG.retrieve(query, top_k=3)` returns `(score, chunk)` for inspection and evaluation.


In [None]:
# Week 2 stable embedder (no torch/sentence-transformers in kernel).
EMBEDDING_MODEL_ID = "hashing-768-stable"


class StableEmbedder:
    """Lightweight deterministic embedder based on HashingVectorizer."""

    def __init__(self, n_features: int = 768):
        self.n_features = n_features
        self.vectorizer = HashingVectorizer(
            n_features=n_features,
            norm="l2",
            alternate_sign=False,
            lowercase=True,
        )

    def encode(self, texts):
        if isinstance(texts, str):
            texts = [texts]
        x = self.vectorizer.transform(texts)
        return np.ascontiguousarray(x.toarray(), dtype=np.float32)


class VectorStore:
    """
    Local vector store: chunks + embeddings + FAISS IndexFlatIP.
    Similarity = cosine (inner product on L2-normalized vectors).
    """

    def __init__(self, model_id: str = EMBEDDING_MODEL_ID, batch_size: int = 64):
        self.model_id = model_id
        self.batch_size = batch_size
        self.model = StableEmbedder(n_features=768)
        self.chunks = []
        self._index = None

    def build_index(self, chunks):
        """Build FAISS index from chunk documents in streaming batches."""
        self.chunks = list(chunks)
        if not self.chunks:
            raise ValueError("Cannot index: chunks list is empty")

        texts = [c.page_content for c in self.chunks]
        self._index = None

        for i in range(0, len(texts), self.batch_size):
            batch = texts[i:i + self.batch_size]
            batch_emb = self.model.encode(batch)

            if self._index is None:
                self._index = faiss.IndexFlatIP(batch_emb.shape[1])

            self._index.add(batch_emb)

    def retrieve(self, query: str, top_k: int = 3):
        """Return (scores, indices) for top-k chunks. Scores are cosine similarity."""
        if not query or not query.strip():
            raise ValueError("Query cannot be empty")
        if self._index is None or self._index.ntotal == 0:
            raise ValueError("Index is empty; call build_index(chunks) first")

        q = self.model.encode([query])
        k = min(top_k, self._index.ntotal)
        scores, indices = self._index.search(q, k)
        return scores[0], indices[0]


class RAG:
    """Retrieval-only wrapper (no generation in Week 2)."""

    def __init__(self, vector_store: VectorStore):
        self.vector_store = vector_store

    def retrieve(self, query: str, top_k: int = 3):
        scores, indices = self.vector_store.retrieve(query, top_k=top_k)
        return [(float(s), self.vector_store.chunks[i]) for s, i in zip(scores, indices)]


print("VectorStore and RAG classes defined. Embedding model:", EMBEDDING_MODEL_ID)


---
## Retrieval Evaluation

**Goal:** run a full retrieval pass with fixed chunking and measure relevance.

**Setup:**
1. Build a single `VectorStore` with the stable embedder and cosine-equivalent FAISS retrieval.
2. Run the fixed query set.
3. Check top-1 and top-3 relevance.


In [None]:
# Step 1: initialize vector store (model load)
store = VectorStore()
print("VectorStore initialized successfully.")


In [None]:
# Step 2: build FAISS index from chunks (heaviest operation)
print(f"Building index for {len(chunks)} chunks with batch_size={store.batch_size}...")
store.build_index(chunks)
print(f"Index built. ntotal={store._index.ntotal}")


In [None]:
# Step 3: retrieval checks
rag = RAG(store)

TEST_QUERIES = [
    "What is RAG?",
    "What is GIT?",
    "What is GCP?",
    "How to create a git branch?",
    "What is HyDE in RAG?",
]

def print_top3(results):
    for rank, (score, chunk) in enumerate(results, 1):
        topic = chunk.metadata.get("topic", "?")
        src = chunk.metadata.get("source", "?")[:40]
        text = chunk.page_content[:120].replace("\n", " ")
        print(f"  {rank}. [{topic}] score={score:.4f} | {src}")
        print(f"      {text}...")

for query in TEST_QUERIES:
    print("=" * 60)
    print(f"Query: {query!r}")
    print("-" * 60)
    print_top3(rag.retrieve(query, top_k=3))
    print()


### Manual Evaluation Summary

The table records, per query:
- `expected_topic`
- `top1_topic`
- `top1_score`
- `top3_hit`

This is the evidence table for Week 2 retrieval quality under the stable execution setup.


In [None]:
QUERY_EXPECTED_TOPIC = {
    "What is RAG?": "RAG",
    "What is GIT?": "GIT",
    "What is GCP?": "GCP",
    "How to create a git branch?": "GIT",
    "What is HyDE in RAG?": "RAG",
}

rows = []
for query in TEST_QUERIES:
    expected = QUERY_EXPECTED_TOPIC.get(query, "?")
    results = rag.retrieve(query, top_k=3)
    topics = [item[1].metadata.get("topic", "?") for item in results]

    rows.append({
        "query": query,
        "expected_topic": expected,
        "top1_topic": topics[0] if topics else "-",
        "top1_score": round(results[0][0], 4) if results else None,
        "top3_hit": expected in topics,
    })

eval_df = pd.DataFrame(rows)
print("Manual evaluation table:")
eval_df


In [None]:
retrieval_summary = pd.DataFrame([
    {
        "top1_accuracy": round((eval_df["top1_topic"] == eval_df["expected_topic"]).mean(), 4),
        "top3_hit_rate": round(eval_df["top3_hit"].mean(), 4),
        "avg_top1_score": round(eval_df["top1_score"].mean(), 4),
        "chunk_size": CHUNK_CONFIG["chunk_size"],
        "chunk_overlap": CHUNK_CONFIG["chunk_overlap"],
    }
])

print("Retrieval summary:")
retrieval_summary


In [None]:
print(f"Week 2 fixed chunk config: {CHUNK_CONFIG}")
print("Model: hashing-768-stable | Metric: cosine-equivalent (IndexFlatIP on normalized vectors)")


In [None]:
eval_df.sort_values("query").to_csv(ARTIFACTS_DIR / "week2_retrieval_eval.csv", index=False)
retrieval_summary.to_csv(ARTIFACTS_DIR / "week2_retrieval_summary.csv", index=False)
print("Saved:")
print("- artifacts/week2_retrieval_eval.csv")
print("- artifacts/week2_retrieval_summary.csv")


---
## Analysis and Rationale

### 1) Chunking
- Fixed configuration: `chunk_size=300`, `chunk_overlap=50`.
- Overlap preserves context across chunk boundaries.

### 2) Metric
- Retrieval uses cosine-equivalent ranking (`IndexFlatIP` on L2-normalized vectors).

### 3) Why MPNet is not used in Week 2 execution
- MPNet worked conceptually from Week 1 evaluation.
- In this local runtime (`Python 3.13` + Jupyter kernel), MPNet path (`sentence-transformers/torch`) crashes the kernel process during indexing.
- Because the process dies at native level, notebook execution is not reproducible with MPNet here.

### 4) Why this fallback was chosen
- `HashingVectorizer` is deterministic, CPU-only, and stable in the same environment.
- It lets us complete Week 2 goals (chunking, indexing, retrieval evaluation, artifact export) without kernel death.
- Tradeoff: semantic quality is typically lower than MPNet, but operational stability is higher.


---
## Final Week 2 Conclusions

1. Retrieval pipeline is built and validated on the local corpus.
2. Chunking is fixed: `chunk_size=300`, `chunk_overlap=50`.
3. Metric is cosine-equivalent via FAISS `IndexFlatIP` on normalized vectors.
4. Week 2 embedder is `hashing-768-stable` for runtime reliability.

## Decision Record
- **Preferred model from Week 1:** MPNet.
- **Week 2 runtime model:** hashing fallback.
- **Reason:** repeated native kernel termination when running MPNet in this environment.
- **Principle used:** reproducible execution over model optimality for this stage.

### Handoff to Week 3
Week 3 should reuse:
- fixed chunking,
- stable retrieval interface (`VectorStore`/`RAG`),
- artifact outputs from Week 2,
and then add prompt + LLM generation.


In [None]:
WEEK2_OPERATIONAL_MODEL = "hashing-768-stable"

final_week2_summary = pd.DataFrame([
    {
        "decision": "chunking_config",
        "value": str(CHUNK_CONFIG),
        "basis": "fixed in Week 2 pipeline",
    },
    {
        "decision": "preferred_model_from_week1",
        "value": "all-mpnet-base-v2",
        "basis": "Week 1 quality winner",
    },
    {
        "decision": "operational_model_week2",
        "value": WEEK2_OPERATIONAL_MODEL,
        "basis": "runtime stability in Python 3.13 Jupyter",
    },
    {
        "decision": "model_policy",
        "value": "stability_fallback",
        "basis": "MPNet path caused native kernel crash",
    },
    {
        "decision": "metric",
        "value": "cosine_equivalent",
        "basis": "normalized vectors + FAISS IndexFlatIP",
    },
])

final_week2_summary.to_csv(ARTIFACTS_DIR / "week2_final_decisions.csv", index=False)
print("Final Week 2 decisions:")
final_week2_summary
