# Week 4 ‚Äì Retrieval-Augmented Generation (RAG) with arXiv Papers

This notebook implements a complete RAG-style retrieval system using a small collection of scientific PDFs.  
The workflow includes:

1. **PDF text extraction** using PyMuPDF  
2. **Token-based sliding window chunking**  
3. **Embedding generation** using `SentenceTransformer` (`all-MiniLM-L6-v2`)  
4. **Vector index construction** using FAISS  
5. **Semantic search demo** retrieving top-k relevant chunks  
6. **FastAPI web service** exposing `/search` as an API endpoint  

This notebook also saves:
- `chunks.pkl`
- `faiss.index`
- `meta.pkl`

These are loaded by `main.py` when serving results via FastAPI.


## 1. Environment and Dataset

- Python environment: `ollama314` (Anaconda, Python 3.14)
- Key libraries:
  - `pymupdf` (imported as `fitz`) for PDF ‚Üí text
  - `sentence-transformers` for embeddings (`all-MiniLM-L6-v2`)
  - `faiss-cpu` for vector search (with a pure NumPy fallback if FAISS is unavailable)
  - `fastapi` + `uvicorn` for the search API

**Data assumption**

- I have a folder of arXiv PDFs, for example:

```text
data/arxiv_pdfs/
    paper_01.pdf
    paper_02.pdf
    ...


In [1]:
# 2. (Optional) Install dependencies
# Run this once if the packages are not installed in the environment.

# !pip install -q "sentence-transformers>=3.0.0" pymupdf fastapi "uvicorn[standard]"
# !pip install -q faiss-cpu   # may fail on some Python versions; we handle that later

print("If you see errors, install packages manually in your conda env.")

If you see errors, install packages manually in your conda env.


In [2]:
# 3. Imports and configuration

import os
from pathlib import Path
from typing import List, Tuple, Dict

import numpy as np

import fitz  # PyMuPDF
from sentence_transformers import SentenceTransformer

# Try to import FAISS; if it fails, we'll use a simple NumPy-based fallback index.
try:
    import faiss
    FAISS_AVAILABLE = True
except ImportError:
    FAISS_AVAILABLE = False
    print("‚ö†Ô∏è faiss-cpu not available ‚Äì will use a NumPy fallback index instead.")

# ----------------------
# Paths and parameters
# ----------------------
PDF_FOLDER = Path("data/arxiv_pdfs")         # folder containing the PDFs
MAX_TOKENS = 512                             # chunk size (in whitespace tokens)
OVERLAP = 50                                 # overlap between chunks
EMBED_MODEL_NAME = "all-MiniLM-L6-v2"        # sentence-transformers model
TOP_K = 3                                    # top-k passages to retrieve

print("PDF folder:", PDF_FOLDER.resolve())
print("FAISS available:", FAISS_AVAILABLE)


  from .autonotebook import tqdm as notebook_tqdm


PDF folder: D:\AI study\MLE_in_Gen_AI-Course\week 0\MLE_in_Gen_AI-Course\class4\data\arxiv_pdfs
FAISS available: True


## üìÅ Files Saved in `data/index/`

Running this notebook generates three important files:

### **`chunks.pkl`**
- List of all text chunks (strings)  
- Used so the API can return the original text  

### **`faiss.index`**
- Vector index created by FAISS  
- Stores embeddings for fast k-NN search  

### **`meta.pkl`**
- Metadata for each chunk:
  - PDF filename
  - chunk ID  

These files are loaded directly by `main.py` so the FastAPI service works **without running the notebook again**.


In [3]:
# 4. PDF ‚Üí text extraction (PyMuPDF)

def extract_text_from_pdf(pdf_path: Path) -> str:
    """
    Open a PDF and extract all text as a single string.
    """
    doc = fitz.open(str(pdf_path))
    pages = []
    for page in doc:
        page_text = page.get_text()          # raw text from each page
        # TODO (optional): clean page_text (remove headers/footers)
        pages.append(page_text)
    full_text = "\n".join(pages)
    return full_text


# Quick smoke test (only if there is at least one PDF)
sample_pdfs = sorted(PDF_FOLDER.glob("*.pdf"))
print(f"Found {len(sample_pdfs)} PDF files.")
if sample_pdfs:
    txt_preview = extract_text_from_pdf(sample_pdfs[0])
    print("Sample PDF:", sample_pdfs[0].name)
    print("Preview (first 400 chars):")
    print(txt_preview[:400])


Found 1 PDF files.
Sample PDF: Build_a_Large_Language_Model_(From_Scrat_v8_MEAP.pdf
Preview (first 400 chars):

MEAP Edition
Manning Early Access Program
Build a Large Language Model (From Scratch)
Version 8
Copyright 2024 Manning Publications
For more information on this and other Manning titles go to manning.com.
¬© Manning Publications Co. To comment go to liveBook
Licensed to     <149533107@qq.com>

¬†
welcome
¬†
Thank you for purchasing the MEAP edition of Build a Large Language Model (From
Scratch).
¬†
I


In [4]:
# 5. Text chunking (sliding window over whitespace tokens)

def chunk_text(text: str, max_tokens: int = MAX_TOKENS, overlap: int = OVERLAP) -> List[str]:
    """
    Split a long text into overlapping chunks of up to max_tokens tokens.
    Uses a simple whitespace tokenizer for clarity.
    """
    tokens = text.split()
    chunks = []
    step = max_tokens - overlap
    for i in range(0, len(tokens), step):
        chunk = tokens[i : i + max_tokens]
        if not chunk:
            break
        chunks.append(" ".join(chunk))
    return chunks


# Quick test on the sample text
if sample_pdfs:
    test_chunks = chunk_text(txt_preview, max_tokens=64, overlap=16)
    print(f"Created {len(test_chunks)} small test chunks (64 tokens, overlap 16).")
    print("First chunk preview:")
    print(test_chunks[0][:300])


Created 1898 small test chunks (64 tokens, overlap 16).
First chunk preview:
MEAP Edition Manning Early Access Program Build a Large Language Model (From Scratch) Version 8 Copyright 2024 Manning Publications For more information on this and other Manning titles go to manning.com. ¬© Manning Publications Co. To comment go to liveBook Licensed to     <149533107@qq.com> welcome


In [9]:
# 6. Build corpus of chunks and metadata from all PDFs

def build_corpus(pdf_folder: Path) -> Tuple[List[str], List[Dict]]:
    """
    Process all PDFs in the folder:
      - extract full text
      - chunk into segments
    Returns:
      chunks: list[str]
      metadata: list[dict] (per chunk: pdf_name, chunk_id, etc.)
    """
    all_chunks: List[str] = []
    all_meta: List[Dict] = []

    for pdf_path in sorted(pdf_folder.glob("*.pdf")):
        full_text = extract_text_from_pdf(pdf_path)
        chunks = chunk_text(full_text, max_tokens=MAX_TOKENS, overlap=OVERLAP)

        for idx, chunk in enumerate(chunks):
            all_chunks.append(chunk)
            all_meta.append({
                "pdf_name": pdf_path.name,
                "chunk_id": idx,
            })

        print(f"{pdf_path.name}: {len(chunks)} chunks")

    print(f"\nTotal chunks in corpus: {len(all_chunks)}")
    return all_chunks, all_meta


chunks, metadata = build_corpus(PDF_FOLDER)


Build_a_Large_Language_Model_(From_Scrat_v8_MEAP.pdf: 198 chunks

Total chunks in corpus: 198


In [10]:
# 7. Embedding generation with Sentence-Transformers

embed_model = SentenceTransformer(EMBED_MODEL_NAME)
print(f"Loaded embedding model: {EMBED_MODEL_NAME}")

# Compute embeddings for all chunks
embeddings = embed_model.encode(
    chunks,
    convert_to_numpy=True,
    show_progress_bar=True,
)
print("Embeddings shape:", embeddings.shape)


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


Loaded embedding model: all-MiniLM-L6-v2


Batches: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 7/7 [00:08<00:00,  1.19s/it]

Embeddings shape: (198, 384)





In [19]:
import pickle
from pathlib import Path
import os

DATA_DIR = Path("data/index")
DATA_DIR.mkdir(parents=True, exist_ok=True)

# Save chunks
with open(DATA_DIR / "chunks.pkl", "wb") as f:
    pickle.dump(chunks, f)

# Save metadata
with open(DATA_DIR / "meta.pkl", "wb") as f:
    pickle.dump(metadata, f)

# Save FAISS index
if FAISS_AVAILABLE:
    faiss.write_index(index, str(DATA_DIR / "index.faiss"))
else:
    # Fallback: save numpy index
    np.save(DATA_DIR / "fallback_index.npy", np.array(embeddings))

print("Saved chunks, metadata, and index.")


Saved chunks, metadata, and index.


In [11]:
# 8. Build vector index (FAISS if available, otherwise simple NumPy search)

class SimpleNumpyIndex:
    """
    Minimal FAISS-like interface using brute-force cosine similarity, for environments
    where faiss-cpu is not available.
    """
    def __init__(self, vectors: np.ndarray):
        self.vectors = vectors / (np.linalg.norm(vectors, axis=1, keepdims=True) + 1e-10)

    def search(self, query_vectors: np.ndarray, k: int):
        # query_vectors: shape (n_queries, dim)
        query_norm = query_vectors / (np.linalg.norm(query_vectors, axis=1, keepdims=True) + 1e-10)
        sims = query_norm @ self.vectors.T                # cosine similarity
        # For compatibility with FAISS, we return distances (1 - sim)
        indices = np.argsort(-sims, axis=1)[:, :k]
        distances = 1.0 - np.take_along_axis(sims, indices, axis=1)
        return distances.astype("float32"), indices.astype("int64")


if FAISS_AVAILABLE:
    dim = embeddings.shape[1]
    faiss_index = faiss.IndexFlatL2(dim)
    faiss_index.add(embeddings.astype("float32"))
    index = faiss_index
    print("Using FAISS IndexFlatL2.")
else:
    index = SimpleNumpyIndex(embeddings.astype("float32"))
    print("Using SimpleNumpyIndex (NumPy-based).")


Using FAISS IndexFlatL2.


In [None]:
# Quick sanity checks for the built index and saved artifacts

from pathlib import Path

DATA_DIR = Path("data/index")

print("Embeddings shape:", embeddings.shape)
print("Number of chunks:", len(chunks))
print("Number of metadata entries:", len(metadata))

assert embeddings.shape[0] == len(chunks) == len(metadata), "Mismatch between embeddings, chunks, and metadata lengths."
assert (DATA_DIR / "faiss.index").exists(), "faiss.index file is missing."
assert (DATA_DIR / "chunks.pkl").exists(), "chunks.pkl file is missing."
assert (DATA_DIR / "meta.pkl").exists(), "meta.pkl file is missing."

print("‚úÖ Sanity checks passed: index and artifacts look consistent.")

In [12]:
# 9. Semantic search helper function

def search_chunks(query: str, k: int = TOP_K) -> List[Dict]:
    """
    Embed the query, search the index, and return the top-k chunks with metadata.
    """
    query_vec = embed_model.encode([query], convert_to_numpy=True).astype("float32")
    distances, indices = index.search(query_vec, k)
    results = []
    for rank, (dist, idx) in enumerate(zip(distances[0], indices[0]), start=1):
        entry = {
            "rank": rank,
            "distance": float(dist),
            "pdf_name": metadata[idx]["pdf_name"],
            "chunk_id": metadata[idx]["chunk_id"],
            "text": chunks[idx],
        }
        results.append(entry)
    return results


In [13]:
# 10. Notebook demo: try a few sample queries

sample_queries = [
    "What is attention in neural networks?",
    "How do transformers handle long sequences?",
    "What are common evaluation metrics in NLP?",
]

for q in sample_queries:
    print("=" * 80)
    print("QUERY:", q)
    results = search_chunks(q, k=TOP_K)
    for r in results:
        print(f"\nRank {r['rank']} | pdf={r['pdf_name']} | chunk_id={r['chunk_id']} | distance={r['distance']:.4f}")
        print("-" * 40)
        print(r["text"][:600], "...")


QUERY: What is attention in neural networks?

Rank 1 | pdf=Build_a_Large_Language_Model_(From_Scrat_v8_MEAP.pdf | chunk_id=32 | distance=0.9793
----------------------------------------
selected attention weights with dropout to reduce overfitting Stacking multiple causal attention modules into a multi-head attention module In the previous chapter, you learned how to prepare the input text for training LLMs. This involved splitting text into individual word and subword tokens, which can be encoded into vector representations, the so-called embeddings, for the LLM. In this chapter, we will now look at an integral part of the LLM architecture itself, attention mechanisms, as illustrated in Figure 3.1. 60 ¬© Manning Publications Co. To comment go to liveBook Licensed to     <1495 ...

Rank 2 | pdf=Build_a_Large_Language_Model_(From_Scrat_v8_MEAP.pdf | chunk_id=176 | distance=1.0050
----------------------------------------
neural networks to prevent overfitting by randomly dropping units (a

## üöÄ Running the FastAPI Semantic Search Server

After building the FAISS index and saving the chunk data, you can run a real web service using FastAPI.

### **1. Activate the environment**
```bash

conda activate rag311

### **2. Go to the Week 4 project folder

cd "D:/AI study/MLE_in_Gen_AI-Course/week 0/MLE_in_Gen_AI-Course/class4"

### **3. Start the API server

uvicorn main:app --reload

If successful, you will see output like:

Uvicorn running on http://127.0.0.1:8000

### **4. Open the API documentation

Swagger UI ‚Üí http://127.0.0.1:8000/docs

ReDoc ‚Üí http://127.0.0.1:8000/redoc

You can run queries such as:

What is PyTorch?

What is attention in neural networks?

How do transformers handle long sequences?

## 2. FastAPI Retrieval Service

Next, I expose the same retrieval logic via a small FastAPI app.

The `/search` endpoint:

- takes a query parameter `q` (the user question)
- embeds `q`
- searches the FAISS (or NumPy) index
- returns the top-k chunks as JSON

In a real project, this code would live in a separate `main.py` file and be run with:

```bash
uvicorn main:app --reload


In [17]:
# 11. FastAPI app definition

from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI(title="Week 4 RAG Search API")


class SearchResponseChunk(BaseModel):
    rank: int
    distance: float
    pdf_name: str
    chunk_id: int
    text: str


class SearchResponse(BaseModel):
    query: str
    results: List[SearchResponseChunk]


@app.get("/search", response_model=SearchResponse)
async def search_endpoint(q: str, k: int = TOP_K):
    """
    Receive a query `q`, embed it, retrieve top-k passages, and return them as JSON.
    """
    results = search_chunks(q, k=k)
    response_chunks = [SearchResponseChunk(**r) for r in results]
    return SearchResponse(query=q, results=response_chunks)


### How to run the FastAPI app (outside the notebook)

For the actual service, I would:

1. Copy the relevant code (imports, model loading, index building, `search_chunks`, and the `FastAPI` app) into `main.py`.
2. Make sure `embeddings`, `index`, `chunks`, and `metadata` are built at import time (or loaded from disk).
3. Start the server:

```bash
uvicorn main:app --reload


In [18]:
%%writefile main.py
from fastapi import FastAPI
import numpy as np
import faiss
from pathlib import Path
import pickle
from sentence_transformers import SentenceTransformer

# Load embeddings + chunks (make sure these files exist)
DATA_DIR = Path("data/index")

with open(DATA_DIR / "chunks.pkl", "rb") as f:
    chunks = pickle.load(f)

faiss_index = faiss.read_index(str(DATA_DIR / "faiss.index"))

# Load embedding model
model = SentenceTransformer("all-MiniLM-L6-v2")

app = FastAPI()

@app.get("/search")
async def search(q: str, k: int = 3):
    """Return top-k matching chunks for query q."""
    query_vec = model.encode([q]).astype("float32")
    distances, indices = faiss_index.search(query_vec, k)
    results = [chunks[i] for i in indices[0]]
    return {"query": q, "results": results}


Writing main.py


### How to run the FastAPI semantic search service (from the terminal)

To expose the same search functionality as a web API, run these steps **outside the notebook**:

1. Open a terminal and activate the Week 4 environment:

   ```bash
   conda activate rag311
   ```

2. Change into the Week 4 project folder (where `main.py` lives):

   ```bash
   cd "D:/AI study/MLE_in_Gen_AI-Course/week 0/MLE_in_Gen_AI-Course/class4"
   ```

3. Start the FastAPI app with Uvicorn:

   ```bash
   uvicorn main:app --reload
   ```

   - `main:app` means: import the `app` object from `main.py`  
   - `--reload` watches for code changes and restarts the server automatically (handy during development).

4. Open your browser at:

   - `http://127.0.0.1:8000/docs`  ‚Üí interactive Swagger UI  
   - `http://127.0.0.1:8000/search?q=your+question&k=3` ‚Üí direct JSON response

5. When you are done testing, stop the server with **Ctrl + C** in the terminal.


# 3. Reflection and Notes

## ‚úÖ What I Implemented
- A full RAG-style retrieval system over a PDF document dataset  
- PDF ‚Üí text pipeline using PyMuPDF (`fitz`)  
- Token-based sliding window chunking with adjustable sizes  
- Dense vector embeddings via Sentence-Transformers (`all-MiniLM-L6-v2`)  
- Fast vector search using FAISS (`IndexFlatL2`)  
- Notebook-based semantic search demo  
- A complete FastAPI `/search` endpoint returning JSON results  

## üîß Potential Improvements
- Try more powerful embedding models (e.g., `all-mpnet-base-v2`)  
- Add richer metadata such as authors, year, or sections  
- Add a cross-encoder (re-ranking step) for higher retrieval accuracy  
- Integrate this retriever with an LLM (local via Ollama or via OpenAI)  
- Expand the PDF dataset to multiple papers and build a larger index  


# Appendix: Running Search Programmatically

Example GET request:

```bash
curl -X GET "http://127.0.0.1:8000/search?q=What%20is%20PyTorch?&k=3" \
     -H "accept: application/json"
Sample response:

json
Copy code
{
  "query": "What is PyTorch?",
  "results": [
    {
      "pdf": "...",
      "chunk_id": 32,
      "distance": 0.9793,
      "text": "PyTorch is a tensor library..."
    }
  ]
}
yaml
Copy code
