# PDF Question Answering with Embedding Search

This notebook demonstrates how to build a simple RAG (Retrieval-Augmented Generation) system that:

1. Extracts and chunks text from PDF documents.
2. Embeds the text using a local embedding model served via MLX Server.
3. Stores the embeddings in a FAISS index for fast retrieval.
4. Answers user queries by retrieving relevant chunks and using a chat model to respond based on context.

Before running the notebook, make sure to launch the local MLX Server by executing the following command in your terminal (`lm`: Text-only model): 
```bash
mlx-server launch --model-path mlx-community/Qwen3-4B-8bit --model-type lm
```
This command starts the MLX API server locally at http://localhost:8000/v1, which exposes an OpenAI-compatible interface. It enables the specified model to be used for both embedding (vector representation of text) and response generation (chat completion).

For this illustration, we use the model `mlx-community/Qwen3-4B-8bit`, a lightweight and efficient language model that supports both tasks. You can substitute this with any other compatible model depending on your use case and hardware capability.

## Install dependencies


In [15]:
# Install required packages
%pip install -Uq numpy PyMuPDF faiss-cpu openai

Note: you may need to restart the kernel to use updated packages.


## Initialize MLX Server client


In [16]:
# Connect to the local MLX Server that serves embedding and chat models
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",  # This is your MLX Server endpoint
    api_key="fake-api-key"                # Dummy key, not used by local server
)

## Load and chunk PDF document

We load the PDF file and split it into smaller chunks to ensure each chunk fits within the context window of the model.

### Read PDF
Extracts text from each page of the PDF.

In [17]:
import fitz  # PyMuPDF

def read_pdf(path):
    """Extract text from each page of a PDF file."""
    doc = fitz.open(path)
    texts = []
    for page in doc:
        text = page.get_text().strip()
        if text: 
            texts.append(text)
    doc.close()
    return texts

### Chunk Text
Splits the extracted text into smaller chunks with overlap to preserve context.
We use the following parameters:
- `chunk_size=500`: Maximum number of words in a chunk.
- `overlap=100`: Number of words overlapping between consecutive chunks to avoid breaking context too harshly.

Each chunk is created using simple whitespace (`" "`) tokenization and rejoined with spaces, which works well for general text.

In [18]:
def chunk_text(texts, chunk_size=400, overlap=200):
    """Split text into smaller chunks with overlap for better context preservation."""
    chunks = []
    for text in texts:
        words = text.split()  
        i = 0
        while i < len(words):
            chunk = words[i:i + chunk_size]
            chunks.append(" ".join(chunk))
            i += chunk_size - overlap
    return chunks

## Save embeddings and chunks to FAISS

### Embed Chunks

Uses the MLX Server to generate embeddings for the text chunks.

In [19]:
import numpy as np

# Embed chunks using the model served by MLX Server
def embed_chunks(chunks, model_name):
    """Generate embeddings for text chunks using the MLX Server model."""
    response = client.embeddings.create(input=chunks, model=model_name)
    embeddings = [np.array(item.embedding).astype('float32') for item in response.data]
    return embeddings

### Save to FAISS

Saves the embeddings in a FAISS index and the chunks in a metadata file.

In [None]:
import faiss
import os
import pickle
import numpy as np

def normalize(vectors):
    """
    Normalize a set of vectors.
    """
    norms = np.linalg.norm(vectors, axis=1, keepdims=True)  # Compute L2 norms for each vector
    return vectors / norms  # Divide each vector by its norm to normalize

def save_faiss_index(embeddings, chunks, index_path="db/index.faiss", meta_path="db/meta.pkl"):
    """
    The embeddings are stored in a FAISS index, 
    and the corresponding text chunks are saved in a metadata file locally.    
    """
    if not os.path.exists("db"):
        os.makedirs("db")  
    dim = len(embeddings[0])
    
    # Normalize the embeddings to unit length for cosine similarity
    # This is required because FAISS's IndexFlatIP uses inner product
    embeddings = normalize(embeddings)
    index = faiss.IndexFlatIP(dim)
    index.add(np.array(embeddings))
    faiss.write_index(index, index_path)
    with open(meta_path, "wb") as f:
        pickle.dump(chunks, f)

Combines the above steps into a single pipeline to process a PDF.

In [21]:
# Full pipeline: Read PDF → Chunk → Embed → Save
def prepare_pdf(pdf_path, model_name):
    texts = read_pdf(pdf_path)
    chunks = chunk_text(texts)
    embeddings = embed_chunks(chunks, model_name)
    save_faiss_index(embeddings, chunks)

## Query PDF using FAISS

### Load FAISS Index

In [22]:
def load_faiss_index(index_path="db/index.faiss", meta_path="db/meta.pkl"):
    """Load the FAISS index and corresponding text chunks from disk."""
    index = faiss.read_index(index_path)
    with open(meta_path, "rb") as f:
        chunks = pickle.load(f)
    return index, chunks

### Embed Query
Embeds the user's query using the same model.

In [23]:
def embed_query(query, model_name):
    """Convert a query string into an embedding vector."""
    embedding = client.embeddings.create(input=[query], model=model_name).data[0].embedding
    return np.array(embedding).astype('float32')

### Retrieve Chunks
Retrieves the top-k most relevant chunks based on the query embedding.

In [24]:
def retrieve_chunks(query, index, chunks, model_name, top_k=5):
    """Retrieve the top-k most relevant chunks from the FAISS index."""
    query_vector = embed_query(query, model_name).reshape(1, -1)
    query_vector = normalize(query_vector)  # Normalize the query vector
    distances, indices = index.search(query_vector, top_k)  # Search for nearest neighbors
    relevant_chunks = [chunks[i] for i in indices[0]]    
    return relevant_chunks

### Generate Answer with Context

In [25]:
def answer_with_context(query, retrieved_chunks, model_name):
    """Generate a response to the query using retrieved chunks as context."""
    context = "\n".join(retrieved_chunks)
    prompt = f"""You are a helpful assistant. Use the context below to answer the question.
    Context:
    {context}
    Question: {query}
    Answer:"""
    
    response = client.chat.completions.create(
        model=model_name,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

Combines the query steps into a single function.

In [26]:
# Full pipeline: Query → Embed → Retrieve → Answer
def query_pdf(query, model_name="mlx-community/Qwen3-4B-8bit"):
    index, chunks = load_faiss_index()
    top_chunks = retrieve_chunks(query, index, chunks, model_name)
    return answer_with_context(query, top_chunks, model_name)

## Example Usage

Index text chunks from PDF into FAISS using Qwen3-4B-8bit model

In [27]:
prepare_pdf("./pdfs/lab03.pdf", "mlx-community/Qwen3-4B-8bit")

Sample query:

In [28]:
# Ask a question related to the content of the PDF
query = "What submissions do I need to submit in this lab?"
print("Query: ", query)
response = query_pdf(query, model_name="mlx-community/Qwen3-4B-8bit")
print("Response: ", response)

Query:  What submissions do I need to submit in this lab?
Response:  For this lab, you need to submit the following:

1. **StudentID1_StudentID2_Report.pdf**:  
   - A short report explaining your solution, any problems encountered, and any approaches you tried that didn't work. This report should not include your source code but should clarify your solution and any issues you faced.

2. **<StudentID1_StudentID2>.patch**:  
   - A git diff file showing the changes you made to the xv6 codebase. You can generate this by running the command:  
     ```bash
     $ git diff > <StudentID1_StudentID2>.patch
     ```

3. **Zip file of xv6**:  
   - A zip file containing the modified xv6 codebase. The code should be in a clean state (i.e., after running `make clean`). The filename should follow the format:  
     ```bash
     <StudentID1_StudentID2>.zip
     ```  
     For example, if the students' IDs are 2312001 and 2312002, the filename would be:  
     ```bash
     2312001_2312002.zip
     

In [29]:
# Ask a question related to the content of the PDF
query = "What is the hint for Lab 4.2 – Speed up system calls?"
print("Query: ", query)
response = query_pdf(query, model_name="mlx-community/Qwen3-4B-8bit")
print("Response: ", response)

Query:  What is the hint for Lab 4.2 – Speed up system calls?
Response:  The hint for Lab 4.2 – Speed up system calls is to choose permission bits that allow userspace to only read the page. This ensures that the shared page between userspace and the kernel can be accessed by userspace for reading the PID, but not modified, which is necessary for the optimization of the getpid() system call.
