<a href="https://colab.research.google.com/github/roya90/DocuQuery/blob/main/Building_a_PDF_Question_Answering_System_with_Retrieval_Augmented_Generation_(RAG)_in_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
# Install required libraries
!pip install -q PyMuPDF  # PyMuPDF (FAISS dependencies are usually pre-installed in Colab)
!pip install -q sentence-transformers faiss-cpu google-generativeai tqdm python-dotenv
!pip install -q --upgrade google-cloud-aiplatform

# Imports
import os
import sys
import fitz  # PyMuPDF
from sentence_transformers import SentenceTransformer
import numpy as np
import faiss
from tqdm.notebook import tqdm  # Use notebook version for Colab
import spacy
import re
from transformers import AutoTokenizer
from google import genai
from google.genai import types
from google.colab import files, output  # For file uploads and output control
import vertexai
from dotenv import load_dotenv

print("Libraries installed and imported successfully!")

Libraries installed and imported successfully!


# Download Spacy Model

In [4]:
# Download the spacy model
!python -m spacy download en_core_web_sm

print("spaCy model downloaded successfully!")

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
spaCy model downloaded successfully!


# Authenticate to Google Cloud Generative AI

In [5]:
!pip install --upgrade google-genai
!gcloud auth application-default login

Go to the following link in your browser, and complete the sign-in prompts:

    https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=764086051850-6qr4p6gpi6hn506pt8ejuq83di341hur.apps.googleusercontent.com&redirect_uri=https%3A%2F%2Fsdk.cloud.google.com%2Fapplicationdefaultauthcode.html&scope=openid+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fsqlservice.login&state=oaN2xRQaWVTOYpGBZr5aFVsI087Wpj&prompt=consent&token_usage=remote&access_type=offline&code_challenge=4UKi2j0r2EE6R22ncspjClG6CE_fzo8agKuzGKgHIVc&code_challenge_method=S256

Once finished, enter the verification code provided in your browser: 4/0AQSTgQFmHrgo_OXPyvaflZO4dt43lMfh-yKThYytKaKIH_VpKdSJHzedrjHTlbrrMvBbmg

Credentials saved to file: [/content/.config/application_default_credentials.json]

These credentials will be used by any library that requests Application Default Credentials (ADC).
Ca

# Define Utility Functions (Text Extraction, Chunking, Embedding)

In [6]:
# --- PDF Text Extraction ---
def extract_text_from_pdf(pdf_path):
    try:
        pdf_path = Path(pdf_path)
        if not pdf_path.is_file() or not pdf_path.suffix.lower() == '.pdf':
            raise ValueError("The provided file is not a valid PDF.")

        text = ""
        with fitz.open(pdf_path) as pdf_document:
            for page_num in range(len(pdf_document)):
                text += pdf_document[page_num].get_text()
        return text

    except FileNotFoundError:
        print("The specified PDF file was not found.")
        return None
    except fitz.FileDataError:
        print("The PDF file is corrupted or unreadable.")
        return None
    except Exception as e:
        print(f"An unexpected error occurred: {e}", exc_info=True)
        return None


# --- Text Chunking ---

try:
    SPACY_NLP = spacy.load("en_core_web_sm")
    TOKENIZER = AutoTokenizer.from_pretrained("bert-base-uncased")
except Exception as e:
    raise RuntimeError(f"Failed to load models: {e}")

def is_meaningful(sentence, threshold=5):
    sentence = sentence.strip()
    if len(sentence) < threshold:
        return False
    if re.fullmatch(r"[\W\d_]+", sentence):
        return False
    return True

def validate_text_input(text, max_length=1_000_000):
    if not isinstance(text, str):
        raise ValueError("Input text must be a string.")
    if len(text) > max_length:
        raise ValueError("Input text is too large to process.")
    return text.strip()

def smart_chunk_spacy_by_paragraph(text):
    text = validate_text_input(text)
    paragraphs = [para.strip() for para in text.split("\n") if is_meaningful(para)]
    return paragraphs

def smart_chunk_spacy(text):
    text = validate_text_input(text)
    doc = SPACY_NLP(text)
    sentences = [sent.text for sent in doc.sents if is_meaningful(sent.text)]
    return sentences

def smart_chunk_spacy_advanced(text, min_chunk_length=50, max_chunk_length=500):
    text = validate_text_input(text)
    raw_paragraphs = re.sub(r"\n{2,}", "\n\n", text).split("\n\n")
    refined_paragraphs = []
    for paragraph in raw_paragraphs:
        if len(paragraph.strip()) < min_chunk_length:
            continue
        doc = SPACY_NLP(paragraph)
        current_chunk = []
        current_length = 0
        for sent in doc.sents:
            sent_text = sent.text.strip()
            if current_length + len(sent_text) > max_chunk_length:
                refined_paragraphs.append(" ".join(current_chunk).strip())
                current_chunk = []
                current_length = 0
            current_chunk.append(sent_text)
            current_length += len(sent_text)
        if current_chunk:
            refined_paragraphs.append(" ".join(current_chunk).strip())
    return refined_paragraphs

def smart_chunk_transformers(text, max_tokens=128):
    text = validate_text_input(text)
    tokens = TOKENIZER(text, truncation=False, return_tensors="pt")
    chunks = [text[i:i+max_tokens] for i in range(0, len(tokens['input_ids'][0]), max_tokens)]
    return chunks

# --- Vector Database (FAISS) Utilities ---
from pathlib import Path
# Load the embedding model globally
try:
    MODEL = SentenceTransformer("all-MiniLM-L6-v2")
except Exception as e:
    raise RuntimeError(f"Failed to load the embedding model: {e}")

def validate_text_chunks(text):
    if isinstance(text, str):
        text = [text]
    if not isinstance(text, list) or not all(isinstance(t, str) for t in text):
        raise ValueError("Input must be a string or a list of strings.")
    return [t.strip() for t in text if t.strip()]

def generate_embeddings(chunks, model_name="all-MiniLM-L6-v2"):
    chunks = validate_text_chunks(chunks)
    embeddings = MODEL.encode(chunks, convert_to_tensor=False, show_progress_bar=True) # Add progress bar
    embeddings = np.array(embeddings)
    norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
    normalized_embeddings = embeddings / (norms + 1e-10)  # Add small value for numerical stability
    return normalized_embeddings

def store_in_faiss(embeddings, db_file="vector_db_cosine.index"):
    try:
        dimension = embeddings.shape[1]
        index = faiss.IndexFlatIP(dimension)  # Use inner product (for cosine similarity with normalized vectors)
        index.add(embeddings)
        faiss.write_index(index, str(db_file)) #cast to string for cross-platform
        print(f"FAISS index saved to {db_file}")
        return index
    except Exception as e:
        raise RuntimeError(f"Failed to store FAISS index: {e}")

def load_faiss_index(db_file):
    try:
        db_file = Path(db_file).resolve()
        return faiss.read_index(str(db_file))
    except Exception as e:
        raise RuntimeError(f"Failed to load FAISS index: {e}")

def query_faiss_index(query_text, vector_index, chunks, model_name="all-MiniLM-L6-v2", top_k=2):
    try:
        query_text = validate_text_chunks(query_text)
        if not query_text:
            raise ValueError("Query text cannot be empty.")
        query_embedding = MODEL.encode(query_text, convert_to_tensor=False)
        query_embedding = np.array(query_embedding)
        query_embedding = query_embedding / (np.linalg.norm(query_embedding, axis=1, keepdims=True) + 1e-10)
        distances, indices = vector_index.search(query_embedding, top_k)
        results = [(chunks[idx], distances[0][i], idx) for i, idx in enumerate(indices[0])]
        return results
    except Exception as e:
        raise RuntimeError(f"Failed to query FAISS index: {e}")

print("Utility functions defined.")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Utility functions defined.


# Define Gemini Query Function

In [7]:
from google.colab import userdata
def query_flash(question = "What is APR?", context_chunks = [], model_name="gemini-pro", top_k=3, test = False):
    """
    Queries a Large Language Model (LLM) with a question and relevant context chunks.

    Args:
        question (str): The question to be answered.
        context_chunks (list): A list of tuples, each containing (text_chunk, similarity_score, index).
        model_name (str): The name of the Gemini model to use. Defaults to "gemini-pro".
        top_k (int): The number of top context chunks to use. Defaults to 3.

    Returns:
        dict: A dictionary containing the generated answer and the relevant context chunks.
              Returns an error message as a string in case of exceptions.
    """
    try:
        # Input validation
        if not isinstance(question, str) or not question.strip():
            raise ValueError("The question must be a non-empty string.")
        if not isinstance(context_chunks, list) or not all(
            isinstance(chunk, (list, tuple)) for chunk in context_chunks
        ):
            raise ValueError("Context chunks must be a list of tuples or lists.")

        client = genai.Client(
            vertexai=True,
            project=userdata.get('project_id'),
            location=userdata.get('location'),

        )

        # Construct the context
        context = " ".join([context_chunk[0] for context_chunk in context_chunks])

        # Prompt Engineering
        prompt = (
            f"You are a legal assistant specializing in contracts. "
            f"Answer the question based on the following context, and cite the sources explicitly. "
            f"Do not include any information not present in the provided context.\n\n"
            f"Context:\n{context}\n\n"
            f"Question: {question}\nAnswer:"
        )

        model = "gemini-2.0-flash-lite-001"
        contents = [prompt
        ]
        generate_content_config = types.GenerateContentConfig(
            temperature = 1,
            top_p = 0.95,
            max_output_tokens = 8192,
            response_modalities = ["TEXT"],
            safety_settings = [types.SafetySetting(
            category="HARM_CATEGORY_HATE_SPEECH",
            threshold="OFF"
            ),types.SafetySetting(
            category="HARM_CATEGORY_DANGEROUS_CONTENT",
            threshold="OFF"
            ),types.SafetySetting(
            category="HARM_CATEGORY_SEXUALLY_EXPLICIT",
            threshold="OFF"
            ),types.SafetySetting(
            category="HARM_CATEGORY_HARASSMENT",
            threshold="OFF"
            )],
        )

        if test:
            res = client.models.generate_content_stream(model = model,
                    contents = ["Write a short story about a kitten"],
                    config = generate_content_config,)
            for chunk in res:
                print(chunk.text)
            return

        # Generate content (streaming)
        response_stream = client.models.generate_content_stream(model = model,
                    contents = contents,
                    config = generate_content_config,)

        # Collect the streamed response
        generated_answer = ""
        for chunk in response_stream:
            generated_answer += chunk.text

        relevant_chunks = context_chunks[:top_k]  # Top k chunks for citation
        return {"answer": generated_answer.strip(), "relevant_context": relevant_chunks}

    except Exception as e:
        return f"An unexpected error occurred: {e}"


print("Gemini query function defined.")
query_flash(test = True)

Gemini query function defined.




Pip
kin was a wisp of smoke and mischief, a tiny whirlwind of grey fluff
. He was barely bigger than a teacup, with enormous, emerald eyes that reflected
 the world in miniature. He lived in the attic of a dusty old house, his kingdom a land of forgotten trunks, cobweb castles, and sunlight-dappled dust
 motes.

His days were a symphony of exploration. He would stalk the frayed edges of the carpet, transforming into a miniature panther, his tiny body low
 to the ground. He would bat at the dancing dust motes with delicate paws, imagining them to be mischievous fairies. He would climb the towering legs of the old chaise lounge, his claws clicking against the wood, scaling Everest in his tiny world
.

His greatest treasure was a sunbeam that streamed through a crack in the boarded-up window. In that golden rectangle, he would bask, purring like a tiny engine, his fur turning to a shimmering silver. Sometimes, he’d
 catch a stray moth, a furry snack that brought a surge of triumph and a p

# Main Question Answering Function

In [8]:
def main(pdf_path, query_text, relevance_threshold=0.3):
    """
    Main function to perform question answering on a PDF document.

    Args:
        pdf_path (str): Path to the PDF file.  (Will be a temporary path in Colab)
        query_text (str): The question to ask.
        relevance_threshold (float): Minimum similarity score for a chunk.
    """

    print("\nExtracting text from the PDF...")
    try:
        extracted_text = extract_text_from_pdf(pdf_path)
        print("Text extracted successfully.")
        if not extracted_text:
            print("Error: Failed to extract text. Document might be empty/unreadable.")
            return  # Exit if extraction fails

        print("\nChunking the extracted text...")
        chunks = smart_chunk_spacy_advanced(extracted_text)  # Use advanced chunker
        if not chunks:
            print("Error: Failed to create text chunks.")
            return
        print(f"Text chunked into {len(chunks)} chunks.")

        print("\nGenerating embeddings for the chunks...")
        embeddings = generate_embeddings(chunks)
        print("Embeddings generated.")

        print("\nStoring embeddings in FAISS index...")
        index = store_in_faiss(embeddings)  # Use default filename
        print("FAISS index created and stored.")

        # No need to reload the index immediately, we just created it!

        print(f"\nQuerying FAISS index with: '{query_text}'...")
        results = query_faiss_index(query_text, index, chunks, top_k=5)

        # Filter results by relevance threshold
        context_chunks = [(text, dis, idx) for text, dis, idx in results if dis > relevance_threshold]

        print("\nResults from FAISS index:")
        for idx, (text, score, doc_id) in enumerate(results):
          if score > relevance_threshold:
            print(f"{idx}. Score: {score:.4f}, Document ID: {doc_id}\n{text[:200]}...\n")
        print(f"Found {len(context_chunks)} relevant context chunks.")


        print("\nQuerying the Gemini model for an answer...")
        answer = query_flash(query_text, context_chunks)  # Use Gemini query
        if isinstance(answer, str) and "Error" in answer: #check for errors from query_flash
          print(f"Error querying Gemini: {answer}")
          return

        print("\nGenerated Answer:")
        print(answer["answer"])
        print("\nCited Context:")
        for text, _, idx in answer["relevant_context"]:
            print(f"\tSource {idx}: {text}")

    except Exception as e:
        print(f"An unexpected error occurred in main(): {e}")



print("Main function defined.")

Main function defined.


# Get API Key, Project ID, and Location, Upload PDF, and Run!

In [9]:


# File upload
print("Upload your PDF file:")
uploaded = files.upload()
if not uploaded:
    print("No file uploaded.  Exiting.")
    sys.exit(1)  # Exit if no file

pdf_filename = list(uploaded.keys())[0]  # Get the filename
print(f"Uploaded file: {pdf_filename}")



Upload your PDF file:


Saving ExampleCo - NDA - John Appleseed.pdf to ExampleCo - NDA - John Appleseed (4).pdf
Uploaded file: ExampleCo - NDA - John Appleseed (4).pdf


In [10]:
# Get the question
query = "what state is this contract binding at "  #@param {type:"string"}

# Run the main function
main(pdf_filename, query)
print("Done!")


Extracting text from the PDF...
Text extracted successfully.

Chunking the extracted text...
Text chunked into 25 chunks.

Generating embeddings for the chunks...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Embeddings generated.

Storing embeddings in FAISS index...
FAISS index saved to vector_db_cosine.index
FAISS index created and stored.

Querying FAISS index with: 'what state is this contract binding at '...

Results from FAISS index:
0. Score: 0.3535, Document ID: 18
The Recipient shall hold harmless and indemnify the Company, as well as the
shareholders, officers, directors, employees, agents and representatives of the Company, from
3
- -
4853-1848-6325.v1
and ag...

1. Score: 0.3384, Document ID: 22
The covenants and agreements set forth in this Agreement are each deemed
separate and independent, and if any such covenant or agreement is determined by any court of
competent jurisdiction to be inva...

2. Score: 0.3354, Document ID: 15
The Company makes no warranty whatsoever relating to the
Confidential Information and the use to be made thereof by the Recipient, and the Company
disclaims all implied warranties. 12....

3. Score: 0.3174, Document ID: 0
...

4. Score: 0.3083, Documen




Generated Answer:
I am sorry, but the provided text does not specify what state this contract is binding at.

Cited Context:
	Source 18: The Recipient shall hold harmless and indemnify the Company, as well as the
shareholders, officers, directors, employees, agents and representatives of the Company, from
3
- -
4853-1848-6325.v1
and against any and all claims, judgments, obligations, costs, awards, expenses (including,
without limitation, reasonable attorneys’ fees and costs) and liabilities of every kind arising from
any use made by the Recipient of the Confidential Information.
15.
	Source 22: The covenants and agreements set forth in this Agreement are each deemed
separate and independent, and if any such covenant or agreement is determined by any court of
competent jurisdiction to be invalid or unenforceable for any reason, including, without
limitation, by reason of such covenant or agreement extending for too great a period of time or
over too great a geographical area, or by re