# 📚 Document Question-Answering System with Local AI

This notebook creates a simple, powerful system that can read your documents and answer questions about them using:
- **Haystack** for processing documents and finding relevant information
- **gemma3:1b** (or any local AI model via Ollama) for generating human-like answers
- **BM25 keyword search** for finding relevant content (no complex embeddings needed!)

Perfect for working with PDFs, Word documents, and text files on your local computer.

In [55]:
# Install the tools we need for our RAG system
import subprocess
import sys

required_tools = [
    "haystack-ai",      # Main RAG framework
    "ollama-haystack",  # Connect to local Ollama models
    "PyPDF2",          # Read PDF files
    "python-docx"      # Read Word documents
]

print("Installing required tools (this might take a minute)...")
for tool_name in required_tools:
    try:
        subprocess.check_call([sys.executable, "-m", "pip", "install", tool_name, "-q"])
        print(f"✅ {tool_name}")
    except:
        print(f"❌ {tool_name} - already installed or error")

print("\n🎉 All tools installed successfully!")

Installing required tools (this might take a minute)...
✅ haystack-ai
✅ haystack-ai
✅ ollama-haystack
✅ ollama-haystack
✅ PyPDF2
✅ PyPDF2
✅ python-docx

🎉 All tools installed successfully!
✅ python-docx

🎉 All tools installed successfully!


In [56]:
# Import the libraries we need for our document Q&A system
from pathlib import Path
from typing import List, Dict

# Haystack components - these handle documents and generate answers
from haystack import Document, Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.writers import DocumentWriter
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever
from haystack.components.builders import PromptBuilder
from haystack_integrations.components.generators.ollama import OllamaGenerator

# Tools for reading different types of files
import PyPDF2  # For PDF files
import docx    # For Word documents

print("✅ All libraries loaded and ready to use!")

✅ All libraries loaded and ready to use!


In [61]:
# Create our Document Question-Answering System

class DocumentQASystem:
    """
    A smart system that reads your documents and answers questions about them.
    Uses keyword search (no complex AI embeddings) and your local AI model.
    """
    
    def __init__(self, ai_model_name="gemma3:1b"):
        print(f"🚀 Setting up Document Q&A system with {ai_model_name}...")
        
        # Where we store all the document pieces
        self.document_storage = InMemoryDocumentStore()
        
        # The components that do the actual work
        self.document_finder = InMemoryBM25Retriever(document_store=self.document_storage)
        self.answer_generator = OllamaGenerator(model=ai_model_name)
        
        # Set up the workflows for processing documents and answering questions
        self._setup_workflows()
        print("✅ Document Q&A system is ready to use!")
    
    def _setup_workflows(self):
        # Workflow for adding documents to our storage
        self.document_adding_workflow = Pipeline()
        self.document_adding_workflow.add_component("document_saver", DocumentWriter(document_store=self.document_storage))
        
        # Workflow for answering questions
        answer_template = """Based on the provided information, answer the question clearly and helpfully.

Information from documents:
{% for document in documents %}
{{ document.content }}

{% endfor %}

Question: {{ question }}

Please provide a helpful answer based on the information above:"""
        
        self.question_answering_workflow = Pipeline()
        self.question_answering_workflow.add_component("document_finder", self.document_finder)
        self.question_answering_workflow.add_component("answer_builder", PromptBuilder(template=answer_template))
        self.question_answering_workflow.add_component("answer_generator", self.answer_generator)
        
        # Connect the workflow steps
        self.question_answering_workflow.connect("document_finder", "answer_builder.documents")
        self.question_answering_workflow.connect("answer_builder", "answer_generator")
    
    def _clean_messy_text(self, raw_text):
        """Clean up text from PDFs that might have weird formatting"""
        import re
        
        # Fix spacing issues and normalize whitespace
        clean_text = re.sub(r'\s+', ' ', raw_text)
        
        # Remove weird characters that sometimes come from PDF conversion
        clean_text = re.sub(r'[^\w\s\.\,\!\?\;\:\-\(\)]', ' ', clean_text)
        
        # Only keep lines that have actual content
        text_lines = clean_text.split('\n')
        meaningful_lines = []
        for text_line in text_lines:
            if len(text_line.strip()) > 10:  # Skip very short lines (usually junk)
                meaningful_lines.append(text_line.strip())
        
        return ' '.join(meaningful_lines)
    
    def _break_into_smaller_pieces(self, long_text, piece_size=500, overlap_words=50):
        """Break long text into smaller, overlapping pieces for better searching"""
        all_words = long_text.split()
        text_pieces = []
        
        for word_index in range(0, len(all_words), piece_size - overlap_words):
            piece_words = all_words[word_index:word_index + piece_size]
            text_piece = ' '.join(piece_words)
            
            if len(text_piece.strip()) > 100:  # Only keep pieces with real content
                text_pieces.append(text_piece)
        
        return text_pieces
    
    def add_text_content(self, text_content, source_information=None):
        """Add some text content to our knowledge base"""
        if not text_content or len(text_content.strip()) < 50:
            print("⚠️ Text is too short, skipping...")
            return
        
        # Clean up the text
        cleaned_text = self._clean_messy_text(text_content)
        
        # Break it into smaller pieces for better searching
        text_pieces = self._break_into_smaller_pieces(cleaned_text)
        
        print(f"📝 Processing text: {len(text_content)} characters → {len(text_pieces)} searchable pieces")
        
        # Add each piece as a document
        document_pieces = []
        for piece_number, text_piece in enumerate(text_pieces):
            piece_information = (source_information or {}).copy()
            piece_information['piece_number'] = piece_number + 1
            piece_information['total_pieces'] = len(text_pieces)
            
            document_piece = Document(content=text_piece, meta=piece_information)
            document_pieces.append(document_piece)
        
        if document_pieces:
            self.document_adding_workflow.run({"document_saver": {"documents": document_pieces}})
            print(f"✅ Added {len(document_pieces)} pieces to knowledge base")
    
    def add_document_file(self, file_location):
        """Add a file (PDF, Word doc, or text file) to our knowledge base"""
        file_path = Path(file_location)
        if not file_path.exists():
            print(f"❌ Can't find file: {file_location}")
            return
        
        print(f"📄 Reading {file_path.name}...")
        
        try:
            file_content = ""
            if file_path.suffix.lower() == '.pdf':
                file_content = self._extract_text_from_pdf(file_path)
            elif file_path.suffix.lower() == '.txt':
                file_content = file_path.read_text(encoding='utf-8', errors='ignore')
            elif file_path.suffix.lower() == '.docx':
                file_content = self._extract_text_from_word_doc(file_path)
            else:
                print(f"❌ Don't know how to read {file_path.suffix} files")
                return
            
            if file_content and len(file_content.strip()) > 100:
                self.add_text_content(file_content, {"source": str(file_path), "filename": file_path.name})
            else:
                print(f"⚠️ Couldn't extract meaningful content from {file_path.name}")
        
        except Exception as error:
            print(f"❌ Error reading {file_path.name}: {error}")
    
    def _extract_text_from_pdf(self, pdf_file_path):
        """Extract text from a PDF file"""
        extracted_content = ""
        try:
            with open(pdf_file_path, 'rb') as pdf_file:
                pdf_reader = PyPDF2.PdfReader(pdf_file)
                for page_number, pdf_page in enumerate(pdf_reader.pages):
                    page_content = pdf_page.extract_text()
                    if page_content and len(page_content.strip()) > 50:
                        extracted_content += f"\\n[Page {page_number + 1}]\\n{page_content}\\n"
        except Exception as pdf_error:
            print(f"PDF reading error: {pdf_error}")
        return extracted_content
    
    def _extract_text_from_word_doc(self, word_file_path):
        """Extract text from a Word document"""
        word_content = ""
        try:
            word_document = docx.Document(word_file_path)
            for paragraph in word_document.paragraphs:
                if paragraph.text and len(paragraph.text.strip()) > 10:
                    word_content += paragraph.text + "\\n"
        except Exception as word_error:
            print(f"Word document reading error: {word_error}")
        return word_content
    
    def answer_question(self, user_question, max_results_to_search=3):
        """Ask a question and get an answer based on our documents"""
        total_document_pieces = self.document_storage.count_documents()
        if total_document_pieces == 0:
            return "❌ I don't have any documents to search through yet. Please add some files first!"
        
        try:
            print(f"🔍 Searching through {total_document_pieces} document pieces...")
            
            search_result = self.question_answering_workflow.run({
                "document_finder": {"query": user_question, "top_k": max_results_to_search},
                "answer_builder": {"question": user_question}
            })
            
            generated_answer = search_result["answer_generator"]["replies"][0]
            return generated_answer
            
        except Exception as processing_error:
            return f"❌ Something went wrong: {processing_error}"
    
    def show_knowledge_base_info(self):
        """Show info about what documents we have in our knowledge base"""
        document_count = self.document_storage.count_documents()
        print(f"📊 Knowledge base contains {document_count} document pieces")
        return document_count

# Create our Document Q&A system - ready to use!
print("🔧 Creating your Document Q&A system...")
qa_system = DocumentQASystem(ai_model_name="gemma3:1b")

🔧 Creating your Document Q&A system...
🚀 Setting up Document Q&A system with gemma3:1b...


PromptBuilder has 2 prompt variables, but `required_variables` is not set. By default, all prompt variables are treated as optional, which may lead to unintended behavior in multi-branch pipelines. To avoid unexpected execution, ensure that variables intended to be required are explicitly set in `required_variables`.


✅ Document Q&A system is ready to use!


In [69]:
# Easy-to-use functions for your Document Q&A system

def ask_about_documents(user_question):
    """Ask any question about the documents you've added"""
    print(f"❓ {user_question}")
    generated_answer = qa_system.answer_question(user_question)
    
    # Keep answers readable on screen
    if len(generated_answer) > 500:
        display_answer = generated_answer[:500] + "..."
    else:
        display_answer = generated_answer
    
    print(f"🤖 {display_answer}")
    print("─" * 60)
    return generated_answer

def add_document_to_system(file_location):
    """Add a document (PDF, Word, or text file) to your knowledge base"""
    qa_system.add_document_file(file_location)

def check_knowledge_base():
    """See what documents are in your knowledge base"""
    return qa_system.show_knowledge_base_info()

# Show current status of the system
print("📋 Current Knowledge Base Status:")
check_knowledge_base()

print("\n💡 How to use your Document Q&A system:")
print("• add_document_to_system('path/to/file.pdf')     - Add a document")
print("• ask_about_documents('What is this about?')     - Ask questions")
print("• check_knowledge_base()                         - See what you have")

print("\n🧪 Quick test:")
ask_about_documents("What documents do I currently have in my knowledge base?")

📋 Current Knowledge Base Status:
📊 Knowledge base contains 23 document pieces

💡 How to use your Document Q&A system:
• add_document_to_system('path/to/file.pdf')     - Add a document
• ask_about_documents('What is this about?')     - Ask questions
• check_knowledge_base()                         - See what you have

🧪 Quick test:
❓ What documents do I currently have in my knowledge base?
🔍 Searching through 23 document pieces...
🤖 Based on the provided information, here’s a breakdown of the documents you currently have in your knowledge base:

*   **Page 1:** Vishal Rajesh Kushwaha
*   **Page 13:** A service of Via medici online www.thieme.de
*   **Page 14:** Physics exam spring 2001
*   **Page 15:** A service of Via medici online www.thieme.de
*   **Page 16:** A service of Via medici online www.thieme.de
*   **Page 17:** A service of Via medici online www.thieme.de
*   **Page 18:** A service of Via medici online www.thie...
────────────────────────────────────────────────────────────

'Based on the provided information, here’s a breakdown of the documents you currently have in your knowledge base:\n\n*   **Page 1:** Vishal Rajesh Kushwaha\n*   **Page 13:** A service of Via medici online www.thieme.de\n*   **Page 14:** Physics exam spring 2001\n*   **Page 15:** A service of Via medici online www.thieme.de\n*   **Page 16:** A service of Via medici online www.thieme.de\n*   **Page 17:** A service of Via medici online www.thieme.de\n*   **Page 18:** A service of Via medici online www.thieme.de\n*   **Page 23:** A service of Via medici online www.thieme.de\n*   **Page 24:** A service of Via medici online www.thieme.de\n*   **Page 25:** A service of Via medici online www.thieme.de\n*   **Page 26:** A service of Via medici online www.thieme.de\n*   **Page 27:** A service of Via medici online www.thieme.de\n\nDo you have any other questions?'

In [64]:
# Example: Add your documents and test the system

# Add your PDF files to the knowledge base
add_document_to_system(r"c:\Users\visha\Downloads\approval docx\rag\aerz_vorpr_f2001_2a (1).pdf")
add_document_to_system(r"c:\Users\visha\Downloads\approval docx\rag\cover letter.pdf")

# Check what documents we now have
check_knowledge_base()

# Ask some questions about your documents
ask_about_documents("What is the main topic of these documents?")
ask_about_documents("Can you summarize the key points from these documents?")
ask_about_documents("What type of research or content is discussed?")

📄 Reading aerz_vorpr_f2001_2a (1).pdf...
📝 Processing text: 70274 characters → 22 searchable pieces
✅ Added 22 pieces to knowledge base
📄 Reading cover letter.pdf...
📝 Processing text: 70274 characters → 22 searchable pieces
✅ Added 22 pieces to knowledge base
📄 Reading cover letter.pdf...
📝 Processing text: 2041 characters → 1 searchable pieces
✅ Added 1 pieces to knowledge base
📊 Knowledge base contains 23 document pieces
❓ What is the main topic of these documents?
🔍 Searching through 23 document pieces...
📝 Processing text: 2041 characters → 1 searchable pieces
✅ Added 1 pieces to knowledge base
📊 Knowledge base contains 23 document pieces
❓ What is the main topic of these documents?
🔍 Searching through 23 document pieces...
🤖 Based on the provided documents, the main topic is **Health-damaging behaviors**.

Here’s a breakdown of how the documents relate to this topic:

*   **Early focus on Modeling:** The documents consistently discuss the impact of behavioral patterns, specifical

"The text describes a research study focusing on nursing staff's personality traits and the impact of a service of Via medici online.\n\nHere’s a breakdown of the key aspects:\n\n*   **The Study's Focus:** It’s investigating the reasons behind personality differences among nursing staff, aiming to clarify the reasons.\n*   **The Investigation Method:** The hospital management commissioned an external social science investigation to gather information.\n*   **The Type of Research:** It’s a **personality assessment study** – specifically, a study using an **Inventory and resignation rates among nursing staff** to identify potential causes.\n*   **The Content Involved:** The information is related to psychological conditions, such as potential personality traits, which may influence a nurse's behavior and possibly impact their feelings and engagement.\n\nEssentially, the text is about understanding personality factors within the nursing profession, prompting a social science investigation

In [None]:
# Your turn! Add your own documents and ask your own questions

# Add your documents (change these file paths to your actual files)
# add_document_to_system("path/to/your/document.pdf")
# add_document_to_system("path/to/another/file.docx")
# add_document_to_system("path/to/text/file.txt")

# Ask your own questions about your documents
# ask_about_documents("What is this document about?")
# ask_about_documents("What are the main conclusions?")
# ask_about_documents("Can you explain the methodology used?")
# ask_about_documents("What are the key findings?")
# ask_about_documents("What is this document about?")
# ask_about_documents("What are the main conclusions?")
# ask_about_documents("Can you explain the methodology?")

In [68]:
ask_about_documents("What is the main topic of the cover letter?")

❓ What is the main topic of the cover letter?
🔍 Searching through 23 document pieces...
🤖 The cover letter is focused on expressing interest in a position at the company and highlighting Vishal’s skills and experience in machine learning and automation. It emphasizes his expertise in developing intelligent systems for various use cases, including customer behavior, scientific simulation, and model building. The letter also demonstrates his passion for learning and his desire to contribute creatively and precisely to the company's projects.

**Therefore, the main topic of the cover le...
────────────────────────────────────────────────────────────
🤖 The cover letter is focused on expressing interest in a position at the company and highlighting Vishal’s skills and experience in machine learning and automation. It emphasizes his expertise in developing intelligent systems for various use cases, including customer behavior, scientific simulation, and model building. The letter also demons

"The cover letter is focused on expressing interest in a position at the company and highlighting Vishal’s skills and experience in machine learning and automation. It emphasizes his expertise in developing intelligent systems for various use cases, including customer behavior, scientific simulation, and model building. The letter also demonstrates his passion for learning and his desire to contribute creatively and precisely to the company's projects.\n\n**Therefore, the main topic of the cover letter is his interest in the specific position and his professional background in machine learning and automation.**\n\n"