# RAG Chatbot Testing Without Gradio - Multi-Document Knowledge Base

This notebook allows you to test the RAG (Retrieval-Augmented Generation) functionality without the Gradio interface. 

## 🔗 Unified Multi-Document RAG System

This implementation creates a **unified knowledge base** that combines **ALL PDF files** in the directory into a single searchable database. Key features:

- **📚 Multi-Document Reference:** Questions can draw information from any or all documents simultaneously
- **🔍 Cross-Document Intelligence:** Find connections and patterns across your entire document collection  
- **🎯 Source Attribution:** See exactly which documents contributed to each answer
- **💡 Comprehensive Answers:** Get insights that span multiple documents for richer responses

## 📋 Example Input and Output

### Input Setup:
```
./sample_pdfs/
├── harry_potter_book1.pdf
├── harry_potter_book2.pdf
```

### Example Questions and Answers:

Question: "Which house was Harry assigned to?"

Output:
```
💬 Answer: Harry Potter was assigned to Gryffindor house at Hogwarts School of 
Witchcraft and Wizardry. The Sorting Hat placed him in Gryffindor after 
considering his qualities and characteristics.

📖 Referenced documents: harry_potter_book1.pdf
📊 Used 2 chunks from 1 documents
```

## Import Required Libraries

**Prerequisites:** Install Ollama from [https://ollama.com/](https://ollama.com/)

In [1]:
from pathlib import Path
import requests

# Import RAG functions from separate module
from langchain.chains import RetrievalQA
from rag_functions import document_loader, text_splitter, vector_database, get_llm

## Check Ollama status

In [2]:

def check_ollama_status() -> bool:
    """
    Check if Ollama service is running and what models are available.

    Returns
    -------
    bool
        True if Ollama is running and llama2 model is available, False otherwise.
    """
    
    try:
        # Check if Ollama service is running
        response = requests.get("http://localhost:11434/api/tags", timeout=5)
        if response.status_code == 200:
            models = response.json().get('models', [])
            print("✅ Ollama is running!")
            print(f"📦 Available models: {[m['name'] for m in models]}")
            
            # Check if llama2 is available
            llama_models = [m for m in models if 'llama2' in m['name']]
            if llama_models:
                print("✅ llama2 model is ready!")
                return True
            else:
                print("⚠️  llama2 model not found. Run: ollama pull llama2")
                return False
        else:
            print("❌ Ollama API not responding")
            return False
    except requests.exceptions.RequestException:
        print("❌ Ollama is not running. Visit https://ollama.com/ for installation instructions.")
        return False
    except Exception as e:
        print(f"❌ Error checking Ollama: {e}")
        return False


# Check Ollama status
ollama_ready = check_ollama_status()

if not ollama_ready:
    print("\n🔧 Setup required:")
    print("1. Install Ollama: https://ollama.com/")
    print("2. Start Ollama service")
    print("3. Download model: ollama pull llama2")
    print("4. Re-run this cell to verify")

✅ Ollama is running!
📦 Available models: ['llama2:latest']
✅ llama2 model is ready!


## Set Directory Path

Define the directory path where your PDF files are located. Create a sample directory if it doesn't exist.

In [3]:
# Set the directory path for PDF files
pdf_directory: Path = Path("./sample_pdfs")

# Create directory if it doesn't exist
pdf_directory.mkdir(exist_ok=True)

print(f"PDF directory: {pdf_directory.absolute()}")
print(f"Directory exists: {pdf_directory.exists()}")

# You can also use an absolute path to an existing directory
# pdf_directory = Path("/path/to/your/pdf/files")

PDF directory: /Users/tirthshah/git_personal/qa-bot-with-rag/sample_pdfs
Directory exists: True


## Process All PDF Files and Create Combined Knowledge Base

Scan the directory and list all available PDF files.

In [4]:
# List all PDF files in the directory
pdf_files = list(pdf_directory.glob("*.pdf"))

print(f"Found {len(pdf_files)} PDF files:")
for i, file_path in enumerate(pdf_files, 1):
    file_size = file_path.stat().st_size / 1024  # Size in KB
    print(f"{i}. {file_path.name} ({file_size:.1f} KB)")

if not pdf_files:
    print("\n⚠️  No PDF files found!")
    print(f"Please add some PDF files to: {pdf_directory.absolute()}")
else:
    print(f"\n✅ Ready to process {len(pdf_files)} files")

Found 2 PDF files:
1. chamber_of_secrets.pdf (33.9 KB)
2. philosophers_stone.pdf (27.8 KB)

✅ Ready to process 2 files


Process all PDF files and combine into a single knowledge base

In [5]:
# Process all PDF files and combine into a single knowledge base
all_documents = []
processed_files = []

if pdf_files:
    print(f"Processing {len(pdf_files)} PDF files for combined knowledge base...")
    print("=" * 70)
    
    # Load all documents
    for i, file_path in enumerate(pdf_files, 1):
        print(f"\n📁 Loading file {i}/{len(pdf_files)}: {file_path.name}")
        print("-" * 40)
        
        try:
            # Load document
            documents = document_loader(file_path)
            print(f"✅ Document loaded successfully!")
            print(f"📄 Number of pages: {len(documents)}")
            
            # Add source metadata to each document
            for doc in documents:
                doc.metadata['source_file'] = file_path.name
                doc.metadata['file_path'] = str(file_path)
            
            # Add to combined document list
            all_documents.extend(documents)
            
            # Track processing info
            processed_files.append({
                'name': file_path.name,
                'pages': len(documents),
                'path': file_path
            })
            
            print(f"📚 Added {len(documents)} pages to combined knowledge base")
            
        except Exception as e:
            print(f"❌ Error loading document: {e}")
    
    # Create combined chunks from all documents
    if all_documents:
        print(f"\n🔄 Creating combined knowledge base from {len(all_documents)} total pages...")
        print("-" * 60)
        
        # Split all documents into chunks
        all_chunks = text_splitter(all_documents)
        
        # Create unified vector database
        print(f"📊 Created {len(all_chunks)} chunks from all documents")
        print(f"📊 Average chunk length: {sum(len(chunk.page_content) for chunk in all_chunks) / len(all_chunks):.0f} characters")
        
        # Show source distribution
        source_counts = {}
        for chunk in all_chunks:
            source = chunk.metadata.get('source_file', 'unknown')
            source_counts[source] = source_counts.get(source, 0) + 1
        
        print(f"\n📋 Chunk distribution by source:")
        for source, count in source_counts.items():
            print(f"  - {source}: {count} chunks")
        
        # Create the unified vector database
        print(f"\n🗄️ Creating unified vector database...")
        unified_vectordb = vector_database(all_chunks)
        unified_retriever = unified_vectordb.as_retriever(
            search_kwargs={'k': 2}
        )
        
        print(f"✅ Combined knowledge base created successfully!")
        print(f"🎯 Ready for RAG queries across all {len(processed_files)} documents")
    
    else:
        print("❌ No documents loaded successfully")
        unified_retriever = None
        all_chunks = []
    
else:
    print("❌ No PDF files available to process")
    processed_files = []
    unified_retriever = None
    all_chunks = []

Ignoring wrong pointing object 6 0 (offset 0)
Ignoring wrong pointing object 48 0 (offset 0)
Ignoring wrong pointing object 48 0 (offset 0)
Ignoring wrong pointing object 8 0 (offset 0)
Ignoring wrong pointing object 8 0 (offset 0)


Processing 2 PDF files for combined knowledge base...

📁 Loading file 1/2: chamber_of_secrets.pdf
----------------------------------------
✅ Document loaded successfully!
📄 Number of pages: 2
📚 Added 2 pages to combined knowledge base

📁 Loading file 2/2: philosophers_stone.pdf
----------------------------------------
✅ Document loaded successfully!
📄 Number of pages: 1
📚 Added 1 pages to combined knowledge base

🔄 Creating combined knowledge base from 3 total pages...
------------------------------------------------------------
📊 Created 9 chunks from all documents
📊 Average chunk length: 804 characters

📋 Chunk distribution by source:
  - chamber_of_secrets.pdf: 5 chunks
  - philosophers_stone.pdf: 4 chunks

🗄️ Creating unified vector database...


  from .autonotebook import tqdm as notebook_tqdm


✅ Combined knowledge base created successfully!
🎯 Ready for RAG queries across all 2 documents


## Test Unified RAG System

Ask questions that can be answered using information from any document in the combined knowledge base.

In [6]:
# Create QA system
llm = get_llm()

unified_qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=unified_retriever,
    return_source_documents=True
)

In [7]:
# Test question - these can now reference any document in the knowledge base
questions = [
    "Which house was Harry assigned to?",
    "What monster was hidden in the Chamber of Secrets?",
    "Who is the main villain that Harry faces at the end of each of his first two years at Hogwarts?"
]

for question in questions:
    response = unified_qa.invoke(question)
    answer = response['result']
    source_docs = response['source_documents']

    print(f"\n❓ Question: {question}")
    print("-" * 60)

    print(f"💬 Answer: {answer}")

    # Show which documents were referenced
    if source_docs:
        referenced_files = set()
        for doc in source_docs:
            source_file = doc.metadata.get('source_file', 'unknown')
            referenced_files.add(source_file)
        
        print(f"\n📖 Referenced documents: {', '.join(referenced_files)}")
        print(f"📊 Used {len(source_docs)} chunks from {len(referenced_files)} documents")
        
        print("\n" + "="*60)
            
    else:
        print("❌ No unified knowledge base available for testing")
        if not processed_files:
            print("No files were processed successfully")
        if not unified_retriever:
            print("Vector database creation failed")


❓ Question: Which house was Harry assigned to?
------------------------------------------------------------
💬 Answer: According to the passage, Harry Potter was assigned to Gryffindor House at Hogwarts School of Witchcraft and Wizardry.

📖 Referenced documents: philosophers_stone.pdf
📊 Used 2 chunks from 1 documents


❓ Question: What monster was hidden in the Chamber of Secrets?
------------------------------------------------------------
💬 Answer: According to the context provided, the monster hidden in the Chamber of Secrets is a Basilisk.

📖 Referenced documents: chamber_of_secrets.pdf
📊 Used 2 chunks from 1 documents


❓ Question: What monster was hidden in the Chamber of Secrets?
------------------------------------------------------------
💬 Answer: According to the context provided, the monster hidden in the Chamber of Secrets is a Basilisk.

📖 Referenced documents: chamber_of_secrets.pdf
📊 Used 2 chunks from 1 documents


❓ Question: Who is the main villain that Harry faces at