# Ancestor RAG System - Complete Interactive Notebook

This notebook demonstrates RAG (Retrieval Augmented Generation) for querying ancestor PDFs.

## üìã What This Demo Does:

1. ‚úÖ Prompts for API key
2. ‚úÖ Asks a question BEFORE loading PDFs (baseline)
3. ‚úÖ Lets you select which PDFs to load
4. ‚úÖ Times the RAG embedding process
5. ‚úÖ Asks the same question AFTER loading (with RAG)
6. ‚úÖ Shows before/after comparison
7. ‚úÖ Interactive Q&A mode

## üöÄ Quick Start:

1. Upload your PDF files to this directory
2. Run all cells: **Cell ‚Üí Run All**
3. Follow the prompts!

## üì¶ Step 1: Install Required Packages

In [None]:
# Install dependencies (run once)
!pip install anthropic pypdf numpy scikit-learn ipywidgets -q

print("‚úì Packages installed successfully!")

## üìö Step 2: Import Libraries

In [None]:
import os
from pathlib import Path
from typing import List, Dict, Any, Optional
import time
import pickle
import anthropic
from pypdf import PdfReader
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from IPython.display import display, HTML, Markdown, clear_output
import ipywidgets as widgets

print("‚úì All libraries imported successfully!")

## üîß Step 3: Define AncestorRAG Class

In [None]:
class AncestorRAG:
    """
    A RAG system for ancestor research using only Anthropic's Claude API.
    """
    
    def __init__(self, anthropic_api_key: str):
        """Initialize the RAG system."""
        self.api_key = anthropic_api_key
        self.client = anthropic.Anthropic(api_key=self.api_key)
        self.documents = []  # Store document chunks
        self.embeddings = []  # Store embeddings
        print("‚úì Ancestor RAG system initialized")
    
    def extract_text_from_pdf(self, pdf_path: str) -> str:
        """Extract all text content from a PDF file."""
        try:
            reader = PdfReader(pdf_path)
            text = ""
            for page in reader.pages:
                page_text = page.extract_text()
                if page_text:
                    text += page_text + "\n"
            return text
        except Exception as e:
            print(f"Error reading {pdf_path}: {e}")
            return ""
    
    def chunk_text(self, text: str, chunk_size: int = 1000, overlap: int = 200) -> List[str]:
        """Split text into overlapping chunks."""
        chunks = []
        start = 0
        
        while start < len(text):
            end = start + chunk_size
            chunk = text[start:end].strip()
            if chunk:
                chunks.append(chunk)
            start += chunk_size - overlap
        
        return chunks
    
    def create_embedding(self, text: str) -> List[float]:
        """Create simple embedding using word hashing."""
        text = text.lower()
        words = text.split()
        embedding_dim = 512
        embedding = np.zeros(embedding_dim)
        
        for word in words:
            hash_val = hash(word) % embedding_dim
            embedding[hash_val] += 1
        
        norm = np.linalg.norm(embedding)
        if norm > 0:
            embedding = embedding / norm
        
        return embedding.tolist()
    
    def add_pdf(self, pdf_path: str, metadata: Optional[Dict[str, Any]] = None):
        """Add a PDF document to the RAG system."""
        if not os.path.exists(pdf_path):
            print(f"‚ùå Error: File not found: {pdf_path}")
            return
        
        print(f"üìÑ Processing {pdf_path}...")
        
        # Extract text
        text = self.extract_text_from_pdf(pdf_path)
        if not text.strip():
            print(f"‚ùå Warning: No text extracted from {pdf_path}")
            return
        
        # Split into chunks
        chunks = self.chunk_text(text)
        if not chunks:
            print(f"‚ùå Warning: No chunks created from {pdf_path}")
            return
        
        print(f"   Creating embeddings for {len(chunks)} chunks...")
        
        # Create embeddings
        for i, chunk in enumerate(chunks):
            embedding = self.create_embedding(chunk)
            doc_metadata = metadata.copy() if metadata else {}
            doc_metadata.update({
                "source": pdf_path,
                "filename": Path(pdf_path).name,
                "chunk_id": i,
                "total_chunks": len(chunks)
            })
            
            self.documents.append({"text": chunk, "metadata": doc_metadata})
            self.embeddings.append(embedding)
        
        print(f"‚úì Added {len(chunks)} chunks from {Path(pdf_path).name}")
    
    def search(self, query: str, top_k: int = 5) -> List[Dict[str, Any]]:
        """Search for relevant document chunks."""
        if not self.documents:
            return []
        
        query_embedding = self.create_embedding(query)
        similarities = cosine_similarity([query_embedding], self.embeddings)[0]
        top_indices = np.argsort(similarities)[-top_k:][::-1]
        
        results = []
        for idx in top_indices:
            results.append({
                "text": self.documents[idx]["text"],
                "metadata": self.documents[idx]["metadata"],
                "score": float(similarities[idx])
            })
        
        return results
    
    def query(self, question: str, top_k: int = 5, show_sources: bool = True) -> Dict[str, Any]:
        """Answer a question using RAG with Claude."""
        if not self.documents:
            return {
                "answer": "No documents loaded. Please add PDF files first.",
                "sources": []
            }
        
        # Retrieve relevant documents
        results = self.search(question, top_k=top_k)
        
        # Build context
        context_parts = []
        for i, result in enumerate(results):
            filename = result['metadata']['filename']
            chunk_text = result['text']
            context_parts.append(f"[Source {i+1}: {filename}]\n{chunk_text}")
        
        context = "\n\n".join(context_parts)
        
        # Create prompt
        prompt = f"""You are a helpful genealogy research assistant. Answer the following question based on the provided information about ancestors.

Context from ancestor documents:
{context}

Question: {question}

Instructions:
- Provide a clear, accurate answer based on the documents
- Include specific details like dates, places, and names when available
- If the documents don't contain enough information to answer fully, say so
- Be conversational and helpful
- Don't make up information that isn't in the documents

Answer:"""
        
        # Query Claude
        try:
            message = self.client.messages.create(
                model="claude-sonnet-4-20250514",
                max_tokens=2000,
                messages=[{"role": "user", "content": prompt}]
            )
            answer = message.content[0].text
        except Exception as e:
            return {"answer": f"Error querying Claude: {e}", "sources": []}
        
        # Prepare response
        response = {"answer": answer}
        if show_sources:
            response["sources"] = [
                {
                    "file": r['metadata']['filename'],
                    "chunk": r['metadata']['chunk_id'],
                    "score": r['score']
                }
                for r in results
            ]
        
        return response
    
    def save_index(self, filepath: str = "ancestor_index.pkl"):
        """Save the document index to a file."""
        data = {"documents": self.documents, "embeddings": self.embeddings}
        with open(filepath, "wb") as f:
            pickle.dump(data, f)
        print(f"‚úì Index saved to {filepath} ({len(self.documents)} chunks)")
    
    def load_index(self, filepath: str = "ancestor_index.pkl"):
        """Load a previously saved index."""
        if not os.path.exists(filepath):
            print(f"‚ùå Error: Index file '{filepath}' not found")
            return False
        
        try:
            with open(filepath, "rb") as f:
                data = pickle.load(f)
            self.documents = data["documents"]
            self.embeddings = data["embeddings"]
            print(f"‚úì Index loaded from {filepath} ({len(self.documents)} chunks)")
            return True
        except Exception as e:
            print(f"‚ùå Error loading index: {e}")
            return False

print("‚úì AncestorRAG class defined successfully!")

## üîë Step 4: Configure API Key

In [None]:
# API Key Configuration
print("="*70)
print("üîë API Key Setup")
print("="*70)

# Check if already set in environment
API_KEY = os.environ.get("ANTHROPIC_API_KEY")

if API_KEY:
    print("‚úì Found API key in environment variable")
else:
    print("\nPlease enter your Anthropic API key:")
    print("(Get one at: https://console.anthropic.com/)")
    from getpass import getpass
    API_KEY = getpass("API Key: ")
    
    if API_KEY:
        os.environ["ANTHROPIC_API_KEY"] = API_KEY
        print("‚úì API key set for this session")
    else:
        print("‚ùå No API key provided")

print()

## üìù Step 5: Ask Question BEFORE Loading Documents

This establishes a baseline - what does Claude know without your documents?

In [None]:
print("="*70)
print("Step 1: Ask a Question (WITHOUT your documents)")
print("="*70)
print("\nAsk a question about one of your ancestors.")
print("Example: 'Where was Giovanni Parone born?'\n")

question = input("Your question: ").strip()

if not question:
    question = "Where was Giovanni Parone born?"
    print(f"Using default: {question}")

print(f"\nü§î Asking Claude WITHOUT documents...\n")

# Ask Claude directly (no RAG)
client = anthropic.Anthropic(api_key=API_KEY)
try:
    message = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1000,
        messages=[{"role": "user", "content": question}]
    )
    answer_without_rag = message.content[0].text
except Exception as e:
    answer_without_rag = f"Error: {e}"

display(Markdown("### üí¨ Answer WITHOUT documents:"))
display(Markdown(f"---\n{answer_without_rag}\n---"))

print("\n‚úì Baseline answer recorded!")

## üìÅ Step 6: Select and Load PDF Documents

In [None]:
print("="*70)
print("Step 2: Load Your PDF Documents")
print("="*70)

# Find all PDFs in current directory
pdf_files = [f for f in os.listdir(".") if f.endswith(".pdf")]

if not pdf_files:
    print("\n‚ùå No PDF files found in current directory.")
    print("Please upload PDF files and re-run this cell.")
else:
    print(f"\nFound {len(pdf_files)} PDF file(s):")
    for i, pdf in enumerate(pdf_files, 1):
        size_mb = os.path.getsize(pdf) / (1024 * 1024)
        print(f"  {i}. {pdf} ({size_mb:.2f} MB)")
    
    # Ask user which PDFs to load
    print("\nWhich PDFs to load?")
    print("  ‚Ä¢ Enter numbers (e.g., 1,2,3)")
    print("  ‚Ä¢ Enter 'all' for all files")
    
    choice = input("\nChoice: ").strip().lower()
    
    if choice == 'all':
        selected_pdfs = pdf_files
    else:
        try:
            indices = [int(x.strip()) for x in choice.split(',')]
            selected_pdfs = [pdf_files[i-1] for i in indices if 1 <= i <= len(pdf_files)]
        except:
            print("‚ö†Ô∏è  Invalid choice. Loading all PDFs.")
            selected_pdfs = pdf_files
    
    if selected_pdfs:
        print(f"\n‚úì Will process {len(selected_pdfs)} PDF(s)")
    else:
        print("‚ùå No PDFs selected.")

## ‚öôÔ∏è Step 7: Process PDFs with RAG (Timed)

In [None]:
print("="*70)
print("Step 3: Processing Documents (RAG)")
print("="*70)

if not pdf_files or not selected_pdfs:
    print("\n‚ùå No PDFs to process. Please upload PDFs first.")
else:
    print(f"\n‚öôÔ∏è  Loading {len(selected_pdfs)} PDF(s) into RAG system...")
    print("This will:")
    print("  1. Extract text from PDFs")
    print("  2. Split into chunks")
    print("  3. Create embeddings")
    print()
    
    # Initialize RAG
    rag = AncestorRAG(anthropic_api_key=API_KEY)
    print()
    
    # Process with timing
    start_time = time.time()
    
    for pdf in selected_pdfs:
        rag.add_pdf(pdf)
        print()
    
    elapsed = time.time() - start_time
    
    print("="*70)
    print(f"‚úì Processing complete!")
    print(f"‚è±Ô∏è  Time: {elapsed:.2f} seconds ({elapsed/60:.2f} minutes)")
    print(f"üìä Total chunks: {len(rag.documents)}")
    print("="*70)

## üîç Step 8: Ask Same Question WITH Documents

In [None]:
print("="*70)
print("Step 4: Ask the Same Question (WITH your documents)")
print("="*70)

if not rag.documents:
    print("\n‚ùå No documents loaded. Please run the previous cells first.")
else:
    print(f"\nOriginal question: \"{question}\"")
    change = input("Ask a different question? (yes/no): ").strip().lower()
    
    if change in ['yes', 'y']:
        new_q = input("New question: ").strip()
        if new_q:
            question = new_q
    
    print(f"\nüîé Asking Claude WITH documents (using RAG)...\n")
    
    result = rag.query(question, top_k=5)
    
    display(Markdown("### üí¨ Answer WITH documents:"))
    display(Markdown(f"---\n{result['answer']}\n---"))
    
    if result.get('sources'):
        print("\nüìö Sources:")
        for i, s in enumerate(result['sources'], 1):
            print(f"  {i}. {s['file']} (chunk {s['chunk']}, relevance: {s['score']:.0%})")

## üìä Step 9: Side-by-Side Comparison

In [None]:
display(Markdown("# üîç COMPARISON - Before vs After RAG"))
display(Markdown(f"## ‚ùì Question: {question}\n"))

display(Markdown("### üî¥ BEFORE (without documents):"))
display(Markdown(f"---\n{answer_without_rag}\n---"))

if 'result' in locals():
    display(Markdown("### üü¢ AFTER (with RAG):"))
    display(Markdown(f"---\n{result['answer']}\n---"))

display(Markdown("""
### üí° Key Differences:
- **BEFORE**: Based only on Claude's general knowledge
- **AFTER**: Based on YOUR specific documents
- **RAG** provides accurate, sourced, personalized answers!
"""))

## üí¨ Step 10: Interactive Q&A (Ask More Questions)

In [None]:
# Create interactive interface
question_input = widgets.Textarea(
    placeholder='Ask another question about your ancestors...',
    description='Question:',
    layout=widgets.Layout(width='100%', height='80px')
)

ask_button = widgets.Button(
    description='Ask Claude',
    button_style='primary',
    icon='search'
)

output_area = widgets.Output()

def on_ask_clicked(b):
    with output_area:
        clear_output()
        q = question_input.value.strip()
        
        if not q:
            print("‚ö†Ô∏è  Please enter a question")
            return
        
        if not rag.documents:
            print("‚ùå No documents loaded. Please run the cells above first.")
            return
        
        print(f"üîé Searching for: {q}\n")
        print("‚è≥ Generating answer...\n")
        
        result = rag.query(q)
        
        display(Markdown(f"### üí¨ Answer:\n\n{result['answer']}"))
        
        if result.get('sources'):
            print("\nüìö Sources:")
            for i, source in enumerate(result['sources'], 1):
                print(f"  {i}. {source['file']} (chunk {source['chunk']}, relevance: {source['score']:.0%})")

ask_button.on_click(on_ask_clicked)

print("üå≥ Interactive Q&A - Ask more questions below!")
print()
display(question_input)
display(ask_button)
display(output_area)

## üíæ Step 11: Save Your Index (Optional)

In [None]:
# Save the index to avoid re-processing next time
if 'rag' in locals() and rag.documents:
    save = input("Save the index for faster loading next time? (yes/no): ").strip().lower()
    
    if save in ['yes', 'y']:
        filename = input("Filename (press Enter for 'ancestor_index.pkl'): ").strip()
        if not filename:
            filename = "ancestor_index.pkl"
        
        rag.save_index(filename)
        print(f"\nüí° Next time, load with: rag.load_index('{filename}')")
else:
    print("No index to save. Please process PDFs first.")

## üéØ Example Questions to Try

Use the interactive box above to ask:
- Where was [name] born?
- When did [name] immigrate to America?
- What was [name]'s occupation?
- Tell me about [name]'s family
- Where did [name] live in the United States?
- What military service did [name] have?

## üìù Summary

This notebook demonstrated:
1. ‚úÖ Baseline question (without documents)
2. ‚úÖ PDF selection and loading
3. ‚úÖ Timed RAG processing
4. ‚úÖ Same question with RAG
5. ‚úÖ Before/after comparison
6. ‚úÖ Interactive Q&A mode

**Key Takeaway:** RAG transforms general AI into a personalized research assistant for YOUR documents! üå≥