# Week 3 - Exercise 1: Personal Knowledge Base

## üìã Exercise Overview

**Due:** Monday (Week 3)  
**Estimated Time:** 3-4 hours  
**Difficulty:** Intermediate

---

## üéØ Learning Objectives

In this exercise, you will:
1. Build a complete RAG system from scratch
2. Process and upload multiple documents
3. Implement semantic search with PGVector
4. Create a Q&A system with source citations
5. Add conversation memory for context

---

## üìù Requirements

Your Personal Knowledge Base must:

### Core Features:
- ‚úÖ **Document Upload:** Load at least 3 different documents (txt, pdf, etc.)
- ‚úÖ **Text Splitting:** Implement proper chunking with overlap
- ‚úÖ **Vector Storage:** Store embeddings in PGVector
- ‚úÖ **Semantic Search:** Retrieve relevant chunks based on query
- ‚úÖ **Q&A System:** Answer questions with source citations
- ‚úÖ **Conversation Memory:** Maintain context across multiple questions

### Technical Requirements:
- Use `RecursiveCharacterTextSplitter` with appropriate chunk size
- Store metadata (source, chunk_id, timestamp)
- Implement similarity search with configurable `k` value
- Return source documents with answers
- Handle errors gracefully (missing docs, connection issues)

### Bonus Challenges (Optional):
- üåü Support multiple document formats (PDF, DOCX, TXT)
- üåü Implement metadata filtering (by source, date, category)
- üåü Add document summarization feature
- üåü Create a relevance scoring system
- üåü Implement document update/delete functionality

---

## üí° Hints

<details>
<summary>Click for Hint 1: Document Processing Pipeline</summary>

```python
# Pipeline structure:
1. Load documents ‚Üí loader.load()
2. Split into chunks ‚Üí text_splitter.split_documents()
3. Add metadata ‚Üí chunk.metadata.update()
4. Create embeddings ‚Üí PGVector.from_documents()
```
</details>

<details>
<summary>Click for Hint 2: Choosing Chunk Size</summary>

Good starting values:
- chunk_size: 500-1000 characters
- chunk_overlap: 50-200 characters (10-20% of chunk_size)
- Experiment to find what works best for your documents
</details>

<details>
<summary>Click for Hint 3: Source Citations</summary>

```python
# Return both answer and sources
retrieved_docs = retriever.get_relevant_documents(question)
answer = rag_chain.invoke(question)
return {"answer": answer, "sources": retrieved_docs}
```
</details>

---

## üîß Setup

In [None]:
# Import required libraries
import os
from dotenv import load_dotenv
from typing import List, Dict
from datetime import datetime

# LangChain imports
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import PGVector
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import TextLoader, PyPDFLoader
from langchain.schema import Document
from langchain.memory import ConversationBufferMemory

# Load environment variables
load_dotenv()

# Initialize components
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

print("‚úÖ Setup complete!")

---

## üìù Step 1: Create Sample Documents

Create at least 3 sample documents for your knowledge base.

In [None]:
# TODO: Create sample documents
# You can create text files manually or use the code below

# Example documents
doc1_content = """
# TODO: Add your first document content here
# Topic: [Your choice - e.g., Python Programming]
"""

doc2_content = """
# TODO: Add your second document content here
# Topic: [Your choice - e.g., Machine Learning]
"""

doc3_content = """
# TODO: Add your third document content here
# Topic: [Your choice - e.g., Data Science]
"""

# Save documents
# YOUR CODE HERE

print("‚úÖ Sample documents created")

---

## üìù Step 2: Implement DocumentProcessor Class

Create a class to handle document loading and processing.

In [None]:
class DocumentProcessor:
    """
    Handles document loading, splitting, and metadata management.
    """
    
    def __init__(self, chunk_size: int = 500, chunk_overlap: int = 50):
        """
        Initialize the document processor.
        
        Args:
            chunk_size: Target size for text chunks
            chunk_overlap: Overlap between chunks
        """
        # TODO: Initialize text splitter
        self.text_splitter = # YOUR CODE HERE
        
        # Track processed documents
        self.processed_docs = []
    
    def load_document(self, file_path: str, doc_type: str = "txt") -> List[Document]:
        """
        Load a document from file.
        
        Args:
            file_path: Path to the document
            doc_type: Type of document (txt, pdf)
            
        Returns:
            List of loaded documents
        """
        # TODO: Implement document loading
        # Handle different file types
        # YOUR CODE HERE
        pass
    
    def process_document(self, file_path: str, category: str = "general") -> List[Document]:
        """
        Load, split, and add metadata to a document.
        
        Args:
            file_path: Path to the document
            category: Category/tag for the document
            
        Returns:
            List of processed document chunks
        """
        try:
            # TODO: Load document
            # YOUR CODE HERE
            
            # TODO: Split into chunks
            # YOUR CODE HERE
            
            # TODO: Add metadata to each chunk
            # Metadata should include:
            # - source (file path)
            # - category
            # - chunk_id
            # - timestamp
            # - chunk_size
            # YOUR CODE HERE
            
            # Track processed documents
            self.processed_docs.extend(chunks)
            
            return chunks
            
        except Exception as e:
            print(f"‚ùå Error processing {file_path}: {e}")
            return []
    
    def get_statistics(self) -> Dict:
        """
        Get processing statistics.
        """
        # TODO: Return statistics about processed documents
        # YOUR CODE HERE
        pass

---

## üìù Step 3: Implement KnowledgeBase Class

Create a class to manage the vector store and retrieval.

In [None]:
class KnowledgeBase:
    """
    Manages vector store, retrieval, and Q&A functionality.
    """
    
    def __init__(self, collection_name: str = "personal_kb"):
        """
        Initialize the knowledge base.
        
        Args:
            collection_name: Name for the PGVector collection
        """
        self.collection_name = collection_name
        self.embeddings = embeddings
        self.llm = llm
        
        # TODO: Initialize connection string
        self.connection_string = # YOUR CODE HERE
        
        # TODO: Initialize vector store (will be set in add_documents)
        self.vectorstore = None
        self.retriever = None
        
        # TODO: Initialize conversation memory
        self.memory = # YOUR CODE HERE
        
        # TODO: Create RAG chain
        self._create_rag_chain()
    
    def _create_rag_chain(self):
        """
        Create the RAG chain for Q&A.
        """
        # TODO: Create prompt template
        self.rag_prompt = # YOUR CODE HERE
        
        # Chain will be created after vectorstore is initialized
        self.rag_chain = None
    
    def add_documents(self, documents: List[Document]):
        """
        Add documents to the vector store.
        
        Args:
            documents: List of documents to add
        """
        # TODO: Create or update vector store
        # YOUR CODE HERE
        
        # TODO: Create retriever
        # YOUR CODE HERE
        
        # TODO: Create complete RAG chain
        # YOUR CODE HERE
        
        print(f"‚úÖ Added {len(documents)} documents to knowledge base")
    
    def search(self, query: str, k: int = 3, filter_dict: Dict = None) -> List[Document]:
        """
        Search for relevant documents.
        
        Args:
            query: Search query
            k: Number of results to return
            filter_dict: Metadata filters
            
        Returns:
            List of relevant documents
        """
        # TODO: Implement search
        # YOUR CODE HERE
        pass
    
    def ask(self, question: str, return_sources: bool = True) -> Dict:
        """
        Ask a question and get an answer with sources.
        
        Args:
            question: The question to ask
            return_sources: Whether to return source documents
            
        Returns:
            Dictionary with answer and sources
        """
        # TODO: Get answer from RAG chain
        # YOUR CODE HERE
        
        # TODO: Get source documents if requested
        # YOUR CODE HERE
        
        # TODO: Store in memory
        # YOUR CODE HERE
        
        pass
    
    def get_conversation_history(self) -> List:
        """
        Get the conversation history.
        """
        # TODO: Return conversation history from memory
        # YOUR CODE HERE
        pass
    
    def clear_memory(self):
        """
        Clear conversation memory.
        """
        # TODO: Clear memory
        # YOUR CODE HERE
        pass
    
    # BONUS: Implement these methods
    
    def summarize_document(self, source: str) -> str:
        """
        Generate a summary of a specific document.
        """
        # TODO (BONUS): Implement document summarization
        pass
    
    def get_statistics(self) -> Dict:
        """
        Get knowledge base statistics.
        """
        # TODO (BONUS): Return statistics
        pass

---

## ‚úÖ Testing Your Implementation

Run these tests to verify your knowledge base works correctly:

### Test 1: Document Processing

In [None]:
print("Test 1: Document Processing")
print("="*60)

# Initialize processor
processor = DocumentProcessor(chunk_size=500, chunk_overlap=50)

# Process documents
all_chunks = []
documents = [
    ("doc1.txt", "programming"),
    ("doc2.txt", "machine-learning"),
    ("doc3.txt", "data-science")
]

for file_path, category in documents:
    chunks = processor.process_document(file_path, category)
    all_chunks.extend(chunks)
    print(f"‚úÖ Processed {file_path}: {len(chunks)} chunks")

print(f"\nüìä Total chunks: {len(all_chunks)}")

# Show statistics
stats = processor.get_statistics()
print(f"\nüìä Statistics:")
for key, value in stats.items():
    print(f"  {key}: {value}")

# ‚úÖ Should successfully process all documents!

### Test 2: Knowledge Base Creation

In [None]:
print("\nTest 2: Knowledge Base Creation")
print("="*60)

# Create knowledge base
kb = KnowledgeBase(collection_name="test_kb")

# Add documents
kb.add_documents(all_chunks)

print("‚úÖ Knowledge base created successfully")

# ‚úÖ Should create vector store without errors!

### Test 3: Semantic Search

In [None]:
print("\nTest 3: Semantic Search")
print("="*60)

# Search for relevant documents
query = "What is machine learning?"
results = kb.search(query, k=3)

print(f"üîç Query: {query}")
print(f"\nüìö Found {len(results)} relevant chunks:\n")

for i, doc in enumerate(results, 1):
    print(f"{i}. Source: {doc.metadata.get('source', 'Unknown')}")
    print(f"   Category: {doc.metadata.get('category', 'Unknown')}")
    print(f"   Content: {doc.page_content[:100]}...")
    print()

# ‚úÖ Should return relevant chunks!

### Test 4: Q&A with Sources

In [None]:
print("\nTest 4: Q&A with Sources")
print("="*60)

questions = [
    "What is Python used for?",
    "Explain machine learning in simple terms.",
    "What is data science?"
]

for question in questions:
    print(f"\n‚ùì Question: {question}")
    
    result = kb.ask(question, return_sources=True)
    
    print(f"üí° Answer: {result['answer']}")
    
    if 'sources' in result:
        print(f"\nüìö Sources:")
        for i, source in enumerate(result['sources'], 1):
            print(f"  {i}. {source.metadata.get('source', 'Unknown')}")
    
    print("-" * 60)

# ‚úÖ Should provide accurate answers with sources!

### Test 5: Conversational Memory

In [None]:
print("\nTest 5: Conversational Memory")
print("="*60)

# Clear previous memory
kb.clear_memory()

# Have a conversation
conversation = [
    "What is Python?",
    "What are its main features?",  # Should reference Python
    "How is it used in data science?"  # Should maintain context
]

for question in conversation:
    print(f"\nüë§ User: {question}")
    result = kb.ask(question, return_sources=False)
    print(f"ü§ñ Assistant: {result['answer']}")

# View conversation history
print("\nüìú Conversation History:")
history = kb.get_conversation_history()
print(f"Total exchanges: {len(history)}")

# ‚úÖ Should maintain context across questions!

### Test 6: Metadata Filtering (Bonus)

In [None]:
print("\nTest 6: Metadata Filtering")
print("="*60)

# Search with category filter
filtered_results = kb.search(
    query="programming",
    k=5,
    filter_dict={"category": "programming"}
)

print(f"üîç Filtered search (category='programming'):")
print(f"Found {len(filtered_results)} results")

for doc in filtered_results:
    print(f"  - {doc.metadata.get('source')}: {doc.metadata.get('category')}")

# ‚úÖ Should only return documents from specified category!

---

## üé® Your Own Tests

Add your own test cases here:

In [None]:
# YOUR TEST CASES HERE


---

## üìä Self-Assessment

Rate your implementation (1-5):

| Criteria | Rating | Notes |
|----------|--------|-------|
| Document Processing | /5 | Loads and splits correctly? |
| Vector Store | /5 | PGVector integration works? |
| Semantic Search | /5 | Returns relevant results? |
| Q&A Accuracy | /5 | Provides correct answers? |
| Source Citations | /5 | Properly cites sources? |
| Conversation Memory | /5 | Maintains context? |
| Error Handling | /5 | Handles errors gracefully? |
| Code Quality | /5 | Clean, documented code? |
| Bonus Features | /5 | Extra features implemented? |
| **Total** | **/45** | |

---

## ü§î Reflection Questions

Answer these questions in the markdown cell below:

1. What chunk size and overlap worked best for your documents? Why?
2. How did you handle documents with different structures?
3. What strategies did you use to improve retrieval accuracy?
4. How would you scale this to thousands of documents?
5. What metadata proved most useful for filtering?

---

### Your Answers:

**1. Chunk Size Selection:**
- [Your answer here]

**2. Handling Different Document Structures:**
- [Your answer here]

**3. Improving Retrieval Accuracy:**
- [Your answer here]

**4. Scaling Strategy:**
- [Your answer here]

**5. Useful Metadata:**
- [Your answer here]

---

## üì§ Submission

### Before Submitting:

- [ ] All tests pass
- [ ] Documents are properly processed and stored
- [ ] Semantic search returns relevant results
- [ ] Q&A provides accurate answers with sources
- [ ] Conversation memory works correctly
- [ ] Code is well-documented
- [ ] Error handling is implemented
- [ ] Reflection questions answered
- [ ] Notebook runs from top to bottom without errors

### How to Submit:

1. Save this notebook
2. Commit to your git branch: `git commit -m "Complete Week 3 Exercise 1"`
3. Push to repository: `git push origin week3-exercise1`
4. Submit repository link to instructor

---

**Excellent work on building your RAG system! üéâ**