# 07 - RAG Systems: Building Knowledge-Enhanced Applications

## Overview
In this notebook, we'll build complete Retrieval-Augmented Generation (RAG) systems that combine LLMs with external knowledge sources. You'll learn how to create applications that can answer questions using your own documents.

## Learning Objectives
By the end of this notebook, you will be able to:
- Build a complete RAG pipeline from document loading to answer generation
- Implement different retrieval strategies for better results
- Handle context length limitations and optimize retrieval
- Create multi-query RAG for comprehensive answers
- Add metadata filtering for targeted retrieval
- Evaluate and optimize RAG system performance

## Prerequisites
- Completion of notebooks 01-06 (especially 05 on document loading and 06 on embeddings)
- Understanding of vector stores and similarity search
- Basic knowledge of document processing

## Back-and-Forth Teaching Pattern
This notebook follows our pattern:
1. **Instructor Activity**: Demonstrates a concept with complete examples
2. **Learner Activity**: You apply the concept with guidance and hidden solutions

## Setup

Let's install and import the necessary libraries:

In [None]:
# Install required packages
!pip install langchain langchain-community langchain-openai chromadb faiss-cpu pypdf

In [None]:
import os
from typing import List, Dict, Any
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma, FAISS
from langchain.chains import RetrievalQA, RetrievalQAWithSourcesChain
from langchain.prompts import ChatPromptTemplate, PromptTemplate
from langchain.schema.runnable import RunnablePassthrough, RunnableParallel
from langchain.schema.output_parser import StrOutputParser
from langchain_community.document_loaders import TextLoader, PyPDFLoader
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
import warnings
warnings.filterwarnings('ignore')

# Set your OpenAI API key
os.environ["OPENAI_API_KEY"] = "your-api-key-here"

## Create Sample Documents

Let's create some sample documents to work with:

In [None]:
# Create sample documents about AI topics
sample_texts = [
    """Machine Learning Fundamentals
    Machine learning is a subset of artificial intelligence that enables systems to learn and improve from experience.
    There are three main types: supervised learning (with labeled data), unsupervised learning (finding patterns),
    and reinforcement learning (learning through rewards). Common algorithms include decision trees, neural networks,
    and support vector machines. Applications range from image recognition to recommendation systems.""",
    
    """Natural Language Processing
    NLP is a field of AI that helps computers understand, interpret, and manipulate human language.
    Key techniques include tokenization, named entity recognition, sentiment analysis, and language modeling.
    Modern NLP uses transformer models like BERT and GPT. Applications include chatbots, translation,
    and text summarization. Challenges include context understanding and handling ambiguity.""",
    
    """Computer Vision Applications
    Computer vision enables machines to interpret and understand visual information from the world.
    Core tasks include object detection, image classification, facial recognition, and semantic segmentation.
    Convolutional Neural Networks (CNNs) are the backbone of most vision systems. Real-world applications
    include autonomous vehicles, medical imaging, and augmented reality. Recent advances include
    vision transformers and self-supervised learning.""",
    
    """Deep Learning Architecture
    Deep learning uses neural networks with multiple layers to progressively extract higher-level features.
    Key architectures include CNNs for images, RNNs for sequences, and Transformers for various tasks.
    Training requires large datasets and computational resources. Techniques like transfer learning
    and fine-tuning help adapt pre-trained models. Challenges include interpretability and overfitting.""",
    
    """AI Ethics and Bias
    Ethical AI development requires addressing bias, fairness, transparency, and accountability.
    Bias can enter through training data, algorithm design, or deployment contexts. Mitigation strategies
    include diverse datasets, fairness metrics, and regular audits. Important considerations include
    privacy protection, explainable AI, and responsible deployment. Regulatory frameworks are emerging globally."""
]

# Save documents to files
for i, text in enumerate(sample_texts):
    with open(f"ai_doc_{i}.txt", "w") as f:
        f.write(text)

print("Sample documents created!")

---

## Instructor Activity 1: Building a Basic RAG Pipeline

Let's build a complete RAG system from scratch, understanding each component:

In [None]:
# Step 1: Load and split documents
documents = []
for i in range(5):
    loader = TextLoader(f"ai_doc_{i}.txt")
    documents.extend(loader.load())

# Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,
    chunk_overlap=50,
    separators=["\n\n", "\n", ".", " ", ""]
)
splits = text_splitter.split_documents(documents)

print(f"Created {len(splits)} document chunks")
print(f"\nExample chunk: {splits[0].page_content[:150]}...")

In [None]:
# Step 2: Create embeddings and vector store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(
    documents=splits,
    embedding=embeddings,
    collection_name="ai_knowledge_base"
)

# Create retriever
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 3}  # Return top 3 most relevant chunks
)

print("Vector store created with retriever configured")

In [None]:
# Step 3: Build RAG chain using LCEL
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

# Create RAG prompt
rag_prompt = ChatPromptTemplate.from_template(
    """You are an AI expert assistant. Use the following context to answer the question.
    If you don't know the answer based on the context, say so.
    
    Context:
    {context}
    
    Question: {question}
    
    Answer:"""
)

# Build the RAG chain
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | rag_prompt
    | llm
    | StrOutputParser()
)

print("RAG chain built successfully")

In [None]:
# Step 4: Test the RAG system
questions = [
    "What are the three main types of machine learning?",
    "How do CNNs relate to computer vision?",
    "What are the key ethical considerations in AI?"
]

for question in questions:
    print(f"\nQ: {question}")
    answer = rag_chain.invoke(question)
    print(f"A: {answer}")
    print("-" * 50)

In [None]:
# Advanced: RAG with source citations
rag_prompt_with_sources = ChatPromptTemplate.from_template(
    """You are an AI expert assistant. Use the following context to answer the question.
    Include relevant source references in your answer.
    
    Context (with sources):
    {context}
    
    Question: {question}
    
    Answer (include source references):"""
)

def format_docs_with_sources(docs):
    formatted = []
    for i, doc in enumerate(docs):
        source = doc.metadata.get('source', f'Document {i+1}')
        formatted.append(f"[Source: {source}]\n{doc.page_content}")
    return "\n\n".join(formatted)

rag_chain_with_sources = (
    {"context": retriever | format_docs_with_sources, "question": RunnablePassthrough()}
    | rag_prompt_with_sources
    | llm
    | StrOutputParser()
)

# Test with sources
question = "What techniques are used in NLP?"
answer = rag_chain_with_sources.invoke(question)
print(f"Q: {question}")
print(f"A: {answer}")

---

## Learner Activity 1: Build Your Own RAG System

Now it's your turn! Build a RAG system for a customer support knowledge base.

**Task**: Create a RAG system that can answer questions about a product using a knowledge base.

Requirements:
1. Create documents about a fictional product (e.g., a smart home device)
2. Build a vector store with appropriate chunk sizes
3. Implement a RAG chain with a custom prompt for customer support
4. Add functionality to handle "I don't know" cases gracefully
5. Include confidence scoring in responses

In [None]:
# Create your product knowledge base documents
product_docs = [
    """SmartHome Hub - Product Overview
    # Your product description here""",
    
    """Installation Guide
    # Installation steps here""",
    
    # Add more documents...
]

# TODO: Save documents to files
# Your code here

# TODO: Load and split documents
# Your code here

# TODO: Create vector store and retriever
# Your code here

# TODO: Build customer support RAG chain
# Hint: Create a prompt that:
# - Acts as a friendly support agent
# - Provides step-by-step help when appropriate
# - Admits when information isn't available
# - Suggests contacting human support for complex issues

# Your code here

# TODO: Test with customer questions
test_questions = [
    "How do I set up the SmartHome Hub?",
    "What's the warranty period?",
    "Can it work with my existing smart lights?"
]

# Your code here

In [None]:
# Solution (hidden by default - try solving it yourself first!)

"""
# Create product knowledge base
product_docs = [
    '''SmartHome Hub - Product Overview
    The SmartHome Hub is an all-in-one home automation controller that connects all your smart devices.
    Features include voice control, mobile app, automation routines, and compatibility with 1000+ devices.
    It supports WiFi, Zigbee, Z-Wave, and Bluetooth protocols. The hub has built-in security features
    including encryption and regular security updates. Price: $199. Warranty: 2 years.''',
    
    '''Installation Guide
    Step 1: Unbox your SmartHome Hub and connect the power adapter.
    Step 2: Download the SmartHome app from App Store or Google Play.
    Step 3: Create an account or sign in to your existing account.
    Step 4: Follow the in-app setup wizard to connect the hub to your WiFi.
    Step 5: The LED will turn solid green when connected successfully.
    Step 6: Start adding your smart devices through the app.
    Troubleshooting: If LED is red, check your WiFi password. If blinking blue, hub is in pairing mode.''',
    
    '''Compatible Devices
    The SmartHome Hub works with: Philips Hue lights, LIFX bulbs, Nest thermostats, Ring doorbells,
    August smart locks, Samsung SmartThings devices, Amazon Alexa, Google Home, Apple HomeKit (via bridge),
    Sonos speakers, and many more. Check our website for the full compatibility list.
    Most devices using Zigbee, Z-Wave, or WiFi protocols are supported.''',
    
    '''Troubleshooting Common Issues
    Hub won't connect: Ensure you're using 2.4GHz WiFi (not 5GHz). Reset hub by holding button for 10 seconds.
    Devices not responding: Check if device is powered on. Try removing and re-adding the device.
    App crashes: Update to latest version. Clear app cache. Reinstall if needed.
    Automation not working: Check time zone settings. Verify all devices in automation are online.
    For other issues, contact support at support@smarthomehub.com or call 1-800-SMARTHUB.'''
]

# Save documents
for i, doc in enumerate(product_docs):
    with open(f"product_doc_{i}.txt", "w") as f:
        f.write(doc)

# Load and split
from langchain_community.document_loaders import TextLoader

product_documents = []
for i in range(len(product_docs)):
    loader = TextLoader(f"product_doc_{i}.txt")
    product_documents.extend(loader.load())

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=150,
    chunk_overlap=30
)
product_splits = text_splitter.split_documents(product_documents)

# Create vector store
product_vectorstore = Chroma.from_documents(
    documents=product_splits,
    embedding=OpenAIEmbeddings(),
    collection_name="product_support"
)

product_retriever = product_vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 4}
)

# Customer support RAG chain
support_prompt = ChatPromptTemplate.from_template(
    '''You are a friendly SmartHome Hub customer support assistant. 
    Use the following context to help the customer. Be helpful, clear, and professional.
    
    Guidelines:
    - Provide step-by-step instructions when applicable
    - If the information isn't in the context, politely say you'll need to check with the team
    - For complex technical issues, suggest contacting support directly
    - Always be encouraging and positive
    
    Context:
    {context}
    
    Customer Question: {question}
    
    Support Response:'''
)

def format_product_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

support_chain = (
    {"context": product_retriever | format_product_docs, "question": RunnablePassthrough()}
    | support_prompt
    | ChatOpenAI(model="gpt-3.5-turbo", temperature=0.3)
    | StrOutputParser()
)

# Test the system
test_questions = [
    "How do I set up the SmartHome Hub?",
    "What's the warranty period?",
    "Can it work with my Philips Hue lights?",
    "My hub LED is red, what should I do?",
    "Does it support Matter protocol?"  # This isn't in our docs
]

for question in test_questions:
    print(f"\nCustomer: {question}")
    response = support_chain.invoke(question)
    print(f"Support: {response}")
    print("-" * 60)
"""

print("Try implementing the customer support RAG system above!")
print("The solution shows how to create a helpful, professional support assistant.")

---

## Instructor Activity 2: Advanced Retrieval Strategies

Let's explore advanced retrieval techniques for better RAG performance:

In [None]:
# Multi-Query Retriever: Generate multiple queries for comprehensive retrieval
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain.chains import LLMChain

# Create a prompt to generate multiple search queries
query_prompt = PromptTemplate(
    input_variables=["question"],
    template="""You are an AI assistant helping to retrieve information.
    Generate 3 different versions of the given question to retrieve relevant documents.
    Provide these alternative questions separated by newlines.
    Original question: {question}
    Alternative questions:"""
)

llm = ChatOpenAI(temperature=0)

# Create multi-query retriever
multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=vectorstore.as_retriever(),
    llm=llm,
    prompt=query_prompt
)

# Test multi-query retrieval
question = "What are the challenges in implementing AI systems?"
docs = multi_query_retriever.get_relevant_documents(question)

print(f"Original question: {question}")
print(f"\nRetrieved {len(docs)} documents using multi-query approach:")
for i, doc in enumerate(docs[:3]):
    print(f"\nDoc {i+1}: {doc.page_content[:150]}...")

In [None]:
# Contextual Compression: Extract only relevant parts of retrieved documents
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

# Create compressor
compressor = LLMChainExtractor.from_llm(llm)

# Create compression retriever
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectorstore.as_retriever(search_kwargs={"k": 4})
)

# Compare regular vs compressed retrieval
question = "What specific techniques are used in NLP?"

print("Regular Retrieval:")
regular_docs = vectorstore.as_retriever().get_relevant_documents(question)
print(f"First doc ({len(regular_docs[0].page_content)} chars): {regular_docs[0].page_content}")

print("\n" + "="*50 + "\n")

print("Compressed Retrieval (only relevant parts):")
compressed_docs = compression_retriever.get_relevant_documents(question)
print(f"First doc ({len(compressed_docs[0].page_content)} chars): {compressed_docs[0].page_content}")

In [None]:
# Hybrid Search: Combine semantic and keyword search
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever

# Create BM25 retriever for keyword search
bm25_retriever = BM25Retriever.from_documents(splits)
bm25_retriever.k = 3

# Create ensemble retriever
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, vectorstore.as_retriever(search_kwargs={"k": 3})],
    weights=[0.4, 0.6]  # 40% keyword, 60% semantic
)

# Test hybrid search
question = "CNN architecture neural networks"
hybrid_docs = ensemble_retriever.get_relevant_documents(question)

print(f"Hybrid search for: {question}")
print(f"\nRetrieved {len(hybrid_docs)} documents:")
for i, doc in enumerate(hybrid_docs[:3]):
    print(f"\nDoc {i+1}: {doc.page_content[:100]}...")

In [None]:
# Metadata Filtering: Target specific document categories
from langchain.schema import Document

# Create documents with metadata
categorized_docs = [
    Document(
        page_content="Supervised learning uses labeled data for training...",
        metadata={"category": "ml_basics", "level": "beginner"}
    ),
    Document(
        page_content="Transformer architectures revolutionized NLP with attention mechanisms...",
        metadata={"category": "deep_learning", "level": "advanced"}
    ),
    Document(
        page_content="Ethical AI requires addressing bias in training data...",
        metadata={"category": "ethics", "level": "intermediate"}
    ),
]

# Create vector store with metadata
metadata_vectorstore = Chroma.from_documents(
    documents=categorized_docs,
    embedding=OpenAIEmbeddings(),
    collection_name="categorized_knowledge"
)

# Retrieve with metadata filter
filtered_retriever = metadata_vectorstore.as_retriever(
    search_kwargs={
        "k": 2,
        "filter": {"category": "deep_learning"}
    }
)

# Test filtered retrieval
question = "How do modern NLP systems work?"
filtered_docs = filtered_retriever.get_relevant_documents(question)

print(f"Question: {question}")
print("\nFiltered retrieval (deep_learning category only):")
for doc in filtered_docs:
    print(f"- {doc.page_content}")
    print(f"  Metadata: {doc.metadata}")

---

## Learner Activity 2: Implement Advanced Retrieval

Build a RAG system with advanced retrieval for a technical documentation chatbot.

**Task**: Create a system that uses multiple retrieval strategies to provide accurate technical answers.

Requirements:
1. Implement multi-query retrieval for better coverage
2. Add contextual compression for concise responses
3. Use metadata filtering to separate beginner/advanced content
4. Create a fallback mechanism when confidence is low
5. Implement query routing based on question type

In [None]:
# Create technical documentation with different levels
tech_docs = [
    # Create documents with metadata for level (beginner/advanced) and topic
    # Your documents here
]

# TODO: Create vector store with metadata
# Your code here

# TODO: Implement multi-query retriever
# Hint: Create a prompt that generates technical variations of the question
# Your code here

# TODO: Add contextual compression
# Your code here

# TODO: Build query router
# Hint: Detect if question is beginner or advanced, then use appropriate filter
def route_query(question: str) -> str:
    # Determine question level
    # Return "beginner" or "advanced"
    pass

# TODO: Create adaptive RAG chain
# Should:
# - Route queries to appropriate level
# - Use multi-query for complex questions
# - Compress results for clarity
# - Handle low-confidence cases

# Your code here

# Test with different types of questions
test_queries = [
    "What is an API?",  # Beginner question
    "How do I implement OAuth 2.0 with PKCE flow?",  # Advanced question
    "Explain microservices architecture trade-offs"  # Complex question
]

# Your test code here

In [None]:
# Solution (hidden by default)

"""
from langchain.schema import Document

# Create technical documentation
tech_docs = [
    Document(
        page_content='''API Basics: An API (Application Programming Interface) is a set of rules
        that allows different software applications to communicate. Think of it like a menu
        in a restaurant - it tells you what you can order and how to order it.
        Common types include REST APIs and GraphQL APIs.''',
        metadata={"level": "beginner", "topic": "api"}
    ),
    Document(
        page_content='''Advanced API Security: OAuth 2.0 with PKCE (Proof Key for Code Exchange)
        prevents authorization code interception attacks. Implementation: Generate code_verifier
        (random string), create code_challenge (SHA256 hash), include in authorization request,
        send code_verifier in token exchange. Critical for mobile and SPA applications.''',
        metadata={"level": "advanced", "topic": "api_security"}
    ),
    Document(
        page_content='''Microservices Architecture involves breaking applications into small,
        independent services. Benefits: scalability, technology diversity, fault isolation.
        Trade-offs: increased complexity, network latency, data consistency challenges.
        Requires robust service discovery, API gateway, and monitoring.''',
        metadata={"level": "advanced", "topic": "architecture"}
    ),
    Document(
        page_content='''REST API Basics: REST uses HTTP methods - GET (read), POST (create),
        PUT (update), DELETE (remove). URLs identify resources. Status codes indicate results:
        200 (success), 404 (not found), 500 (server error). JSON is the common data format.''',
        metadata={"level": "beginner", "topic": "api"}
    ),
]

# Create vector store with metadata
tech_vectorstore = Chroma.from_documents(
    documents=tech_docs,
    embedding=OpenAIEmbeddings(),
    collection_name="tech_docs"
)

# Multi-query retriever with technical variations
tech_query_prompt = PromptTemplate(
    input_variables=["question"],
    template='''You are a technical documentation assistant.
    Generate 3 different versions of this question to search technical docs:
    - One using simpler terms
    - One using technical terminology
    - One focusing on practical implementation
    
    Original question: {question}
    
    Alternative questions:'''
)

multi_query_tech = MultiQueryRetriever.from_llm(
    retriever=tech_vectorstore.as_retriever(),
    llm=ChatOpenAI(temperature=0.2),
    prompt=tech_query_prompt
)

# Contextual compression for technical docs
tech_compressor = LLMChainExtractor.from_llm(
    ChatOpenAI(temperature=0)
)

compression_tech = ContextualCompressionRetriever(
    base_compressor=tech_compressor,
    base_retriever=multi_query_tech
)

# Query router
def route_query(question: str) -> str:
    # Simple keyword-based routing (could use LLM for better classification)
    beginner_keywords = ['what is', 'basic', 'introduction', 'explain', 'simple']
    advanced_keywords = ['implement', 'optimize', 'architecture', 'advanced', 'pattern']
    
    question_lower = question.lower()
    
    if any(keyword in question_lower for keyword in beginner_keywords):
        return "beginner"
    elif any(keyword in question_lower for keyword in advanced_keywords):
        return "advanced"
    else:
        return "all"  # No filter

# Adaptive RAG chain
def create_adaptive_chain(level_filter=None):
    if level_filter and level_filter != "all":
        retriever = tech_vectorstore.as_retriever(
            search_kwargs={"k": 3, "filter": {"level": level_filter}}
        )
    else:
        retriever = tech_vectorstore.as_retriever(search_kwargs={"k": 3})
    
    # Add compression
    compressed_retriever = ContextualCompressionRetriever(
        base_compressor=tech_compressor,
        base_retriever=retriever
    )
    
    prompt = ChatPromptTemplate.from_template(
        '''You are a technical documentation assistant. Answer based on the context provided.
        If confidence is low, suggest consulting official documentation.
        
        Context level: {level}
        Context: {context}
        
        Question: {question}
        
        Answer (adjust complexity based on level):'''
    )
    
    chain = (
        {
            "context": compressed_retriever | format_docs,
            "question": RunnablePassthrough(),
            "level": lambda x: level_filter or "mixed"
        }
        | prompt
        | ChatOpenAI(temperature=0.2)
        | StrOutputParser()
    )
    
    return chain

# Test the system
test_queries = [
    "What is an API?",
    "How do I implement OAuth 2.0 with PKCE flow?",
    "Explain microservices architecture trade-offs"
]

for query in test_queries:
    level = route_query(query)
    print(f"\nQuestion: {query}")
    print(f"Detected Level: {level}")
    
    chain = create_adaptive_chain(level)
    answer = chain.invoke(query)
    
    print(f"Answer: {answer}")
    print("-" * 70)
"""

print("Implement an advanced retrieval system with routing and compression!")
print("The solution demonstrates multi-query, compression, and adaptive routing.")

---

## Instructor Activity 3: RAG Evaluation and Optimization

Let's learn how to evaluate and optimize RAG system performance:

In [None]:
# RAG Evaluation Metrics
from typing import List, Dict, Tuple
import numpy as np

class RAGEvaluator:
    def __init__(self, llm, retriever):
        self.llm = llm
        self.retriever = retriever
    
    def evaluate_retrieval_relevance(self, question: str, docs: List) -> float:
        """Evaluate how relevant retrieved documents are to the question"""
        relevance_prompt = ChatPromptTemplate.from_template(
            """Rate the relevance of this document to the question on a scale of 0-10.
            Question: {question}
            Document: {document}
            
            Return only a number between 0-10:"""
        )
        
        scores = []
        for doc in docs:
            chain = relevance_prompt | self.llm | StrOutputParser()
            score_str = chain.invoke({
                "question": question,
                "document": doc.page_content
            })
            try:
                scores.append(float(score_str.strip()))
            except:
                scores.append(0)
        
        return np.mean(scores) if scores else 0
    
    def evaluate_answer_faithfulness(self, context: str, answer: str) -> float:
        """Check if answer is grounded in the context"""
        faithfulness_prompt = ChatPromptTemplate.from_template(
            """Is this answer fully supported by the context? Rate 0-10.
            Context: {context}
            Answer: {answer}
            
            Return only a number (10 = fully supported, 0 = not supported):"""
        )
        
        chain = faithfulness_prompt | self.llm | StrOutputParser()
        score_str = chain.invoke({"context": context, "answer": answer})
        
        try:
            return float(score_str.strip())
        except:
            return 0
    
    def evaluate_rag_pipeline(self, test_cases: List[Dict]) -> Dict:
        """Comprehensive RAG evaluation"""
        results = {
            "retrieval_relevance": [],
            "answer_faithfulness": [],
            "overall_quality": []
        }
        
        for case in test_cases:
            question = case["question"]
            expected_topics = case.get("expected_topics", [])
            
            # Get retrieved docs
            docs = self.retriever.get_relevant_documents(question)
            
            # Evaluate retrieval
            relevance = self.evaluate_retrieval_relevance(question, docs)
            results["retrieval_relevance"].append(relevance)
            
            print(f"\nQ: {question}")
            print(f"Retrieval Relevance: {relevance:.1f}/10")
        
        return results

# Create evaluator
evaluator = RAGEvaluator(
    llm=ChatOpenAI(model="gpt-3.5-turbo", temperature=0),
    retriever=vectorstore.as_retriever()
)

# Test cases for evaluation
test_cases = [
    {"question": "What are the types of machine learning?", "expected_topics": ["supervised", "unsupervised"]},
    {"question": "How does computer vision work?", "expected_topics": ["CNN", "image"]}
]

# Run evaluation
eval_results = evaluator.evaluate_rag_pipeline(test_cases)

In [None]:
# RAG Optimization Strategies

class OptimizedRAG:
    def __init__(self, vectorstore, llm):
        self.vectorstore = vectorstore
        self.llm = llm
        self.cache = {}  # Simple cache for repeated queries
    
    def optimize_chunk_size(self, documents: List, sizes: List[int] = [100, 200, 500]):
        """Find optimal chunk size for your documents"""
        best_size = None
        best_score = 0
        
        for size in sizes:
            splitter = RecursiveCharacterTextSplitter(
                chunk_size=size,
                chunk_overlap=size // 10
            )
            splits = splitter.split_documents(documents)
            
            # Evaluate retrieval quality (simplified)
            avg_length = np.mean([len(s.page_content) for s in splits])
            num_chunks = len(splits)
            
            # Balance between context and granularity
            score = (avg_length / 500) * (100 / num_chunks)
            
            print(f"Chunk size {size}: {num_chunks} chunks, avg {avg_length:.0f} chars, score: {score:.2f}")
            
            if score > best_score:
                best_score = score
                best_size = size
        
        return best_size
    
    def rerank_results(self, question: str, docs: List, top_k: int = 3):
        """Rerank retrieved documents using cross-encoder or LLM"""
        rerank_prompt = ChatPromptTemplate.from_template(
            """Score this document's relevance to the question (0-10).
            Question: {question}
            Document: {document}
            Score:"""
        )
        
        scored_docs = []
        for doc in docs:
            chain = rerank_prompt | self.llm | StrOutputParser()
            score = chain.invoke({
                "question": question,
                "document": doc.page_content[:500]
            })
            try:
                scored_docs.append((float(score.strip()), doc))
            except:
                scored_docs.append((0, doc))
        
        # Sort by score and return top k
        scored_docs.sort(key=lambda x: x[0], reverse=True)
        return [doc for _, doc in scored_docs[:top_k]]
    
    def query_with_cache(self, question: str):
        """Cache frequently asked questions"""
        if question in self.cache:
            print("Using cached result")
            return self.cache[question]
        
        # Perform RAG
        docs = self.vectorstore.as_retriever().get_relevant_documents(question)
        reranked_docs = self.rerank_results(question, docs)
        
        # Generate answer
        context = "\n\n".join([doc.page_content for doc in reranked_docs])
        answer = f"Based on: {context[:200]}..."  # Simplified
        
        # Cache result
        self.cache[question] = answer
        return answer

# Create optimized RAG
optimized_rag = OptimizedRAG(vectorstore, ChatOpenAI(temperature=0))

# Test optimization techniques
print("Testing chunk size optimization:")
best_chunk = optimized_rag.optimize_chunk_size(documents[:3])
print(f"\nBest chunk size: {best_chunk}")

print("\n" + "="*50)
print("\nTesting query with caching:")
question = "What is machine learning?"
result1 = optimized_rag.query_with_cache(question)
result2 = optimized_rag.query_with_cache(question)  # Should use cache

In [None]:
# RAG Performance Monitoring
import time
from datetime import datetime

class RAGMonitor:
    def __init__(self):
        self.metrics = []
    
    def log_query(self, question: str, retrieval_time: float, 
                  generation_time: float, num_docs: int, answer_length: int):
        """Log performance metrics for each query"""
        self.metrics.append({
            "timestamp": datetime.now().isoformat(),
            "question": question,
            "retrieval_time": retrieval_time,
            "generation_time": generation_time,
            "total_time": retrieval_time + generation_time,
            "num_docs": num_docs,
            "answer_length": answer_length
        })
    
    def get_stats(self) -> Dict:
        """Get performance statistics"""
        if not self.metrics:
            return {}
        
        retrieval_times = [m["retrieval_time"] for m in self.metrics]
        generation_times = [m["generation_time"] for m in self.metrics]
        total_times = [m["total_time"] for m in self.metrics]
        
        return {
            "avg_retrieval_time": np.mean(retrieval_times),
            "avg_generation_time": np.mean(generation_times),
            "avg_total_time": np.mean(total_times),
            "queries_processed": len(self.metrics),
            "p95_total_time": np.percentile(total_times, 95) if total_times else 0
        }

# Monitored RAG pipeline
monitor = RAGMonitor()

def monitored_rag_query(question: str, retriever, llm):
    """RAG query with monitoring"""
    # Retrieval phase
    start_retrieval = time.time()
    docs = retriever.get_relevant_documents(question)
    retrieval_time = time.time() - start_retrieval
    
    # Generation phase
    start_generation = time.time()
    context = "\n\n".join([doc.page_content for doc in docs])
    
    prompt = ChatPromptTemplate.from_template(
        """Answer based on context:
        Context: {context}
        Question: {question}
        Answer:"""
    )
    
    chain = prompt | llm | StrOutputParser()
    answer = chain.invoke({"context": context, "question": question})
    generation_time = time.time() - start_generation
    
    # Log metrics
    monitor.log_query(
        question=question,
        retrieval_time=retrieval_time,
        generation_time=generation_time,
        num_docs=len(docs),
        answer_length=len(answer)
    )
    
    return answer

# Test with monitoring
test_questions = [
    "What is deep learning?",
    "Explain computer vision",
    "What are ethical considerations in AI?"
]

for q in test_questions:
    answer = monitored_rag_query(
        q, 
        vectorstore.as_retriever(),
        ChatOpenAI(temperature=0)
    )
    print(f"Q: {q}")
    print(f"A: {answer[:100]}...\n")

# Display performance stats
stats = monitor.get_stats()
print("\nPerformance Statistics:")
for key, value in stats.items():
    if "time" in key:
        print(f"{key}: {value:.3f}s")
    else:
        print(f"{key}: {value}")

---

## Learner Activity 3: Build and Optimize a Production RAG System

Create a production-ready RAG system with evaluation and monitoring.

**Task**: Build a complete RAG system for a company knowledge base with optimization and monitoring.

Requirements:
1. Implement automatic chunk size optimization
2. Add query result caching for common questions
3. Build evaluation metrics for quality assurance
4. Create performance monitoring with alerts
5. Implement fallback strategies for poor retrieval
6. Add user feedback collection mechanism

In [None]:
# Build your production RAG system

class ProductionRAGSystem:
    def __init__(self):
        # Initialize your system components
        pass
    
    # TODO: Implement automatic optimization
    def auto_optimize(self, sample_queries: List[str]):
        # Test different configurations
        # Find best settings
        pass
    
    # TODO: Implement smart caching
    def cached_query(self, question: str):
        # Check cache with similarity threshold
        # Return cached or compute new
        pass
    
    # TODO: Add quality checks
    def quality_check(self, question: str, answer: str, context: str) -> Dict:
        # Check answer quality
        # Return quality metrics
        pass
    
    # TODO: Implement monitoring with alerts
    def monitor_performance(self):
        # Track latency, errors, quality
        # Send alerts if thresholds exceeded
        pass
    
    # TODO: Add fallback strategies
    def query_with_fallback(self, question: str):
        # Try primary retrieval
        # If quality low, try alternative strategies
        # If still poor, escalate to human
        pass
    
    # TODO: Collect user feedback
    def collect_feedback(self, question: str, answer: str, rating: int, comment: str = None):
        # Store feedback
        # Use for continuous improvement
        pass

# TODO: Create and test your system
# Your implementation here

# Test scenarios
test_scenarios = [
    "Normal query",
    "Repeated query (should cache)",
    "Query with no good matches (should fallback)",
    "Complex multi-part query"
]

# Your test code here

In [None]:
# Solution (hidden by default)

"""
import hashlib
from collections import deque

class ProductionRAGSystem:
    def __init__(self, documents: List):
        self.documents = documents
        self.llm = ChatOpenAI(temperature=0)
        self.embeddings = OpenAIEmbeddings()
        self.vectorstore = None
        self.cache = {}
        self.performance_log = deque(maxlen=100)  # Keep last 100 queries
        self.feedback_log = []
        self.optimal_chunk_size = 200
        
        # Quality thresholds
        self.min_relevance_score = 6.0
        self.latency_threshold = 5.0  # seconds
    
    def auto_optimize(self, sample_queries: List[str]):
        '''Test different configurations to find optimal settings'''
        print("Starting auto-optimization...")
        
        configurations = [
            {"chunk_size": 150, "overlap": 30, "k": 3},
            {"chunk_size": 200, "overlap": 50, "k": 4},
            {"chunk_size": 300, "overlap": 75, "k": 5}
        ]
        
        best_config = None
        best_score = 0
        
        for config in configurations:
            # Create vector store with config
            splitter = RecursiveCharacterTextSplitter(
                chunk_size=config["chunk_size"],
                chunk_overlap=config["overlap"]
            )
            splits = splitter.split_documents(self.documents)
            
            test_vectorstore = Chroma.from_documents(
                documents=splits,
                embedding=self.embeddings,
                collection_name=f"test_{config['chunk_size']}"
            )
            
            # Evaluate with sample queries
            scores = []
            for query in sample_queries:
                docs = test_vectorstore.as_retriever(
                    search_kwargs={"k": config["k"]}
                ).get_relevant_documents(query)
                
                # Simple relevance check
                if docs:
                    scores.append(len(docs) / config["k"])
            
            avg_score = np.mean(scores) if scores else 0
            print(f"Config {config}: Score {avg_score:.2f}")
            
            if avg_score > best_score:
                best_score = avg_score
                best_config = config
        
        # Apply best configuration
        if best_config:
            self.optimal_chunk_size = best_config["chunk_size"]
            splitter = RecursiveCharacterTextSplitter(
                chunk_size=best_config["chunk_size"],
                chunk_overlap=best_config["overlap"]
            )
            splits = splitter.split_documents(self.documents)
            self.vectorstore = Chroma.from_documents(
                documents=splits,
                embedding=self.embeddings,
                collection_name="production"
            )
            print(f"\nOptimization complete. Best config: {best_config}")
    
    def _cache_key(self, question: str) -> str:
        '''Generate cache key for question'''
        return hashlib.md5(question.lower().encode()).hexdigest()
    
    def cached_query(self, question: str):
        '''Query with intelligent caching'''
        # Check exact match cache
        cache_key = self._cache_key(question)
        if cache_key in self.cache:
            print("[Cache Hit] Returning cached result")
            return self.cache[cache_key]
        
        # Check for similar questions in cache
        for cached_q, cached_answer in self.cache.items():
            # Simple similarity check (could use embeddings)
            if len(set(question.lower().split()) & set(cached_q.lower().split())) > 3:
                print("[Partial Cache Hit] Found similar question")
                return cached_answer
        
        # Compute new answer
        answer = self.query_with_fallback(question)
        self.cache[cache_key] = answer
        
        # Limit cache size
        if len(self.cache) > 100:
            # Remove oldest entries
            oldest_key = list(self.cache.keys())[0]
            del self.cache[oldest_key]
        
        return answer
    
    def quality_check(self, question: str, answer: str, docs: List) -> Dict:
        '''Evaluate answer quality'''
        # Check relevance
        relevance_prompt = ChatPromptTemplate.from_template(
            '''Rate how well this answer addresses the question (0-10):
            Question: {question}
            Answer: {answer}
            Score:'''
        )
        
        chain = relevance_prompt | self.llm | StrOutputParser()
        relevance_score = chain.invoke({"question": question, "answer": answer})
        
        try:
            relevance = float(relevance_score.strip())
        except:
            relevance = 0
        
        # Check groundedness
        context = "\n".join([doc.page_content for doc in docs])
        grounded = len(set(answer.split()) & set(context.split())) / len(answer.split())
        
        return {
            "relevance": relevance,
            "groundedness": grounded,
            "confidence": (relevance / 10) * grounded,
            "passed": relevance >= self.min_relevance_score
        }
    
    def monitor_performance(self, query_data: Dict):
        '''Monitor and alert on performance issues'''
        self.performance_log.append(query_data)
        
        # Check recent performance
        recent_latencies = [d["latency"] for d in self.performance_log if "latency" in d]
        recent_qualities = [d["quality"]["relevance"] for d in self.performance_log if "quality" in d]
        
        # Alert conditions
        if recent_latencies and np.mean(recent_latencies[-10:]) > self.latency_threshold:
            print("⚠️  ALERT: High latency detected!")
            print(f"   Average latency: {np.mean(recent_latencies[-10:]):.2f}s")
        
        if recent_qualities and np.mean(recent_qualities[-10:]) < self.min_relevance_score:
            print("⚠️  ALERT: Low quality responses detected!")
            print(f"   Average quality: {np.mean(recent_qualities[-10:]):.1f}/10")
    
    def query_with_fallback(self, question: str):
        '''Query with multiple fallback strategies'''
        start_time = time.time()
        
        if not self.vectorstore:
            return "System not initialized. Please run auto_optimize first."
        
        # Strategy 1: Standard retrieval
        retriever = self.vectorstore.as_retriever(search_kwargs={"k": 3})
        docs = retriever.get_relevant_documents(question)
        
        if docs:
            # Generate answer
            context = "\n\n".join([doc.page_content for doc in docs])
            prompt = ChatPromptTemplate.from_template(
                '''Answer the question based on the context.
                Context: {context}
                Question: {question}
                Answer:'''
            )
            
            chain = prompt | self.llm | StrOutputParser()
            answer = chain.invoke({"context": context, "question": question})
            
            # Quality check
            quality = self.quality_check(question, answer, docs)
            
            if quality["passed"]:
                # Log performance
                self.monitor_performance({
                    "question": question,
                    "latency": time.time() - start_time,
                    "quality": quality,
                    "strategy": "standard"
                })
                return answer
        
        # Strategy 2: Expanded search
        print("[Fallback] Trying expanded search...")
        retriever_expanded = self.vectorstore.as_retriever(search_kwargs={"k": 6})
        docs = retriever_expanded.get_relevant_documents(question)
        
        if docs:
            context = "\n\n".join([doc.page_content for doc in docs[:4]])
            answer = f"Based on expanded search: {context[:200]}..."
            return answer
        
        # Strategy 3: Escalate to human
        print("[Fallback] No good matches found. Escalating...")
        return '''I couldn't find sufficient information to answer your question confidently. 
        This has been flagged for human review. In the meantime, you may want to:
        1. Try rephrasing your question
        2. Contact support directly at support@company.com'''
    
    def collect_feedback(self, question: str, answer: str, rating: int, comment: str = None):
        '''Collect and store user feedback'''
        feedback = {
            "timestamp": datetime.now().isoformat(),
            "question": question,
            "answer": answer[:200],  # Store preview
            "rating": rating,
            "comment": comment
        }
        
        self.feedback_log.append(feedback)
        
        # Analyze feedback trends
        recent_ratings = [f["rating"] for f in self.feedback_log[-20:]]
        if recent_ratings:
            avg_rating = np.mean(recent_ratings)
            if avg_rating < 3:
                print(f"⚠️  Low user satisfaction: {avg_rating:.1f}/5")
                print("   Consider retraining or updating knowledge base")
        
        return f"Thank you for your feedback! (Rating: {rating}/5)"

# Create and test the system
print("Creating Production RAG System...")
prod_system = ProductionRAGSystem(documents)

# Auto-optimize
sample_queries = [
    "What is machine learning?",
    "How do neural networks work?",
    "What are AI ethics?"
]
prod_system.auto_optimize(sample_queries)

print("\n" + "="*60)
print("Testing Production System:\n")

# Test different scenarios
test_cases = [
    "What are the types of machine learning?",
    "What are the types of machine learning?",  # Repeated (should cache)
    "Explain quantum computing",  # Not in knowledge base
    "How do CNNs work in computer vision?"
]

for i, question in enumerate(test_cases):
    print(f"\nTest {i+1}: {question}")
    answer = prod_system.cached_query(question)
    print(f"Answer: {answer[:150]}...")
    
    # Simulate user feedback
    rating = 4 if "couldn't find" not in answer else 2
    feedback_response = prod_system.collect_feedback(
        question, answer, rating, "Good answer" if rating > 3 else "Needs improvement"
    )
    print(feedback_response)
    print("-" * 40)
"""

print("Build a complete production RAG system with optimization and monitoring!")
print("The solution includes auto-optimization, caching, quality checks, and feedback collection.")

---

## Summary and Next Steps

Congratulations! You've learned how to build complete RAG systems. You can now:

✅ Build end-to-end RAG pipelines from documents to answers
✅ Implement advanced retrieval strategies (multi-query, compression, hybrid)
✅ Use metadata filtering for targeted retrieval
✅ Evaluate and optimize RAG performance
✅ Monitor production RAG systems
✅ Implement fallback strategies and quality checks

### Key Takeaways:
- **Retrieval Quality Matters**: Good retrieval is crucial for good answers
- **Multiple Strategies**: Combine different retrieval approaches for robustness
- **Evaluation is Essential**: Measure relevance, faithfulness, and performance
- **Optimization**: Chunk size, reranking, and caching improve results
- **Production Considerations**: Monitor, collect feedback, and iterate

### Next Steps:
- **Notebook 08**: Learn about Tools and Agents for autonomous AI systems
- **Practice**: Build RAG systems for different domains
- **Experiment**: Try different embedding models and vector stores
- **Scale**: Implement distributed RAG for large knowledge bases

### Additional Challenges:
1. Implement RAG with multiple data sources (PDFs, websites, databases)
2. Build a conversational RAG with memory
3. Create a RAG system with real-time document updates
4. Implement semantic caching with embedding similarity
5. Build a multi-lingual RAG system