# Lab 27: Web-Based RAG System - Real-Time Information Retrieval

## Learning Objectives
In this lab, you will learn how to:
- Build a complete RAG system that processes live web content for up-to-date information retrieval
- Load and extract content from web pages using WebBaseLoader for HTML processing
- Implement the same RAG pipeline architecture with web-based content sources
- Understand how to adapt RAG systems for dynamic, frequently-updated content
- Compare web-based RAG with document-based RAG for different use cases
- Build systems that can answer questions about current events and online information

## Overview
This lab extends the RAG system architecture to work with web content, enabling real-time information retrieval from online sources. You'll learn how to process HTML content, handle web-specific challenges, and create question-answering systems that can work with the latest information from websites. This is essential for building AI systems that need access to current, frequently-updated information rather than static documents.

In [None]:
# Web-Based RAG System Implementation - Essential Imports
# This lab demonstrates adapting the RAG architecture for real-time web content processing
# enabling question-answering systems that work with live, frequently-updated information

# Web Content Loading and Processing
from langchain_community.document_loaders import WebBaseLoader  # HTML content extraction from web pages
from langchain.text_splitter import RecursiveCharacterTextSplitter  # Intelligent text chunking for web content

# Vector Storage and Embeddings (same as PDF RAG)
from langchain_openai import OpenAIEmbeddings  # High-quality semantic embeddings for web text
from langchain_chroma import Chroma  # Vector database for similarity search

# LLM and Chain Components (consistent RAG architecture)
from langchain_openai import ChatOpenAI  # OpenAI's chat model for answer generation
from langchain.prompts import PromptTemplate  # Structured prompt templates
from langchain_core.runnables import RunnablePassthrough  # Data flow management
from langchain_core.output_parsers import StrOutputParser  # Clean string output

In [None]:
# OpenAI API Configuration
# Configure authentication for OpenAI services (embeddings and chat model)
import os

# Set OpenAI API key for embedding generation and LLM inference
# Required for both text-embedding-3-large model and ChatOpenAI model
os.environ["OPENAI_API_KEY"] = "your-api-key"

In [None]:
# Initialize Web Content Loader for Real-Time Information
# WebBaseLoader extracts clean text content from HTML web pages
# Essential for building RAG systems that work with current, online information

# Target URL: The Verge article about Meta's AI assistant and Llama 3
# This demonstrates processing current tech news and product announcements
URL = "https://www.theverge.com/2024/4/18/24133808/meta-ai-assistant-llama-3-chatgpt-openai-rival"

# WebBaseLoader capabilities:
# - Fetches web page content via HTTP requests
# - Parses HTML to extract readable text content
# - Filters out navigation, ads, and boilerplate content
# - Preserves article structure and formatting
loader = WebBaseLoader(URL)

print("🌐 Web loader initialized for The Verge article")
print("📰 URL: Meta AI assistant and Llama 3 announcement")
print("🔧 WebBaseLoader will extract clean text from HTML content")
print("📊 Content will be processed for real-time question answering")

In [None]:
# Load and Extract Web Page Content
# The load_and_split() method fetches the web page and extracts clean text content
# Unlike PDF processing, web content typically results in a single document object

# Web content loading process:
# 1. HTTP request to fetch the web page
# 2. HTML parsing to extract main content
# 3. Text cleaning to remove HTML tags, navigation, ads
# 4. Content structuring for further processing
pages = loader.load_and_split()

print(f"🌐 Successfully loaded web content: {len(pages)} document(s)")
print("📰 Extracted clean text from HTML article")
print("🔍 Content ready for chunking and embedding")

# Display information about the loaded web content
if pages:
    print(f"📄 Content preview: {pages[0].page_content[:300]}...")
    print(f"🏷️ Metadata: {pages[0].metadata}")
    print(f"📊 Total content length: {len(pages[0].page_content)} characters")

In [None]:
# Advanced Text Chunking for Web Content Processing
# Apply the same intelligent chunking strategy used for PDF documents
# Consistent chunking parameters ensure optimal retrieval performance across content types

# Configure text splitter with proven parameters:
# - chunk_size=200: Small chunks for precise retrieval and focused context
# - chunk_overlap=50: 25% overlap ensures context continuity between chunks
# These parameters work well for both PDF and web content processing
text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=50)

# Split web content into optimized chunks for vector storage
chunks = text_splitter.split_documents(pages)

print(f"📦 Split web content into {len(chunks)} chunks")
print(f"📏 Consistent chunking: 200 characters with 50-character overlap")
print(f"🎯 Optimized for precise retrieval from web article content")

# Display chunk statistics for web content
if chunks:
    avg_length = sum(len(chunk.page_content) for chunk in chunks) / len(chunks)
    print(f"📊 Average chunk length: {avg_length:.1f} characters")
    print(f"🔍 Sample chunk: {chunks[0].page_content[:150]}...")
    print(f"📰 Web content successfully prepared for vector storage")

In [None]:
# Initialize High-Quality Embedding Model for Web Content
# Using the same OpenAI text-embedding-3-large model ensures consistent performance
# across different content sources (PDF documents, web pages, etc.)

# Configure the most advanced OpenAI embedding model:
# - 3072 dimensions for rich semantic representation
# - Excellent performance on diverse content types
# - Optimized for both formal documents and web article content
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

print("🚀 OpenAI Embeddings initialized with text-embedding-3-large")
print("📐 Generates 3072-dimensional vectors for semantic search")
print("🌐 Optimized for diverse content: web articles, news, technical content")
print("🎯 Consistent embedding quality across PDF and web content sources")

In [None]:
# Create Vector Database from Web Content Chunks
# Apply the same vector storage approach used for PDF documents
# Demonstrates the flexibility of RAG architecture across different content sources

# The from_documents() method performs the complete embedding pipeline:
# 1. Generates embeddings for each web content chunk
# 2. Creates Chroma vector database instance
# 3. Stores embedded chunks with original text and web metadata
# 4. Builds similarity search indexes for fast question-answering
vectorstore = Chroma.from_documents(documents=chunks, embedding=embeddings)

print("🗄️ Vector store created from web content!")
print(f"📚 Embedded and stored {len(chunks)} web article chunks")
print("🌐 Vector database contains real-time information from The Verge")
print("⚡ Ready for semantic search on current tech news and AI developments")

In [None]:
# Initialize Web Content Retriever
# Convert vector store into a retriever for seamless integration with RAG chain
# Same interface as PDF RAG system, demonstrating architectural consistency

# as_retriever() creates a standardized interface that:
# - Accepts questions about web content and converts them to embeddings
# - Performs similarity search against stored web article vectors
# - Returns most relevant chunks about current AI developments
# - Integrates seamlessly with the question-answering pipeline
retriever = vectorstore.as_retriever()

print("🔍 Web content retriever initialized")
print("📊 Retriever will find relevant chunks from The Verge article")
print("🎯 Semantic search on current Meta AI and Llama 3 information")
print("⚡ Ready for real-time question answering about tech developments")

In [None]:
# Document Formatting Utility Function
# Identical to PDF RAG system - demonstrates reusable components across content types
# Prepares web content chunks for optimal LLM processing

def format_docs(docs):
    """
    Format retrieved web content chunks for LLM consumption.
    
    Args:
        docs: List of retrieved document chunks from web content vector search
    
    Returns:
        str: Formatted string with content separated by double newlines
    """
    # Join web content chunks with clear separators for better LLM comprehension
    # Double newlines create clear boundaries between different article sections
    return "\n\n".join(doc.page_content for doc in docs)

print("📝 Document formatter function defined")
print("🌐 Converts retrieved web content into clean context for LLM")
print("🔧 Same formatting approach works for both PDF and web content")
print("📋 Maintains clear separation between different content chunks")

In [None]:
# Initialize Language Model for Web Content Question Answering
# Same ChatOpenAI configuration as PDF RAG system
# Demonstrates how LLM components work consistently across different content sources

# ChatOpenAI configuration for web-based RAG:
# - Uses GPT-3.5-turbo by default (cost-effective and fast)
# - Processes web content context to generate accurate answers
# - Maintains factual accuracy by staying within retrieved information
# - Handles current events and technical information effectively
llm = ChatOpenAI()

print("🤖 ChatOpenAI language model initialized")
print("💬 Using GPT-3.5-turbo for web content question answering")
print("📰 Model will generate answers based on current tech article content")
print("✅ Ready to process questions about Meta AI, Llama 3, and current developments")

In [None]:
# Create Structured Prompt Template for Web-Based RAG
# Identical prompt template structure as PDF RAG system
# Demonstrates consistency and reusability of RAG architecture components

# The same prompt design principles apply to web content:
# 1. Define clear role as factual question-answer bot
# 2. Emphasize responding only from provided web article context
# 3. Include fallback behavior for information not in the article
# 4. Use clear variable placeholders for dynamic web content
template = """SYSTEM: You are a question answer bot. 
                 Be factual in your response.
                 Respond to the following question: {question} only from 
                 the below context :{context}. 
                 If you don't know the answer, just say that you don't know.
               """

# Convert template string into a LangChain PromptTemplate object
prompt = PromptTemplate.from_template(template)

print("📝 Web RAG prompt template created")
print("🎯 Template ensures factual responses based on web article content")
print("🚫 Prevents hallucination - answers only from retrieved web content")
print("🔧 Variables: {question} for user query, {context} for web article chunks")

In [None]:
# Build Complete Web-Based RAG Question-Answering Chain
# Identical architecture to PDF RAG system - demonstrates RAG pattern consistency
# Shows how the same pipeline works for different content sources

# Web RAG Chain Architecture (same as PDF RAG):
# 1. Input: User question about web content
# 2. Parallel Processing:
#    - retriever | format_docs: Finds relevant web article chunks and formats context
#    - RunnablePassthrough(): Passes the original question unchanged
# 3. prompt: Combines web content context and question into structured prompt
# 4. llm: Generates answer based on web article information
# 5. StrOutputParser(): Extracts clean string response

chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

print("⛓️ Complete web-based RAG chain constructed!")
print("🔄 Pipeline: Question → Retrieve Web Content → Format → Generate → Parse")
print("🌐 Parallel processing: Web content retrieval + Question passthrough")
print("🎯 End-to-end system ready for real-time web content question answering")

In [None]:
# Execute Web RAG System with Current Tech Question
# Test the system with a question about recent AI developments
# Demonstrates real-time information retrieval from web sources

print("❓ Testing web RAG system with current AI technology question...")
print("🔍 Question: 'What's the size of the largest Llama 3 model?'")
print("🌐 System will search through The Verge article for Llama 3 information")
print()

# Invoke the complete web RAG chain:
# 1. Retrieves relevant chunks about Llama 3 model specifications
# 2. Formats web content context for the language model
# 3. Generates accurate answer based on current article information
# 4. Returns factual response about latest AI developments
response = chain.invoke("What's the size of the largest Llama 3 model?")

print("💬 Web RAG System Response:")
print("=" * 50)
print(response)
print("=" * 50)
print()
print("✅ Web RAG system successfully answered question using current web content")
print("🌐 Response is grounded in real-time information from The Verge article")
print("📰 Demonstrates RAG system capability with frequently-updated online sources")

## Key Takeaways and Web RAG Insights

### What You've Accomplished
1. **Web-Based RAG System**: Adapted the RAG architecture for real-time web content processing
2. **Content Diversity**: Demonstrated RAG flexibility across different source types (PDF vs Web)
3. **Current Information**: Built system capable of processing frequently-updated online content
4. **Architectural Consistency**: Reused same components and patterns from PDF RAG system
5. **Real-Time Retrieval**: Created question-answering system for current events and developments

### Technical Comparison: PDF vs Web RAG
| Component | PDF RAG (Lab 26) | Web RAG (Lab 27) |
|-----------|------------------|------------------|
| **Loader** | PyPDFLoader | WebBaseLoader |
| **Content Type** | Static documents | Dynamic web pages |
| **Update Frequency** | Infrequent | Real-time |
| **Processing** | Page-based chunks | Article-based chunks |
| **Use Cases** | Policies, manuals | News, current events |

### Architectural Advantages
- **Component Reusability**: Same text splitter, embeddings, vector store, and chain architecture
- **Consistent Performance**: Identical chunking and retrieval strategies across content types
- **Scalable Design**: Easy to extend to additional content sources (APIs, databases, etc.)
- **Unified Interface**: Same question-answering experience regardless of content source

### Web RAG Specific Benefits
- **Current Information**: Access to latest news, announcements, and developments
- **Dynamic Content**: Handles frequently-updated online sources
- **Rich Metadata**: Web pages provide URL, publication date, and source information
- **Broad Coverage**: Can process diverse web content from news sites, blogs, documentation

### Production Considerations for Web RAG
- **Content Freshness**: Implement regular re-indexing for frequently-updated sources
- **Rate Limiting**: Respect website rate limits and robots.txt policies
- **Content Quality**: Filter and validate web content for accuracy and relevance
- **Legal Compliance**: Ensure proper permissions for web content usage

### Real-World Applications
- **News Analysis**: Question-answering systems for current events and breaking news
- **Market Research**: Real-time analysis of industry developments and trends
- **Product Updates**: Customer support systems with latest product information
- **Compliance Monitoring**: Track regulatory changes and policy updates
- **Competitive Intelligence**: Monitor competitor announcements and developments

### Integration Possibilities
- **Hybrid Systems**: Combine PDF and web RAG for comprehensive knowledge bases
- **Multi-Source Retrieval**: Aggregate information from documents and web sources
- **Automated Updates**: Schedule regular web content indexing for fresh information
- **Source Attribution**: Track and cite specific web sources in generated responses