<a href="https://colab.research.google.com/github/sayaleepande/GenAI/blob/main/M8_Lab1_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![RAG Intro Lab](https://www.dropbox.com/scl/fi/hyyhk4rkslkgi6rd1ki5z/RAG_Intro_Lab.png?rlkey=31iuzmmvt9ta3xf649jma4y67&raw=1)


In [1]:
# 📦 Installing Required Libraries for LangChain RAG Lab (Quiet Mode)
# ==================================================

# --- Core LangChain and OpenAI Integration ---
!pip install -q --upgrade langchain langchain-community langchain-openai

# --- OpenAI SDK ---
# 'openai': Required to access GPT-3.5/4 and manage API keys, works with both LangChain and direct calls
!pip install -q --upgrade openai

# --- Vector Databases for Retrieval (RAG) ---
# 'faiss-cpu': Facebook's FAISS for fast vector search (in-memory or persistent)
# 'chromadb': Lightweight vector database, ideal for local demos and quick setup
!pip install -q --upgrade faiss-cpu chromadb

# --- Tokenization and Unstructured Data Support ---
# 'tiktoken': Fast, efficient tokenizer (used with OpenAI, supports counting tokens accurately)
# 'unstructured': Loads/cleans data from PDFs, DOCX, HTML, email, etc. for use in retrieval pipelines
# 'unstructured[pdf]': Adds PDF parsing support (using pdfminer, pypdf, etc.)
# 'pypdf', 'pdfminer.six': Popular PDF parsing backends, required for some document loaders
!pip install -q --upgrade tiktoken unstructured "unstructured[pdf]" pypdf pdfminer.six


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.5 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.7/2.5 MB[0m [31m24.7 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━[0m [32m1.5/2.5 MB[0m [31m19.8 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m2.5/2.5 MB[0m [31m23.5 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m17.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m65.2/65.2 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m438.1/438.1 kB[0m [31m22.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.0/363.0 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━

In [None]:
# 📚 LangChain RAG Lab: Library Imports & Setup
# ====================================================
# ✅ This cell handles all required imports, grouped by category for clarity.

# 🧱 System & Environment Setup
import os  # Environment variable access
import requests  # For fetching remote resources (e.g., PDFs, data files)
from google.colab import userdata  # Accessing Colab-specific secure data

# 🧪 Jupyter & Colab Display Utilities
import ipywidgets as widgets  # Interactive widgets
from IPython.display import clear_output, display, HTML  # Display controls

# 🔑 OpenAI API
import openai  # Optional: raw API access (not required for LangChain unless custom use)

# 🧠 LangChain Core Modules
from langchain_openai import ChatOpenAI, OpenAIEmbeddings  # LLM + Embeddings via OpenAI
from langchain_core.prompts import PromptTemplate  # Structured prompt templates
from langchain.memory import ConversationBufferMemory  # For chat history memory


# ✅ Confirmation
print("✅ All libraries imported and categorized successfully!")


<!DOCTYPE html>
<html>
<head>
<style>
body {
    font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
    line-height: 1.4;
    color: #1a1a1a;
    max-width: 750px;
    margin: 0 auto;
    padding: 8px;
    font-size: 14px;
}

.section-header {
    background-color: #e8f0fe;
    padding: 12px;
    margin: 12px 0;
    border-radius: 4px;
    border-left: 3px solid #1a73e8;
}

.description-box {
    background-color: #f8f9fa;
    padding: 14px;
    margin: 12px 0;
    border-radius: 4px;
    border: 1px solid #e0e0e0;
}

h2 {
    color: #1a73e8;
    font-size: 1.15em;
    margin: 0;
}

.code-note {
    background-color: #fff;
    padding: 10px;
    margin: 10px 0;
    border-radius: 4px;
    border: 1px solid #dadce0;
    font-size: 0.9em;
}

.highlight {
    color: #1a73e8;
    font-weight: 600;
}
</style>
</head>
<body>

<div class="section-header">
    <h2>🖨️ Pretty Print Function</h2>
</div>

<div class="description-box">
    <p style="margin: 0 0 8px 0;">The <span class="highlight">pretty_print()</span> function enhances output readability by transforming standard text into styled HTML blocks. This utility function replaces basic print statements with visually appealing formatted displays that improve the user experience when viewing model responses and system outputs.</p>
    
    <p style="margin: 8px 0 0 0;">Key features include automatic detection and formatting of bulleted lists, proper line break handling, and consistent visual styling that matches the laboratory's design theme. The function accepts two parameters: the text content to display and an optional title that appears as a header above the formatted output.</p>
</div>

<div class="code-note">
    <strong>Usage:</strong> Replace standard <code>print()</code> statements with <code>pretty_print()</code> throughout your notebook to maintain consistent, professional output formatting.
</div>

</body>
</html>

In [None]:
# 🖨️ pretty_print(): Reusable HTML display function for model outputs
def pretty_print(text, title="🤖 Model Response"):
    """
    Display model response in styled HTML block.
    Handles bulleted lists and line breaks.
    """
    lines = text.strip().split('\n')
    is_bulleted = all(line.strip().startswith(("-", "•", "*")) for line in lines if line.strip())

    if is_bulleted:
        list_items = ''.join(f"<li>{line.lstrip('-•* ').strip()}</li>" for line in lines if line.strip())
        content_html = f"<ul style='margin-top: 6px;'>{list_items}</ul>"
    else:
        content_html = text.replace("\n", "<br>")  # fallback for plain lines

    display(HTML(f"""
    <div style="background-color:#f8f9fc; border-left:5px solid #4285f4;
                padding:16px; margin-top:16px; font-family:'Segoe UI', sans-serif;
                color:#202124; line-height:1.6;">
      <strong>{title}</strong><br><br>
      {content_html}
    </div>
    """))

<!DOCTYPE html>
<html>
<head>
<style>
body {
    font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
    line-height: 1.4;
    color: #1a1a1a;
    max-width: 750px;
    margin: 0 auto;
    padding: 8px;
    font-size: 14px;
}

.section-header {
    background-color: #e8f0fe;
    padding: 12px;
    margin: 12px 0;
    border-radius: 4px;
    border-left: 3px solid #1a73e8;
}

h2 {
    color: #1a73e8;
    font-size: 1.15em;
    margin: 0;
}
</style>
</head>
<body>

<div class="section-header">
    <h2>🔑 OpenAI API Key Setup from Colab Secrets</h2>
</div>

</body>
</html>

In [None]:
# ==================================================
# 🔑 OpenAI API Key Setup from Colab Secrets
# ==================================================

# ✅ Retrieve OpenAI API Key securely from Colab's secret storage
try:
    from google.colab import userdata  # Colab-specific secure storage
    openai_key = userdata.get('OPENAI_API_KEY')  # Must be pre-stored via UI

    if openai_key:
        os.environ["OPENAI_API_KEY"] = openai_key
        pretty_print("🔐 OpenAI API Key successfully set from Colab Secrets!", title="✅ API Key Setup")
    else:
        pretty_print("⚠️ OpenAI API Key not found in Colab Secrets. Please add it via Colab ➤ More ➤ Secrets.", title="❌ Missing API Key")

except Exception as e:
    pretty_print(f"🚫 Error retrieving OpenAI API Key: {e}", title="❗ API Key Setup Error")


<!DOCTYPE html>
<html>
<head>
<style>
body {
    font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
    line-height: 1.4;
    color: #1a1a1a;
    max-width: 750px;
    margin: 0 auto;
    padding: 8px;
    font-size: 14px;
}

.section-header {
    background-color: #e8f0fe;
    padding: 12px;
    margin: 12px 0;
    border-radius: 4px;
    border-left: 3px solid #1a73e8;
}

h2 {
    color: #1a73e8;
    font-size: 1.15em;
    margin: 0;
}
</style>
</head>
<body>

<div class="section-header">
    <h2>🔷 Part 1: Non-RAG Model Implementation</h2>
</div>

</body>
</html>

In [None]:
# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
# SIMPLE LANGCHAIN LLM QUERY - NO RAG COMPONENTS
# Direct language model query using LangChain without document retrieval
# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

# Initialize language model
llm = ChatOpenAI(
    model="gpt-3.5-turbo",
    temperature=0.7,
    max_tokens=500
)

# Execute query
query = "What do I learn in the GenAI course in 3 bullets, software, application. Also who is the prof? any hints for me to gain a good grade?"
response = llm.invoke(query)

# Display result
pretty_print(response.content, title="🎯 Direct LLM Response")

<!DOCTYPE html>
<html>
<head>
<style>
body {
    font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
    line-height: 1.4;
    color: #1a1a1a;
    max-width: 750px;
    margin: 0 auto;
    padding: 8px;
    font-size: 14px;
}

.section-header {
    background-color: #e8f0fe;
    padding: 12px;
    margin: 12px 0;
    border-radius: 4px;
    border-left: 3px solid #1a73e8;
}

.explanation-box {
    background-color: #f0f6ff;
    border-left: 3px solid #1a73e8;
    padding: 16px;
    margin: 16px 0;
    border-radius: 4px;
}

.process-diagram {
    background: white;
    border: 1px solid #dadce0;
    border-radius: 8px;
    padding: 20px;
    margin: 16px 0;
}

.step-container {
    display: grid;
    grid-template-columns: repeat(4, 1fr);
    gap: 15px;
    margin: 20px 0;
}

.step-card {
    background: #f8f9fa;
    border-radius: 8px;
    padding: 15px;
    text-align: center;
    border-top: 3px solid #1a73e8;
    position: relative;
}

.step-number {
    background: #1a73e8;
    color: white;
    width: 28px;
    height: 28px;
    border-radius: 50%;
    display: flex;
    align-items: center;
    justify-content: center;
    font-weight: bold;
    margin: 0 auto 10px;
}

.step-arrow {
    position: absolute;
    right: -10px;
    top: 50%;
    transform: translateY(-50%);
    color: #1a73e8;
    font-size: 20px;
}

.technology-grid {
    display: grid;
    grid-template-columns: repeat(2, 1fr);
    gap: 12px;
    margin: 16px 0;
}

.tech-card {
    background: white;
    border: 1px solid #e0e0e0;
    border-radius: 6px;
    padding: 12px;
}

h2 {
    color: #1a73e8;
    font-size: 1.15em;
    margin: 0;
}

h3 {
    color: #1a73e8;
    font-size: 1em;
    margin: 8px 0;
}

.highlight {
    color: #1a73e8;
    font-weight: 600;
}
</style>
</head>
<body>

<div class="section-header">
    <h2>🔷 Part 2: RAG Model Implementation</h2>
</div>

<div class="explanation-box">
    <h3>Understanding the RAG Architecture</h3>
    <p>The Retrieval-Augmented Generation system enhances language model responses by incorporating document-specific context. Unlike standard LLMs that rely solely on training data, RAG systems actively search through your documents to find relevant information before generating responses.</p>
</div>

<div class="chart-container">
    <img src="https://www.dropbox.com/scl/fi/w7w1hfzgzdu46ydv9on00/RAG_Structure.png?rlkey=ef8r6nfdtbg3zvw90900w6rt8&dl=1" alt="RAG Architecture Overview" style="width: 70%; max-width: 70%;">
</div>

</body>
</html>


<div class="explanation-box" style="background-color: #f8f9fa;">
    <h3>How Retrieval Works</h3>
    <p>When a user submits a query, the system:</p>
    <ol style="margin: 8px 0 0 20px; padding: 0;">
        <li>Converts the query into an embedding vector using the same model</li>
        <li>Searches the FAISS index for the k most similar document chunks</li>
        <li>Passes these relevant chunks as context to the language model</li>
        <li>Generates a response grounded in the retrieved information</li>
    </ol>
    <p style="margin-top: 12px;">This approach ensures responses are based on your specific documents rather than general knowledge, significantly improving accuracy and relevance.</p>
</div>


<!DOCTYPE html>
<html>
<head>
<style>
body {
    font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
    line-height: 1.4;
    color: #1a1a1a;
    max-width: 750px;
    margin: 0 auto;
    padding: 8px;
    font-size: 14px;
}

.task-header {
    background-color: #e8f0fe;
    padding: 12px 14px;
    margin: 10px 0;
    border-radius: 4px;
    border-left: 3px solid #1a73e8;
}

.task-details {
    background-color: #f8f9fa;
    padding: 12px;
    margin: 8px 0;
    border-radius: 4px;
    border: 1px solid #e0e0e0;
    font-size: 0.9em;
}

h3 {
    color: #1a73e8;
    font-size: 1.05em;
    margin: 0 0 8px 0;
}

.process-list {
    margin: 8px 0 0 0;
    padding-left: 20px;
}

.process-list li {
    margin: 4px 0;
    color: #444;
}

.highlight {
    color: #1a73e8;
    font-weight: 600;
}
</style>
</head>
<body>

<div class="task-header">
    <h3>📥 Task 1: Document Loading</h3>
    <div class="task-details">
        <p style="margin: 0 0 8px 0;">This task retrieves the course syllabus PDF from Dropbox and prepares it for RAG processing. The document undergoes three key transformations to enable efficient semantic search:</p>
        <ul class="process-list">
            <li><span class="highlight">Download:</span> Fetch the PDF file using the provided Dropbox URL</li>
            <li><span class="highlight">Load:</span> Extract text content from all pages using PyPDFLoader</li>
            <li><span class="highlight">Chunk:</span> Split the document into 1000-character segments with 200-character overlap to preserve context boundaries</li>
        </ul>
        <p style="margin: 8px 0 0 0;">The chunking strategy ensures that related information remains together while creating appropriately sized segments for embedding generation. The overlap prevents important context from being lost at chunk boundaries.</p>
    </div>
</div>

</body>
</html>

In [None]:
# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
#>> DOCUMENT LOADING AND PROCESSING
# This cell downloads the PDF from Dropbox, loads it into memory,
# and splits it into manageable chunks for vector search
# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

# 📦 Document Loaders
from langchain.document_loaders import PyPDFLoader  # Load content from PDFs
from langchain_community.document_loaders import TextLoader  # Load plain text files

# ✂️ Text Processing & Chunking
from langchain.text_splitter import RecursiveCharacterTextSplitter # Import the text splitter here


# Download PDF from Dropbox
dropbox_url = "https://www.dropbox.com/scl/fi/zedqrdppb6et1sm3s09r6/IE_5250_Applied_Generative_AI-2025.pdf?rlkey=tn3130kcd5o03twalmydn8t6p&e=1&dl=1"
pdf_path = "/content/document.pdf"

response = requests.get(dropbox_url)
with open(pdf_path, "wb") as file:
    file.write(response.content)

# Load and process the PDF
loader = PyPDFLoader(pdf_path)
documents = loader.load()

# Split into chunks for better retrieval accuracy
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    separators=["\n\n", "\n", " ", ""]
)
docs = text_splitter.split_documents(documents)

pretty_print(f"PDF successfully downloaded and processed\n{len(documents)} pages converted into {len(docs)} searchable chunks",
             title="📥 Document Loading Complete")

<!DOCTYPE html>
<html>
<head>
<style>
body {
    font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
    line-height: 1.4;
    color: #1a1a1a;
    max-width: 750px;
    margin: 0 auto;
    padding: 8px;
    font-size: 14px;
}

.task-header {
    background-color: #e8f0fe;
    padding: 10px 12px;
    margin: 10px 0;
    border-radius: 4px;
    border-left: 3px solid #1a73e8;
}

h3 {
    color: #1a73e8;
    font-size: 1.05em;
    margin: 0;
}
</style>
</head>
<body>

<!-- Task 2 Header -->
<div class="task-header">
    <h3>🧮 Task 2: Embedding Generation & Vector Store Creation</h3>
</div>


</body>
</html>

In [None]:
# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
#>> EMBEDDING GENERATION AND VECTOR STORE CREATION
# This cell converts text chunks into vector embeddings using OpenAI's
# model and stores them in a FAISS index for fast similarity search
# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

# ✂️ Text Processing & Chunking
from langchain.text_splitter import RecursiveCharacterTextSplitter  # Split text into chunks

# 📚 Vector Store & Embeddings
from langchain.vectorstores import FAISS  # FAISS for fast vector search

# Initialize embedding model
embedding_model = OpenAIEmbeddings(model="text-embedding-ada-002")

# Create FAISS vector store
vector_db = FAISS.from_documents(docs, embedding_model)

# Prepare sample chunks display
sample_chunks = []
for i in range(min(3, len(docs))):
    chunk_preview = docs[i].page_content[:150].strip()
    sample_chunks.append(f"• Chunk {i+1}: {chunk_preview}...")

sample_text = f"Embeddings successfully created for {len(docs)} chunks\n\nSample chunks:\n" + "\n".join(sample_chunks)
pretty_print(sample_text, title="🧠 Embedding Generation Complete")


<!DOCTYPE html>
<html>
<head>
<style>
body {
    font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
    line-height: 1.4;
    color: #1a1a1a;
    max-width: 750px;
    margin: 0 auto;
    padding: 8px;
    font-size: 14px;
}

.task-header {
    background-color: #e8f0fe;
    padding: 10px 12px;
    margin: 10px 0;
    border-radius: 4px;
    border-left: 3px solid #1a73e8;
}

h3 {
    color: #1a73e8;
    font-size: 1.05em;
    margin: 0;
}
</style>
</head>
<body>


<!-- Task 3 Header -->
<div class="task-header">
    <h3>🔍 Task 3: Query and Retrieval</h3>
</div>

</body>
</html>

In [None]:
# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
#>> RETRIEVAL AND QUESTION ANSWERING
# This cell sets up the RAG chain, performs retrieval testing,
# and executes queries against the document
# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

# 🔁 Retrieval-Augmented Generation (RAG)
from langchain.chains import RetrievalQA  # Combine retriever + LLM into a QA system

# Initialize language model
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

# Create retriever
retriever = vector_db.as_retriever(search_kwargs={"k": 4})

# Test retrieval functionality
test_docs = retriever.get_relevant_documents("document")
retrieval_status = f"Retrieval system operational: {len(test_docs)} documents successfully retrieved"

# Build RAG chain
rag_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True
)

# Execute query
query = "What do I learn in the GenAI course in 3 bullets, software, appliciaon. Also who is the prof? any hitns for me to gain a good grade?"
result = rag_chain({"query": query})

# Format the complete response
query_result_text = f"{retrieval_status}\n\nQuery: {query}\n\nAnswer:\n{result['result']}\n\nSource Documents Used: {len(result['source_documents'])}"
pretty_print(query_result_text, title="🔍 RAG Query Results")

<div style="background-color: #f0f6ff; border-left: 3px solid #1a73e8; padding: 16px; margin: 12px 0; border-radius: 4px; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif; color: #1a1a1a; font-size: 14px; line-height: 1.4; max-width: 750px;">

<h2 style="color: #1a73e8; font-size: 1.15em; font-weight: 600; margin: 0 0 12px 0;">🔷 Vector Database Alternatives to FAISS</h2>

<p>While FAISS excels at similarity search, several alternatives offer unique features for RAG applications. Each database addresses different needs regarding ease of use, scalability, and deployment options.</p>

<ul style="margin: 12px 0; padding-left: 20px;">
  <li style="margin: 10px 0;">
    <a href="https://www.trychroma.com/" target="_blank" style="color: #1a73e8; font-weight: 600; text-decoration: none;">ChromaDB</a> – Open-source embedding DB with simple API design and LangChain support.  
    <a href="https://docs.trychroma.com/" target="_blank" style="color: #1a73e8; text-decoration: none;">Learn more →</a>
  </li>
  
  <li style="margin: 10px 0;">
    <a href="https://www.pinecone.io/" target="_blank" style="color: #1a73e8; font-weight: 600; text-decoration: none;">Pinecone</a> – Fully managed vector DB, ideal for production-scale with zero infra hassle.  
    <a href="https://docs.pinecone.io/" target="_blank" style="color: #1a73e8; text-decoration: none;">Learn more →</a>
  </li>

  <li style="margin: 10px 0;">
    <a href="https://weaviate.io/" target="_blank" style="color: #1a73e8; font-weight: 600; text-decoration: none;">Weaviate</a> – Combines vector + structured search with GraphQL and ML modules.  
    <a href="https://weaviate.io/developers/weaviate" target="_blank" style="color: #1a73e8; text-decoration: none;">Learn more →</a>
  </li>

  <li style="margin: 10px 0;">
    <a href="https://qdrant.tech/" target="_blank" style="color: #1a73e8; font-weight: 600; text-decoration: none;">Qdrant</a> – Rust-based engine offering fast vector search and advanced filtering.  
    <a href="https://qdrant.tech/documentation/" target="_blank" style="color: #1a73e8; text-decoration: none;">Learn more →</a>
  </li>
</ul>

<div style="margin-top: 14px; padding-top: 12px; border-top: 1px solid #dadce0; font-size: 0.9em; color: #555;">
  <strong>Selection Guide:</strong> Use <b>ChromaDB</b> for quick dev, <b>Pinecone</b> for managed infra, <b>Weaviate</b> for hybrid search, or <b>Qdrant</b> for speed.  
  <i>FAISS</i> is still great for offline and lightweight use.
</div>

</div>


<!DOCTYPE html>
<html>
<head>
<style>
body {
    font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
    line-height: 1.4;
    color: #1a1a1a;
    max-width: 750px;
    margin: 0 auto;
    padding: 8px;
    font-size: 14px;
}

.section-header {
    background-color: #e8f0fe;
    padding: 12px;
    margin: 12px 0;
    border-radius: 4px;
    border-left: 3px solid #1a73e8;
}

.intro-box {
    background-color: #f0f6ff;
    border-left: 3px solid #1a73e8;
    padding: 12px;
    margin: 12px 0;
    border-radius: 4px;
    font-size: 0.95em;
}

h2 {
    color: #1a73e8;
    font-size: 1.15em;
    margin: 0;
}

.highlight {
    color: #1a73e8;
    font-weight: 600;
}
</style>
</head>
<body>

<div class="section-header">
    <h2>📊 Loading CSV Data in LangChain</h2>
</div>

<div class="intro-box">
    <p style="margin: 0;">LangChain's <span class="highlight">CSVLoader</span> enables seamless integration of structured tabular data into RAG systems. This capability transforms spreadsheet data into searchable documents, allowing natural language queries against datasets containing sales records, inventory, research data, or any information organized in rows and columns.</p>
</div>

<div style="background: white; border: 1px solid #dadce0; padding: 20px; border-radius: 6px; margin: 16px 0;">
    <h3 style="color: #1a73e8; font-size: 1em; margin: 0 0 10px 0;">📊 CSV Data Loading</h3>
    <p style="margin: 0 0 8px 0; font-size: 0.9em;">Process structured tabular data for analysis and question-answering. The CSVLoader converts each row into a document, preserving column relationships while enabling semantic search across your datasets. This approach bridges the gap between traditional data analysis and natural language processing.</p>
    <p style="margin: 8px 0 0 0; font-size: 0.85em; color: #666;"><strong>Common use cases:</strong> Sales and financial data, product catalogs, customer records, survey responses, scientific datasets, inventory management, performance metrics</p>
</div>

<div style="background-color: #f8f9fa; padding: 12px; margin: 16px 0; border-radius: 4px; border: 1px solid #e0e0e0;">
    <p style="margin: 0; font-size: 0.9em;"><strong>💡 Pro Tip:</strong> CSV data maintains its structured nature even after conversion to documents. This allows RAG systems to answer complex questions about trends, comparisons, and aggregations within your tabular data.</p>
</div>

</body>
</html>

In [None]:
# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
# This cell generates a non-grounded LLM response without using any data source
# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

# ✅ Initialize LLM
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

# ✅ Query about GDP
query = (
    "Compare the GDP of USA, Japan, China, and Qatar in 1980 and 2020. "
    "For each country, show GDP in 1980, GDP in 2020, and percent change in a concise format, e.g., "
    "<flag emoji> Country: (1980 → GDP), (2020 → GDP), (% change: #.##%). "
    "Use billions or trillions of USD, rounded."
)
hallucinated_response = llm.invoke(query).content

# ✅ Format and display
hallucination_output = f"""
🧠 Hallucinated Response (No RAG):
---------------------------------
{hallucinated_response}
"""

pretty_print(hallucination_output, title="🌍 GDP Query without RAG")


In [None]:
# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
#>> RAG RESPONSE (CSV) – GPT-4, One-Line Query, No Prompt Template
# ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
from langchain.document_loaders import CSVLoader
from langchain.vectorstores import FAISS
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
import requests
import pandas as pd

# ✅ Step 1: Download CSV from Dropbox
csv_url = "https://www.dropbox.com/scl/fi/nzc5p2gpgb3wja2kf7qdz/GDP_World_Bank.csv?rlkey=jydgntc8jfkm6ajyswovmso3z&st=gwqgc976&dl=1"
with open("gdp_data.csv", "wb") as f:
    f.write(requests.get(csv_url).content)

# ✅ Step 2: Load CSV
try:
    loader = CSVLoader(
        file_path="gdp_data.csv",
        encoding="utf-8",
        csv_args={
            'delimiter': ',',
            'quotechar': '"',
            'fieldnames': None
        }
    )
    docs = loader.load()
except Exception as e:
    print(f"Error loading CSV: {e}")
    df = pd.read_csv("gdp_data.csv")
    from langchain.schema import Document
    docs = []
    for idx, row in df.iterrows():
        content = f"Country: {row.get('Country Name', 'Unknown')}, "
        for col in df.columns:
            if col not in ['Country Name', 'Country Code']:
                content += f"{col}: {row.get(col, 'N/A')}, "
        docs.append(Document(page_content=content, metadata={"row": idx}))

# ✅ Step 3: Optional - Split documents
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
split_docs = text_splitter.split_documents(docs)

# ✅ Step 4: Embedding and vector store
embedding = OpenAIEmbeddings()
vector_db = FAISS.from_documents(split_docs if split_docs else docs, embedding)

# ✅ Step 5: Retriever setup
retriever = vector_db.as_retriever(
    search_type="similarity",
    search_kwargs={
        "k": 10,
        "score_threshold": 0.5
    }
)

# ✅ Step 6: RAG chain with no template
llm = ChatOpenAI(model="gpt-4", temperature=0)
rag_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True
)

# ✅ Step 7: Simple one-line query
# ✅ Query about GDP
query = (
    "Compare the GDP of China, Japan, China, and Qatar in 1980 and 2020. "
    "For each country, show GDP in 1980, GDP in 2020, and percent change in a concise format, e.g., "
    "<flag emoji> Country: (1980 → GDP), (2020 → GDP), (% change: #.##%). "
    "Use billions or trillions of USD, rounded."
)


rag_response = rag_chain.invoke({"query": query})

# ✅ Step 8: Display result
rag_output = f"""
Retrieval Status: {len(split_docs if 'split_docs' in locals() else docs)} documents in vector store

📊 RAG-Based Response (Using CSV):
---------------------------------
{rag_response['result']}

Source Documents Used: {len(rag_response['source_documents'])}

"""

pretty_print(rag_output, title="🌍 GDP Query with CSV-RAG (Simple Query)")


<!DOCTYPE html>
<html>
<head>
<style>
body {
    font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
    line-height: 1.4;
    color: #1a1a1a;
    max-width: 750px;
    margin: 0 auto;
    padding: 8px;
    font-size: 14px;
}

.section-header {
    background-color: #e8f0fe;
    padding: 12px;
    margin: 12px 0;
    border-radius: 4px;
    border-left: 3px solid #1a73e8;
}

.intro-box {
    background-color: #f0f6ff;
    border-left: 3px solid #1a73e8;
    padding: 12px;
    margin: 12px 0;
    border-radius: 4px;
    font-size: 0.95em;
}

.data-type-grid {
    display: grid;
    grid-template-columns: repeat(2, 1fr);
    gap: 16px;
    margin: 16px 0;
}

.data-card {
    background: white;
    border: 1px solid #dadce0;
    padding: 16px;
    border-radius: 6px;
    border-top: 3px solid #1a73e8;
}

.code-example {
    background-color: #f5f5f5;
    border: 1px solid #ddd;
    border-radius: 4px;
    padding: 12px;
    margin: 10px 0;
    font-family: 'Courier New', monospace;
    font-size: 0.85em;
    overflow-x: auto;
}

h2 {
    color: #1a73e8;
    font-size: 1.15em;
    margin: 0;
}

h3 {
    color: #1a73e8;
    font-size: 1em;
    margin: 0 0 8px 0;
}

.highlight {
    color: #1a73e8;
    font-weight: 600;
}

.use-case-list {
    margin: 8px 0;
    padding-left: 20px;
    font-size: 0.9em;
}

.use-case-list li {
    margin: 4px 0;
    color: #555;
}

.chart-container {
    background: white;
    border: 1px solid #dadce0;
    padding: 16px;
    border-radius: 6px;
    margin: 16px 0;
    text-align: center;
}

.chart-container img {
    max-width: 70%;
    height: auto;
    border-radius: 4px;
}

.question-box {
    background-color: #fff3cd;
    border-left: 3px solid #ffc107;
    padding: 12px;
    margin: 16px 0;
    border-radius: 4px;
}
</style>
</head>
<body>

<div class="section-header">
    <h2>📊 World Bank GDP Data Analysis</h2>
</div>

<div class="intro-box">
    <p style="margin: 0;">The provided data from the <span class="highlight">World Bank</span> shows the GDP of countries as shown in the image below. This dataset provides valuable insights into global economic indicators and can be used to test the accuracy of <span class="highlight">RAG model implementations</span> when processing economic data.</p>
</div>

<div class="chart-container">
    <img src="https://www.dropbox.com/scl/fi/bmitdcpfoqmib886t26vv/GDP_Chart.png?rlkey=1xmwtmvybl7h1rp3dxvbj972m&raw=1" alt="GDP Chart" style="width: 70%; max-width: 70%;">
</div>

<div class="question-box">
    <h3 style="color: #856404; margin: 0;">❓ Did your RAG model provide accurate values based on this dataset?</h3>
</div>

</body>
</html>


<!DOCTYPE html>
<html>
<head>
<style>
body {
    font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
    line-height: 1.4;
    color: #1a1a1a;
    max-width: 750px;
    margin: 0 auto;
    padding: 8px;
    font-size: 14px;
}

.section-header {
    background-color: #e8f0fe;
    padding: 12px;
    margin: 12px 0;
    border-radius: 4px;
    border-left: 3px solid #1a73e8;
}

.intro-box {
    background-color: #f0f6ff;
    border-left: 3px solid #1a73e8;
    padding: 12px;
    margin: 12px 0;
    border-radius: 4px;
    font-size: 0.95em;
}

.data-type-grid {
    display: grid;
    grid-template-columns: repeat(2, 1fr);
    gap: 16px;
    margin: 16px 0;
}

.data-card {
    background: white;
    border: 1px solid #dadce0;
    padding: 16px;
    border-radius: 6px;
    border-top: 3px solid #1a73e8;
}

.code-example {
    background-color: #f5f5f5;
    border: 1px solid #ddd;
    border-radius: 4px;
    padding: 12px;
    margin: 10px 0;
    font-family: 'Courier New', monospace;
    font-size: 0.85em;
    overflow-x: auto;
}

h2 {
    color: #1a73e8;
    font-size: 1.15em;
    margin: 0;
}

h3 {
    color: #1a73e8;
    font-size: 1em;
    margin: 0 0 8px 0;
}

.highlight {
    color: #1a73e8;
    font-weight: 600;
}

.use-case-list {
    margin: 8px 0;
    padding-left: 20px;
    font-size: 0.9em;
}

.use-case-list li {
    margin: 4px 0;
    color: #555;
}
</style>
</head>
<body>

<div class="section-header">
    <h2>🔷 Loading External Data: HTML & CSV in LangChain</h2>
</div>

<div class="intro-box">
    <p style="margin: 0;">LangChain extends beyond PDF processing to support diverse data sources. <span class="highlight">HTML loaders</span> enable web content ingestion for real-time information retrieval, while <span class="highlight">CSV loaders</span> handle structured data analysis. These capabilities allow RAG systems to work with dynamic web content and tabular datasets alongside traditional documents.</p>
</div>

<div style="background: white; border: 1px solid #dadce0; padding: 20px; border-radius: 6px; margin: 16px 0;">
    <h3 style="color: #1a73e8; font-size: 1em; margin: 0 0 10px 0;">🌐 HTML Data Loading</h3>
    <p style="margin: 0 0 8px 0; font-size: 0.9em;">Extract content from web pages for up-to-date information retrieval. Perfect for incorporating current events, documentation, or any web-based content into your RAG system. LangChain's HTML loaders enable seamless integration of web content into your document processing pipeline.</p>
    <p style="margin: 8px 0 0 0; font-size: 0.85em; color: #666;"><strong>Common use cases:</strong> News articles and blog posts, technical documentation, Wikipedia entries, company websites, product pages, FAQ sections</p>
</div>

<div style="background-color: #f8f9fa; padding: 12px; margin: 16px 0; border-radius: 4px; border: 1px solid #e0e0e0;">
    <p style="margin: 0; font-size: 0.9em;"><strong>💡 Pro Tip:</strong> Both loaders convert content into LangChain Document objects, maintaining consistency across different data sources. This allows you to apply the same embedding and retrieval pipeline regardless of whether your source is a PDF, web page, or CSV file.</p>
</div>

</body>
</html>

# ✋**Hands-On: RAG with HTML Data**

---

## 🌐 Load HTML from a Webpage or Local File

```python
from langchain.document_loaders import HTMLLoader

# Load from a webpage
html_loader = HTMLLoader("https://en.wikipedia.org/wiki/Artificial_intelligence")
html_docs = html_loader.load()

# OR load from a local file
# html_loader = HTMLLoader("data/my_page.html")
# html_docs = html_loader.load()


In [None]:
# ==================================================
# ✋ **Hands-On: Load & Retrieve Renewable Energy Info from Wikipedia**
# ==================================================
# 📌 **Task Instructions:**
# 1️⃣ Fill in the missing placeholders (`-----`) to complete the process.
# 2️⃣ Use `HTMLLoader` to load Wikipedia data.
# 3️⃣ Split text into retrievable chunks.
# 4️⃣ Convert chunks into vector embeddings using FAISS.
# 5️⃣ Use retrieval to answer a question about renewable energy.

from langchain.document_loaders import HTMLLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.chains import RetrievalQA

# ==================================================
# ✅ Step 1: Load Wikipedia Page on Renewable Energy
# ==================================================
wiki_url = "https://en.wikipedia.org/wiki/Renewable_energy"
loader = -----  # Load HTML from Wikipedia
documents = -----  # Extract text from the page

# ==================================================
# ✅ Step 2: Split Text into Chunks
# ==================================================
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
docs = -----  # Split extracted text into smaller chunks

# ==================================================
# ✅ Step 3: Convert Chunks to Embeddings & Store in FAISS
# ==================================================
embedding_model = -----  # Use OpenAIEmbeddings or another model
vector_db = -----  # Convert docs into vector embeddings and store in FAISS

# ==================================================
# ✅ Step 4: Create a Retriever to Fetch Relevant Information
# ==================================================
retriever = -----  # Convert FAISS vector store into a retriever

# ==================================================
# ✅ Step 5: Ask AI a Question About Renewable Energy
# ==================================================
rag_chain = RetrievalQA.from_chain_type(-----, retriever=retriever)  # Define the RAG pipeline

query = "What are the main types of renewable energy sources?"
response_rag = rag_chain.run(query)

# ✅ Step 6: Display Retrieved Answer
print("\n🌍 🔋 AI Answer on Renewable Energy:")
print(response_rag)


<!DOCTYPE html>
<html>
<head>
<style>
body {
    font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
    line-height: 1.4;
    color: #1a1a1a;
    max-width: 750px;
    margin: 0 auto;
    padding: 8px;
    font-size: 14px;
}
.section-header {
    background-color: #e8f0fe;
    padding: 12px;
    margin: 12px 0;
    border-radius: 4px;
    border-left: 3px solid #1a73e8;
}
.intro-box {
    background-color: #f0f6ff;
    border-left: 3px solid #1a73e8;
    padding: 12px;
    margin: 12px 0;
    border-radius: 4px;
    font-size: 0.95em;
}
h1 {
    color: #1a73e8;
    font-size: 1.5em;
    text-align: center;
    margin: 20px 0;
}
h2 {
    color: #1a73e8;
    font-size: 1.15em;
    margin: 0;
}
.highlight {
    color: #1a73e8;
    font-weight: 600;
}
</style>
</head>
<body>

<h1>🎉 Congratulations!</h1>

<div class="section-header">
    <h2>✅ Lab Completed: RAG with LangChain & FAISS</h2>
</div>

<div class="intro-box">
    <p style="margin: 0 0 12px 0;">You successfully integrated <span class="highlight">LangChain</span> with <span class="highlight">FAISS</span> to build a RAG pipeline for World Bank GDP data. Great job mastering vector databases and semantic search!</p>
    <p style="margin: 0;">❓ You need to extend this lab based on other embeddings or/and vector settings. Check Canvas and lab requirements for more details.</p>
</div>

</body>
</html>