# ðŸ¤– Notebook 02: Building the RAG Pipeline with LangChain (FAISS)

## Learning Objectives
In this notebook, you will learn:
1. **Load the vector store** from disk (no re-embedding needed)
2. **Create a retriever** for finding relevant documents
3. **Load an LLM** (FLAN-T5-small) for text generation
4. **Build a RAG prompt** that instructs the LLM to use context
5. **Run the complete RAG pipeline** and see answers with sources

## What is RAG?

**RAG = Retrieval-Augmented Generation**

It's a technique that combines:
1. **Retrieval**: Find relevant documents from a knowledge base
2. **Augmentation**: Add those documents as context to a prompt
3. **Generation**: Use an LLM to generate an answer based on the context

Why RAG?
- LLMs can hallucinate (make up facts)
- RAG grounds the LLM in your actual data (support tickets)
- You can update the knowledge base without retraining

---

## Step 1: Setup and Imports

In [None]:
# Standard library imports
import sys
from pathlib import Path

# Add project root to path
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))

# IMPORTANT: Set up HuggingFace cache BEFORE importing transformers
from src.config import setup_hf_cache
setup_hf_cache()

print("âœ“ Setup complete!")
print(f"Project root: {project_root}")

In [None]:
# Import our custom modules
from src import config
from src.vectorstore import load_vector_store, get_retriever
from src.llm import get_llm, test_llm, RAG_PROMPT_TEMPLATE, format_docs_for_context
from src.rag_pipeline import RAGPipeline, print_rag_response

print("âœ“ Custom modules imported!")

---

## Step 2: Load the Vector Store

We'll load the **FAISS** vector store we created in Notebook 01.

In [None]:
# Load vector store from disk
vectorstore = load_vector_store()

print(f"\nâœ“ Vector store loaded from: {config.VECTOR_STORE_DIR}")

In [None]:
# Create a retriever
# k=5 means we'll retrieve the top 5 most similar documents
retriever = get_retriever(vectorstore, k=5)

print("\nâœ“ Retriever created (k=5)")

---

## Step 3: Load the LLM

We'll use **google/flan-t5-small** - a small, CPU-friendly model.

### About FLAN-T5-small:
- ~300MB download (first time only)
- Instruction-tuned (good at following prompts)
- Runs on CPU (no GPU needed)
- Good for demos and learning

**Note:** First run will download the model. This is cached for future runs.

In [None]:
# Show LLM configuration
print("LLM Configuration:")
print(f"  Model: {config.LLM_MODEL_NAME}")
print(f"  Cache directory: {config.MODELS_DIR}")

In [None]:
# Load the LLM
# This will download the model on first run (~300MB)
llm = get_llm()

In [None]:
# Quick test to make sure the LLM is working
test_response = test_llm(llm)
print("\nâœ“ LLM is working!")

---

## Step 4: Understand the RAG Prompt

The prompt is **critical** for RAG. It tells the LLM:
1. What role to play (support analytics assistant)
2. To use ONLY the provided context
3. To say "I don't know" if context is insufficient

This helps prevent hallucination!

In [None]:
# Let's look at our RAG prompt template
print("RAG PROMPT TEMPLATE")
print("=" * 60)
print(RAG_PROMPT_TEMPLATE)

**Key elements of the prompt:**
- **Role**: "helpful customer support analytics assistant"
- **Instruction**: Use ONLY the provided context
- **Fallback**: Say "I don't have enough information" if needed
- **Placeholders**: `{context}` and `{question}` will be filled in

---

## Step 5: Build the RAG Pipeline

Now let's put it all together! The `RAGPipeline` class handles:
1. Retrieving relevant documents
2. Formatting them into context
3. Generating an answer with the LLM
4. Returning the answer + sources

In [None]:
# Create the RAG pipeline
# We pass in our pre-loaded vectorstore and llm to avoid reloading
rag = RAGPipeline(vectorstore=vectorstore, llm=llm, retrieval_k=5)

---

## Step 6: Ask Questions!

Let's test the RAG pipeline with various questions about support tickets.

### Question 1: Common billing issues

In [None]:
# Ask about billing issues
question1 = "What are the most common billing and payment issues customers report?"

response1 = rag.ask(question1)
print_rag_response(response1)

### Question 2: Technical problems with TVs

In [None]:
# Ask about TV issues
question2 = "What technical problems do customers face with smart TVs?"

response2 = rag.ask(question2)
print_rag_response(response2)

### Question 3: Refund requests

In [None]:
# Ask about refunds
question3 = "How are refund requests typically handled? What are common reasons for refunds?"

response3 = rag.ask(question3)
print_rag_response(response3)

### Question 4: Critical priority tickets

In [None]:
# Ask about critical issues
question4 = "What types of issues are marked as critical priority?"

response4 = rag.ask(question4)
print_rag_response(response4)

### Question 5: Device connectivity issues

In [None]:
# Ask about connectivity
question5 = "What connectivity or network issues do customers report with their devices?"

response5 = rag.ask(question5)
print_rag_response(response5)

---

## Step 7: Examine the RAG Process in Detail

Let's break down what happens at each step of the RAG pipeline.

In [None]:
# Let's trace through a question step by step
demo_question = "What problems do customers have with laptop batteries?"

print("STEP-BY-STEP RAG PROCESS")
print("=" * 60)
print(f"\n1. QUESTION: {demo_question}")

In [None]:
# Step 2: Retrieve relevant documents
print("\n2. RETRIEVAL: Finding similar documents...")
print("-" * 60)

retrieved_docs = rag.retrieve(demo_question)

print(f"   Retrieved {len(retrieved_docs)} documents")
for i, doc in enumerate(retrieved_docs, 1):
    print(f"\n   Doc {i}: Ticket {doc.metadata.get('ticket_id', 'N/A')}")
    print(f"   Product: {doc.metadata.get('product', 'N/A')}")
    print(f"   Preview: {doc.page_content[:100]}...")

In [None]:
# Step 3: Format context
print("\n3. CONTEXT FORMATTING:")
print("-" * 60)

context = format_docs_for_context(retrieved_docs)
print(f"   Context length: {len(context)} characters")
print(f"\n   Context preview (first 500 chars):")
print(f"   {context[:500]}...")

In [None]:
# Step 4: Generate answer
print("\n4. GENERATION: Calling LLM...")
print("-" * 60)

answer = rag.generate(demo_question, context)
print(f"\n   Generated Answer:")
print(f"   {answer}")

---

## Step 8: Get Detailed Response Information

For debugging and evaluation, we can get more details about each response.

In [None]:
# Get detailed response as dictionary
detailed = rag.ask_with_details("What issues do customers have with GoPro cameras?")

print("DETAILED RESPONSE")
print("=" * 60)
print(f"\nQuestion: {detailed['question']}")
print(f"\nAnswer: {detailed['answer']}")
print(f"\nNumber of sources: {detailed['num_sources']}")
print(f"Context length: {detailed['context_length']} chars")
print("\nSources:")
for i, src in enumerate(detailed['sources'], 1):
    print(f"  {i}. Ticket {src['ticket_id']} - {src['product']} ({src['ticket_type']})")

---

## Summary

### What We Accomplished
1. âœ… Loaded the vector store from disk
2. âœ… Created a retriever (k=5)
3. âœ… Loaded FLAN-T5-small LLM
4. âœ… Built the complete RAG pipeline
5. âœ… Asked 5+ questions and got answers with sources
6. âœ… Traced through the RAG process step-by-step

### Key Takeaways
- **RAG = Retrieval + Augmentation + Generation**
- The **prompt** is critical - it tells the LLM to use context
- **Sources** provide transparency and traceability
- Small models like FLAN-T5 work for demos but have limitations

### Limitations of FLAN-T5-small
- Answers may be short or incomplete
- May not fully utilize all context
- For production, consider larger models or API-based LLMs

### Next Steps
â†’ **Notebook 03**: Evaluate RAG quality systematically

In [None]:
print("\n" + "=" * 60)
print("ðŸŽ‰ Notebook 02 Complete!")
print("=" * 60)
print("\nYou've built a working RAG pipeline!")
print("\nProceed to: 03_rag_evaluation.ipynb")