# Ticket RAG System - LangChain Introduction

**Simple introduction to LangChain and LCEL (LangChain Expression Language)**

**What is LCEL?**: A way to chain components together using the pipe operator `|`

**Example**: `input | step1 | step2 | output`

**Pipeline**: Retrieve documents → Format context → Generate answer

## Install Dependencies

### 📦 What We're Installing

This cell installs the LangChain ecosystem packages we need:

- **`langchain`**: Core LangChain framework with base classes and utilities
- **`langchain-community`**: Community integrations (HuggingFace embeddings, etc.)
- **`langchain-openai`**: OpenAI-compatible LLM interface (works with LM Studio!)
- **`langchain-chroma`**: ChromaDB vector store integration

### 💡 Why These Packages?

LangChain is modular - you only install what you need. We're using:
- HuggingFace for **embeddings** (free, local)
- ChromaDB for **vector storage** (fast, embedded database)
- OpenAI API format for **LLM** (compatible with LM Studio, OpenAI, many others)

### ⚡ The `-q` Flag

`-q` means "quiet" - suppresses verbose installation output to keep notebook clean.

In [None]:
# Install LangChain packages
!pip install -q langchain langchain-community langchain-openai langchain-chroma

## Imports

### 📚 Import Organization

Good practice: organize imports by category for readability.

#### Standard Library
- **`os`**: File system operations
- **`pandas`**: Data manipulation (DataFrames)
- **`typing`**: Type hints for better code documentation

#### LangChain Core Components
- **`ChatPromptTemplate`**: Structures prompts with system/user messages
- **`StrOutputParser`**: Extracts text from LLM responses
- **`RunnableLambda`**: Wraps functions to use in LCEL chains
- **`ChatOpenAI`**: LLM interface (OpenAI API compatible)

#### LangChain Integrations
- **`HuggingFaceEmbeddings`**: Free, local text embeddings
- **`Chroma`**: Vector database for similarity search
- **`Document`**: Standard document format in LangChain
- **`BaseRetriever`**: Base class for custom retrievers

#### Supporting Libraries
- **`chromadb`**: Direct ChromaDB client access
- **`CrossEncoder`**: Re-ranking model from sentence-transformers
- **`dotenv`**: Load environment variables (API keys)
- **`IPython.display`**: Pretty output in Jupyter

### 🎯 Key Concept: LangChain Components

LangChain uses **composable building blocks**:
1. **Retrievers** → Find relevant documents
2. **Prompts** → Structure LLM input
3. **LLMs** → Generate responses
4. **Output Parsers** → Format results

We chain these together with the **`|` operator**!

In [None]:
import os
import pandas as pd
from typing import List, Dict

# LangChain imports
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableLambda
from langchain_openai import ChatOpenAI
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_chroma import Chroma
from langchain_core.documents import Document
from langchain_core.retrievers import BaseRetriever

# ChromaDB
import chromadb
from chromadb.config import Settings

# Re-ranker
from sentence_transformers import CrossEncoder
from dotenv import load_dotenv
from IPython.display import display, Markdown

# Load secrets
load_dotenv('./secrets.env', override=True)
print("✅ Libraries loaded")

## Configuration

### ⚙️ Centralized Configuration Pattern

**Best Practice**: Keep all configuration in one dictionary for:
- Easy modification (change once, affects everywhere)
- Clear documentation of parameters
- Reproducibility (share config = share exact setup)

### 🔧 Configuration Breakdown

#### Data Configuration
- **`csv_path`**: Location of ticket dataset (4k tickets, translated)
- **`train_test_split`**: 80% training, 20% testing
- **`random_seed`**: Ensures reproducible splits

#### Vector Store
- **`chroma_db_path`**: Where to persist the vector database
- **`embedding_model`**: `all-MiniLM-L6-v2` (384 dimensions, fast, good quality)

#### Retrieval Configuration
- **`reranker_model`**: `mxbai-rerank-base-v1` (cross-encoder for better ranking)
- **`top_k_initial`**: First stage retrieves 20 candidates
- **`top_k_reranked`**: Re-ranker picks best 5

#### LLM Configuration
- **`lm_studio_url`**: Local LM Studio server (can also use OpenAI)
- **`llm_model`**: Model name in LM Studio

### 💡 Why Two-Stage Retrieval?

**Stage 1 (Vector Search)**: Fast but approximate → Get 20 candidates

**Stage 2 (Re-ranking)**: Slow but accurate → Pick best 5

This balances **speed** (don't re-rank everything) and **quality** (accurate final selection).

In [None]:
CONFIG = {
    'csv_path': './dataset-tickets-multi-lang3-4k-translated-all.csv',
    'chroma_db_path': './chroma_ticket_db_langchain',
    'train_test_split': 0.8,
    'random_seed': 42,
    'embedding_model': 'all-MiniLM-L6-v2',
    'reranker_model': 'mixedbread-ai/mxbai-rerank-base-v1',
    'top_k_initial': 20,
    'top_k_reranked': 5,
    'lm_studio_url': 'http://192.168.7.171:1234',
    'llm_model': 'gpt-oss-20b',
}

print("⚙️  Configuration loaded")

## Load Data

### 📊 Data Preparation Steps

This cell performs several important data operations:

#### 1. Load CSV
```python
df = pd.read_csv(CONFIG['csv_path'])
```
Loads the multi-language ticket dataset with translations.

#### 2. Clean Data
```python
df = df.dropna(subset=['subject_english', 'body_english', 'answer_english'])
```
Removes tickets with missing English translations (our RAG system uses English).

#### 3. Shuffle Data
```python
df_shuffled = df.sample(frac=1, random_state=42)
```
- **`frac=1`**: Sample 100% of rows (shuffle all)
- **`random_state=42`**: Reproducible shuffle (same order every run)

#### 4. Train/Test Split
```python
split_idx = int(len(df) * 0.8)  # 80% mark
train_df = df_shuffled[:split_idx]   # First 80%
test_df = df_shuffled[split_idx:]    # Last 20%
```

### 🎯 Why Split Data?

**Training Set**: Build the vector database (knowledge base)

**Test Set**: Evaluate RAG performance (unseen tickets)

This prevents "cheating" - the system must generalize, not memorize!

### 📈 Expected Output

You should see approximately:
- **Train**: ~3,200 tickets (80%)
- **Test**: ~800 tickets (20%)

In [None]:
# Load and split
df = pd.read_csv(CONFIG['csv_path'])
df = df.dropna(subset=['subject_english', 'body_english', 'answer_english'])
df_shuffled = df.sample(frac=1, random_state=CONFIG['random_seed']).reset_index(drop=True)

split_idx = int(len(df_shuffled) * CONFIG['train_test_split'])
train_df = df_shuffled[:split_idx].reset_index(drop=True)
test_df = df_shuffled[split_idx:].reset_index(drop=True)

print(f"📊 Train: {len(train_df):,} tickets")
print(f"📊 Test: {len(test_df):,} tickets")

## Initialize LangChain Components

### 🧱 The Three Building Blocks of RAG

Every RAG system needs:

1. **Embeddings** → Convert text to vectors
2. **LLM** → Generate answers
3. **Re-ranker** → Improve retrieval quality

### 🎯 Component Details

#### 1. Embeddings Model (`HuggingFaceEmbeddings`)

**What it does**: Converts text → 384-dimensional vectors

**Key parameters**:
- `model_name`: `all-MiniLM-L6-v2` (fast, good quality, popular choice)
- `device='mps'`: Use Apple Silicon GPU (use `'cpu'` on other systems)
- `normalize_embeddings=True`: Unit vectors for cosine similarity

**Why this model?**: Balance of speed, quality, and size
- ✅ Fast inference (~1000 docs/sec)
- ✅ Small memory footprint (~80MB)
- ✅ Good quality for most tasks

#### 2. LLM (`ChatOpenAI`)

**What it does**: Generates natural language answers

**Key parameters**:
- `base_url`: LM Studio endpoint (OpenAI-compatible API)
- `api_key`: Not needed for local LM Studio
- `temperature=0.2`: Low randomness (more focused answers)
- `max_tokens=6000`: Maximum response length

**Why LM Studio?**: Free, local, private, fast!

#### 3. Re-ranker (`CrossEncoder`)

**What it does**: Scores query-document pairs for better ranking

**Model**: `mxbai-rerank-base-v1` (mixed bread AI)
- More accurate than vector similarity alone
- Slower (can't run on all documents)
- Used in Stage 2 of retrieval

### 📊 Performance Notes

**First run**: Downloads models (~100MB total)

**Subsequent runs**: Loads from cache (fast!)

In [None]:
# 1. Embeddings
print("Loading embeddings...")
embeddings = HuggingFaceEmbeddings(
    model_name=CONFIG['embedding_model'],
    model_kwargs={'device': 'mps'},
    encode_kwargs={'normalize_embeddings': True}
)
print(f"✅ Embeddings ready")

# 2. LLM
print("\nConnecting to LM Studio...")
llm = ChatOpenAI(
    base_url=f"{CONFIG['lm_studio_url']}/v1",
    api_key="not-needed",
    model=CONFIG['llm_model'],
    temperature=0.2,
    max_tokens=6000
)
print(f"✅ LLM ready")

# 3. Re-ranker
print("\nLoading re-ranker...")
reranker = CrossEncoder(CONFIG['reranker_model'])
print(f"✅ Re-ranker ready")

## Create Vector Store

### 🗄️ Building the Knowledge Base

This is where we create the **vector database** - the core of our RAG system.

### 📝 Two-Step Process

#### Step 1: Convert Tickets → LangChain Documents

**Helper Function**: `create_ticket_text(row)`
```python
def create_ticket_text(row):
    return f"""Subject: {row['subject_english']}
Body: {row['body_english']}
Answer: {row['answer_english']}"""
```

Formats each ticket as structured text with clear sections.

**Document Creation**:
```python
Document(
    page_content=create_ticket_text(row),  # The text content
    metadata={                              # Extra information
        'ticket_id': f"ticket_{idx}",
        'type': str(row.get('type', 'unknown')),
        'priority': str(row.get('priority', 'unknown'))
    }
)
```

**LangChain `Document` Class**:
- `page_content`: The actual text (searchable)
- `metadata`: Tags/attributes (filterable, returnable)

#### Step 2: Build Vector Store

**ChromaDB** is an embedded vector database:
- ✅ Fast similarity search
- ✅ Persistent storage (saves to disk)
- ✅ No separate server needed

### ⚠️ Note About This Cell

**SKIP IF ALREADY BUILT**: If you've run this before and have a persisted database, you can skip re-creating it. The next cell handles loading from disk.

### 🔧 Technical Details

**Embedding Process**:
1. Take each document's `page_content`
2. Pass through embedding model → 384D vector
3. Store vector + metadata in ChromaDB

**Time**: ~30-60 seconds for 3,200 tickets

**Storage**: ~50MB on disk

In [None]:
#### SKIP IF ALREADY BUILT ####


# Helper function
def create_ticket_text(row):
    return f"""Subject: {row['subject_english']}
Body: {row['body_english']}
Answer: {row['answer_english']}"""

# Convert to LangChain Documents
print("Creating documents...")
documents = [
    Document(
        page_content=create_ticket_text(row),
        metadata={
            'ticket_id': f"ticket_{idx}",
            'type': str(row.get('type', 'unknown')),
            'priority': str(row.get('priority', 'unknown'))
        }
    )
    for idx, row in train_df.iterrows()
]
print(f"✅ Created {len(documents):,} documents")

# Build vector store
print("\nBuilding vector store...")
print("📝 Using in-memory mode (fast, but rebuilds each run)")

# Simple in-memory approach - always works!
vectorstore = Chroma(
    collection_name="tickets_simple",
    embedding_function=embeddings
)

print("Adding documents to vector store...")
vectorstore.add_documents(documents)
print(f"✅ Vector store ready with {vectorstore._collection.count():,} documents")

In [None]:
# Helper function
def create_ticket_text(row):
    return f"""Subject: {row['subject_english']}
Body: {row['body_english']}
Answer: {row['answer_english']}"""

# Convert to LangChain Documents
print("Creating documents...")
documents = [
    Document(
        page_content=create_ticket_text(row),
        metadata={
            'ticket_id': f"ticket_{idx}",
            'type': str(row.get('type', 'unknown')),
            'priority': str(row.get('priority', 'unknown'))
        }
    )
    for idx, row in train_df.iterrows()
]
print(f"✅ Created {len(documents):,} documents")

# Build vector store
print("\nBuilding vector store...")

# Fix for ChromaDB readonly database error
import shutil
if os.path.exists(CONFIG['chroma_db_path']):
    print(f"⚠️  Removing existing ChromaDB at {CONFIG['chroma_db_path']}")
    shutil.rmtree(CONFIG['chroma_db_path'])

vectorstore = Chroma.from_documents(
    documents=documents,
    embedding=embeddings,
    collection_name="tickets_simple",
    persist_directory=CONFIG['chroma_db_path']
)
print(f"✅ Vector store ready with {vectorstore._collection.count():,} documents")

## Custom Retriever with Re-ranking

### 🎯 Two-Stage Retrieval Strategy

This custom retriever implements a **sophisticated retrieval pipeline**:

```
Query → Vector Search (fast, ~20 docs) → Re-rank (slow, accurate) → Top 5 docs
```

### 📊 Why Custom Retriever?

LangChain's `BaseRetriever` class lets us create specialized retrievers that fit into LCEL chains seamlessly.

### 🔧 Code Breakdown

#### Class Definition
```python
class SimpleRerankRetriever(BaseRetriever):
```
Inherits from `BaseRetriever` → compatible with all LangChain chain operations.

#### Required Fields
```python
vectorstore: Chroma         # Where to search
reranker: CrossEncoder      # How to re-rank
k_initial: int = 20         # Stage 1: get 20 candidates
k_final: int = 5            # Stage 2: return top 5
```

#### Core Method: `_get_relevant_documents(query)`

**Stage 1: Vector Search**
```python
docs = self.vectorstore.similarity_search(query, k=self.k_initial)
```
- Uses cosine similarity on embeddings
- Fast: ~10-20ms for 3k documents
- Gets more candidates than needed

**Stage 2: Re-ranking**
```python
pairs = [[query, doc.page_content] for doc in docs]
scores = self.reranker.predict(pairs)
```
- Cross-encoder scores each query-document pair
- Slower: ~100-200ms for 20 pairs
- Much more accurate than vector similarity alone

**Stage 3: Sort and Filter**
```python
docs_sorted = sorted(docs, key=lambda x: x.metadata['rerank_score'], reverse=True)
return docs_sorted[:self.k_final]
```
- Sorts by re-rank score (highest first)
- Returns only top 5

### 💡 Key Insight: Trade-offs

| Method | Speed | Accuracy | Scalability |
|--------|-------|----------|-------------|
| Vector search only | ⚡ Fast | 😐 OK | ✅ Great |
| Re-rank all docs | 🐌 Slow | ✅ Great | ❌ Poor |
| **Two-stage** | ✅ Fast | ✅ Great | ✅ Great |

### 🧪 Performance Numbers

For 3,200 documents:
- **Vector search** (k=20): ~15ms
- **Re-ranking** (20 docs): ~150ms
- **Total**: ~165ms per query

Compare to re-ranking all 3,200 docs: ~24,000ms! ⏱️

In [None]:
class SimpleRerankRetriever(BaseRetriever):
    """Two-stage retrieval: vector search → re-rank."""
    
    vectorstore: Chroma
    reranker: CrossEncoder
    k_initial: int = 20
    k_final: int = 5
    
    def _get_relevant_documents(self, query: str) -> List[Document]:
        # Stage 1: Get initial results
        docs = self.vectorstore.similarity_search(query, k=self.k_initial)
        
        # Stage 2: Re-rank
        pairs = [[query, doc.page_content] for doc in docs]
        scores = self.reranker.predict(pairs)
        
        # Add scores and sort
        for doc, score in zip(docs, scores):
            doc.metadata['rerank_score'] = float(score)
        
        docs_sorted = sorted(docs, key=lambda x: x.metadata['rerank_score'], reverse=True)
        return docs_sorted[:self.k_final]

# Create retriever
retriever = SimpleRerankRetriever(
    vectorstore=vectorstore,
    reranker=reranker,
    k_initial=CONFIG['top_k_initial'],
    k_final=CONFIG['top_k_reranked']
)

print("✅ Retriever ready")

## Define Helper Functions

### 🔧 Context Formatting Function

This simple but crucial function transforms retrieved documents into a formatted context string for the LLM.

### 📝 Code Analysis

```python
def format_docs_to_context(docs: List[Document]) -> str:
```

**Input**: List of `Document` objects (from our retriever)

**Output**: Single formatted string

### 🎨 Formatting Strategy

For each document:
1. Add a separator: `--- Ticket {i} ---`
2. Include the re-rank score: `(score: 0.845)`
3. Add the document content

**Example Output**:
```
--- Ticket 1 (score: 0.892) ---
Subject: Cannot access email
Body: User reports login fails...
Answer: Reset password via admin panel...

--- Ticket 2 (score: 0.756) ---
Subject: VPN connection issues
Body: Remote worker cannot connect...
Answer: Check firewall settings...
```

### 💡 Why This Matters

**Clear Structure**: LLM can easily distinguish between tickets

**Score Information**: LLM knows which tickets are most relevant

**Numbered Tickets**: LLM can cite specific tickets in answer

### 🔗 LCEL Integration

This function will be wrapped in `RunnableLambda` to use in our chain:

```python
retriever | RunnableLambda(format_docs_to_context) | prompt | llm
```

Simple functions → Powerful chains! 🚀

In [None]:
def format_docs_to_context(docs: List[Document]) -> str:
    """Format documents into context string."""
    parts = []
    for i, doc in enumerate(docs, 1):
        score = doc.metadata.get('rerank_score', 0)
        parts.append(f"\n--- Ticket {i} (score: {score:.3f}) ---\n{doc.page_content}")
    return "\n".join(parts)

print("✅ Helper functions ready")

## Build LCEL Chain

### 🔗 Introduction to LCEL (LangChain Expression Language)

**This is the magic of LangChain!** LCEL lets you chain components together using the **pipe operator `|`**.

Think of it like Unix pipes: `cat file.txt | grep "error" | wc -l`

### 🎯 Building Blocks

#### 1. Define the Prompt Template

```python
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are an IT support assistant..."),
    ("user", "HISTORICAL TICKETS:\n{context}\n\nUSER QUESTION:\n{question}\n...")
])
```

**Prompt variables**: `{context}` and `{question}` will be filled by our chain.

**Message types**:
- `system`: Sets assistant behavior/role
- `user`: The actual query

### 🔄 Understanding the Chain

Let's build it step by step:

#### Simple Chain (Conceptual)
```python
retriever | format | {"context": ..., "question": ...}
```

#### Complete RAG Chain
```python
rag_chain = (
    {"context": retriever | RunnableLambda(format_docs_to_context), 
     "question": lambda x: x}
    | prompt
    | llm
    | StrOutputParser()
)
```

### 📊 Flow Visualization

```
Input Query
    ↓
┌───────────────────────────────────────┐
│ Step 1: Parallel Execution            │
│  ├─ retriever → format → context      │
│  └─ lambda x: x → question            │
└───────────────────────────────────────┘
    ↓
┌───────────────────────────────────────┐
│ Step 2: Prompt Template               │
│  Combines {context} + {question}      │
└───────────────────────────────────────┘
    ↓
┌───────────────────────────────────────┐
│ Step 3: LLM Generation                │
│  Sends prompt to LM Studio            │
└───────────────────────────────────────┘
    ↓
┌───────────────────────────────────────┐
│ Step 4: Output Parsing                │
│  Extracts string from LLM response    │
└───────────────────────────────────────┘
    ↓
Final Answer
```

### 💡 Key LCEL Concepts

#### 1. Dictionary Runnables
```python
{"context": retriever | format, "question": lambda x: x}
```
Creates a dict with computed values - runs both branches in parallel!

#### 2. RunnableLambda
```python
RunnableLambda(format_docs_to_context)
```
Wraps any function to work in LCEL chains.

#### 3. Lambda Functions
```python
lambda x: x
```
Pass-through function (returns input unchanged).

#### 4. Output Parsers
```python
StrOutputParser()
```
Extracts the text string from LLM response object.

### 🚀 Why LCEL is Powerful

✅ **Readable**: Flow is clear and linear

✅ **Composable**: Swap components easily

✅ **Streaming**: Built-in support for streaming responses

✅ **Parallel**: Automatic parallel execution where possible

✅ **Type-safe**: Type hints flow through chain

In [None]:
# Define prompt
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are an IT support assistant. Answer based on the historical tickets provided."),
    ("user", """HISTORICAL TICKETS:
{context}

USER QUESTION:
{question}

Provide a helpful answer based on the historical tickets above.""")
])

# Build the chain
# Step 1: Retrieve → Step 2: Format → Step 3: Prompt → Step 4: LLM → Step 5: Parse
chain = (
    retriever                                    # Step 1: Get documents
    | RunnableLambda(format_docs_to_context)     # Step 2: Format to context string
    | RunnableLambda(lambda context: {           # Step 3: Prepare prompt inputs
        "context": context,
        "question": ""  # Will be filled during invoke
    })
)

# Complete RAG chain
rag_chain = (
    {"context": retriever | RunnableLambda(format_docs_to_context), "question": lambda x: x}
    | prompt
    | llm
    | StrOutputParser()
)

print("✅ LCEL chain built!")
print("   Flow: Query → Retrieve → Format → Prompt → LLM → Answer")

## Test the Chain

### 🧪 End-to-End RAG System Test

Now we put everything together and see the RAG system in action!

### 📝 Test Setup

#### 1. Select Test Ticket
```python
test_ticket = test_df.iloc[555]
```
Picks ticket #555 from our **test set** (unseen by the vector database).

#### 2. Format Query
```python
query = f"""Subject: {test_ticket['subject_english']}
Body: {test_ticket['body_english']}"""
```
Creates query in same format as training data (but **without** the answer!).

### 🔄 What Happens When You Run This Cell

#### Behind the Scenes:

1. **Retrieval Stage**
   - Query embedding: Convert query text → 384D vector
   - Vector search: Find 20 similar tickets (cosine similarity)
   - Re-ranking: Score all 20 with cross-encoder
   - Selection: Pick top 5 re-ranked results

2. **Generation Stage**
   - Format: Convert 5 documents → context string
   - Prompt: Combine context + query → complete prompt
   - LLM Call: Send to LM Studio
   - Parse: Extract text from response

#### Performance Expectations:
- **Retrieval**: ~165ms (vector search + re-ranking)
- **LLM Generation**: ~2-5 seconds (depends on model/hardware)
- **Total**: ~2-6 seconds

### 📊 Evaluating Results

**Good signs**:
- ✅ Answer addresses the specific question
- ✅ Uses information from retrieved tickets
- ✅ Formatted professionally
- ✅ Includes troubleshooting steps

**Red flags**:
- ❌ Generic answer (not using retrieval context)
- ❌ Hallucinated information
- ❌ Doesn't match the ticket type
- ❌ Overly brief or unclear

### 💡 Interactive Experiment

Try changing `test_df.iloc[555]` to different indices:
- `test_df.iloc[0]` - First test ticket
- `test_df.iloc[100]` - Another random ticket
- `test_df.sample(1).iloc[0]` - Random ticket

Compare:
1. Original answer (in test set)
2. RAG-generated answer
3. Retrieved ticket answers

This helps you understand when RAG works well vs. needs improvement!

In [None]:
# Get test ticket
test_ticket = test_df.iloc[555]

query = f"""Subject: {test_ticket['subject_english']}
Body: {test_ticket['body_english']}"""

print("="*80)
print("TEST TICKET")
print("="*80)
print(query)
print("\n" + "="*80)

# Run the chain
print("\n🔄 Running LCEL chain...\n")
answer = rag_chain.invoke(query)

print("="*80)
print("GENERATED ANSWER")
print("="*80)

# Display answer as Markdown for better formatting
display(Markdown(answer))

print("\n" + "="*80)

## View Retrieved Documents

### 🔍 Examining the Retrieval Quality

This cell lets us inspect **what the system actually retrieved** - crucial for debugging and understanding RAG performance!

### 🎯 Why This Matters

The quality of your RAG answer is **directly dependent** on retrieval quality:

```
Bad Retrieval → Bad Answer (even with great LLM)
Good Retrieval → Good Answer (even with mediocre LLM)
```

### 📊 What to Look For

When examining retrieved documents:

#### ✅ Good Retrieval Signs

1. **Relevant Content**
   - Documents address similar issues
   - Terminology matches the query
   - Solutions are applicable

2. **High Re-rank Scores**
   - Top document: >0.7 (strong match)
   - Documents 2-5: >0.4 (relevant)
   - Clear score separation (best doc significantly higher)

3. **Diversity**
   - Different tickets with same solution
   - Multiple perspectives on same problem
   - Various contexts (different users, setups)

#### ❌ Poor Retrieval Signs

1. **Irrelevant Content**
   - Documents about different topics
   - Wrong product/system
   - Unrelated technical issues

2. **Low Re-rank Scores**
   - All scores <0.5 (weak matches)
   - Scores very similar (no clear winner)
   - Top score <0.3 (likely poor answer quality)

3. **Duplicates**
   - Same ticket multiple times
   - Identical or near-identical solutions
   - No information diversity

### 🔬 Analysis Workflow

For each retrieved document:

1. **Check relevance**: Does it actually relate to the query?
2. **Evaluate score**: Is the re-rank score reasonable?
3. **Assess quality**: Would this help answer the question?
4. **Compare ranks**: Are better documents ranked higher?

### 💡 Debugging Tips

**If retrieval is poor**:
- Try different embedding models
- Adjust `k_initial` and `k_final` values
- Check if query format matches training data
- Verify documents are properly indexed

**If scores seem wrong**:
- Re-ranker might need different model
- Query might be too short/vague
- Documents might need better formatting

### 🎓 Learning Exercise

Compare the retrieved documents to:
1. The original test ticket answer
2. The RAG-generated answer
3. Other similar tickets in the test set

This helps you understand:
- What information the LLM had available
- Why it generated its specific answer
- How retrieval quality affects output quality

In [None]:
# Get the documents that were retrieved
docs = retriever.invoke(query)

print("="*80)
print(f"TOP {len(docs)} RETRIEVED TICKETS")
print("="*80)

for i, doc in enumerate(docs, 1):
    score = doc.metadata.get('rerank_score', 0)
    print(f"\n--- Ticket {i} (rerank score: {score:.3f}) ---")
    print(doc.page_content[:300] + "...")

print("\n" + "="*80)

## Understanding LCEL

### What We Just Built

```python
rag_chain = (
    {"context": retriever | format_docs, "question": lambda x: x}
    | prompt
    | llm
    | StrOutputParser()
)
```

### Breaking It Down

1. **`retriever`** - Gets relevant documents from vector store
2. **`| format_docs`** - Formats documents into context string
3. **`| prompt`** - Creates the prompt with context and question
4. **`| llm`** - Sends to LLM and gets response
5. **`| StrOutputParser()`** - Extracts text from LLM response

### Key Benefits

**1. Readable**: Chain flow is clear and easy to understand

**2. Modular**: Each component can be tested independently
```python
# Test just the retriever
docs = retriever.invoke("test query")
```

**3. Reusable**: Components can be used in different chains
```python
# Use same retriever in different chain
another_chain = retriever | different_prompt | llm
```

**4. Composable**: Easy to add or remove steps
```python
# Add a translation step
chain_with_translation = retriever | format | prompt | llm | translator
```

## Comparison: Original vs LCEL

### Original Approach (Procedural)
```python
def rag_pipeline(query):
    docs = retrieve(query)
    context = format(docs)
    prompt_text = build_prompt(context, query)
    answer = llm.generate(prompt_text)
    return answer
```

### LCEL Approach (Declarative)
```python
rag_chain = retriever | format_docs | prompt | llm | parser
answer = rag_chain.invoke(query)
```

### Why LCEL?

| Feature | Original | LCEL |
|---------|----------|------|
| Code style | Procedural | Declarative |
| Testing | Full pipeline | Per component |
| Reusability | Copy-paste | Compose |
| Streaming | Manual | Built-in |
| Modification | Edit function | Swap components |

## Next Steps

Now that you understand basic LCEL chains, you can:

1. **Add streaming**: Get answers word-by-word
```python
for chunk in rag_chain.stream(query):
    print(chunk, end="", flush=True)
```

2. **Add error handling**: Catch and handle errors gracefully
```python
chain_with_fallback = rag_chain.with_fallbacks([backup_chain])
```

3. **Add evaluation**: Chain an evaluation step
```python
full_chain = rag_chain | evaluation_chain
```

4. **Batch processing**: Process multiple queries
```python
answers = rag_chain.batch([query1, query2, query3])
```

## Summary

**What We Learned**:
1. LangChain components (embeddings, LLM, retriever, prompts)
2. LCEL syntax: chaining with `|` operator
3. Sequential data flow through components
4. Benefits: modularity, testability, reusability

**What We Built**:
- Simple RAG chain: Retrieve → Format → Generate
- Two-stage retrieval with re-ranking
- Clean, composable architecture

**This is the foundation** - from here we can build more complex chains with parallel execution, conditional logic, and advanced patterns!

## 📝 Homework Assignments

Complete these exercises to deepen your understanding of LangChain and RAG systems.

### Assignment 1: Add Streaming Output

**Goal**: Modify the RAG chain to stream the LLM response word-by-word instead of waiting for the complete answer.

**Tasks**:
1. Research the `.stream()` method in LangChain
2. Create a new cell that uses `rag_chain.stream(query)` instead of `rag_chain.invoke(query)`
3. Print each chunk as it arrives with `end=""` and `flush=True`
4. Compare the user experience between streaming vs. non-streaming

**Starter Code**:
```python
# TODO: Implement streaming
query = "How do I reset my password?"

print("Streaming answer: ", end="", flush=True)
# Your code here
```

**Bonus**: Add a timer to measure and compare response times for first token vs. complete response.

**Hint**: replace THIS chunk of code with the stream() object:


```python

answer = rag_chain.invoke(query)

print("="*80)
print("GENERATED ANSWER")
print("="*80)
print(answer)
print("\n" + "="*80)



### Assignment 2: Experiment with Retrieval Parameters (Intermediate)

**Goal**: Understand how retrieval parameters affect answer quality by testing different configurations.

**Tasks**:
1. Create a new retriever with different `k_initial` and `k_final` values
2. Test at least 3 different configurations:
   - Configuration A: k_initial=10, k_final=3
   - Configuration B: k_initial=30, k_final=10
   - Configuration C: k_initial=20, k_final=5 (baseline)
3. Run the same query through each configuration
4. Compare the retrieved documents and generated answers
5. Analyze trade-offs between:
   - Retrieval quality (are the right documents found?)
   - Answer accuracy (is the LLM generating better answers?)
   - Performance (speed and resource usage)

**Starter Code**:
```python
# TODO: Create experimental configurations
configs_to_test = [
    {"name": "Config A", "k_initial": 10, "k_final": 3},
    {"name": "Config B", "k_initial": 30, "k_final": 10},
    {"name": "Config C", "k_initial": 20, "k_final": 5},
]

test_query = test_df.iloc[100]  # Pick a different test ticket

for config in configs_to_test:
    # Create retriever with config
    # Run query
    # Compare results
    pass
```

**Questions**:
- Which configuration provided the best answer quality?
- What's the performance impact of larger k_initial values?
- Is there a point where more documents hurt answer quality?

### Assignment 3: Build a Confidence Scoring Chain (Advanced)

**Goal**: Extend the RAG chain to include a confidence score based on document relevance and answer quality.

**Tasks**:
1. Create a confidence calculation function that:
   - Analyzes rerank scores from retrieved documents
   - Calculates metrics like: average score, score variance, top score
   - Returns a confidence level: "High" (>0.7), "Medium" (0.4-0.7), "Low" (<0.4)
2. Wrap the function in a `RunnableLambda`
3. Add it to the chain using the pipe operator
4. Test with multiple queries and verify confidence scores make sense

**Starter Code**:
```python
def calculate_confidence(chain_output: dict) -> dict:
    """
    Calculate confidence score from retrieval and answer.
    
    Args:
        chain_output: Dict with 'answer', 'docs', and other chain data
        
    Returns:
        Dict with original data plus 'confidence' and 'confidence_level'
    """
    # TODO: Extract rerank scores from documents
    # TODO: Calculate statistics (mean, max, variance)
    # TODO: Determine confidence level
    # TODO: Return enriched output
    pass

# TODO: Create enhanced chain
enhanced_chain = (
    {"context": retriever | RunnableLambda(format_docs_to_context), 
     "question": lambda x: x,
     "docs": retriever}  # Pass docs separately for confidence calculation
    | prompt
    | llm
    | StrOutputParser()
    | RunnableLambda(calculate_confidence)
)

# Test with various queries
test_queries = [
    "How do I reset my password?",  # Should have high confidence
    "What is quantum computing?",     # Should have low confidence (off-topic)
]
```

**Challenge**: Create a fallback mechanism that triggers when confidence is low, asking the user to rephrase or providing a disclaimer that the system is uncertain.

**Bonus**: Log confidence scores and correlate them with actual answer quality by manual review.