# Ticket RAG System - Week 05

##  Building on Previous Work

### Week 02 Recap: Ticket Triage
In Week 02, we built a ticket triage system that:
- **Detected language** of support tickets (German, English, French, Portuguese, Spanish)
- **Translated** subject and body to English using LLM
- **Classified tickets** into type, queue, and priority categories
- Used **individual LLM calls** for each task (language detection → translation → classification)

**Key Learning**: We can process tickets one-at-a-time, but each ticket is handled independently.

### Week 03 Recap: Optimization & Consolidation
In Week 03, we improved our approach:
- **Consolidated LLM calls**: Combined 6 separate calls into 1 comprehensive call
- **JSON structured output**: Added schema validation for consistent responses
- **Token tracking**: Measured cost/efficiency of LLM operations
- **Batch processing**: Handled multiple tickets efficiently

**Key Learning**: Efficiency matters - consolidate operations and track resources.

### Week 04 Recap: RAG with PDFs
In Week 04, we learned Retrieval-Augmented Generation (RAG):
- **PDF → Chunks**: Split documents into manageable pieces
- **Embeddings**: Convert text to vector representations
- **ChromaDB**: Store vectors for fast similarity search
- **LLM + Context**: Generate answers using retrieved relevant chunks

**Pattern**: `Document → Embed → Store → Query → Retrieve Similar → LLM Answer`



## This Week: RAG for Ticket Support

### The Challenge
We can triage individual tickets (Week 02/03), but can we leverage **historical tickets** to provide better answers?

**Questions to answer:**
- How do we find similar past tickets?
- Can historical solutions help with new problems?
- How do we measure answer quality?

### The Solution: Apply RAG to Tickets
Combine all previous concepts:
- Use **pre-translated tickets** from Week 02/03 preprocessing (CSV with `_english` columns)
- Apply **RAG pattern from Week 04** (but with structured CSV instead of PDF)
- **Search** for similar historical tickets using vector similarity
- **Generate** context-aware answers using LLM with relevant examples

### New This Week
1. **Train/Test Split**: Proper evaluation methodology (80/20)
2. **Confidence Scoring**: How certain is the system about its answer?
3. **LLM Evaluation**: Let LLM judge answer quality vs ground truth
4. **Batch Testing**: Systematic quality assessment

## Architecture Overview
```
Pre-translated CSV (from Week 02/03 work)
    ↓
Extract English columns (subject_english, body_english, answer_english)
    ↓
Train/Test Split (80/20)
    ↓
Generate Embeddings (Week 04 pattern)
    ↓
Store in ChromaDB (Week 04 pattern)
    ↓
New Ticket → Embed → Vector Search → Retrieve Top-K Similar Tickets
    ↓
Context + Question → LM Studio LLM → Answer + Confidence + References
```

## CSV Data Structure

**Original columns** (preserved from Week 02 work):
- `subject`, `body`, `answer` (in original languages: es/fr/de/pt/en)

**Translation columns** (created by `translate_tickets.ipynb`):
- `subject_english` - English translation of subject
- `body_english` - English translation of body
- `answer_english` - English translation of answer
- `original_language` - Detected language code (en/es/fr/de/pt)

**Why separate columns?**
- Original content preserved for reference
- Fast loading (no translation overhead)
- Consistent translations across experiments
- Separation of concerns (translate once, use many times)

## 1. Configuration & Imports

### Why Centralized Configuration?

**Evolution of our approach:**
- **Week 02**: Hardcoded values like `max_tokens=1200` directly in functions
- **Week 03**: Started grouping related values (model selection at top)
- **Week 04**: Configuration-first approach for PDF processing
- **Week 05**: All parameters in one `CONFIG` dictionary

**Benefits of CONFIG dictionary:**
-  **Easy Experimentation**: Change `top_k` once instead of in multiple functions
-  **Track Changes**: See all parameters at a glance
-  **Reproducibility**: Same config = same results

**Common Parameters You'll Adjust:**
- `top_k`: Number of similar tickets to retrieve (try 3, 5, 10)
- `temperature`: LLM creativity (0.1 = factual, 0.9 = creative)
- `embedding_model`: Trade speed vs quality
- `train_test_split`: More test data = better evaluation
- `test_mode`: Set to `False` to process all ~4K tickets (currently limited to 100 for speed)

**Test Mode Note**: The system is currently in TEST MODE, loading only the first 100 tickets for quick experimentation. To use the full dataset:
1. Find the line: `'test_mode': True` 
2. Change to: `'test_mode': False`
3. Re-run the configuration and data loading cells

In [None]:
# Install required packages (run once)
# !pip install chromadb sentence-transformers pandas requests python-dotenv openai

In [122]:
import os
import json
import pandas as pd
import numpy as np
from typing import List, Dict, Optional, Tuple
from datetime import datetime

from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.config import Settings
from openai import OpenAI
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

print("✅ Libraries imported successfully")

✅ Libraries imported successfully


In [123]:
# =====================================================
# CONFIGURATION - All parameters in one place
# =====================================================

CONFIG = {
    # File paths
    'csv_path': './dataset-tickets-multi-lang3-4k-translated-all.csv',  # Use pre-translated CSV
    'chroma_db_path': './chroma_ticket_db',
    
    # Data split
    'train_test_split': 0.8,  # 80% train, 20% test
    'random_seed': 42,
    
    # TEST MODE - Limit data loading for quick testing
    'test_mode': False,  # Set to False to process all tickets
    'test_limit': 100,  # Only load first N tickets in test mode
    
    # Embedding configuration
    'embedding_model': 'all-MiniLM-L6-v2',
    'embedding_fields': ['subject_english', 'body_english', 'answer_english'],  # Use English translation columns
    
    # Metadata fields
    'metadata_fields': ['type', 'queue', 'priority', 'business_type', 'original_language'],
    
    # Vector search configuration
    'top_k': 5,  # Number of similar tickets to retrieve
    'similarity_threshold': 0.3,  # Minimum similarity score (cosine distance)
    
    # RAG Mode Configuration
    # 'strict': LLM uses ONLY historical tickets (context-only, traceable)
    # 'augmented': LLM uses historical tickets + general knowledge (helpful, flexible)
    'rag_mode': 'strict',  
    
    # LLM configuration
    'lm_studio_url': 'http://169.254.233.17:1234',
    'llm_model': 'gpt-oss-20b',
    'temperature': 0.2,
    'max_tokens': 2000,
    
    # ChromaDB collection
    'collection_name': 'ticket_rag_collection',
    
    # Future extensibility
    'enable_reranking': False,  # Placeholder for future lessons
    'reranking_model': None,
}

print("✅ Configuration loaded")
print(f"✅ Train/Test Split: {CONFIG['train_test_split']*100:.0f}%/{(1-CONFIG['train_test_split'])*100:.0f}%")
print(f"✅ Top-K Retrieval: {CONFIG['top_k']}")
print(f"✅ Embedding Model: {CONFIG['embedding_model']}")
print(f"✅ Using English translation columns: {', '.join(CONFIG['embedding_fields'])}")

# Display RAG mode with explanation
rag_mode = CONFIG['rag_mode']
rag_mode_desc = 'Context-only (traceable)' if rag_mode == 'strict' else 'Context + LLM knowledge (flexible)'
print(f"✅ RAG Mode: {rag_mode.upper()} ({rag_mode_desc})")

if CONFIG['test_mode']:
    print(f"⚠️  TEST MODE: Will load only first {CONFIG['test_limit']} tickets")
print(f"✅ CSV: {CONFIG['csv_path']}")

✅ Configuration loaded
✅ Train/Test Split: 80%/20%
✅ Top-K Retrieval: 5
✅ Embedding Model: all-MiniLM-L6-v2
✅ Using English translation columns: subject_english, body_english, answer_english
✅ RAG Mode: STRICT (Context-only (traceable))
✅ CSV: ./dataset-tickets-multi-lang3-4k-translated-all.csv


## 2. Initialize Components 

Just like Week 04, we initialize our core components:
- **Embedding Model**: Converts text to vectors (sentence-transformers)
- **ChromaDB**: Vector database for similarity search
- **LM Studio Client**: OpenAI-compatible API for local LLM

In [124]:
# =====================================================
# INITIALIZE COMPONENTS
# =====================================================

# Initialize embedding model
print("✅ Loading embedding model...")
embedder = SentenceTransformer(CONFIG['embedding_model'])
print(f"✅ Embedding model loaded: {CONFIG['embedding_model']}")
print(f"   Embedding dimensions: {len(embedder.encode(['test'])[0])}")

# Initialize ChromaDB client
print("\n✅ Initializing ChromaDB...")
chroma_client = chromadb.PersistentClient(path=CONFIG['chroma_db_path'])
print(f"✅ ChromaDB initialized at: {CONFIG['chroma_db_path']}")

# Initialize LM Studio client
print("\n✅ Connecting to LM Studio...")
client = OpenAI(
    base_url=f"{CONFIG['lm_studio_url']}/v1",
    api_key="lm-studio"
)
print(f"✅ LM Studio client initialized")
print(f"   URL: {CONFIG['lm_studio_url']}")
print(f"   Model: {CONFIG['llm_model']}")

print("\n✅ All components initialized successfully")

✅ Loading embedding model...
✅ Embedding model loaded: all-MiniLM-L6-v2
   Embedding dimensions: 384

✅ Initializing ChromaDB...
✅ ChromaDB initialized at: ./chroma_ticket_db

✅ Connecting to LM Studio...
✅ LM Studio client initialized
   URL: http://169.254.233.17:1234
   Model: gpt-oss-20b

✅ All components initialized successfully


## 3. Helper Functions 
Following the pattern from Week 02 and Week 03, we define reusable helper functions at the top of our notebook. These make our code more modular and easier to understand.

In [125]:
# =====================================================
# HELPER FUNCTIONS (Matches Week 02/03 Pattern)
# =====================================================

def show_error(err_string: str):
    """
    Print an error message and stop execution.
    (Consistent with Week 02/03 pattern)
    
    Args:
        err_string: Error message to display
    """
    print(f"❌ Error: {err_string}")
    raise SystemExit()


def load_data(csv_path: str) -> pd.DataFrame:
    """
    Load ticket data from CSV.
    Enhanced version of Week 02 function - now handles translated columns.
    
    In Week 02, we loaded `dataset-tickets-multi-lang_cleaned.csv` from IT_Tickets/.
    Now we load pre-translated data with separate English columns.
    
    Args:
        csv_path: Path to CSV file with translations
    
    Returns:
        pd.DataFrame: Ticket data with both original and *_english columns
        None: If file can't be found or loaded
    """
    try:
        df = pd.read_csv(csv_path)
        
        # Verify required English translation columns exist
        required_cols = ['subject_english', 'body_english', 'answer_english']
        missing = [c for c in required_cols if c not in df.columns]
        
        if missing:
            show_error(
                f"Missing translation columns: {missing}.\n"
                f"Please run translate_tickets.ipynb first to generate English translations!"
            )
        
        print(f"✅ Loaded {len(df):,} tickets from CSV")
        return df
        
    except FileNotFoundError:
        show_error(f"CSV file not found at {csv_path}")
    except Exception as e:
        show_error(f"Error loading CSV: {str(e)}")


def call_llm_sdk(
    system_content: str,
    user_content: str,
    model: str = "gpt-oss-20b",
    max_tokens: int = 2000,
    temperature: float = 0.1
) -> str:
    """
    Call LLM via OpenAI SDK.
    Matches Week 03 signature for consistency.
    
    This is a wrapper around the OpenAI client that we initialized earlier.
    Same interface as Week 03, but now used for RAG answer generation.
    
    Args:
        system_content: System prompt/instructions
        user_content: User message/question
        model: Model name (default from CONFIG)
        max_tokens: Maximum tokens to generate
        temperature: Sampling temperature (0=deterministic, 1=creative)
    
    Returns:
        str: LLM response text
    """
    try:
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": system_content},
                {"role": "user", "content": user_content}
            ],
            max_tokens=max_tokens,
            temperature=temperature,
            timeout=60
        )
        return response.choices[0].message.content.strip()
        
    except Exception as e:
        show_error(f"LLM call failed: {str(e)}")


print("✅ Helper functions loaded (Week 02/03 pattern restored)")

✅ Helper functions loaded (Week 02/03 pattern restored)


## 4. Understanding Our Data 

### Why Pre-Translated Data?

Remember in Week 02/03, we translated tickets one-at-a-time using LLM calls. For this RAG system, we need to process **thousands** of tickets efficiently.

**Solution**: Pre-translate once, reuse many times
- Run `translate_tickets.py` script (one-time, ~1-6 hours for 4K tickets depending on model and provider - took 75 min using gpt-oss-20b on GROQ - cost  $1.24  - on local Mac - it took over 4 hours) 
- Creates CSV with separate `_english` columns
- This notebook uses the pre-translated data

### Why Train/Test Split?

In Week 02/03, we processed tickets individually without evaluation. Now we need to **test** our RAG system properly:

- **Training Set (80%)**: Build our knowledge base from these tickets (technically not "training"!)
- **Test Set (20%)**: Evaluate how well we answer *new, unseen* tickets

This is standard ML practice for unbiased evaluation.

In [126]:
# Test LM Studio connection
def test_lm_studio():
    """Test connection to LM Studio"""
    try:
        response = client.chat.completions.create(
            model=CONFIG['llm_model'],
            messages=[{"role": "user", "content": "Say 'connected'"}],
            temperature=CONFIG['temperature'],
            max_tokens=10,
            timeout=30
        )
        print("✅ LM Studio is connected and ready")
        return True
    except Exception as e:
        print(f"⚠️ LM Studio connection failed: {e}")
        print("Please ensure LM Studio is running with a model loaded on port 1234")
        return False

test_lm_studio()

✅ LM Studio is connected and ready


True

## 4. Embedding & Vector DB Functions

In [127]:
def load_and_split_data(csv_path: str, train_ratio: float = 0.8, random_seed: int = 42) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """
    Load CSV and split into train/test sets.
    
    Args:
        csv_path: Path to CSV file
        train_ratio: Ratio for training set (default 0.8 = 80%)
        random_seed: Random seed for reproducibility
    
    Returns:
        Tuple of (train_df, test_df)
    """
    print(f" Loading CSV from: {csv_path}")
    df = pd.read_csv(csv_path)
    
    # TEST MODE - Limit data for quick testing
    if CONFIG.get('test_mode', False):
        test_limit = CONFIG.get('test_limit', 10)
        print(f"⚠️  TEST MODE: Loading only first {test_limit} tickets")
        df = df.head(test_limit)
    
    # Check if English translation columns exist
    required_cols = ['subject_english', 'body_english', 'answer_english']
    missing_cols = [col for col in required_cols if col not in df.columns]
    
    if missing_cols:
        print(f"\n  Missing translation columns: {', '.join(missing_cols)}")
        print(f"   Please run translate_tickets.ipynb first to generate English translations!")
        raise ValueError(f"Missing required columns: {missing_cols}")
    
    # Clean data: drop rows with missing English translations
    initial_count = len(df)
    df = df.dropna(subset=['subject_english', 'body_english', 'answer_english'])
    
    # Also filter out empty strings
    df = df[(df['subject_english'].str.strip() != '') & 
            (df['body_english'].str.strip() != '') & 
            (df['answer_english'].str.strip() != '')]
    
    cleaned_count = len(df)
    
    print(f" Total tickets: {initial_count:,}")
    print(f" After cleaning: {cleaned_count:,} (removed {initial_count - cleaned_count:,} with missing/empty English translations)")
    
    # Check if we have any data
    if cleaned_count == 0:
        print(f"\n❌ No tickets with English translations found!")
        print(f"   Run translate_tickets.ipynb to generate translations first.")
        raise ValueError("No valid English translations in CSV")
    
    # Shuffle and split
    df_shuffled = df.sample(frac=1, random_state=random_seed).reset_index(drop=True)
    split_idx = int(len(df_shuffled) * train_ratio)
    
    train_df = df_shuffled[:split_idx].reset_index(drop=True)
    test_df = df_shuffled[split_idx:].reset_index(drop=True)
    
    print(f"\n Split complete:")
    print(f"    Train set: {len(train_df):,} tickets ({train_ratio*100:.0f}%)")
    print(f"    Test set: {len(test_df):,} tickets ({(1-train_ratio)*100:.0f}%)")
    
    return train_df, test_df

# Load and split data
train_df, test_df = load_and_split_data(
    CONFIG['csv_path'], 
    CONFIG['train_test_split'],
    CONFIG['random_seed']
)

# Display sample with both original and English columns
print("\n Sample train ticket:")
if 'subject' in train_df.columns:
    print("\nOriginal (multilingual):")
    print(train_df[['subject', 'type', 'priority', 'original_language']].head(3))
    print("\nEnglish translations:")
    print(train_df[['subject_english', 'type', 'priority']].head(3))
else:
    print(train_df[['subject_english', 'type', 'priority', 'original_language']].head(3))

 Loading CSV from: ./dataset-tickets-multi-lang3-4k-translated-all.csv
 Total tickets: 3,998
 After cleaning: 3,601 (removed 397 with missing/empty English translations)

 Split complete:
    Train set: 2,880 tickets (80%)
    Test set: 721 tickets (20%)

 Sample train ticket:

Original (multilingual):
                                         subject      type priority  \
0                                            NaN   Request      low   
1  Probleme mit Outlook bei Microsoft Office 365  Incident     high   
2             Touchscreen Issue on Surface Pro 7   Problem     high   

  original_language  
0                es  
1                de  
2                en  

English translations:
                                     subject_english      type priority
0  Dear Support, I’m requesting a return for my H...   Request      low
1           Outlook Issues with Microsoft Office 365  Incident     high
2                 Touchscreen Issue on Surface Pro 7   Problem     high


## 5. Creating Embeddings (Week 04 Concept → Tickets)

### Embeddings: From PDF Chunks to Ticket Fields

**Week 04 Recap**: We embedded PDF chunks (arbitrary text segments)  
**Week 05**: We embed structured ticket fields (subject + body + answer)

**Key Difference:**
- PDF had arbitrary text → needed chunking strategy
- Tickets are pre-structured → use fields directly

### Why Combine Subject + Body + Answer?

Each field contributes different information:
- **Subject**: Main issue keywords ("VPN connection failed")
- **Body**: Detailed context ("tried restarting router...")  
- **Answer**: Solution keywords ("check firewall settings...")

→ **Embedding captures the full "problem→solution" pattern**

Let's see this in action with a single ticket first, then scale to all tickets.

In [128]:
def create_combined_text(row: pd.Series, fields: List[str]) -> str:
    """
    Combine multiple fields into single text for embedding.
    
    Args:
        row: DataFrame row
        fields: List of field names to combine
    
    Returns:
        Combined text string
    """
    texts = []
    for field in fields:
        if pd.notna(row.get(field)):
            texts.append(f"{field.capitalize()}: {row[field]}")
    return "\n".join(texts)


def embed_tickets(df: pd.DataFrame, embedding_fields: List[str]) -> List[List[float]]:
    """
    Generate embeddings for ticket data.
    
    Args:
        df: DataFrame with ticket data (already translated to English)
        embedding_fields: Fields to include in embeddings
    
    Returns:
        List of embedding vectors
    """
    print(f"🔄 Generating embeddings for {len(df):,} tickets...")
    print(f"   Fields: {', '.join(embedding_fields)}")
    
    # Combine fields into single text per ticket
    combined_texts = [create_combined_text(row, embedding_fields) for _, row in df.iterrows()]
    
    # Generate embeddings
    embeddings = embedder.encode(combined_texts, show_progress_bar=True)
    
    print(f"✅ Generated {len(embeddings):,} embeddings (dim: {embeddings.shape[1]})")
    return embeddings.tolist()


def prepare_metadata(row: pd.Series, metadata_fields: List[str]) -> Dict:
    """
    Extract metadata from DataFrame row.
    
    Args:
        row: DataFrame row
        metadata_fields: Fields to extract as metadata
    
    Returns:
        Metadata dictionary
    """
    metadata = {}
    for field in metadata_fields:
        value = row.get(field)
        # Convert to string, handle NaN
        metadata[field] = str(value) if pd.notna(value) else "unknown"
    return metadata


def load_vector_db(
    df: pd.DataFrame,
    collection_name: str,
    embedding_fields: List[str],
    metadata_fields: List[str],
    clear_existing: bool = True
) -> chromadb.Collection:
    """
    Load tickets into ChromaDB vector database.
    
    Args:
        df: DataFrame with ticket data (already translated to English)
        collection_name: Name for ChromaDB collection
        embedding_fields: Fields to embed
        metadata_fields: Fields to store as metadata
        clear_existing: Whether to clear existing collection
    
    Returns:
        ChromaDB collection object
    """
    print(f"\n Loading vector database: {collection_name}")
    print("="*60)
    
    # Create or get collection
    if clear_existing:
        try:
            chroma_client.delete_collection(name=collection_name)
            print("  Cleared existing collection")
        except:
            pass
    
    collection = chroma_client.create_collection(
        name=collection_name,
        metadata={"hnsw:space": "cosine"}
    )
    print(f"✅ Created collection: {collection_name}")
    
    # Generate embeddings (data already in English)
    embeddings = embed_tickets(df, embedding_fields)
    
    # Prepare data for ChromaDB
    ids = [f"ticket_{i}" for i in range(len(df))]
    documents = [create_combined_text(row, embedding_fields) for _, row in df.iterrows()]
    metadatas = [prepare_metadata(row, metadata_fields) for _, row in df.iterrows()]
    
    # Add to collection in batches (ChromaDB has batch size limits)
    batch_size = 1000
    for i in range(0, len(df), batch_size):
        end_idx = min(i + batch_size, len(df))
        collection.add(
            embeddings=embeddings[i:end_idx],
            documents=documents[i:end_idx],
            ids=ids[i:end_idx],
            metadatas=metadatas[i:end_idx]
        )
        print(f"   Loaded batch {i//batch_size + 1}: {end_idx:,}/{len(df):,} tickets")
    
    print(f"\n✅ Vector database loaded successfully")
    print(f"   Total tickets: {collection.count():,}")
    
    return collection

## 5a. Embedding & Vector DB Functions - Single Ticket Test



In [129]:
# =====================================================
# STEP 1: Single Ticket Embedding (Simple Example)
# =====================================================

# Get a sample ticket from our training data
sample_ticket = train_df.iloc[0]

# Combine the English fields into one text block
sample_text = f"""Subject_english: {sample_ticket['subject_english']}
Body_english: {sample_ticket['body_english']}
Answer_english: {sample_ticket['answer_english']}"""

print(" Sample ticket text to embed:")
print("=" * 80)
print(sample_text[:300] + "..." if len(sample_text) > 300 else sample_text)
print("=" * 80)

# Generate embedding for this single ticket
sample_embedding = embedder.encode([sample_text])[0]

print(f"\n Generated embedding vector:")
print(f"    Dimensions: {len(sample_embedding)}")
print(f"    First 10 values: {sample_embedding[:10].round(4)}")
print(f"    These numbers capture the semantic meaning of the ticket")

print(f"\nWhat does this embedding represent?")
print(f"   - Tickets with similar problems will have similar embedding vectors")
print(f"   - Vector similarity (cosine distance) = semantic similarity")
print(f"   - This is how we'll find relevant historical tickets!")

print(f"\nNext step: Scale this to all {len(train_df):,} training tickets...")

 Sample ticket text to embed:
Subject_english: Dear Support, I’m requesting a return for my HP DeskJet 3755. It’s not working properly and the wireless connectivity is spotty.
Body_english: Hi there, to start your return, just head over to our returns page or give our Customer Service a call at <tel_num>. Thanks!
Answer_english:...

 Generated embedding vector:
    Dimensions: 384
    First 10 values: [-0.1047  0.0166  0.025  -0.0016 -0.0491 -0.0374 -0.0098 -0.0488 -0.1002
 -0.0509]
    These numbers capture the semantic meaning of the ticket

What does this embedding represent?
   - Tickets with similar problems will have similar embedding vectors
   - Vector similarity (cosine distance) = semantic similarity
   - This is how we'll find relevant historical tickets!

Next step: Scale this to all 2,880 training tickets...


## 5b. Embedding & Vector DB Functions (Scale to All Tickets)

Now that we understand how to embed a single ticket, let's scale this to all training tickets and store them in ChromaDB for fast similarity search.

In [130]:
# Load training data into vector database
collection = load_vector_db(
    train_df,
    CONFIG['collection_name'],
    CONFIG['embedding_fields'],
    CONFIG['metadata_fields'],
    clear_existing=True
)

print(f"\nDatabase Statistics:")
print(f"   Collection: {CONFIG['collection_name']}")
print(f"   Total tickets: {collection.count():,}")
print(f"   Embedding dim: {len(embedder.encode(['test'])[0])}")
print(f"   Metadata fields: {', '.join(CONFIG['metadata_fields'])}")


 Loading vector database: ticket_rag_collection
  Cleared existing collection
✅ Created collection: ticket_rag_collection
🔄 Generating embeddings for 2,880 tickets...
   Fields: subject_english, body_english, answer_english


Batches:   0%|          | 0/90 [00:00<?, ?it/s]

✅ Generated 2,880 embeddings (dim: 384)
   Loaded batch 1: 1,000/2,880 tickets
   Loaded batch 2: 2,000/2,880 tickets
   Loaded batch 3: 2,880/2,880 tickets

✅ Vector database loaded successfully
   Total tickets: 2,880

Database Statistics:
   Collection: ticket_rag_collection
   Total tickets: 2,880
   Embedding dim: 384
   Metadata fields: type, queue, priority, business_type, original_language


## 6. Query & Response Functions

###  Understanding RAG Modes

Our system supports two different RAG (Retrieval-Augmented Generation) modes:

####  Strict RAG Mode (`rag_mode: 'strict'`)
**What it does:**
- LLM is **constrained** to ONLY use information from retrieved historical tickets
- Cannot supplement with its general knowledge
- Will explicitly say "I don't have enough information..." if context is insufficient

**When to use:**
- Compliance-critical environments where answers must be traceable to documented sources
- Quality assurance testing to verify RAG retrieval quality
- When you want to test if your knowledge base has sufficient coverage

**Example behavior:**
- Question: "How do I reset my VPN password?"
- If similar tickets exist → Answers based on those tickets
- If no similar tickets → "I don't have enough information in the historical tickets to answer this question fully."

---

####  Augmented RAG Mode (`rag_mode: 'augmented'`)  
**What it does:**
- LLM uses retrieved historical tickets as **context** to inform answers
- Can **supplement** with its general IT support knowledge
- Provides helpful answers even when historical tickets are incomplete

**When to use:**
- Production support systems where helpfulness matters most
- When your knowledge base has gaps but you still want useful answers
- General-purpose IT support scenarios

**Example behavior:**
- Question: "How do I reset my VPN password?"
- Uses similar historical tickets as context
- Supplements with general VPN troubleshooting knowledge if needed
- Always provides a helpful, actionable answer

---

###  Switching Between Modes

To change modes, update the `CONFIG` dictionary in cell-5:

```python
CONFIG = {
    # ... other settings ...
    'rag_mode': 'strict',      # Options: 'strict' or 'augmented'
}
```

Then re-run the query cells to see the difference in behavior!

---

###  Mode Comparison

| Aspect | Strict RAG | Augmented RAG |
|--------|-----------|---------------|
| **Sources** | Historical tickets only | Historical tickets + LLM knowledge |
| **Coverage** | Limited to knowledge base | Can answer beyond knowledge base |
| **Traceability** | 100% traceable | Mixed sources |
| **Helpfulness** | May say "I don't know" | Always attempts helpful answer |
| **Best for** | Compliance, QA testing | Production support |

---

###  Value

By implementing both modes, you can:
1. **Test retrieval quality**: Strict mode shows if your vector search is finding relevant tickets
2. **Compare approaches**: See how LLM knowledge augmentation changes answers
3. **Understand trade-offs**: Coverage vs. traceability, helpfulness vs. compliance
4. **Learn RAG patterns**: Both are valid approaches used in real-world systems

In [131]:
def search_similar_tickets(query_text: str, top_k: int = 5) -> List[Dict]:
    """
    Search for similar tickets in vector database using pure vector similarity.
    
    Args:
        query_text: Text to search for
        top_k: Number of results to return
    
    Returns:
        List of similar tickets with metadata and scores
    """
    # Generate query embedding
    query_embedding = embedder.encode([query_text])[0].tolist()
    
    # Execute search (pure vector similarity, no filters)
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k
    )
    
    # Format results
    similar_tickets = []
    for i in range(len(results['ids'][0])):
        similar_tickets.append({
            'id': results['ids'][0][i],
            'document': results['documents'][0][i],
            'metadata': results['metadatas'][0][i],
            'distance': results['distances'][0][i],
            'similarity': 1 - results['distances'][0][i]
        })
    
    return similar_tickets


def generate_answer(
    question: str,
    context_tickets: List[Dict],
    temperature: float = 0.2
) -> str:
    """
    Generate answer using LM Studio LLM with configurable RAG mode.
    
    RAG Modes:
    - 'strict': Context-only mode - LLM must ONLY use information from historical tickets
    - 'augmented': Context + knowledge mode - LLM can supplement with general knowledge
    
    Args:
        question: User's question
        context_tickets: List of similar tickets for context
        temperature: LLM temperature
    
    Returns:
        Generated answer text
    """
    # Build context from similar tickets
    context_parts = []
    for i, ticket in enumerate(context_tickets, 1):
        context_parts.append(f"\n--- Similar Ticket {i} (similarity: {ticket['similarity']:.2f}) ---\n{ticket['document']}")
    
    context = "\n".join(context_parts)
    
    # Choose prompt based on RAG mode
    if CONFIG.get('rag_mode', 'augmented') == 'strict':
        # STRICT RAG: Context-only mode
        prompt = f"""You are a helpful IT support assistant. You must ONLY use information from the historical tickets provided below to answer the question. Do not use any external knowledge.

SIMILAR HISTORICAL TICKETS:
{context}

NEW SUPPORT QUESTION:
{question}

INSTRUCTIONS:
- Answer ONLY using information from the historical tickets above
- If the historical tickets don't contain enough information to answer the question, say "I don't have enough information in the historical tickets to answer this question fully."
- Do NOT use any general knowledge or information not present in the historical tickets
- Provide a clear, actionable solution based solely on the provided context
- Be concise but thorough
- Reference which historical ticket(s) your answer comes from

ANSWER:"""
    else:
        # AUGMENTED RAG: Context + LLM knowledge mode
        prompt = f"""You are a helpful IT support assistant. Based on the similar historical tickets provided below, generate a clear and helpful answer to the new support question.

SIMILAR HISTORICAL TICKETS:
{context}

NEW SUPPORT QUESTION:
{question}

INSTRUCTIONS:
- Use the historical tickets as context to inform your answer
- You may supplement with your general IT support knowledge when appropriate
- Provide a clear, actionable solution
- Be concise but thorough
- If the question is unclear or lacks context, ask clarifying questions

ANSWER:"""
    
    # Call LLM
    try:
        response = client.chat.completions.create(
            model=CONFIG['llm_model'],
            messages=[
                {"role": "system", "content": "You are a helpful IT support assistant."},
                {"role": "user", "content": prompt}
            ],
            temperature=temperature,
            max_tokens=CONFIG['max_tokens'],
            timeout=60
        )
        return response.choices[0].message.content.strip()
    except Exception as e:
        return f"Error generating answer: {str(e)}"


def calculate_confidence(similar_tickets: List[Dict]) -> float:
    """
    Calculate confidence score based on similarity of retrieved tickets.
    
    Args:
        similar_tickets: List of similar tickets with similarity scores
    
    Returns:
        Confidence score between 0 and 1
    """
    if not similar_tickets:
        return 0.0
    
    # Use weighted average of top similarities
    # Top result weighted more heavily
    weights = [1.0, 0.8, 0.6, 0.4, 0.2][:len(similar_tickets)]
    weighted_sim = sum(t['similarity'] * w for t, w in zip(similar_tickets, weights))
    total_weight = sum(weights)
    
    confidence = weighted_sim / total_weight
    return round(confidence, 3)


def answer_ticket(ticket_json: Dict) -> Dict:
    """
    Complete RAG pipeline: retrieve context and generate answer.
    
    Args:
        ticket_json: Dictionary with ticket data
            Required: 'subject', 'body'
            Optional fields are stored but not used for filtering
    
    Returns:
        Dictionary with answer, confidence, and references
    """
    # Extract fields
    subject = ticket_json.get('subject', '')
    body = ticket_json.get('body', '')
    
    # Create query text
    query_text = f"Subject: {subject}\nBody: {body}"
    
    # Search for similar tickets (pure vector similarity)
    print(f" Searching for similar tickets using vector similarity...")
    
    similar_tickets = search_similar_tickets(query_text, CONFIG['top_k'])
    
    print(f"   Found {len(similar_tickets)} similar tickets")
    
    # Generate answer
    rag_mode = CONFIG.get('rag_mode', 'augmented')
    print(f" Generating answer using {rag_mode.upper()} RAG mode...")
    answer = generate_answer(query_text, similar_tickets, CONFIG['temperature'])
    
    # Calculate confidence
    confidence = calculate_confidence(similar_tickets)
    
    # Format references
    references = []
    for ticket in similar_tickets:
        references.append({
            'ticket_id': ticket['id'],
            'similarity': round(ticket['similarity'], 3),
            'metadata': ticket['metadata'],
            'preview': ticket['document'][:200] + '...'
        })
    
    # Build response
    response = {
        'answer': answer,
        'confidence': confidence,
        'num_references': len(references),
        'references': references,
        'rag_mode': rag_mode,
        'timestamp': datetime.now().isoformat()
    }
    
    print(f"✅ Answer generated (confidence: {confidence:.2%}, mode: {rag_mode})")
    
    return response

print("✅ Query functions ready")

✅ Query functions ready


## 7. Test with Sample Query

**Note**: This system uses pure vector similarity search based on semantic meaning of the ticket content (subject + body). Metadata (type, queue, priority, etc.) is stored but not used for filtering or ranking - this keeps the system simple and can be enhanced in future lessons.

In [136]:
CONFIG['rag_mode'] = 'strict'


In [135]:
import textwrap
from IPython.display import display, Markdown

def render_answer(answer_text: str, title: str = "GENERATED ANSWER"):
    """
    Render markdown-formatted answer beautifully in Jupyter.
    
    Args:
        answer_text: The LLM-generated answer (may contain markdown)
        title: Title to display above the answer
    """
    print(f"\n {title}")
    print("=" * 100)
    display(Markdown(answer_text))
    print("=" * 100)

# Create a sample ticket query
sample_ticket = {
    "subject": "Cannot connect to VPN from home",
    "body": "I'm trying to connect to the company VPN from my home network but keep getting a connection timeout error. \
        I've tried restarting my router and computer but the issue persists. This is blocking my work."
}

print("=" * 100)
print("SAMPLE TICKET QUERY")
print("=" * 100)

print(f"\nSUBJECT:")
print("-" * 100)
print(sample_ticket['subject'])

print(f"\nBODY:")
print("-" * 100)
wrapped_body = textwrap.fill(sample_ticket['body'], width=100)
print(wrapped_body)

print("\n" + "=" * 100)

# Get answer
response = answer_ticket(sample_ticket)

# Display results with beautiful markdown rendering
render_answer(response['answer'])

print(f"\n Confidence: {response['confidence']:.1%}")
print(f" References: {response['num_references']} similar tickets")

print("\n TOP 3 SIMILAR TICKETS:")
print("=" * 100)
for i, ref in enumerate(response['references'][:3], 1):
    print(f"\n{i}. Similarity: {ref['similarity']:.1%} | {ref['metadata'].get('type', 'Unknown')} | Priority: {ref['metadata'].get('priority', 'N/A')}")
    print("   " + "-" * 96)
    preview_wrapped = textwrap.fill(ref['preview'], width=96, initial_indent='   ', subsequent_indent='   ')
    print(preview_wrapped)

print("\n" + "=" * 100)

SAMPLE TICKET QUERY

SUBJECT:
----------------------------------------------------------------------------------------------------
Cannot connect to VPN from home

BODY:
----------------------------------------------------------------------------------------------------
I'm trying to connect to the company VPN from my home network but keep getting a connection timeout
error.         I've tried restarting my router and computer but the issue persists. This is blocking
my work.

 Searching for similar tickets using vector similarity...
   Found 5 similar tickets
 Generating answer using AUGMENTED RAG mode...
✅ Answer generated (confidence: 41.70%, mode: augmented)

 GENERATED ANSWER


**Subject:** Re: Cannot connect to VPN from home – troubleshooting steps

Hi [Name],

Thanks for reaching out. A connection‑timeout usually points to a networking or configuration issue on the client side (or sometimes an IP restriction on the VPN server). Below are some quick checks and actions that often resolve this type of problem.

| Step | What to do | Why it helps |
|------|------------|--------------|
| **1. Verify the VPN address** | Make sure you’re entering the exact hostname/IP the IT team gave you (e.g., `vpn.company.com`). A typo or old address will cause a timeout. | |
| **2. Check your local firewall / antivirus** | Temporarily disable any third‑party firewall/antivirus software and try again. Some programs block VPN ports (UDP 1194, TCP 443). | |
| **3. Confirm the required port is open** | Run `telnet vpn.company.com 443` or `nc -vz vpn.company.com 443`. If it fails, your home router may be blocking that port. | |
| **4. Try a different device / network** | Connect from another computer on the same Wi‑Fi, or use a mobile hotspot. This tells us whether the issue is with your machine or the home network. | |
| **5. Reinstall/Update the VPN client** | Uninstall the current VPN client, download the latest installer from the IT portal, and install it again. Older clients can have bugs that cause timeouts. | |
| **6. Clear any proxy settings** | In your browser or system network settings make sure no HTTP/HTTPS proxy is configured – VPNs usually don’t work with a proxy in place. | |
| **7. Check for IP restrictions** | Some VPN servers only allow connections from whitelisted IP ranges. If you’re on a dynamic home IP, the IT team may need to add it. Ask them if your current public IP (you can find it at `whatismyip.com`) is allowed. | |
| **8. Review client logs** | Most VPN clients write a log file (`vpn.log` or similar). If you can share that (or just the first 20‑30 lines), we can spot specific errors. | |

### Quick test

1. Open a terminal/command prompt.
2. Run:  
   ```bash
   ping vpn.company.com
   ```
3. If you get replies, the DNS and basic connectivity are fine.  
4. Then run:  
   ```bash
   telnet vpn.company.com 443
   ```
   or for UDP: `nc -vz -u vpn.company.com 1194` (Linux/macOS).  

If either command times out, the issue is likely on your local network side.

### If you’re still stuck

- Let me know which VPN client you’re using (Cisco AnyConnect, OpenVPN, etc.).
- Share any error code or log snippet.
- Tell me if you can connect from another device or network.

Once we have that info, I’ll be able to narrow it down further. In the meantime, try the steps above and let me know what happens.

Best regards,

[Your Name]  
IT Support Team

---


 Confidence: 41.7%
 References: 5 similar tickets

 TOP 3 SIMILAR TICKETS:

1. Similarity: 44.5% | Problem | Priority: medium
   ------------------------------------------------------------------------------------------------
   Subject_english: Intermittent Connection Issues Reported Body_english: You might need to
   update the firmware or reboot the device. Answer_english: Try rebooting the device first. If
   the issue persists...

2. Similarity: 41.2% | Incident | Priority: medium
   ------------------------------------------------------------------------------------------------
   Subject_english: Help Needed with Connectivity Issues Body_english: Hi Customer Support,  I'm
   <name> from <company_name>. We're having connectivity problems that are slowing down our
   remote team's pro...

3. Similarity: 39.7% | Problem | Priority: low
   ------------------------------------------------------------------------------------------------
   Subject_english: Intermittent Connectivity I

In [141]:
#NOW - set rag mode to augmented and rerun:

CONFIG['rag_mode'] = 'augmented'

## 8. LLM Evaluation System

In [138]:
def evaluate_answer(
    question: str,
    original_answer: str,
    generated_answer: str
) -> Dict:
    """
    Use LLM to evaluate generated answer against original.
    
    Args:
        question: Original ticket question
        original_answer: Original answer from dataset
        generated_answer: RAG-generated answer
    
    Returns:
        Evaluation results with scores and feedback
    """
    prompt = f"""You are an expert evaluator for IT support responses. Compare the generated answer against the original answer and rate it on the following criteria:

ORIGINAL QUESTION:
{question}

ORIGINAL ANSWER:
{original_answer}

GENERATED ANSWER:
{generated_answer}

EVALUATION CRITERIA:
1. Accuracy (1-5): Does the generated answer provide correct information?
2. Completeness (1-5): Does it cover all important points from the original?
3. Clarity (1-5): Is it clear and easy to understand?
4. Helpfulness (1-5): Would this answer help resolve the user's issue?

Please provide:
- A score for each criterion (1-5, where 5 is best)
- Brief feedback explaining the scores
- An overall score (1-5)

Format your response as JSON:
{{
  "accuracy": <score>,
  "completeness": <score>,
  "clarity": <score>,
  "helpfulness": <score>,
  "overall": <score>,
  "feedback": "<explanation>"
}}"""
    
    try:
        response = client.chat.completions.create(
            model=CONFIG['llm_model'],
            messages=[
                {"role": "system", "content": "You are an expert evaluator. Respond only with valid JSON."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.1,
            max_tokens=500,
            timeout=60
        )
        
        # Parse JSON response
        result_text = response.choices[0].message.content.strip()
        # Extract JSON if wrapped in markdown code blocks
        if "```json" in result_text:
            result_text = result_text.split("```json")[1].split("```")[0].strip()
        elif "```" in result_text:
            result_text = result_text.split("```")[1].split("```")[0].strip()
        
        evaluation = json.loads(result_text)
        return evaluation
        
    except Exception as e:
        return {
            "error": str(e),
            "raw_response": response.choices[0].message.content if 'response' in locals() else None
        }

print("✅ Evaluation function ready")

✅ Evaluation function ready


In [139]:
print(CONFIG['rag_mode'])

strict


In [142]:
# Test evaluation with a ticket from test set
test_ticket_idx = 380
test_row = test_df.iloc[test_ticket_idx]

# Prepare ticket for RAG (only subject and body needed)
test_ticket_input = {
    "subject": test_row['subject_english'],
    "body": test_row['body_english']
}

print("=" * 100)
print(" TEST TICKET EVALUATION")
print("=" * 100)

# Display ticket details
print(f"\n TICKET SUBJECT (English):")
print("-" * 100)
print(f"{test_row['subject_english']}")

print(f"\n TICKET BODY (English):")
print("-" * 100)
# Wrap body text to 100 characters
import textwrap
wrapped_body = textwrap.fill(test_row['body_english'], width=100)
print(wrapped_body)

print(f"\n Ticket Metadata:")
print(f"   Type: {test_row.get('type')} | Priority: {test_row.get('priority')} | Language: {test_row.get('original_language')}")

print("\n" + "=" * 100)

# Generate answer
rag_response = answer_ticket(test_ticket_input)

# Display answers with markdown rendering for generated answer
print("\n ANSWER COMPARISON")
print("=" * 100)

print("\n ORIGINAL ANSWER (from dataset):")
print("-" * 100)
original_wrapped = textwrap.fill(test_row['answer_english'], width=100)
print(original_wrapped)

# Render generated answer with beautiful markdown formatting
from IPython.display import display, Markdown
print("\n GENERATED ANSWER (RAG system with markdown rendering):")
print("-" * 100)
display(Markdown(rag_response['answer']))

print("=" * 100)

# Evaluate
print("\n Evaluating answer quality...")
question_text = f"Subject: {test_row['subject_english']}\nBody: {test_row['body_english']}"
evaluation = evaluate_answer(
    question_text,
    test_row['answer_english'],
    rag_response['answer']
)

# Display results
print("\n EVALUATION RESULTS:")
print("=" * 100)
if 'error' not in evaluation:
    print(f"\n{'Metric':<20} {'Score':<10} {'Rating'}")
    print("-" * 100)
    print(f"{'Overall':<20} {evaluation['overall']}/5      {'⭐' * int(evaluation['overall'])}")
    print(f"{'Accuracy':<20} {evaluation['accuracy']}/5      {'⭐' * int(evaluation['accuracy'])}")
    print(f"{'Completeness':<20} {evaluation['completeness']}/5      {'⭐' * int(evaluation['completeness'])}")
    print(f"{'Clarity':<20} {evaluation['clarity']}/5      {'⭐' * int(evaluation['clarity'])}")
    print(f"{'Helpfulness':<20} {evaluation['helpfulness']}/5      {'⭐' * int(evaluation['helpfulness'])}")
    
    print(f"\n Confidence Score: {rag_response['confidence']:.1%}")
    print(f" Similar Tickets Used: {rag_response['num_references']}")
    
    print(f"\n Evaluator Feedback:")
    print("-" * 100)
    feedback_wrapped = textwrap.fill(evaluation['feedback'], width=100)
    print(feedback_wrapped)
else:
    print(f"❌ Evaluation error: {evaluation['error']}")
    if evaluation.get('raw_response'):
        print(f"\nRaw response: {evaluation['raw_response']}")

print("\n" + "=" * 100)

 TEST TICKET EVALUATION

 TICKET SUBJECT (English):
----------------------------------------------------------------------------------------------------
Recurring Excel Crash Issue After the Update

 TICKET BODY (English):
----------------------------------------------------------------------------------------------------
Dear Customer Support,  I’m writing to report a persistent problem that’s surfaced after the latest
software update installed on my computer. My name is <name> and I’m experiencing frequent crashes
specifically with the Microsoft Office 365 Excel application. Every time I try to open or work
within Excel, the app unexpectedly closes, preventing me from accessing important data and documents
I need for daily tasks. This issue seems to persist regardless of the spreadsheet file I’m trying to
open or create. It started immediately after the update was installed, so we suspect it may be
linked to recent changes. Could you provide guidance or a possible solution to resolve

**Subject:** Re: Recurring Excel Crash Issue After the Update  

Dear <name>,

Thank you for reaching out. I understand how disruptive these crashes are and will walk you through a few steps that usually resolve the issue quickly.

| Step | What to do | Why it helps |
|------|------------|--------------|
| **1 – Repair Office** | 1️⃣ Open Control Panel → Programs & Features. <br>2️⃣ Select *Microsoft Office 365* and click **Change**. <br>3️⃣ Choose **Quick Repair** first; if that doesn’t help, run **Online Repair** (requires internet). | Fixes corrupted files or missing components introduced by the update. |
| **2 – Verify Updates** | 1️⃣ Open Windows Settings → Update & Security → Check for updates. <br>2️⃣ Install any pending Office updates (sometimes a patch follows the initial release). | Keeps both Windows and Office at the same supported version, reducing incompatibilities. |
| **3 – Safe Mode / Add‑ins** | 1️⃣ Hold **Ctrl** while launching Excel (or run `excel.exe /safe`). <br>2️⃣ If Excel stays stable, disable all add‑ins: File → Options → Add‑Ins → Manage → Excel Add‑Ins → Go. Uncheck everything and restart normally. | Identifies whether a third‑party add‑in is causing the crash. |
| **4 – Check for Conflicting Software** | • Disable or uninstall recently added antivirus/backup tools temporarily. <br>• Ensure no other Office apps (Word, PowerPoint) are crashing; if they do, the problem is broader than Excel alone. | Some security or backup utilities can interfere with Office’s runtime after an update. |
| **5 – Test on a Clean Profile** | Create a new Windows user account and try opening Excel there. If it works, your original profile may have corrupted settings. | Confirms whether the issue is system‑wide or tied to your user profile. |

### Quick Checklist

- [ ] Office 365 fully updated  
- [ ] Quick Repair (then Online if needed) completed  
- [ ] Excel runs in Safe Mode without crashing  
- [ ] All add‑ins disabled → normal mode works  
- [ ] No recent antivirus/backup changes  

If after completing these steps Excel still crashes, please let me know the exact error message or any crash logs you can find (Event Viewer → Windows Logs → Application). Those details will help us dig deeper.

Feel free to reply here or call me directly at <tel_num>. I’ll follow up within 2 hours if I don’t hear back from you.

Thank you for your patience—let’s get Excel running smoothly again.

Best regards,

<Your Name>  
IT Support Specialist  
Customer Support Team


 Evaluating answer quality...

 EVALUATION RESULTS:

Metric               Score      Rating
----------------------------------------------------------------------------------------------------
Overall              5/5      ⭐⭐⭐⭐⭐
Accuracy             5/5      ⭐⭐⭐⭐⭐
Completeness         5/5      ⭐⭐⭐⭐⭐
Clarity              5/5      ⭐⭐⭐⭐⭐
Helpfulness          5/5      ⭐⭐⭐⭐⭐

 Confidence Score: 90.3%
 Similar Tickets Used: 5

 Evaluator Feedback:
----------------------------------------------------------------------------------------------------
The generated answer accurately reflects the troubleshooting steps from the original response and
expands on them with additional useful details (e.g., add‑in disabling, clean profile test). It
presents information in a clear table format, making it easy to follow. The extra guidance on
checking logs and offering further contact also enhances its helpfulness.



## 9. Batch Testing & Statistics

In [143]:
import textwrap

def run_batch_evaluation(test_df: pd.DataFrame, num_samples: int = 10) -> pd.DataFrame:
    """
    Run RAG system on multiple test tickets and evaluate results.
    
    Args:
        test_df: Test dataset
        num_samples: Number of tickets to test
    
    Returns:
        DataFrame with evaluation results
    """
    results = []
    
    sample_indices = np.random.choice(len(test_df), min(num_samples, len(test_df)), replace=False)
    
    print("=" * 100)
    print(f" BATCH EVALUATION - Testing {len(sample_indices)} tickets")
    print("=" * 100)
    
    for idx in sample_indices:
        row = test_df.iloc[idx]
        
        print(f"\n Processing ticket {idx}...", end=" ")
        
        # Prepare input (only subject and body needed)
        ticket_input = {
            "subject": row['subject_english'],
            "body": row['body_english']
        }
        
        try:
            # Generate answer
            rag_response = answer_ticket(ticket_input)
            
            # Evaluate
            question_text = f"Subject: {row['subject_english']}\nBody: {row['body_english']}"
            evaluation = evaluate_answer(
                question_text,
                row['answer_english'],
                rag_response['answer']
            )
            
            # Store results
            if 'error' not in evaluation:
                results.append({
                    'ticket_idx': idx,
                    'subject': row['subject_english'][:60] + '...',
                    'type': row.get('type'),
                    'priority': row.get('priority'),
                    'language': row.get('original_language'),
                    'confidence': rag_response['confidence'],
                    'num_refs': rag_response['num_references'],
                    'accuracy': evaluation['accuracy'],
                    'completeness': evaluation['completeness'],
                    'clarity': evaluation['clarity'],
                    'helpfulness': evaluation['helpfulness'],
                    'overall': evaluation['overall']
                })
                print(f"✅ Overall Score: {evaluation['overall']}/5 | Confidence: {rag_response['confidence']:.1%}")
            else:
                print(f"❌ Evaluation failed")
                
        except Exception as e:
            print(f"❌ Error: {e}")
    
    return pd.DataFrame(results)


# Run batch evaluation using the TEST data frame - these are tickets we haven't seen before; the answers should not be
# in the vector DB

eval_results = run_batch_evaluation(test_df, num_samples=5)

# Display statistics
if len(eval_results) > 0:
    print("\n" + "=" * 100)
    print(" BATCH EVALUATION SUMMARY")
    print("=" * 100)
    
    print(f"\n{'Metric':<20} {'Mean Score':<15} {'Rating'}")
    print("-" * 100)
    print(f"{'Overall':<20} {eval_results['overall'].mean():.2f}/5         {'⭐' * int(eval_results['overall'].mean())}")
    print(f"{'Accuracy':<20} {eval_results['accuracy'].mean():.2f}/5         {'⭐' * int(eval_results['accuracy'].mean())}")
    print(f"{'Completeness':<20} {eval_results['completeness'].mean():.2f}/5         {'⭐' * int(eval_results['completeness'].mean())}")
    print(f"{'Clarity':<20} {eval_results['clarity'].mean():.2f}/5         {'⭐' * int(eval_results['clarity'].mean())}")
    print(f"{'Helpfulness':<20} {eval_results['helpfulness'].mean():.2f}/5         {'⭐' * int(eval_results['helpfulness'].mean())}")
    
    print(f"\n Average Confidence: {eval_results['confidence'].mean():.1%}")
    print(f" Average References Used: {eval_results['num_refs'].mean():.1f}")
    
    print("\n DETAILED RESULTS:")
    print("=" * 100)
    print(f"{'Idx':<6} {'Subject':<40} {'Type':<10} {'Lang':<6} {'Score':<8} {'Conf':<8}")
    print("-" * 100)
    for _, row in eval_results.iterrows():
        subject_short = row['subject'][:38] + '..' if len(row['subject']) > 38 else row['subject']
        print(f"{row['ticket_idx']:<6} {subject_short:<40} {row['type']:<10} {row['language']:<6} {row['overall']}/5      {row['confidence']:.1%}")
    
    print("=" * 100)
else:
    print("\n⚠️ No results to display")

 BATCH EVALUATION - Testing 5 tickets

 Processing ticket 622...  Searching for similar tickets using vector similarity...
   Found 5 similar tickets
 Generating answer using AUGMENTED RAG mode...
✅ Answer generated (confidence: 87.60%, mode: augmented)
✅ Overall Score: 5/5 | Confidence: 87.6%

 Processing ticket 555...  Searching for similar tickets using vector similarity...
   Found 5 similar tickets
 Generating answer using AUGMENTED RAG mode...
✅ Answer generated (confidence: 89.10%, mode: augmented)
✅ Overall Score: 5/5 | Confidence: 89.1%

 Processing ticket 539...  Searching for similar tickets using vector similarity...
   Found 5 similar tickets
 Generating answer using AUGMENTED RAG mode...
✅ Answer generated (confidence: 85.80%, mode: augmented)
✅ Overall Score: 5/5 | Confidence: 85.8%

 Processing ticket 434...  Searching for similar tickets using vector similarity...
   Found 5 similar tickets
 Generating answer using AUGMENTED RAG mode...
✅ Answer generated (confidence: 

## 10. Summary & Next Steps

###  What We Did This Week

**Built on Previous Weeks:**
1. **From Week 02/03**: Used pre-translated ticket data, applied LLM for answer generation
2. **From Week 04**: Applied RAG pattern (embed → store → search → retrieve → generate)
3. **New Additions**: Train/test split, confidence scoring, LLM evaluation, batch testing

**Complete RAG Pipeline:**
1. Load pre-translated CSV with English ticket data
2. Split into 80% training / 20% test sets
3. Generate embeddings for training tickets (subject + body + answer)
4. Store in ChromaDB vector database with metadata
5. For new tickets: embed → search similar → generate context-aware answer
6. Evaluate answer quality using LLM-as-judge
7. Batch test for systematic quality assessment

###  Connections to Previous Work

**Week 02 → Week 05:**
- Week 02: Translated individual tickets using LLM
- Week 05: Uses pre-translated data for efficiency

**Week 03 → Week 05:**
- Week 03: Consolidated multiple LLM calls, added JSON output
- Week 05: Uses same OpenAI SDK pattern, structured responses

**Week 04 → Week 05:**
- Week 04: RAG with PDF documents (chunk → embed → search)
- Week 05: RAG with structured tickets (fields → embed → search)

###  Key Concepts 

1. **Retrieval-Augmented Generation (RAG)**: Combining search with generation
2. **Vector Similarity**: Finding semantically similar content
3. **Embeddings**: Converting text to numerical representations
4. **Train/Test Split**: Proper evaluation methodology
5. **Confidence Scoring**: Measuring system certainty
6. **LLM Evaluation**: Using AI to judge AI outputs

###  Future Enhancements (Later Lessons)

**Metadata Filtering**:
- Filter by `type`: Only retrieve similar Incidents, Requests, etc.
- Filter by `priority`: P1 tickets get P1 solutions
- Filter by `queue`: Department-specific knowledge

**Re-ranking**:
- Semantic re-ranking of retrieved results
- Combine vector similarity with other relevance signals

**Hybrid Search**:
- Combine vector similarity (semantic) with keyword matching (lexical)
- Best of both worlds for retrieval

**Multilingual Support**:
- Return answers in user's preferred language
- Cross-lingual retrieval (query in Spanish, find English solutions)

**Production Features**:
- API wrapper for easy integration
- Monitoring and logging
- Caching for common queries
- Feedback loop for continuous improvement

### Key Files

- **Translation**: `translate_tickets.py` (run once)
- **RAG System**: `ticket_rag_system.ipynb` (this notebook)
- **Original CSV**: `dataset-tickets-multi-lang3-4k.csv`
- **Translated CSV**: `dataset-tickets-multi-lang3-4k-translated -all.csv`
- **Vector DB**: `./chroma_ticket_db/`

### Quick Start Guide

**First Time Setup:**
1. Run `translate_tickets.py` to create translated CSV (one-time)
2. Run this notebook from top to bottom
3. Experiment with different `CONFIG` parameters
4. Complete homework assignments

**Adjusting Parameters:**
- `top_k`: Number of similar tickets (3-10 recommended)
- `temperature`: LLM creativity (0.1-0.3 for factual support)
- `embedding_model`: Trade speed vs quality
- `test_mode`: Set `False` to use all ~4K tickets

**Testing Your Changes:**
- Modify `CONFIG` dictionary
- Re-run affected sections
- Compare evaluation metrics

### Performance Notes

**Current Configuration (TEST MODE):**
- Loading: First 100 tickets only (fast, for learning)
- Embedding: ~5 seconds for 100 tickets
- Query: <1 second per ticket
- Evaluation: ~10 seconds per ticket (includes LLM calls)

**Full Dataset (test_mode=False):**
- Loading: ~4,000 tickets
- Embedding: ~2 minutes for all tickets
- Query: <1 second per ticket (same)
- Batch evaluation: Scale linearly with sample size

###  Success Metrics

**Quality Metrics** (from evaluation):
- Overall score: 3-5 stars (LLM-judged)
- Accuracy: Correctness of information
- Completeness: Coverage of important points
- Clarity: Ease of understanding
- Helpfulness: Actionability of solution

**Confidence Metrics**:
- Based on similarity of retrieved tickets
- Range: 0-1 (higher = more confident)
- Weighted by position (top results weighted more)

**When to Worry:**
- Confidence consistently <0.5: Check retrieval quality
- Overall scores <3: Review LLM prompts or `top_k`
- High confidence but low scores: Check training data quality



## 11. Homework & Experiments

Following the tradition from Week 02 and Week 03, here are hands-on assignments to deepen your understanding of the RAG system.

### Homework 1: Tune Retrieval Parameters

The `top_k` parameter controls how many similar tickets we retrieve for context.

**Your Task:**
1. Test with `top_k = 1, 3, 5, 10`...
2. Use the same sample question each time
3. Observe how the answer quality and confidence change

**Questions:**
- At what point does adding more tickets stop helping?
- How does confidence score correlate with `top_k`?

**How to do it:**
```python
# Update config
CONFIG['top_k'] = 1  # Try 1, 3, 5, 10

# Re-run Section 6 (Sample Query) or Section 7 (Evaluation)
# Compare the results
```

 - what `top_k` value works best for support tickets?

---

### Homework 2: Temperature Experimentation

Temperature controls LLM creativity (0 = deterministic, 1 = creative).

**Your Task:**
1. Try `temperature = 0.1, 0.5, 0.9`
2. Ask the same question multiple times at each temperature
3. Compare answer consistency and quality

**Questions:**
- Are higher temperatures better for support tickets? Why or why not?
- Which temperature gives the most helpful answers?
- Does low temperature make answers too robotic?

**How to do it:**
```python
# Update config
CONFIG['temperature'] = 0.1  # Try 0.1, 0.5, 0.9

# Re-run sample queries
# Note: You may need to ask the same question multiple times to see variation
```

**Success Criteria**: Recommend an optimal temperature range for IT support answers.

---

## Bonus Challenge: End-to-End Evaluation

**Task**: Run batch evaluation on 20 test tickets and analyze the results.

**Questions:**
- What's the average accuracy score?
- Which ticket types get the best answers?
- Are P1 tickets answered better than P3?

**How to do it:**
```python
# Run batch evaluation
eval_results = run_batch_evaluation(test_df, num_samples=20)

# Analyze by ticket type
eval_results.groupby('type')['overall'].mean()

# Analyze by priority
eval_results.groupby('priority')['overall'].mean()
```

```