# üìä Notebook 03: RAG Evaluation

## Learning Objectives
In this notebook, you will learn:
1. **Why evaluate RAG systems** - understanding failure modes
2. **Create evaluation questions** covering different aspects
3. **Run systematic evaluation** and collect results
4. **Score responses manually** using a clear rubric
5. **Generate evaluation reports** in markdown format
6. **Debug and tune** retrieval parameters (k, chunk size)

## Why Evaluate RAG?

RAG systems can fail in many ways:
- **Retrieval failures**: Wrong documents retrieved
- **Context ignored**: LLM doesn't use the provided context
- **Hallucination**: LLM makes up information not in context
- **Incomplete answers**: Important details missing

Systematic evaluation helps identify and fix these issues.

## Evaluation Dimensions

| Dimension | Question to Ask |
|-----------|----------------|
| **Retrieval Quality** | Are the right documents being retrieved? |
| **Answer Accuracy** | Is the answer factually correct? |
| **Faithfulness** | Does the answer stick to the context? |
| **Relevance** | Does the answer address the question? |
| **Completeness** | Are all important points covered? |

---

## Step 1: Setup and Imports

In [None]:
# Standard library imports
import sys
from pathlib import Path

# Add project root to path
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))

# Set up HuggingFace cache
from src.config import setup_hf_cache
setup_hf_cache()

# Data manipulation
import pandas as pd

print("‚úì Setup complete!")

In [None]:
# Import our custom modules
from src import config
from src.rag_pipeline import RAGPipeline, print_rag_response
from src.evaluation import (
    SAMPLE_EVALUATION_QUESTIONS,
    run_evaluation,
    results_to_dataframe,
    results_to_markdown_table,
    print_scoring_guide,
    EvaluationResult,
    EvaluationReport,
    compare_retrieval_k,
)
from src.vectorstore import load_vector_store, search_similar

print("‚úì Custom modules imported!")

---

## Step 2: Initialize the RAG Pipeline

In [None]:
# Initialize RAG pipeline
# This loads the vector store and LLM
rag = RAGPipeline(retrieval_k=5)

---

## Step 3: Review the Scoring Guide

Before evaluating, let's understand how to score responses.

In [None]:
# Print the scoring guide
print_scoring_guide()

---

## Step 4: Define Evaluation Questions

We'll create questions that cover different aspects of the support ticket data.

In [None]:
# Our evaluation questions
# These cover different categories to test RAG comprehensively

evaluation_questions = [
    # Product-specific questions
    "What are the most common issues reported for smart TVs?",
    "What problems do customers face with GoPro cameras?",
    
    # Issue type questions  
    "What are typical billing and payment issues customers report?",
    "What technical issues are most frequently mentioned?",
    
    # Priority/severity questions
    "What types of issues are marked as critical priority?",
    "What patterns do you see in high priority tickets?",
    
    # Resolution questions
    "How are refund requests typically handled?",
    "What solutions are provided for device connectivity issues?",
    
    # Channel-specific questions
    "What issues come through social media channels?",
    "Are there differences in issues reported via email vs chat?",
]

print(f"Prepared {len(evaluation_questions)} evaluation questions:")
print("-" * 50)
for i, q in enumerate(evaluation_questions, 1):
    print(f"{i}. {q}")

---

## Step 5: Run Evaluation

Let's run all questions through the RAG pipeline and collect results.

In [None]:
# Run evaluation on all questions
report = run_evaluation(rag, evaluation_questions, verbose=True)

---

## Step 6: Review Results and Score

Now let's review each result in detail and assign scores.

In [None]:
# Review each result
print("DETAILED EVALUATION RESULTS")
print("=" * 70)

for i, result in enumerate(report.results, 1):
    print(f"\n{'='*70}")
    print(f"QUESTION {i}: {result.question}")
    print("=" * 70)
    print(f"\nANSWER:\n{result.answer}")
    print(f"\nSOURCES: {result.sources_summary}")
    print("-" * 70)

### Manual Scoring

Based on the results above, let's assign scores. 

**Instructions:**
1. Review each answer and its sources
2. Assign a score 1-5 based on the scoring guide
3. Add comments explaining your score

In [None]:
# Score the results
# Modify these scores based on your review of the outputs above!

# Example scoring (you should adjust based on actual outputs)
scores_and_comments = [
    (3, "Provides some relevant info about TV issues, but could be more specific"),
    (3, "Mentions GoPro issues but answer is brief"),
    (4, "Good coverage of billing issues from retrieved tickets"),
    (3, "Lists some technical issues, sources are relevant"),
    (3, "Identifies critical issues but limited detail"),
    (3, "Some patterns identified, could use more analysis"),
    (4, "Good explanation of refund handling process"),
    (3, "Mentions connectivity solutions, sources relevant"),
    (3, "Some social media issues identified"),
    (2, "Limited differentiation between channels"),
]

# Apply scores to results
for i, (score, comment) in enumerate(scores_and_comments):
    if i < len(report.results):
        report.results[i].score = score
        report.results[i].comments = comment

print("‚úì Scores applied to results")
print(f"\nAverage score: {report.average_score:.2f} / 5.0")

---

## Step 7: Generate Evaluation Report

Let's create a formatted evaluation report.

In [None]:
# Convert to DataFrame for easy viewing
eval_df = results_to_dataframe(report)

print("EVALUATION RESULTS TABLE")
print("=" * 70)
eval_df

In [None]:
# Generate markdown table
markdown_report = results_to_markdown_table(report)

print("MARKDOWN EVALUATION TABLE")
print("=" * 70)
print(markdown_report)

In [None]:
# Save the markdown report to a file
report_path = project_root / "evaluation_report.md"

with open(report_path, 'w') as f:
    f.write("# RAG Evaluation Report\n\n")
    f.write(f"**Date:** {pd.Timestamp.now().strftime('%Y-%m-%d %H:%M')}\n\n")
    f.write(f"**Configuration:**\n")
    f.write(f"- Retrieval k: {rag.retrieval_k}\n")
    f.write(f"- Chunk size: {config.CHUNK_SIZE}\n")
    f.write(f"- Chunk overlap: {config.CHUNK_OVERLAP}\n")
    f.write(f"- LLM: {config.LLM_MODEL_NAME}\n\n")
    f.write("## Results\n\n")
    f.write(markdown_report)

print(f"‚úì Report saved to: {report_path}")

---

## Step 8: Debug Section - Parameter Tuning

Let's explore how different parameters affect RAG performance.

### 8.1 Varying Retrieval k

How does the number of retrieved documents affect results?

In [None]:
# Compare retrieval with different k values
test_question = "What are common billing issues?"

print("COMPARING RETRIEVAL WITH DIFFERENT k VALUES")
print("=" * 60)
print(f"Question: {test_question}")
print("=" * 60)

for k in [3, 5, 8]:
    print(f"\n--- k = {k} ---")
    results = search_similar(rag.vectorstore, test_question, k=k)
    print(f"Retrieved {len(results)} documents:")
    for i, doc in enumerate(results, 1):
        print(f"  {i}. Ticket {doc.metadata.get('ticket_id', 'N/A')} - {doc.metadata.get('product', 'N/A')}")

In [None]:
# Compare answers with different k values
print("\nCOMPARING ANSWERS WITH DIFFERENT k VALUES")
print("=" * 60)

for k in [3, 8]:
    print(f"\n{'='*60}")
    print(f"k = {k}")
    print("=" * 60)
    
    # Create temporary pipeline with different k
    temp_rag = RAGPipeline(
        vectorstore=rag.vectorstore, 
        llm=rag.llm, 
        retrieval_k=k
    )
    
    response = temp_rag.ask(test_question)
    print(f"\nAnswer: {response.answer}")
    print(f"\nSources: {len(response.sources)} documents")

### 8.2 Analysis: k=3 vs k=5 vs k=8

| k Value | Pros | Cons |
|---------|------|------|
| k=3 | Faster, more focused | May miss relevant info |
| k=5 | Good balance | Default choice |
| k=8 | More context | May include noise, slower |

### 8.3 Chunk Size Impact

Different chunk sizes affect retrieval precision:

| Chunk Size | Pros | Cons |
|------------|------|------|
| 300 chars | More precise matching | Less context per chunk |
| 500 chars | Balanced (our default) | Good for most cases |
| 800 chars | More context | Less precise, may miss specific info |

**Note:** Changing chunk size requires rebuilding the vector store.

In [None]:
# Show current chunk configuration
print("CURRENT CHUNK CONFIGURATION")
print("=" * 40)
print(f"Chunk size: {config.CHUNK_SIZE} characters")
print(f"Chunk overlap: {config.CHUNK_OVERLAP} characters")
print(f"\nTo test different chunk sizes:")
print("1. Modify config.CHUNK_SIZE in src/config.py")
print("2. Re-run Notebook 01 to rebuild the vector store")
print("3. Re-run evaluation to compare results")

---

## Step 9: Recommendations Summary

In [None]:
# Generate recommendations based on evaluation
print("EVALUATION SUMMARY & RECOMMENDATIONS")
print("=" * 60)

avg_score = report.average_score
print(f"\nüìä Average Score: {avg_score:.2f} / 5.0")

if avg_score >= 4.0:
    print("\n‚úÖ RAG pipeline is performing well!")
elif avg_score >= 3.0:
    print("\n‚ö†Ô∏è RAG pipeline is acceptable but has room for improvement.")
else:
    print("\n‚ùå RAG pipeline needs significant improvement.")

print("\nüìã RECOMMENDATIONS:")
print("-" * 40)

recommendations = [
    "1. Try a larger LLM (flan-t5-base or flan-t5-large) for better answers",
    "2. Experiment with k=3 to k=8 to find optimal retrieval count",
    "3. Consider chunk_size=300 for more precise retrieval",
    "4. Add more specific prompting for structured answers",
    "5. For production: use API-based LLMs (OpenAI, Anthropic) for quality",
]

for rec in recommendations:
    print(f"  {rec}")

---

## Summary

### What We Accomplished
1. ‚úÖ Defined 10 evaluation questions covering different aspects
2. ‚úÖ Ran systematic evaluation on the RAG pipeline
3. ‚úÖ Scored responses using a clear rubric (1-5 scale)
4. ‚úÖ Generated evaluation reports (DataFrame and Markdown)
5. ‚úÖ Explored parameter tuning (k values, chunk sizes)
6. ‚úÖ Documented recommendations for improvement

### Key Takeaways
- **Systematic evaluation** is essential for RAG quality
- **Manual scoring** provides ground truth (but is time-consuming)
- **Parameter tuning** (k, chunk size) significantly impacts results
- **Small LLMs** have limitations - consider larger models for production

### Files Created
- `evaluation_report.md` - Markdown evaluation report

### Next Steps
- Try the Streamlit app (`app.py`) for interactive Q&A
- Experiment with different parameters
- Consider using larger LLMs for better quality

In [None]:
print("\n" + "=" * 60)
print("üéâ Notebook 03 Complete!")
print("=" * 60)
print("\nYou've completed the RAG evaluation!")
print(f"\nFinal Average Score: {report.average_score:.2f} / 5.0")
print(f"\nReport saved to: {report_path}")
print("\nüöÄ Try the Streamlit app: streamlit run app.py")