# RAG Evaluation with RAGAS

## The Problem
How do you know if your RAG improvements actually work?

**You need systematic measurement!**

## The Solution
**RAGAS:** Framework for evaluating RAG systems with metrics:
- Context Recall
- Context Precision
- Faithfulness
- Answer Relevancy

**Difficulty:** ‚≠ê‚≠ê‚≠ê‚òÜ‚òÜ

## What is RAGAS?

**RAGAS** (Retrieval-Augmented Generation Assessment) is an automated evaluation framework that uses LLMs to judge your RAG system's quality.

**The key insight:** Instead of manually reviewing answers, RAGAS uses AI to evaluate:
1. **Retrieval quality** - Did we find the right documents?
2. **Generation quality** - Is the answer accurate and relevant?

**Why it matters:**
- Without metrics, you're flying blind
- You need objective comparison between techniques

Let's learn how to measure RAG performance scientifically!

## Step 1: Imports

In [None]:
from utils_openai import setup_openai_api, create_embeddings, create_llm, load_msme_data, create_vectorstore, get_baseline_prompt, load_existing_vectorstore
from ragas import evaluate
from ragas.metrics import context_recall, faithfulness, answer_relevancy
from ragas import EvaluationDataset
from ragas.llms import LangchainLLMWrapper
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

print('[OK] Imports done!')

## Step 2: Setup RAG System

In [None]:
api_key = setup_openai_api()
embeddings = create_embeddings(api_key)
llm = create_llm(api_key)
vectorstore = load_existing_vectorstore(embeddings, 'msme_ragas', './chroma_raags')
retriever = vectorstore.as_retriever(search_kwargs={'k': 5})

prompt = get_baseline_prompt()
rag_chain = (
    {'context': retriever, 'question': RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)
print('[OK] RAG system ready!')

In [None]:
api_key = setup_openai_api()
embeddings = create_embeddings(api_key)
llm = create_llm(api_key)
docs, metas, ids = load_msme_data('msme.csv')
vectorstore = create_vectorstore(docs, metas, ids, embeddings, 'msme_ragas', './chroma_raags')
retriever = vectorstore.as_retriever(search_kwargs={'k': 5})

prompt = get_baseline_prompt()
rag_chain = (
    {'context': retriever, 'question': RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)
print('[OK] RAG system ready!')

**What is Ground Truth?**

Ground truth = the "correct" or "ideal" answer you expect for each question.

Think of it like a test answer key. RAGAS compares your RAG's actual answers against these reference answers to measure quality.

**Note:** You don't need perfect ground truth - even approximate reference answers help RAGAS evaluate performance.

## Step 3: Prepare Evaluation Dataset
Create test queries with ground truth:

In [None]:
eval_questions = [
    'What are the financing options for MSMEs in Nigeria?',
    'How do I register a business in Nigeria?',
    'What challenges do MSMEs face?'
]

eval_ground_truth = [
    'MSMEs can access financing through Development Bank of Nigeria, Bank of Industry, microfinance banks, and government intervention funds.',
    'Register with Corporate Affairs Commission (CAC), obtain Tax Identification Number (TIN), and get necessary licenses.',
    'MSMEs face challenges including poor access to credit, weak infrastructure, discriminatory legislation, and lack of technical skills.'
]

print(f'[OK] Prepared {len(eval_questions)} test cases')

## Step 4: Generate Answers and Contexts

In [None]:
dataset = []

for question, reference in zip(eval_questions, eval_ground_truth):
    # Retrieve documents
    retrieved = retriever.invoke(question)
    contexts = [doc.page_content for doc in retrieved]
    
    # Generate answer
    answer = rag_chain.invoke(question)
    
    dataset.append({
        'user_input': question,
        'retrieved_contexts': contexts,
        'response': answer,
        'reference': reference
    })

print(f'[OK] Generated {len(dataset)} evaluation samples')

**What's happening here:**

For each test question, we're collecting 4 pieces of information:
1. **user_input** - The question asked
2. **retrieved_contexts** - Documents the retriever found
3. **response** - Answer generated by the RAG system
4. **reference** - Ground truth (expected answer)

RAGAS needs all 4 to evaluate different aspects:
- Context Recall uses: question + contexts + reference
- Faithfulness uses: contexts + response
- Answer Relevancy uses: question + response

**Why a separate evaluator LLM?**

RAGAS uses an LLM to judge your RAG system's outputs. This "judge" LLM:
- Compares retrieved contexts to reference answers
- Checks if generated answers are faithful to context
- Measures how relevant answers are to questions

We wrap the same LLM we're testing with `LangchainLLMWrapper` so RAGAS can use it for evaluation.

## Step 5: Create RAGAS Dataset

In [None]:
eval_dataset = EvaluationDataset.from_list(dataset)
print('[OK] RAGAS dataset created!')

## Step 6: Setup Evaluator LLM

In [None]:
evaluator_llm = LangchainLLMWrapper(llm)
print('[OK] Evaluator LLM ready!')

## Step 7: Run Evaluation

In [None]:
result = evaluate(
    dataset=eval_dataset,
    metrics=[context_recall, faithfulness, answer_relevancy],
    llm=evaluator_llm
)

print('\n' + '='*80)
print('RAGAS EVALUATION RESULTS')
print('='*80)

# Convert to pandas DataFrame for easier viewing
result_df = result.to_pandas()
print(result_df)

print('\n' + '='*80)
print('AVERAGE SCORES')
print('='*80)
print(f'Context Recall: {result_df["context_recall"].mean():.3f}')
print(f'Faithfulness: {result_df["faithfulness"].mean():.3f}')
print(f'Answer Relevancy: {result_df["answer_relevancy"].mean():.3f}')
print('='*80)

## Interpreting Your Scores

**All metrics range from 0 to 1 (higher is better)**

### Quick Score Guide:
- **0.9 - 1.0**: Excellent - Production ready
- **0.7 - 0.9**: Good - Minor improvements needed
- **0.5 - 0.7**: Fair - Significant issues to address
- **Below 0.5**: Poor - Major problems with retrieval or generation

### What to do if scores are low:
- **Low Context Recall?** ‚Üí Improve retrieval (try hybrid search, increase k, better chunking)
- **Low Faithfulness?** ‚Üí LLM is hallucinating ‚Üí Improve prompts, use RAG-specific models
- **Low Answer Relevancy?** ‚Üí Answers off-topic ‚Üí Improve prompt instructions, better context filtering

**How to compare techniques:**

1. Run the SAME test questions on each technique
2. Record the three metrics for each
3. Look for trade-offs:
   - Some techniques improve recall but hurt faithfulness
   - Others are slower but more accurate
   
**Pro tip:** No single technique wins everything. Choose based on your priorities:
- Need accuracy? ‚Üí Prioritize Faithfulness
- Missing documents? ‚Üí Prioritize Context Recall
- Vague answers? ‚Üí Prioritize Answer Relevancy

## Step 8: Compare Techniques
Now you can objectively compare all techniques!

In [None]:
# Example comparison (you fill in actual scores)
print('\nTECHNIQUE COMPARISON')
print('='*80)
print(f'{"Technique":<30} {"Recall":<10} {"Faithful":<10} {"Relevant":<10}')
print('-'*80)
print(f'{"Baseline":<30} {0.85:<10.2f} {0.92:<10.2f} {0.88:<10.2f}')
print(f'{"BM25 Hybrid":<30} {0.89:<10.2f} {0.91:<10.2f} {0.90:<10.2f}')
print(f'{"Contextual Compression":<30} {0.83:<10.2f} {0.95:<10.2f} {0.91:<10.2f}')
print(f'{"HyDE":<30} {0.88:<10.2f} {0.89:<10.2f} {0.92:<10.2f}')
print('='*80)

## When to Use
**Always!** Evaluation should be part of every RAG system.

**Use RAGAS to:**
- Compare techniques objectively
- Track improvements over time
- Identify weak points
- Make data-driven decisions

## Exercise

**Final Project:**
1. Evaluate ALL 5 techniques on same test set
2. Create comparison table
3. Identify best technique for each metric
4. Recommend optimal combination

This is your capstone - show what you've learned!

In [None]:
# Your comprehensive evaluation code here

## Congratulations!

You've completed all 6 Advanced RAG Techniques!

**What you learned:**
- Multiple Query Retrieval
- Contextual Compression
- Semantic Chunking
- Reranking
- HyDE
- RAGAS Evaluation

**Next steps:**
1. Build your own RAG system
2. Combine techniques that work well together
3. Deploy to production
4. Keep learning!

**Resources:**
- Check README.md for decision guides
- Explore https://github.com/NirDiamant/RAG_Techniques/tree/main for more

**Happy building! üöÄ**