# üîç DeepEval - Advanced RAG Evaluation

Ch√†o m·ª´ng ƒë·∫øn v·ªõi **Notebook 2** trong series DeepEval framework!

## üéØ M·ª•c ti√™u c·ªßa Notebook n√†y

1. **X√¢y d·ª±ng RAG Pipeline** ho√†n ch·ªânh v·ªõi LangChain
2. **RAG-Specific Metrics**: ContextualPrecision, ContextualRecall, ContextualRelevancy, Faithfulness
3. **T·ª± ƒë·ªông t·∫°o Dataset** v·ªõi deepeval.Synthesizer
4. **Advanced Evaluation Techniques** cho retrieval systems
5. **Performance Analysis** v√† optimization strategies

## üìñ T·∫°i sao RAG Evaluation quan tr·ªçng?

Retrieval-Augmented Generation (RAG) l√† m·ªôt trong nh·ªØng architecture ph·ªï bi·∫øn nh·∫•t cho LLM applications, nh∆∞ng vi·ªác ƒë√°nh gi√° RAG systems c√≥ nh·ªØng th√°ch th·ª©c ri√™ng:

### üîç Th√°ch th·ª©c c·ªßa RAG Evaluation:
- **Multi-stage process**: Retrieval ‚Üí Ranking ‚Üí Generation
- **Context quality**: Li·ªáu context c√≥ relevant v√† sufficient?
- **Faithfulness**: LLM c√≥ trung th·ª±c v·ªõi retrieved context?
- **Completeness**: C√≥ thi·∫øu th√¥ng tin quan tr·ªçng?
- **Redundancy**: Context c√≥ b·ªã duplicate kh√¥ng?

### ‚úÖ DeepEval gi·∫£i quy·∫øt nh∆∞ th·∫ø n√†o:
- **Contextual Metrics**: ƒê√°nh gi√° ch·∫•t l∆∞·ª£ng retrieval
- **Faithfulness Metrics**: Ki·ªÉm tra consistency v·ªõi context
- **Automated Dataset Generation**: T·∫°o test cases t·ª´ documents
- **End-to-end Evaluation**: ƒê√°nh gi√° to√†n b·ªô RAG pipeline

## üõ†Ô∏è Ph·∫ßn 1: Setup v√† Imports

In [None]:
# Core imports
import os
import json
import pandas as pd
import numpy as np
from typing import List, Dict, Any, Optional
import warnings
warnings.filterwarnings('ignore')

# DeepEval imports
import deepeval
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import (
    ContextualPrecisionMetric,
    ContextualRecallMetric, 
    ContextualRelevancyMetric,
    FaithfulnessMetric,
    AnswerRelevancyMetric
)
from deepeval.synthesizer import Synthesizer

# LangChain imports for RAG
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import TextLoader
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA
from langchain.schema import Document

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('default')

print(f"‚úÖ DeepEval version: {deepeval.__version__}")
print("‚úÖ All imports successful!")

In [None]:
# Setup environment
from dotenv import load_dotenv
load_dotenv()

# Check API keys
api_keys_status = {
    "OpenAI": "‚úÖ Configured" if os.getenv("OPENAI_API_KEY") else "‚ùå Missing",
    "Anthropic": "‚úÖ Configured" if os.getenv("ANTHROPIC_API_KEY") else "‚ùå Missing"
}

print("üîë API Keys Status:")
for provider, status in api_keys_status.items():
    print(f"  {provider}: {status}")

if not os.getenv("OPENAI_API_KEY"):
    print("\n‚ö†Ô∏è  C·∫ßn OPENAI_API_KEY ƒë·ªÉ ch·∫°y RAG evaluation!")
    print("   T·∫°o file .env v·ªõi: OPENAI_API_KEY=your_key_here")

## üèóÔ∏è Ph·∫ßn 2: X√¢y d·ª±ng RAG Pipeline

### 2.1 Load v√† Prepare Documents

In [None]:
def load_and_prepare_documents():
    """
    Load document t·ª´ data folder v√† prepare cho RAG
    """
    
    # Load document
    doc_path = "data/rag_document.txt"
    
    try:
        with open(doc_path, 'r', encoding='utf-8') as f:
            content = f.read()
        
        print(f"üìÑ Loaded document: {len(content)} characters")
        print(f"Preview: {content[:200]}...")
        
        # Create Document object
        doc = Document(
            page_content=content,
            metadata={"source": doc_path, "title": "AI v√† ML Guide"}
        )
        
        return [doc]
        
    except FileNotFoundError:
        print(f"‚ùå File kh√¥ng t√¨m th·∫•y: {doc_path}")
        print("üí° ƒê·∫£m b·∫£o ƒë√£ ch·∫°y notebook trong ƒë√∫ng directory")
        return []
    except Exception as e:
        print(f"‚ùå Error loading document: {e}")
        return []

# Load documents
documents = load_and_prepare_documents()

In [None]:
def split_documents(documents: List[Document]) -> List[Document]:
    """
    Split documents th√†nh chunks nh·ªè h∆°n cho retrieval
    """
    if not documents:
        return []
    
    # T·∫°o text splitter
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=500,  # K√≠ch th∆∞·ªõc chunk
        chunk_overlap=50,  # Overlap between chunks
        length_function=len,
        separators=["\n\n", "\n", ". ", " ", ""]
    )
    
    # Split documents
    chunks = text_splitter.split_documents(documents)
    
    print(f"üìë Document split th√†nh {len(chunks)} chunks")
    print(f"üìè Average chunk size: {np.mean([len(chunk.page_content) for chunk in chunks]):.0f} characters")
    
    # Preview first few chunks
    print("\nüîç Preview chunks:")
    for i, chunk in enumerate(chunks[:3]):
        print(f"  Chunk {i+1}: {chunk.page_content[:100]}...")
    
    return chunks

# Split documents
document_chunks = split_documents(documents)

### 2.2 T·∫°o Vector Store

In [None]:
def create_vector_store(chunks: List[Document]) -> Optional[FAISS]:
    """
    T·∫°o FAISS vector store t·ª´ document chunks
    """
    if not chunks:
        print("‚ùå Kh√¥ng c√≥ chunks ƒë·ªÉ t·∫°o vector store")
        return None
    
    if not os.getenv("OPENAI_API_KEY"):
        print("‚ùå C·∫ßn OPENAI_API_KEY ƒë·ªÉ t·∫°o embeddings")
        return None
    
    try:
        print("üîÑ ƒêang t·∫°o embeddings...")
        
        # T·∫°o embeddings
        embeddings = OpenAIEmbeddings(
            model="text-embedding-ada-002"
        )
        
        # T·∫°o FAISS vector store
        vector_store = FAISS.from_documents(
            documents=chunks,
            embedding=embeddings
        )
        
        print(f"‚úÖ Vector store t·∫°o th√†nh c√¥ng v·ªõi {len(chunks)} documents")
        
        # Test similarity search
        test_query = "Machine learning l√† g√¨?"
        similar_docs = vector_store.similarity_search(test_query, k=2)
        
        print(f"\nüîç Test similarity search cho '{test_query}':")
        for i, doc in enumerate(similar_docs):
            print(f"  Result {i+1}: {doc.page_content[:100]}...")
        
        return vector_store
        
    except Exception as e:
        print(f"‚ùå Error creating vector store: {e}")
        return None

# T·∫°o vector store
vector_store = create_vector_store(document_chunks)

### 2.3 T·∫°o RAG Chain

In [None]:
def create_rag_chain(vector_store: FAISS) -> Optional[RetrievalQA]:
    """
    T·∫°o RAG chain v·ªõi retrieval v√† generation
    """
    if not vector_store:
        print("‚ùå Kh√¥ng c√≥ vector store ƒë·ªÉ t·∫°o RAG chain")
        return None
    
    if not os.getenv("OPENAI_API_KEY"):
        print("‚ùå C·∫ßn OPENAI_API_KEY ƒë·ªÉ t·∫°o LLM")
        return None
    
    try:
        # T·∫°o LLM
        llm = OpenAI(
            model_name="gpt-3.5-turbo-instruct",
            temperature=0.1,  # Low temperature cho factual responses
            max_tokens=500
        )
        
        # T·∫°o retriever
        retriever = vector_store.as_retriever(
            search_type="similarity",
            search_kwargs={"k": 3}  # Retrieve top 3 relevant chunks
        )
        
        # T·∫°o RAG chain
        rag_chain = RetrievalQA.from_chain_type(
            llm=llm,
            chain_type="stuff",  # Combine all retrieved docs
            retriever=retriever,
            return_source_documents=True,  # Return source docs for evaluation
            verbose=True
        )
        
        print("‚úÖ RAG chain t·∫°o th√†nh c√¥ng!")
        
        return rag_chain
        
    except Exception as e:
        print(f"‚ùå Error creating RAG chain: {e}")
        return None

# T·∫°o RAG chain
rag_chain = create_rag_chain(vector_store)

### 2.4 Test RAG Pipeline

In [None]:
def test_rag_pipeline(rag_chain: RetrievalQA) -> Dict[str, Any]:
    """
    Test RAG pipeline v·ªõi sample questions
    """
    if not rag_chain:
        print("‚ùå Kh√¥ng c√≥ RAG chain ƒë·ªÉ test")
        return {}
    
    # Test questions
    test_questions = [
        "Machine learning c√≥ nh·ªØng lo·∫°i ch√≠nh n√†o?",
        "Deep learning kh√°c g√¨ v·ªõi machine learning th√¥ng th∆∞·ªùng?",
        "AI ƒë∆∞·ª£c ·ª©ng d·ª•ng trong lƒ©nh v·ª±c n√†o?"
    ]
    
    results = []
    
    print("üß™ Testing RAG Pipeline:\n")
    
    for i, question in enumerate(test_questions, 1):
        try:
            print(f"‚ùì Question {i}: {question}")
            
            # Get RAG response
            response = rag_chain({"query": question})
            
            answer = response["result"]
            source_docs = response["source_documents"]
            
            print(f"üí° Answer: {answer[:200]}...")
            print(f"üìö Sources: {len(source_docs)} documents retrieved")
            
            # Store result
            results.append({
                "question": question,
                "answer": answer,
                "source_documents": source_docs,
                "num_sources": len(source_docs)
            })
            
            print("" + "-"*50)
            
        except Exception as e:
            print(f"‚ùå Error v·ªõi question {i}: {e}")
            results.append({
                "question": question,
                "answer": None,
                "error": str(e)
            })
    
    return results

# Test RAG pipeline
rag_test_results = test_rag_pipeline(rag_chain)

## üìä Ph·∫ßn 3: RAG-Specific Metrics

### 3.1 ContextualRelevancyMetric

ƒê√°nh gi√° m·ª©c ƒë·ªô li√™n quan c·ªßa retrieved context v·ªõi query:

In [None]:
def demo_contextual_relevancy():
    """
    Demo ContextualRelevancyMetric
    """
    
    if not rag_test_results or not rag_test_results[0].get("answer"):
        print("‚ùå C·∫ßn RAG results ƒë·ªÉ demo contextual relevancy")
        return
    
    # L·∫•y result ƒë·∫ßu ti√™n
    test_result = rag_test_results[0]
    
    # T·∫°o test case cho DeepEval
    test_case = LLMTestCase(
        input=test_result["question"],
        actual_output=test_result["answer"],
        retrieval_context=[doc.page_content for doc in test_result["source_documents"]]
    )
    
    print("üéØ ContextualRelevancyMetric Demo")
    print(f"Query: {test_case.input}")
    print(f"Retrieved contexts: {len(test_case.retrieval_context)}")
    
    # Preview contexts
    for i, context in enumerate(test_case.retrieval_context):
        print(f"  Context {i+1}: {context[:100]}...")
    
    try:
        # T·∫°o metric
        relevancy_metric = ContextualRelevancyMetric(
            threshold=0.7,
            model="gpt-3.5-turbo",
            include_reason=True
        )
        
        # Evaluate
        relevancy_metric.measure(test_case)
        
        print(f"\nüìä K·∫øt qu·∫£:")
        print(f"  Score: {relevancy_metric.score:.3f}")
        print(f"  Passed: {'‚úÖ' if relevancy_metric.is_successful() else '‚ùå'}")
        print(f"  Reason: {relevancy_metric.reason}")
        
        return relevancy_metric, test_case
        
    except Exception as e:
        print(f"‚ùå Error: {e}")
        return None, test_case

# Run demo
contextual_relevancy_result = demo_contextual_relevancy()

### 3.2 ContextualPrecisionMetric

ƒê√°nh gi√° precision c·ªßa retrieval - li·ªáu c√°c context c√≥ ƒë√∫ng th·ª© t·ª± relevance kh√¥ng:

In [None]:
def demo_contextual_precision():
    """
    Demo ContextualPrecisionMetric
    """
    
    if not rag_test_results or not rag_test_results[0].get("answer"):
        print("‚ùå C·∫ßn RAG results ƒë·ªÉ demo contextual precision")
        return
    
    # S·ª≠ d·ª•ng test case t·ª´ result th·ª© 2
    test_result = rag_test_results[1] if len(rag_test_results) > 1 else rag_test_results[0]
    
    # T·∫°o test case v·ªõi expected_output ƒë·ªÉ ƒë√°nh gi√° precision
    test_case = LLMTestCase(
        input=test_result["question"],
        actual_output=test_result["answer"],
        expected_output="Deep learning s·ª≠ d·ª•ng neural networks v·ªõi nhi·ªÅu layers ƒë·ªÉ h·ªçc complex patterns trong data, kh√°c v·ªõi ML truy·ªÅn th·ªëng.",  # Expected answer
        retrieval_context=[doc.page_content for doc in test_result["source_documents"]]
    )
    
    print("üéØ ContextualPrecisionMetric Demo")
    print(f"Query: {test_case.input}")
    print(f"Expected: {test_case.expected_output}")
    print(f"Actual: {test_case.actual_output[:100]}...")
    
    try:
        # T·∫°o metric
        precision_metric = ContextualPrecisionMetric(
            threshold=0.7,
            model="gpt-3.5-turbo",
            include_reason=True
        )
        
        # Evaluate
        precision_metric.measure(test_case)
        
        print(f"\nüìä K·∫øt qu·∫£:")
        print(f"  Score: {precision_metric.score:.3f}")
        print(f"  Passed: {'‚úÖ' if precision_metric.is_successful() else '‚ùå'}")
        print(f"  Reason: {precision_metric.reason}")
        
        return precision_metric, test_case
        
    except Exception as e:
        print(f"‚ùå Error: {e}")
        return None, test_case

# Run demo
contextual_precision_result = demo_contextual_precision()

### 3.3 ContextualRecallMetric

ƒê√°nh gi√° recall - li·ªáu t·∫•t c·∫£ th√¥ng tin c·∫ßn thi·∫øt c√≥ ƒë∆∞·ª£c retrieve kh√¥ng:

In [None]:
def demo_contextual_recall():
    """
    Demo ContextualRecallMetric
    """
    
    if not rag_test_results or not rag_test_results[2].get("answer"):
        print("‚ùå C·∫ßn RAG results ƒë·ªÉ demo contextual recall")
        return
    
    # S·ª≠ d·ª•ng c√¢u h·ªèi v·ªÅ ·ª©ng d·ª•ng AI
    test_result = rag_test_results[2] if len(rag_test_results) > 2 else rag_test_results[0]
    
    test_case = LLMTestCase(
        input=test_result["question"],
        actual_output=test_result["answer"],
        expected_output="AI ƒë∆∞·ª£c ·ª©ng d·ª•ng trong y t·∫ø (ch·∫©n ƒëo√°n, ph√°t tri·ªÉn thu·ªëc), t√†i ch√≠nh (ph√°t hi·ªán gian l·∫≠n, giao d·ªãch t·ª± ƒë·ªông), giao th√¥ng (xe t·ª± l√°i), v√† gi√°o d·ª•c (h·ªçc t·∫≠p th√≠ch ·ª©ng).",
        retrieval_context=[doc.page_content for doc in test_result["source_documents"]]
    )
    
    print("üéØ ContextualRecallMetric Demo")
    print(f"Query: {test_case.input}")
    print(f"Expected coverage: Y t·∫ø, T√†i ch√≠nh, Giao th√¥ng, Gi√°o d·ª•c")
    
    try:
        # T·∫°o metric
        recall_metric = ContextualRecallMetric(
            threshold=0.7,
            model="gpt-3.5-turbo",
            include_reason=True
        )
        
        # Evaluate
        recall_metric.measure(test_case)
        
        print(f"\nüìä K·∫øt qu·∫£:")
        print(f"  Score: {recall_metric.score:.3f}")
        print(f"  Passed: {'‚úÖ' if recall_metric.is_successful() else '‚ùå'}")
        print(f"  Reason: {recall_metric.reason}")
        
        return recall_metric, test_case
        
    except Exception as e:
        print(f"‚ùå Error: {e}")
        return None, test_case

# Run demo
contextual_recall_result = demo_contextual_recall()

### 3.4 FaithfulnessMetric

ƒê√°nh gi√° ƒë·ªô trung th·ª±c - li·ªáu answer c√≥ faithful v·ªõi retrieved context kh√¥ng:

In [None]:
def demo_faithfulness():
    """
    Demo FaithfulnessMetric
    """
    
    if not rag_test_results or not rag_test_results[0].get("answer"):
        print("‚ùå C·∫ßn RAG results ƒë·ªÉ demo faithfulness")
        return
    
    # T·∫°o test case t·ªët (faithful)
    test_result = rag_test_results[0]
    
    faithful_test = LLMTestCase(
        input=test_result["question"],
        actual_output=test_result["answer"],
        retrieval_context=[doc.page_content for doc in test_result["source_documents"]]
    )
    
    # T·∫°o test case kh√¥ng faithful
    unfaithful_test = LLMTestCase(
        input="Machine learning c√≥ nh·ªØng lo·∫°i ch√≠nh n√†o?",
        actual_output="Machine learning ƒë∆∞·ª£c ph√°t minh nƒÉm 1955 b·ªüi Alan Turing. C√≥ 5 lo·∫°i ch√≠nh: Quantum Learning, Bio Learning, Cosmic Learning, Magic Learning, v√† Time Learning. Ch√∫ng ƒë·ªÅu s·ª≠ d·ª•ng crystal processors.",
        retrieval_context=[doc.page_content for doc in test_result["source_documents"]]
    )
    
    print("üéØ FaithfulnessMetric Demo")
    
    # Test faithful case
    print("\nüß™ Test Case 1: Faithful Answer")
    print(f"Answer: {faithful_test.actual_output[:150]}...")
    
    try:
        faithfulness_metric_1 = FaithfulnessMetric(
            threshold=0.7,
            model="gpt-3.5-turbo",
            include_reason=True
        )
        
        faithfulness_metric_1.measure(faithful_test)
        
        print(f"\nüìä K·∫øt qu·∫£ Faithful Test:")
        print(f"  Score: {faithfulness_metric_1.score:.3f}")
        print(f"  Passed: {'‚úÖ' if faithfulness_metric_1.is_successful() else '‚ùå'}")
        print(f"  Reason: {faithfulness_metric_1.reason[:150]}...")
        
    except Exception as e:
        print(f"‚ùå Error v·ªõi faithful test: {e}")
    
    # Test unfaithful case
    print("\nüß™ Test Case 2: Unfaithful Answer")
    print(f"Answer: {unfaithful_test.actual_output}")
    
    try:
        faithfulness_metric_2 = FaithfulnessMetric(
            threshold=0.7,
            model="gpt-3.5-turbo",
            include_reason=True
        )
        
        faithfulness_metric_2.measure(unfaithful_test)
        
        print(f"\nüìä K·∫øt qu·∫£ Unfaithful Test:")
        print(f"  Score: {faithfulness_metric_2.score:.3f}")
        print(f"  Passed: {'‚úÖ' if faithfulness_metric_2.is_successful() else '‚ùå'}")
        print(f"  Reason: {faithfulness_metric_2.reason[:150]}...")
        
    except Exception as e:
        print(f"‚ùå Error v·ªõi unfaithful test: {e}")

# Run demo
demo_faithfulness()

## üî¨ Ph·∫ßn 4: Comprehensive RAG Evaluation

### 4.1 Multi-Metric RAG Evaluation

In [None]:
def comprehensive_rag_evaluation(rag_results: List[Dict]) -> pd.DataFrame:
    """
    Ch·∫°y comprehensive evaluation cho t·∫•t c·∫£ RAG results
    """
    if not rag_results or not any(result.get("answer") for result in rag_results):
        print("‚ùå Kh√¥ng c√≥ RAG results ƒë·ªÉ evaluate")
        return pd.DataFrame()
    
    # T·∫°o t·∫•t c·∫£ metrics
    metrics = {
        "Answer Relevancy": AnswerRelevancyMetric(threshold=0.7, model="gpt-3.5-turbo"),
        "Contextual Relevancy": ContextualRelevancyMetric(threshold=0.7, model="gpt-3.5-turbo"),
        "Faithfulness": FaithfulnessMetric(threshold=0.7, model="gpt-3.5-turbo")
    }
    
    evaluation_results = []
    
    print("üîç Comprehensive RAG Evaluation")
    print(f"Evaluating {len(rag_results)} RAG responses v·ªõi {len(metrics)} metrics\n")
    
    for i, result in enumerate(rag_results):
        if not result.get("answer"):
            continue
            
        print(f"üìù Evaluating Question {i+1}: {result['question'][:50]}...")
        
        # T·∫°o test case
        test_case = LLMTestCase(
            input=result["question"],
            actual_output=result["answer"],
            retrieval_context=[doc.page_content for doc in result["source_documents"]]
        )
        
        result_row = {
            "Question_ID": i + 1,
            "Question": result["question"],
            "Answer_Length": len(result["answer"]),
            "Num_Retrieved_Docs": len(result["source_documents"])
        }
        
        # Evaluate t·ª´ng metric
        for metric_name, metric in metrics.items():
            try:
                # T·∫°o metric instance m·ªõi ƒë·ªÉ tr√°nh state conflicts
                metric_instance = metric.__class__(
                    threshold=metric.threshold,
                    model=getattr(metric, 'model', 'gpt-3.5-turbo')
                )
                
                metric_instance.measure(test_case)
                
                result_row[f"{metric_name}_Score"] = round(metric_instance.score, 3)
                result_row[f"{metric_name}_Passed"] = metric_instance.is_successful()
                
                status = "‚úÖ" if metric_instance.is_successful() else "‚ùå"
                print(f"  {metric_name}: {status} ({metric_instance.score:.3f})")
                
            except Exception as e:
                print(f"  {metric_name}: ‚ùå Error - {e}")
                result_row[f"{metric_name}_Score"] = 0.0
                result_row[f"{metric_name}_Passed"] = False
        
        evaluation_results.append(result_row)
        print()
    
    # T·∫°o DataFrame
    df = pd.DataFrame(evaluation_results)
    
    return df

# Run comprehensive evaluation
evaluation_df = comprehensive_rag_evaluation(rag_test_results)

In [None]:
# Hi·ªÉn th·ªã v√† ph√¢n t√≠ch k·∫øt qu·∫£
def analyze_evaluation_results(df: pd.DataFrame):
    """
    Ph√¢n t√≠ch v√† visualize evaluation results
    """
    if df.empty:
        print("‚ùå Kh√¥ng c√≥ data ƒë·ªÉ ph√¢n t√≠ch")
        return
    
    print("üìä RAG Evaluation Results Summary\n")
    
    # Hi·ªÉn th·ªã table
    display_columns = [col for col in df.columns if not col.endswith('_Passed')]
    print(df[display_columns].to_string(index=False))
    
    # Calculate statistics
    print("\nüìà Evaluation Statistics:")
    
    metric_columns = [col for col in df.columns if col.endswith('_Score')]
    
    for col in metric_columns:
        metric_name = col.replace('_Score', '')
        avg_score = df[col].mean()
        
        # Pass rate
        pass_col = col.replace('_Score', '_Passed')
        if pass_col in df.columns:
            pass_rate = df[pass_col].mean() * 100
        else:
            pass_rate = 0
        
        print(f"  {metric_name}:")
        print(f"    Average Score: {avg_score:.3f}")
        print(f"    Pass Rate: {pass_rate:.1f}%")
    
    # Overall statistics
    print(f"\nüéØ Overall Performance:")
    print(f"  Questions Evaluated: {len(df)}")
    print(f"  Average Answer Length: {df['Answer_Length'].mean():.0f} characters")
    print(f"  Average Retrieved Docs: {df['Num_Retrieved_Docs'].mean():.1f}")
    
    return df

# Analyze results
analyzed_df = analyze_evaluation_results(evaluation_df)

### 4.2 Visualization

In [None]:
def visualize_rag_evaluation(df: pd.DataFrame):
    """
    T·∫°o visualizations cho RAG evaluation results
    """
    if df.empty:
        print("‚ùå Kh√¥ng c√≥ data ƒë·ªÉ visualize")
        return
    
    # Setup plots
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    fig.suptitle('RAG Evaluation Results Analysis', fontsize=16, fontweight='bold')
    
    # 1. Metric Scores Comparison
    metric_scores = [col for col in df.columns if col.endswith('_Score')]
    score_data = df[metric_scores]
    score_data.columns = [col.replace('_Score', '') for col in score_data.columns]
    
    score_data.boxplot(ax=axes[0,0])
    axes[0,0].set_title('Distribution of Metric Scores')
    axes[0,0].set_ylabel('Score')
    axes[0,0].tick_params(axis='x', rotation=45)
    
    # 2. Pass Rate by Metric
    pass_rates = []
    metric_names = []
    
    for col in metric_scores:
        metric_name = col.replace('_Score', '')
        pass_col = col.replace('_Score', '_Passed')
        if pass_col in df.columns:
            pass_rate = df[pass_col].mean() * 100
            pass_rates.append(pass_rate)
            metric_names.append(metric_name)
    
    bars = axes[0,1].bar(metric_names, pass_rates, color=['skyblue', 'lightgreen', 'coral'])
    axes[0,1].set_title('Pass Rate by Metric')
    axes[0,1].set_ylabel('Pass Rate (%)')
    axes[0,1].tick_params(axis='x', rotation=45)
    
    # Add value labels on bars
    for bar, rate in zip(bars, pass_rates):
        axes[0,1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1, 
                      f'{rate:.1f}%', ha='center', va='bottom')
    
    # 3. Score Correlation Heatmap
    if len(score_data.columns) > 1:
        correlation_matrix = score_data.corr()
        sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, 
                   square=True, ax=axes[1,0])
        axes[1,0].set_title('Metric Score Correlations')
    else:
        axes[1,0].text(0.5, 0.5, 'Need more metrics\nfor correlation', 
                      ha='center', va='center', transform=axes[1,0].transAxes)
        axes[1,0].set_title('Metric Score Correlations')
    
    # 4. Answer Length vs Performance
    if 'Answer_Length' in df.columns and metric_scores:
        avg_score = df[metric_scores].mean(axis=1)
        scatter = axes[1,1].scatter(df['Answer_Length'], avg_score, 
                                  c=df['Num_Retrieved_Docs'], cmap='viridis', alpha=0.7)
        axes[1,1].set_xlabel('Answer Length (characters)')
        axes[1,1].set_ylabel('Average Score')
        axes[1,1].set_title('Answer Length vs Performance')
        
        # Add colorbar
        cbar = plt.colorbar(scatter, ax=axes[1,1])
        cbar.set_label('Num Retrieved Docs')
    
    plt.tight_layout()
    plt.show()
    
    # Print insights
    print("\nüîç Key Insights:")
    
    if not score_data.empty:
        best_metric = score_data.mean().idxmax()
        worst_metric = score_data.mean().idxmin()
        
        print(f"  ‚Ä¢ Best performing metric: {best_metric} ({score_data[best_metric].mean():.3f})")
        print(f"  ‚Ä¢ Lowest performing metric: {worst_metric} ({score_data[worst_metric].mean():.3f})")
        
        if len(score_data.columns) > 1:
            correlation_matrix = score_data.corr()
            high_corr_pairs = []
            for i in range(len(correlation_matrix.columns)):
                for j in range(i+1, len(correlation_matrix.columns)):
                    corr_val = correlation_matrix.iloc[i, j]
                    if abs(corr_val) > 0.7:
                        high_corr_pairs.append((correlation_matrix.columns[i], 
                                              correlation_matrix.columns[j], corr_val))
            
            if high_corr_pairs:
                print(f"  ‚Ä¢ High correlations found:")
                for metric1, metric2, corr in high_corr_pairs:
                    print(f"    - {metric1} & {metric2}: {corr:.3f}")

# Create visualizations
visualize_rag_evaluation(evaluation_df)

## ü§ñ Ph·∫ßn 5: Automated Dataset Generation

### 5.1 S·ª≠ d·ª•ng DeepEval Synthesizer

In [None]:
def create_synthetic_dataset():
    """
    T·∫°o synthetic dataset t·ª´ documents s·ª≠ d·ª•ng DeepEval Synthesizer
    """
    
    if not document_chunks:
        print("‚ùå C·∫ßn document chunks ƒë·ªÉ t·∫°o synthetic dataset")
        return []
    
    if not os.getenv("OPENAI_API_KEY"):
        print("‚ùå C·∫ßn OPENAI_API_KEY ƒë·ªÉ t·∫°o synthetic dataset")
        return []
    
    try:
        print("üîÑ ƒêang t·∫°o synthetic dataset...")
        
        # T·∫°o Synthesizer
        synthesizer = Synthesizer(
            model="gpt-3.5-turbo",
            multithreading=False  # Set False ƒë·ªÉ tr√°nh rate limiting
        )
        
        # Prepare contexts t·ª´ document chunks
        contexts = []
        for chunk in document_chunks[:5]:  # Ch·ªâ l·∫•y 5 chunks ƒë·∫ßu ƒë·ªÉ demo
            contexts.append([chunk.page_content])
        
        print(f"üìù Generating synthetic data t·ª´ {len(contexts)} contexts...")
        
        # Generate synthetic test cases
        synthetic_test_cases = synthesizer.generate_goldens_from_contexts(
            contexts=contexts,
            max_goldens_per_context=2,  # 2 test cases per context
            source_type="context"  # Specify that we're using contexts
        )
        
        print(f"‚úÖ Generated {len(synthetic_test_cases)} synthetic test cases")
        
        # Preview first few test cases
        print("\nüîç Preview Synthetic Test Cases:")
        for i, test_case in enumerate(synthetic_test_cases[:3]):
            print(f"\n  Test Case {i+1}:")
            print(f"    Input: {test_case.input}")
            print(f"    Expected Output: {test_case.expected_output[:100]}...")
            print(f"    Context: {len(test_case.context)} items")
        
        return synthetic_test_cases
        
    except Exception as e:
        print(f"‚ùå Error creating synthetic dataset: {e}")
        print("üí° C√≥ th·ªÉ do rate limiting ho·∫∑c API quota. Th·ª≠ gi·∫£m s·ªë l∆∞·ª£ng contexts.")
        return []

# Generate synthetic dataset
synthetic_test_cases = create_synthetic_dataset()

### 5.2 Evaluate Synthetic Dataset

In [None]:
def evaluate_synthetic_dataset(test_cases: List[LLMTestCase]) -> pd.DataFrame:
    """
    Evaluate synthetic test cases v·ªõi RAG system
    """
    if not test_cases:
        print("‚ùå Kh√¥ng c√≥ synthetic test cases ƒë·ªÉ evaluate")
        return pd.DataFrame()
    
    if not rag_chain:
        print("‚ùå C·∫ßn RAG chain ƒë·ªÉ evaluate synthetic cases")
        return pd.DataFrame()
    
    print(f"üß™ Evaluating {len(test_cases)} synthetic test cases v·ªõi RAG system")
    
    results = []
    
    for i, test_case in enumerate(test_cases):
        try:
            print(f"\nüìù Test Case {i+1}: {test_case.input[:50]}...")
            
            # Get RAG response
            rag_response = rag_chain({"query": test_case.input})
            
            # Update test case v·ªõi actual output t·ª´ RAG
            evaluated_test_case = LLMTestCase(
                input=test_case.input,
                actual_output=rag_response["result"],
                expected_output=test_case.expected_output,
                retrieval_context=[doc.page_content for doc in rag_response["source_documents"]]
            )
            
            # Evaluate v·ªõi multiple metrics
            metrics = {
                "Answer_Relevancy": AnswerRelevancyMetric(threshold=0.7),
                "Faithfulness": FaithfulnessMetric(threshold=0.7)
            }
            
            result_row = {
                "Test_ID": i + 1,
                "Question": test_case.input,
                "Expected_Output": test_case.expected_output[:100] + "...",
                "Actual_Output": rag_response["result"][:100] + "...",
                "Num_Retrieved": len(rag_response["source_documents"])
            }
            
            # Evaluate each metric
            for metric_name, metric in metrics.items():
                try:
                    metric.measure(evaluated_test_case)
                    result_row[f"{metric_name}_Score"] = round(metric.score, 3)
                    result_row[f"{metric_name}_Passed"] = metric.is_successful()
                    
                    status = "‚úÖ" if metric.is_successful() else "‚ùå"
                    print(f"  {metric_name}: {status} ({metric.score:.3f})")
                    
                except Exception as e:
                    print(f"  {metric_name}: ‚ùå Error - {e}")
                    result_row[f"{metric_name}_Score"] = 0.0
                    result_row[f"{metric_name}_Passed"] = False
            
            results.append(result_row)
            
        except Exception as e:
            print(f"‚ùå Error v·ªõi test case {i+1}: {e}")
    
    return pd.DataFrame(results)

# Evaluate synthetic dataset
synthetic_evaluation_df = evaluate_synthetic_dataset(synthetic_test_cases)

In [None]:
# Analyze synthetic dataset results
def analyze_synthetic_results(df: pd.DataFrame):
    """
    Ph√¢n t√≠ch k·∫øt qu·∫£ c·ªßa synthetic dataset evaluation
    """
    if df.empty:
        print("‚ùå Kh√¥ng c√≥ synthetic results ƒë·ªÉ ph√¢n t√≠ch")
        return
    
    print("üìä Synthetic Dataset Evaluation Results\n")
    
    # Display summary table
    display_cols = ['Test_ID', 'Question', 'Num_Retrieved', 'Answer_Relevancy_Score', 'Faithfulness_Score']
    available_cols = [col for col in display_cols if col in df.columns]
    
    if available_cols:
        print(df[available_cols].to_string(index=False))
    
    # Statistics
    print("\nüìà Synthetic Dataset Statistics:")
    
    metric_cols = [col for col in df.columns if col.endswith('_Score')]
    for col in metric_cols:
        metric_name = col.replace('_Score', '')
        avg_score = df[col].mean()
        
        pass_col = col.replace('_Score', '_Passed')
        if pass_col in df.columns:
            pass_rate = df[pass_col].mean() * 100
        else:
            pass_rate = 0
        
        print(f"  {metric_name}:")
        print(f"    Average Score: {avg_score:.3f}")
        print(f"    Pass Rate: {pass_rate:.1f}%")
    
    print(f"\nüéØ Dataset Quality:")
    print(f"  Total Synthetic Cases: {len(df)}")
    if 'Num_Retrieved' in df.columns:
        print(f"  Average Retrieved Docs: {df['Num_Retrieved'].mean():.1f}")
    
    # Identify potential issues
    print(f"\nüîç Potential Issues:")
    
    if 'Answer_Relevancy_Passed' in df.columns:
        failed_relevancy = df[~df['Answer_Relevancy_Passed']]
        if not failed_relevancy.empty:
            print(f"  ‚Ä¢ {len(failed_relevancy)} cases failed Answer Relevancy")
    
    if 'Faithfulness_Passed' in df.columns:
        failed_faithfulness = df[~df['Faithfulness_Passed']]
        if not failed_faithfulness.empty:
            print(f"  ‚Ä¢ {len(failed_faithfulness)} cases failed Faithfulness")
    
    return df

# Analyze synthetic results
analyzed_synthetic_df = analyze_synthetic_results(synthetic_evaluation_df)

## üéØ Ph·∫ßn 6: Advanced Evaluation Techniques

### 6.1 Custom RAG Evaluation Pipeline

In [None]:
class RAGEvaluationPipeline:
    """
    Custom pipeline cho comprehensive RAG evaluation
    """
    
    def __init__(self, rag_chain, vector_store, model="gpt-3.5-turbo"):
        self.rag_chain = rag_chain
        self.vector_store = vector_store
        self.model = model
        self.evaluation_history = []
    
    def create_comprehensive_metrics(self, threshold=0.7):
        """
        T·∫°o comprehensive set of metrics cho RAG evaluation
        """
        return {
            "answer_relevancy": AnswerRelevancyMetric(threshold=threshold, model=self.model),
            "contextual_relevancy": ContextualRelevancyMetric(threshold=threshold, model=self.model),
            "faithfulness": FaithfulnessMetric(threshold=threshold, model=self.model)
        }
    
    def evaluate_question(self, question: str, expected_answer: str = None) -> Dict[str, Any]:
        """
        Evaluate single question v·ªõi RAG system
        """
        if not self.rag_chain:
            raise ValueError("RAG chain not available")
        
        # Get RAG response
        rag_response = self.rag_chain({"query": question})
        
        # Create test case
        test_case = LLMTestCase(
            input=question,
            actual_output=rag_response["result"],
            expected_output=expected_answer,
            retrieval_context=[doc.page_content for doc in rag_response["source_documents"]]
        )
        
        # Evaluate v·ªõi metrics
        metrics = self.create_comprehensive_metrics()
        results = {
            "question": question,
            "answer": rag_response["result"],
            "retrieved_docs": len(rag_response["source_documents"]),
            "metrics": {}
        }
        
        for metric_name, metric in metrics.items():
            try:
                metric.measure(test_case)
                results["metrics"][metric_name] = {
                    "score": round(metric.score, 3),
                    "passed": metric.is_successful(),
                    "reason": getattr(metric, 'reason', 'No reason provided')
                }
            except Exception as e:
                results["metrics"][metric_name] = {
                    "score": 0.0,
                    "passed": False,
                    "error": str(e)
                }
        
        # Store in history
        self.evaluation_history.append(results)
        
        return results
    
    def batch_evaluate(self, questions: List[str], expected_answers: List[str] = None) -> List[Dict[str, Any]]:
        """
        Batch evaluation c·ªßa multiple questions
        """
        if expected_answers is None:
            expected_answers = [None] * len(questions)
        
        results = []
        
        for i, (question, expected) in enumerate(zip(questions, expected_answers)):
            print(f"üîç Evaluating {i+1}/{len(questions)}: {question[:50]}...")
            
            try:
                result = self.evaluate_question(question, expected)
                results.append(result)
                
                # Print quick summary
                passed_metrics = sum(1 for m in result["metrics"].values() if m.get("passed", False))
                total_metrics = len(result["metrics"])
                print(f"  ‚úÖ {passed_metrics}/{total_metrics} metrics passed")
                
            except Exception as e:
                print(f"  ‚ùå Error: {e}")
                results.append({
                    "question": question,
                    "error": str(e)
                })
        
        return results
    
    def get_performance_summary(self) -> Dict[str, Any]:
        """
        Get performance summary c·ªßa t·∫•t c·∫£ evaluations
        """
        if not self.evaluation_history:
            return {"message": "No evaluations performed yet"}
        
        # Collect all metric scores
        metric_scores = {}
        metric_pass_counts = {}
        
        for result in self.evaluation_history:
            if "metrics" in result:
                for metric_name, metric_data in result["metrics"].items():
                    if "score" in metric_data:
                        if metric_name not in metric_scores:
                            metric_scores[metric_name] = []
                            metric_pass_counts[metric_name] = 0
                        
                        metric_scores[metric_name].append(metric_data["score"])
                        if metric_data.get("passed", False):
                            metric_pass_counts[metric_name] += 1
        
        # Calculate summary statistics
        summary = {
            "total_evaluations": len(self.evaluation_history),
            "metrics_summary": {}
        }
        
        for metric_name, scores in metric_scores.items():
            summary["metrics_summary"][metric_name] = {
                "average_score": round(np.mean(scores), 3),
                "min_score": round(min(scores), 3),
                "max_score": round(max(scores), 3),
                "pass_rate": round(metric_pass_counts[metric_name] / len(scores) * 100, 1)
            }
        
        return summary

# Create evaluation pipeline
if rag_chain and vector_store:
    eval_pipeline = RAGEvaluationPipeline(rag_chain, vector_store)
    print("‚úÖ RAG Evaluation Pipeline created successfully!")
else:
    print("‚ùå Cannot create pipeline without RAG chain and vector store")
    eval_pipeline = None

In [None]:
# Demo advanced evaluation pipeline
def demo_advanced_evaluation():
    """
    Demo advanced RAG evaluation pipeline
    """
    if not eval_pipeline:
        print("‚ùå Evaluation pipeline not available")
        return
    
    # Advanced test questions
    advanced_questions = [
        "So s√°nh supervised learning v√† unsupervised learning v·ªÅ ∆∞u nh∆∞·ª£c ƒëi·ªÉm?",
        "T·∫°i sao deep learning l·∫°i hi·ªáu qu·∫£ h∆°n traditional machine learning trong x·ª≠ l√Ω h√¨nh ·∫£nh?",
        "Nh·ªØng th√°ch th·ª©c ƒë·∫°o ƒë·ª©c n√†o m√† AI ƒëang ph·∫£i ƒë·ªëi m·∫∑t?",
        "Xu h∆∞·ªõng ph√°t tri·ªÉn n√†o c·ªßa AI s·∫Ω quan tr·ªçng nh·∫•t trong t∆∞∆°ng lai?"
    ]
    
    print("üöÄ Advanced RAG Evaluation Demo")
    print(f"Testing {len(advanced_questions)} complex questions\n")
    
    # Run batch evaluation
    results = eval_pipeline.batch_evaluate(advanced_questions)
    
    # Get performance summary
    summary = eval_pipeline.get_performance_summary()
    
    print("\nüìä Advanced Evaluation Summary:")
    print(f"Total Evaluations: {summary['total_evaluations']}")
    
    if "metrics_summary" in summary:
        for metric_name, stats in summary["metrics_summary"].items():
            print(f"\n{metric_name.replace('_', ' ').title()}:")
            print(f"  Average Score: {stats['average_score']}")
            print(f"  Score Range: {stats['min_score']} - {stats['max_score']}")
            print(f"  Pass Rate: {stats['pass_rate']}%")
    
    return results, summary

# Run advanced evaluation
if eval_pipeline:
    advanced_results, advanced_summary = demo_advanced_evaluation()
else:
    print("‚è≠Ô∏è  Skipping advanced evaluation demo")
    advanced_results, advanced_summary = None, None

## üîß Ph·∫ßn 7: RAG Optimization Strategies

### 7.1 Identifying Performance Bottlenecks

In [None]:
def analyze_rag_bottlenecks(evaluation_results: List[Dict]) -> Dict[str, Any]:
    """
    Ph√¢n t√≠ch performance bottlenecks trong RAG system
    """
    if not evaluation_results:
        return {"message": "No evaluation results to analyze"}
    
    print("üîç Analyzing RAG Performance Bottlenecks\n")
    
    # Collect metrics data
    failed_cases = []
    low_score_cases = []
    metric_issues = {}
    
    for i, result in enumerate(evaluation_results):
        if "metrics" not in result:
            continue
        
        case_failed = False
        case_scores = []
        
        for metric_name, metric_data in result["metrics"].items():
            if "score" in metric_data:
                score = metric_data["score"]
                passed = metric_data.get("passed", False)
                
                case_scores.append(score)
                
                if not passed:
                    case_failed = True
                    if metric_name not in metric_issues:
                        metric_issues[metric_name] = []
                    metric_issues[metric_name].append({
                        "case_id": i,
                        "question": result["question"],
                        "score": score,
                        "reason": metric_data.get("reason", "No reason")
                    })
        
        if case_failed:
            failed_cases.append(i)
        
        avg_score = np.mean(case_scores) if case_scores else 0
        if avg_score < 0.6:  # Low overall performance
            low_score_cases.append({
                "case_id": i,
                "question": result["question"],
                "avg_score": avg_score,
                "retrieved_docs": result.get("retrieved_docs", 0)
            })
    
    # Analysis results
    analysis = {
        "total_cases": len(evaluation_results),
        "failed_cases": len(failed_cases),
        "low_score_cases": len(low_score_cases),
        "failure_rate": round(len(failed_cases) / len(evaluation_results) * 100, 1)
    }
    
    print(f"üìä Performance Overview:")
    print(f"  Total Cases: {analysis['total_cases']}")
    print(f"  Failed Cases: {analysis['failed_cases']} ({analysis['failure_rate']}%)")
    print(f"  Low Score Cases: {analysis['low_score_cases']}")
    
    # Metric-specific issues
    print(f"\nüéØ Metric-Specific Issues:")
    for metric_name, issues in metric_issues.items():
        print(f"\n  {metric_name.replace('_', ' ').title()}:")
        print(f"    Failed Cases: {len(issues)}")
        
        if issues:
            avg_failed_score = np.mean([issue["score"] for issue in issues])
            print(f"    Average Failed Score: {avg_failed_score:.3f}")
            
            # Show worst case
            worst_case = min(issues, key=lambda x: x["score"])
            print(f"    Worst Case: '{worst_case['question'][:50]}...' (Score: {worst_case['score']})")
    
    # Recommendations
    print(f"\nüí° Optimization Recommendations:")
    
    if "answer_relevancy" in metric_issues and len(metric_issues["answer_relevancy"]) > 0:
        print("  ‚Ä¢ Answer Relevancy Issues:")
        print("    - Consider improving prompt engineering")
        print("    - Review question-answer alignment")
        print("    - Fine-tune retrieval parameters")
    
    if "contextual_relevancy" in metric_issues and len(metric_issues["contextual_relevancy"]) > 0:
        print("  ‚Ä¢ Contextual Relevancy Issues:")
        print("    - Improve chunking strategy")
        print("    - Increase number of retrieved documents")
        print("    - Enhance embedding model")
    
    if "faithfulness" in metric_issues and len(metric_issues["faithfulness"]) > 0:
        print("  ‚Ä¢ Faithfulness Issues:")
        print("    - Add grounding instructions to prompts")
        print("    - Implement citation mechanisms")
        print("    - Review context completeness")
    
    # Retrieval analysis
    if low_score_cases:
        avg_retrieved = np.mean([case["retrieved_docs"] for case in low_score_cases])
        print(f"  ‚Ä¢ Retrieval Analysis:")
        print(f"    - Low scoring cases average {avg_retrieved:.1f} retrieved docs")
        if avg_retrieved < 2:
            print("    - Consider increasing retrieval count (k parameter)")
        elif avg_retrieved > 5:
            print("    - Consider reducing retrieval count to avoid noise")
    
    return analysis

# Analyze bottlenecks
if advanced_results:
    bottleneck_analysis = analyze_rag_bottlenecks(advanced_results)
else:
    print("‚è≠Ô∏è  Skipping bottleneck analysis (no advanced results)")

### 7.2 A/B Testing Different RAG Configurations

In [None]:
def ab_test_rag_configurations():
    """
    A/B test different RAG configurations
    """
    if not vector_store or not os.getenv("OPENAI_API_KEY"):
        print("‚ùå Need vector store and API key for A/B testing")
        return
    
    print("üß™ A/B Testing RAG Configurations\n")
    
    # Configuration A: Conservative (k=2, higher temperature)
    try:
        llm_a = OpenAI(
            model_name="gpt-3.5-turbo-instruct",
            temperature=0.3,  # Higher temperature
            max_tokens=300    # Shorter responses
        )
        
        retriever_a = vector_store.as_retriever(
            search_type="similarity",
            search_kwargs={"k": 2}  # Fewer documents
        )
        
        rag_chain_a = RetrievalQA.from_chain_type(
            llm=llm_a,
            chain_type="stuff",
            retriever=retriever_a,
            return_source_documents=True
        )
        
        print("‚úÖ Configuration A: Conservative (k=2, temp=0.3, max_tokens=300)")
        
    except Exception as e:
        print(f"‚ùå Error creating config A: {e}")
        return
    
    # Configuration B: Aggressive (k=4, lower temperature)
    try:
        llm_b = OpenAI(
            model_name="gpt-3.5-turbo-instruct",
            temperature=0.1,  # Lower temperature
            max_tokens=600    # Longer responses
        )
        
        retriever_b = vector_store.as_retriever(
            search_type="similarity",
            search_kwargs={"k": 4}  # More documents
        )
        
        rag_chain_b = RetrievalQA.from_chain_type(
            llm=llm_b,
            chain_type="stuff",
            retriever=retriever_b,
            return_source_documents=True
        )
        
        print("‚úÖ Configuration B: Aggressive (k=4, temp=0.1, max_tokens=600)")
        
    except Exception as e:
        print(f"‚ùå Error creating config B: {e}")
        return
    
    # Test questions
    test_questions = [
        "Machine learning v√† deep learning kh√°c nhau nh∆∞ th·∫ø n√†o?",
        "AI c√≥ nh·ªØng ·ª©ng d·ª•ng n√†o trong y t·∫ø?"
    ]
    
    # Run A/B test
    results_comparison = []
    
    for question in test_questions:
        print(f"\nüîç Testing: {question}")
        
        # Test Configuration A
        try:
            response_a = rag_chain_a({"query": question})
            
            test_case_a = LLMTestCase(
                input=question,
                actual_output=response_a["result"],
                retrieval_context=[doc.page_content for doc in response_a["source_documents"]]
            )
            
            # Quick evaluation
            relevancy_a = AnswerRelevancyMetric(threshold=0.7)
            relevancy_a.measure(test_case_a)
            
            print(f"  Config A: Score {relevancy_a.score:.3f}, Docs: {len(response_a['source_documents'])}, Length: {len(response_a['result'])}")
            
        except Exception as e:
            print(f"  Config A: Error - {e}")
            relevancy_a = None
        
        # Test Configuration B
        try:
            response_b = rag_chain_b({"query": question})
            
            test_case_b = LLMTestCase(
                input=question,
                actual_output=response_b["result"],
                retrieval_context=[doc.page_content for doc in response_b["source_documents"]]
            )
            
            # Quick evaluation
            relevancy_b = AnswerRelevancyMetric(threshold=0.7)
            relevancy_b.measure(test_case_b)
            
            print(f"  Config B: Score {relevancy_b.score:.3f}, Docs: {len(response_b['source_documents'])}, Length: {len(response_b['result'])}")
            
        except Exception as e:
            print(f"  Config B: Error - {e}")
            relevancy_b = None
        
        # Compare
        if relevancy_a and relevancy_b:
            if relevancy_a.score > relevancy_b.score:
                print(f"  üèÜ Winner: Config A (Conservative)")
            elif relevancy_b.score > relevancy_a.score:
                print(f"  üèÜ Winner: Config B (Aggressive)")
            else:
                print(f"  ü§ù Tie")
            
            results_comparison.append({
                "question": question,
                "config_a_score": relevancy_a.score,
                "config_b_score": relevancy_b.score,
                "winner": "A" if relevancy_a.score > relevancy_b.score else "B" if relevancy_b.score > relevancy_a.score else "Tie"
            })
    
    # Summary
    if results_comparison:
        print(f"\nüìä A/B Test Summary:")
        winners = [r["winner"] for r in results_comparison]
        a_wins = winners.count("A")
        b_wins = winners.count("B")
        ties = winners.count("Tie")
        
        print(f"  Config A (Conservative) wins: {a_wins}")
        print(f"  Config B (Aggressive) wins: {b_wins}")
        print(f"  Ties: {ties}")
        
        if a_wins > b_wins:
            print(f"  üèÜ Overall Winner: Configuration A (Conservative)")
            print(f"  üí° Recommendation: Use fewer docs (k=2) v·ªõi higher temperature")
        elif b_wins > a_wins:
            print(f"  üèÜ Overall Winner: Configuration B (Aggressive)")
            print(f"  üí° Recommendation: Use more docs (k=4) v·ªõi lower temperature")
        else:
            print(f"  ü§ù Results are tied - consider context-specific tuning")
    
    return results_comparison

# Run A/B test
ab_test_results = ab_test_rag_configurations()

## üéì Ph·∫ßn 8: Exercises v√† Th·ª±c h√†nh

### Exercise 1: Custom RAG Evaluation

In [None]:
# Exercise 1: T·∫°o custom evaluation cho domain-specific questions
def exercise_1_custom_rag_evaluation():
    """
    TODO: T·∫°o 5 c√¢u h·ªèi chuy√™n s√¢u v·ªÅ AI/ML v√† evaluate v·ªõi RAG system
    Y√™u c·∫ßu:
    1. C√¢u h·ªèi ph·∫£i kh√≥ v√† c·∫ßn context t·ª´ document
    2. Evaluate v·ªõi √≠t nh·∫•t 3 metrics
    3. Ph√¢n t√≠ch k·∫øt qu·∫£ v√† ƒë∆∞a ra insights
    """
    
    # TODO: T·∫°o danh s√°ch c√¢u h·ªèi chuy√™n s√¢u
    expert_questions = [
        # Add your expert-level questions here
        "Your question 1",
        "Your question 2",
        "Your question 3",
        "Your question 4",
        "Your question 5"
    ]
    
    # TODO: Implement evaluation logic
    results = []
    
    # TODO: Analyze v√† visualize results
    
    return results

print("üí° Exercise 1 Template created. Complete the function above!")
print("Hints:")
print("- S·ª≠ d·ª•ng rag_chain ƒë·ªÉ get responses")
print("- T·∫°o LLMTestCase v·ªõi retrieval_context")
print("- S·ª≠ d·ª•ng multiple metrics ƒë·ªÉ comprehensive evaluation")

### Exercise 2: Synthetic Dataset Generation

In [None]:
# Exercise 2: Generate v√† evaluate synthetic dataset
def exercise_2_synthetic_dataset():
    """
    TODO: T·∫°o synthetic dataset t·ª´ custom documents
    Y√™u c·∫ßu:
    1. Load additional documents (c√≥ th·ªÉ t·∫°o custom content)
    2. Generate synthetic test cases
    3. Evaluate quality c·ªßa synthetic data
    4. Compare v·ªõi real questions
    """
    
    # TODO: Create custom document content
    custom_content = """
    Add your custom content here about a specific AI/ML topic
    that you want to test RAG evaluation with.
    Make it detailed and information-rich.
    """
    
    # TODO: Process content v√† t·∫°o synthetic dataset
    
    # TODO: Evaluate synthetic vs real questions
    
    return None

print("üí° Exercise 2 Template created. Complete the function above!")
print("Hints:")
print("- S·ª≠ d·ª•ng Synthesizer.generate_goldens_from_contexts()")
print("- So s√°nh quality metrics gi·ªØa synthetic v√† real data")
print("- Ph√¢n t√≠ch types of questions ƒë∆∞·ª£c generate")

### Exercise 3: RAG Optimization

In [None]:
# Exercise 3: Optimize RAG performance
def exercise_3_rag_optimization():
    """
    TODO: Experiment v·ªõi different RAG configurations ƒë·ªÉ t·ªëi ∆∞u performance
    Y√™u c·∫ßu:
    1. Test √≠t nh·∫•t 3 configurations kh√°c nhau
    2. Vary parameters: chunk_size, k, temperature, max_tokens
    3. Measure performance v·ªõi multiple metrics
    4. ƒê∆∞a ra recommendation cho optimal config
    """
    
    configurations = [
        {
            "name": "Config 1",
            "chunk_size": 300,
            "k": 2,
            "temperature": 0.1,
            "max_tokens": 400
        },
        # TODO: Add more configurations
    ]
    
    # TODO: Implement testing logic
    
    # TODO: Compare configurations v√† choose winner
    
    return None

print("üí° Exercise 3 Template created. Complete the function above!")
print("Hints:")
print("- Create separate RAG chains cho m·ªói configuration")
print("- Use same test questions ƒë·ªÉ fair comparison")
print("- Consider tradeoffs: accuracy vs speed vs cost")

## üéØ T·ªïng k·∫øt v√† Next Steps

### üèÜ Nh·ªØng g√¨ ƒë√£ h·ªçc trong Notebook n√†y:

1. **‚úÖ RAG Pipeline Construction**
   - Document loading v√† chunking strategies
   - Vector store creation v·ªõi FAISS
   - End-to-end RAG chain v·ªõi LangChain

2. **‚úÖ RAG-Specific Metrics**
   - **ContextualRelevancyMetric**: ƒê√°nh gi√° ch·∫•t l∆∞·ª£ng retrieved context
   - **ContextualPrecisionMetric**: ƒê√°nh gi√° th·ª© t·ª± relevance c·ªßa contexts
   - **ContextualRecallMetric**: ƒê√°nh gi√° completeness c·ªßa retrieval
   - **FaithfulnessMetric**: ƒê√°nh gi√° consistency v·ªõi source material

3. **‚úÖ Automated Dataset Generation**
   - S·ª≠ d·ª•ng DeepEval Synthesizer
   - Generate test cases t·ª´ documents
   - Quality assessment c·ªßa synthetic data

4. **‚úÖ Advanced Evaluation Techniques**
   - Custom RAG evaluation pipeline
   - Batch evaluation strategies
   - Performance bottleneck analysis
   - A/B testing different configurations

5. **‚úÖ Optimization Strategies**
   - Parameter tuning (k, temperature, chunk_size)
   - Configuration comparison
   - Performance recommendations

### üöÄ Next Steps - Notebook 3: Code Generation Evaluation

Trong notebook ti·∫øp theo, ch√∫ng ta s·∫Ω h·ªçc:

- üíª **Custom Metrics v·ªõi G-Eval** cho code evaluation
- üîç **Code Quality Metrics**: Correctness, Readability, Efficiency
- üõ°Ô∏è **Security Review Metrics**: Vulnerability detection
- üìä **Code Review Automation** v·ªõi DeepEval
- üß™ **Testing Code Generation** systems

### üìä Key Insights t·ª´ RAG Evaluation:

- **Context Quality > Quantity**: Th∆∞·ªùng 2-3 relevant chunks t·ªët h∆°n 5-6 noisy chunks
- **Faithfulness is Critical**: LLM d·ªÖ hallucinate n·∫øu kh√¥ng c√≥ proper grounding
- **Threshold Tuning**: Different thresholds cho different use cases
- **Synthetic Data**: Useful for scaling evaluation nh∆∞ng c·∫ßn validate quality

### üí° Best Practices Summary:

1. **Always evaluate retrieval v√† generation separately**
2. **Use multiple metrics ƒë·ªÉ comprehensive assessment**
3. **A/B test different configurations**
4. **Monitor performance continuously**
5. **Balance accuracy, speed, v√† cost**

---

## üéâ Excellent Work!

B·∫°n ƒë√£ master advanced RAG evaluation v·ªõi DeepEval! 

Ready for **Notebook 3: Evaluating Code Generation and Review**? üöÄüíª