# EVALUATION OF SEMANTIC SEARCH AND ITS ROLE IN RETRIEVED-AUGMENTED-GENERATION (RAG) FOR ARABIC LANGUAGE

## Paper Information
- **Title**: Evaluation of Semantic Search and its Role in Retrieved-Augmented-Generation (RAG) for Arabic Language
- **Authors**: Ali Mahboub, Muhy Eddin Za'ter, Bashar Al-Rfooh, Yazan Estaitia, Adnan Jaljuli, Asma Hakouz
- **Organization**: Maqsam, Amman, Jordan
- **ArXiv Link**: [2403.18350v2](https://arxiv.org/abs/2403.18350v2)

## Abstract
Paper thiết lập benchmark cho semantic search tiếng Ả Rập và đánh giá hiệu quả của semantic search trong framework RAG. Nghiên cứu so sánh 5 encoder đa ngôn ngữ và chứng minh tầm quan trọng của semantic search trong việc cải thiện chất lượng RAG cho tiếng Ả Rập.

## 1. Environment Setup & Dependencies

In [None]:
# Install required packages
!pip install langchain langchain-openai langchain-community
!pip install sentence-transformers
!pip install faiss-cpu
!pip install pandas numpy matplotlib seaborn
!pip install scikit-learn
!pip install tqdm
!pip install python-dotenv

In [None]:
import os
import numpy as np
import pandas as pd
from typing import List, Dict, Tuple, Any
from dataclasses import dataclass
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm

# LangChain imports
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.schema import Document
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

# Evaluation metrics
from sklearn.metrics import ndcg_score

# Set random seed for reproducibility
np.random.seed(42)

# Load environment variables
from dotenv import load_dotenv
load_dotenv()

## 2. Data Structures & Dataset Generation

Paper sử dụng dataset gồm:
- 2030 customer support call summaries (tiếng Ả Rập)
- 406 search queries
- Relevance scores: 0 (irrelevant), 1 (somewhat relevant), 2 (very relevant)

In [None]:
@dataclass
class SearchQuery:
    """Represents a search query with associated documents and relevance scores"""
    query_id: str
    query_text: str
    document_relevance: Dict[str, int]  # doc_id -> relevance score (0, 1, 2)

@dataclass 
class Document:
    """Represents a document in the corpus"""
    doc_id: str
    content: str
    metadata: Dict[str, Any] = None

In [None]:
# Generate synthetic dataset mimicking the paper's approach
# In production, this would use GPT-4 as described in Section 3.1

def generate_synthetic_arabic_dataset(n_docs: int = 100, n_queries: int = 20) -> Tuple[List[Document], List[SearchQuery]]:
    """
    Generate synthetic dataset for testing.
    In real implementation, this would use GPT-4 to generate Arabic customer support data.
    """
    # Mock Arabic customer support summaries
    arabic_summaries = [
        "العميل يواجه مشكلة في تسجيل الدخول إلى حسابه",
        "طلب استرداد المبلغ المدفوع للمنتج المعيب",
        "استفسار حول موعد التسليم المتأخر",
        "شكوى بخصوص جودة الخدمة المقدمة",
        "طلب تحديث معلومات الحساب الشخصي"
    ]
    
    documents = []
    for i in range(n_docs):
        doc = Document(
            doc_id=f"doc_{i}",
            content=arabic_summaries[i % len(arabic_summaries)] + f" - حالة رقم {i}",
            metadata={"category": f"category_{i % 5}"}
        )
        documents.append(doc)
    
    # Generate queries with relevance scores
    queries = []
    arabic_queries = [
        "مشكلة تسجيل دخول",
        "استرداد مبلغ",
        "تأخير التسليم",
        "جودة الخدمة",
        "تحديث المعلومات"
    ]
    
    for i in range(n_queries):
        query = SearchQuery(
            query_id=f"query_{i}",
            query_text=arabic_queries[i % len(arabic_queries)],
            document_relevance={}
        )
        
        # Assign relevance scores (simulating GPT-4 labeling)
        for j, doc in enumerate(documents[:5]):  # Each query associated with 5 docs
            if j == i % 5:  # Very relevant
                query.document_relevance[doc.doc_id] = 2
            elif abs(j - (i % 5)) == 1:  # Somewhat relevant
                query.document_relevance[doc.doc_id] = 1
            else:  # Irrelevant
                query.document_relevance[doc.doc_id] = 0
        
        queries.append(query)
    
    return documents, queries

# Generate test dataset
documents, queries = generate_synthetic_arabic_dataset()

## 3. Semantic Search Implementation with LangChain

### 3.1 Encoder Setup
Paper đánh giá 5 encoder multilingual. Tôi sẽ implement với LangChain HuggingFaceEmbeddings.

In [None]:
# Define encoders from the paper
ENCODERS = {
    "encoder_1": {
        "model_name": "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
        "embedding_size": 384,
        "description": "Paraphrase Multilingual MiniLM"
    },
    "encoder_2": {
        "model_name": "sentence-transformers/use-cmlm-multilingual", 
        "embedding_size": 768,
        "description": "CMLM Multilingual"
    },
    "encoder_3": {
        "model_name": "sentence-transformers/paraphrase-multilingual-mpnet-base-v2",
        "embedding_size": 768,
        "description": "Paraphrase Multilingual MPNet"
    },
    "encoder_4": {
        "model_name": "sentence-transformers/distiluse-base-multilingual-cased-v1",
        "embedding_size": 512,
        "description": "Multilingual DistilBERT"
    },
    "encoder_5": {
        "model_name": "symanto/sn-xlm-roberta-base-snli-mnli-anli-xnli",
        "embedding_size": 768,
        "description": "XLM-RoBERTa"
    }
}

In [None]:
class SemanticSearchEvaluator:
    """Evaluates semantic search performance using different encoders"""
    
    def __init__(self, encoder_config: Dict[str, Any]):
        self.encoder_name = encoder_config["model_name"]
        self.embeddings = HuggingFaceEmbeddings(
            model_name=self.encoder_name,
            encode_kwargs={'normalize_embeddings': True}
        )
        self.vector_store = None
        
    def index_documents(self, documents: List[Document]):
        """Create FAISS index from documents using LangChain"""
        # Convert to LangChain Document format
        langchain_docs = [
            Document(page_content=doc.content, metadata={"doc_id": doc.doc_id})
            for doc in documents
        ]
        
        # Create FAISS vector store
        self.vector_store = FAISS.from_documents(
            documents=langchain_docs,
            embedding=self.embeddings
        )
        
    def search(self, query: str, k: int = 3) -> List[Tuple[str, float]]:
        """Perform semantic search and return doc_ids with scores"""
        if not self.vector_store:
            raise ValueError("Documents not indexed yet")
            
        results = self.vector_store.similarity_search_with_score(query, k=k)
        return [(doc.metadata["doc_id"], score) for doc, score in results]

### 3.2 Evaluation Metrics Implementation

Implement các metrics từ Section 3.2: nDCG, MRR, mAP

In [None]:
class MetricsCalculator:
    """Calculate evaluation metrics for semantic search as defined in the paper"""
    
    @staticmethod
    def calculate_ndcg_at_k(relevance_scores: List[int], k: int = 3) -> float:
        """
        Calculate Normalized Discounted Cumulative Gain at k
        Equation (1) and (2) from the paper
        """
        if not relevance_scores:
            return 0.0
            
        # Calculate DCG@k
        dcg = 0.0
        for i in range(min(k, len(relevance_scores))):
            dcg += (2**relevance_scores[i] - 1) / np.log2(i + 2)
        
        # Calculate IDCG@k (ideal ranking)
        ideal_scores = sorted(relevance_scores, reverse=True)
        idcg = 0.0
        for i in range(min(k, len(ideal_scores))):
            idcg += (2**ideal_scores[i] - 1) / np.log2(i + 2)
        
        # Return nDCG
        return dcg / idcg if idcg > 0 else 0.0
    
    @staticmethod
    def calculate_mrr_at_k(relevance_scores: List[int], k: int = 3) -> float:
        """
        Calculate Mean Reciprocal Rank at k
        Equation (3) from the paper
        """
        # Find first very relevant document (score = 2)
        for i in range(min(k, len(relevance_scores))):
            if relevance_scores[i] == 2:
                return 1.0 / (i + 1)
        return 0.0
    
    @staticmethod
    def calculate_map_at_k(relevance_scores: List[int], k: int = 3) -> float:
        """
        Calculate Mean Average Precision at k
        Equations (4) and (5) from the paper
        """
        num_relevant = sum(1 for score in relevance_scores if score > 0)
        if num_relevant == 0:
            return 0.0
        
        ap = 0.0
        relevant_count = 0
        
        for i in range(min(k, len(relevance_scores))):
            if relevance_scores[i] > 0:
                relevant_count += 1
                precision_at_i = relevant_count / (i + 1)
                ap += precision_at_i * relevance_scores[i]
        
        return ap / num_relevant

### 3.3 Run Evaluation

In [None]:
def evaluate_encoder(encoder_config: Dict, documents: List[Document], 
                    queries: List[SearchQuery], k: int = 3) -> Dict[str, float]:
    """
    Evaluate a single encoder on the dataset
    """
    # Initialize evaluator
    evaluator = SemanticSearchEvaluator(encoder_config)
    evaluator.index_documents(documents)
    
    # Calculate metrics
    metrics_calc = MetricsCalculator()
    ndcg_scores = []
    mrr_scores = []
    map_scores = []
    
    for query in tqdm(queries, desc=f"Evaluating {encoder_config['description']}"):
        # Perform search
        search_results = evaluator.search(query.query_text, k=k)
        
        # Get relevance scores for retrieved documents
        relevance_scores = []
        for doc_id, _ in search_results:
            relevance_scores.append(query.document_relevance.get(doc_id, 0))
        
        # Calculate metrics
        ndcg_scores.append(metrics_calc.calculate_ndcg_at_k(relevance_scores, k))
        mrr_scores.append(metrics_calc.calculate_mrr_at_k(relevance_scores, k))
        map_scores.append(metrics_calc.calculate_map_at_k(relevance_scores, k))
    
    return {
        "NDCG@3": np.mean(ndcg_scores),
        "MRR@3": np.mean(mrr_scores),
        "mAP@3": np.mean(map_scores),
        "Emb_Size": encoder_config["embedding_size"]
    }

In [None]:
# Evaluate all encoders
results = {}

# Note: In practice, you would use the actual encoders.
# For demonstration, we'll show how to evaluate one encoder
print("Evaluating Semantic Search Encoders...")
print("Note: Full evaluation requires downloading all 5 encoder models.")
print("\nDemonstrating with Encoder #1...")

# Evaluate first encoder as example
encoder_1_results = evaluate_encoder(
    ENCODERS["encoder_1"], 
    documents, 
    queries
)

print(f"\nEncoder #1 Results:")
for metric, value in encoder_1_results.items():
    print(f"{metric}: {value:.3f}")

## 4. RAG Pipeline Implementation with LangChain

### 4.1 RAG Dataset for Arabic FAQ

In [None]:
@dataclass
class FAQ:
    """Represents a FAQ with question and answer"""
    faq_id: str
    question: str
    answer: str
    domain: str

def generate_arabic_faq_dataset() -> List[FAQ]:
    """
    Generate synthetic Arabic FAQ dataset.
    Paper mentions 816 FAQs from 4 domains.
    """
    faqs = [
        FAQ("faq_1", "كيف يمكنني إعادة تعيين كلمة المرور؟", 
            "يمكنك إعادة تعيين كلمة المرور من خلال النقر على 'نسيت كلمة المرور' في صفحة تسجيل الدخول",
            "account"),
        FAQ("faq_2", "ما هي سياسة الإرجاع؟",
            "يمكنك إرجاع المنتجات خلال 30 يومًا من تاريخ الشراء مع الاحتفاظ بالفاتورة",
            "policy"),
        FAQ("faq_3", "كيف أتتبع طلبي؟",
            "يمكنك تتبع طلبك من خلال رقم التتبع المرسل إلى بريدك الإلكتروني",
            "shipping"),
        FAQ("faq_4", "ما هي طرق الدفع المتاحة؟",
            "نقبل البطاقات الائتمانية، PayPal، والدفع عند الاستلام",
            "payment")
    ]
    return faqs

faqs = generate_arabic_faq_dataset()

### 4.2 LangChain RAG Implementation

In [None]:
class ArabicRAGPipeline:
    """RAG Pipeline for Arabic as described in Section 3.4"""
    
    def __init__(self, encoder_name: str, faqs: List[FAQ]):
        # Initialize embeddings
        self.embeddings = HuggingFaceEmbeddings(
            model_name=encoder_name,
            encode_kwargs={'normalize_embeddings': True}
        )
        
        # Create documents from FAQs
        self.documents = [
            Document(
                page_content=f"السؤال: {faq.question}\nالإجابة: {faq.answer}",
                metadata={"faq_id": faq.faq_id, "domain": faq.domain}
            )
            for faq in faqs
        ]
        
        # Create vector store
        self.vector_store = FAISS.from_documents(
            documents=self.documents,
            embedding=self.embeddings
        )
        
        # Initialize LLM (GPT-3.5-turbo as mentioned in paper)
        self.llm = ChatOpenAI(
            model="gpt-3.5-turbo",
            temperature=0
        )
        
        # Create retrieval chain
        self.qa_chain = self._create_qa_chain()
    
    def _create_qa_chain(self):
        """Create the QA chain with custom Arabic prompt"""
        prompt_template = """أنت مساعد ذكي يجيب على الأسئلة باللغة العربية.
        استخدم المعلومات التالية للإجابة على السؤال. إذا لم تجد الإجابة في المعلومات المتاحة، قل "لا أعرف".
        
        المعلومات المتاحة:
        {context}
        
        السؤال: {question}
        الإجابة:"""
        
        PROMPT = PromptTemplate(
            template=prompt_template,
            input_variables=["context", "question"]
        )
        
        return RetrievalQA.from_chain_type(
            llm=self.llm,
            chain_type="stuff",
            retriever=self.vector_store.as_retriever(search_kwargs={"k": 3}),
            chain_type_kwargs={"prompt": PROMPT}
        )
    
    def answer_question(self, question: str) -> str:
        """Answer a question using RAG pipeline"""
        return self.qa_chain.run(question)
    
    def get_retrieved_docs(self, question: str, k: int = 3) -> List[Document]:
        """Get retrieved documents for analysis"""
        return self.vector_store.similarity_search(question, k=k)

### 4.3 RAG Evaluation

In [None]:
def evaluate_rag_accuracy(rag_pipeline: ArabicRAGPipeline, test_questions: List[Dict]) -> float:
    """
    Evaluate RAG accuracy as described in Section 3.4.2
    Uses GPT-4 to check if generated answer matches ground truth
    """
    correct_answers = 0
    
    # Initialize evaluator LLM (GPT-4 as mentioned in paper)
    evaluator_llm = ChatOpenAI(model="gpt-4", temperature=0)
    
    for test_q in test_questions:
        # Generate answer using RAG
        generated_answer = rag_pipeline.answer_question(test_q["question"])
        
        # Evaluate using GPT-4
        eval_prompt = f"""قارن بين الإجابتين التاليتين:
        
        السؤال: {test_q['question']}
        الإجابة الصحيحة: {test_q['ground_truth']}
        الإجابة المولدة: {generated_answer}
        
        هل الإجابة المولدة تحتوي على نفس المعلومات الأساسية؟ أجب بنعم أو لا فقط."""
        
        evaluation = evaluator_llm.predict(eval_prompt)
        
        if "نعم" in evaluation:
            correct_answers += 1
    
    return correct_answers / len(test_questions)

In [None]:
# Demo RAG pipeline
print("Demonstrating Arabic RAG Pipeline...")
print("Note: Full implementation requires OpenAI API key for LLM calls\n")

# Create RAG pipeline with best encoder (Encoder #3 from paper)
# rag_pipeline = ArabicRAGPipeline(
#     encoder_name=ENCODERS["encoder_3"]["model_name"],
#     faqs=faqs
# )

# Example query
# test_question = "كيف أستطيع تغيير كلمة السر الخاصة بي؟"
# answer = rag_pipeline.answer_question(test_question)
# print(f"Question: {test_question}")
# print(f"Answer: {answer}")

## 5. Results Visualization & Analysis

In [None]:
# Results from the paper (Table 1)
paper_results = pd.DataFrame({
    'Model': ['Encoder #1', 'Encoder #2', 'Encoder #3', 'Encoder #4', 'Encoder #5'],
    'NDCG@3': [0.853, 0.789, 0.879, 0.868, 0.837],
    'MRR@3': [0.888, 0.798, 0.911, 0.890, 0.848],
    'mAP@3': [0.863, 0.793, 0.888, 0.876, 0.854],
    'Emb_Size': [384, 768, 768, 512, 768]
})

# Visualize results
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Plot 1: Metrics comparison
metrics = ['NDCG@3', 'MRR@3', 'mAP@3']
x = np.arange(len(paper_results))
width = 0.25

for i, metric in enumerate(metrics):
    axes[0, 0].bar(x + i*width, paper_results[metric], width, label=metric)

axes[0, 0].set_xlabel('Encoder')
axes[0, 0].set_ylabel('Score')
axes[0, 0].set_title('Semantic Search Metrics Comparison')
axes[0, 0].set_xticks(x + width)
axes[0, 0].set_xticklabels(paper_results['Model'])
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Plot 2: Embedding size vs performance
axes[0, 1].scatter(paper_results['Emb_Size'], paper_results['NDCG@3'], s=100, alpha=0.6)
for i, model in enumerate(paper_results['Model']):
    axes[0, 1].annotate(model, (paper_results['Emb_Size'][i], paper_results['NDCG@3'][i]))
axes[0, 1].set_xlabel('Embedding Size')
axes[0, 1].set_ylabel('NDCG@3')
axes[0, 1].set_title('Embedding Size vs Performance')
axes[0, 1].grid(True, alpha=0.3)

# Plot 3: RAG results (Table 2)
rag_results = pd.DataFrame({
    'Encoder': ['Encoder #1', 'Encoder #2', 'Encoder #3', 'Encoder #4', 'Encoder #5'],
    'Top_3_Accuracy': [59.31, 62.01, 63.11, 62.5, 57.84],
    'Top_1_Accuracy': [61.15, 63.23, 63.84, 63.24, np.nan]
})

axes[1, 0].bar(rag_results['Encoder'], rag_results['Top_3_Accuracy'], alpha=0.7, label='Top 3')
axes[1, 0].bar(rag_results['Encoder'], rag_results['Top_1_Accuracy'], alpha=0.7, label='Top 1')
axes[1, 0].set_ylabel('Accuracy (%)')
axes[1, 0].set_title('RAG Accuracy with Different Encoders')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# Plot 4: Correlation between Semantic Search and RAG
axes[1, 1].scatter(paper_results['NDCG@3'], rag_results['Top_3_Accuracy'], s=100)
for i, model in enumerate(paper_results['Model']):
    axes[1, 1].annotate(model, 
                       (paper_results['NDCG@3'][i], rag_results['Top_3_Accuracy'][i]),
                       fontsize=8)
axes[1, 1].set_xlabel('NDCG@3 (Semantic Search)')
axes[1, 1].set_ylabel('RAG Top-3 Accuracy (%)')
axes[1, 1].set_title('Correlation: Semantic Search vs RAG Performance')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 6. Key Findings & Insights

### Paper's Main Conclusions:

1. **Best Encoder**: Encoder #3 (paraphrase-multilingual-mpnet-base-v2) achieved best performance:
   - NDCG@3: 0.879
   - MRR@3: 0.911
   - mAP@3: 0.888

2. **Asymmetric vs Symmetric Search**: 
   - Encoder #1 performed well in asymmetric search (long docs, short queries)
   - Encoder #2 better for symmetric search (similar length texts)

3. **RAG Integration Benefits**:
   - Shorter prompts (fewer tokens)
   - More precise outcomes
   - Cost-effective inference

4. **Arabic Challenges**: Larger embedding sizes (768) generally performed better for Arabic due to language complexity

## 7. Template for Personal Research

Use this section to extend the research with your own experiments:

In [None]:
# Template for testing new encoders
def test_custom_encoder(encoder_name: str, documents: List[Document], queries: List[SearchQuery]):
    """
    Template function to test your own encoder
    """
    custom_config = {
        "model_name": encoder_name,
        "embedding_size": 768,  # Update based on your model
        "description": "Custom Encoder"
    }
    
    results = evaluate_encoder(custom_config, documents, queries)
    return results

# Example: Test a new Arabic-specific encoder
# custom_results = test_custom_encoder("aubmindlab/bert-base-arabertv2", documents, queries)

In [None]:
# Template for custom RAG experiments
class CustomRAGExperiment:
    """
    Extend this class for your own RAG experiments
    """
    def __init__(self):
        # Add your initialization
        pass
    
    def experiment_1_hybrid_search(self):
        """
        Experiment: Combine semantic search with keyword search
        """
        # Your implementation
        pass
    
    def experiment_2_reranking(self):
        """
        Experiment: Add reranking layer after retrieval
        """
        # Your implementation
        pass
    
    def experiment_3_cross_lingual(self):
        """
        Experiment: Test cross-lingual retrieval (Arabic query, English docs)
        """
        # Your implementation
        pass

## References

1. Mahboub, A., Za'ter, M. E., Al-Rfooh, B., Estaitia, Y., Jaljuli, A., & Hakouz, A. (2024). Evaluation of Semantic Search and its Role in Retrieved-Augmented-Generation (RAG) for Arabic Language. arXiv preprint arXiv:2403.18350v2.

2. LangChain Documentation: https://python.langchain.com/

3. Sentence Transformers: https://www.sbert.net/