# Evaluating the Quality of a RAG Pipeline

When developing **Retrieval-Augmented Generation (RAG)** pipelines, it is essential to ensure that the system performs reliably across two critical dimensions:

1. **Retrieval Quality**

   * The pipeline must accurately identify and retrieve all documents relevant to the user query.
   * Retrieved documents should be **highly relevant**, minimizing noise and irrelevant information that could mislead the generation stage.
   * Evaluation metrics may include recall, precision, and relevance scoring based on domain-specific criteria.

2. **Generation Accuracy**

   * The content generated by the RAG model must faithfully reflect the information contained in the retrieved documents.
   * **No hallucinations** or unsupported statements should be present; the generated output should be verifiable against the source documents.
   * Consider evaluating fluency, factual consistency, and adherence to context in addition to relevance.

For evaluating the created RAG pipeline, we will use the provided samples at `tests/samples` and metrics inspired by the RAGAS framework.

In [14]:
# setting up the python path to use internal modules here
import sys
import os

sys.path.append(os.path.abspath(".."))

In [15]:
# getting env variables
elastic_url = os.environ["ELASTIC_SEARCH_URL"]
elastic_api_key = os.environ["ELASTIC_SEARCH_API_KEY"]

## Creating the evaluation pipeline

### Generating the evaluation index

At first, we must create the complete environment for setting up the created RAG structure. We'll start with the evaluation index.

In [None]:
from app.pipeline.extract import PdfReader
from app.pipeline.index import ElasticVectorManager

# --- config ---
user_id = "eval_user"
session_id = "eval_session"
index_name = "default-evaluation-index"

pdf_paths = [
    "../tests/samples/LB5001.pdf",
    "../tests/samples/MN414_0224.pdf",
    "../tests/samples/WEG-CESTARI-manual-iom-guia-consulta-rapida-50111652-pt-en-es-web.pdf",
    "../tests/samples/WEG-motores-eletricos-guia-de-especificacao-50032749-brochure-portuguese-web.pdf",
]

# --- init vector DB ---
vector_database = ElasticVectorManager(
    elastic_url=elastic_url,
    api_key=elastic_api_key,
    index_name=index_name,
)

total_docs = 0
total_chunks = 0
all_docs = []

reader = PdfReader()

for pdf in pdf_paths:
    print(f"Reading {pdf} ...")
    docs = reader.read(pdf)

    # attach metadata
    for doc in docs:
        doc.user_id = user_id
        doc.session_id = session_id

    all_docs.extend(docs)
    total_docs += 1
    total_chunks += len(docs)

print(f"\nFinished reading {total_docs} documents -> {total_chunks} chunks")

# --- index documents ---
try:
    vector_database.index_documents(all_docs)
    print(f"Indexed into [{index_name}] with {len(all_docs)} chunks")
except Exception as e:
    print(f"Failed to index documents: {e}")
    vector_database.es.indices.delete(index=index_name, ignore=[400, 404])

[2025-08-17 19:20:44] [app.pipeline.index] [index.__init__:42] [INFO] Connected to Elasticsearch at https://my-elasticsearch-project-bfcdc2.es.us-central1.gcp.elastic.cloud:443
[2025-08-17 19:20:45] [app.pipeline.index] [index.__init__:46] [INFO] Index 'default-evaluation-index' did not exist. Creating index...
[2025-08-17 19:20:47] [app.pipeline.index] [index._create_index:128] [INFO] Successfully created index 'default-evaluation-index'
Reading ../tests/samples/LB5001.pdf ...
[2025-08-17 19:20:47] [app.pipeline.extract] [extract.read:27] [INFO] Starting PDF read | user_id=unknown_user | doc_id=d54c6707-4d34-4611-8aec-a93c8cedca3d | chunk_size=300 | chunk_overlap=50
[2025-08-17 19:20:47] [app.pipeline.extract] [extract.read:34] [DEBUG] PDF source type: path | path=../tests/samples/LB5001.pdf
[2025-08-17 19:20:47] [app.pipeline.extract] [extract.read:52] [INFO] Opened PDF successfully | title=LB5001.pdf | pages=2
[2025-08-17 19:20:47] [app.pipeline.extract] [extract.read:55] [DEBUG] Pr

### Generating RAGAS inspired metrics

Here, we will evaluate the RAG pipeline based on the following criteria:

* Retrieval Evaluation:
    - Context Recall;
    - Context Relevance;
* Generation Evaluation:
    - Faithfullness;
    - Response Relevancy;


Let's create the complete pipeline for evaluating RAG

In [42]:
import json
import time
import random
import pandas as pd
from typing import List, Dict, Optional, Tuple
from dataclasses import dataclass, asdict
from google import genai
from google.genai import types
import numpy as np
from collections import Counter
import re

# Configure Google client (make sure your API key is set)
google_client = genai.Client()

@dataclass
class EvaluationResult:
    """Data class to store evaluation results"""
    question: str
    ground_truth: str
    retrieved_contexts: List[str]
    generated_answer: str
    document_recall: float
    context_relevance: float
    faithfulness: float
    response_relevance: float
    overall_score: float

class RAGEvaluator:
    """
    Comprehensive RAG evaluation system based on RAGAS metrics
    """
    
    def __init__(self, 
                 model: str = "gemini-2.0-flash-exp", 
                 max_retries: int = 10,
                 delay_between_calls: float = 1.0):
        self.model = model
        self.max_retries = max_retries
        self.delay_between_calls = delay_between_calls
    
    def _make_llm_call(self, prompt: str, temperature: float = 0.1) -> str:
        """Make an LLM call with retry logic and rate limiting"""
        for attempt in range(self.max_retries):
            try:
                response = google_client.models.generate_content(
                    model=self.model,
                    contents=[types.Content(
                        role="user",
                        parts=[types.Part.from_text(text=prompt)]
                    )],
                    config=types.GenerateContentConfig(
                        temperature=temperature,
                        max_output_tokens=2048
                    )
                )
                time.sleep(self.delay_between_calls)  # Rate limiting
                return response.text.strip()
            except Exception as e:
                if attempt < self.max_retries - 1:
                    wait_time = (2 ** attempt) + random.uniform(0, 1)
                    print(f"API call failed (attempt {attempt + 1}/{self.max_retries}): {e}")
                    print(f"Retrying in {wait_time:.2f} seconds...")
                    time.sleep(wait_time)
                else:
                    print(f"Max retries reached. Final error: {e}")
                    raise
    
    def evaluate_document_recall(self, ground_truth: str, retrieved_contexts: List[str]) -> float:
        """
        Evaluate how well the retrieved contexts cover the ground truth information.
        Uses semantic similarity to assess coverage.
        """
        if not retrieved_contexts:
            return 0.0
        
        # Extract key facts from ground truth
        fact_extraction_prompt = f"""
        Extract the key facts and information from the following ground truth answer. 
        List each fact as a separate bullet point.
        
        Ground Truth: {ground_truth}
        
        Key Facts:
        """
        
        try:
            facts_response = self._make_llm_call(fact_extraction_prompt)
            facts = [fact.strip() for fact in facts_response.split('\n') if fact.strip() and '•' in fact or '-' in fact]
            
            if not facts:
                return 0.0
            
            # Check coverage of each fact in retrieved contexts
            covered_facts = 0
            contexts_text = " ".join(retrieved_contexts)
            
            for fact in facts:
                coverage_prompt = f"""
                Does the following context contain information that covers this fact?
                
                Fact: {fact}
                
                Context: {contexts_text}
                
                Answer only 'YES' or 'NO'.
                """
                
                coverage_response = self._make_llm_call(coverage_prompt)
                if 'YES' in coverage_response.upper():
                    covered_facts += 1
            
            return covered_facts / len(facts) if facts else 0.0
            
        except Exception as e:
            print(f"Error in document recall evaluation: {e}")
            return 0.0
    
    def evaluate_context_relevance(self, question: str, retrieved_contexts: List[str]) -> float:
        """
        Evaluate how relevant the retrieved contexts are to the question.
        """
        if not retrieved_contexts:
            return 0.0
        
        relevant_contexts = 0
        
        for context in retrieved_contexts:
            relevance_prompt = f"""
            Rate the relevance of the following context to the given question on a scale of 0-1.
            
            Question: {question}
            
            Context: {context}
            
            Consider:
            - Does the context contain information that could help answer the question?
            - Is the context topically related to the question?
            - Would this context be useful for generating a response?
            
            Return only a number between 0 and 1 (e.g., 0.8, 0.3, 1.0).
            """
            
            try:
                relevance_response = self._make_llm_call(relevance_prompt)
                # Extract numeric score
                score_match = re.search(r'(\d+\.?\d*)', relevance_response)
                if score_match:
                    score = float(score_match.group(1))
                    # Ensure score is between 0 and 1
                    score = min(max(score, 0.0), 1.0)
                    relevant_contexts += score
                else:
                    print(f"Could not parse relevance score: {relevance_response}")
            except Exception as e:
                print(f"Error evaluating context relevance: {e}")
        
        return relevant_contexts / len(retrieved_contexts) if retrieved_contexts else 0.0
    
    def evaluate_faithfulness(self, generated_answer: str, retrieved_contexts: List[str]) -> float:
        """
        Evaluate how faithful the generated answer is to the retrieved contexts.
        Checks if the answer contains information not supported by the contexts.
        """
        if not retrieved_contexts or not generated_answer:
            return 0.0
        
        # Extract claims from the generated answer
        claim_extraction_prompt = f"""
        Extract all factual claims from the following generated answer. 
        List each claim as a separate bullet point.
        
        Generated Answer: {generated_answer}
        
        Factual Claims:
        """
        
        try:
            claims_response = self._make_llm_call(claim_extraction_prompt)
            claims = [claim.strip() for claim in claims_response.split('\n') 
                     if claim.strip() and ('•' in claim or '-' in claim)]
            
            if not claims:
                return 1.0  # If no claims extracted, assume faithful
            
            # Check if each claim is supported by the contexts
            supported_claims = 0
            contexts_text = " ".join(retrieved_contexts)
            
            for claim in claims:
                support_prompt = f"""
                Is the following claim supported by or consistent with the provided contexts?
                
                Claim: {claim}
                
                Contexts: {contexts_text}
                
                Answer 'YES' if the claim is supported/consistent, 'NO' if it contradicts or is not supported.
                """
                
                support_response = self._make_llm_call(support_prompt)
                if 'YES' in support_response.upper():
                    supported_claims += 1
            
            return supported_claims / len(claims) if claims else 1.0
            
        except Exception as e:
            print(f"Error in faithfulness evaluation: {e}")
            return 0.0
    
    def evaluate_response_relevance(self, user_input: str, generated_answer: str, num_questions: int = 3) -> float:
        """
        Evaluate response relevancy using the RAGAS methodology:
        1. Generate artificial questions based on the response
        2. Compute cosine similarity between user input and each generated question
        3. Take the average of cosine similarity scores
        """
        if not generated_answer:
            return 0.0
        
        try:
            # Step 1: Generate artificial questions based on the response
            question_generation_prompt = f"""
            Based on the following response, generate {num_questions} questions that this response would appropriately answer.
            The questions should reflect the main content and intent of the response.
            
            Response: {generated_answer}
            
            Generate exactly {num_questions} questions, one per line, without numbering or bullets:
            """
            
            questions_response = self._make_llm_call(question_generation_prompt)
            generated_questions = [q.strip() for q in questions_response.split('\n') 
                                 if q.strip() and not q.strip().startswith(('1.', '2.', '3.', '-', '•'))]
            
            # Clean up questions - remove numbering if present
            cleaned_questions = []
            for q in generated_questions:
                # Remove common prefixes
                q = re.sub(r'^\d+\.\s*', '', q)  # Remove "1. ", "2. ", etc.
                q = re.sub(r'^[-•]\s*', '', q)   # Remove "- " or "• "
                if q.strip():
                    cleaned_questions.append(q.strip())
            
            if len(cleaned_questions) < num_questions:
                # If we didn't get enough questions, pad the list
                while len(cleaned_questions) < num_questions:
                    cleaned_questions.append(generated_answer[:100] + "...")
            
            # Take only the requested number of questions
            generated_questions = cleaned_questions[:num_questions]
            
            # Step 2: Get embeddings for user input and generated questions
            user_embedding = self._get_embedding(user_input)
            question_embeddings = [self._get_embedding(q) for q in generated_questions]
            
            # Step 3: Compute cosine similarities
            similarities = []
            for q_embedding in question_embeddings:
                similarity = self._cosine_similarity(user_embedding, q_embedding)
                similarities.append(similarity)
            
            # Step 4: Return average similarity
            return np.mean(similarities) if similarities else 0.0
            
        except Exception as e:
            print(f"Error in response relevance evaluation: {e}")
            return 0.0
    
    def _get_embedding(self, text: str) -> List[float]:
        """Generate embedding for text using Google GenAI"""
        try:
            response = google_client.models.embed_content(
                model="gemini-embedding-001",
                contents=text,
                config=types.EmbedContentConfig(
                    task_type="SEMANTIC_SIMILARITY",
                    output_dimensionality=768,
                )
            )
            return response.embeddings[0].values
        except Exception as e:
            print(f"Error generating embedding: {e}")
            # Return zero vector as fallback
            return [0.0] * 768
    
    def _cosine_similarity(self, vec1: List[float], vec2: List[float]) -> float:
        """Calculate cosine similarity between two vectors"""
        try:
            vec1 = np.array(vec1)
            vec2 = np.array(vec2)
            
            dot_product = np.dot(vec1, vec2)
            norm1 = np.linalg.norm(vec1)
            norm2 = np.linalg.norm(vec2)
            
            if norm1 == 0 or norm2 == 0:
                return 0.0
            
            return dot_product / (norm1 * norm2)
        except Exception as e:
            print(f"Error calculating cosine similarity: {e}")
            return 0.0
    
    def evaluate_single_case(self, 
                           question: str, 
                           ground_truth: str,
                           retrieved_contexts: List[str], 
                           generated_answer: str) -> EvaluationResult:
        """
        Evaluate a single RAG case across all metrics
        """
        print(f"Evaluating question: {question[:100]}...")
        
        # Evaluate all metrics
        doc_recall = self.evaluate_document_recall(ground_truth, retrieved_contexts)
        context_rel = self.evaluate_context_relevance(question, retrieved_contexts)
        faithfulness = self.evaluate_faithfulness(generated_answer, retrieved_contexts)
        response_rel = self.evaluate_response_relevance(question, generated_answer)
        
        # Calculate overall score (weighted average)
        overall_score = (doc_recall * 0.25 + context_rel * 0.25 + 
                        faithfulness * 0.25 + response_rel * 0.25)
        
        return EvaluationResult(
            question=question,
            ground_truth=ground_truth,
            retrieved_contexts=retrieved_contexts,
            generated_answer=generated_answer,
            document_recall=doc_recall,
            context_relevance=context_rel,
            faithfulness=faithfulness,
            response_relevance=response_rel,
            overall_score=overall_score
        )
    
    def evaluate_dataset(self, evaluation_data: List[Dict]) -> Tuple[List[EvaluationResult], Dict]:
        """
        Evaluate an entire dataset of RAG cases
        
        Args:
            evaluation_data: List of dicts with keys:
                - 'question': The user question
                - 'ground_truth': Expected/ideal answer
                - 'retrieved_contexts': List of retrieved context strings
                - 'generated_answer': The RAG system's generated answer
        
        Returns:
            Tuple of (detailed_results, summary_stats)
        """
        results = []
        
        for i, data in enumerate(evaluation_data):
            print(f"\n--- Evaluating case {i+1}/{len(evaluation_data)} ---")
            
            try:
                result = self.evaluate_single_case(
                    question=data['question'],
                    ground_truth=data['ground_truth'],
                    retrieved_contexts=data['retrieved_contexts'],
                    generated_answer=data['generated_answer']
                )
                results.append(result)
                
                print(f"Document Recall: {result.document_recall:.3f}")
                print(f"Context Relevance: {result.context_relevance:.3f}")
                print(f"Faithfulness: {result.faithfulness:.3f}")
                print(f"Response Relevance: {result.response_relevance:.3f}")
                print(f"Overall Score: {result.overall_score:.3f}")
                
            except Exception as e:
                print(f"Error evaluating case {i+1}: {e}")
                continue
        
        # Calculate summary statistics
        if results:
            summary_stats = {
                'total_cases': len(results),
                'avg_document_recall': np.mean([r.document_recall for r in results]),
                'avg_context_relevance': np.mean([r.context_relevance for r in results]),
                'avg_faithfulness': np.mean([r.faithfulness for r in results]),
                'avg_response_relevance': np.mean([r.response_relevance for r in results]),
                'avg_overall_score': np.mean([r.overall_score for r in results]),
                'std_overall_score': np.std([r.overall_score for r in results])
            }
        else:
            summary_stats = {}
        
        return results, summary_stats
    
    def export_results(self, results: List[EvaluationResult], filename: str = "rag_evaluation_results.json"):
        """Export evaluation results to JSON file"""
        results_dict = [asdict(result) for result in results]
        
        with open(filename, 'w') as f:
            json.dump(results_dict, f, indent=2, ensure_ascii=False)
        
        print(f"Results exported to {filename}")
    
    def create_results_dataframe(self, results: List[EvaluationResult]) -> pd.DataFrame:
        """Convert results to pandas DataFrame for analysis"""
        df_data = []
        
        for result in results:
            df_data.append({
                'question': result.question,
                'ground_truth': result.ground_truth,
                'generated_answer': result.generated_answer,
                'num_contexts': len(result.retrieved_contexts),
                'document_recall': result.document_recall,
                'context_relevance': result.context_relevance,
                'faithfulness': result.faithfulness,
                'response_relevance': result.response_relevance,
                'overall_score': result.overall_score
            })
        
        return pd.DataFrame(df_data)

### Helper Functions

Creating the helper functions for generating data and exporting result

In [43]:
def create_sample_evaluation_data(rag_agent, retriever):
    """
    Create sample evaluation data using your actual RAG system.
    This generates realistic test cases by running questions through your RAG pipeline.
    
    Args:
        rag_agent: Your RAGAgent instance
        retriever: Your ElasticRetriever instance
    """
    
    # Test questions and their ground truth answers
    test_cases = [
        {
            "question": "What is the procedure upon receiving the motor?",
            "ground_truth": """Inspect the motor for damage before accepting it. The Motor shaft should rotate freely with 
                no rubs. Report any damage immediately to the commercial carrier that delivered your 
                motor."""
        },
        {
            "question": "What is the procedure to install Three Phase Motors from Baldor Motors?",
            "ground_truth": """Installation Procedure

                This procedure outlines the steps for the safe and proper installation of a submersible pump motor, focusing on electrical and mechanical precautions and checks.

                General Instructions
                Prioritize Safety: Turn off and lock out all power before starting. Verify that the voltage at the motor starter connectors is zero.

                Handle with Care: Do not use force to drive the pump onto the motor shaft or to remove it, as this can cause damage.

                Connection and Verification Procedure
                Electrical Connection: Connect the motor power leads to the connectors in the motor starter. The motor's cable assembly has three power leads, two ground leads, two thermal leads, and two moisture sensing probe leads.

                Rotation Check (Three-Phase Motors ONLY):

                Preparation: Turn off the power, disconnect the motor shaft from the load, and remove any loose rotating parts.

                Rotation Test: Momentarily apply power and check the direction of rotation of the motor shaft.

                Correcting Rotation: If the shaft rotation is incorrect, turn off the power again and reverse any two of the three motor power leads at the motor starter. Restore power to verify the correct rotation.

                Additional Connections: Connect the two Thermal Protectors and the two Moisture Sensing Probes at the motor starter as shown in the manual's diagrams (Figure 2-2 and Figure 2-3).

                Assembly and Finalization
                Pump Mounting: Follow the pump manufacturer's instructions to mount the pump onto the motor shaft.

                Securing: Secure the pump case to the motor flange and attach the drain piping to the pump.

                Lowering: Use a spreader bar and lifting eyes to lower the motor/pump assembly to the proper depth. Ensure that the motor wires are not damaged during this process.

                Configuration: Set the control parameter values (if applicable) according to the motor nameplate values."""
        },
        {
            "question": "Quais são as temperaturas ambientais mínimas para o início de operação dos redutores e motorredutores?",
            "ground_truth": """Óleo Mineral CLP:

                Viscosidade ISO VG 220: Temperatura mínima de +2°C para lubrificação por imersão e +8°C para lubrificação forçada.

                Viscosidade ISO VG 320: Temperatura mínima de +7°C para lubrificação por imersão e +14°C para lubrificação forçada.

                Óleo Sintético CLP HC (PAO):

                Viscosidade ISO VG 220: Temperatura mínima de -5°C para lubrificação por imersão e +2°C para lubrificação forçada.

                Viscosidade ISO VG 320: Temperatura mínima de 0°C para lubrificação por imersão e +8°C para lubrificação forçada.

                Viscosidade ISO VG 460: Temperatura mínima de +6°C para lubrificação por imersão. A informação para lubrificação forçada não está disponível."""
        },
    ]
    
    evaluation_data = []
    
    print("Generating evaluation data using your RAG system...")
    
    for i, test_case in enumerate(test_cases):
        print(f"\nProcessing question {i+1}/{len(test_cases)}: {test_case['question'][:60]}...")
        
        try:
            # Step 1: Retrieve contexts using your retriever
            retrieved_docs = retriever.retrieve(test_case['question'], top_k=5)
            retrieved_contexts = [doc['text'] for doc in retrieved_docs]
            
            print(f"  Retrieved {len(retrieved_contexts)} contexts")
            
            # Step 2: Generate answer using your RAG agent
            rag_response = rag_agent.run(test_case['question'])
            
            # Extract the generated answer from your RAG response
            # Adjust this based on your RAGResponse schema structure
            if isinstance(rag_response, dict):
                generated_answer = rag_response.get('answer', '') or rag_response.get('response', '') or str(rag_response)
            else:
                generated_answer = str(rag_response)
            
            print(f"  Generated answer: {generated_answer[:100]}...")
            
            # Create evaluation data entry
            eval_entry = {
                'question': test_case['question'],
                'ground_truth': test_case['ground_truth'],
                'retrieved_contexts': retrieved_contexts,
                'generated_answer': generated_answer
            }
            
            evaluation_data.append(eval_entry)
            
        except Exception as e:
            print(f"  Error processing question {i+1}: {e}")
            # Add a fallback entry to keep the evaluation going
            evaluation_data.append({
                'question': test_case['question'],
                'ground_truth': test_case['ground_truth'],
                'retrieved_contexts': ["Error retrieving contexts"],
                'generated_answer': f"Error generating answer: {str(e)}"
            })
    
    print(f"\nGenerated {len(evaluation_data)} evaluation cases")
    return evaluation_data

### Run Evaluation

In [44]:
from app.pipeline.retrieve import ElasticRetriever
from app.pipeline.generate import RAGAgent

def run_evaluation_example():
    """
    Example of how to run the RAG evaluation system with sample data
    """
    # Initialize evaluator
    evaluator = RAGEvaluator()

    retriever = ElasticRetriever(
        elastic_url=elastic_url,
        api_key=elastic_api_key, 
        index_name="default-evaluation-index"
    )

    rag_agent = RAGAgent(
        model="gemini-2.5-flash",
        retriever=retriever
    )
    
    # Create or load your evaluation data
    evaluation_data = create_sample_evaluation_data(rag_agent, retriever)
    
    # Run evaluation
    results, summary = evaluator.evaluate_dataset(evaluation_data)
    
    # Print summary
    print("\n" + "="*50)
    print("EVALUATION SUMMARY")
    print("="*50)
    for metric, value in summary.items():
        print(f"{metric}: {value:.4f}")
    
    # Create DataFrame for analysis
    df = evaluator.create_results_dataframe(results)
    
    # Export results
    evaluator.export_results(results)
    
    return results, summary, df

In [None]:
results, summary, df = run_evaluation_example()
print(df)

[2025-08-17 21:18:36] [app.pipeline.retrieve] [retrieve.__init__:40] [INFO] Connected to Elasticsearch at https://my-elasticsearch-project-bfcdc2.es.us-central1.gcp.elastic.cloud:443
Generating evaluation data using your RAG system...

Processing question 1/3: What is the procedure upon receiving the motor?...
[2025-08-17 21:18:36] [app.pipeline.retrieve] [retrieve.retrieve:47] [INFO] Running vector search | Top-K: 5 | Query: What is the procedure upon receiving the motor?...


[2025-08-17 21:18:39] [app.pipeline.retrieve] [retrieve.retrieve:70] [INFO] Retrieved 5 results for query.
  Retrieved 5 contexts
[2025-08-17 21:18:39] [app.pipeline.generate] [generate.run:53] [INFO] Agent 'RAG Agent is searching for relevant documents'
[2025-08-17 21:18:39] [app.pipeline.retrieve] [retrieve.retrieve:47] [INFO] Running vector search | Top-K: 5 | Query: What is the procedure upon receiving the motor?...
[2025-08-17 21:18:39] [app.pipeline.retrieve] [retrieve.retrieve:70] [INFO] Retrieved 5 results for query.
[2025-08-17 21:18:39] [app.pipeline.generate] [generate.run:58] [INFO] Agent 'RAG Agent' found 5 relevant documents from 5 retrieved
[2025-08-17 21:18:39] [app.pipeline.generate] [generate.run:65] [DEBUG] Final formatted context from retrieved documents
[2025-08-17 21:18:39] [app.pipeline.generate] [generate.run:66] [DEBUG] 
==== FORMATTED CONTEXT START ====
Title: LB5001.pdf
Score: 0.71286416
Content: AC & DC Motor Installation & Maintenance Instructions Handling 