##### ScholarAgent: Quantitative Evaluation with RAGAS
**Objective:** To quantitatively measure the performance of our advanced RAG pipeline using the RAGAS framework. This moves our project from a qualitative demo to a rigorous, research-grade system.

In [1]:
import sys
import os
from dotenv import load_dotenv

# Add the project root to the Python path.
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))

if project_root not in sys.path:
    sys.path.insert(0, project_root)

# Laod environment variables
load_dotenv()

True

#### 1. Imports and Setup

In [4]:
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy,context_precision, context_recall

from src.rag_pipeline.core import create_rag_chain

if not os.getenv("GOOGLE_API_KEY"):
    print(f" WARNING: GOOGLE_API_KEY not found in .env file. RAGAS evaluation may fail. ")


#### 2. Define the Evaluation Set
This is the most critical part of a good evaluation. We need high-quality questions and "ground truth" answers that are derived directly from our source documents.

In [9]:
test_questions = [
    "What is the core problem with polysemantic neurons?",
    "How does dictionary learning with sparse autoencoders attempt to solve polysemanticity?",
    "What is a 'feature' in the context of mechanistic interpretability?",
    ]

ground_truth_answers = [
    "The core problem with polysemantic neurons is that they are frequently activated by several completely different types of inputs, making them difficult to interpret and assign a single, clear function to.",
    "Dictionary learning with sparse autoencoders attempts to solve polysemanticity by decomposing model activations into a larger set of more specific, interpretable features, where each feature corresponds to a single meaningful concept (monosemanticity).",
    "In mechanistic interpretability, a 'feature' is a specific, human-interpretable variable or concept that a model uses for computation, often represented as a pattern of neuron activations.",
]

#### 3. Generate Answers with our RAG Chain

In [10]:
print("Initializing RAG chain...")
rag_chain = create_rag_chain()
generated_answers = []
retrieved_contexts = []
# We need to get not just the answer, but also the context that was used to generate it.
# We can get this by invoking the chain with a specific structure.

for question in test_questions:
    print(f" Answering: {question}")
    # The `with_config` allows us to name the run for tracing if needed
    response = rag_chain.with_config(run_name="test_question_run").invoke(question)

    generated_answers.append(response["answer"])
    retrieved_contexts.append([doc.page_content for doc in response["context"]])

print("Answer generation complete.")


2025-08-29 22:54:07,267 - src.rag_pipeline.core - INFO - Creating the RAG chain with re-ranking...
INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: cpu
INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: all-MiniLM-L6-v2


Initializing RAG chain...


2025-08-29 22:54:07,601 - src.rag_pipeline.core - INFO - Base retriever created successfully.
2025-08-29 22:54:07,749 - src.rag_pipeline.core - INFO - Prompt template created.
2025-08-29 22:54:07,751 - src.rag_pipeline.core - INFO - LLM initialized with model: gemini-1.5-flash
2025-08-29 22:54:07,752 - src.rag_pipeline.core - INFO - RAG chain with re-ranking created successfully.


 Answering: What is the core problem with polysemantic neurons?


TypeError: string indices must be integers, not 'str'

#### 4. Run the RAGAS Evaluation

In [None]:
# Create a Hugging Face Dataset from our results
response_dataset = Dataset.from_dict({
    "question": test_questions,
    "answer": generated_answers,
    "contexts": retrieved_contexts,
    "ground_truth": ground_truth_answers,
    })

print("Running RAGAS evaluation...")

result = evaluate(
    dataset=response_dataset, 
    metrics=[
    context_precision,  # Evaluates the retriever\n",
    context_recall,     # Evaluates the retriever\n",
    faithfulness,       # Evaluates the generator\n",
    answer_relevancy,   # Evaluates the generator\n",
    ],
    )
    
    print("\n--- RAGAS Evaluation Complete ---")
    print(result)