##### ScholarAgent: Quantitative Evaluation with RAGAS
**Objective:** To quantitatively measure the performance of our advanced RAG pipeline using the RAGAS framework. This moves our project from a qualitative demo to a rigorous, research-grade system.

In [1]:
import sys
import os
from dotenv import load_dotenv

# Add the project root to the Python path.
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))

if project_root not in sys.path:
    sys.path.insert(0, project_root)

# Laod environment variables
load_dotenv()
print("Paths and environment variables loaded.")

Paths and environment variables loaded.


#### 1. Imports and Setup

In [2]:
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy,context_precision, context_recall
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_community.embeddings import SentenceTransformerEmbeddings
# from ragas.integrations.langchain import LangchainLLM
# from langchain_google_genai import ChatGoogleGenerativeAI
from src.rag_pipeline.core import create_rag_chain
import configs.settings as settings

print("All libraries imported successfully.")

if not os.getenv("GOOGLE_API_KEY"):
    print(f" WARNING: GOOGLE_API_KEY not found in .env file. RAGAS evaluation may fail. ")


  from .autonotebook import tqdm as notebook_tqdm

For example, replace imports like: `from langchain_core.pydantic_v1 import BaseModel`
with: `from pydantic import BaseModel`
or the v1 compatibility namespace if you are working in a code base that has not been fully upgraded to pydantic 2 yet. 	from pydantic.v1 import BaseModel

  from ragas.metrics._answer_correctness import AnswerCorrectness, answer_correctness

For example, replace imports like: `from langchain.pydantic_v1 import BaseModel`
with: `from pydantic import BaseModel`
or the v1 compatibility namespace if you are working in a code base that has not been fully upgraded to pydantic 2 yet. 	from pydantic.v1 import BaseModel

  from ragas.metrics._context_entities_recall import (


All libraries imported successfully.


#### 2. Define the Evaluation Set
This is the most critical part of a good evaluation. We need high-quality questions and "ground truth" answers that are derived directly from our source documents.

In [3]:
test_questions = [
    "What is the core problem with polysemantic neurons?",
    "How does dictionary learning with sparse autoencoders attempt to solve polysemanticity?",
    "What is a 'feature' in the context of mechanistic interpretability?",
    ]

ground_truth_answers = [
    "The core problem with polysemantic neurons is that they are frequently activated by several completely different types of inputs, making them difficult to interpret and assign a single, clear function to.",
    "Dictionary learning with sparse autoencoders attempts to solve polysemanticity by decomposing model activations into a larger set of more specific, interpretable features, where each feature corresponds to a single meaningful concept (monosemanticity).",
    "In mechanistic interpretability, a 'feature' is a specific, human-interpretable variable or concept that a model uses for computation, often represented as a pattern of neuron activations.",
]

#### 3. Generate Answers with our RAG Chain

In [4]:
print("Initializing RAG chain...")
rag_chain = create_rag_chain()
generated_answers = []
retrieved_contexts = []
# We need to get not just the answer, but also the context that was used to generate it.
# We can get this by invoking the chain with a specific structure.

for question in test_questions:
    print(f" Answering: {question}")
    # The `with_config` allows us to name the run for tracing if needed
    response = rag_chain.with_config(run_name="test_question_run").invoke(question)

    generated_answers.append(response["answer"])
    retrieved_contexts.append([doc.page_content for doc in response["context"]])

print("Answer generation complete.")


2025-08-30 12:55:27,194 - src.rag_pipeline.core - INFO - Creating the RAG chain for evaluation...


Initializing RAG chain...


  embedding_model = SentenceTransformerEmbeddings(
  vector_store = Chroma(
2025-08-30 12:55:34,121 - src.rag_pipeline.core - INFO - Base retriever created successfully.
2025-08-30 12:55:34,294 - src.rag_pipeline.core - INFO - Flashrank re-ranker initialized.
2025-08-30 12:55:34,295 - src.rag_pipeline.core - INFO - Prompt template created.
2025-08-30 12:55:34,309 - src.rag_pipeline.core - INFO - LLM initialized with model: gemini-1.5-flash
2025-08-30 12:55:34,310 - src.rag_pipeline.core - INFO - RAG chain for evaluation created successfully.


 Answering: What is the core problem with polysemantic neurons?


2025-08-30 12:55:36,942 - src.rag_pipeline.core - INFO - Re-ranked 20 documents down to 5.


 Answering: How does dictionary learning with sparse autoencoders attempt to solve polysemanticity?


2025-08-30 12:55:40,329 - src.rag_pipeline.core - INFO - Re-ranked 20 documents down to 5.


 Answering: What is a 'feature' in the context of mechanistic interpretability?


2025-08-30 12:55:42,981 - src.rag_pipeline.core - INFO - Re-ranked 20 documents down to 5.


Answer generation complete.


#### 4. Run the RAGAS Evaluation
This is the final step. We'll combine our questions, ground truth answers, generated answers, and retrieved contexts into a dataset and pass it to RAGAS for scoring.

In [7]:
# Combine all the data into a Hugging Face Dataset object
response_dataset = Dataset.from_dict({
    "question": test_questions,
    "answer": generated_answers,
    "contexts": retrieved_contexts,
    "ground_truth": ground_truth_answers,
    })

# Initialize the models RAGAS will use
ragas_llm = ChatGoogleGenerativeAI(model=settings.LLM_MODEL_NAME)
ragas_embeddings = SentenceTransformerEmbeddings(model_name=settings.EMBEDDING_MODEL_NAME)
print("Running RAGAS evaluation...")

# gemini_llm = ChatGoogleGenerativeAI(model=settings.LLM_MODEL_NAME)

# ragas_llm = LangchainLLM(llm=gemini_llm)
# faithfulness.llm = ragas_llm
# answer_relevancy.llm = ragas_llm
# context_recall.llm = ragas_llm
# context_precision.llm = ragas_llm

print("Running RAGAS evaluation...")

result = evaluate(
    dataset=response_dataset, 
    metrics=[
        context_precision,  # Evaluates the retriever\n",
        context_recall,     # Evaluates the retriever\n",
        faithfulness,       # Evaluates the generator\n",
        answer_relevancy,   # Evaluates the generator\n",
    ]
    ,llm=ragas_llm
    ,embeddings=ragas_embeddings
)

print("--- RAGAS Evaluation Complete ---")
print(result)

INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: cpu
INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: all-MiniLM-L6-v2


Running RAGAS evaluation...
Running RAGAS evaluation...


Evaluating: 100%|██████████| 12/12 [00:14<00:00,  1.24s/it]


--- RAGAS Evaluation Complete ---
{'context_precision': 0.5810, 'context_recall': 1.0000, 'faithfulness': 1.0000, 'answer_relevancy': 0.6829}


#### 5. Analyze the Results
The output above gives us a dictionary of scores. A score of `1.0` is perfect, and a score of `0.0` is the worst. We are looking for high scores in all categories, especially:

- **`faithfulness`**: How factually accurate is the answer based *only* on the provided context? This is a key metric for reducing hallucinations.
- **`context_recall`**: Did the retriever find all the relevant information needed to answer the question?
- **`answer_relevancy`**: Is the answer actually relevant to the question being asked?
- **`context_precision`**: Is the retrieved context precise and to the point, or does it contain a lot of irrelevant noise?"