## RAGAS Evaluation

RAGAS = Retrieval-Augmented Generation Assessment Suite (an evaluation framework).

Itâ€™s designed to evaluate retrieval-augmented generation (RAG) systems.



### 1 - Install dependencies

If you haven't installed the required packages, run the cell below once. It's commented out so you don't accidentally reinstall packages every run.

In [None]:
# !pip install --upgrade pip
# !pip install ragas langchain openai python-dotenv
print('Run the pip install commands above if you need to install dependencies')

### 2 - Imports & environment

This cell loads environment variables (from a `.env` file if present) and checks that `ragas` is importable.

In [1]:
import os
from dotenv import load_dotenv
load_dotenv()  # reads .env if present

# Basic check for ragas
try:
    import ragas
except Exception as e:
    raise RuntimeError("Missing 'ragas'. Install it with: pip install ragas") from e

print('Python and ragas imported OK')

  from .autonotebook import tqdm as notebook_tqdm


Python and ragas imported OK


### 3 - Configure LLM (Azure or OpenAI)

In [2]:
use_azure = bool(os.getenv('AZURE_API_KEY') and os.getenv('AZURE_DEPLOYMENT_NAME'))

if use_azure:
    from langchain.chat_models import AzureChatOpenAI
    azure_endpoint = os.getenv('AZURE_ENDPOINT')
    azure_key = os.getenv('AZURE_API_KEY')
    deployment_name = os.getenv('AZURE_DEPLOYMENT_NAME')
    if not (azure_endpoint and azure_key and deployment_name):
        raise RuntimeError('Set AZURE_ENDPOINT, AZURE_API_KEY and AZURE_DEPLOYMENT_NAME for Azure')
    llm = AzureChatOpenAI(azure_deployment=deployment_name, api_key=azure_key, azure_endpoint=azure_endpoint, temperature=0.0)
    print('Using AzureChatOpenAI with deployment:', deployment_name)
else:
    from langchain.chat_models import ChatOpenAI
    openai_api_key = os.getenv('OPENAI_API_KEY')
    if not openai_api_key:
        raise RuntimeError('Set OPENAI_API_KEY for non-Azure usage')
    model_name = os.getenv('OPENAI_MODEL', 'gpt-4o-mini')
    llm = ChatOpenAI(openai_api_key=openai_api_key, temperature=0.0, model=model_name)
    print('Using ChatOpenAI model:', model_name)

Using ChatOpenAI model: gpt-4o-mini


  llm = ChatOpenAI(openai_api_key=openai_api_key, temperature=0.0, model=model_name)


### 4 - Wrap LangChain LLM for Ragas

Ragas expects a wrapper around LangChain LLMs. This cell wraps the `llm` created above.

In [3]:
from ragas.llms import LangchainLLMWrapper
evaluator_llm = LangchainLLMWrapper(llm)
print('Wrapped LangChain LLM for Ragas')

Wrapped LangChain LLM for Ragas


### 5 - Prepare evaluation samples

Replace the `samples` list below with your system outputs. Each sample should be a dict with (commonly) these keys: `question`, `answer` (the system's response), `contexts` (list of retrieved documents/strings), and optionally `ground_truths` (list of correct answers for metrics that require them).

In [4]:
samples = [
    {
        "question": "Where was Albert Einstein born and when?",
        "answer": "Albert Einstein was born in Ulm, Germany on 14 March 1879.",
        "contexts": [
            "Albert Einstein (born 14 March 1879 in Ulm) was a German-born theoretical physicist."
        ],
        "ground_truths": ["Ulm, Germany on 14 March 1879"]
    },
    {
        "question": "What is the capital of France?",
        "answer": "Paris is the capital of France.",
        "contexts": ["Paris is the capital and most populous city of France."],
        "ground_truths": ["Paris"]
    }
]

from ragas import EvaluationDataset

eval_dataset = EvaluationDataset.from_list(samples)
print('Constructed EvaluationDataset with', len(eval_dataset.samples), 'samples')
try:
    print('Available features:', eval_dataset.features())
except Exception:
    print('Could not call eval_dataset.features(); available attributes on dataset object:')
    print([a for a in dir(eval_dataset) if not a.startswith('_')][:50])

Constructed EvaluationDataset with 2 samples
Available features: []


### 6 - Run evaluation (default metric set)

This runs `evaluate(dataset, llm=...)` letting Ragas choose a reasonable default metric set for your dataset/version.

In [5]:
from ragas import evaluate

print('Running evaluate(dataset, llm=...) with default metric set...')
results_default = evaluate(dataset=eval_dataset, llm=evaluator_llm, show_progress=False)

print('\nDefault evaluation results:')
print(results_default)

Running evaluate(dataset, llm=...) with default metric set...


ValueError: The metric [answer_relevancy] that is used requires the following additional columns ['user_input', 'response'] to be present in the dataset.

### 7 - Optional: explicit metric run (robust across Ragas versions)

Ragas changes metric class names across versions. This cell tries several common metric names dynamically and runs them if available.

In [6]:
from ragas import metrics as ragas_metrics_module

metric_candidates = [
    "Faithfulness", "AnswerRelevancy", "ContextRecall", "ContextPrecision",
    "LLMContextRecall", "FactualCorrectness", "SemanticSimilarity"
]

metric_instances = []
for mn in metric_candidates:
    cls = getattr(ragas_metrics_module, mn, None)
    if cls is not None:
        try:
            metric_instances.append(cls())
        except Exception:
            metric_instances.append(cls)

if metric_instances:
    try:
        print('Attempting explicit metrics:', [getattr(m, '__class__', m).__name__ for m in metric_instances])
    except Exception:
        pass
    try:
        results_explicit = evaluate(dataset=eval_dataset, metrics=metric_instances, llm=evaluator_llm, show_progress=False)
        print('Explicit metrics results:')
        print(results_explicit)
    except Exception as e:
        print('Explicit metrics run failed:', e)
else:
    print('No compatible explicit metric classes found on this ragas installation; stick to defaults.')

Attempting explicit metrics: ['Faithfulness', 'AnswerRelevancy', 'ContextRecall', 'ContextPrecision', 'LLMContextRecall', 'FactualCorrectness', 'SemanticSimilarity']
Explicit metrics run failed: The metric [faithfulness] that is used requires the following additional columns ['retrieved_contexts', 'user_input', 'response'] to be present in the dataset.


### 8 - Save & Export results (optional)

This cell saves the dataset samples to CSV and the default evaluation results to JSON.

In [None]:
try:
    import pandas as pd
    df = eval_dataset.to_pandas()
    df.to_csv('ragas_eval_dataset_samples.csv', index=False)
    print('Saved ragas_eval_dataset_samples.csv')
except Exception as e:
    print('Could not save sample CSV:', e)

import json
with open('ragas_eval_results.json', 'w') as f:
    json.dump(results_default, f, indent=2)
print('Saved ragas_eval_results.json')

### Notes & Troubleshooting

- If you hit `AttributeError` regarding metric class names, run `dir(ragas.metrics)` to inspect available metric classes and adapt the list in Cell 7.
- If your original notebook used different column names (e.g., `retrieved_contexts` or `response`), map them to `contexts` / `answer` / `ground_truths` when building `samples` above.
- For Azure: make sure `AZURE_ENDPOINT` ends with `/` and matches the format from your Azure portal.
- If you want me to patch your exact uploaded `.ipynb` and return the updated file, tell me and I'll create a notebook file in /mnt/data for download.