source: https://docs.ragas.io/en/latest/getstarted/evaluation.html

In [5]:
import os
os.environ["OPENAI_API_KEY"] = os.environ.get("OPENAI_API_KEY")

### The Data¶
For this tutorial we are going to use an example dataset from one of the baselines we created for the Financial Opinion Mining and Question Answering (fiqa) Dataset. The dataset has the following columns.

- question: list[str] - These are the questions your RAG pipeline will be evaluated on.

- answer: list[str] - The answer generated from the RAG pipeline and given to the user.

- contexts: list[list[str]] - The contexts which were passed into the LLM to answer the question.

- ground_truths: list[list[str]] - The ground truth answer to the questions. (only required if you are using context_recall)

Ideally your list of questions should reflect the questions your users give, including those that you have been problematic in the past.

In [6]:
from datasets import load_dataset

# loading the V2 dataset
amnesty_qa = load_dataset("explodinggradients/amnesty_qa", "english_v2")
amnesty_qa

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


DatasetDict({
    eval: Dataset({
        features: ['question', 'ground_truth', 'answer', 'contexts'],
        num_rows: 20
    })
})

### Metrics¶
Ragas provides you with a few metrics to evaluate the different aspects of your RAG systems namely

1.  Retriever: offers context_precision and context_recall which give you the measure of the performance of your retrieval system.

2.  Generator (LLM): offers faithfulness which measures hallucinations and answer_relevancy which measures how to the point the answers are to the question.

The harmonic mean of these 4 aspects gives you the ragas score which is a single measure of the performance of your QA system across all the important aspects.

now lets import these metrics and understand more about what they denote

In [7]:
from ragas.metrics import(
    answer_relevancy,
    faithfulness,
    context_recall,
    context_precision
)

here you can see that we are using 4 metrics, but what do they represent?

faithfulness - the factual consistency of the answer to the context base on the question.

context_precision - a measure of how relevant the retrieved context is to the question. Conveys quality of the retrieval pipeline.

answer_relevancy - a measure of how relevant the answer is to the question

context_recall: measures the ability of the retriever to retrieve all the necessary information needed to answer the question.

### Evaluation¶
Running the evaluation is as simple as calling evaluate on the Dataset with the metrics of your choice.

In [8]:
from ragas import evaluate

result = evaluate(
    amnesty_qa["eval"],
    metrics=[
        context_precision,
        faithfulness,
        answer_relevancy,
        context_recall,
    ],
)

result

Evaluating: 100%|██████████| 80/80 [01:49<00:00,  1.36s/it]


{'context_precision': 1.0000, 'faithfulness': 0.6842, 'answer_relevancy': 0.9724, 'context_recall': 0.9938}

In [9]:
df = result.to_pandas()
df.head()

Unnamed: 0,question,ground_truth,answer,contexts,context_precision,faithfulness,answer_relevancy,context_recall
0,What are the global implications of the USA Su...,The global implications of the USA Supreme Cou...,The global implications of the USA Supreme Cou...,"[- In 2022, the USA Supreme Court handed down ...",1.0,1.0,0.988044,1.0
1,Which companies are the main contributors to G...,"According to the Carbon Majors database, the m...","According to the Carbon Majors database, the m...","[- Fossil fuel companies, whether state or pri...",1.0,1.0,0.934202,1.0
2,Which private companies in the Americas are th...,The largest private companies in the Americas ...,"According to the Carbon Majors database, the l...",[The private companies responsible for the mos...,1.0,0.0,0.987074,1.0
3,What action did Amnesty International urge its...,Amnesty International urged its supporters to ...,Amnesty International urged its supporters to ...,[Amnesty International called on its vast netw...,1.0,,0.985182,1.0
4,What are the recommendations made by Amnesty I...,The recommendations made by Amnesty Internatio...,Amnesty International made several recommendat...,[Amnesty International recommends that the Spe...,1.0,1.0,0.99367,1.0


In [10]:
df.to_csv('data.csv', sep='|')