# Evaluation using RAGAS

- Author: [Sungchul Kim](https://github.com/rlatjcj)
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/16-Evaluations/02-Evaluation-using-RAGAS.ipynb) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/16-Evaluations/02-Evaluation-using-RAGAS.ipynb)

## Overview
This tutorial will show you how to evaluate the quality of your LLM output using RAGAS.

Before starting this tutorial, let's review the RAGAS metrics, Context Recall, Context Precision, Answer Relevancy, and Faithfulness first.

### Context Recall

It estimates "how well the retrieved context matches the LLM-generated answer".  
It is calculated using question, ground truth, and retrieved context. The value is between 0 and 1, and higher values indicate better performance. To estimate context recall from the ground truth answer, each claim in the ground truth answer is analyzed to see if it can be attributed to the retrieved context. In the ideal scenario, all claims in the ground truth answer should be able to be attributed to the retrieved context.

$$\text{context recall} = \frac{|\text{GT claims that can be attributed to context}|}{|\text{Number of claims in GT}|}$$


### Context Precision

It estimates "whether ground-truth related items in contexts are ranked at the top".

Ideally, all relevant chunks should appear in the top ranks. This metric is calculated using question, ground_truth, and contexts, with values ranging from 0 to 1. Higher scores indicate better precision.

The formula for Context Precision@K is as follows:

$$\text{Context Precision@K} = \frac{\sum_{k=1}^{K} (\text{Precision@k} \times v_k)}{\text{Total number of relevant items in the top K results}}$$

Here, Precision@k is calculated as follows:

$$\text{Precision@k} = \frac{\text{true positives@k}}{(\text{true positives@k + false positives@k})}$$

K is the total number of chunks in contexts, and $v_k \in \{0, 1\}$ is the relevance indicator at rank k.

This metric is used to evaluate the quality of the retrieved context in information retrieval systems. It measures how well relevant information is placed in the top ranks, allowing for performance assessment.


### Answer Relevancy

It is a metric that evaluates "how well the generated answer matches the given prompt".

The main features and calculation methods of this metric are as follows:

1. Purpose: Evaluate the relevance of the generated answer.
2. Score interpretation: Lower scores indicate incomplete or duplicate information in the answer, while higher scores indicate better relevance.
3. Elements used in calculation: question, context, answer

The calculation method for Answer Relevancy is defined as the average cosine similarity between the original question and the generated synthetic questions.

$$\text{answer relevancy} = \frac{1}{N} \sum_{i=1}^N \cos(E_{g_i}, E_o)$$

or

$$\text{answer relevancy} = \frac{1}{N} \sum_{i=1}^N \frac{E_{g_i} \cdot E_o}{\|E_{g_i}\| \|E_o\|}$$

Here:
- $E_{g_i}$ is the embedding of the generated question $i$
- $E_o$ is the embedding of the original question
- $N$ is the number of generated questions (default value is 3)

Note:
- The actual score is mostly between 0 and 1, but mathematically it can be between -1 and 1 due to the characteristics of cosine similarity.

This metric is useful for evaluating the performance of question-answering systems, particularly for measuring how well the generated answer reflects the original question's intent.


### Faithfulness

It is a metric that evaluates "the factual consistency of the generated answer compared to the given context".

The main features and calculation methods of this metric are as follows:

1. Purpose: Evaluate the factual consistency of the generated answer compared to the given context.
2. Calculation elements: Use the generated answer and the retrieved context.
3. Score range: Adjusted between 0 and 1, with higher values indicating better performance.

The calculation method for Faithfulness score is as follows:

$$\text{Faithfulness score} = \frac{|\text{Number of claims in the generated answer that can be inferred from given context}|}{|\text{Total number of claims in the generated answer}|}$$

Calculation process:
1. Identify claims in the generated answer.
2. Verify each claim against the given context to check if it can be inferred from the context.
3. Use the above formula to calculate the score.

Example:
- Question: "When and where was Einstein born?"
- Context: "Albert Einstein (born March 14, 1879) is a German-born theoretical physicist, widely considered one of the most influential scientists of all time."
- High faithfulness answer: "Einstein was born in Germany on March 14, 1879."
- Low faithfulness answer: "Einstein was born in Germany on March 20, 1879."

This metric is useful for evaluating the performance of question-answering systems, particularly for measuring how well the generated answer reflects the given context.

### Table of Contents

### References

----

## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]**
- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. 
- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.

In [9]:
%%capture --no-stderr
%pip install langchain-opentutorial ragas pymupdf

In [10]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain_core",
        "langchain_community",
        "langchain_text_splitters",
        "langchain_openai",
        "ragas",
        "pymupdf",
    ],
    verbose=False,
    upgrade=False,
)

In [None]:
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "Evaluation-using-RAGAS",  # title 과 동일하게 설정해 주세요
    }
)

Environment variables have been set successfully.


You can alternatively set API keys such as `OPENAI_API_KEY` in a `.env` file and load them.

[Note] This is not necessary if you've already set the required API keys in previous steps.

In [3]:
# Load API keys from .env file
from dotenv import load_dotenv

load_dotenv(override=True)

True

## Load saved RAGAS dataset

Load the RAGAS dataset that you saved in the previous step.

In [4]:
import pandas as pd

df = pd.read_csv("data/ragas_synthetic_dataset.csv")
df.head()

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,What specific recent developments in generativ...,"['SPRi AI Brief |\n2023-12월호\n삼성전자, 자체 개발 생성 A...",TechRepublic highlighted several recent develo...,simple,[{'source': 'data/SPRI_AI_Brief_2023년12월호_F.pd...,True
1,What are the dates and location for CES 2024?,['Ⅱ\n. 주요 행사 일정\n행사명 행사 주요 개요\n- 미국 소비자기술 협회(C...,CES 2024 will take place from January 9 to Jan...,simple,[{'source': 'data/SPRI_AI_Brief_2023년12월호_F.pd...,True
2,What are the key aspects of AI global cooperat...,['문제를 방지하는 조치를 확대\n∙ 형사사법 시스템에서 AI 사용 모범사례를 개발...,The key aspects of AI global cooperation highl...,simple,[{'source': 'data/SPRI_AI_Brief_2023년12월호_F.pd...,True
3,What measures are suggested to enhance data pr...,['∙ 첨단 AI 시스템의 성능과 한계를 공개하고 적절하거나 부적절한 사용영역을 알...,The context suggests several measures to enhan...,simple,[{'source': 'data/SPRI_AI_Brief_2023년12월호_F.pd...,True
4,What were the main takeaways from the AI Safet...,"['관련된 경우 해당 국가와 결과를 공유하며, 적절한 시기에 공동 표준 개발을 위해...",The main takeaways from the AI Safety Summit o...,reasoning,[{'source': 'data/SPRI_AI_Brief_2023년12월호_F.pd...,True


In [5]:
from datasets import Dataset

test_dataset = Dataset.from_pandas(df)
test_dataset

Dataset({
    features: ['question', 'contexts', 'ground_truth', 'evolution_type', 'metadata', 'episode_done'],
    num_rows: 10
})

In [6]:
import ast


# Convert contexts column from string to list
def convert_to_list(example):
    contexts = ast.literal_eval(example["contexts"])
    return {"contexts": contexts}


test_dataset = test_dataset.map(convert_to_list)
print(test_dataset)

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

Dataset({
    features: ['question', 'contexts', 'ground_truth', 'evolution_type', 'metadata', 'episode_done'],
    num_rows: 10
})


In [7]:
test_dataset[1]["contexts"]

['Ⅱ\n. 주요 행사 일정\n행사명 행사 주요 개요\n- 미국 소비자기술 협회(CTA)가 주관하는 세계 최대 가전·IT·소\n비재 전시회로 5G, AR&VR, 디지털헬스, 교통·모빌리티 등\n주요 카테고리 중심으로 기업들이 최신의 기술 제품군을 전시\n- CTA 사피로 회장은 가장 주목받는 섹터로 AI를 조명하였으며,\n모든 산업을 포괄한다는 의미에서 ‘올 온(All on)’을 주제로 한\nCES 2024\n이번 전시에는 500곳 이상의 한국기업 참가 예정\n기간 장소 홈페이지\n2024.1.9~12 미국, 라스베가스 https://www.ces.tech/\n- 머신러닝 및 응용에 관한 국제 컨퍼런스(AIMLA 2024)는\n인공지능 및 머신러닝의 이론, 방법론 및 실용적 접근에 관한\n지식과 최신 연구 결과 공유\n- 이론 및 실무 측면에서 인공지능, 기계학습의 주요 분야를\n논의하고, 학계, 산업계의 연구자와 실무자들에게 해당 분\nAIMLA 2024\n야의 최첨단 개발 소식 공유\n기간 장소 홈페이지\nhttps://ccnet2024.org/aimla\n2024.1.27~28 덴마크, 코펜하겐\n/index\n- AI 발전 협회 컨퍼런스(AAAI)는 AI 연구를 촉진하고, AI 분야\n연구원, 실무자, 과학자, 학생 및 공학자 간 교류의 기회 제공\n- 컨퍼런스에서 AI 관련 기술 발표, 특별 트랙, 초청 연사,\nAAAI Conference\n워크숍, 튜토리얼, 포스터 세션, 주제 발표, 대회, 전시 프\non Artificial\n로그램 등 진행\nIntelligence\n기간 장소 홈페이지\nhttps://aaai.org/aaai-confere\n2024.2.20~27 캐나다, 밴쿠버\nnce/']

In [11]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.vectorstores import FAISS
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# Step 1: Load Documents
loader = PyMuPDFLoader("data/SPRI_AI_Brief_2023년12월호_F.pdf") # TODO (sungchul): update the path
docs = loader.load()

# Step 2: Split Documents
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=50)
split_documents = text_splitter.split_documents(docs)

# Step 3: Create Embeddings
embeddings = OpenAIEmbeddings()

# Step 4: Create DB and Save
vectorstore = FAISS.from_documents(documents=split_documents, embedding=embeddings)

# Step 5: Create Retriever
retriever = vectorstore.as_retriever()

# Step 6: Create Prompt
prompt = PromptTemplate.from_template(
    """You are an assistant for question-answering tasks. 
Use the following pieces of retrieved context to answer the question. 
If you don't know the answer, just say that you don't know. 

#Context: 
{context}

#Question:
{question}

#Answer:"""
)

# Step 7: Create LLM
llm = ChatOpenAI(model_name="gpt-4o", temperature=0)

# Step 8: Create Chain
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

Create batch dataset. Batch dataset is useful when you want to process a large number of questions at once.

- Reference for `batch`: [Link](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/01-Basic/07-LCEL-Interface.ipynb)

In [12]:
batch_dataset = [question for question in test_dataset["question"]]
batch_dataset[:3]

['What specific recent developments in generative AI have been highlighted by TechRepublic in their November 2023 articles?',
 'What are the dates and location for CES 2024?',
 "What are the key aspects of AI global cooperation highlighted in the G7's approach to ethical considerations and regulatory frameworks for advanced AI systems?"]

Call `batch()` to get answers for the batch dataset.

In [13]:
answer = chain.batch(batch_dataset)
answer[:3]

["I don't know. The provided context does not contain information about specific recent developments in generative AI highlighted by TechRepublic in their November 2023 articles.",
 'The dates for CES 2024 are January 9 to January 12, and it will be held in Las Vegas, USA.',
 "The key aspects of AI global cooperation highlighted in the G7's approach to ethical considerations and regulatory frameworks for advanced AI systems include:\n\n1. **International Code of Conduct**: The G7 has developed an International Code of Conduct for Advanced AI Systems, which encourages companies to voluntarily adopt measures for identifying and mitigating AI risks throughout the AI lifecycle.\n\n2. **Risk Assessment and Mitigation**: The code emphasizes the importance of assessing and mitigating risks associated with advanced AI systems, both during development and after deployment, to address vulnerabilities and misuse.\n\n3. **Transparency and Accountability**: It calls for transparency in disclosing t

Store the answers generated by the LLM in the 'answer' column.

In [14]:
# Overwrite or add 'answer' column
if "answer" in test_dataset.column_names:
    test_dataset = test_dataset.remove_columns(["answer"]).add_column("answer", answer)
else:
    test_dataset = test_dataset.add_column("answer", answer)

  block_group = [InMemoryTable(cls._concat_blocks(list(block_group), axis=axis))]
  table = cls._concat_blocks(blocks, axis=0)


## Evaluate the answers

In [15]:
from ragas import evaluate
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_recall,
    context_precision,
)

result = evaluate(
    dataset=test_dataset,
    metrics=[
        context_precision,
        faithfulness,
        answer_relevancy,
        context_recall,
    ],
)

result

Evaluating:   0%|          | 0/40 [00:00<?, ?it/s]

{'context_precision': 0.7000, 'faithfulness': 0.5094, 'answer_relevancy': 0.5897, 'context_recall': 0.5333}

In [16]:
result_df = result.to_pandas()
result_df.head()

Unnamed: 0,user_input,retrieved_contexts,response,reference,context_precision,faithfulness,answer_relevancy,context_recall
0,What specific recent developments in generativ...,"[SPRi AI Brief |\n2023-12월호\n삼성전자, 자체 개발 생성 AI...",I don't know. The provided context does not co...,TechRepublic highlighted several recent develo...,0.5,0.333333,0.0,0.6
1,What are the dates and location for CES 2024?,[Ⅱ\n. 주요 행사 일정\n행사명 행사 주요 개요\n- 미국 소비자기술 협회(CT...,The dates for CES 2024 are January 9 to Januar...,CES 2024 will take place from January 9 to Jan...,1.0,1.0,0.957657,1.0
2,What are the key aspects of AI global cooperat...,[문제를 방지하는 조치를 확대\n∙ 형사사법 시스템에서 AI 사용 모범사례를 개발하...,The key aspects of AI global cooperation highl...,The key aspects of AI global cooperation highl...,0.5,0.5,1.0,0.4
3,What measures are suggested to enhance data pr...,[∙ 첨단 AI 시스템의 성능과 한계를 공개하고 적절하거나 부적절한 사용영역을 알리...,To enhance data privacy in the context of adva...,The context suggests several measures to enhan...,1.0,1.0,0.995689,0.5
4,What were the main takeaways from the AI Safet...,"[관련된 경우 해당 국가와 결과를 공유하며, 적절한 시기에 공동 표준 개발을 위해 ...",The main takeaways from the AI Safety Summit o...,The main takeaways from the AI Safety Summit o...,1.0,0.111111,1.0,0.666667


In [17]:
result_df.to_csv("data/ragas_evaluation_result.csv", index=False)

In [18]:
result_df.loc[:, "context_precision":"context_recall"]

Unnamed: 0,context_precision,faithfulness,answer_relevancy,context_recall
0,0.5,0.333333,0.0,0.6
1,1.0,1.0,0.957657,1.0
2,0.5,0.5,1.0,0.4
3,1.0,1.0,0.995689,0.5
4,1.0,0.111111,1.0,0.666667
5,0.0,0.0,0.0,1.0
6,1.0,0.733333,0.952757,0.666667
7,1.0,0.916667,0.990732,0.5
8,0.0,0.5,0.0,0.0
9,1.0,0.0,0.0,0.0
