# Use Ragas to evaluate RAG pipeline

Ragas is an open source project for evaluating RAG components.

<div>
<img src="./ragas_eval_image.png" width="80%"/>
</div>

**Please note that RAGAS can use a large amount of OpenAI api token consumption.** <br> 

Read through this notebook carefully and pay attention to the number of questions and metrics you want to evaluate.

### 1. Prepare Ragas environment and ground truth data

In [1]:
# ! python -m pip install openai dataset ragas langchain pandas

In [2]:
# Read questions and ground truth answers into a pandas dataframe.
# Note: Surround each context string with ''' to avoid issues with quotes inside.
# Note: Separate each context string with a comma.
import pandas as pd
import numpy as np

# Read ground truth answers from file.
eval_df = pd.read_csv("../../../christy_coding_scratch/data/milvus_ground_truth.csv", 
                      header=0, skip_blank_lines=True)
display(eval_df.head())

Unnamed: 0,Question,ground_truth_answer,OpenAI_RAG_answer,Custom_RAG_answer,Custom_RAG_context,Uri,H1,H2,Score,Reason
0,What do the parameters for HNSW mean?\n,M: maximum out-degree of the graph in a layer;...,The HNSW parameters include the “nlist” which ...,The parameters for HNSW are as follows:\n- M: ...,'''the next layer to begin another search. Aft...,https://pymilvus.readthedocs.io/en/latest/para...,Index,Milvus support to create index to accelerate v...,,
1,What are HNSW good default parameters when dat...,"M=16, efConstruction=32, ef=32",The default HNSW parameters for data size of 2...,A good default value for the HNSW parameters w...,'''to reduce the probability that the target v...,https://pymilvus.readthedocs.io/en/latest/para...,,,,
2,what is the default distance metric used in AU...,"Trick answer: IP inner product, not yet updat...",The default AUTOINDEX distance metric in Milvu...,The default AUTOINDEX distance metric in Milvu...,'''please refer to Milvus documentation index ...,https://pymilvus.readthedocs.io/en/latest/tuto...,,,,
3,How did New York City get its name?,"In the 1600’s, the Dutch planted a trading pos...","I'm sorry, but I couldn't find any information...",New York City was originally named New Amsterd...,'''Etymology See also: Nicknames of New York C...,https://en.wikipedia.org/wiki/New_York_City,,,,


In [3]:
# Ragas default uses HuggingFace Datasets.
# https://docs.ragas.io/en/latest/getstarted/evaluation.html
from datasets import Dataset

def assemble_ragas_dataset(input_df):
    """Assemble a RAGAS HuggingFace Dataset from an input pandas df."""

    # Assemble Ragas lists: questions, ground_truth_answers, retrieval_contexts, and RAG answers.
    question_list, truth_list, context_list = [], [], []

    # Get all the questions.
    question_list = input_df.Question.to_list()

    # Get all the ground truth answers.
    truth_list = input_df.ground_truth_answer.to_list()

    # Get all the Milvus Retrieval Contexts as list[list[str]]
    context_list = input_df.Custom_RAG_context.to_list()
    context_list = [[context] for context in context_list]

    # Get all the RAG answers based on contexts.
    rag_answer_list = input_df.Custom_RAG_answer.to_list()

    # Create a HuggingFace Dataset from the ground truth lists.
    ragas_ds = Dataset.from_dict({"question": question_list,
                            "contexts": context_list,
                            "answer": rag_answer_list,
                            "ground_truth": truth_list
                            })
    return ragas_ds

In [4]:
# Create a Ragas HuggingFace Dataset from the ground truth lists.
ragas_input_ds = assemble_ragas_dataset(eval_df)
display(ragas_input_ds)

Dataset({
    features: ['question', 'contexts', 'answer', 'ground_truth'],
    num_rows: 4
})

In [5]:
# Debugging inspect all the data.
ragas_input_df = ragas_input_ds.to_pandas()
display(ragas_input_df.head())

Unnamed: 0,question,contexts,answer,ground_truth
0,What do the parameters for HNSW mean?\n,['''the next layer to begin another search. Af...,The parameters for HNSW are as follows:\n- M: ...,M: maximum out-degree of the graph in a layer;...
1,What are HNSW good default parameters when dat...,['''to reduce the probability that the target ...,A good default value for the HNSW parameters w...,"M=16, efConstruction=32, ef=32"
2,what is the default distance metric used in AU...,['''please refer to Milvus documentation index...,The default AUTOINDEX distance metric in Milvu...,"Trick answer: IP inner product, not yet updat..."
3,How did New York City get its name?,['''Etymology See also: Nicknames of New York ...,New York City was originally named New Amsterd...,"In the 1600’s, the Dutch planted a trading pos..."


### 2. Start Ragas Evaluation with custom Evaluation LLM

The default OpenAI model used by Ragas is `gpt-3.5-turbo-16k`.

Note that a large amount of OpenAI api token is consumed. Every time you ask a question and every evaluation, you will ask the OpenAI service. Please pay attention to your token consumption. 

In [6]:
import os, openai, pprint
from openai import OpenAI

# Save api key in env variable.
# https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety
openai_api_key=os.environ['OPENAI_API_KEY']

In [7]:
# Choose the metrics you want to see.
# Remove context relevancy metric - it is deprecated and not maintained.
from ragas.metrics import (
    context_recall, 
    context_precision, 
    faithfulness, 
    # answer_relevancy, 
    # answer_similarity,
    )
metrics = ['context_recall', 'context_precision', 'faithfulness']

# Change the llm-as-critic.
# It is also possible to switch out a HuggingFace open LLM here if you want.
# https://docs.ragas.io/en/stable/howtos/customisations/bring-your-own-llm-or-embs.html
from ragas.llms import llm_factory
LLM_NAME = "gpt-3.5-turbo"
# Default temperature = 1e-8
ragas_llm = llm_factory(model=LLM_NAME)

# Also change the embeddings using HuggingFace models.
from langchain_community.embeddings import HuggingFaceEmbeddings
from ragas.embeddings import LangchainEmbeddingsWrapper
EMB_NAME = "BAAI/bge-large-en-v1.5"
lc_embeddings = HuggingFaceEmbeddings(model_name=EMB_NAME)

# # Alternatively use OpenAI embedding models.
# # https://openai.com/blog/new-embedding-models-and-api-updates
# from langchain_openai.embeddings import OpenAIEmbeddings
# lc_embeddings = OpenAIEmbeddings(
#     model="text-embedding-3-small", 
#     # 512 or 1536 possible for 3-small
#     # 256, 1024, or 3072 for 3-large
#     dimensions=512)
ragas_emb = LangchainEmbeddingsWrapper(embeddings=lc_embeddings)

# Change the default models used for each metric.
for metric in metrics:
    globals()[metric].llm = ragas_llm
    globals()[metric].embeddings = ragas_emb

In [8]:
# Evaluate the dataset.
from ragas import evaluate

ragas_result = evaluate(
    ragas_input_ds,
    metrics=[
        context_precision,
        context_recall,
        faithfulness,
    ],
    llm=ragas_llm,
)

# View evaluations.
ragas_output_df = ragas_result.to_pandas()
# Calculate average context scores.
temp = ragas_output_df.fillna(0.0)
temp['context_f1'] = 2.0 * temp.context_precision * temp.context_recall \
                    / (temp.context_precision + temp.context_recall)
temp.head()

Evaluating:   0%|          | 0/12 [00:00<?, ?it/s]

Invalid JSON response. Expected dictionary with key 'Attributed'
Invalid JSON response. Expected dictionary with key 'Attributed'


Unnamed: 0,question,contexts,answer,ground_truth,context_precision,context_recall,faithfulness,context_f1
0,What do the parameters for HNSW mean?\n,['''the next layer to begin another search. Af...,The parameters for HNSW are as follows:\n- M: ...,M: maximum out-degree of the graph in a layer;...,1.0,0.0,1.0,0.0
1,What are HNSW good default parameters when dat...,['''to reduce the probability that the target ...,A good default value for the HNSW parameters w...,"M=16, efConstruction=32, ef=32",0.0,1.0,0.333333,0.0
2,what is the default distance metric used in AU...,['''please refer to Milvus documentation index...,The default AUTOINDEX distance metric in Milvu...,"Trick answer: IP inner product, not yet updat...",1.0,1.0,0.5,1.0
3,How did New York City get its name?,['''Etymology See also: Nicknames of New York ...,New York City was originally named New Amsterd...,"In the 1600’s, the Dutch planted a trading pos...",1.0,0.0,0.333333,0.0


In [9]:
# Display Retrieval average score.
print(f"Using {eval_df.shape[0]} eval questions, Mean Retrieval F1 Score = {np.round(temp.context_f1.mean(),2)}")

Using 4 eval questions, Mean Retrieval F1 Score = 0.25


In [10]:
# Props to Sebastian Raschka for this handy watermark.
# !pip install watermark

%load_ext watermark
%watermark -a 'Christy Bergman' -v -p datasets,langchain,openai,ragas --conda

Author: Christy Bergman

Python implementation: CPython
Python version       : 3.11.7
IPython version      : 8.21.0

datasets : 2.18.0
langchain: 0.1.11
openai   : 1.13.3
ragas    : 0.1.5

conda environment: py311

