![Redis](https://redis.io/wp-content/uploads/2024/04/Logotype.svg?auto=webp&quality=85,75&width=120)
# Evaluating RAG

This notebook uses the [ragas library](https://docs.ragas.io/en/stable/getstarted/index.html) and [Redis](https://redis.com) to evaluate the performance of sample RAG application. Also see the original [source paper](https://arxiv.org/pdf/2309.15217) to build a more detailed understanding.

## Let's Begin!
<a href="https://colab.research.google.com/github/redis-developer/redis-ai-resources/blob/main/python-recipes/RAG/06_ragas_evaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

To start, we need a RAG app to evaluate. Let's create one using LangChain and connect it with Redis as the vector DB.

## Init redis, data prep, and populating the vector DB

In [4]:
# install deps
# NBVAL_SKIP
%pip install -q redis "unstructured[pdf]" sentence-transformers langchain langchain-community langchain-openai ragas datasets

Note: you may need to restart the kernel to use updated packages.


#### Running Redis in Colab
Use the shell script below to download, extract, and install [Redis Stack](https://redis.io/docs/getting-started/install-stack/) directly from the Redis package archive.

In [None]:
# NBVAL_SKIP
%%sh
curl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/redis.list
sudo apt-get update  > /dev/null 2>&1
sudo apt-get install redis-stack-server  > /dev/null 2>&1
redis-stack-server --daemonize yes

#### For Alternative Environments
There are many ways to get the necessary redis-stack instance running
1. On cloud, deploy a [FREE instance of Redis in the cloud](https://redis.com/try-free/). Or, if you have your
own version of Redis Enterprise running, that works too!
2. Per OS, [see the docs](https://redis.io/docs/latest/operate/oss_and_stack/install/install-stack/)
3. With docker: `docker run -d --name redis-stack-server -p 6379:6379 redis/redis-stack-server:latest`

In [5]:
import os

# Replace values below with your own if using Redis Cloud instance
REDIS_HOST = os.getenv("REDIS_HOST", "localhost") # ex: "redis-18374.c253.us-central1-1.gce.cloud.redislabs.com"
REDIS_PORT = os.getenv("REDIS_PORT", "6379")      # ex: 18374
REDIS_PASSWORD = os.getenv("REDIS_PASSWORD", "")  # ex: "1TNxTEdYRDgIDKM2gDfasupCADXXXX"

# If SSL is enabled on the endpoint, use rediss:// as the URL prefix
REDIS_URL = f"redis://:{REDIS_PASSWORD}@{REDIS_HOST}:{REDIS_PORT}"

In [2]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import UnstructuredFileLoader

source_doc = "resources/nke-10k-2023.pdf"

loader = UnstructuredFileLoader(
    source_doc, mode="single", strategy="fast"
)

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2500, chunk_overlap=0
)

chunks = loader.load_and_split(text_splitter)

In [None]:
from langchain.embeddings.huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

In [21]:
from langchain.vectorstores.redis import Redis as LangChainRedis


# set the index name for this example
index_name = "ragas_ex"

# with langchain we can manually modify the default vector schema configuration
vector_schema = {
    "name": "chunk_vector",        # name of the vector field in langchain
    "algorithm": "HNSW",           # could use HNSW instead
    "dims": 384,                   # set based on the HF model embedding dimension
    "distance_metric": "COSINE",   # could use EUCLIDEAN or IP
    "datatype": "FLOAT32",
}

# here we can define the entire schema spec for our index in LangChain
index_schema = {
    "vector": [vector_schema],
    "text": [{"name": "source"}, {"name": "content"}],
    "content_vector_key": "chunk_vector"    # name of the vector field in langchain
}


# construct the vector store class from texts and metadata
rds = LangChainRedis.from_documents(
    documents=chunks,
    embedding=embeddings,
    index_name=index_name,
    redis_url=REDIS_URL,
    index_schema=index_schema,
)

`index_schema` does not match generated metadata schema.
If you meant to manually override the schema, please ignore this message.
index_schema: {'vector': [{'name': 'chunk_vector', 'algorithm': 'HNSW', 'dims': 384, 'distance_metric': 'COSINE', 'datatype': 'FLOAT32'}], 'text': [{'name': 'source'}, {'name': 'content'}], 'content_vector_key': 'chunk_vector'}
generated_schema: {'text': [{'name': 'source'}], 'numeric': [], 'tag': []}



## Test the vector store

In [7]:
rds.similarity_search("What was nike's revenue last year?")[0]

Document(page_content='As discussed in Note 15 — Operating Segments and Related Information in the accompanying Notes to the Consolidated Financial Statements, our operating segments are evidence of the structure of the Company\'s internal organization. The NIKE Brand segments are defined by geographic regions for operations participating in NIKE Brand sales activity.\n\nThe breakdown of Revenues is as follows:\n\n(Dollars in millions)\n\nFISCAL 2023 FISCAL 2022\n\n% CHANGE\n\n% CHANGE EXCLUDING CURRENCY (1) CHANGES FISCAL 2021\n\n% CHANGE\n\nNorth America Europe, Middle East & Africa Greater China\n\n$\n\n21,608 $ 13,418 7,248\n\n18,353 12,479 7,547\n\n18 % 8 % -4 %\n\n18 % $ 21 % 4 %\n\n17,179 11,456 8,290\n\n7 % 9 % -9 %\n\nAsia Pacific & Latin America Global Brand Divisions\n\n(3)\n\n(2)\n\n6,431 58\n\n5,955 102\n\n8 % -43 %\n\n17 % -43 %\n\n5,343 25\n\n11 % 308 %\n\nTOTAL NIKE BRAND Converse\n\n$\n\n48,763 $ 2,427\n\n44,436 2,346\n\n10 % 3 %\n\n16 % $ 8 %\n\n42,293 2,205\n\n5 % 6 

## Setup RAG

Now that the vector db is populated let's initialize our RAG app.

In [13]:
import getpass
from langchain_openai import OpenAI

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("OPENAI_API_KEY")

llm = OpenAI(openai_api_key=os.getenv("OPENAI_API_KEY"))

In [9]:
def get_prompt():
    """Create the QA chain."""
    from langchain.prompts import PromptTemplate

    prompt_template = """Use the following pieces of context from financial 10k filings data to answer the user question at the end. If you don't know the answer, say that you don't know, don't try to make up an answer.

    This should be in the following format:

    Question: [question here]
    Answer: [answer here]

    Begin!

    Context:
    ---------
    {context}
    ---------
    Question: {question}
    Answer:"""

    prompt = PromptTemplate(
        template=prompt_template,
        input_variables=["context", "question"]
    )
    return prompt

In [10]:
from langchain.chains import RetrievalQA

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=rds.as_retriever(search_type="similarity_distance_threshold",search_kwargs={"distance_threshold":0.5}),
    return_source_documents=True,
    chain_type_kwargs={"prompt": get_prompt()},
    #verbose=True
)

## Test it out

In [11]:
query = "What was nike's revenue last year?"
res=qa.invoke(query)
res

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


{'query': "What was nike's revenue last year?",
 'result': '\n$51,217 million.',
 'source_documents': [Document(page_content='As discussed in Note 15 — Operating Segments and Related Information in the accompanying Notes to the Consolidated Financial Statements, our operating segments are evidence of the structure of the Company\'s internal organization. The NIKE Brand segments are defined by geographic regions for operations participating in NIKE Brand sales activity.\n\nThe breakdown of Revenues is as follows:\n\n(Dollars in millions)\n\nFISCAL 2023 FISCAL 2022\n\n% CHANGE\n\n% CHANGE EXCLUDING CURRENCY (1) CHANGES FISCAL 2021\n\n% CHANGE\n\nNorth America Europe, Middle East & Africa Greater China\n\n$\n\n21,608 $ 13,418 7,248\n\n18,353 12,479 7,547\n\n18 % 8 % -4 %\n\n18 % $ 21 % 4 %\n\n17,179 11,456 8,290\n\n7 % 9 % -9 %\n\nAsia Pacific & Latin America Global Brand Divisions\n\n(3)\n\n(2)\n\n6,431 58\n\n5,955 102\n\n8 % -43 %\n\n17 % -43 %\n\n5,343 25\n\n11 % 308 %\n\nTOTAL NIKE BR

## Creating a test set

Now that our setup is complete and we have our RAG app to evaluate we need a test set to evaluate against. The ragas library provides a helpful class for generating a synthetic test set given our data as input that we will use here. The output of this generation is a set of `questions`, `contexts`, and `ground_truth`. 

The questions are generated by an LLM based on slices of context from the provided doc and the ground_truth is determined via a critic LLM. Note there is nothing special about this data itself and you can provide your own `questions` and `ground_truth` for evaluation purposes. When starting a project however, there is often a lack of quality human labeled data to be used for evaluation and a synthetic dataset is a valuable place to start if pre live user/process data (which should be incorporated as an ultimate goal).

For more detail see [the docs](https://docs.ragas.io/en/stable/concepts/testset_generation.html)

In [None]:
# NBVAL_SKIP
# source: https://docs.ragas.io/en/latest/getstarted/testset_generation.html
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# generator with openai models
generator_llm = ChatOpenAI(model="gpt-3.5-turbo-16k")
critic_llm = ChatOpenAI(model="gpt-3.5-turbo-16k") # can use more advanced model here.
embeddings = OpenAIEmbeddings()

generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embeddings
)

# generate testset
testset = generator.generate_with_langchain_docs(chunks, test_size=10, distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25})

## Check testset output

We will save this to a file as well as good practice to not constantly be regenerating examples.

In [15]:
# NBVAL_SKIP
testset_df = testset.to_pandas()
testset_df.to_csv("resources/testset.csv", index=False)
testset_df.head()

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,What potential consequences could securities c...,"[Our Class B Common Stock is traded publicly, ...",Any litigation could result in reputational da...,simple,[{'source': 'resources/nke-10k-2023.pdf'}],True
1,How do currency exchange rates impact the fina...,"[Economic factors beyond our control, and chan...",Currency exchange rate fluctuations could disr...,simple,[{'source': 'resources/nke-10k-2023.pdf'}],True
2,Why are cost-effective investments considered ...,"[From time to time, we may invest in technolog...",Cost-effective investments are considered esse...,simple,[{'source': 'resources/nke-10k-2023.pdf'}],True
3,What changes were made to the U.S. corporate i...,[FISCAL 2023 COMPARED TO FISCAL 2022\n\nOther ...,The Inflation Reduction Act of 2022 made chang...,simple,[{'source': 'resources/nke-10k-2023.pdf'}],True
4,How does master netting arrangements impact re...,[The Company records the assets and liabilitie...,The Company's derivative financial instruments...,multi_context,[{'source': 'resources/nke-10k-2023.pdf'}],True


## Evaluation helper functions

The following code takes a RetrievalQA chain, testset dataframe, and the metrics to be evaluated and returns a dataframe including the metrics calculated.

In [16]:

# define reusable helper function for evaluating our test set against different chains
import pandas as pd
from datasets import Dataset
from ragas import evaluate

def parse_contexts(source_docs):
    return [doc.page_content for doc in source_docs]

def create_evaluation_dataset(chain: RetrievalQA, testset: pd.DataFrame) -> dict:
    res_set = {
        "question": [],
        "answer": [],
        "contexts": [],
        "ground_truth": []
    }

    for _, row in testset.iterrows():
        # call QA chain
        result = chain.invoke(row["question"])

        res_set["question"].append(row["question"])
        res_set["answer"].append(result["result"])
        res_set["contexts"].append(parse_contexts(result["source_documents"]))
        res_set["ground_truth"].append(str(row["ground_truth"]))

    return res_set

def evaluate_chain(chain: RetrievalQA, testset: pd.DataFrame, test_name: str, metrics: list):
    eval_dataset = create_evaluation_dataset(chain, testset)

    parsed = Dataset.from_dict(eval_dataset)

    eval_result = evaluate(
        parsed,
        metrics=metrics
    )

    eval_df = eval_result.to_pandas()
    # store the results of our test for future reference in csv
    eval_df.to_csv(f"{test_name}.csv")
    return eval_df

# First let's evaluate generation metrics
Generation metrics quantify how well the RAG app did creating answers to the provided questions (i.e. the G in **R**etrival **A**ugments **G**eneration). We will calculate the generation metrics **faithfulness** and **answer relevancy** for this example.

The ragas libary conveniently abstracts the calculation of these metrics so we don't have to write redundant code but please review the following definitions in order to build intuition around what these metrics actually measure.

Note: the following examples are paraphrased from the [ragas docs](https://docs.ragas.io/en/stable/concepts/metrics/index.html)

------

### Faithfulness

An answer to a question can be said to be "faithful" if the **claims** that are made in the answer **can be inferred** from the **context**.

#### Mathematically:

$$
Faithfullness\ score = \frac{Number\ of\ claims\ in\ the\ generated\ answer\ that\ can\ be\ inferred\ from\ the\ given\ context}{Total\ number\ of\ claim\ in\ the\ generated\ answer}
$$

#### Example process:

> Question: Where and when was Einstein born?
> 
> Context: Albert Einstein (born 14 March 1879) was a German-born theoretical physicist, widely held to be one of the greatest and most influential scientists of all time
>
> answer: Einstein was born in Germany on 20th March 1879.

Step 1: Use LLM to break generated answer into individual statements.
- “Einstein was born in Germany.”
- “Einstein was born on 20th March 1879.”

Step 2: For each statement use LLM to verify if it can be inferred from the context.
- “Einstein was born in Germany.” => yes. 
- “Einstein was born on 20th March 1879.” => no.

Step 3: plug into formula

Number of claims inferred from context = 1
Total number of claims = 2
Faithfulness = 1/2

### Answer Relevance

An answer can be said to be relevant if it directly addresses the question (intuitively).

#### Example process:

1. Use an LLM to generate "hypothetical" questions to a given answer with the following prompt:

    > Generate a question for the given answer.
    > answer: [answer]

2. Embed the generated "hypothetical" questions as vectors.
3. Calculate the cosine similarity of the hypothetical questions and the original question, sum those similarities, and divide by n.

With data:

> Question: Where is France and what is it’s capital?
> 
> answer: France is in western Europe.

Step 1 - use LLM to create 'n' variants of question from the generated answer.

- “In which part of Europe is France located?”
- “What is the geographical location of France within Europe?”
- “Can you identify the region of Europe where France is situated?”

Step 2 - Calculate the mean cosine similarity between the generated questions and the actual question.

## Now let's implement using our helper functions



In [17]:
# NBVAL_SKIP
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
)

eval_metrics = [
    answer_relevancy,
    faithfulness,
]

gen_basic_rag_test = evaluate_chain(qa, testset_df, "resources/generation_basic_rag_test", eval_metrics)

Evaluating:   0%|          | 0/12 [00:00<?, ?it/s]

In [18]:
# NBVAL_SKIP
gen_basic_rag_test.describe()

Unnamed: 0,answer_relevancy,faithfulness
count,6.0,6.0
mean,0.955967,0.958333
std,0.03971,0.102062
min,0.901141,0.75
25%,0.930429,1.0
50%,0.959636,1.0
75%,0.985929,1.0
max,1.0,1.0


# Next let's evaluate the retrieval metrics

Retrieval metrics quantify how well the system performed at fetching the best possible context for generation. Like before please review the definitions below to understand what happens under-the-hood when we execute the evaluation code. 

-----

### Context Relevance

"The context is considered relevant to the extent that it exclusively contains information that is needed to answer the question."

#### Example process:

1. Use the following LLM prompt to extract a subset of sentences necessary to answer the question. The context is defined as the formatted search result from the vector database.

    > Please extract relevant sentences from
    > the provided context that can potentially
    > help answer the following `{question}`. If no
    > relevant sentences are found, or if you
    > believe the question cannot be answered
    > from the given context, return the phrase
    > "Insufficient Information". While extracting candidate sentences you’re not allowed to make any changes to sentences
    > from given `{context}`.

2. Compute the context relevance score = (number of extracted sentences) / (total number of sentences in context)

Moving from the initial paper to the active evaluation library ragas there are a few more insightful metrics to evaluate. From the library [source](https://docs.ragas.io/en/stable/concepts/metrics/index.html) let's introduce `context precision` and `context recall`. 

### Context recall
Context can be said to have high recall if retrieved context aligns with the ground truth answer.

#### Mathematically:

$$
Context\ recall = \frac{Ground\ Truth\ sentences\ that\ can\ be\ attributed\ to\ context}{Total\ number\ of\ sentences\ in\ the\ ground\ truth}
$$

#### Example process:

Data:
> question: Where is France and what is it’s capital?
> ground truth answer: France is in Western Europe and its capital is Paris.
> context: France, in Western Europe, encompasses medieval cities, alpine villages and Mediterranean beaches. The country is also renowned for its wines and sophisticated cuisine. Lascaux’s ancient cave drawings, Lyon’s Roman theater and the vast Palace of Versailles attest to its rich history.
>
> Note: ground truth answer can be created by critic LLM of with own human labeled data set.

Step 1 - use an LLM to break the ground truth down into individual statements:
- `France is in Western Europe`
- `Its capital is Paris`

Step 2 - for each ground truth statement, use an LLM to determine if it can be attributed from the context.
- `France is in Western Europe` => yes
- `Its capital is Paris` => no


Step 3 - plug in to formula

context recall = (1 + 0) / 2 = 0.5

### Context precision

This metrics relates to how chunks are ranked in a response. Ideally the most relevant chunks are at the top.

#### Mathematically:

$$
Context\ Precision@k = \frac{precision@k}{total\ number\ relevant\ items\ in\ the\ top\ k\ results}
$$

$$
Precision@k = \frac{true\ positive@k}{true\ positives@k + false\ positives@k}
$$

#### Example process:

Data:
> Question: Where is France and what is it’s capital?
> 
> Ground truth: France is in Western Europe and its capital is Paris.
> 
> Context: [ “The country is also renowned for its wines and sophisticated cuisine. Lascaux’s ancient cave drawings, Lyon’s Roman theater and”, “France, in Western Europe, encompasses medieval cities, alpine villages and Mediterranean beaches. Paris, its capital, is famed for its fashion houses, classical art museums including the Louvre and monuments like the Eiffel Tower”]

Step 1 - for each chunk use the LLM to check if it's relevant or not to the ground truth answer.

Step 2 - for each chunk in the context calculate the precision defined as: ``
- `“The country is also renowned for its wines and sophisticated cuisine. Lascaux’s ancient cave drawings, Lyon’s Roman theater and”` => precision = 0/1 or 0.
- `“France, in Western Europe, encompasses medieval cities, alpine villages and Mediterranean beaches. Paris, its capital, is famed for its fashion houses, classical art museums including the Louvre and monuments like the Eiffel Tower”` => the precision would be (1) / (1 true positive + 1 false positive) = 0.5. 


Step 3 - calculate the overall context precision = (0 + 0.5) / 1 = 0.5

In [19]:
# NBVAL_SKIP
from ragas.metrics import (
    context_precision,
    context_recall
)

eval_metrics = [
    context_precision,
    context_recall
]

ret_basic_rag_test = evaluate_chain(qa, testset_df, "resources/retrieval_basic_rag_test", eval_metrics)

Evaluating:   0%|          | 0/12 [00:00<?, ?it/s]

In [20]:
# NBVAL_SKIP
ret_basic_rag_test.describe()

Unnamed: 0,context_precision,context_recall
count,6.0,6.0
mean,0.833333,0.666667
std,0.408248,0.516398
min,0.0,0.0
25%,1.0,0.25
50%,1.0,1.0
75%,1.0,1.0
max,1.0,1.0


## Review

- we initialized our RAG app with data from a 10k document
- generated a testset to evaluate 
- calculated both retrieval and generation metrics

## Analysis
- The generation metrics both scored >95% in on our small example. This means 2 things: 1) answers generated were fairly reliably attributable to the context provided. 2) the hypothetical question that could be asked of our generated answers align with the actual question asked.
- The retrieval metrics reveal the most to be improved. Looking at the data generated, we can see that in retrieval we fetched a lot of information that is not required for answering the questions provided. This indicates we may want to experiment with a smaller chunking size or creating dense propositions from the initial chunks to improve the quality of retrieval.

## Next steps

Now that we know how to measure our system we can quickly and easily experiment with different techniques with a baseline in place to improve our systems. 