# How to evaluate a RAG application

This example uses [Langchain](https://www.langchain.com) and [Giskard](https://github.com/Giskard-AI/giskard) to evaluate the quality of a RAG application.

In [1]:
import os
from dotenv import load_dotenv

load_dotenv()

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
MODEL = "gpt-4"

## Scrape the Website and Split the Content

In [3]:
from langchain_community.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)

loader = WebBaseLoader("https://en.wikipedia.org/wiki/Human_mission_to_Mars")
documents = loader.load_and_split(text_splitter)
documents[:5]

[Document(page_content='Human mission to Mars - Wikipedia\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nJump to content\n\n\n\n\n\n\n\nMain menu\n\n\n\n\n\nMain menu\nmove to sidebar\nhide\n\n\n\n\t\tNavigation\n\t\n\n\nMain pageContentsCurrent eventsRandom articleAbout WikipediaContact usDonate\n\n\n\n\n\n\t\tContribute\n\t\n\n\nHelpLearn to editCommunity portalRecent changesUpload file\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSearch\n\n\n\n\n\n\n\n\n\n\n\nSearch\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nCreate account\n\nLog in\n\n\n\n\n\n\n\n\nPersonal tools\n\n\n\n\n\n Create account Log in\n\n\n\n\n\n\t\tPages for logged out editors learn more\n\n\n\nContributionsTalk\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nContents\nmove to sidebar\nhide\n\n\n\n\n(Top)\n\n\n\n\n\n1Travel to Mars\n\n\n\n\n\n\n\n2Landing on Mars\n\n\n\nToggle Landing on Mars subsection\n\n\n\n\n\n2.1Orbital capture\n\n\n\n\n\n\n\n2.2Survey work\n\n\n\n\n\n\

## Load the Content in a Vector Store

In [4]:
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import DocArrayInMemorySearch

vectorstore = DocArrayInMemorySearch.from_documents(
    documents, embedding=OpenAIEmbeddings()
)

## Create a Knowledge Base

Let's start by loading the content in a pandas DataFrame.

In [7]:
import pandas as pd

df = pd.DataFrame([d.page_content for d in documents], columns=["text"])
df.head(10)

Unnamed: 0,text
0,Human mission to Mars - Wikipedia\n\n\n\n\n\n\...
1,5.2Mars sample return missions\n\n\n\n\n\n\n\n...
2,"Proposed concepts\n""Man on Mars"" redirects her..."
3,Conceptual proposals for missions that would i...
4,"Meanwhile, the uncrewed exploration of Mars ha..."
5,Travel to Mars[edit]\nThe minimum distance bet...
6,Several types of mission plans have been propo...
7,Shorter Mars mission plans have round-trip fli...
8,perhaps 10–30 days before it needed to launch ...
9,"In the 1980s, it was suggested that aerobrakin..."


We can now create a Knowledge Base using the DataFrame we created before.

In [9]:
from giskard.rag import KnowledgeBase

knowledge_base = KnowledgeBase(df)

  validated_func = validate_arguments(func, config={"arbitrary_types_allowed": True})
  validated_func = validate_arguments(func, config={"arbitrary_types_allowed": True})


## Generate the Test Set

In [10]:
from giskard.rag import generate_testset

testset = generate_testset(
    knowledge_base,
    num_questions=60,
    agent_description="A chatbot answering questions about Human Missions on MARS",
)

2024-05-19 10:05:08,193 pid:509 MainThread giskard.rag  INFO     Finding topics in the knowledge base.
2024-05-19 10:05:17,552 pid:509 MainThread giskard.rag  INFO     Found 3 topics in the knowledge base.


Generating questions:   0%|          | 0/60 [00:00<?, ?it/s]

Let's display a few samples from the test set.

In [12]:
test_set_df = testset.to_pandas()

for index, row in enumerate(test_set_df.head(3).iterrows()):
    print(f"Question {index + 1}: {row[1]['question']}")
    print(f"Reference answer: {row[1]['reference_answer']}")
    print("Reference context:")
    print(row[1]['reference_context'])
    print("******************", end="\n\n")


Question 1: What is the purpose of the 'Red Rocks Project' proposed by Lockheed Martin?
Reference answer: The 'Red Rocks Project', proposed by Lockheed Martin as part of their 'Stepping stones to Mars' project, aims to explore Mars robotically from Deimos.
Reference context:
Document 27: Missions to Deimos or Phobos[edit]
Many Mars mission concepts propose precursor missions to the moons of Mars, for example a sample return mission to the Mars moon Phobos[63] – not quite Mars, but perhaps a convenient stepping stone to an eventual Martian surface mission. Lockheed Martin, as part of their "Stepping stones to Mars" project, called the "Red Rocks Project", proposed to explore Mars robotically from Deimos.[64][65][66]
Use of fuel produced from water resources on Phobos or Deimos has also been proposed.
******************

Question 2: When is NASA planning to launch astronauts to Mars?
Reference answer: NASA is planning to launch astronauts to Mars in the 2030s.
Reference context:
Document

Let's now save the test set to a file:

In [13]:
testset.save("test-set.jsonl")

## Prepare the Prompt Template

In [14]:
from langchain.prompts import PromptTemplate

template = """
Answer the question based on the context below. If you can't 
answer the question, reply "I don't know".

Context: {context}

Question: {question}
"""

prompt = PromptTemplate.from_template(template)
print(prompt.format(context="Here is some context", question="Here is a question"))


Answer the question based on the context below. If you can't 
answer the question, reply "I don't know".

Context: Here is some context

Question: Here is a question



## Create the RAG Chain

Create a retriever from the Vector Store that will allow us to get the top similar documents to a given question.

In [17]:
retriever = vectorstore.as_retriever()
retriever.get_relevant_documents("How many MARS mission happened till now?")

[Document(page_content='Conceptual proposals for missions that would involve human explorers started in the early 1950s, with planned missions typically being stated as taking place between 10 and 30 years from the time they are drafted.[2] The list of crewed Mars mission plans shows the various mission proposals that have been put forth by multiple organizations and space agencies in this field of space exploration. The plans for these crews have varied—from scientific expeditions, in which a small group (between two and eight astronauts) would visit Mars for a period of a few weeks or more, to a continuous presence (e.g. through research stations, colonization, or other continuous habitation).[citation needed] Some have also considered exploring the Martian moons of Phobos and Deimos.[3] By 2020, virtual visits to Mars, using haptic technologies, had also been proposed.[4]', metadata={'source': 'https://en.wikipedia.org/wiki/Human_mission_to_Mars', 'title': 'Human mission to Mars - W

We can now create our chain.

In [18]:
from langchain_openai.chat_models import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from operator import itemgetter

model = ChatOpenAI(openai_api_key=OPENAI_API_KEY, model=MODEL)

chain = (
    {
        "context": itemgetter("question") | retriever,
        "question": itemgetter("question"),
    }
    | prompt
    | model
    | StrOutputParser()
)

Let's make sure the chain works by testing it with a simple question.

In [21]:
chain.invoke({"question": "when did the first MARS mission happened?"})

'The first uncrewed exploration of Mars was achieved in 1965 with the Mariner 4 flyby.'

## Evaluating the Model on the Test Set

We need to create a function that invokes the chain with a specific question and returns the answer.

In [22]:
def answer_fn(question, history=None):
    return chain.invoke({"question": question})

We can now use the `evaluate()` function to evaluate the model on the test set. This function will compare the answers from the chain with the reference answers in the test set.

In [23]:
from giskard.rag import evaluate

report = evaluate(answer_fn, testset=testset, knowledge_base=knowledge_base)

Asking questions to the agent:   0%|          | 0/60 [00:00<?, ?it/s]

correctness evaluation:   0%|          | 0/60 [00:00<?, ?it/s]

Let now display the report.

Here are the five components of our RAG application:

* **Generator**: This is the LLM used in the chain to generate the answers.
* **Retriever**: This is the retriever that fetches relevant documents from the knowledge base according to a query.
* **Rewriter**: This is a component that rewrites the user query to make it more relevant to the knowledge base or to account for chat history.
* **Router**: This is a component that filters the query of the user based on his intentions.
* **Knowledge Base**: This is the set of documents given to the RAG to generate the answers.

In [28]:
display(report)

In [29]:
report.to_html("report.html")

We can display the correctness results organized by question type.

In [30]:
report.correctness_by_question_type()

Unnamed: 0_level_0,correctness
question_type,Unnamed: 1_level_1
complex,0.7
conversational,0.6
distracting element,0.5
double,0.8
simple,1.0
situational,0.9


We can also display the specific failures.

In [31]:
report.get_failures()

Unnamed: 0_level_0,question,reference_answer,reference_context,conversation_history,metadata,agent_answer,correctness,correctness_reason
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
3546b286-30e3-4e20-97a5-6e210e7fb46b,Could you elaborate on the specific proposal f...,G. A. Landis proposed 'Teleoperation from Mars...,"Document 48: ^ ""Touchdown! NASA's Mars Perseve...",[],"{'question_type': 'complex', 'seed_document_id...",I don't know.,False,The agent failed to provide the correct inform...
a57ba652-3038-4065-9e71-1954a580acb3,Could you elaborate on the active proposals fr...,The active 21st-century proposals for a human ...,Document 53: Category\n Solar System portal\n...,[],"{'question_type': 'complex', 'seed_document_id...",The documents do not provide specific informat...,False,The agent failed to provide the correct inform...
b4459a47-0eaf-4bab-aeff-009166b3d4a0,"According to the timekeeping system on Mars, w...",The day on Mars is called Sol.,Document 52: Missions\nList of missions to Mar...,[],"{'question_type': 'complex', 'seed_document_id...",I don't know.,False,The agent failed to provide the correct answer...
9a6799aa-eae9-443a-ac1c-37b6252f1bdc,"While planning a mission to Mars, how could th...",The mobile view option on Wikipedia allows use...,Document 55: Privacy policy\nAbout Wikipedia\n...,[],"{'question_type': 'distracting element', 'seed...",I don't know.,False,The agent's answer does not address the questi...
3597399a-27a4-44f1-b96a-09758da3cdab,Considering the potential use of fuel produced...,The data-sharing deal between NASA and SpaceX ...,"Document 44: ^ Coates, Andrew (2 December 2016...",[],"{'question_type': 'distracting element', 'seed...",The documents do not provide information on ho...,False,The agent failed to provide the information ab...
d21ffb8b-15e7-45f1-bcff-bdccd73dcafc,Considering the potential hazards of the biolo...,The article on Human mission to Mars is availa...,Document 1: 5.2Mars sample return missions\n\n...,[],"{'question_type': 'distracting element', 'seed...",I don't know.,False,The agent did not provide the correct answer. ...
4e2f73ed-fc71-43cf-9467-b886ff68fdef,In the context of various Mars mission plans a...,The study by Bethany L. Ehlmann in 2005 focuse...,"Document 39: ^ Marshall-Goebel, Karina; et al....",[],"{'question_type': 'distracting element', 'seed...",I don't know.,False,The agent failed to provide the correct inform...
cb61ac7c-d2ac-4829-a3ce-e426380d5704,Given that Michael Meltzer's work discusses NA...,The title of the chapter is 'Return to Mars'.,"Document 41: ^ Meltzer, Michael (May 31, 2012)...",[],"{'question_type': 'distracting element', 'seed...",The documents do not provide information on th...,False,The agent failed to provide the correct chapte...
87f8ec6e-b6bd-49e4-b3ec-81c6e9e247e4,As a science enthusiast wanting to learn about...,The page does not provide any content about hu...,Document 1: 5.2Mars sample return missions\n\n...,[],"{'question_type': 'situational', 'seed_documen...",The page provides information on various aspec...,False,The agent's answer is incorrect because it pro...
9877d40c-1c94-4445-b886-e537c894b002,What event doubled the radiation levels on the...,A large solar storm sparked a global aurora an...,"Document 37: ^ Scott, Jim (30 September 2017)....",[],"{'question_type': 'double', 'original_question...",A massive unexpected solar storm doubled the r...,False,The agent correctly identified the event that ...
