# How to evaluate a RAG application

This example uses [Langchain](https://www.langchain.com) and [Giskard](https://github.com/Giskard-AI/giskard) to evaluate the quality of a RAG application.

In [1]:
import os
from dotenv import load_dotenv

load_dotenv()

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
MODEL = "gpt-3.5-turbo"

## Scrape the Website and Split the Content

In [2]:
from langchain_community.document_loaders import WebBaseLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=20)

loader = WebBaseLoader("https://www.ml.school/")
documents = loader.load_and_split(text_splitter)
documents



[Document(page_content='Building Machine Learning Systems That Don\'t Suck"This is the best machine learning course I\'ve done. Worth every cent."Jose Reyes, AI/ML at Cevo AustraliaLearn how to design, build, deploy, and scale machine learning systems to solve real-world problems.I\'ll lose my mind if I see another book or course teaching people the same basic ideas for the hundredth time. Most people are stuck in beginner mode, and finding help to solve real-world problems is hard.I want to change that.I started writing software 30 years ago. I\'ve written pipelines and trained models for some of the largest companies in the world. I want to show you how to do the same.This is the class I wish I had taken when I started.This program will help you unlearn what you think machine learning is. It\'s a practical, hands-on class where you\'ll learn from years of experience and real-world examples.When you join, you get lifetime access to the following:18 hours of live, interactive sessions.

## Load the Content in a Vector Store

In [3]:
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import DocArrayInMemorySearch

vectorstore = DocArrayInMemorySearch.from_documents(
    documents, embedding=OpenAIEmbeddings()
)

## Create a Knowledge Base

Let's start by loading the content in a pandas DataFrame.

In [4]:
import pandas as pd

df = pd.DataFrame([d.page_content for d in documents], columns=["text"])
df.head(10)

Unnamed: 0,text
0,Building Machine Learning Systems That Don't S...
1,use this time to discuss the first principles ...
2,this is the class you don't want to miss.Who I...
3,"full-scale machine learning.""Brian H. HoughSof..."
4,I've learned from real-life examples I've buil...
5,work.Wednesday: Optional office hours.Thursday...
6,questionsComplete source code of a working pro...
7,labels and weak supervision.Active learning us...
8,"question, please reach out on social media and..."
9,pay once to join the program and get immediate...


We can now create a Knowledge Base using the DataFrame we created before.

In [5]:
from giskard.rag import KnowledgeBase

knowledge_base = KnowledgeBase(df)

  validated_func = validate_arguments(func, config={"arbitrary_types_allowed": True})
  validated_func = validate_arguments(func, config={"arbitrary_types_allowed": True})
  from .autonotebook import tqdm as notebook_tqdm


## Generate the Test Set

In [7]:
from giskard.rag import generate_testset

testset = generate_testset(
    knowledge_base,
    num_questions=60,
    agent_description="A chatbot answering questions about the Machine Learning School Website",
)

2024-03-23 13:21:30,425 pid:2158 MainThread giskard.rag  INFO     Finding topics in the knowledge base.
2024-03-23 13:21:30,426 pid:2158 MainThread giskard.rag  INFO     Computing Knowledge Base embeddings.


  warn(


2024-03-23 13:21:33,342 pid:2158 MainThread giskard.rag  INFO     Found 1 topics in the knowledge base.


Generating questions: 100%|██████████| 60/60 [05:43<00:00,  5.72s/it]


Let's display a few samples from the test set.

In [40]:
test_set_df = testset.to_pandas()

for index, row in enumerate(test_set_df.head(3).iterrows()):
    print(f"Question {index + 1}: {row[1]['question']}")
    print(f"Reference answer: {row[1]['reference_answer']}")
    print("Reference context:")
    print(row[1]['reference_context'])
    print("******************", end="\n\n")


Question 1: What does the Machine Learning Systems course offer?
Reference answer: The Machine Learning Systems course offers 18 hours of live, interactive sessions. It is a practical, hands-on class where participants can learn from years of experience and real-world examples. When you join, you get lifetime access to the course.
Reference context:
Document 0: Building Machine Learning Systems That Don't Suck"This is the best machine learning course I've done. Worth every cent."Jose Reyes, AI/ML at Cevo AustraliaLearn how to design, build, deploy, and scale machine learning systems to solve real-world problems.I'll lose my mind if I see another book or course teaching people the same basic ideas for the hundredth time. Most people are stuck in beginner mode, and finding help to solve real-world problems is hard.I want to change that.I started writing software 30 years ago. I've written pipelines and trained models for some of the largest companies in the world. I want to show you how 

Let's now save the test set to a file:

In [8]:
testset.save("test-set.jsonl")

## Prepare the Prompt Template

In [6]:
from langchain.prompts import PromptTemplate

template = """
Answer the question based on the context below. If you can't 
answer the question, reply "I don't know".

Context: {context}

Question: {question}
"""

prompt = PromptTemplate.from_template(template)
print(prompt.format(context="Here is some context", question="Here is a question"))


Answer the question based on the context below. If you can't 
answer the question, reply "I don't know".

Context: Here is some context

Question: Here is a question



## Create the RAG Chain

Create a retriever from the Vector Store that will allow us to get the top similar documents to a given question.

In [39]:
retriever = vectorstore.as_retriever()
retriever.get_relevant_documents("What is the Machine Learning School?")

[Document(page_content='Building Machine Learning Systems That Don\'t Suck"This is the best machine learning course I\'ve done. Worth every cent."Jose Reyes, AI/ML at Cevo AustraliaLearn how to design, build, deploy, and scale machine learning systems to solve real-world problems.I\'ll lose my mind if I see another book or course teaching people the same basic ideas for the hundredth time. Most people are stuck in beginner mode, and finding help to solve real-world problems is hard.I want to change that.I started writing software 30 years ago. I\'ve written pipelines and trained models for some of the largest companies in the world. I want to show you how to do the same.This is the class I wish I had taken when I started.This program will help you unlearn what you think machine learning is. It\'s a practical, hands-on class where you\'ll learn from years of experience and real-world examples.When you join, you get lifetime access to the following:18 hours of live, interactive sessions.

We can now create our chain.

In [45]:
from langchain_openai.chat_models import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from operator import itemgetter

model = ChatOpenAI(openai_api_key=OPENAI_API_KEY, model=MODEL)

chain = (
    {
        "context": itemgetter("question") | retriever,
        "question": itemgetter("question"),
    }
    | prompt
    | model
    | StrOutputParser()
)

Let's make sure the chain works by testing it with a simple question.

In [46]:
chain.invoke({"question": "What is the Machine Learning School?"})

'The Machine Learning School is a live, interactive program that helps individuals build production-ready machine learning systems from the ground up.'

## Evaluating the Model on the Test Set

We need to create a function that invokes the chain with a specific question and returns the answer.

In [45]:
def answer_fn(question, history=None):
    return chain.invoke({"question": question})

We can now use the `evaluate()` function to evaluate the model on the test set. This function will compare the answers from the chain with the reference answers in the test set.

In [None]:
from giskard.rag import evaluate

report = evaluate(answer_fn, testset=testset, knowledge_base=knowledge_base)

Let now display the report.

Here are the five components of our RAG application:

* **Generator**: This is the LLM used in the chain to generate the answers.
* **Retriever**: This is the retriever that fetches relevant documents from the knowledge base according to a query.
* **Rewriter**: This is a component that rewrites the user query to make it more relevant to the knowledge base or to account for chat history.
* **Router**: This is a component that filters the query of the user based on his intentions.
* **Knowledge Base**: This is the set of documents given to the RAG to generate the answers.

In [63]:
display(report)

0,1,2
GENERATOR,78.0% The Generator is the LLM inside the RAG to generate the answers.,78.0%
RETRIEVER,60.0% The Retriever fetches relevant documents from the knowledge base according to a user query.,60.0%
REWRITER,60.0% The Rewriter modifies the user query to match a predefined format or to include the context from the chat history.,60.0%
ROUTING,70.0% The Router filters the query of the user based on his intentions (intentions detection).,70.0%
KNOWLEDGE_BASE,100.0% The knowledge base is the set of documents given to the RAG to generate the answers. Its scores is computed differently than the other components: it is the difference between the maximum and minimum correctness score across all the topics of the knowledge base.,100.0%


In [50]:
report.to_html("report.html")

We can display the correctness results organized by question type.

In [51]:
report.correctness_by_question_type()

Unnamed: 0_level_0,correctness
question_type,Unnamed: 1_level_1
complex,0.9
conversational,0.5
distracting element,0.5
double,0.8
simple,0.7
situational,1.0


We can also display the specific failures.

In [52]:
report.get_failures()

Unnamed: 0_level_0,question,reference_answer,reference_context,conversation_history,metadata,agent_answer,correctness,correctness_reason
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
964fc6fa-491d-4b6a-aa9d-835328b2443e,What does the Machine Learning Systems course ...,The Machine Learning Systems course offers 18 ...,Document 0: Building Machine Learning Systems ...,[],"{'question_type': 'simple', 'seed_document_id'...",The Machine Learning Systems course offers 10 ...,False,The agent's answer includes details that are n...
7fb7d403-67a1-429b-bc19-d5d014074a0f,What does the Machine Learning Systems course ...,The Machine Learning Systems course offers 18 ...,Document 0: Building Machine Learning Systems ...,[],"{'question_type': 'simple', 'seed_document_id'...",The Machine Learning Systems course offers pra...,False,The agent's answer includes details that are n...
9cf283f9-f19b-4c58-b27b-fca0c5cc7ab2,What is the cost for joining the Machine Learn...,The cost for joining the Machine Learning prog...,Document 9: pay once to join the program and g...,[],"{'question_type': 'simple', 'seed_document_id'...",The cost for joining the Machine Learning prog...,False,The agent's answer is incorrect because it doe...
9f9fd3a9-0b3c-48fa-a90b-bd970238cd1e,What resources and benefits can I expect to re...,"When you join, you get lifetime access to 18 h...",Document 0: Building Machine Learning Systems ...,[],"{'question_type': 'complex', 'seed_document_id...",Upon enrolling in the Machine Learning Systems...,False,The agent's answer does not match the ground t...
8881989e-d95f-462f-931d-8604a75139f7,Could you provide information about the instru...,The instructor of the program is Santiago. He ...,Document 9: pay once to join the program and g...,[],"{'question_type': 'distracting element', 'seed...",The instructor of the Machine Learning program...,False,The agent's answer is partially correct but it...
cf85c206-d8e7-465d-a2a1-040f91ac49aa,What is the cost of the program that includes ...,The cost of the program is $450. It includes l...,Document 9: pay once to join the program and g...,[],"{'question_type': 'distracting element', 'seed...",The cost of the program is a one-time payment ...,False,The agent did not provide the specific cost of...
a805b425-f0b8-480a-9bb7-bbceb19263d2,Considering the course 'Building Machine Learn...,The cost of the program is $450. This includes...,Document 5: work.Wednesday: Optional office ho...,[],"{'question_type': 'distracting element', 'seed...",The course 'Building Machine Learning Systems ...,False,The agent did not provide the correct cost of ...
b07946fd-28af-42d6-9117-8959af8b1d9d,What is the cost to join the program that incl...,The cost to join the program is $450. It inclu...,Document 9: pay once to join the program and g...,[],"{'question_type': 'distracting element', 'seed...",The cost to join the program that includes des...,False,The agent did not provide the specific cost of...
d3af5418-5b97-4397-8173-d7238e219adf,"Considering the program's time commitment, wha...",The second session of the course covers topics...,Document 7: labels and weak supervision.Active...,[],"{'question_type': 'distracting element', 'seed...",For those interested in implementing the codin...,False,The agent's answer does not match the ground t...
99d154a1-d88b-4f11-9ab2-428289571c34,What is included in the machine learning progr...,The program includes 10 hours of step-by-step ...,Document 1: use this time to discuss the first...,[],"{'question_type': 'double', 'original_question...","The machine learning program includes live, in...",False,The agent's answer is missing some key compone...


## Creating a Test Suite

We can create a test suite and use it to compare different models.

Load the test set from disk.

In [22]:
from giskard.rag import QATestset

testset = QATestset.load("test-set.jsonl")

Create a Test Suite from the test set.

In [23]:
test_suite = testset.to_test_suite("Machine Learning School Test Suite")

We need a function that takes a DataFrame of questions, invokes the chain with each question, and returns the answers.

In [24]:
import giskard


def batch_prediction_fn(df: pd.DataFrame):
    return chain.batch([{"question": q} for q in df["question"].values])

We can now create a Giskard Model object to run our test suite.

In [25]:
giskard_model = giskard.Model(
    model=batch_prediction_fn,
    model_type="text_generation",
    name="Machine Learning School Question and Answer Model",
    description="This model answers questions about the Machine Learning School website.",
    feature_names=["question"], 
)

2024-03-23 16:20:54,903 pid:46357 MainThread giskard.models.automodel INFO     Your 'prediction_function' is successfully wrapped by Giskard's 'PredictionFunctionModel' wrapper class.


Let's now run the test suite using the model we created before.

In [64]:
test_suite_results = test_suite.run(model=giskard_model)

2024-03-23 15:57:39,422 pid:2158 MainThread giskard.datasets.base INFO     Casting dataframe columns from {'question': 'object'} to {'question': 'object'}
2024-03-23 15:57:39,423 pid:2158 MainThread giskard.utils.logging_utils INFO     Predicted dataset with shape (60, 5) executed in 0:00:00.007341
Executed 'TestsetCorrectnessTest' with arguments {'model': <giskard.models.function.PredictionFunctionModel object at 0x2fa779b50>, 'dataset': <giskard.datasets.base.Dataset object at 0x2fa899b80>}: 
               Test succeeded
               Metric: 0.62
               
               
2024-03-23 15:58:58,289 pid:2158 MainThread giskard.core.suite INFO     Executed test suite 'Machine Learning School Test Suite'
2024-03-23 15:58:58,291 pid:2158 MainThread giskard.core.suite INFO     result: success
2024-03-23 15:58:58,292 pid:2158 MainThread giskard.core.suite INFO     TestsetCorrectnessTest ({'model': <giskard.models.function.PredictionFunctionModel object at 0x2fa779b50>, 'dataset': <gi

We can display the results.

In [65]:
display(test_suite_results)

## Integrating with Pytest

In [27]:
import ipytest

We can now integrate our test suite with Pytest.

In [36]:
%%ipytest

import pytest
from giskard.rag import QATestset
from giskard.testing.tests.llm import test_llm_correctness


@pytest.fixture
def dataset():
    testset = QATestset.load("test-set.jsonl")
    return testset.to_dataset()


@pytest.fixture
def model():
    return giskard_model


def test_chain(dataset, model):
    test_llm_correctness(model=model, dataset=dataset, threshold=0.5).assert_()

[32m.[0m2024-03-23 16:27:56,471 pid:46357 MainThread giskard.datasets.base INFO     Casting dataframe columns from {'question': 'object'} to {'question': 'object'}
2024-03-23 16:27:56,472 pid:46357 MainThread giskard.utils.logging_utils INFO     Predicted dataset with shape (60, 5) executed in 0:00:00.005269


[32m.[0m[33m                                                                                           [100%][0m
../.venv/lib/python3.9/site-packages/_pytest/config/__init__.py:1276
    self._mark_plugins_for_rewrite(hook)

t_66406511b9d84eb38baa6b0a22141dd0.py::test_llm_correctness

