# Part 1: Simple RAG

Part 1 of this series will cover the simple RAG application with a retrieval model. Below you can find a visual from our Graph. The idea of this is to create a baseline with evaluation metrics to compare with the more complex RAG pipelines in the following Parts. 


## RAG Card:

The RAG card defines the components of the pipeline and the models used in the pipeline.

| | | 
|---|---|
| Generator LLM | [NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO](https://huggingface.co/NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO) hosted on HF API |
| Retriever | [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) with `384` dimension |
| Dataset | [philschmid/easyrag-mini-wikipedia](https://huggingface.co/datasets/philschmid/easyrag-mini-wikipedia) |
| Evaluator | [NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO](https://huggingface.co/NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO) | 
| Metrics | [Answer Correctness](../src/easyrag/metrics/answer_correctness.py), [Answer Faithfulness](../src/easyrag/metrics/answer_faithfulness.py), [Context Precision](../src/easyrag/metrics/context_precision.py), [Context Recall](../src/easyrag/metrics/context_recall.py)   |

## Metrics definition
* **Answer Correctness**: Evaluates the `answer` with the `ground truth` and returns 0 INCORRECT or 1 CORRECT.
* **Answer Faithfulness**: Evaluates if `answer` is "faithfull" based on the provided `context` and returns a 0 UNFAITHFUL or 1 FAITHFUL, e.g. I cannot answer since no information is given in the context.
* **Context Precision**: Evaluates how many of the retrieved documents are relevant to answer the question, uses `question`, `context` and `ground truth` answer.
* **Context Recall**: Evaluates how many sentences in the `answer` can be attributed to retrieved documents, uses `context` and `answer`. 


## Simple RAG pipeline

Below you can find a visual graph of the simple RAG pipeline. 

<img src="../assets/simple-rag.png" width="650">


## 1. Installing dependencies

The first step is to install the necessary dependencies.

In [None]:
!pip install pip install git+https://github.com/philschmid/easyrag.git faiss-cpu datasets sentence_transformers openai --upgrade

## 2. Indexing the documents

Before we can build and evaluate our RAG pipeline we need to index our dataset and create the embeddings for the retriever. We are going to use the [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) model, [faiss](https://github.com/facebookresearch/faiss) and [philschmid/easyrag-mini-wikipedia](https://huggingface.co/datasets/philschmid/easyrag-mini-wikipedia) dataset.

The [philschmid/easyrag-mini-wikipedia](https://huggingface.co/datasets/philschmid/easyrag-mini-wikipedia) is a simple dataset to evaluate RAG pipelines. It consists out of ~900 question and ground truth answers from Wikipedia articles. In addition to the questions it has a second config with ~3,200 documents for retrieval. For evaluation we will use the `mini_100` split from the `questions` config.

In [None]:
from datasets import load_dataset

raw_documents = load_dataset("philschmid/easyrag-mini-wikipedia","documents",split="passages")

print(raw_documents[0]["document"])

We are going to use the [HuggingFaceEmbeddings](https://python.langchain.com/docs/integrations/text_embedding/sentence_transformers) class to locally embedd the documents and store them in a local index. 

In [None]:
from langchain_community.embeddings import HuggingFaceEmbeddings
from transformers import BertTokenizerFast
from langchain.text_splitter import CharacterTextSplitter


embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-small-en-v1.5")
tokenizer = BertTokenizerFast.from_pretrained("BAAI/bge-small-en-v1.5")
text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(
    tokenizer, chunk_size=256, chunk_overlap=64,
)



We can now start indexing the documents into our Faiss index. We are keeping things simple an chunk into `256` token chunks if the document is longer than this. 

In [None]:
from langchain_community.vectorstores import FAISS

chunked_documents = text_splitter.create_documents(raw_documents["document"])
print(f"Chunked documents: {len(chunked_documents)}")
db = FAISS.from_documents(chunked_documents, embeddings)
retriever = db.as_retriever()


Lets test our retriever with a simple question.

In [None]:
query = "What is the capital of Germany?"
docs = retriever.invoke(query)
print(docs[0].page_content)


We are now going to save the index to disk so we can use it later in our langgraph pipeline.

In [None]:
db.save_local("faiss_index")

## 3. Building the RAG pipeline with LangGraph

LangGraph is a library for building stateful, multi-actor applications with LLMs, built on top of (and intended to be used with) LangChain. The main use is for adding cycles and controflows to your LLM application. We are not going to use the full capabilities of LangGraph in this part, but we are going to use it in the following parts.

Our Nodes in the graph are:
* **Retriever**: This node is responsible for retrieving the documents from the index.
* **Generator**: This node is responsible for generating the answer based on the retrieved documents.

The first step is to define our Graph, with the state paramters. We are going only to have "question" and "documents" as state parameters. Each node will simply modify the state. Nodes will be connected by edges.

In [1]:
from typing import Dict, TypedDict

class GraphState(TypedDict):
    """
    Represents the state of our graph.

    Attributes:
        query: The query that was used to retrieve the documents.
        documents: The documents that were retrieved.
    """
    question: str
    context: Dict[str, str]
    answer: str

Next we can define our Nodes, which are simple python functions, which take the state as input and return the modified state as output.

In [2]:
from typing import cast


def retrieve(state:GraphState):
    # destruct state
    question = state.get("question", None)
    # retriever = state.get("retriever", None)
        
    # retrieve top 5 documents
    documents = retriever.similarity_search_with_score(question, k=5)
    
    # reverse docs to get the highest score first and remove the score
    documents = [doc[0].page_content for doc in documents]
    documents.reverse()
    
    # add documents to state
    return {"context": documents}

def generate(state):
    question = state.get("question", None)
    context = state.get("context", None)
    # generate answer
    answer = chain.invoke({"question": question, "context": context}) 
    return {
        "answer": answer
    }


After we have defined our Nodes, we can build our Graph by creating a `StateGraph` object and adding the nodes to it and then connecting them with edges.

In [3]:
from langgraph.graph import END, StateGraph

workflow = StateGraph(GraphState)

# Define the nodes
workflow.add_node("retrieve", retrieve)  # retrieve
workflow.add_node("generate", generate)  # generatae

# Build graph
workflow.add_edge("retrieve", "generate")
workflow.add_edge("generate", END)
# Set entry point
workflow.set_entry_point("retrieve")

# Compile
app = workflow.compile()

Before we can run the graph we need to define our retrieval model and the generator model. We are going to use the [BAAI/bge-small-en-v1.5](https://huggingface.co/BAAI/bge-small-en-v1.5) model for the retriever and the [NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO](https://huggingface.co/NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO) model for the generator. 
First we load again the `HuggingFaceEmbeddings` class and our `FaissIndex` class to load the index from disk.

In [4]:
from langchain_community.vectorstores.faiss import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="BAAI/bge-small-en-v1.5")

retriever = FAISS.load_local("faiss_index", embeddings,allow_dangerous_deserialization=True)

For the LLM we use the `ChatOpenAI` class with the Hugging Face Inference API. We also create our `chain` which is used as part of the generate node. We create a simple prompt and conctate it as chain.

In [5]:
import huggingface_hub

from langchain_community.chat_models.openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

llm = ChatOpenAI(
    model_name="NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO",
    openai_api_key=huggingface_hub.get_token(),
    openai_api_base="https://api-inference.huggingface.co/v1/",
    max_tokens=1024,
    temperature=0.1,
)
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", "You are an assistant for question-answering tasks. Use the following pieces of retrieved Context to answer the Question. If the context doesn't provide any helpful information to answer the questiom say that you cannot answer."),
        ("human", "Question: {question}\nContext: {context}"),
    ]
)
chain = prompt | llm | StrOutputParser()


  warn_deprecated(


Nice. Now lets test our graph with a two question. One with a correct answer and one with an incorrect answer.

In [6]:
# correct = app.invoke({"question": "Which county was Lincoln born in?"})
# print(correct["answer"])
# false = app.invoke({"question": "Which county was Philipp born in?"})
# print(false["answer"])

Fantastic. The graph is working as expected. We can now move on to the evaluation of the pipeline.

## 4. Generating Answers

Before we can evaluate the pipeline we need to generate the answers for the questions. We are going to iterate over the questions and generate the answers using our graph. We will use the `mini_100` split from the `questions` config, which includes 100 questions.

In [7]:
from datasets import load_dataset

test_data = load_dataset("philschmid/easyrag-mini-wikipedia","questions",split="mini_100").select(range(100))

Downloading readme:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Downloading data: 100%|██████████| 48.1k/48.1k [00:00<00:00, 98.2kB/s]
Downloading data: 100%|██████████| 8.42k/8.42k [00:00<00:00, 25.5kB/s]


Generating full split:   0%|          | 0/918 [00:00<?, ? examples/s]

Generating mini_100 split:   0%|          | 0/100 [00:00<?, ? examples/s]

To accelerate the generation process we are going to use the `AsyncRunner` from `easyrag`, which allows us to parallelize the generation process. We are going to use a max concurrency of 8. It leverages the `async` methods from `langgraph`.

In [8]:
from easyrag.runner import AsyncRunner

async def predict(sample):
    answer = await app.ainvoke({"question": sample["question"]})
    return {"answer": answer["answer"], "context": answer["context"], "ground_truth": sample["ground_truth"], "question": sample["question"]}

r = AsyncRunner(concurrency_limit=8, callable=predict)
results = r.run(test_data)

  0%|          | 0/100 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling 

Now, lets convert our list back into a `Dataset`

In [9]:
from datasets import Dataset 

results = Dataset.from_list(results)
# lets save it just in case
results.to_json("results.json")

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

264114

In [10]:
from datasets import load_dataset

results = load_dataset("json", data_files="results.json", split="train")

Generating train split: 0 examples [00:00, ? examples/s]

## 4. Evaluate the RAG pipeline with easyrag 

For the evaluation we are going to use the [easyrag](https://github.com/philschmid/easyrag) and the [philschmid/easyrag-mini-wikipedia](https://huggingface.co/datasets/philschmid/easyrag-mini-wikipedia) dataset using the [NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO](https://huggingface.co/NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO) model. First we initalize our metrics from easyrag. Each metric will make one or more requests to the model. 

In [11]:
from easyrag.metrics import (
    ContextRecall,
    ContextPrecision,
    AnswerCorrectness,
    AnswerFaithfulness,
 )


cr = ContextRecall(llm=llm, verbose=False) # verbose=True to see the prompts
cp = ContextPrecision(llm=llm, verbose=False) # verbose=True to see the prompts
ac = AnswerCorrectness(llm=llm, verbose=False) # verbose=True to see the prompts
af = AnswerFaithfulness(llm=llm, verbose=False) # verbose=True to see the prompts

Lets select a random sample from our results and evaluate it with our metrics.

In [12]:
from random import randrange

sample = results[randrange(len(results))]
print(f"Question: {sample['question']}")
print(f"Ground Truth: {sample['ground_truth']}")
print(f"Answer: {sample['answer']}")

cr_score = cr.compute(context=sample["context"],ground_truth=sample["ground_truth"])
cp_score = cp.compute(context=sample["context"],ground_truth=sample["ground_truth"],question=sample["question"])
ac_score = ac.compute(answer=sample["answer"],ground_truth=sample["ground_truth"],question=sample["question"])
af_score = af.compute(answer=sample["answer"],context=sample["context"])

print(f"Context Recall: {cr_score}")
print(f"Context Precision: {cp_score}")
print(f"Answer Correctness: {ac_score}")
print(f"Answer Faithfulness: {af_score}")

INFO:httpx:HTTP Request: POST https://api-inference.huggingface.co/v1/chat/completions "HTTP/1.1 200 OK"


Question: Did the U.S. join the League of Nations?
Ground Truth: no
Answer: The U.S. did not join the League of Nations. As mentioned in the context, "Wilson's own Congress did not accept the League and only four of the original Fourteen Points were implemented fully in Europe." Additionally, it states that "Coolidge saw the landslide Republican victory of 1920 as a rejection of the Wilsonian idea that the United States should join the League of Nations."


INFO:httpx:HTTP Request: POST https://api-inference.huggingface.co/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api-inference.huggingface.co/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api-inference.huggingface.co/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api-inference.huggingface.co/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api-inference.huggingface.co/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api-inference.huggingface.co/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api-inference.huggingface.co/v1/chat/completions "HTTP/1.1 200 OK"


Context Recall: 0.0
Context Precision: 0.4
Answer Correctness: 1.0
Answer Faithfulness: 1.0


Nice. The metrics are working as expected. We can now run the evaluation on the whole dataset. Therefore we will use again our `AsyncRunner` from `easyrag`.

In [13]:
from easyrag.runner import AsyncRunner
import logging

logging.getLogger().setLevel(logging.ERROR)

async def calc_metrics(sample):
    cr_score = await cr.acompute(context=sample["context"],ground_truth=sample["ground_truth"])
    cp_score = await cp.acompute(context=sample["context"],ground_truth=sample["ground_truth"],question=sample["question"])
    ac_score = await ac.acompute(answer=sample["answer"],ground_truth=sample["ground_truth"],question=sample["question"])
    af_score = await af.acompute(answer=sample["answer"],context=sample["context"])
    return {**sample, **{"context_recall": cr_score, "context_precision": cp_score, "answer_correctness": ac_score, "answer_faithfulness": af_score}}

r = AsyncRunner(concurrency_limit=16, callable=calc_metrics)

metrics = r.run(results)

  1%|          | 1/100 [00:01<02:17,  1.39s/it]ERROR:easyrag.metrics.answer_correctness:Response from LLM is not valid JSON: {"result": {"reason": "The answer provided does not directly answer the question, as it does not explicitly state the most popular rock group in Finland. Instead, it offers a list of popular Finnish rock bands and mentions that CMX is arguably one of Finland's most domestically popular rock groups. While this information is relevant, it does not definitively answer the question. The answer should have clearly stated the most popular rock group in Finland based on the provided context or acknowledged the lack of definitive information.", "score": "0"}}
question: "What is the capital of Australia?"
ground_truth: "The capital of Australia is Canberra."
answer: "Canberra is the capital city of Australia."
{"result": {"reason": "The answer provided is correct based on the question and the context given. The question asks for the capital of Australia, and the answer co

In [14]:
from datasets import Dataset 

results_with_metrics = Dataset.from_list(metrics)
# lets save it just in case
results_with_metrics.to_json("results_with_metrics.json")

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

273739

Lets calculate our final metrics and print them out.

In [15]:
metrics = {
    "context_recall": [],
    "context_precision": [],
    "answer_correctness": [],
    "answer_faithfulness": [],
}

# flatten the results
for sample in results_with_metrics:
    metrics["context_recall"].append(sample["context_recall"])
    metrics["context_precision"].append(sample["context_precision"])
    metrics["answer_correctness"].append(sample["answer_correctness"])
    metrics["answer_faithfulness"].append(sample["answer_faithfulness"])

# replace None with 0 for the metrics to count them as false
metrics = {k: [0 if v is None else v for v in metrics[k]] for k in metrics}

    
print(f"Context Recall: {sum(metrics['context_recall'])/len(metrics['context_recall'])*100:.2f}%")
print(f"Context Precision: {sum(metrics['context_precision'])/len(metrics['context_precision'])*100:.2f}%")
print(f"Answer Correctness: {sum(metrics['answer_correctness'])/len(metrics['answer_correctness'])*100:.2f}%")
print(f"Answer Faithfulness: {sum(metrics['answer_faithfulness'])/len(metrics['answer_faithfulness'])*100:.2f}%")

Context Recall: 57.50%
Context Precision: 31.35%
Answer Correctness: 75.00%
Answer Faithfulness: 89.00%
