# Evaluation with RAGAS and Advanced Retrieval Methods Using LangChain

In the following notebook we'll discuss a major component of LLM Ops:

- Evaluation

We're going to be leveraging the [RAGAS]() framework for our evaluations today as it's becoming a standard method of evaluating (at least directionally) RAG systems.

We're also going to discuss a few more powerful Retrieval Systems that can potentially improve the quality of our generations!

Let's start as we always do: Grabbing our dependencies!

In [1]:
!pip install -U -q langchain openai ragas arxiv pymupdf chromadb wandb tiktoken

In [2]:
import os
import openai
from dotenv import load_dotenv

load_dotenv()

True

### Data Collection

We're going to be using papers from Arxiv as our context today.

We can collect these documents rather straightforwardly with the `ArxivLoader` document loader from LangChain.

Let's grab and load 5 documents.

- [`ArxivLoader`](https://api.python.langchain.com/en/latest/document_loaders/langchain.document_loaders.arxiv.ArxivLoader.html)

In [3]:
from langchain.document_loaders import ArxivLoader

base_docs = ArxivLoader(query="Retrieval Augmented Generation", load_max_docs=5).load()
len(base_docs)

5

In [5]:
for doc in base_docs:
  print(doc.metadata)

{'Published': '2022-02-13', 'Title': 'A Survey on Retrieval-Augmented Text Generation', 'Authors': 'Huayang Li, Yixuan Su, Deng Cai, Yan Wang, Lemao Liu', 'Summary': 'Recently, retrieval-augmented text generation attracted increasing attention\nof the computational linguistics community. Compared with conventional\ngeneration models, retrieval-augmented text generation has remarkable\nadvantages and particularly has achieved state-of-the-art performance in many\nNLP tasks. This paper aims to conduct a survey about retrieval-augmented text\ngeneration. It firstly highlights the generic paradigm of retrieval-augmented\ngeneration, and then it reviews notable approaches according to different tasks\nincluding dialogue response generation, machine translation, and other\ngeneration tasks. Finally, it points out some important directions on top of\nrecent methods to facilitate future research.'}
{'Published': '2023-05-11', 'Title': 'Active Retrieval Augmented Generation', 'Authors': 'Zhengb

### Creating an Index

Let's use a naive index creation strategy of just using `RecursiveCharacterTextSplitter` on our documents and embedding each into our `VectorStore` using `OpenAIEmbeddings()`.

Let's use a rather generic 500 character chunk size.

- [`RecursiveCharacterTextSplitter()`](https://api.python.langchain.com/en/latest/text_splitter/langchain.text_splitter.RecursiveCharacterTextSplitter.html)
- [`Chroma`](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.chroma.Chroma.html?highlight=chroma#langchain.vectorstores.chroma.Chroma)
- [`OpenAIEmbeddings()`](https://api.python.langchain.com/en/latest/embeddings/langchain.embeddings.openai.OpenAIEmbeddings.html?highlight=openaiembeddings#langchain-embeddings-openai-openaiembeddings)

In [6]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1000,
                                               chunk_overlap = 100,
                                               length_function = len)


docs = text_splitter.split_documents(base_docs)

vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings())

In [7]:
len(docs)

357

In [8]:
print(max([len(chunk.page_content) for chunk in docs]))

998


### Setting Up our Basic QA Chain

Now we can instantiate our basic `RetrievalQA` chain, let's retrieve the top `k=3` documents.

- [`RetrievalQA`](https://api.python.langchain.com/en/latest/chains/langchain.chains.retrieval_qa.base.RetrievalQA.html?highlight=retrievalqa#langchain-chains-retrieval-qa-base-retrievalqa)

In [10]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

primary_qa_llm = ChatOpenAI(
    model_name="gpt-3.5-turbo-16k", 
    temperature=0
)

qa_chain = RetrievalQA.from_chain_type(
    llm = primary_qa_llm,
    retriever=vectorstore.as_retriever(),
    return_source_documents=True
)

Let's test it out!

In [11]:
query = "What is RAG?"

result = qa_chain({"query" : query})

print(result["result"])

RAG stands for Retrieval-Augmented Generation. It is a framework that combines retrieval and generation mechanisms to enhance language models. In the context of the given text, HybridRAG is a specific implementation of the RAG framework for real-time composition assistance. It leverages a hybrid setting that combines both client and cloud models to generate retrieval-augmented memory asynchronously.


### Ground Truth Dataset Creation Using GPT-3.5-turbo and GPT-4

The next section might take you a long time to run, so the evaluation dataset is provided.

The basic idea is that we can use LangChain to create questions based on our contexts, and then answer those questions.

Let's look at how that works in the code!

In [12]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

texts = docs

In [13]:
from langchain.output_parsers import ResponseSchema
from langchain.output_parsers import StructuredOutputParser

question_schema = ResponseSchema(
    name="question",
    description="a question about the context."
)

question_response_schemas = [
    question_schema,
]

In [14]:
question_output_parser = StructuredOutputParser.from_response_schemas(question_response_schemas)
format_instructions = question_output_parser.get_format_instructions()

In [15]:
from langchain.prompts import ChatPromptTemplate

qa_template = """\
You are a University Professor creating a test for advanced students. For each context, create a question that is specific to the context. Avoid creating generic or general questions.

question: a question about the context.

Format the output as JSON with the following keys:
question

context: {context}
"""

prompt_template = ChatPromptTemplate.from_template(template=qa_template)

messages = prompt_template.format_messages(
    context=texts[0],
    format_instructions=format_instructions
)

response = primary_qa_llm(messages)
output_dict = question_output_parser.parse(response.content)

In [16]:
for k, v in output_dict.items():
  print(k)
  print(v)

question
What are the advantages of retrieval-augmented text generation compared to conventional generation models?


In [17]:
%pip install -q -U tqdm


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [18]:
from tqdm import tqdm

qac_triples = []

for text in tqdm(texts[:10]):
  messages = prompt_template.format_messages(
      context=text,
      format_instructions=format_instructions
  )
  response = primary_qa_llm(messages)
  try:
    output_dict = question_output_parser.parse(response.content)
  except Exception as e:
    continue
  output_dict["context"] = text
  qac_triples.append(output_dict)

100%|██████████| 10/10 [00:57<00:00,  5.73s/it]


In [19]:
primary_ground_truth_llm = ChatOpenAI(model_name="gpt-4", temperature=0)

answer_schema = ResponseSchema(
    name="answer",
    description="an answer to the question"
)

answer_response_schemas = [
    answer_schema,
]

answer_output_parser = StructuredOutputParser.from_response_schemas(answer_response_schemas)
format_instructions = answer_output_parser.get_format_instructions()

qa_template = """\
You are a University Professor creating a test for advanced students. For each question and context, create an answer.

answer: a question about the context.

Format the output as JSON with the following keys:
answer

question: {question}
context: {context}
"""

prompt_template = ChatPromptTemplate.from_template(template=qa_template)

messages = prompt_template.format_messages(
    context=qac_triples[0]["context"],
    question=qac_triples[0]["question"],
    format_instructions=format_instructions
)

response = primary_ground_truth_llm(messages)
output_dict = answer_output_parser.parse(response.content)

In [20]:
for k, v in output_dict.items():
  print(k)
  print(v)

answer
Retrieval-augmented text generation has remarkable advantages over conventional generation models, particularly achieving state-of-the-art performance in many NLP tasks. It follows a generic paradigm and has been applied successfully in different tasks including dialogue response generation, machine translation, and other generation tasks.


In [21]:
for triple in tqdm(qac_triples):
  messages = prompt_template.format_messages(
      context=triple["context"],
      question=triple["question"],
      format_instructions=format_instructions
  )
  response = primary_ground_truth_llm(messages)
  try:
    output_dict = answer_output_parser.parse(response.content)
  except Exception as e:
    continue
  triple["answer"] = output_dict["answer"]

100%|██████████| 10/10 [00:54<00:00,  5.41s/it]


In [24]:
%pip install -q -U datasets


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [25]:
import pandas as pd
from datasets import Dataset

ground_truth_qac_set = pd.DataFrame(qac_triples)
ground_truth_qac_set["context"] = ground_truth_qac_set["context"].map(lambda x: str(x.page_content))
ground_truth_qac_set = ground_truth_qac_set.rename(columns={"answer" : "ground_truth"})


eval_dataset = Dataset.from_pandas(ground_truth_qac_set)

  from .autonotebook import tqdm as notebook_tqdm


In [26]:
eval_dataset.to_csv("groundtruth_eval_dataset.csv")

Creating CSV from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 112.49ba/s]


16039

### Evaluating RAG Pipelines

If you skipped ahead and need to load the `.csv` directly - uncomment the code below.

If you're using Colab to do this notebook - please ensure you add it to your session files.

In [None]:
# from datasets import Dataset
# eval_dataset = Dataset.from_csv("groundtruth_eval_dataset.csv")

In [27]:
eval_dataset

Dataset({
    features: ['question', 'context', 'ground_truth', 'metadata'],
    num_rows: 10
})

### Evaluation Using RAGAS

Now we can evaluate using RAGAS!

The set-up is fairly straightforward - we simply need to create a dataset with our generated answers and our contexts, and then evaluate using the framework.

More details on the specific metrics can be found [here](https://github.com/explodinggradients/ragas/blob/main/docs/metrics.md)!

In [28]:
%pip install ragas

Collecting ragas
  Using cached ragas-0.0.16-py3-none-any.whl (38 kB)
Collecting transformers
  Downloading transformers-4.34.0-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting sentence-transformers
  Using cached sentence_transformers-2.2.2-py3-none-any.whl
Collecting pydantic<2.0
  Using cached pydantic-1.10.13-cp311-cp311-macosx_11_0_arm64.whl (2.5 MB)
Collecting pysbd>=0.3.4
  Using cached pysbd-0.3.4-py3-none-any.whl (71 kB)
Collecting torch>=1.6.0
  Using cached torch-2.0.1-cp311-none-macosx_11_0_arm64.whl (55.8 MB)
Collecting torchvision
  Using cached torchvision-0.15.2-cp311-cp311-macosx_11_0_arm64.whl (1.4 MB)
Collecting scikit-learn
  Using cached scikit_learn-1.3.1-cp311-cp311-macosx_12_0_arm64.whl (9.4 MB)
Collecting scipy
  Using cached scipy-1.11.3-cp311-cp311-macosx_12_0_arm64.whl (29.7 MB)
Collecting nltk
  Using cached nltk-3.8.1-py3-none-a

In [29]:
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_recall,
    context_precision,
)
from ragas.metrics.critique import harmfulness
from ragas import evaluate

def create_ragas_dataset(rag_pipeline, eval_dataset):
  rag_dataset = []
  for row in tqdm(eval_dataset):
    answer = rag_pipeline({"query" : row["question"]})
    rag_dataset.append(
        {"question" : row["question"],
         "answer" : answer["result"],
         "contexts" : [context.page_content for context in answer["source_documents"]],
         "ground_truths" : [row["ground_truth"]]
         }
    )
  rag_df = pd.DataFrame(rag_dataset)
  rag_eval_dataset = Dataset.from_pandas(rag_df)
  return rag_eval_dataset

def evaluate_ragas_dataset(ragas_dataset):
  result = evaluate(
    ragas_dataset,
    metrics=[
        context_precision,
        faithfulness,
        answer_relevancy,
        context_recall,
    ],
  )
  return result

Lets create our dataset first:

In [31]:
from tqdm import tqdm
import pandas as pd

basic_qa_ragas_dataset = create_ragas_dataset(
    qa_chain,
    eval_dataset)

100%|██████████| 10/10 [00:50<00:00,  5.10s/it]


Save it for later:

In [None]:
basic_qa_ragas_dataset.to_csv(
    ### YOUR CODE HERE
)

And finally - evaluate how it did!

In [32]:
basic_qa_result = evaluate_ragas_dataset(
    basic_qa_ragas_dataset
)

Downloading (…)lve/main/config.json: 100%|██████████| 647/647 [00:00<00:00, 2.85MB/s]
Downloading pytorch_model.bin: 100%|██████████| 57.4M/57.4M [00:10<00:00, 5.26MB/s]
Downloading (…)okenizer_config.json: 100%|██████████| 517/517 [00:00<00:00, 2.54MB/s]
Downloading (…)solve/main/vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 875kB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 112/112 [00:00<00:00, 1.03MB/s]


evaluating with [context_relevancy]


100%|██████████| 1/1 [00:37<00:00, 37.25s/it]


evaluating with [faithfulness]


100%|██████████| 1/1 [01:48<00:00, 108.23s/it]


evaluating with [answer_relevancy]


100%|██████████| 1/1 [00:24<00:00, 24.76s/it]


evaluating with [context_recall]


100%|██████████| 1/1 [00:55<00:00, 55.29s/it]


In [33]:
basic_qa_result

{'ragas_score': 0.0799, 'context_relevancy': 0.0215, 'faithfulness': 0.7800, 'answer_relevancy': 0.9903, 'context_recall': 0.8250}

### Testing Other Retrievers

Now we can test our how changing our Retriever impacts our RAGAS evaluation!

In [34]:
def create_qa_chain(retriever):
  primary_qa_llm = ChatOpenAI(model_name="gpt-3.5-turbo-16k", temperature=0)

  created_qa_chain = RetrievalQA.from_chain_type(
      primary_qa_llm,
      retriever=retriever,
      return_source_documents=True
  )

  return created_qa_chain

#### Parent Document Retriever

One of the easier ways we can imagine improving a retriever is to embed our documents into small chunks, and then retrieve a significant amount of additional context that "surrounds" the found context.

You can read more about this method [here](https://python.langchain.com/docs/modules/data_connection/retrievers/parent_document_retriever)!

- [`ParentDocumentRetriever`](https://api.python.langchain.com/en/latest/retrievers/langchain.retrievers.parent_document_retriever.ParentDocumentRetriever.html?highlight=parentdocumentretriever#langchain-retrievers-parent-document-retriever-parentdocumentretriever)
- [`InMemoryStore`](https://api.python.langchain.com/en/latest/retrievers/langchain.retrievers.parent_document_retriever.ParentDocumentRetriever.html?highlight=parentdocumentretriever#langchain-retrievers-parent-document-retriever-parentdocumentretriever)

In [35]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore

parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)

vectorstore = Chroma(
    collection_name="split_parents", 
    embedding_function= OpenAIEmbeddings()
)

store = InMemoryStore()

In [37]:
parent_document_retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter)

In [38]:
parent_document_retriever.add_documents(base_docs)

Let's create, test, and then evaluate our new chain!

In [39]:
parent_document_retriever_qa_chain = create_qa_chain(parent_document_retriever)

In [40]:
parent_document_retriever_qa_chain({"query" : "What is RAG?"})["result"]

'RAG stands for Retrieval-Augmented Generation. It is an approach that combines retrieval and generation models to enhance the performance of language models. In the RAG framework, a retrieval model is used to retrieve relevant information from external documents, and this information is then used by the generation model to generate responses or completions.'

In [41]:
pdr_qa_ragas_dataset = create_ragas_dataset(parent_document_retriever_qa_chain, eval_dataset)

100%|██████████| 10/10 [00:48<00:00,  4.81s/it]


In [None]:
pdr_qa_ragas_dataset.to_csv(
    ### YOUR CODE HERE
)

In [42]:
pdr_qa_result = evaluate_ragas_dataset(pdr_qa_ragas_dataset)

evaluating with [context_relevancy]


100%|██████████| 1/1 [00:48<00:00, 48.80s/it]


evaluating with [faithfulness]


100%|██████████| 1/1 [01:54<00:00, 114.55s/it]


evaluating with [answer_relevancy]


100%|██████████| 1/1 [00:28<00:00, 28.29s/it]


evaluating with [context_recall]


100%|██████████| 1/1 [00:43<00:00, 43.88s/it]


In [43]:
pdr_qa_result

{'ragas_score': 0.0990, 'context_relevancy': 0.0270, 'faithfulness': 0.9067, 'answer_relevancy': 0.9924, 'context_recall': 0.8000}

#### Ensemble Retrieval

There are a number of excellent options to retrieve documents - we'll be looking at an additional example today, which is called the EnsembleRetriever.

The method this is using is outlined in [this paper](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf).

The brief explanation is:

- We collect results from two different retrieval methods over the same corpus
- We apply a reranking algorithm to rerank our source documents to be the most relevant without losing specific or potentially low-ranked information rich documents
- We feed the top-k results into the LLM with our query as context.

> HINT: Your weight list should be of type List[float] and the sum(List[float]) should be 1.

We'll be leveraging the following tools:

- [`BM25Retriever`](https://api.python.langchain.com/en/latest/retrievers/langchain.retrievers.bm25.BM25Retriever.html)
- [`EnsembleRetriever`](https://api.python.langchain.com/en/latest/retrievers/langchain.retrievers.ensemble.EnsembleRetriever.html)

##### High Level Diagram

Leverages the [RRF](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) reranking algorithm to combine sparse and dense search results for increased effectiveness for relevant document retrieval.

![image](https://i.imgur.com/mn4jXAz.png)

In [51]:
%pip install rank_bm25

Collecting rank_bm25
  Using cached rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Installing collected packages: rank_bm25
Successfully installed rank_bm25-0.2.2

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.11 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [54]:
from langchain.retrievers import BM25Retriever, EnsembleRetriever

text_splitter = RecursiveCharacterTextSplitter(chunk_size = 1000, chunk_overlap = 200, length_function = len)
docs = text_splitter.split_documents(base_docs
    
)

bm25_retriever = BM25Retriever.from_documents(
    docs
)
bm25_retriever.k = 4

embedding = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(docs, embedding)
chroma_retriever = vectorstore.as_retriever()

ensemble_retriever = EnsembleRetriever(retrievers=[chroma_retriever, bm25_retriever])

In [55]:
ensemble_retriever_qa_chain = create_qa_chain(ensemble_retriever)

In [56]:
ensemble_retriever_qa_chain({"query" : "What is RAG?"})["result"]

'RAG stands for Retrieval-Augmented Generation. It is a framework that combines retrieval-based models with generation-based models to enhance their performance in natural language processing tasks. The retrieval component helps improve contextual understanding and reduces hallucination, while the generation component allows for creative and fluent text generation.'

In [57]:
ensemble_qa_ragas_dataset = create_ragas_dataset(ensemble_retriever_qa_chain, eval_dataset)

100%|██████████| 10/10 [00:48<00:00,  4.89s/it]


In [58]:
ensemble_qa_ragas_dataset.to_csv("ensemble_qa_ragas_dataset.csv")

Creating CSV from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 186.30ba/s]


70941

In [59]:
ensemble_qa_result = evaluate_ragas_dataset(ensemble_qa_ragas_dataset)

evaluating with [context_relevancy]


100%|██████████| 1/1 [00:44<00:00, 44.04s/it]


evaluating with [faithfulness]


100%|██████████| 1/1 [01:29<00:00, 89.57s/it]


evaluating with [answer_relevancy]


100%|██████████| 1/1 [00:19<00:00, 19.44s/it]


evaluating with [context_recall]


100%|██████████| 1/1 [00:39<00:00, 39.85s/it]


In [60]:
ensemble_qa_result

{'ragas_score': 0.0800, 'context_relevancy': 0.0214, 'faithfulness': 0.8667, 'answer_relevancy': 0.9636, 'context_recall': 0.8833}

### Conclusion

Observe your results in a table!

In [None]:
### YOUR CODE HERE