# Evaluation with RAGAS and Advanced Retrieval Methods Using LangChain

In the following notebook we'll discuss a major component of LLM Ops:

- Evaluation

We're going to be leveraging the [RAGAS]() framework for our evaluations today as it's becoming a standard method of evaluating (at least directionally) RAG systems.

We're also going to discuss a few more powerful Retrieval Systems that can potentially improve the quality of our generations!

Let's start as we always do: Grabbing our dependencies!

In [39]:
!pip install -U -q langchain openai ragas arxiv pymupdf chromadb wandb tiktoken

In [41]:
import os
import openai
from getpass import getpass

openai.api_key = getpass("Please provide your OpenAI Key: ")
os.environ["OPENAI_API_KEY"] = openai.api_key

### Data Collection

We're going to be using papers from Arxiv as our context today.

We can collect these documents rather straightforwardly with the `ArxivLoader` document loader from LangChain.

Let's grab and load 5 documents.

- [`ArxivLoader`](https://api.python.langchain.com/en/latest/document_loaders/langchain.document_loaders.arxiv.ArxivLoader.html)

In [42]:
from langchain.document_loaders import ArxivLoader

base_docs = ArxivLoader(query="Retrieval Augmented Generation", load_max_docs=5).load()
len(base_docs)

5

In [43]:
for doc in base_docs:
  print(doc.metadata)

{'Published': '2022-02-13', 'Title': 'A Survey on Retrieval-Augmented Text Generation', 'Authors': 'Huayang Li, Yixuan Su, Deng Cai, Yan Wang, Lemao Liu', 'Summary': 'Recently, retrieval-augmented text generation attracted increasing attention\nof the computational linguistics community. Compared with conventional\ngeneration models, retrieval-augmented text generation has remarkable\nadvantages and particularly has achieved state-of-the-art performance in many\nNLP tasks. This paper aims to conduct a survey about retrieval-augmented text\ngeneration. It firstly highlights the generic paradigm of retrieval-augmented\ngeneration, and then it reviews notable approaches according to different tasks\nincluding dialogue response generation, machine translation, and other\ngeneration tasks. Finally, it points out some important directions on top of\nrecent methods to facilitate future research.'}
{'Published': '2023-05-11', 'Title': 'Active Retrieval Augmented Generation', 'Authors': 'Zhengb

### Creating an Index

Let's use a naive index creation strategy of just using `RecursiveCharacterTextSplitter` on our documents and embedding each into our `VectorStore` using `OpenAIEmbeddings()`.

Let's use a rather generic 500 character chunk size.

- [`RecursiveCharacterTextSplitter()`](https://api.python.langchain.com/en/latest/text_splitter/langchain.text_splitter.RecursiveCharacterTextSplitter.html)
- [`Chroma`](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.chroma.Chroma.html?highlight=chroma#langchain.vectorstores.chroma.Chroma)
- [`OpenAIEmbeddings()`](https://api.python.langchain.com/en/latest/embeddings/langchain.embeddings.openai.OpenAIEmbeddings.html?highlight=openaiembeddings#langchain-embeddings-openai-openaiembeddings)

In [44]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500)

docs = text_splitter.split_documents(base_docs)### YOUR CODE HERE

vectorstore = Chroma.from_documents(docs,OpenAIEmbeddings())
  

In [45]:
len(docs)

1061

In [46]:
print(max([len(chunk.page_content) for chunk in docs]))

499


### Setting Up our Basic QA Chain

Now we can instantiate our basic `RetrievalQA` chain, let's retrieve the top `k=3` documents.

- [`RetrievalQA`](https://api.python.langchain.com/en/latest/chains/langchain.chains.retrieval_qa.base.RetrievalQA.html?highlight=retrievalqa#langchain-chains-retrieval-qa-base-retrievalqa)

In [47]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

primary_qa_llm = ChatOpenAI(
    model_name="gpt-3.5-turbo-16k", 
    temperature=0
)
retriever = vectorstore.as_retriever(search_kwargs={"k":3})
qa_chain = RetrievalQA.from_chain_type(
    ### YOUR CODE HERE
    llm=primary_qa_llm,
    retriever=retriever,
    return_source_documents=True### YOUR CODE HERE
)

Let's test it out!

In [48]:
query = "What is RAG?"

result = qa_chain({"query" : query})

print(result["result"])

RAG stands for Retrieval-Augmented Generation. It is a framework that combines both client and cloud models to overcome the limitations of small language models on edge devices. RAG incorporates retrieval-augmented memory generated asynchronously by a Large Language Model (LLM) in the cloud, allowing the client model to generate highly effective responses.


### Ground Truth Dataset Creation Using GPT-3.5-turbo and GPT-4

The next section might take you a long time to run, so the evaluation dataset is provided.

The basic idea is that we can use LangChain to create questions based on our contexts, and then answer those questions.

Let's look at how that works in the code!

In [49]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

texts = docs

In [50]:
from langchain.output_parsers import ResponseSchema
from langchain.output_parsers import StructuredOutputParser

question_schema = ResponseSchema(
    name="question",
    description="a question about the context."
)

question_response_schemas = [
    question_schema,
]

In [51]:
question_output_parser = StructuredOutputParser.from_response_schemas(question_response_schemas)
format_instructions = question_output_parser.get_format_instructions()

In [52]:
from langchain.prompts import ChatPromptTemplate

qa_template = """\
You are a University Professor creating a test for advanced students. For each context, create a question that is specific to the context. Avoid creating generic or general questions.

question: a question about the context.

Format the output as JSON with the following keys:
question

context: {context}
"""

prompt_template = ChatPromptTemplate.from_template(template=qa_template)

messages = prompt_template.format_messages(
    context=texts[0],
    format_instructions=format_instructions
)

response = primary_qa_llm(messages)
output_dict = question_output_parser.parse(response.content)

In [53]:
for k, v in output_dict.items():
  print(k)
  print(v)

question
What is the main focus of the paper 'A Survey on Retrieval-Augmented Text Generation'?


In [54]:
!pip install -q -U tqdm

In [55]:
from tqdm import tqdm

qac_triples = []

for text in tqdm(texts[:10]):
  messages = prompt_template.format_messages(
      context=text,
      format_instructions=format_instructions
  )
  response = primary_qa_llm(messages)
  try:
    output_dict = question_output_parser.parse(response.content)
  except Exception as e:
    continue
  output_dict["context"] = text
  qac_triples.append(output_dict)

 20%|██        | 2/10 [00:04<00:18,  2.34s/it]Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-gpt-3.5-turbo-16k in organization org-2Epl48lGCRfKLBzHeVvz3nRY on requests per min. Limit: 3 / min. Please try again in 20s. Contact us through our help center at help.openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://platform.openai.com/account/billing to add a payment method..
 30%|███       | 3/10 [00:09<00:26,  3.82s/it]Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-gpt-3.5-turbo-16k in organization org-2Epl48lGCRfKLBzHeVvz3nRY on requests per min. Limit: 3 / min. Please try again in 20s. Contact us through our help center at help.openai.com if you contin

In [56]:
primary_ground_truth_llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

answer_schema = ResponseSchema(
    name="answer",
    description="an answer to the question"
)

answer_response_schemas = [
    answer_schema,
]

answer_output_parser = StructuredOutputParser.from_response_schemas(answer_response_schemas)
format_instructions = answer_output_parser.get_format_instructions()

qa_template = """\
You are a University Professor creating a test for advanced students. For each question and context, create an answer.

answer: a question about the context.

Format the output as JSON with the following keys:
answer

question: {question}
context: {context}
"""

prompt_template = ChatPromptTemplate.from_template(template=qa_template)

messages = prompt_template.format_messages(
    context=qac_triples[0]["context"],
    question=qac_triples[0]["question"],
    format_instructions=format_instructions
)

response = primary_ground_truth_llm(messages)
output_dict = answer_output_parser.parse(response.content)

In [57]:
for k, v in output_dict.items():
  print(k)
  print(v)

answer
The main focus of the paper 'A Survey on Retrieval-Augmented Text Generation' is to conduct a survey about retrieval-augmented text generation.


In [58]:
for triple in tqdm(qac_triples):
  messages = prompt_template.format_messages(
      context=triple["context"],
      question=triple["question"],
      format_instructions=format_instructions
  )
  response = primary_ground_truth_llm(messages)
  try:
    output_dict = answer_output_parser.parse(response.content)
  except Exception as e:
    continue
  triple["answer"] = output_dict["answer"]

 30%|███       | 3/10 [00:17<00:44,  6.29s/it]Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-gpt-3.5-turbo in organization org-2Epl48lGCRfKLBzHeVvz3nRY on requests per min. Limit: 3 / min. Please try again in 20s. Contact us through our help center at help.openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://platform.openai.com/account/billing to add a payment method..
Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-gpt-3.5-turbo in organization org-2Epl48lGCRfKLBzHeVvz3nRY on requests per min. Limit: 3 / min. Please try again in 20s. Contact us through our help center at help.openai.com if you continue to have issues. Please add a payment method to your

In [59]:
!pip install -q -U datasets

In [60]:
import pandas as pd
from datasets import Dataset

ground_truth_qac_set = pd.DataFrame(qac_triples)
ground_truth_qac_set["context"] = ground_truth_qac_set["context"].map(lambda x: str(x.page_content))
ground_truth_qac_set = ground_truth_qac_set.rename(columns={"answer" : "ground_truth"})


eval_dataset = Dataset.from_pandas(ground_truth_qac_set)

In [61]:
eval_dataset.to_csv("../data/groundtruth_eval_dataset1.csv")

Creating CSV from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 328.09ba/s]


7651

### Evaluating RAG Pipelines

If you skipped ahead and need to load the `.csv` directly - uncomment the code below.

If you're using Colab to do this notebook - please ensure you add it to your session files.

In [62]:
from datasets import Dataset
eval_dataset = Dataset.from_csv("../data/groundtruth_eval_dataset.csv")

In [63]:
eval_dataset

Dataset({
    features: ['question', 'context', 'ground_truth'],
    num_rows: 10
})

### Evaluation Using RAGAS

Now we can evaluate using RAGAS!

The set-up is fairly straightforward - we simply need to create a dataset with our generated answers and our contexts, and then evaluate using the framework.

More details on the specific metrics can be found [here](https://github.com/explodinggradients/ragas/blob/main/docs/metrics.md)!

In [64]:
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_recall,
    context_precision,
)
from ragas.metrics.critique import harmfulness
from ragas import evaluate

def create_ragas_dataset(rag_pipeline, eval_dataset):
  rag_dataset = []
  for row in tqdm(eval_dataset):
    answer = rag_pipeline({"query" : row["question"]})
    rag_dataset.append(
        {"question" : row["question"],
         "answer" : answer["result"],
         "contexts" : [context.page_content for context in answer["source_documents"]],
         "ground_truths" : [row["ground_truth"]]
         }
    )
  rag_df = pd.DataFrame(rag_dataset)
  rag_eval_dataset = Dataset.from_pandas(rag_df)
  return rag_eval_dataset

def evaluate_ragas_dataset(ragas_dataset):
  result = evaluate(
    ragas_dataset,
    metrics=[
        context_precision,
        faithfulness,
        answer_relevancy,
        context_recall,
    ],
  )
  return result

Lets create our dataset first:

In [65]:
from tqdm import tqdm
import pandas as pd

basic_qa_ragas_dataset = create_ragas_dataset(qa_chain,eval_dataset)

 30%|███       | 3/10 [00:14<00:34,  4.96s/it]Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-embedding-ada-002 in organization org-2Epl48lGCRfKLBzHeVvz3nRY on requests per min. Limit: 3 / min. Please try again in 20s. Contact us through our help center at help.openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://platform.openai.com/account/billing to add a payment method..
Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-embedding-ada-002 in organization org-2Epl48lGCRfKLBzHeVvz3nRY on requests per min. Limit: 3 / min. Please try again in 20s. Contact us through our help center at help.openai.com if you continue to have issues. Please add a payment method to your account to increase your 

In [69]:
# Updating to other key for better throughput
openai.api_key = getpass("Please provide your OpenAI Key: ")
os.environ["OPENAI_API_KEY"] = openai.api_key

Save it for later:

In [70]:
basic_qa_ragas_dataset.to_csv(
    "../data/basic_qa_ragas_dataset1.csv"
)

Creating CSV from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 502.73ba/s]


24646

And finally - evaluate how it did!

In [71]:
basic_qa_result = evaluate_ragas_dataset(basic_qa_ragas_dataset)
    ### YOUR CODE HERE


evaluating with [context_relevancy]


  0%|          | 0/1 [01:23<?, ?it/s]


KeyboardInterrupt: 

In [None]:
basic_qa_result

### Testing Other Retrievers

Now we can test our how changing our Retriever impacts our RAGAS evaluation!

In [76]:
def create_qa_chain(retriever):
  primary_qa_llm = ChatOpenAI(model_name="gpt-3.5-turbo-0301", temperature=0)

  created_qa_chain = RetrievalQA.from_chain_type(
      primary_qa_llm,
      retriever=retriever,
      return_source_documents=True
  )

  return created_qa_chain

#### Parent Document Retriever

One of the easier ways we can imagine improving a retriever is to embed our documents into small chunks, and then retrieve a significant amount of additional context that "surrounds" the found context.

You can read more about this method [here](https://python.langchain.com/docs/modules/data_connection/retrievers/parent_document_retriever)!

- [`ParentDocumentRetriever`](https://api.python.langchain.com/en/latest/retrievers/langchain.retrievers.parent_document_retriever.ParentDocumentRetriever.html?highlight=parentdocumentretriever#langchain-retrievers-parent-document-retriever-parentdocumentretriever)
- [`InMemoryStore`](https://api.python.langchain.com/en/latest/retrievers/langchain.retrievers.parent_document_retriever.ParentDocumentRetriever.html?highlight=parentdocumentretriever#langchain-retrievers-parent-document-retriever-parentdocumentretriever)

In [77]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore

parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)### YOUR CODE HERE
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)### YOUR CODE HERE

vectorstore = Chroma(
    collection_name="split_parents", 
    embedding_function=OpenAIEmbeddings()### YOUR CODE HERE
)

store = InMemoryStore()### YOUR CODE HERE

In [78]:
parent_document_retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,### YOUR CODE HERE
    docstore=store,### YOUR CODE HERE
    child_splitter=child_splitter,### YOUR CODE HERE
    parent_splitter=parent_splitter### YOUR CODE HERE
)

In [79]:
parent_document_retriever.add_documents(base_docs)



KeyboardInterrupt: 

Let's create, test, and then evaluate our new chain!

In [None]:
parent_document_retriever_qa_chain = create_qa_chain(parent_document_retriever)
    ### YOUR CODE HERE


In [None]:
parent_document_retriever_qa_chain({"query" : "What is RAG?"})["result"]

In [None]:
pdr_qa_ragas_dataset = create_ragas_dataset(parent_document_retriever_qa_chain,eval_dataset)
    ### YOUR CODE HERE
    ### YOUR CODE HERE


In [None]:
pdr_qa_ragas_dataset.to_csv("../data/pdr_qa_ragas_dataset1.csv")
    ### YOUR CODE HERE

In [None]:
pdr_qa_result = evaluate_ragas_dataset(pdr_qa_ragas_dataset)
    ### YOUR CODE HERE

In [None]:
pdr_qa_result

#### Ensemble Retrieval

There are a number of excellent options to retrieve documents - we'll be looking at an additional example today, which is called the EnsembleRetriever.

The method this is using is outlined in [this paper](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf).

The brief explanation is:

- We collect results from two different retrieval methods over the same corpus
- We apply a reranking algorithm to rerank our source documents to be the most relevant without losing specific or potentially low-ranked information rich documents
- We feed the top-k results into the LLM with our query as context.

> HINT: Your weight list should be of type List[float] and the sum(List[float]) should be 1.

We'll be leveraging the following tools:

- [`BM25Retriever`](https://api.python.langchain.com/en/latest/retrievers/langchain.retrievers.bm25.BM25Retriever.html)
- [`EnsembleRetriever`](https://api.python.langchain.com/en/latest/retrievers/langchain.retrievers.ensemble.EnsembleRetriever.html)

##### High Level Diagram

Leverages the [RRF](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) reranking algorithm to combine sparse and dense search results for increased effectiveness for relevant document retrieval.

![image](https://i.imgur.com/mn4jXAz.png)

In [None]:
!pip install -q -U rank_bm25

In [None]:
from langchain.retrievers import BM25Retriever, EnsembleRetriever

text_splitter = RecursiveCharacterTextSplitter()### YOUR CODE HERE
docs = text_splitter.split_documents(base_docs)
    ### YOUR CODE HERE

bm25_retriever = BM25Retriever.from_documents(docs)
    ### YOUR CODE HERE

bm25_retriever.k = 2### YOUR CODE HERE

embedding = OpenAIEmbeddings()### YOUR CODE HERE
vectorstore = Chroma.from_documents(docs,embedding)
    ### YOUR CODE HERE
    ### YOUR CODE HERE

chroma_retriever = vectorstore.as_retriever(search_kwargs={"k":2})
    ### YOUR CODE HERE


ensemble_retriever = EnsembleRetriever(retrievers=[bm25_retriever,chroma_retriever],weights=[0.5,0.5])
    ### YOUR CODE HERE
)

In [None]:
ensemble_retriever_qa_chain = create_qa_chain(ensemble_retriever)

In [None]:
ensemble_retriever_qa_chain({"query" : "What is RAG?"})["result"]

In [None]:
ensemble_qa_ragas_dataset = create_ragas_dataset(ensemble_retriever_qa_chain, eval_dataset)

In [None]:
ensemble_qa_ragas_dataset.to_csv("ensemble_qa_ragas_dataset.csv")

In [None]:
ensemble_qa_result = evaluate_ragas_dataset(ensemble_qa_ragas_dataset)

In [None]:
ensemble_qa_result

### Conclusion

Observe your results in a table!

In [None]:
### YOUR CODE HERE