# Evaluation with RAGAS and Advanced Retrieval Methods Using LangChain

In the following notebook we'll discuss a major component of LLM Ops:

- Evaluation

We're going to be leveraging the [RAGAS]() framework for our evaluations today as it's becoming a standard method of evaluating (at least directionally) RAG systems.

We're also going to discuss a few more powerful Retrieval Systems that can potentially improve the quality of our generations!

Let's start as we always do: Grabbing our dependencies!

In [2]:
# !pip install -U -q langchain openai ragas arxiv pymupdf chromadb wandb tiktoken

# Step 1: Load dataset and Modules

In [1]:
from helper_utils import word_wrap
from pypdf import PdfReader
from langchain.text_splitter import RecursiveCharacterTextSplitter, SentenceTransformersTokenTextSplitter
import chromadb
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction
import os
import openai
from openai import OpenAI
from dotenv import load_dotenv, find_dotenv
import umap.umap_ as umap
import numpy as np
from tqdm import tqdm
from sentence_transformers import CrossEncoder

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
import os
import openai
from getpass import getpass

_ = load_dotenv('.env')
openai.api_key = os.environ['OPENAI_API_KEY']

### Data Collection

We're going to be using papers from Arxiv as our context today.

We can collect these documents rather straightforwardly with the `ArxivLoader` document loader from LangChain.

Let's grab and load 5 documents.

- [`ArxivLoader`](https://api.python.langchain.com/en/latest/document_loaders/langchain.document_loaders.arxiv.ArxivLoader.html)

In [32]:
# from PyPDF2 import PdfReader
from langchain.text_splitter import RecursiveCharacterTextSplitter, SentenceTransformersTokenTextSplitter
from langchain.schema import Document
from sentence_transformers import SentenceTransformer

# Load Tesla 2023 10K report
reader = PdfReader("../data/tesla10K.pdf")

# Extract text from each page and store with page numbers
pdf_texts = []
for page_num, page in enumerate(reader.pages):
    text = page.extract_text().strip()
    if text:
        pdf_texts.append({"page_number": page_num + 1, "content": text})

# Split text by sentences while maintaining page number
character_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ". ", " ", ""],
    chunk_size=1000,
    chunk_overlap=0
)

# Split each page's content and store in a list with metadata
character_split_texts = []
for entry in pdf_texts:
    chunks = character_splitter.split_text(entry["content"])
    for chunk in chunks:
        character_split_texts.append({"page_number": entry["page_number"], "content": chunk})

# Print an example chunk and total number of chunks
print(character_split_texts[10]["content"])
print(f"\nTotal chunks: {len(character_split_texts)}")

# Tokenize the sentence chunks
token_splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0, tokens_per_chunk=256)

# Split each character chunk while maintaining metadata
token_split_texts = []
for entry in character_split_texts:
    chunks = token_splitter.split_text(entry["content"])
    for chunk in chunks:
        token_split_texts.append({"page_number": entry["page_number"], "content": chunk})

# Create base_docs structure
base_docs = []
for entry in token_split_texts:
    base_docs.append(Document(page_content=entry["content"], metadata={"page_number": entry["page_number"]}))

# Print an example document from base_docs and total number of documents
print(base_docs[10])
print(f"\nTotal documents: {len(base_docs)}")

from langchain.vectorstores import Chroma
from langchain.embeddings import SentenceTransformerEmbeddings

# Define the embedding function using SentenceTransformer
embedding_function = SentenceTransformerEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')

# Use the embedding function with Chroma
vectorstore = Chroma.from_documents(base_docs, embedding_function)

print("Vectorstore created successfully.")


such	risks	have	occurred	at	the	time	of	this	filing.	We	do	not	assume	any	obligation	to	update	any	forward-looking	statements.

Total chunks: 528
page_content='such risks have occurred at the time of this filing. we do not assume any obligation to update any forward - looking statements.' metadata={'page_number': 4}

Total documents: 556
Vectorstore created successfully.


In [89]:


# Function to print documents with wrapped text
def print_documents_with_wrap(documents, width=70):
    for doc in documents:
        wrapped_content = textwrap.fill(doc.page_content, width=width)
        print(f"Page Number: {doc.metadata['page_number']}\n")
        print(wrapped_content)
        print("\n" + "-" * 80 + "\n")

# Example usage
print_documents_with_wrap(base_docs)

Page Number: 1

united states securities and exchange commission washington, d. c.
20549 form 10 - k ( mark one ) x annual report pursuant to section 13
or 15 ( d ) of the securities exchange act of 1934 for the fiscal year
ended december 31, 2023 or o transition report pursuant to section 13
or 15 ( d ) of the securities exchange act of 1934 for the transition
period from _ _ _ _ _ _ _ _ _ to _ _ _ _ _ _ _ _ _ commission file
number : 001 - 34756 tesla, inc. ( exact name of registrant as
specified in its charter ) delaware 91 - 2197729 ( state or other
jurisdiction of incorporation or organization ) ( i. r. s. employer
identification no. ) 1 tesla road austin, texas 78725 ( address of
principal executive offices ) ( zip code ) ( 512 ) 516 - 8177 (
registrant ’ s telephone number, including area code ) securities
registered pursuant to section 12 ( b ) of the act : title of each
class trading symbol ( s ) name of each exchange on which registered
common stock tsla the nasdaq global sel

In [33]:
base_retriever = vectorstore.as_retriever(search_kwargs={"k" : 2})

In [34]:
relevant_docs = base_retriever.get_relevant_documents("What is tesla 2023 revenue?")

  warn_deprecated(


### Creating an Index

Let's use a naive index creation strategy of just using `RecursiveCharacterTextSplitter` on our documents and embedding each into our `VectorStore` using `OpenAIEmbeddings()`.

- [`RecursiveCharacterTextSplitter()`](https://api.python.langchain.com/en/latest/text_splitter/langchain.text_splitter.RecursiveCharacterTextSplitter.html)
- [`Chroma`](https://api.python.langchain.com/en/latest/vectorstores/langchain.vectorstores.chroma.Chroma.html?highlight=chroma#langchain.vectorstores.chroma.Chroma)
- [`OpenAIEmbeddings()`](https://api.python.langchain.com/en/latest/embeddings/langchain.embeddings.openai.OpenAIEmbeddings.html?highlight=openaiembeddings#langchain-embeddings-openai-openaiembeddings)

In [36]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter

# # Create Document objects from base_docs
# documents = [Document(page_content=doc["content"], metadata=doc["metadata"]) for doc in base_docs]

# # Initialize the text splitter
# text_splitter = RecursiveCharacterTextSplitter(chunk_size=250)

# # Split the documents
# docs = text_splitter.split_documents(documents)

# # Initialize the vector store with the documents and embeddings
# vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings())

# print("Vector store created successfully.")


In [37]:
# from langchain.vectorstores import Chroma
# from langchain.embeddings import OpenAIEmbeddings
# from langchain.text_splitter import RecursiveCharacterTextSplitter

# text_splitter = RecursiveCharacterTextSplitter(chunk_size=250)

# docs = text_splitter.split_documents(base_docs)

# vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings())

## Creating a Retrieval Augmented Generation Prompt

Now we can set up a prompt template that will be used to provide the LLM with the necessary contexts, user query, and instructions!

In [38]:
from langchain.prompts import ChatPromptTemplate

template = """Answer the question based only on the following context. If you cannot answer the question with the context, please respond with 'I don't know':

### CONTEXT
{context}

### QUESTION
Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

### Setting Up our Basic QA Chain

Now we can instantiate our basic RAG chain!

We'll follow *exactly* the chain we made on Tuesday to keep things simple for now - if you need a refresher on what it looked like - check out last week's notebook!

In [39]:
from operator import itemgetter

from langchain.chat_models import ChatOpenAI
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable import RunnableLambda, RunnablePassthrough

primary_qa_llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

retrieval_augmented_qa_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | base_retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": prompt | primary_qa_llm, "context": itemgetter("context")}
)

Let's test it out!

In [75]:
question = "What is tesla 2023 revenue and show how do you get it"

result = retrieval_augmented_qa_chain.invoke({"question" : question})

print(result)

{'response': AIMessage(content="Answer: Tesla's 2023 revenue is $96,773 million. This is calculated by adding up the total automotive revenues ($82,419 million), energy generation and storage revenues ($6,035 million), and services and other revenues ($8,319 million).", response_metadata={'token_usage': {'completion_tokens': 53, 'prompt_tokens': 586, 'total_tokens': 639}, 'model_name': 'gpt-3.5-turbo', 'system_fingerprint': None, 'finish_reason': 'stop', 'logprobs': None}, id='run-8f500bfe-e856-49eb-9107-1c5072172763-0'), 'context': [Document(page_content='tesla, inc. consolidated statements of operations ( in millions, except per share data ) year ended december 31, 2023 2022 2021 revenues automotive sales $ 78, 509 $ 67, 210 $ 44, 125 automotive regulatory credits 1, 790 1, 776 1, 465 automotive leasing 2, 120 2, 476 1, 642 total automotive revenues 82, 419 71, 462 47, 232 energy generation and storage 6, 035 3, 909 2, 789 services and other 8, 319 6, 091 3, 802 total revenues 96, 77

In [76]:
import json
import textwrap
from langchain.schema import Document
from langchain.schema.messages import AIMessage  # Import AIMessage or other message types as needed

def convert_to_serializable(obj):
    if isinstance(obj, Document):
        return {"page_content": obj.page_content, "metadata": obj.metadata}
    if isinstance(obj, AIMessage):
        return {"content": obj.content}  # Adjust based on actual structure of AIMessage
    raise TypeError(f"Object of type {obj.__class__.__name__} is not JSON serializable")

def pretty_print_dict_wrapped(d, width=70):
    """
    Prints a dictionary in a formatted and more readable way with wrapped text.
    
    Args:
    d (dict): The dictionary to print.
    width (int): The maximum width of the wrapped text.
    """
    print(textwrap.fill(json.dumps(d, indent=4, sort_keys=True, default=convert_to_serializable), width=width))



In [77]:
pretty_print_dict_wrapped(result)

{     "context": [         {             "metadata": {
"page_number": 51             },             "page_content": "tesla,
inc. consolidated statements of operations ( in millions, except per
share data ) year ended december 31, 2023 2022 2021 revenues
automotive sales $ 78, 509 $ 67, 210 $ 44, 125 automotive regulatory
credits 1, 790 1, 776 1, 465 automotive leasing 2, 120 2, 476 1, 642
total automotive revenues 82, 419 71, 462 47, 232 energy generation
and storage 6, 035 3, 909 2, 789 services and other 8, 319 6, 091 3,
802 total revenues 96, 773 81, 462 53, 823 cost of revenues automotive
sales 65, 121 49, 599 32, 415 automotive leasing 1, 268 1, 509 978
total automotive cost of revenues 66, 389 51, 108 33, 393 energy
generation and storage 4, 894 3, 621 2, 918 services and other 7, 830
5, 880 3, 906 total cost of revenues 79, 113 60, 609 40, 217 gross
profit 17, 660 20, 853 13, 606 operating expenses research and
development 3, 969 3, 075 2, 593 selling, general and administrative

### Code to generate ground truth
* question
* answer
* context
* ground truth [GPT4 answered based on the context and question]

In [44]:
docs=base_docs

In [45]:
from langchain.output_parsers import ResponseSchema
from langchain.output_parsers import StructuredOutputParser
from langchain.prompts import ChatPromptTemplate
#genreeate questions based on the doc contents
from tqdm import tqdm
import pandas as pd
from datasets import Dataset

question_schema = ResponseSchema(
    name="question",
    description="a question about the context."
)

question_response_schemas = [
    question_schema,
]
question_output_parser = StructuredOutputParser.from_response_schemas(question_response_schemas)
format_instructions = question_output_parser.get_format_instructions()
question_generation_llm = ChatOpenAI(model="gpt-3.5-turbo-16k")

bare_prompt_template = "{content}"
bare_template = ChatPromptTemplate.from_template(template=bare_prompt_template)


qa_template = """\
You are a University Professor creating a test for advanced students. For each context, create a question that is specific to the context. Avoid creating generic or general questions.

question: a question about the context.

Format the output as JSON with the following keys:
question

context: {context}
"""

prompt_template = ChatPromptTemplate.from_template(template=qa_template)

messages = prompt_template.format_messages(
    context=docs[0],
    format_instructions=format_instructions
)

question_generation_chain = bare_template | question_generation_llm

response = question_generation_chain.invoke({"content" : messages})
output_dict = question_output_parser.parse(response.content)
for k, v in output_dict.items():
  print(k)
  print(v)


#create qac_triples
qac_triples = []

for text in tqdm(docs[:10]):
  messages = prompt_template.format_messages(
      context=text,
      format_instructions=format_instructions
  )
  response = question_generation_chain.invoke({"content" : messages}) # genrate questions
  try:
    output_dict = question_output_parser.parse(response.content) #question and answer
  except Exception as e:
    continue
  output_dict["context"] = text
  qac_triples.append(output_dict)

#add answer to qac_triples
answer_generation_llm = ChatOpenAI(model="gpt-4o", temperature=0)

answer_schema = ResponseSchema(
    name="answer",
    description="an answer to the question"
)

answer_response_schemas = [
    answer_schema,
]

answer_output_parser = StructuredOutputParser.from_response_schemas(answer_response_schemas)
format_instructions = answer_output_parser.get_format_instructions()

qa_template = """\
You are a University Professor creating a test for advanced students. For each question and context, create an answer.

answer: a answer about the context.

Format the output as JSON with the following keys:
answer

question: {question}
context: {context}
"""

prompt_template = ChatPromptTemplate.from_template(template=qa_template)

messages = prompt_template.format_messages(
    context=qac_triples[0]["context"],
    question=qac_triples[0]["question"],
    format_instructions=format_instructions
)

answer_generation_chain = bare_template | answer_generation_llm

response = answer_generation_chain.invoke({"content" : messages})
output_dict = answer_output_parser.parse(response.content)
for k, v in output_dict.items():
  print(k)
  print(v)
for triple in tqdm(qac_triples):
  messages = prompt_template.format_messages(
      context=triple["context"],
      question=triple["question"],
      format_instructions=format_instructions
  )
  response = answer_generation_chain.invoke({"content" : messages})
  try:
    output_dict = answer_output_parser.parse(response.content)
  except Exception as e:
    continue
  triple["answer"] = output_dict["answer"]
#ground truth dataset

ground_truth_qac_set = pd.DataFrame(qac_triples)
ground_truth_qac_set["context"] = ground_truth_qac_set["context"].map(lambda x: str(x.page_content))
ground_truth_qac_set = ground_truth_qac_set.rename(columns={"answer" : "ground_truth"})


eval_dataset = Dataset.from_pandas(ground_truth_qac_set)
eval_dataset.to_csv("groundtruth_eval_dataset.csv")

question
What is the trading symbol for Tesla's common stock?


100%|██████████| 10/10 [00:08<00:00,  1.12it/s]


answer
TSLA


100%|██████████| 10/10 [00:17<00:00,  1.73s/it]
Creating CSV from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 147.45ba/s]


11747

In [79]:
pd.DataFrame(eval_dataset)

Unnamed: 0,question,context,ground_truth
0,What is the trading symbol for Tesla's common ...,united states securities and exchange commissi...,TSLA
1,Has the registrant filed all reports required ...,securities registered pursuant to section 12 (...,"Yes, the registrant has filed all reports requ..."
2,"According to the provided information, is the ...",( § 232. 405 of this chapter ) during the prec...,"Yes, the registrant is a large accelerated filer."
3,"According to the given context, has the regist...","if an emerging growth company, indicate by che...","No, the registrant has not elected to use the ..."
4,"According to the context, what is the aggregat...",indicate by check mark whether any of those er...,The aggregate market value of voting stock hel...
5,How many shares of the registrant's common sto...,"as of january 22, 2024, there were 3, 184, 790...","As of January 22, 2024, there were 3,184,790,4..."
6,What is the purpose of the annual report on Fo...,"tesla, inc. annual report on form 10 - k for t...",The purpose of the annual report on Form 10-K ...
7,What information is included in item 13 of the...,item 11. executive compensation 95 item 12. se...,Item 13 of the document includes information a...
8,What specific statements are included in the f...,table of contents forward - looking statements...,The forward-looking statements in this annual ...
9,What are the risks and uncertainties mentioned...,statements contain these identifying words. we...,The risks and uncertainties mentioned in the f...


In [None]:
# test chunk size sensitivity using RAGAS metrics
# test 

In [47]:
# create pipeline testing chunksize sensitivity
# use openai embedding, need to test on chromadb embedding cheaper option

# from langchain.vectorstores import Chroma
# from langchain.embeddings import OpenAIEmbeddings
# from langchain.schema import Document
# from langchain.text_splitter import RecursiveCharacterTextSplitter

# def qa_sensitivity(base_docs, chunk_size, question):
#     # Create Document objects from base_docs
#     documents = [Document(page_content=doc["content"], metadata=doc["metadata"]) for doc in base_docs]

#     # Initialize the text splitter
#     text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size)

#     # Split the documents
#     docs = text_splitter.split_documents(documents)

#     # Initialize the vector store with the documents and embeddings
#     vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings())

#     # print("Vector store created successfully.")

#     result = retrieval_augmented_qa_chain.invoke({"question" : question})
#     return result

# question = "What is tesla 2023 revenue"
# chunk_size=200
# result= qa_sensitivity(base_docs, chunk_size, question)


In [48]:
# Extracting the response content
# response_content = result['response'].content

# # Extracting the context content
# context_content = [doc.page_content for doc in result['context']]

# print("Response Content:", response_content)
# print("Context Content:", context_content)

In [None]:
#chromaDB option

In [49]:
reader = PdfReader("../data/tesla10K.pdf")
pdf_texts = [p.extract_text().strip() for p in reader.pages if p.extract_text()]

# Split text by sentences
character_splitter = RecursiveCharacterTextSplitter(separators=["\n\n", "\n", ". ", " ", ""], chunk_size=1000, chunk_overlap=0)
character_split_texts = character_splitter.split_text('\n\n'.join(pdf_texts))


In [50]:
# create pipeline testing chunksize sensitivity
# use openai embedding, need to test on chromadb embedding cheaper option

from langchain.vectorstores import Chroma
# from langchain.embeddings import OpenAIEmbeddings
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [51]:
class RAGPipeline:
    def __init__(self, character_split_texts, tokens_per_chunk):
        self.tokens_per_chunk = tokens_per_chunk
        self.character_split_texts = character_split_texts
        self.embedding_function = SentenceTransformerEmbeddingFunction()
        self.token_splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0, tokens_per_chunk=tokens_per_chunk)

        # Tokenize the sentence chunks
        self.token_split_texts = [token_split_text for text in character_split_texts for token_split_text in self.token_splitter.split_text(text)]

        # Create vector database using ChromaDB collection
        self.chroma_client = chromadb.Client()
        collection_name = f"tesla202310k_{tokens_per_chunk}"

        # Check if the collection exists
        if self.collection_exists(collection_name):
            self.chroma_collection = self.chroma_client.get_collection(collection_name)
            print(f"Collection '{collection_name}' already exists. Using existing collection.")
        else:
            self.chroma_collection = self.chroma_client.create_collection(collection_name, embedding_function=self.embedding_function)
            ids = [str(i) for i in range(len(self.token_split_texts))]
            self.chroma_collection.add(ids=ids, documents=self.token_split_texts)
            print(f"Collection '{collection_name}' created and documents added successfully.")

    def collection_exists(self, collection_name):
        try:
            self.chroma_client.get_collection(collection_name)
            return True
        except Exception as e:
            return False

    def invoke(self, input_dict):
        question = input_dict.get("question")
        result = retrieval_augmented_qa_chain.invoke({"question": question})
        return result

# Function to use the RAGPipeline
def qa_sensitivity(character_split_texts, tokens_per_chunk, question):
    rag_pipeline = RAGPipeline(character_split_texts, tokens_per_chunk)
    result = rag_pipeline.invoke({"question": question})
    return result

# Define the parameters
tokens_per_chunk = 256
# character_split_texts = ["Your character split texts go here"]

# Create an instance of RAGPipeline
rag_pipeline = RAGPipeline(character_split_texts, tokens_per_chunk)

# Invoke the method
# result = rag_pipeline.invoke({"question": "What is the revenue for Tesla in 2023?"})
# print(result)

# see below on how to create eval_dataset
from datasets import Dataset
import pandas as pd
from tqdm import tqdm
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_recall,
    context_precision,
    context_relevancy,
    answer_correctness,
    answer_similarity
)

from ragas.metrics.critique import harmfulness
from ragas import evaluate

def create_ragas_dataset(rag_pipeline, eval_dataset):
    rag_dataset = []
    for row in tqdm(eval_dataset):
        answer = rag_pipeline.invoke({"question" : row["question"]})
        rag_dataset.append(
            {"question" : row["question"],
             "answer" : answer["response"].content,
             "contexts" : [context.page_content for context in answer["context"]],
             "ground_truths" : [row["ground_truth"]]
             }
        )
    rag_df = pd.DataFrame(rag_dataset)
    rag_eval_dataset = Dataset.from_pandas(rag_df)
    return rag_eval_dataset

def evaluate_ragas_dataset(ragas_dataset):
    result = evaluate(
        ragas_dataset,
        metrics=[
            context_precision,
            faithfulness,
            answer_relevancy,
            context_recall,
            context_relevancy,
            answer_correctness,
            answer_similarity
        ],
    )
    return result

# Load the evaluation dataset
eval_dataset = Dataset.from_csv("groundtruth_eval_dataset.csv")

# Create the RAGAS dataset
ragas_dataset = create_ragas_dataset(rag_pipeline, eval_dataset)

# Evaluate the RAGAS dataset
evaluation_results = evaluate_ragas_dataset(ragas_dataset)
print(evaluation_results)


Collection 'tesla202310k_256' already exists. Using existing collection.


Generating train split: 10 examples [00:00, 772.62 examples/s]
100%|██████████| 10/10 [00:09<00:00,  1.11it/s]
passing column names as 'ground_truths' is deprecated and will be removed in the next version, please use 'ground_truth' instead. Note that `ground_truth` should be of type string and not Sequence[string] like `ground_truths`
Evaluating:  31%|███▏      | 22/70 [00:03<00:07,  6.22it/s]No statements were generated from the answer.
Evaluating: 100%|██████████| 70/70 [00:28<00:00,  2.49it/s]


{'context_precision': 1.0000, 'faithfulness': 1.0000, 'answer_relevancy': 0.8573, 'context_recall': 1.0000, 'context_relevancy': 0.1363, 'answer_correctness': 0.7742, 'answer_similarity': 0.8795}


In [81]:
# Test the chunk size to be half of previous one
pd.DataFrame(ragas_dataset)

Unnamed: 0,question,answer,contexts,ground_truths
0,What is the trading symbol for Tesla's common ...,Answer: TSLA,[united states securities and exchange commiss...,[TSLA]
1,Has the registrant filed all reports required ...,Yes.,[securities registered pursuant to section 12 ...,"[Yes, the registrant has filed all reports req..."
2,"According to the provided information, is the ...",Yes.,[( § 232. 405 of this chapter ) during the pre...,"[Yes, the registrant is a large accelerated fi..."
3,"According to the given context, has the regist...","Yes, the registrant has elected not to use the...","[if an emerging growth company, indicate by ch...","[No, the registrant has not elected to use the..."
4,"According to the context, what is the aggregat...",Answer: $722.52 billion,[indicate by check mark whether any of those e...,[The aggregate market value of voting stock he...
5,How many shares of the registrant's common sto...,"Answer: 3,184,790,415 shares","[as of january 22, 2024, there were 3, 184, 79...","[As of January 22, 2024, there were 3,184,790,..."
6,What is the purpose of the annual report on Fo...,Answer: The purpose of the annual report on Fo...,"[weisshorn solar manager i, llc delaware zep s...",[The purpose of the annual report on Form 10-K...
7,What information is included in item 13 of the...,I don't know,[item 13. certain relationships and related tr...,[Item 13 of the document includes information ...
8,What specific statements are included in the f...,The specific statements included in the forwar...,[statements contain these identifying words. w...,[The forward-looking statements in this annual...
9,What are the risks and uncertainties mentioned...,The risks and uncertainties mentioned in the f...,[statements contain these identifying words. w...,[The risks and uncertainties mentioned in the ...


In [52]:

# Define the parameters
tokens_per_chunk = 128
# character_split_texts = ["Your character split texts go here"]

# Create an instance of RAGPipeline
rag_pipeline = RAGPipeline(character_split_texts, tokens_per_chunk)

# Create the RAGAS dataset
ragas_dataset = create_ragas_dataset(rag_pipeline, eval_dataset)

# Evaluate the RAGAS dataset
evaluation_results = evaluate_ragas_dataset(ragas_dataset)
print(evaluation_results)


Collection 'tesla202310k_128' created and documents added successfully.


100%|██████████| 10/10 [00:07<00:00,  1.40it/s]
passing column names as 'ground_truths' is deprecated and will be removed in the next version, please use 'ground_truth' instead. Note that `ground_truth` should be of type string and not Sequence[string] like `ground_truths`
Evaluating:  54%|█████▍    | 38/70 [00:04<00:02, 13.68it/s]No statements were generated from the answer.
Evaluating:  60%|██████    | 42/70 [00:05<00:03,  8.59it/s]No statements were generated from the answer.
Evaluating: 100%|██████████| 70/70 [00:29<00:00,  2.36it/s]


{'context_precision': 1.0000, 'faithfulness': 1.0000, 'answer_relevancy': 0.7583, 'context_recall': 1.0000, 'context_relevancy': 0.1363, 'answer_correctness': 0.7014, 'answer_similarity': 0.8655}


In [57]:

# Define the parameters
tokens_per_chunk = 64
# character_split_texts = ["Your character split texts go here"]

# Create an instance of RAGPipeline
rag_pipeline = RAGPipeline(character_split_texts, tokens_per_chunk)

# Create the RAGAS dataset
ragas_dataset = create_ragas_dataset(rag_pipeline, eval_dataset)

# Evaluate the RAGAS dataset
evaluation_results = evaluate_ragas_dataset(ragas_dataset)
print(evaluation_results)


Collection 'tesla202310k_64' created and documents added successfully.


100%|██████████| 10/10 [00:08<00:00,  1.23it/s]
passing column names as 'ground_truths' is deprecated and will be removed in the next version, please use 'ground_truth' instead. Note that `ground_truth` should be of type string and not Sequence[string] like `ground_truths`
Evaluating:  73%|███████▎  | 51/70 [00:07<00:02,  8.74it/s]No statements were generated from the answer.
Evaluating: 100%|██████████| 70/70 [00:30<00:00,  2.27it/s]


{'context_precision': 1.0000, 'faithfulness': 1.0000, 'answer_relevancy': 0.8574, 'context_recall': 1.0000, 'context_relevancy': 0.1363, 'answer_correctness': 0.7742, 'answer_similarity': 0.8794}


In [55]:
# ValueError: The token limit of the models 'sentence-transformers/all-mpnet-base-v2' is: 384. Argument tokens_per_chunk=512 > maximum token limit.


In [56]:

# Define the parameters
tokens_per_chunk = 384
# character_split_texts = ["Your character split texts go here"]

# Create an instance of RAGPipeline
rag_pipeline = RAGPipeline(character_split_texts, tokens_per_chunk)

# Create the RAGAS dataset
ragas_dataset = create_ragas_dataset(rag_pipeline, eval_dataset)

# Evaluate the RAGAS dataset
evaluation_results = evaluate_ragas_dataset(ragas_dataset)
print(evaluation_results)


Collection 'tesla202310k_384' created and documents added successfully.


100%|██████████| 10/10 [00:07<00:00,  1.35it/s]
passing column names as 'ground_truths' is deprecated and will be removed in the next version, please use 'ground_truth' instead. Note that `ground_truth` should be of type string and not Sequence[string] like `ground_truths`
Evaluating:  56%|█████▌    | 39/70 [00:06<00:03, 10.13it/s]No statements were generated from the answer.
Evaluating: 100%|██████████| 70/70 [00:29<00:00,  2.37it/s]


{'context_precision': 1.0000, 'faithfulness': 1.0000, 'answer_relevancy': 0.8573, 'context_recall': 1.0000, 'context_relevancy': 0.1363, 'answer_correctness': 0.7742, 'answer_similarity': 0.8794}


In [87]:
pd.DataFrame.from_dict(evaluation_results, orient='index')

Unnamed: 0,0
context_precision,1.0
faithfulness,1.0
answer_relevancy,0.857432
context_recall,1.0
context_relevancy,0.136346
answer_correctness,0.774174
answer_similarity,0.879449


In [90]:
def ndcg(scores, ideal_scores, k):
    dcg = lambda scores: sum((2**score - 1) / np.log2(idx + 2) for idx, score in enumerate(scores[:k]))
    actual_dcg = dcg(scores)
    ideal_dcg = dcg(sorted(ideal_scores, reverse=True))
    return actual_dcg / ideal_dcg if ideal_dcg > 0 else 0

In [114]:

# relevancy rank method 1
from sentence_transformers import CrossEncoder

def rank_relevancy_pairs(pairs, k_pairs):
    # Initialize the CrossEncoder with a specific pre-trained model
    cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
    
    # Predict scores for the pairs using the CrossEncoder
    scores = cross_encoder.predict(pairs)
    
    # Select top 5 ranked answers by sorting the indices of scores in descending order
    top_indices = np.argsort(scores)[::-1][:k_pairs]  # Select only the top 5
    
    # Assign relevancy scores from 4 to 0 (or less based on list length)
    predicted_relevance = [0] * k_pairs
    for rank, index in enumerate(top_indices):
        predicted_relevance[index] = k_pairs-1 - rank  # Adjust this based on how many scores there are
    
    # Retrieve the top 5 pairs based on these indices
    top_pairs = [pairs[index] for index in top_indices]
    
    return top_indices, predicted_relevance, top_pairs

In [118]:
def ndcg_from_ragas(ragas_dataset, k_pairs):
    df = pd.DataFrame.from_dict(ragas_dataset)

    # Extract the question and answer columns
    questions = df['question']
    answers = df['answer']

    # Pair them into tuples
    qa_pairs= list(zip(questions, answers))

    questions = df['question']
    ground_truths = df['ground_truths']

    # Pair them into tuples

    qgt_pairs = [(str(q), str(a)) for q, a in list(zip(questions, ground_truths))]
    top_indices, rank_qa_pairs, top_pairs = rank_relevancy_pairs(qa_pairs, k_pairs)

    top_indices, rank_qgt_pairs, top_pairs = rank_relevancy_pairs(qgt_pairs, k_pairs)


    ndcg_score = ndcg(rank_qa_pairs, rank_qgt_pairs, k=k_pairs)
    print(f"NDCG Score: {ndcg_score}")
    return ndcg_score

In [119]:
ndcg_ragas= ndcg_from_ragas(ragas_dataset, 10)
ndcg_ragas

NDCG Score: 0.409057702101361


0.409057702101361

## -------------------My test pipeline above here --------------------

### Ground Truth Dataset Creation Using GPT-3.5-turbo and GPT-4

The next section might take you a long time to run, so the evaluation dataset is provided.

The basic idea is that we can use LangChain to create questions based on our contexts, and then answer those questions.

Let's look at how that works in the code!

In [35]:
from langchain.output_parsers import ResponseSchema
from langchain.output_parsers import StructuredOutputParser

question_schema = ResponseSchema(
    name="question",
    description="a question about the context."
)

question_response_schemas = [
    question_schema,
]

In [36]:
question_output_parser = StructuredOutputParser.from_response_schemas(question_response_schemas)
format_instructions = question_output_parser.get_format_instructions()

In [37]:
question_generation_llm = ChatOpenAI(model="gpt-3.5-turbo-16k")

bare_prompt_template = "{content}"
bare_template = ChatPromptTemplate.from_template(template=bare_prompt_template)

In [50]:
from langchain.prompts import ChatPromptTemplate

qa_template = """\
You are a University Professor creating a test for advanced students. For each context, create a question that is specific to the context. Avoid creating generic or general questions.

question: a question about the context.

Format the output as JSON with the following keys:
question

context: {context}
"""

prompt_template = ChatPromptTemplate.from_template(template=qa_template)

messages = prompt_template.format_messages(
    context=docs[0],
    format_instructions=format_instructions
)

question_generation_chain = bare_template | question_generation_llm

response = question_generation_chain.invoke({"content" : messages})
output_dict = question_output_parser.parse(response.content)

In [51]:
for k, v in output_dict.items():
  print(k)
  print(v)

question
What is the purpose of Form 10-K according to the United States Securities and Exchange Commission?


In [52]:
!pip install -q -U tqdm

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


You should consider upgrading via the '/Users/hongtang/Documents/RAG_brewer/RAGenv/bin/python3 -m pip install --upgrade pip' command.[0m


In [55]:
#genreeate questions based on the doc contents

In [53]:
from tqdm import tqdm

qac_triples = []

for text in tqdm(docs[:10]):
  messages = prompt_template.format_messages(
      context=text,
      format_instructions=format_instructions
  )
  response = question_generation_chain.invoke({"content" : messages}) # genrate questions
  try:
    output_dict = question_output_parser.parse(response.content) #question and answer
  except Exception as e:
    continue
  output_dict["context"] = text
  qac_triples.append(output_dict)

100%|██████████| 10/10 [00:08<00:00,  1.17it/s]


In [54]:
# pretty_print_dict_wrapped(qac_triples[6])
qac_triples[6]


{'question': 'What is the exact name of the registrant as specified in its charter?',
 'context': Document(page_content='_ _ _ _ _ _ _ _ _ to _ _ _ _ _ _ _ _ _ commission file number : 001 - 34756 tesla, inc. ( exact name of registrant as specified in its charter ) delaware 91 - 2197729 ( state or other jurisdiction of incorporation or organization ) ( i. r. s.', metadata={'page_number': 1})}

In [63]:
answer_generation_llm = ChatOpenAI(model="gpt-4-1106-preview", temperature=0)

answer_schema = ResponseSchema(
    name="answer",
    description="an answer to the question"
)

answer_response_schemas = [
    answer_schema,
]

answer_output_parser = StructuredOutputParser.from_response_schemas(answer_response_schemas)
format_instructions = answer_output_parser.get_format_instructions()

qa_template = """\
You are a University Professor creating a test for advanced students. For each question and context, create an answer.

answer: a answer about the context.

Format the output as JSON with the following keys:
answer

question: {question}
context: {context}
"""

prompt_template = ChatPromptTemplate.from_template(template=qa_template)

messages = prompt_template.format_messages(
    context=qac_triples[0]["context"],
    question=qac_triples[0]["question"],
    format_instructions=format_instructions
)

answer_generation_chain = bare_template | answer_generation_llm

response = answer_generation_chain.invoke({"content" : messages})
output_dict = answer_output_parser.parse(response.content)

In [64]:
for k, v in output_dict.items():
  print(k)
  print(v)

answer
The purpose of Form 10-K is to provide a comprehensive overview of a company's financial performance and operations for the fiscal year. It is an annual report required by the Securities and Exchange Commission (SEC) and must be filed by public companies to comply with federal securities laws. The form includes information about the company's financial condition, risk factors, management discussion and analysis, market information, corporate governance, executive compensation, and other relevant data.
question
What is the purpose of Form 10-K?


In [65]:
for triple in tqdm(qac_triples):
  messages = prompt_template.format_messages(
      context=triple["context"],
      question=triple["question"],
      format_instructions=format_instructions
  )
  response = answer_generation_chain.invoke({"content" : messages})
  try:
    output_dict = answer_output_parser.parse(response.content)
  except Exception as e:
    continue
  triple["answer"] = output_dict["answer"]

100%|██████████| 9/9 [00:37<00:00,  4.18s/it]


In [70]:
#ground truth dataset

In [69]:
qac_triples

[{'question': 'What is the purpose of Form 10-K?',
  'context': Document(page_content='washington, d. c. 20549 form 10 - k ( mark one ) x annual report pursuant to section 13 or 15 ( d ) of the securities exchange act of 1934 for the fiscal year ended december 31, 2023 or o transition report pursuant to section 13 or 15 ( d ) of the', metadata={'page_number': 1}),
  'answer': "The purpose of Form 10-K is to provide a comprehensive overview of a company's financial performance and operations for the fiscal year. It is an annual report required by the Securities and Exchange Commission (SEC) and must be filed by public companies to comply with federal securities laws. The form includes information about the company's financial condition, risk factors, management discussion and analysis, market information, corporate governance, executive compensation, and other relevant data."},
 {'question': 'What type of report is being created in this context?',
  'context': Document(page_content='x a

In [66]:
# !pip install -q -U datasets

In [71]:
import pandas as pd
from datasets import Dataset

ground_truth_qac_set = pd.DataFrame(qac_triples)
ground_truth_qac_set["context"] = ground_truth_qac_set["context"].map(lambda x: str(x.page_content))
ground_truth_qac_set = ground_truth_qac_set.rename(columns={"answer" : "ground_truth"})


eval_dataset = Dataset.from_pandas(ground_truth_qac_set)

In [79]:
type(eval_dataset)

datasets.arrow_dataset.Dataset

In [80]:
eval_dataset[0]

{'question': 'What is the purpose of Form 10-K?',
 'context': 'washington, d. c. 20549 form 10 - k ( mark one ) x annual report pursuant to section 13 or 15 ( d ) of the securities exchange act of 1934 for the fiscal year ended december 31, 2023 or o transition report pursuant to section 13 or 15 ( d ) of the',
 'ground_truth': "The purpose of Form 10-K is to provide a comprehensive overview of a company's financial performance and operations for the fiscal year. It is an annual report required by the Securities and Exchange Commission (SEC) and must be filed by public companies to comply with federal securities laws. The form includes information about the company's financial condition, risk factors, management discussion and analysis, market information, corporate governance, executive compensation, and other relevant data."}

In [81]:
eval_dataset.to_csv("groundtruth_eval_dataset.csv")

Creating CSV from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 110.14ba/s]


6114

### Evaluating RAG Pipelines

If you skipped ahead and need to load the `.csv` directly - uncomment the code below.

If you're using Colab to do this notebook - please ensure you add it to your session files.

In [82]:
from datasets import Dataset
eval_dataset = Dataset.from_csv("groundtruth_eval_dataset.csv")

Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 9 examples [00:00, 764.27 examples/s]


In [136]:
eval_dataset

Dataset({
    features: ['question', 'context', 'ground_truth'],
    num_rows: 9
})

In [85]:
# load microsoft 10K and apply the selected pipeline
# train test split. split the eval_dataset to 80/20 x
# use the ground truth to evaluate different pipelines and select teh better one v


### Evaluation Using RAGAS

Now we can evaluate using RAGAS!

The set-up is fairly straightforward - we simply need to create a dataset with our generated answers and our contexts, and then evaluate using the framework.

In [132]:
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_recall,
    context_precision,
    context_relevancy,
    answer_correctness,
    answer_similarity
)

from ragas.metrics.critique import harmfulness
from ragas import evaluate

def create_ragas_dataset(rag_pipeline, eval_dataset):
  rag_dataset = []
  for row in tqdm(eval_dataset):
    answer = rag_pipeline.invoke({"question" : row["question"]})
    rag_dataset.append(
        {"question" : row["question"],
         "answer" : answer["response"].content,
         "contexts" : [context.page_content for context in answer["context"]],
         "ground_truths" : [row["ground_truth"]]
         }
    )
  rag_df = pd.DataFrame(rag_dataset)
  rag_eval_dataset = Dataset.from_pandas(rag_df)
  return rag_eval_dataset

def evaluate_ragas_dataset(ragas_dataset):
  result = evaluate(
    ragas_dataset,
    metrics=[
        context_precision,
        faithfulness,
        answer_relevancy,
        context_recall,
        context_relevancy,
        answer_correctness,
        answer_similarity
    ],
  )
  return result

Lets create our dataset first:

In [133]:
from tqdm import tqdm
import pandas as pd

basic_qa_ragas_dataset = create_ragas_dataset(retrieval_augmented_qa_chain, eval_dataset)

100%|██████████| 9/9 [00:06<00:00,  1.33it/s]


In [158]:
# pretty_print_dict_wrapped( basic_qa_ragas_dataset[0])
# basic_qa_ragas_dataset

Save it for later:

In [89]:
basic_qa_ragas_dataset.to_csv("basic_qa_ragas_dataset.csv")

Creating CSV from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 195.51ba/s]


8602

And finally - evaluate how it did!

In [90]:
basic_qa_result = evaluate_ragas_dataset(basic_qa_ragas_dataset)

passing column names as 'ground_truths' is deprecated and will be removed in the next version, please use 'ground_truth' instead. Note that `ground_truth` should be of type string and not Sequence[string] like `ground_truths`
Evaluating:   8%|▊         | 5/63 [00:00<00:07,  8.21it/s]No statements were generated from the answer.
No statements were generated from the answer.
Evaluating:  11%|█         | 7/63 [00:02<00:18,  3.08it/s]No statements were generated from the answer.
Evaluating:  33%|███▎      | 21/63 [00:03<00:07,  5.76it/s]No statements were generated from the answer.
Evaluating:  49%|████▉     | 31/63 [00:04<00:03,  9.22it/s]No statements were generated from the answer.
Evaluating: 100%|██████████| 63/63 [00:07<00:00,  8.44it/s]


In [91]:
pretty_print_dict_wrapped(basic_qa_result)

{     "answer_correctness": 0.4511098380111778,
"answer_relevancy": 0.38902964973295784,     "answer_similarity":
0.7822178154504189,     "context_precision": 0.7222222221833333,
"context_recall": 0.7777777777777778,     "context_relevancy":
0.07222222222222223,     "faithfulness": 1.0 }


In [97]:
# hyperparameter tunning
# retrieval_augmented_qa_chain
primary_qa_llm_gpt4o = ChatOpenAI(model_name="gpt-4o", temperature=0)

rag_gpt4o_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | base_retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": prompt | primary_qa_llm_gpt4o, "context": itemgetter("context")}
)

In [98]:
rag_gpt4o_ragas_dataset = create_ragas_dataset(rag_gpt4o_chain, eval_dataset)
rag_gpt4o_ragas_dataset.to_csv("rag_gpt4o_ragas_dataset.csv")
rag_gpt4o_result = evaluate_ragas_dataset(rag_gpt4o_ragas_dataset)
pretty_print_dict_wrapped(rag_gpt4o_result )


100%|██████████| 9/9 [00:09<00:00,  1.01s/it]
Creating CSV from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 386.68ba/s]
passing column names as 'ground_truths' is deprecated and will be removed in the next version, please use 'ground_truth' instead. Note that `ground_truth` should be of type string and not Sequence[string] like `ground_truths`
Evaluating:   8%|▊         | 5/63 [00:00<00:06,  9.00it/s]No statements were generated from the answer.
Evaluating:  16%|█▌        | 10/63 [00:01<00:09,  5.80it/s]No statements were generated from the answer.
Evaluating:  51%|█████     | 32/63 [00:03<00:01, 16.80it/s]No statements were generated from the answer.
Evaluating:  56%|█████▌    | 35/63 [00:04<00:04,  5.92it/s]No statements were generated from the answer.
Evaluating:  89%|████████▉ | 56/63 [00:06<00:00,  9.51it/s]No statements were generated from the answer.
Evaluating: 100%|██████████| 63/63 [00:07<00:00,  8.71it/s]


{     "answer_correctness": 0.4563945569982671,
"answer_relevancy": 0.42863584865198795,     "answer_similarity":
0.8255782279930683,     "context_precision": 0.7222222221833333,
"context_recall": 0.7777777777777778,     "context_relevancy":
0.07222222222222223,     "faithfulness": 1.0 }


In [109]:
#chunk size sensitivity

### Testing Other Retrievers

Now we can test our how changing our Retriever impacts our RAGAS evaluation!

We'll build this simple qa_chain factory to create standardized qa_chains where the only different component will be the retriever.

In [None]:
def create_qa_chain(retriever):
  primary_qa_llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
  created_qa_chain = (
    {"context": itemgetter("question") | retriever,
     "question": itemgetter("question")
    }
    | RunnablePassthrough.assign(
        context=itemgetter("context")
      )
    | {
         "response": prompt | primary_qa_llm,
         "context": itemgetter("context"),
      }
  )

  return created_qa_chain

#### Parent Document Retriever

One of the easier ways we can imagine improving a retriever is to embed our documents into small chunks, and then retrieve a significant amount of additional context that "surrounds" the found context.

You can read more about this method [here](https://python.langchain.com/docs/modules/data_connection/retrievers/parent_document_retriever)!

The basic outline of this retrieval method is as follows:

1. Obtain User Question
2. Retrieve child documents using Dense Vector Retrieval
3. Merge the child documents based on their parents. If they have the same parents - they become merged.
4. Replace the child documents with their respective parent documents from an in-memory-store.
5. Use the parent documents to augment generation.

In [108]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain.docstore.document import Document

# Convert base_docs to a list of Document objects
documents = [Document(page_content=doc['content'], metadata=doc['metadata']) for doc in base_docs]

# Initialize text splitters
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=1500)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)

# Initialize the vector store
vectorstore = Chroma(collection_name="split_parents", embedding_function=OpenAIEmbeddings())

# Initialize the in-memory document store
store = InMemoryStore()

# Initialize the ParentDocumentRetriever
parent_document_retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

# Add documents to the ParentDocumentRetriever
parent_document_retriever.add_documents(documents)


KeyboardInterrupt: 

Let's create, test, and then evaluate our new chain!

In [50]:
parent_document_retriever_qa_chain = create_qa_chain(parent_document_retriever)

In [51]:
parent_document_retriever_qa_chain.invoke({"question" : "What is RAG?"})["response"].content

'Answer: Retrieval-augmented generation (RAG) is a practicable complement to large language models (LLMs) that relies heavily on the relevance of retrieved documents to improve the robustness of generation.'

In [52]:
pdr_qa_ragas_dataset = create_ragas_dataset(parent_document_retriever_qa_chain, eval_dataset)

100%|██████████| 10/10 [00:10<00:00,  1.10s/it]


In [53]:
pdr_qa_ragas_dataset.to_csv("pdr_qa_ragas_dataset.csv")

Creating CSV from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 306.38ba/s]


55684

In [54]:
pdr_qa_result = evaluate_ragas_dataset(pdr_qa_ragas_dataset)

passing column names as 'ground_truths' is deprecated and will be removed in the next version, please use 'ground_truth' instead. Note that `ground_truth` should be of type string and not Sequence[string] like `ground_truths`
Evaluating: 100%|██████████| 70/70 [01:30<00:00,  1.29s/it]


In [48]:
pdr_qa_result

{'context_precision': 0.9806, 'faithfulness': 0.9500, 'answer_relevancy': 0.9600, 'context_recall': 1.0000, 'context_relevancy': 0.0338, 'answer_correctness': 0.6026, 'answer_similarity': 0.9366}

#### Ensemble Retrieval

Next let's look at ensemble retrieval!

You can read more about this [here](https://python.langchain.com/docs/modules/data_connection/retrievers/ensemble)!

The basic idea is as follows:

1. Obtain User Question
2. Hit the Retriever Pair
    - Retrieve Documents with BM25 Sparse Vector Retrieval
    - Retrieve Documents with Dense Vector Retrieval Method
3. Collect and "fuse" the retrieved docs based on their weighting using the [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) algorithm into a single ranked list.
4. Use those documents to augment our generation.

Ensure your `weights` list - the relative weighting of each retriever - sums to 1!

In [49]:
!pip install -q -U rank_bm25

In [100]:
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.retrievers import BM25Retriever, EnsembleRetriever
from langchain.docstore.document import Document

# Assuming base_docs is a list of dictionaries with 'content' and 'metadata' keys
# Convert base_docs to a list of Document objects
documents = [Document(page_content=doc['content'], metadata=doc['metadata']) for doc in base_docs]

# Initialize the text splitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=450, chunk_overlap=75)

# Split the documents
docs = text_splitter.split_documents(documents)

# Initialize the BM25 retriever
bm25_retriever = BM25Retriever.from_documents(docs)
bm25_retriever.k = 2

# Initialize the embeddings and vector store
embedding = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(docs, embedding)

# Initialize the Chroma retriever
chroma_retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

# Initialize the Ensemble retriever
ensemble_retriever = EnsembleRetriever(retrievers=[bm25_retriever, chroma_retriever], weights=[0.75, 0.25])


In [99]:
from langchain.retrievers import BM25Retriever, EnsembleRetriever

text_splitter = RecursiveCharacterTextSplitter(chunk_size=450, chunk_overlap=75)
docs = text_splitter.split_documents(base_docs)

bm25_retriever = BM25Retriever.from_documents(docs)
bm25_retriever.k = 2

embedding = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(docs, embedding)
chroma_retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

ensemble_retriever = EnsembleRetriever(retrievers=[bm25_retriever, chroma_retriever], weights=[0.75, 0.25])

AttributeError: 'dict' object has no attribute 'page_content'

In [51]:
ensemble_retriever_qa_chain = create_qa_chain(ensemble_retriever)

In [52]:
ensemble_retriever_qa_chain.invoke({"question" : "What is RAG?"})["response"].content

'Retrieval Augmented Generation (RAG) is a practicable complement to Large Language Models (LLMs) that relies heavily on the relevance of retrieved documents.'

In [53]:
ensemble_qa_ragas_dataset = create_ragas_dataset(ensemble_retriever_qa_chain, eval_dataset)

100%|██████████| 10/10 [00:13<00:00,  1.34s/it]


In [54]:
ensemble_qa_ragas_dataset.to_csv("ensemble_qa_ragas_dataset.csv")

Creating CSV from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

24578

In [55]:
ensemble_qa_result = evaluate_ragas_dataset(ensemble_qa_ragas_dataset)



Evaluating:   0%|          | 0/70 [00:00<?, ?it/s]

In [56]:
ensemble_qa_result

{'context_precision': 0.7746, 'faithfulness': 1.0000, 'answer_relevancy': 0.8441, 'context_recall': 1.0000, 'context_relevancy': 0.0325, 'answer_correctness': 0.6255, 'answer_similarity': 0.9327}

### Conclusion

Observe your results in a table!

In [57]:
basic_qa_result

{'context_precision': 0.9000, 'faithfulness': 0.8500, 'answer_relevancy': 0.9651, 'context_recall': 0.7000, 'context_relevancy': 0.1051, 'answer_correctness': 0.5615, 'answer_similarity': 0.9298}

In [58]:
pdr_qa_result

{'context_precision': 0.9806, 'faithfulness': 0.9500, 'answer_relevancy': 0.9600, 'context_recall': 1.0000, 'context_relevancy': 0.0338, 'answer_correctness': 0.6026, 'answer_similarity': 0.9366}

In [59]:
ensemble_qa_result

{'context_precision': 0.7746, 'faithfulness': 1.0000, 'answer_relevancy': 0.8441, 'context_recall': 1.0000, 'context_relevancy': 0.0325, 'answer_correctness': 0.6255, 'answer_similarity': 0.9327}

We can also zoom in on each result and find specific information about each of the questions and answers.

In [60]:
ensemble_qa_result_df = ensemble_qa_result.to_pandas()

In [61]:
ensemble_qa_result_df

Unnamed: 0,question,answer,contexts,ground_truths,ground_truth,context_precision,faithfulness,answer_relevancy,context_recall,context_relevancy,answer_correctness,answer_similarity
0,What is the main focus of the paper 'A Survey ...,The main focus of the paper 'A Survey on Retri...,[A Survey on Retrieval-Augmented Text Generati...,[The main focus of the paper 'A Survey on Retr...,The main focus of the paper 'A Survey on Retri...,0.679167,1.0,1.0,1.0,0.055556,0.581007,0.990696
1,What is the main focus of the paper 'A Survey ...,The main focus of the paper 'A Survey on Retri...,[A Survey on Retrieval-Augmented Text Generati...,[The main focus of the paper 'A Survey on Retr...,The main focus of the paper 'A Survey on Retri...,0.679167,1.0,1.0,1.0,0.055556,0.581211,0.991511
2,What is the main focus of this paper?,The main focus of this paper is to improve the...,[and main intent within questions.\nquestion: ...,[The main focus of this paper is to conduct a ...,The main focus of this paper is to conduct a s...,1.0,1.0,0.737677,1.0,0.0,0.72221,0.888841
3,What is the main focus of the paper 'A Survey ...,The main focus of the paper 'A Survey on Retri...,[A Survey on Retrieval-Augmented Text Generati...,[The main focus of the paper 'A Survey on Retr...,The main focus of the paper 'A Survey on Retri...,0.679167,1.0,1.0,1.0,0.055556,0.62244,0.989762
4,What is the aim of this paper?,The aim of this paper is to propose Corrective...,[and main intent within questions.\nquestion: ...,[The aim of this paper is to conduct a survey ...,The aim of this paper is to conduct a survey a...,0.804167,1.0,0.737583,1.0,0.0,0.473361,0.893446
5,What is the focus of this paper?,The focus of this paper is to improve the robu...,[and main intent within questions.\nquestion: ...,[The focus of this paper is on retrieval-augme...,The focus of this paper is on retrieval-augmen...,0.916667,1.0,0.745384,1.0,0.0,0.911593,0.919097
6,What is the main focus of this paper?,The main focus of this paper is to improve the...,[and main intent within questions.\nquestion: ...,[The main focus of this paper is to conduct a ...,The main focus of this paper is to conduct a s...,0.7,1.0,0.737677,1.0,0.0,0.719003,0.876011
7,What is the purpose of this paper?,The purpose of this paper is to propose the Co...,[and main intent within questions.\nquestion: ...,[The purpose of this paper is to conduct a com...,The purpose of this paper is to conduct a comp...,0.804167,1.0,0.741192,1.0,0.069767,0.473029,0.892116
8,What is the main focus of the paper 'A Survey ...,The main focus of the paper 'A Survey on Retri...,[A Survey on Retrieval-Augmented Text Generati...,[The main focus of the paper 'A Survey on Retr...,The main focus of the paper 'A Survey on Retri...,0.679167,1.0,1.0,1.0,0.018519,0.698101,0.992403
9,What is the purpose of this paper?,The purpose of this paper is to propose the Co...,[and main intent within questions.\nquestion: ...,[The purpose of this paper is to conduct a com...,The purpose of this paper is to conduct a comp...,0.804167,1.0,0.741192,1.0,0.069767,0.473232,0.892926


We'll also look at combining the results and looking at them in a single table so we can make inferences about them!

In [62]:
def create_df_dict(pipeline_name, pipeline_items):
  df_dict = {"name" : pipeline_name}
  for name, score in pipeline_items:
    df_dict[name] = score
  return df_dict

In [63]:
basic_rag_df_dict = create_df_dict("basic_rag", basic_qa_result.items())

In [64]:
pdr_rag_df_dict = create_df_dict("pdr_rag", pdr_qa_result.items())

In [65]:
ensemble_rag_df_dict = create_df_dict("ensemble_rag", ensemble_qa_result.items())

In [66]:
results_df = pd.DataFrame([basic_rag_df_dict, pdr_rag_df_dict, ensemble_rag_df_dict])

In [67]:
results_df.sort_values("answer_correctness", ascending=False)

Unnamed: 0,name,context_precision,faithfulness,answer_relevancy,context_recall,context_relevancy,answer_correctness,answer_similarity
2,ensemble_rag,0.774583,1.0,0.84407,1.0,0.032472,0.625519,0.932681
1,pdr_rag,0.980556,0.95,0.960035,1.0,0.03382,0.602607,0.936583
0,basic_rag,0.9,0.85,0.965148,0.7,0.105105,0.561453,0.929758


### ❓QUESTION❓

What conclusions can you draw about the above results?

Describe in your own words what the metrics are expressing.

In [68]:
retrieval_augmented_qa_chain = (
    RunnableParallel({
        'context': itemgetter('question') | base_retriever,
        'question': RunnablePassthrough()
    }) | {
        'response': prompt | primary_qa_llm | parser,
        'context': itemgetter('context')
    }
)

NameError: name 'RunnableParallel' is not defined