# Evaluating RAG

The extent to which you can **evaluate** your system is the extent to which you can **improve** your system. Before going to prod, it is in your best interest to establish a framework for quickly and effectively understanding the quality of your RAG application. In this notebook, we will use the RAGAS framework, as proposed by [this paper](https://arxiv.org/pdf/2309.15217), to evaluate the RAG application developed in the previous examples. 

There is no substitute for reading the paper, but summarized below are the main metrics we will work with. Note: there are many more metrics that can be used depending on use case but these are the main ones covered in the paper so we will start there. 

# Quality metric breakdown

The 3 quality metrics in the RAGAS framework are: **faithfulness**, **answer relevance**, and **context relevance**. Let's take a moment to define each and understand how we can arrive at their values.

## Faithfulness

An answer to a question can be said to be "faithful" if the **claims** that are made in the answer **can be inferred** from the **context**.

The process for quantifying this score is as follows:

1. Use the following prompt with an LLM to generate shorter more focused statements provided the question and answer.

    > Given a question and answer, create one
    > or more statements from each sentence
    > in the given answer.
    > question: [question]
    > answer: [answer]

2. For each generated statement, verify if it can be inferred from the context with the following prompt.

    > Consider the given context and following
    > statements, then determine whether they
    > are supported by the information present
    > in the context. Provide a brief explanation for each statement before arriving
    > at the verdict (Yes/No). Provide a final
    > verdict for each statement in order at the
    > end in the given format. Do not deviate
    > from the specified format.
    > statement: [statement 1]
    > ...
    > statement: [statement n]

3. The final score can then be calculated Faithfulness = (number of supported statements) / (total number of statements)

## Answer Relevance

An answer can be said to be relevant if it directly addresses the question (intuitively).

The process for quantifying this score is:

1. Use an LLM to generate "hypothetical" questions to a given answer with the following prompt:

    > Generate a question for the given answer.
    > answer: [answer]

2. Embed the generated "hypothetical" questions as vectors.
3. Calculate the cosine similarity of the hypothetical questions and the original question, sum those similarities, and divide by n.

Expressed computationally: `Answer Relevance = sum(cos_sim((q, q_i) for q_i in n)) / n`

## Context Relevance

"The context is considered relevant to the extent that it exclusively contains information that is needed to answer the question."

The process:

1. Use the following LLM prompt to extract a subset of sentences necessary to answer the question. The context is defined as the formatted search result from the vector database.

    > Please extract relevant sentences from
    > the provided context that can potentially
    > help answer the following `{question}`. If no
    > relevant sentences are found, or if you
    > believe the question cannot be answered
    > from the given context, return the phrase
    > "Insufficient Information". While extracting candidate sentences you’re not allowed to make any changes to sentences
    > from given `{context}`.

2. Compute the context relevance score = (number of extracted sentences) / (total number of sentences in context)

# Let's start coding!

If you just finished the other examples this may already be done for you.


# Initialize Redis and create chunks to populate the index

In [1]:
# init Redis connection and index
import os
from redisvl.index import SearchIndex
from redisvl.schema import IndexSchema
from redis import Redis

# init Redis connection
# Replace values below with your own if using Redis Cloud instance
REDIS_HOST = os.getenv("REDIS_HOST", "localhost")
REDIS_PORT = os.getenv("REDIS_PORT", "6379")
REDIS_PASSWORD = os.getenv("REDIS_PASSWORD", "")

# If SSL is enabled on the endpoint, use redis:// as the URL prefix
REDIS_URL = f"redis://{REDIS_HOST}:{REDIS_PORT}"
os.environ["REDIS_URL"] = REDIS_URL

index_name = 'langchain'
prefix = 'chunk'
schema = IndexSchema.from_yaml('sec_index.yaml')
client = Redis.from_url(REDIS_URL)

# create an index from schema and the client
index = SearchIndex(schema, client)
index.create(overwrite=True, drop=True)

In [2]:
# configure env
import json
import os
import warnings
warnings.filterwarnings("ignore")
dir_path = os.getcwd()
parent_directory = os.path.dirname(dir_path)
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["ROOT_DIR"] = parent_directory

#setting the local downloaded sentence transformer models f
os.environ["TRANSFORMERS_CACHE"] = f"{parent_directory}/models"

In [33]:
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings 
from ingestion import get_sec_data
from ingestion import redis_bulk_upload

embeddings = SentenceTransformerEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2", cache_folder=os.getenv("TRANSFORMERS_CACHE", f"{parent_directory}/models"))
sec_data = get_sec_data()
chunks = redis_bulk_upload(sec_data, index, embeddings, tickers=['AAPL'])

 ✅ Loaded doc info for  110 tickers...
✅ Loaded 108 10K chunks for ticker=AAPL from AAPL-2021-10K.pdf
✅ Loaded 94 10K chunks for ticker=AAPL from AAPL-2023-10K.pdf
✅ Loaded 103 10K chunks for ticker=AAPL from AAPL-2022-10K.pdf
✅ Loaded 27 earning_call chunks for ticker=AAPL from 2018-May-01-AAPL.txt
✅ Loaded 31 earning_call chunks for ticker=AAPL from 2019-Oct-30-AAPL.txt
✅ Loaded 30 earning_call chunks for ticker=AAPL from 2016-Jan-26-AAPL.txt
✅ Loaded 31 earning_call chunks for ticker=AAPL from 2020-Jul-30-AAPL.txt
✅ Loaded 30 earning_call chunks for ticker=AAPL from 2017-Aug-01-AAPL.txt
✅ Loaded 29 earning_call chunks for ticker=AAPL from 2020-Jan-28-AAPL.txt
✅ Loaded 34 earning_call chunks for ticker=AAPL from 2016-Apr-26-AAPL.txt
✅ Loaded 29 earning_call chunks for ticker=AAPL from 2017-Jan-31-AAPL.txt
✅ Loaded 28 earning_call chunks for ticker=AAPL from 2019-Apr-30-AAPL.txt
✅ Loaded 26 earning_call chunks for ticker=AAPL from 2017-Nov-02-AAPL.txt
✅ Loaded 31 earning_call chunks f

In [34]:
flattened_chunks = [item for sublist in chunks for item in sublist]
len(flattened_chunks)

874

# Populate index and create vector store
This is entirely the same as we have done in the previous examples

In [4]:
from langchain_community.vectorstores import Redis as LangChainRedis
from utils import create_langchain_schemas_from_redis_schema

index_name = 'langchain'

vec_schema , main_schema = create_langchain_schemas_from_redis_schema('sec_index.yaml')

rds = LangChainRedis.from_existing_index(
    embedding=embeddings, 
    index_name= index_name, 
    schema = main_schema
)

## Test it out!
We can see the vector store is populated and returning results.

In [6]:
rds.similarity_search("What was apples revenue last year?")[0]

Document(page_content="Earlier this month, released macOS Catalina with all new entertainment apps, innovative Sidecar feature that uses iPad to expand Mac workspace and new accessibility tools that enable users to control their Mac entirely with their voice. 1. Catalina brings Apple Arcade experience to Mac. 1. Already seeing some third-party developers bring their iPad apps to Mac App Store with Mac Catalyst, including Twitter, Post-it and more. 4. Launching newly redesigned Mac Pro this fall, which Co. is manufacturing in Austin, Texas. 7. Others: 1. In FY19, crossed $100b in revenue in US for first time. 2. Introduce new services from Apple Card to Apple TV+ and generated over $46b in total Services revenue, setting new yearly Services records in all five geographic segments and driving Services business to size of Fortune 70 co. 3. Delivered new hardware in all device categories. 4. Wearables business showed explosive growth and generated more annual revenue than two-thirds of com

# Setup RAG

In [7]:
from langchain_community.llms import Ollama

# we will use llama3 as our local llm for this use case
llm = Ollama(model="llama3")

In [8]:
def get_prompt():
    """Create the QA chain."""
    from langchain.prompts import PromptTemplate

    # Define our prompt
    prompt_template = """Use the following pieces of context from financial 10k filings data to answer the user question at the end. Only use the result from tools and evidence provided to you. If you don't know the answer, say that you don't know, don't try to make up an answer. Provide the source of the document that you used to get the answer.

    This should be in the following format:

    Question: [question here]
    Answer: [answer here]
    Source: [source document here]

    Begin!

    Context:
    ---------
    {context}
    ---------
    Question: {question}
    Answer:"""

    prompt = PromptTemplate(
        template=prompt_template,
        input_variables=["context", "question"]
    )
    return prompt

In [9]:
from langchain.chains import RetrievalQA

def get_search_kwargs(filters, distance_threshold):
    return {"distance_threshold":distance_threshold,"filter":filters}
    

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=rds.as_retriever(search_type="similarity_distance_threshold",
                               search_kwargs={"distance_threshold":0.8, 'include_metadata': True}),
    return_source_documents=True,
    chain_type_kwargs={"prompt": get_prompt()},
    verbose=True
)

# Now we have our RAG QA to test out

In [10]:
query = "What was Apple's revenue last year compared to this year??"
res=qa(query)
res



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': "What was Apple's revenue last year compared to this year??",
 'result': "Question: What was Apple's revenue last year compared to this year??\nAnswer: According to the transcript, Apple's revenue grew by $36.4 billion in fiscal year '18, which is a significant increase from the previous year.\nSource: The transcript is based on the financial 10K filing of Apple Inc., specifically the section discussing revenue growth.",
 'source_documents': [Document(page_content="Thank you, Nancy. Good afternoon, everyone, and thanks for joining us. I just got back from Brooklyn, where we marked our fourth major launch at the end of the year. In addition to being a great time, it put an exclamation point at the end of a remarkable fiscal 2018. This year, we shipped our 2 billionth iOS device, celebrated the 10th anniversary of the App Store and achieved the strongest revenue and earnings in Apple's history. In fiscal year '18, our revenue grew by $36.4 billion. That's the equivalent of a Fo

# Setup complete!
Now let's generate some test questions to evaluate the answering abilities of the RAG QA using the metrics we introduced at the beginning. To do this we can use the LLM to come up with some potential questions.

In [26]:
prompt = """
    You are a helpful question generating bot.
    Generate 15 questions you might ask about Apple's financial performance from it's 2023 annual report, earnings calls,
    and other financial documents. Return the response without any additional text as a json object of the form
    {"questions": [question1, question2, ..., question15]}
"""

questions = json.loads(llm.generate([prompt]).generations[0][0].text)["questions"]
questions

["What is Apple's total revenue for 2023 compared to the previous year?",
 'What percentage increase in Services revenue did Apple report in 2023?',
 "How much has Apple's gross margin increased/decreased over the past three years?",
 "What was Apple's operating cash flow for 2023, and how does it compare to 2022?",
 'In what sectors did Apple see significant growth in its hardware sales (e.g., Mac, iPad, etc.)?',
 "By what percentage did Apple's iPhone revenue increase or decrease in 2023 compared to the previous year?",
 "What was Apple's research and development expense for 2023, and how does it compare to 2022?",
 "How has Apple's capital expenditures changed over the past five years?",
 'In what regions did Apple see significant growth in its sales (e.g., Asia, Americas, etc.)?',
 "By what percentage did Apple's China revenue increase or decrease in 2023 compared to the previous year?",
 "What was Apple's effective tax rate for 2023, and how does it compare to the previous year?",

# Helper function for creating test dataset

In the following code we take a list of questions and a QA retrieval chain as input. We call the chain and store the answer returned along with the context (aka source documents) to be used as the essential data for our evaluation.


In [27]:

# define reusable helper function for evaluating our test set against different chains

from datasets import Dataset
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_relevancy,
)

from ragas import evaluate

def parse_contexts(source_docs):
    return [doc.page_content for doc in source_docs]

def create_evaluation_dataset(chain, questions):
    res_set = {
        "question": [],
        "answer": [],
        "contexts": [],
        # "ground_truth": []
    }

    for question in questions:
        # call QA chain
        result = chain(question)

        res_set["question"].append(question)
        res_set["answer"].append(result["result"])
        res_set["contexts"].append(parse_contexts(result["source_documents"]))
        # res_set["ground_truth"].append(test[1]["ground_truth"])
    return Dataset.from_dict(res_set)

def evaluate_chain(chain, questions, test_name):
    eval_dataset = create_evaluation_dataset(chain, questions)

    eval_result = evaluate(
        eval_dataset,
        metrics=[
            faithfulness,
            answer_relevancy,
            context_relevancy
        ],
    )

    eval_df = eval_result.to_pandas()
    # store the results of our test for future reference in csv
    eval_df.to_csv(f"{test_name}.csv")
    return eval_df

In [29]:
import getpass

# by default ragas evaluation uses OpenAI
if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("OPENAI_API_KEY")

basic_rag_test = evaluate_chain(qa, questions, "basic_rag_test")



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


[1m> Entering new RetrievalQA chain...[0m

[1m

Evaluating:   0%|          | 0/45 [00:00<?, ?it/s]

In [30]:
basic_rag_test.describe()

Unnamed: 0,faithfulness,answer_relevancy,context_relevancy
count,15.0,15.0,15.0
mean,0.55623,0.585421,0.008943
std,0.262812,0.495011,0.004356
min,0.0,0.0,0.004274
25%,0.416667,0.0,0.005445
50%,0.6,0.951487,0.006849
75%,0.732143,0.976251,0.011905
max,1.0,1.0,0.016667


# Analysis

We can see from the above results that our basic RAG didn't score particularly well. This is okay because now that we have a baseline for the performance of our RAG, we can begin to try different techniques to improve our results. The reason it is so important to have a framework in place for evaluation is now we can properly experiment with different techniques to see what improves our particular system.

As an example, we can see that our system scored the worst in terms of context_relevancy. This means that the context provided to the LLM contained a lot of additional unneeded information. One technique we could try to improve this score is creating dense propositions from our chunks that improve our content's searchability and economy of words.

In [107]:
# for speed sake the propositions have been precomputed but you can generate them with the following function
with open("propositions.json", "r") as f:
    preloaded_propositions = json.load(f)

In [108]:
preloaded_propositions[0]

'\n\n* United States Securities and Exchange Commission is a regulatory agency.\n* The commission is located in Washington, D.C. and has an address of 20549.\n* Apple Inc. is a company that filed this report with the SEC.\n* The company is registered under the name "Apple Inc." and is incorporated in California (State or other jurisdiction of incorporation or organization).\n* The company\'s I.R.S. Employer Identification Number is 94-2404110.\n* The company\'s principal executive offices are located at One Apple Park Way, Cupertino, California with a zip code of 95014 and a telephone number including area code as (408) 996-1010.\n* Securities registered pursuant to Section 12(b) of the Act include common stock with a par value per share of $0.00001 and various notes with different interest rates and maturity dates.\n* The trading symbol for these securities is AAPL, which is listed on The Nasdaq Stock Market LLC.\n* Apple Inc. is not required to file reports pursuant to Section 13 or 

In [110]:
flattened_chunks[0].page_content

'UNITED STATES SECURITIES AND EXCHANGE COMMISSION Washington, D.C. 20549\n\nFORM 10-K\n\n(Mark One)\n\n☒ ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 For the fiscal year ended September 25, 2021 or ☐ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934\n\nFor the transition period from to .\n\nCommission File Number: 001-36743\n\nApple Inc.\n\n(Exact name of Registrant as specified in its charter)\n\nCalifornia (State or other jurisdiction of incorporation or organization)\n\n94-2404110 (I.R.S. Employer Identification No.)\n\nOne Apple Park Way Cupertino, California (Address of principal executive offices)\n\n95014 (Zip Code)\n\n(408) 996-1010 (Registrant’s telephone number, including area code)\n\nSecurities registered pursuant to Section 12(b) of the Act:\n\nTitle of each class Common Stock, $0.00001 par value per share\n\n1.000% Notes due 2022 1.375% Notes due 2024 0.000% Notes due 2025 0.875% Notes due 2025

In [92]:
import time


def create_dense_props(chunk):
    """Create dense representation of raw text content."""

    preamble = """
        You are a helpful PDF extractor tool. You will be presented with segments from
        raw documents composed of information about public companies.

        Decompose and summarize the raw content into clear and simple propositions,
        ensuring they are interpretable out of context. Consider the following rules:
        1. Split compound sentences into simpler dense phrases that retain existing
        meaning.
        2. Simplify technical jargon or wording if possible while retaining existing
        meaning.
        2. For any named entity that is accompanied by additional descriptive information,
        separate this information into its own distinct proposition.
        3. Decontextualize the proposition by adding necessary modifier to nouns or
        entire sentences and replacing pronouns (e.g., "it", "he", "she", "they", "this", "that")
        with the full name of the entities they refer to.
        4. Respond in the format content: results where results is the raw decomposed content.
    """

    prompt = f"""
        {preamble}
        Decompose this raw content using the rules above: {flattened_chunks[0].page_content}
    """

    try:
        return llm.generate([prompt]).generations[0][0].text
    except Exception as e:
        print(f"Failed to parse propositions attempt wait and backoff", str(e), flush=True)
        time.sleep(10)
        # Retry
        return create_dense_props(chunk)

In [111]:
# if creating from scratch
# propositions = [create_dense_props(chunk) for chunk in flattened_chunks]
# with open("propositions.json", "w") as f:
#     json.dump(propositions, f)

In [112]:
from langchain.docstore.document import Document

prop_docs = [Document(page_content=prop, metadata={"source": "local"}) for prop in preloaded_propositions]

# Embed the props and store them in a new index as vector field

In [116]:
from langchain.vectorstores.redis import Redis
from langchain.embeddings.huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")


# set the index name for this example
index_name = "proposition_index"

# with langchain we can manually modify the default vector schema configuration
vector_schema = {
    "name": "prop_vector",        # name of the vector field in langchain
    "algorithm": "HNSW",           # could use HNSW instead
    "dims": 384,                   # set based on the HF model embedding dimension
    "distance_metric": "COSINE",   # could use EUCLIDEAN or IP
    "datatype": "FLOAT32",
}

# here we can define the entire schema spec for our index in LangChain
index_schema = {
    "vector": [vector_schema],
    "text": [{"name": "content"}],
    "content_vector_key": "prop_vector"    # name of the vector field in langchain
}


# construct the vector store class from texts and metadata
prop_rds = Redis.from_documents(
    documents=prop_docs,
    embedding=embeddings,
    index_name=index_name,
    redis_url=REDIS_URL,
    index_schema=index_schema,
)

`index_schema` does not match generated metadata schema.
If you meant to manually override the schema, please ignore this message.
index_schema: {'vector': [{'name': 'prop_vector', 'algorithm': 'HNSW', 'dims': 384, 'distance_metric': 'COSINE', 'datatype': 'FLOAT32'}], 'text': [{'name': 'content'}], 'content_vector_key': 'prop_vector'}
generated_schema: {'text': [{'name': 'source'}], 'numeric': [], 'tag': []}



# Create RAG chain but use prop index instead

In [117]:
prop_qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=prop_rds.as_retriever(search_type="similarity_distance_threshold",search_kwargs={"distance_threshold":0.5}),
    return_source_documents=True,
    chain_type_kwargs={"prompt": get_prompt()},
    #verbose=True
)

In [118]:
prop_qa("What is Apple's total revenue for 2023 compared to the previous year?")

{'query': "What is Apple's total revenue for 2023 compared to the previous year?",
 'result': "Question: What is Apple's total revenue for 2023 compared to the previous year?\nAnswer: I don't know\nSource: The provided context does not include financial data or information about Apple's revenue for 2023. The annual report (FORM 10-K) only provides information up to September 25, 2021, and does not include data for subsequent years.",
 'source_documents': [Document(page_content="Results\n\nThe following are the decomposed and summarized propositions from the raw content:\n\n1. The UNITED STATES SECURITIES AND EXCHANGE COMMISSION is a regulatory agency responsible for overseeing public companies.\n2. Apple Inc. is a public company that has filed an annual report (FORM 10-K) with the SEC.\n3. Apple's fiscal year ended on September 25, 2021.\n4. Apple is headquartered in California and can be reached at (408) 996-1010.\n5. The company's common stock has a par value of $0.00001 per share.\n

In [102]:
prop_rag_test = evaluate_chain(prop_qa, questions, "prop_rag_test")

Evaluating:   0%|          | 0/45 [00:00<?, ?it/s]

In [103]:
prop_rag_test.describe()

Unnamed: 0,faithfulness,answer_relevancy,context_relevancy
count,15.0,15.0,15.0
mean,0.510476,0.333314,0.030353
std,0.451568,0.487921,0.036376
min,0.0,0.0,0.0
25%,0.0,0.0,0.008475
50%,0.666667,0.0,0.022222
75%,0.928571,0.999852,0.031676
max,1.0,1.0,0.142857


# Analysis and conclusion


As a review, in this notebook we covered:
- why it's important to have an evaluation framework
- the basic theory of RAGAS
- how to interpret and generate faithfulness, answer_relevancy, and context_relevancy
- code to evaluate two different RAG chains to monitor how creating dense props might improve our results
