# Document Retrieval


Retrieval-Augmented Generation (RAG) is a cutting-edge natural language processing paradigm that combines aspects of both information retrieval and language generation. In the RAG framework, a pre-trained language model, often a large transformer model, is augmented with a retrieval component. This retrieval mechanism allows the model to access external knowledge by retrieving relevant information from a set of source documents based on the input query. The integration of retrieval into the generation process enhances the model's contextual understanding and enables it to generate more informed and contextually relevant responses.

The RAG pattern involves two main phases: Retrieval and Augmented Generation. During Retrieval, the model identifies and retrieves pertinent information from the source documents. Subsequently, in the Augmented Generation phase, the retrieved content is utilized to enrich the generation of language, resulting in more contextually aware and coherent responses.

Evaluating RAG involves assessing the effectiveness of both the Retrieval and Augmented Generation components independently. In this notebook, the focus is specifically on evaluating the Retrieval part of RAG. The analysis centers on comparing two retrieval methods—similarity search and ensemble retrieval—using a source document associated with a given query. This exploration aims to provide insights into the performance and effectiveness of these retrieval techniques within the broader context of RAG pattern search.

In [1]:
__import__('pysqlite3')
import sys
sys.modules['sqlite3'] = sys.modules.pop('pysqlite3')
import os
from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import MarkdownTextSplitter
from langchain.vectorstores import Chroma, FAISS
from langchain.retrievers import BM25Retriever, EnsembleRetriever
import pandas as pd
from tqdm import tqdm

pd.set_option('display.max_colwidth', None)

In [2]:
def ensemble_retriever_retrieval(query, ensemble_retriever):
    docs = ensemble_retriever.get_relevant_documents(query)
    sources = [doc.metadata.get('source') for doc in docs if 'source' in doc.metadata]
    return sources

In [3]:
# Load documents 
loader = DirectoryLoader('../data/external', glob="**/*.md", loader_cls=TextLoader)
documents = loader.load()
text_splitter = MarkdownTextSplitter(chunk_overlap=0)
texts = text_splitter.split_documents(documents)

text_content = [doc.page_content if hasattr(doc, 'page_content') else doc.content for doc in texts]

In [4]:
# Embeddings
embeddings = HuggingFaceEmbeddings()
faiss_vectorstore = FAISS.from_documents(texts, embeddings)
faiss_retriever = faiss_vectorstore.as_retriever(search_kwargs={"k": 5})
bm25_retriever = BM25Retriever.from_texts(text_content)
bm25_retriever.k = 5

In the first case, represented by bm_25, an EnsembleRetriever combines retrievals from bm25_retriever with full emphasis (weight 1) and completely disregards contributions from faiss_retriever (weight 0). In the second case, denoted as faiss_ret, the emphasis is inverted, assigning weight 1 to faiss_retriever and weight 0 to bm25_retriever. This prioritizes results from the Faiss retriever exclusively. In the third case, ensemble_ret, both retrievers contribute equally (weights 0.5 each), resulting in a balanced ensemble that considers both BM25 and Faiss retrievers with equal influence on the final retrieval outcomes.

In [5]:
bm_25 = EnsembleRetriever(
    retrievers=[bm25_retriever, faiss_retriever], weights=[1, 0]
)

faiss_ret= EnsembleRetriever(
    retrievers=[bm25_retriever, faiss_retriever], weights=[0, 1]
)

ensemble_ret= EnsembleRetriever(
    retrievers=[bm25_retriever, faiss_retriever], weights=[0.5, 0.5]
)

## Getting query from df

In [6]:
df = pd.read_csv('../data/processed/validation_data.csv')[['Question']]
df['Sources'] = '../data/external/rosaworkshop/14-faq.md'

In [7]:
df.head(2)

Unnamed: 0,Question,Sources
0,What is Red Hat OpenShift Service on AWS (ROSA)?,../data/external/rosaworkshop/14-faq.md
1,Where can I go to get more information/details?,../data/external/rosaworkshop/14-faq.md


In [8]:
# Iterate through rows in the DataFrame and retrieve sources
for index, row in tqdm(df.iterrows()):
    question = row['Question']

    # Retrieve sources using bm_25
    sources_bm25 = ensemble_retriever_retrieval(question, bm_25)
    
    # Retrieve sources using similarity search
    source_Faiss_ret = ensemble_retriever_retrieval(question, faiss_ret)

    # Retrieve sources using ensemble retriever
    sources_ensemble_retriever = ensemble_retriever_retrieval(question, ensemble_ret)

    # Update the DataFrame with the new columns
    df.at[index, 'source_Faiss_ret'] = ', '.join(map(str, source_Faiss_ret))
    df.at[index, 'source_bm25'] = ', '.join(map(str, sources_bm25))
    df.at[index, 'source_ensemble_retriever'] = ', '.join(map(str, sources_ensemble_retriever))


65it [00:34,  1.89it/s]


In [9]:
df = df[['Question','Sources','source_Faiss_ret', 'source_bm25','source_ensemble_retriever']]
df.head(2)

Unnamed: 0,Question,Sources,source_Faiss_ret,source_bm25,source_ensemble_retriever
0,What is Red Hat OpenShift Service on AWS (ROSA)?,../data/external/rosaworkshop/14-faq.md,"../data/external/rosaworkshop/14-faq.md, ../data/external/rosa-docs/rosa_architecture.md, ../data/external/rosaworkshop/14-faq.md, ../data/external/rh-mobb/_index.md, ../data/external/rosa-docs/welcome.md","../data/external/rosaworkshop/14-faq.md, ../data/external/rosa-docs/rosa_architecture.md, ../data/external/rosaworkshop/14-faq.md, ../data/external/rosa-docs/welcome.md, ../data/external/rh-mobb/_index.md","../data/external/rosaworkshop/14-faq.md, ../data/external/rosa-docs/rosa_architecture.md, ../data/external/rosaworkshop/14-faq.md, ../data/external/rh-mobb/_index.md, ../data/external/rosa-docs/welcome.md"
1,Where can I go to get more information/details?,../data/external/rosaworkshop/14-faq.md,"../data/external/rosa-docs/rosa_architecture.md, ../data/external/rosa-docs/rosa_architecture.md, ../data/external/rosa-docs/rosa_architecture.md, ../data/external/rosa-docs/rosa_architecture.md, ../data/external/rosaworkshop/14-faq.md","../data/external/rosaworkshop/14-faq.md, ../data/external/rosa-docs/rosa_architecture.md, ../data/external/rosa-docs/rosa_architecture.md, ../data/external/rosa-docs/rosa_architecture.md, ../data/external/rosa-docs/rosa_architecture.md","../data/external/rosaworkshop/14-faq.md, ../data/external/rosa-docs/rosa_architecture.md, ../data/external/rosa-docs/rosa_architecture.md, ../data/external/rosa-docs/rosa_architecture.md, ../data/external/rosa-docs/rosa_architecture.md"


# Metrics for Evaluation

In this segment, we will provide a concise overview of various metrics employed in the assessment of retrieval performance.

## Recall@K

It measures the proportion of relevant items that were successfully recommended or retrieved within the top k items. 


$Recall@k = \frac{Number\:of\:relevant\:documents\:retrieved\:in\:top\:k}{Total\:number\:of\:relevant\:documents}$


In [10]:
def recall(actual, predicted, k):
    act_set = set([actual])
    pred_set = set(predicted[:k])
    if len(act_set & pred_set) == 0:
        return 0
    result = round(len(act_set & pred_set) / float(len(act_set)), 2)
    return result

In [11]:
actual = "apple"
predicted = "banana","apple","orange","grape", "apple"

recall_5 = recall(actual, predicted, 5)

print(f"Recall@5 : {recall_5}")

Recall@5 : 1.0


In this example, we are evaluating a retrieval system's performance. The system has generated a list of 5 predicted fruit names, and among them, "apple" is considered relevant. The task is to calculate Recall@5, which assesses how well the system captures the relevant item within the top 5 recommendations. If Recall@5 is 1, it indicates that the recommendation system has successfully retrieved all relevant items within the top 5 recommendations. In other words, every item considered relevant is among the first 5 items suggested by the system.

**Pros:**

- Easily interpretable metric.
- A perfect score (1) indicates retrieval of all relevant items.
- Sensitivity to the impact of smaller k values on system performance.

**Cons:**

- Increasing K can yield perfect scores, potentially misleading.
- Doesn't consider the position of relevant items in the ranking.
- Cannot distinguish between returning a relevant result at rank one or rank four.

## Mean Reciprocal Rank

Unlike recall@k, MRR (Mean Reciprocal Rank) actually takes the order of relevant items into consideration. It is particularly focused on the position of the first relevant item in the ranked list of results, providing a more fine-grained evaluation in the context of information retrieval systems.

For each query, we calaulate the reciprocal rank, denoted by RR. It is considered as the inverse of the rank at which the first relevant items appear in the ranked list.

$Reciprocal\:Rank(RR) = \frac{1}{Rank\:of\:the\:first\:relevant\:item}$, the value is $0$ is no relevant item is found in the list.

Then, calculate the average reciprocal rank across all queries. MRR provides the single aggregated measure of how well a system performs in placing the first relevant item in the ranked list.

$MRR = \frac{1stRR\:+\:2ndRR\:+\:3rdRR\:+...}{Total\:number\:of\:queries}$

**Pros:**

- MRR taking order in considerations, making it suitable for scenarios where the position of the first relevant result is crucial, such as chatbots or question-answering systems.

**Cons:**

- MRR considers only the rank of the first relevant item and ignores others, which may not be suitable for use cases involving the retrieval of multiple relevant items, as in recommendation or search engines.

- In situations where multiple relevant items need to be returned (e.g., recommending ~10 products), MRR may yield a perfect score even if only one relevant item is returned in the top position, potentially masking the poor performance of returning other irrelevant items.

## Normalized Discounted Cumulative Gain (NDCG@k)

Normalized Discounted Cumulative Gain (nDCG@K) is a metric commonly used in information retrieval and recommendation systems to evaluate the quality of ranked results. Unlike Recall@K, which focuses on the presence of relevant items in the top K recommendations, NDCG@K considers the relevance and the position of each relevant item in the entire ranked list, offering the normalized score between 0 and 1. 

$NDCG@K= \frac{DCG@K}{IDCG@K}$ 

- **Discounted Cumulative Gain (DCG@K):** It evaluates the quality of the ranked list by considering both item relevance and position. It sums up the relevance scores with logarithmic discounting base don item position, prioritizing highly relevant items at the top. Higher DCG@K values indicate superior ranking.

$DCG@K = \sum_{k = 1}^K \frac{rel_k}{log_2(1+k)}$

- **Ideal Discounted Cumulative Gain (IDCG@K):** It represents the highest achievable DCG@K under perfect ranking conditions. In the ideal scenario, the items are arranged in descending order of relevance.

$DCG@K = \sum_{k = 1}^K \frac{rel_ideal,k}{log_2(1+k)}$

**How to calculate NDCG@K:

We take an example lists:

- Actual List: ["Apple", "Banana", "Orange"]
- Retrieved List: ["Banana", "Apple", "Orange"] 

Now based on the actual list, we observe that "Apple" is the most relevant, whereas "Orange" is the least relevant. Hence, we will assign a relevance score to each depending upon their relevance.

$Apple:2\:,\:Banana:1\:,\:Orange:0$ 

Using them, We calculate:

$DCG@3 = \frac{1}{log_2 (1+1)}+\frac{2}{log_2 (1+2)}+\frac{0}{log_2 (1+3)} = 2.26 $

$IDCG@3 = \frac{2}{log_2 (1+1)}+\frac{1}{log_2 (1+2)}+\frac{0}{log_2 (1+3)} = 2.63 $

$NDCG@3 = \frac{2.3}{2.6} = 0.86$

Here is how we can implement the calculation with python code.

In [12]:
from math import log2

'''
Considering our retrieved list is ["Banana", "Apple", "Orange"]. Based on relevance score of each item:
'''

# Given relevance scores
relevance = [1, 2, 0]

#Number of retrieved list
K = len(relevance)

# Calculate DCG
dcg = sum(rel_k / log2(1 + k) for k, rel_k in enumerate(relevance, 1))

# Sort items in 'relevance' from most relevant to less relevant for IDCG
ideal_relevance = sorted(relevance, reverse=True)

# Calculate IDCG
idcg = sum(rel_k / log2(1 + k) for k, rel_k in enumerate(ideal_relevance, 1))

# Calculate NDCG
ndcg = dcg / idcg if idcg != 0 else 0

print(f"Relevance Scores: {relevance}")
print(f"DCG@{K}: {dcg:.2f}")
print(f"IDCG@{K}: {idcg:.2f}")
print(f"NDCG@{K}: {ndcg:.2f}")

Relevance Scores: [1, 2, 0]
DCG@3: 2.26
IDCG@3: 2.63
NDCG@3: 0.86


In the context of NDCG@K, where 1.0 represents a perfect ranking (IDCG@K), a score of 0.86 suggests that the retrieval system is performing quite well.

**Pros of NDCG@K:**

- **Relevance Optimization:** Emphasizes highly relevant document ranking for better user satisfaction.

- **Order-Awareness:** Considers both relevance and order, capturing the importance of document position.

- **Interpretable Score:** Easily understandable score (0 to 1), facilitating quick system assessment.

**Cons of NDCG@K:**

- **Data Labeling Complexity:** Requires well-labeled data with detailed relevance and relative importance information.

- **Subjectivity in Ranking:** Introduces subjectivity in determining relative relevance.

# Data Analysis

In this section, we will apply the metrics discussed above to evaluate our dataset. Due to insufficient labeling in our dataset, we will focus exclusively on utilizing Recall@K and the Mean Reciprocal Rank (MRR) metric for evaluation.

In [13]:
df.head(1)

Unnamed: 0,Question,Sources,source_Faiss_ret,source_bm25,source_ensemble_retriever
0,What is Red Hat OpenShift Service on AWS (ROSA)?,../data/external/rosaworkshop/14-faq.md,"../data/external/rosaworkshop/14-faq.md, ../data/external/rosa-docs/rosa_architecture.md, ../data/external/rosaworkshop/14-faq.md, ../data/external/rh-mobb/_index.md, ../data/external/rosa-docs/welcome.md","../data/external/rosaworkshop/14-faq.md, ../data/external/rosa-docs/rosa_architecture.md, ../data/external/rosaworkshop/14-faq.md, ../data/external/rosa-docs/welcome.md, ../data/external/rh-mobb/_index.md","../data/external/rosaworkshop/14-faq.md, ../data/external/rosa-docs/rosa_architecture.md, ../data/external/rosaworkshop/14-faq.md, ../data/external/rh-mobb/_index.md, ../data/external/rosa-docs/welcome.md"


In [14]:
# Define the recall and rank functions if not defined already
def recall(actual, predicted, k):
    act_set = set([actual])
    pred_set = set(predicted[:k])
    if len(act_set & pred_set) == 0:
        return 0
    result = round(len(act_set & pred_set) / float(len(act_set)), 2)
    return result

def rank(actual, predicted):
    return predicted.index(actual) + 1 if actual in predicted else None

# Loop over the rows
for index, row in df.iterrows():
    query = row['Question']
    actual_source = row['Sources']
    sources_faiss_ret = row['source_Faiss_ret'].split(', ')
    sources_bm25 = row['source_bm25'].split(', ')
    sources_ensemble = row['source_ensemble_retriever'].split(', ')

    # Calculate Recall@5 and Rank@5 for source_Faiss_ret
    recall_5_faiss_ret = recall(actual_source, sources_faiss_ret, 5)
    rank_5_faiss_ret = rank(actual_source, sources_faiss_ret)

    # Calculate Recall@5 and Rank@5 for source_bm25
    recall_5_bm25 = recall(actual_source, sources_bm25, 5)
    rank_5_bm25 = rank(actual_source, sources_bm25)

    # Calculate Recall@5 and Rank@5 for source_ensemble_retriever
    recall_5_ensemble = recall(actual_source, sources_ensemble, 5)
    rank_5_ensemble = rank(actual_source, sources_ensemble)

    # Add new columns to the DataFrame
    df.at[index, 'Recall@5_source_Faiss_ret'] = recall_5_faiss_ret
    df.at[index, 'Rank@5_source_Faiss_ret'] = rank_5_faiss_ret

    df.at[index, 'Recall@5_source_bm25'] = recall_5_bm25
    df.at[index, 'Rank@5_source_bm25'] = rank_5_bm25

    df.at[index, 'Recall@5_source_ensemble_retriever'] = recall_5_ensemble
    df.at[index, 'Rank@5_source_ensemble_retriever'] = rank_5_ensemble

# Replace NaN values with 0 in the new columns
df.fillna(0, inplace=True)

In [15]:
df.head(5)

Unnamed: 0,Question,Sources,source_Faiss_ret,source_bm25,source_ensemble_retriever,Recall@5_source_Faiss_ret,Rank@5_source_Faiss_ret,Recall@5_source_bm25,Rank@5_source_bm25,Recall@5_source_ensemble_retriever,Rank@5_source_ensemble_retriever
0,What is Red Hat OpenShift Service on AWS (ROSA)?,../data/external/rosaworkshop/14-faq.md,"../data/external/rosaworkshop/14-faq.md, ../data/external/rosa-docs/rosa_architecture.md, ../data/external/rosaworkshop/14-faq.md, ../data/external/rh-mobb/_index.md, ../data/external/rosa-docs/welcome.md","../data/external/rosaworkshop/14-faq.md, ../data/external/rosa-docs/rosa_architecture.md, ../data/external/rosaworkshop/14-faq.md, ../data/external/rosa-docs/welcome.md, ../data/external/rh-mobb/_index.md","../data/external/rosaworkshop/14-faq.md, ../data/external/rosa-docs/rosa_architecture.md, ../data/external/rosaworkshop/14-faq.md, ../data/external/rh-mobb/_index.md, ../data/external/rosa-docs/welcome.md",1.0,1.0,1.0,1.0,1.0,1.0
1,Where can I go to get more information/details?,../data/external/rosaworkshop/14-faq.md,"../data/external/rosa-docs/rosa_architecture.md, ../data/external/rosa-docs/rosa_architecture.md, ../data/external/rosa-docs/rosa_architecture.md, ../data/external/rosa-docs/rosa_architecture.md, ../data/external/rosaworkshop/14-faq.md","../data/external/rosaworkshop/14-faq.md, ../data/external/rosa-docs/rosa_architecture.md, ../data/external/rosa-docs/rosa_architecture.md, ../data/external/rosa-docs/rosa_architecture.md, ../data/external/rosa-docs/rosa_architecture.md","../data/external/rosaworkshop/14-faq.md, ../data/external/rosa-docs/rosa_architecture.md, ../data/external/rosa-docs/rosa_architecture.md, ../data/external/rosa-docs/rosa_architecture.md, ../data/external/rosa-docs/rosa_architecture.md",1.0,5.0,1.0,1.0,1.0,1.0
2,What are the benefits of Red Hat OpenShift Service on AWS (Key Features)?,../data/external/rosaworkshop/14-faq.md,"../data/external/rosaworkshop/14-faq.md, ../data/external/rh-mobb/_index.md, ../data/external/rosaworkshop/14-faq.md, ../data/external/rosa-docs/rosa_architecture.md, ../data/external/rosa-docs/rosa_architecture.md","../data/external/rosaworkshop/14-faq.md, ../data/external/rosa-docs/rosa_architecture.md, ../data/external/rosaworkshop/14-faq.md, ../data/external/rosa-docs/rosa_architecture.md, ../data/external/rh-mobb/_index.md","../data/external/rosaworkshop/14-faq.md, ../data/external/rh-mobb/_index.md, ../data/external/rosaworkshop/14-faq.md, ../data/external/rosa-docs/rosa_architecture.md, ../data/external/rosa-docs/rosa_architecture.md",1.0,1.0,1.0,1.0,1.0,1.0
3,What are the differences between Red Hat OpenShift Service on AWS and Kubernetes?,../data/external/rosaworkshop/14-faq.md,"../data/external/rosaworkshop/14-faq.md, ../data/external/rosa-docs/welcome.md, ../data/external/rosa-docs/rosa_architecture.md, ../data/external/rosa-docs/rosa_architecture.md, ../data/external/rosa-docs/rosa_install_access_delete_clusters.md","../data/external/rosaworkshop/14-faq.md, ../data/external/rosa-docs/rosa_architecture.md, ../data/external/rosa-docs/rosa_install_access_delete_clusters.md, ../data/external/rosa-docs/rosa_architecture.md, ../data/external/rosa-docs/welcome.md","../data/external/rosaworkshop/14-faq.md, ../data/external/rosa-docs/welcome.md, ../data/external/rosa-docs/rosa_architecture.md, ../data/external/rosa-docs/rosa_architecture.md, ../data/external/rosa-docs/rosa_install_access_delete_clusters.md",1.0,1.0,1.0,1.0,1.0,1.0
4,What exactly am I responsible for and what is Red Hat / AWS responsible for?,../data/external/rosaworkshop/14-faq.md,"../data/external/rh-mobb/_index.md, ../data/external/rosa-docs/rosa_architecture.md, ../data/external/rosa-docs/rosa_architecture.md, ../data/external/rosaworkshop/14-faq.md, ../data/external/rosa-docs/rosa_architecture.md","../data/external/rosaworkshop/14-faq.md, ../data/external/rosa-docs/rosa_architecture.md, ../data/external/rosa-docs/rosa_architecture.md, ../data/external/rosa-docs/rosa_architecture.md, ../data/external/rh-mobb/_index.md","../data/external/rosaworkshop/14-faq.md, ../data/external/rh-mobb/_index.md, ../data/external/rosa-docs/rosa_architecture.md, ../data/external/rosa-docs/rosa_architecture.md, ../data/external/rosa-docs/rosa_architecture.md",1.0,4.0,1.0,1.0,1.0,1.0


Lets check the recall values for three cases. 

In [16]:
print(f"Recall value for Faiss retriever: {df['Recall@5_source_Faiss_ret'].sum()}")
print(f"Recall value for bm25 retriever: {df['Recall@5_source_bm25'].sum()}")
print(f"Recall value for ensemble retriever: {df['Recall@5_source_ensemble_retriever'].sum()}")
print(f"Total number of queries : {df['Question'].nunique()}")

Recall value for Faiss retriever: 35.0
Recall value for bm25 retriever: 35.0
Recall value for ensemble retriever: 35.0
Total number of queries : 65


The recall values for the Faiss retriever, bm25 retriever, and ensemble retriever are all 35.0. This suggests that across a total of 65 queries, each retriever successfully identified and retrieved relevant information in 35 instances. The uniformity in recall values indicates comparable retrieval performance among the three retrievers.

Now, let's evaluate the Mean Reciprocal Rank (MRR) for three different retrieval cases, considering the order of relevance. We will compare the MRR values to assess the performance of each case.

In [17]:
mrr_values = {}

for column in ['Rank@5_source_Faiss_ret', 'Rank@5_source_bm25', 'Rank@5_source_ensemble_retriever']:
    ranks_column = df[column]
    reciprocal_ranks = 1 / ranks_column.replace(0, float('inf')).astype(float)
    sum_reciprocal_ranks = reciprocal_ranks.sum()
    mrr = sum_reciprocal_ranks / len(df)
    mrr_values[column] = mrr

# 
for column, mrr in mrr_values.items():
    print(f"Mean Reciprocal Rank (MRR) for {column}: {mrr}")

Mean Reciprocal Rank (MRR) for Rank@5_source_Faiss_ret: 0.3953846153846154
Mean Reciprocal Rank (MRR) for Rank@5_source_bm25: 0.48435897435897435
Mean Reciprocal Rank (MRR) for Rank@5_source_ensemble_retriever: 0.47948717948717945


These values indicate the effectiveness of each retrieval method in providing relevant information within the top 5 results. A higher MRR suggests a better-performing retrieval system. In this comparison, Rank@5_source_bm25 has the highest MRR, indicating its superiority in returning relevant documents early in the ranking. 

# Conclusion

In this notebook we,

- Explored document retrieval using a basic ensemble retriever.
- Examined various evaluation metrics, including recall and mean reciprocal rank.
- Conducted a comparative analysis of different retriever configurations.
- Gained insights into effective document retrieval strategies and metric considerations.