# Document Retrieval


Retrieval-Augmented Generation (RAG) is a cutting-edge natural language processing paradigm that combines aspects of both information retrieval and language generation. In the RAG framework, a pre-trained language model, often a large transformer model, is augmented with a retrieval component. This retrieval mechanism allows the model to access external knowledge by retrieving relevant information from a set of source documents based on the input query. The integration of retrieval into the generation process enhances the model's contextual understanding and enables it to generate more informed and contextually relevant responses.

The RAG pattern involves two main phases: Retrieval and Augmented Generation. During Retrieval, the model identifies and retrieves pertinent information from the source documents. Subsequently, in the Augmented Generation phase, the retrieved content is utilized to enrich the generation of language, resulting in more contextually aware and coherent responses.

Evaluating RAG involves assessing the effectiveness of both the Retrieval and Augmented Generation components independently. In this notebook, the focus is specifically on evaluating the Retrieval part of RAG. The analysis centers on comparing two retrieval methods—similarity search and ensemble retrieval—using a source document associated with a given query. This exploration aims to provide insights into the performance and effectiveness of these retrieval techniques within the broader context of RAG pattern search.

In [1]:
__import__('pysqlite3')
import sys
sys.modules['sqlite3'] = sys.modules.pop('pysqlite3')

In [2]:
import os
from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.text_splitter import MarkdownTextSplitter
from langchain.vectorstores import Chroma
from langchain.retrievers import BM25Retriever, EnsembleRetriever
import pandas as pd
from tqdm import tqdm

In [3]:
pd.set_option('display.max_colwidth', None)

In [4]:
def similarity_search_retrieval(query, retriever):
    docs = retriever.similarity_search(query, k=5)
    sources = [i.metadata.get('source') for i in docs]
    return sources

In [5]:
def ensemble_retriever_retrieval(query, ensemble_retriever):
    docs = ensemble_retriever.get_relevant_documents(query)
    sources = [doc.metadata.get('source') for doc in docs]
    return sources

In [6]:
# Load documents 
loader = DirectoryLoader('../data/external', glob="**/*.md", loader_cls=TextLoader)
documents = loader.load()
text_splitter = MarkdownTextSplitter(chunk_overlap=0)
texts = text_splitter.split_documents(documents)

text_content = [doc.page_content if hasattr(doc, 'page_content') else doc.content for doc in texts]

In [7]:
# Embeddings
embeddings = HuggingFaceEmbeddings()

In [8]:
# create Chroma vector store
chroma_vectorstore = Chroma.from_documents(texts, embeddings)

In [9]:
# Create BM25 retriever
bm25_retriever = BM25Retriever.from_texts(text_content)
bm25_retriever.k = 5  

In [10]:
# Create Chroma retriever
chroma_retriever = chroma_vectorstore.as_retriever(search_kwargs={"k": 5})

In [11]:
# Create ensemble retriever
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, chroma_retriever],
    weights=[0.6, 0.4]  #weights
)

## Getting query from df

In [12]:
df = pd.read_csv('final-ds.csv')[['Question','Source']]

In [13]:
# Iterate through rows in the DataFrame and retrieve sources
for index, row in tqdm(df.iterrows()):
    question = row['Question']

    # Retrieve sources using similarity search
    sources_similarity_search = similarity_search_retrieval(question, chroma_vectorstore)

    # Retrieve sources using ensemble retriever
    sources_ensemble_retriever = ensemble_retriever_retrieval(question, ensemble_retriever)

    # Update the DataFrame with the new columns
    df.at[index, 'source_similarity_search'] = ', '.join(map(str, sources_similarity_search))
    df.at[index, 'source_ensemble_retriever'] = ', '.join(map(str, sources_ensemble_retriever))


7560it [28:10,  4.47it/s]


In [14]:
df = df[['Question','Source','source_similarity_search','source_ensemble_retriever']]

In [15]:
df.head()

Unnamed: 0,Question,Source,source_similarity_search,source_ensemble_retriever
0,How do I install the Knative CLI? \n,../data/processed/dataset-qa/rosa-docs-serverless_35.json,"../data/external/rosa-docs/serverless.md, ../data/external/rosa-docs/serverless.md, ../data/external/rosa-docs/serverless.md, ../data/external/rosa-docs/serverless.md, ../data/external/rosa-docs/serverless.md","None, None, None, None, None, ../data/external/rosa-docs/serverless.md, ../data/external/rosa-docs/serverless.md, ../data/external/rosa-docs/serverless.md, ../data/external/rosa-docs/serverless.md, ../data/external/rosa-docs/serverless.md"
1,How do I use the Knative CLI to describe a subscription?\n,../data/processed/dataset-qa/rosa-docs-serverless_35.json,"../data/external/rosa-docs/serverless.md, ../data/external/rosa-docs/serverless.md, ../data/external/rosa-docs/serverless.md, ../data/external/rosa-docs/serverless.md, ../data/external/rosa-docs/serverless.md","../data/external/rosa-docs/serverless.md, None, None, None, None, ../data/external/rosa-docs/serverless.md, ../data/external/rosa-docs/serverless.md, ../data/external/rosa-docs/serverless.md, ../data/external/rosa-docs/serverless.md"
2,How do I use the Knative CLI to list existing subscriptions?\n,../data/processed/dataset-qa/rosa-docs-serverless_35.json,"../data/external/rosa-docs/serverless.md, ../data/external/rosa-docs/serverless.md, ../data/external/rosa-docs/serverless.md, ../data/external/rosa-docs/serverless.md, ../data/external/rosa-docs/serverless.md","None, None, None, None, None, ../data/external/rosa-docs/serverless.md, ../data/external/rosa-docs/serverless.md, ../data/external/rosa-docs/serverless.md, ../data/external/rosa-docs/serverless.md, ../data/external/rosa-docs/serverless.md"
3,How do I update a subscription using the Knative CLI?\n,../data/processed/dataset-qa/rosa-docs-serverless_35.json,"../data/external/rosa-docs/serverless.md, ../data/external/rosa-docs/serverless.md, ../data/external/rosa-docs/serverless.md, ../data/external/rosa-docs/serverless.md, ../data/external/rosa-docs/serverless.md","../data/external/rosa-docs/serverless.md, ../data/external/rosa-docs/serverless.md, None, None, None, ../data/external/rosa-docs/serverless.md, ../data/external/rosa-docs/serverless.md, ../data/external/rosa-docs/serverless.md"
4,What parameters can I configure for event delivery? \n,../data/processed/dataset-qa/rosa-docs-serverless_35.json,"../data/external/rosa-docs/serverless.md, ../data/external/rosa-docs/serverless.md, ../data/external/rosa-docs/serverless.md, ../data/external/rosa-docs/serverless.md, ../data/external/rosa-docs/serverless.md","None, None, None, None, None, ../data/external/rosa-docs/serverless.md, ../data/external/rosa-docs/serverless.md, ../data/external/rosa-docs/serverless.md, ../data/external/rosa-docs/serverless.md, ../data/external/rosa-docs/serverless.md"


In [18]:
#df.to_csv('analysed-data.csv')

## Data analaysis

In [19]:
#df = pd.read_csv('analysed-data.csv')

In [20]:
df = df[['Question','Source','source_similarity_search','source_ensemble_retriever']]

In [21]:
# Extracting the file name from the 'Source' column
df['Source'] = df['Source'].apply(lambda x: '_'.join(x.split("/")[-1].split("_")[:-1]))

In [22]:
# Pre-processing 'source_similarity_search' column
df['source_similarity_search'] = df['source_similarity_search'].apply(lambda x: ', '.join(set(y.split("/")[-1].split(".")[0] for y in x.split(", "))))

In [23]:
# Pre-processing 'source_ensemble_retriever' column
df['source_ensemble_retriever'] = df['source_ensemble_retriever'].apply(
    lambda x: ', '.join(set(y.split("/")[-1].split(".")[0] for y in x.split(", ") if y != 'None')) if pd.notna(x) else None
)

In [24]:
# Count the number of elements in 'source_similarity_search' and 'source_ensemble_retriever'
df['Number_of_sources_from_similarity_search'] = df['source_similarity_search'].apply(lambda x: len(x.split(', ')))
df['Number_of_sources_from_ensemble_retriever'] = df['source_ensemble_retriever'].apply(lambda x: len(x.split(', ')) if pd.notna(x) else 0)

In [25]:
# Create a boolean column for 'source_similarity_search'
df['Match_in_source_similarity_search'] = df.apply(lambda row: any(source in row['Source'] for source in row['source_similarity_search'].split(', ')), axis=1).astype(int)

In [26]:
# Create a boolean column for 'source_ensemble_retriever'
df['Match_in_source_ensemble_retriever'] = df.apply(lambda row: any(source in row['Source'] for source in row['source_ensemble_retriever'].split(', ')) if pd.notna(row['source_ensemble_retriever']) else False, axis=1).astype(int)

In [27]:
df.head()

Unnamed: 0,Question,Source,source_similarity_search,source_ensemble_retriever,Number_of_sources_from_similarity_search,Number_of_sources_from_ensemble_retriever,Match_in_source_similarity_search,Match_in_source_ensemble_retriever
0,How do I install the Knative CLI? \n,rosa-docs-serverless,serverless,serverless,1,1,1,1
1,How do I use the Knative CLI to describe a subscription?\n,rosa-docs-serverless,serverless,serverless,1,1,1,1
2,How do I use the Knative CLI to list existing subscriptions?\n,rosa-docs-serverless,serverless,serverless,1,1,1,1
3,How do I update a subscription using the Knative CLI?\n,rosa-docs-serverless,serverless,serverless,1,1,1,1
4,What parameters can I configure for event delivery? \n,rosa-docs-serverless,serverless,serverless,1,1,1,1


**Total Query**

In [28]:
df.Question.nunique()

6212

In [46]:
len(df.Question)

7560

**Percentage Match from similarity search**

In [44]:
(df.Match_in_source_similarity_search.sum() / len(df.Question))*100

69.97354497354497

**Percentage Match from Ensemble search**

In [45]:
(df.Match_in_source_ensemble_retriever.sum() / len(df.Question))*100

69.78835978835978

**Cases when there is no match in Similarity search**

In [31]:
df[df.Match_in_source_similarity_search==0].head()

Unnamed: 0,Question,Source,source_similarity_search,source_ensemble_retriever,Number_of_sources_from_similarity_search,Number_of_sources_from_ensemble_retriever,Match_in_source_similarity_search,Match_in_source_ensemble_retriever
33,How do I create an AWS service account for clusters with AWS Security Token Service (STS) enabled?\n\n,rosa-docs-logging,"authentication, ecr-secret-operator-iam_assume_role, using-sts-with-aws-services-_index, aws-secrets-manager-csi-_index","authentication, ecr-secret-operator-iam_assume_role, using-sts-with-aws-services-_index, aws-secrets-manager-csi-_index",4,4,0,0
34,How do I create a role for AWS using the CredentialsRequest CR?\n,rosa-docs-logging,"rosa_architecture, aws-secrets-manager-csi-_index, rosa_planning, authentication, ecr-secret-operator-iam_assume_role","rosa_architecture, aws-secrets-manager-csi-_index, rosa_planning, authentication, ecr-secret-operator-iam_assume_role",5,5,0,0
45,What is the purpose of a security group?\n\n,rosa-docs-rosa_install_access_delete_clusters,"authentication, rosa_architecture","authentication, rosa_architecture",2,2,0,0
49,What is the purpose of the MasterSecurityGroup?\n,rosa-docs-rosa_install_access_delete_clusters,"authentication, rosa_architecture","authentication, rosa_architecture",2,2,0,0
52,What is the purpose of the port range used to control egress traffic from the master nodes?\n,rosa-docs-rosa_install_access_delete_clusters,"networking, rosa_architecture, rosa_planning, service_mesh, rosa-asea-landing-zone-index","networking, rosa_architecture, rosa_planning, service_mesh, rosa-asea-landing-zone-index",5,5,0,0


**Cases when there is no match in Ensemble retriever**

In [32]:
df[df.Match_in_source_ensemble_retriever==0].head()

Unnamed: 0,Question,Source,source_similarity_search,source_ensemble_retriever,Number_of_sources_from_similarity_search,Number_of_sources_from_ensemble_retriever,Match_in_source_similarity_search,Match_in_source_ensemble_retriever
33,How do I create an AWS service account for clusters with AWS Security Token Service (STS) enabled?\n\n,rosa-docs-logging,"authentication, ecr-secret-operator-iam_assume_role, using-sts-with-aws-services-_index, aws-secrets-manager-csi-_index","authentication, ecr-secret-operator-iam_assume_role, using-sts-with-aws-services-_index, aws-secrets-manager-csi-_index",4,4,0,0
34,How do I create a role for AWS using the CredentialsRequest CR?\n,rosa-docs-logging,"rosa_architecture, aws-secrets-manager-csi-_index, rosa_planning, authentication, ecr-secret-operator-iam_assume_role","rosa_architecture, aws-secrets-manager-csi-_index, rosa_planning, authentication, ecr-secret-operator-iam_assume_role",5,5,0,0
45,What is the purpose of a security group?\n\n,rosa-docs-rosa_install_access_delete_clusters,"authentication, rosa_architecture","authentication, rosa_architecture",2,2,0,0
49,What is the purpose of the MasterSecurityGroup?\n,rosa-docs-rosa_install_access_delete_clusters,"authentication, rosa_architecture","authentication, rosa_architecture",2,2,0,0
52,What is the purpose of the port range used to control egress traffic from the master nodes?\n,rosa-docs-rosa_install_access_delete_clusters,"networking, rosa_architecture, rosa_planning, service_mesh, rosa-asea-landing-zone-index","networking, rosa_architecture, rosa_planning, service_mesh, rosa-asea-landing-zone-index",5,5,0,0


**Cases when there is no match from similarity search but there is match from ensemble retriever**

In [33]:
df[(df.Match_in_source_similarity_search==0) & (df.Match_in_source_ensemble_retriever==1)].head()

Unnamed: 0,Question,Source,source_similarity_search,source_ensemble_retriever,Number_of_sources_from_similarity_search,Number_of_sources_from_ensemble_retriever,Match_in_source_similarity_search,Match_in_source_ensemble_retriever


There is no such cases.

**Cases when there is no match from ensemble retriever but there is match from similarity search**

In [34]:
df[(df.Match_in_source_similarity_search==1) & (df.Match_in_source_ensemble_retriever==0)].head()

Unnamed: 0,Question,Source,source_similarity_search,source_ensemble_retriever,Number_of_sources_from_similarity_search,Number_of_sources_from_ensemble_retriever,Match_in_source_similarity_search,Match_in_source_ensemble_retriever
451,What is an Issuer URL?\n,rosa-docs-rosa_install_access_delete_clusters,"rosa_install_access_delete_clusters, rosa_planning",rosa_planning,2,1,1,0
765,What are the domains used for?\n,rosa-docs-rosa_install_access_delete_clusters,"rosa_install_access_delete_clusters, serverless, rosa_planning","serverless, rosa_planning",3,2,1,0
805,What is a domain address?\n,rosa-docs-rosa_install_access_delete_clusters,"rosa_install_access_delete_clusters, rosa_planning",rosa_planning,2,1,1,0
1130,What is the purpose of using wildcard entries in an allowlist?\n,rosa-docs-rosa_install_access_delete_clusters,"rosa_install_access_delete_clusters, service_mesh, networking, rosa_planning","networking, service_mesh, rosa_planning",4,3,1,0
1134,What is the purpose of allowing the Amazon Web Services (AWS) API URls?\n,rosa-docs-rosa_install_access_delete_clusters,"rosa_install_access_delete_clusters, service_mesh, rosa_planning","service_mesh, rosa_planning",3,2,1,0


In [35]:
df1 = df[(df.Match_in_source_similarity_search==1) & (df.Match_in_source_ensemble_retriever==0)]

In [38]:
df1.Question.nunique()

13

There are 13 such cases where there is a match from similarity search but not from ensemble search.

In [47]:
df.Number_of_sources_from_ensemble_retriever.mean()

2.515079365079365

In [48]:
df.Number_of_sources_from_similarity_search.mean()

2.534656084656085

# Conclusion

- **Total Unique Queries:**
In our study, we analyzed a total of 6212 unique queries related to various topics.

- **Similarity Search vs. Ensemble Retriever:**
    - Percentage Match from Similarity Search: The similarity search demonstrated a match percentage of 69.97%, indicating its effectiveness in retrieving relevant information from the source documents.
    - Percentage Match from Ensemble Retriever: The ensemble retriever closely followed with a match percentage of 69.78%, showcasing comparable performance.


- **Cases with Match Discrepancy:**

    - **Cases with Match in Similarity Search but Not in Ensemble Retriever:** We identified 13 queries where there was a match in the similarity search but not in the ensemble retriever.

- **Average Number of Sources Retrieved:**

    - **Average Number of Sources from Ensemble Retriever:** The ensemble retriever, on average, retrieved information from 2.515 sources per query.
    - **Average Number of Sources from Similarity Search:** Similarly, the similarity search had an average retrieval of information from 2.534 sources per query.


Despite the effectiveness demonstrated by both the similarity search and ensemble retriever in matching relevant information, our study reveals that they collectively achieved a match percentage of 69% out of the total 6212 unique queries. While this signifies a substantial success in retrieving pertinent details, it also underscores the presence of opportunities for improvement. Identifying areas that contribute to the remaining 31% without a match could be crucial for enhancing the performance of the search algorithms.