# Development of a RAG Model for RFP Question Similarity

## Notebook setup

In [1]:
import pandas as pd

In [2]:
%pip install -qU langchain langchain-openai langchain-cohere


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [3]:
import os

import dotenv

dotenv.load_dotenv()

if os.getenv("OPENAI_API_KEY") is None:
    raise Exception("OPENAI_API_KEY not found")

In [4]:
import textwrap
from IPython.display import HTML, display
from tabulate import tabulate


def _format_cell_text(text, width=50):
    """Private function to format a cell's text."""
    return "\n".join([textwrap.fill(line, width=width) for line in text.split("\n")])


def _format_dataframe_for_tabulate(df):
    """Private function to format the entire DataFrame for tabulation."""
    df_out = df.copy()

    # Format all string columns
    for column in df_out.columns:
        # Check if column is of type object (likely strings)
        if df_out[column].dtype == object:
            df_out[column] = df_out[column].apply(_format_cell_text)
    return df_out


def _dataframe_to_html_table(df):
    """Private function to convert a DataFrame to an HTML table."""
    headers = df.columns.tolist()
    table_data = df.values.tolist()
    return tabulate(table_data, headers=headers, tablefmt="html")


def display_nice(df, num_rows=None):
    """Primary function to format and display a DataFrame."""
    if num_rows is not None:
        df = df.head(num_rows)
    formatted_df = _format_dataframe_for_tabulate(df)
    html_table = _dataframe_to_html_table(formatted_df)
    display(HTML(html_table))

In [5]:
def print_dict_keys(data, indent=0):
    for key, value in data.items():
        print(' ' * indent + str(key))
        if isinstance(value, dict):  # if the value is another dictionary, recurse
            print_dict_keys(value, indent + 4)

## Data preparation

### Load Existing RFPs

In [6]:
# List of CSV file paths
existing_rfp_paths = [
    "datasets/rag/rfp_existing_questions_client_1.csv",
    "datasets/rag/rfp_existing_questions_client_2.csv",
    "datasets/rag/rfp_existing_questions_client_3.csv",
    "datasets/rag/rfp_existing_questions_client_4.csv",
    "datasets/rag/rfp_existing_questions_client_5.csv",
]

existing_rfp_df = [pd.read_csv(file_path) for file_path in existing_rfp_paths]

# Concatenate all DataFrames into one
existing_rfp_df = pd.concat(existing_rfp_df, ignore_index=True)

In [7]:
existing_rfp_df

Unnamed: 0,Project_Title,RFP_Question_ID,RFP_Question,RFP_Answer,Area,Last_Accessed_At,Requester,Status
0,Gen AI-Driven Financial Advisory System,1,What is your experience in developing AI-based...,Our company has 15 years of experience in deve...,General,18/12/2023,Bank A,Under Review
1,Gen AI-Driven Financial Advisory System,2,How do you ensure your AI-based apps remain up...,We maintain a dedicated R&D team focused on in...,General,18/12/2023,Bank A,Under Review
2,Gen AI-Driven Financial Advisory System,3,Can your AI-based applications be customized t...,"Absolutely, customization is a core aspect of ...",General,18/12/2023,Bank A,Under Review
3,Gen AI-Driven Financial Advisory System,4,What measures do you take to ensure user priva...,User privacy and data security are paramount. ...,General,18/12/2023,Bank A,Under Review
4,Gen AI-Driven Financial Advisory System,5,How do you approach user interface and experie...,Our design philosophy centers on simplicity an...,General,18/12/2023,Bank A,Under Review
...,...,...,...,...,...,...,...,...
110,Generative AI Solutions for Fraud Detection an...,19,What steps do you take to ensure the transpare...,We prioritize transparency by incorporating ex...,AI Regulation,18/10/2022,Bank E,Under Review
111,Generative AI Solutions for Fraud Detection an...,20,How do you monitor and assess AI risk exposure...,We have developed a set of Key Performance Ind...,AI Regulation,18/10/2022,Bank E,Under Review
112,Generative AI Solutions for Fraud Detection an...,21,How do you handle the management and mitigatio...,We implement and maintain robust risk manageme...,AI Regulation,18/10/2022,Bank E,Under Review
113,Generative AI Solutions for Fraud Detection an...,22,How do you ensure your AI solutions adhere to ...,We ensure compliance with U.S. regulations suc...,AI Regulation,18/10/2022,Bank E,Under Review


## Convert Questions and Answers to Embeddings 

In [8]:
# Add unique identifier to each row in the rfp df
existing_rfp_df["unique_id"] = existing_rfp_df.index.astype(str)
existing_rfp_df.head()

Unnamed: 0,Project_Title,RFP_Question_ID,RFP_Question,RFP_Answer,Area,Last_Accessed_At,Requester,Status,unique_id
0,Gen AI-Driven Financial Advisory System,1,What is your experience in developing AI-based...,Our company has 15 years of experience in deve...,General,18/12/2023,Bank A,Under Review,0
1,Gen AI-Driven Financial Advisory System,2,How do you ensure your AI-based apps remain up...,We maintain a dedicated R&D team focused on in...,General,18/12/2023,Bank A,Under Review,1
2,Gen AI-Driven Financial Advisory System,3,Can your AI-based applications be customized t...,"Absolutely, customization is a core aspect of ...",General,18/12/2023,Bank A,Under Review,2
3,Gen AI-Driven Financial Advisory System,4,What measures do you take to ensure user priva...,User privacy and data security are paramount. ...,General,18/12/2023,Bank A,Under Review,3
4,Gen AI-Driven Financial Advisory System,5,How do you approach user interface and experie...,Our design philosophy centers on simplicity an...,General,18/12/2023,Bank A,Under Review,4


In [9]:
# now we can import and use the openai client
from openai import OpenAI

client = OpenAI()


def get_embedding_OpenAI(text):
    """Returns a text embedding for the given text"""
    return client.embeddings.create(
        input=text,
        model="text-embedding-3-small",
    ).data[0].embedding

In [10]:
from langchain_openai import OpenAIEmbeddings

embeddings_model = OpenAIEmbeddings(model="text-embedding-3-small")

# Embed a single text item and return the embedding
def get_embedding_LC(text):
    return embeddings_model.embed_query(text) 

In [11]:
run = True
if run:
    # Apply the function to each question and answer and create new columns
    existing_rfp_df['Question_Embeddings_LC'] = existing_rfp_df['RFP_Question'].apply(get_embedding_LC)
    existing_rfp_df['Answer_Embeddings_LC'] = existing_rfp_df['RFP_Answer'].apply(get_embedding_LC)
    #existing_rfp_df['Question_Embeddings_OpenAI'] = existing_rfp_df['RFP_Question'].apply(get_embedding_OpenAI)
    #existing_rfp_df['Answer_Embeddings_OpenAI'] = existing_rfp_df['RFP_Answer'].apply(get_embedding_OpenAI)

In [12]:
existing_rfp_df.head()

Unnamed: 0,Project_Title,RFP_Question_ID,RFP_Question,RFP_Answer,Area,Last_Accessed_At,Requester,Status,unique_id,Question_Embeddings_LC,Answer_Embeddings_LC
0,Gen AI-Driven Financial Advisory System,1,What is your experience in developing AI-based...,Our company has 15 years of experience in deve...,General,18/12/2023,Bank A,Under Review,0,"[0.009233377394315142, -0.030979380265132028, ...","[-0.014074281457419971, -0.019768934230661707,..."
1,Gen AI-Driven Financial Advisory System,2,How do you ensure your AI-based apps remain up...,We maintain a dedicated R&D team focused on in...,General,18/12/2023,Bank A,Under Review,1,"[-0.015557997606513376, 0.001083239429675231, ...","[0.010924095968986537, 0.0148562794873187, 0.0..."
2,Gen AI-Driven Financial Advisory System,3,Can your AI-based applications be customized t...,"Absolutely, customization is a core aspect of ...",General,18/12/2023,Bank A,Under Review,2,"[-0.012713182022339392, 0.0019399791850565832,...","[0.0012248214130063306, 0.009834839798469275, ..."
3,Gen AI-Driven Financial Advisory System,4,What measures do you take to ensure user priva...,User privacy and data security are paramount. ...,General,18/12/2023,Bank A,Under Review,3,"[-0.00713508370007173, -0.016262165748897776, ...","[0.03035999154741477, 0.02386305199822589, 0.0..."
4,Gen AI-Driven Financial Advisory System,5,How do you approach user interface and experie...,Our design philosophy centers on simplicity an...,General,18/12/2023,Bank A,Under Review,4,"[-0.01446926920262831, 0.007049784771264524, 0...","[0.008875330851434665, 0.0019440388876996923, ..."


## Setup the Vectorstore

In [13]:
import chromadb
from langchain.vectorstores.chroma import Chroma

persistent_client = chromadb.PersistentClient()
collection = persistent_client.get_or_create_collection(
    name = "rfp_qa_collection",
)


In [14]:
# Initialize lists to store data for batch addition
all_embeddings = []
all_metadatas = []
all_documents = []
all_ids = []

# Loop through the DataFrame rows
for index, row in existing_rfp_df.iterrows():
    # Append each piece of data to its respective list
    all_embeddings.append(row['Question_Embeddings_LC'])
    all_metadatas.append({
        'Project_Title': row['Project_Title'],
        'RFP_Question_ID': row['RFP_Question_ID'],
        'RFP_Question': row['RFP_Question'],
        'RFP_Answer': row['RFP_Answer'],
        'Area': row['Area'],
        'Last_Accessed_At': row['Last_Accessed_At'],
        'Requester': row['Requester'],
        'Status': row['Status'],
        'hnsw:space': 'cosine'
    })
    all_documents.append(row['RFP_Question'])
    all_ids.append(row['unique_id'])

# Add all data to the collection in a single operation
collection.add(
    ids=all_ids, 
    documents=all_documents,
    embeddings=all_embeddings,
    metadatas=all_metadatas,
)


In [15]:
langchain_chroma = Chroma(
    client=persistent_client,
    collection_name="rfp_qa_collection",
)

print("There are", langchain_chroma._collection.count(), "documents in the collection")

There are 115 documents in the collection


In [16]:
query = existing_rfp_df['RFP_Question'][0]
documents = langchain_chroma.similarity_search_by_vector_with_relevance_scores(
    get_embedding_LC(query), k=10)


In [17]:
number_of_documents = 10

print(f"New RFP Question:\n{query}")
print()
print(f"Top {number_of_documents} most similar existing RFP questions:")
print()

context = ""

for i, document in enumerate(documents[:number_of_documents]):
    page_content = document[0].page_content  # This is where the content of the page is stored.
    metadata = document[0].metadata  # This is where the metadata of the document is stored.
    score = document[1]  # This is the score at the end of the tuple.

    # Extracting the metadata
    rfp = metadata['Project_Title']
    question = metadata['RFP_Question']
    answer = metadata['RFP_Answer']
    metric = metadata['hnsw:space']

    context += f"Question: {question}\nAnswer: {answer}\n"

    # Print formatted output
    print(f"Document {i + 1}")
    print(f"Question: {question}")
    print(f"Answer: {answer}")
    print(f"Score: {1-score} ({metric})\n")


New RFP Question:
What is your experience in developing AI-based applications, and can you provide examples of successful projects?

Top 10 most similar existing RFP questions:

Document 1
Question: What is your experience in developing AI-based applications, and can you provide examples of successful projects?
Answer: Our company has 15 years of experience in developing AI-based applications, with a strong portfolio in sectors such as healthcare, finance, and education. For instance, our project MediAI Insight for the healthcare industry demonstrated significant achievements in patient data analysis, resulting in a 30% reduction in diagnostic errors and a 40% improvement in treatment personalization. Our platform has engaged over 200 healthcare facilities, achieving a user satisfaction rate of 95%.
Score: 1.0 (cosine)

Document 2
Question: Please share your experience with developing AI-enabled applications and provide examples of notable projects.
Answer: Our company has 15 years of 

In [18]:
from langchain.prompts import ChatPromptTemplate

template = """
Answer the question based only on the following context. 
If you cannot answer the question with the context, please respond with 'I don't know':

### CONTEXT
{context}

### QUESTION
Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

In [20]:
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI

#context_runnable = ContextRunnable(documents)
llm = ChatOpenAI(model="gpt-3.5-turbo-16k")

In [21]:
rag_chain = (
    {"context": RunnablePassthrough(), "question": RunnablePassthrough()}
    | prompt
    | llm
)

In [23]:
question = existing_rfp_df['RFP_Question'][0]

# Generate an answer using the RAG chain
generated_answer = rag_chain.invoke(
    {"question" : question,
     "context": context}
)


In [27]:
print(question)

What is your experience in developing AI-based applications, and can you provide examples of successful projects?


In [26]:
print(context)

Question: What is your experience in developing AI-based applications, and can you provide examples of successful projects?
Answer: Our company has 15 years of experience in developing AI-based applications, with a strong portfolio in sectors such as healthcare, finance, and education. For instance, our project MediAI Insight for the healthcare industry demonstrated significant achievements in patient data analysis, resulting in a 30% reduction in diagnostic errors and a 40% improvement in treatment personalization. Our platform has engaged over 200 healthcare facilities, achieving a user satisfaction rate of 95%.
Question: Please share your experience with developing AI-enabled applications and provide examples of notable projects.
Answer: Our company has 15 years of experience in developing AI-based applications, with a strong portfolio in sectors such as healthcare, finance, and education. For instance, our project MediAI Insight for the healthcare industry demonstrated significant 

In [28]:
print(generated_answer.content)

Our company has 15 years of experience in developing AI-based applications, with a strong portfolio in sectors such as healthcare, finance, and education. For instance, our project MediAI Insight for the healthcare industry demonstrated significant achievements in patient data analysis, resulting in a 30% reduction in diagnostic errors and a 40% improvement in treatment personalization. Our platform has engaged over 200 healthcare facilities, achieving a user satisfaction rate of 95%.
