## BUILDING HELPMATE AI USING RAG FOR LIFE INSURANCE POLICY

Steps involved in the RAG model are as follows:
1. Building the vector store
2. Searching the vector store using cache/main collection getting the top 10 results and re-ranking them
3. Generative search

#### STEP 1: Building the vector store

Importing the libraries

In [2]:
import pdfplumber
from pathlib import Path
import pandas as pd
from operator import itemgetter
import json
import tiktoken
import openai
import chromadb

Using the pdfplumber library, to create a pdf object, that helps us to read the texts in the pages

In [3]:
pdf_path='D:\Courses\AI_ML_UPGRAD\Materials\Course-6 GenAI\Module 9- Vector Database\Principal-Sample-Life-Insurance-Policy.pdf'

Function to check whether a word is present in a table or not for segregation of regular text and tables

In [4]:
def check_bboxes(word, table_bbox):
    # Check whether word is inside a table bbox.
    l = word['x0'], word['top'], word['x1'], word['bottom']
    r = table_bbox
    return l[0] > r[0] and l[1] > r[1] and l[2] < r[2] and l[3] < r[3]

Function to extract text from a PDF file.

In [5]:

def extract_text_from_pdf(pdf_path):
    p = 0
    full_text = []


    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            page_no = f"Page {p+1}"
            text = page.extract_text()

            tables = page.find_tables()
            table_bboxes = [i.bbox for i in tables]
            tables = [{'table': i.extract(), 'top': i.bbox[1]} for i in tables]
            non_table_words = [word for word in page.extract_words() if not any(
                [check_bboxes(word, table_bbox) for table_bbox in table_bboxes])]
            lines = []

            for cluster in pdfplumber.utils.cluster_objects(non_table_words + tables, itemgetter('top'), tolerance=5):

                if 'text' in cluster[0]:
                    try:
                        lines.append(' '.join([i['text'] for i in cluster]))
                    except KeyError:
                        pass

                elif 'table' in cluster[0]:
                    lines.append(json.dumps(cluster[0]['table']))


            full_text.append([page_no, " ".join(lines)])
            p +=1

    return full_text

In [6]:
extracted_text = extract_text_from_pdf(pdf_path)

In [7]:
extracted_text_df = pd.DataFrame(extracted_text, columns=['Page No.', 'Page_Text'])

In [8]:
extracted_text_df

Unnamed: 0,Page No.,Page_Text
0,Page 1,DOROTHEA GLAUSE S655 RHODE ISLAND JOHN DOE 01/...
1,Page 2,This page left blank intentionally
2,Page 3,POLICY RIDER GROUP INSURANCE POLICY NO: S655 C...
3,Page 4,This page left blank intentionally
4,Page 5,PRINCIPAL LIFE INSURANCE COMPANY (called The P...
...,...,...
59,Page 60,I f a Dependent who was insured dies during th...
60,Page 61,Section D - Claim Procedures Article 1 - Notic...
61,Page 62,A claimant may request an appeal of a claim de...
62,Page 63,This page left blank intentionally


In [9]:
extracted_text_df.Page_Text[4]

"PRINCIPAL LIFE INSURANCE COMPANY (called The Principal in this Group Policy) Des Moines, Iowa 50392-0002 This group insurance policy is issued to: RHODE ISLAND JOHN DOE (called the Policyholder in this Group Policy) The Date of Issue is November 1, 2007. In return for the Policyholder's application and payment of all premiums when due, The Principal agrees to provide: MEMBER LIFE INSURANCE MEMBER ACCIDENTAL DEATH AND DISMEMBERMENT INSURANCE DEPENDENT LIFE INSURANCE subject to the terms and conditions described in this Group Policy. GROUP POLICY NO. GL S655 RENEWABLE TERM - NON-PARTICIPATING CONTRACT STATE OF ISSUE: RHODE ISLAND This policy has been updated effective January 1, 2014 GC 6000 TITLE PAGE"

lets check the length of the text in each page of the document

In [10]:

extracted_text_df['Text_Length'] = extracted_text_df['Page_Text'].apply(lambda x: len(x.split(' ')))
extracted_text_df['Text_Length']

0      30
1       5
2     230
3       5
4     110
     ... 
59    285
60    418
61    322
62      5
63      8
Name: Text_Length, Length: 64, dtype: int64

In [11]:
# Retain only the rows with a text length of at least 10 texts in a page

extracted_text_df = extracted_text_df.loc[extracted_text_df['Text_Length'] >= 10]
extracted_text_df

Unnamed: 0,Page No.,Page_Text,Text_Length
0,Page 1,DOROTHEA GLAUSE S655 RHODE ISLAND JOHN DOE 01/...,30
2,Page 3,POLICY RIDER GROUP INSURANCE POLICY NO: S655 C...,230
4,Page 5,PRINCIPAL LIFE INSURANCE COMPANY (called The P...,110
5,Page 6,TABLE OF CONTENTS PART I - DEFINITIONS PART II...,153
6,Page 7,Section A – Eligibility Member Life Insurance ...,176
7,Page 8,Section A - Member Life Insurance Schedule of ...,171
8,Page 9,P ART I - DEFINITIONS When used in this Group ...,387
9,Page 10,T he legally recognized union of two eligible ...,251
10,Page 11,(2) has been placed with the Member or spouse ...,299
11,Page 12,An institution that is licensed as a Hospital ...,352


In [90]:
extracted_text_df.info()


<class 'pandas.core.frame.DataFrame'>
Index: 60 entries, 0 to 61
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Page No.     60 non-null     object
 1   Page_Text    60 non-null     object
 2   Text_Length  60 non-null     int64 
 3   Metadata     60 non-null     object
dtypes: int64(1), object(3)
memory usage: 2.3+ KB


removed 4 pages that had fewer words than 10 

In [12]:
max_value = extracted_text_df['Text_Length'].max()
print(f"The maximum value in the 'Text_Length' column is: {max_value}")

The maximum value in the 'Text_Length' column is: 462


Since the max text length in each page is less than few hundred words, i.e. the max value is 462, document chunking page wise and creating the embeddings would be better option.


In [13]:
# adding the page number column as a metadata
extracted_text_df['Metadata'] = extracted_text_df.apply(lambda x: {'Page_No.': x['Page No.']}, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  extracted_text_df['Metadata'] = extracted_text_df.apply(lambda x: {'Page_No.': x['Page No.']}, axis=1)


In [14]:
extracted_text_df

Unnamed: 0,Page No.,Page_Text,Text_Length,Metadata
0,Page 1,DOROTHEA GLAUSE S655 RHODE ISLAND JOHN DOE 01/...,30,{'Page_No.': 'Page 1'}
2,Page 3,POLICY RIDER GROUP INSURANCE POLICY NO: S655 C...,230,{'Page_No.': 'Page 3'}
4,Page 5,PRINCIPAL LIFE INSURANCE COMPANY (called The P...,110,{'Page_No.': 'Page 5'}
5,Page 6,TABLE OF CONTENTS PART I - DEFINITIONS PART II...,153,{'Page_No.': 'Page 6'}
6,Page 7,Section A – Eligibility Member Life Insurance ...,176,{'Page_No.': 'Page 7'}
7,Page 8,Section A - Member Life Insurance Schedule of ...,171,{'Page_No.': 'Page 8'}
8,Page 9,P ART I - DEFINITIONS When used in this Group ...,387,{'Page_No.': 'Page 9'}
9,Page 10,T he legally recognized union of two eligible ...,251,{'Page_No.': 'Page 10'}
10,Page 11,(2) has been placed with the Member or spouse ...,299,{'Page_No.': 'Page 11'}
11,Page 12,An institution that is licensed as a Hospital ...,352,{'Page_No.': 'Page 12'}


### EMBEDDINGS

creating an embedding and storing it for the texts using chromadb and the OpenAIEmbeddingFunction

In [15]:
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction
import chromadb

In [16]:
filepath=r'D:\Courses\AI_ML_UPGRAD\Materials\Course-6 GenAI\Module 9- Vector Database\OPENAI_API_Key.txt'
with open(filepath,'r') as f:
    openai.api_key=' '.join(f.readlines())

In [17]:
chroma_data_path='D:\Courses\AI_ML_UPGRAD\Materials\Course-6 GenAI\Module 9- Vector Database\Principal-Sample-Life-Insurance-Policy.pdf'

In [18]:
client = chromadb.PersistentClient()

In [19]:
model = "text-embedding-ada-002"
embedding_function = OpenAIEmbeddingFunction(api_key=openai.api_key, model_name=model)

In [43]:
life_policy_collection = client.get_or_create_collection(name='RAG_on_policy_latest', embedding_function=embedding_function)

In [44]:
documents_list = extracted_text_df["Page_Text"].tolist()
metadata_list=extracted_text_df["Metadata"].tolist()

In [47]:
extracted_text_df["Metadata"]

0      {'Page_No.': 'Page 1'}
2      {'Page_No.': 'Page 3'}
4      {'Page_No.': 'Page 5'}
5      {'Page_No.': 'Page 6'}
6      {'Page_No.': 'Page 7'}
7      {'Page_No.': 'Page 8'}
8      {'Page_No.': 'Page 9'}
9     {'Page_No.': 'Page 10'}
10    {'Page_No.': 'Page 11'}
11    {'Page_No.': 'Page 12'}
12    {'Page_No.': 'Page 13'}
13    {'Page_No.': 'Page 14'}
14    {'Page_No.': 'Page 15'}
15    {'Page_No.': 'Page 16'}
16    {'Page_No.': 'Page 17'}
17    {'Page_No.': 'Page 18'}
18    {'Page_No.': 'Page 19'}
19    {'Page_No.': 'Page 20'}
20    {'Page_No.': 'Page 21'}
21    {'Page_No.': 'Page 22'}
22    {'Page_No.': 'Page 23'}
23    {'Page_No.': 'Page 24'}
24    {'Page_No.': 'Page 25'}
25    {'Page_No.': 'Page 26'}
26    {'Page_No.': 'Page 27'}
27    {'Page_No.': 'Page 28'}
28    {'Page_No.': 'Page 29'}
29    {'Page_No.': 'Page 30'}
30    {'Page_No.': 'Page 31'}
31    {'Page_No.': 'Page 32'}
32    {'Page_No.': 'Page 33'}
33    {'Page_No.': 'Page 34'}
34    {'Page_No.': 'Page 35'}
35    {'Pa

In [48]:
life_policy_collection.add(
    documents= documents_list,
    ids = [str(i) for i in range(0, len(documents_list))],
    metadatas = metadata_list
)

Add of existing embedding ID: 0
Add of existing embedding ID: 1
Add of existing embedding ID: 2
Add of existing embedding ID: 3
Add of existing embedding ID: 4
Add of existing embedding ID: 5
Add of existing embedding ID: 6
Add of existing embedding ID: 7
Add of existing embedding ID: 8
Add of existing embedding ID: 9
Add of existing embedding ID: 10
Add of existing embedding ID: 11
Add of existing embedding ID: 12
Add of existing embedding ID: 13
Add of existing embedding ID: 14
Add of existing embedding ID: 15
Add of existing embedding ID: 16
Add of existing embedding ID: 17
Add of existing embedding ID: 18
Add of existing embedding ID: 19
Add of existing embedding ID: 20
Add of existing embedding ID: 21
Add of existing embedding ID: 22
Add of existing embedding ID: 23
Add of existing embedding ID: 24
Add of existing embedding ID: 25
Add of existing embedding ID: 26
Add of existing embedding ID: 27
Add of existing embedding ID: 28
Add of existing embedding ID: 29
Add of existing embe

checking the embeddings created by chroma db

In [None]:

life_policy_collection.get(
    ids = ['0','1','2'],
    include = ['embeddings', 'documents', 'metadatas']
)

{'ids': ['0', '1', '2'],
 'embeddings': array([[-2.24228799e-02,  1.87183432e-02, -2.72361692e-02, ...,
         -3.69149223e-02,  2.83710100e-03, -1.30930578e-03],
        [-1.32057490e-02,  8.82212631e-03, -4.67860838e-03, ...,
         -1.56548154e-02, -4.84764605e-05,  7.25115696e-03],
        [-1.20378779e-02,  1.40740369e-02, -3.30295507e-03, ...,
         -2.85194907e-02, -9.43796150e-03,  1.02139572e-02]]),
 'documents': ['DOROTHEA GLAUSE S655 RHODE ISLAND JOHN DOE 01/01/2014 711 HIGH STREET GEORGE RI 02903 GROUP POLICY FOR: RHODE ISLAND JOHN DOE ALL MEMBERS Group Member Life Insurance Print Date: 07/16/2014',
  'POLICY RIDER GROUP INSURANCE POLICY NO: S655 COVERAGE: Life EMPLOYER: RHODE ISLAND JOHN DOE Effective on the later of the Date of Issue of this Group Policy or March 1, 2005, the following will apply to your Policy: From time to time The Principal may offer or provide certain employer groups who apply for coverage with The Principal a Financial Services Hotline and Gri

#### STEP 2: Searching the vector store using cache/main collection getting the top 10 results and re-ranking them

Adding a cache to improve the speed of the model

In [70]:
cache_collection = client.get_or_create_collection(name='Policy_Cache_latest', embedding_function=embedding_function)

In [71]:
cache_collection.peek()

{'ids': [],
 'embeddings': array([], dtype=float64),
 'documents': [],
 'uris': None,
 'data': None,
 'metadatas': [],
 'included': [<IncludeEnum.embeddings: 'embeddings'>,
  <IncludeEnum.documents: 'documents'>,
  <IncludeEnum.metadatas: 'metadatas'>]}

In [72]:
query = input()
print(query)

what must the format of the certificates?


In [73]:
cache_results = cache_collection.query(
    query_texts=query,
    n_results=1
)

In [74]:
cache_results

{'ids': [[]],
 'embeddings': None,
 'documents': [[]],
 'uris': None,
 'data': None,
 'metadatas': [[]],
 'distances': [[]],
 'included': [<IncludeEnum.distances: 'distances'>,
  <IncludeEnum.documents: 'documents'>,
  <IncludeEnum.metadatas: 'metadatas'>]}

In [75]:
threshold = 0.2

ids = []
documents = []
distances = []
metadatas = []

results_df = pd.DataFrame()

# Check if cache is empty or the closest cached distance exceeds the threshold
if not cache_results['distances'][0] or cache_results['distances'][0][0] > threshold:
    results = life_policy_collection.query(
        query_texts=query,
        n_results=10
    )

    keys_list = []
    values_list = []

    # Safeguard against empty or invalid results
    for key, value in results.items():
        if value is None or not value[0]:  # Ensure value[0] exists
            continue
        for i in range(min(10, len(value[0]))):  # Ensure we do not exceed the length of the list
            keys_list.append(f"{key}{i}")
            values_list.append(str(value[0][i]))

    # Add results to the cache
    cache_collection.add(
        documents=[query],
        ids=[query],
        metadatas=dict(zip(keys_list, values_list))
    )

    print("Results are not found in the cache db, it is found in the main db")

    # Create a DataFrame from the main collection results
    results_dict = {
        'Metadatas': results['metadatas'][0],
        'Distances': results['distances'][0],
        'IDs': results['ids'][0],
        'Documents': results['documents'][0]
    }
    results_df = pd.DataFrame.from_dict(results_dict)

else:
    # Retrieve results from the cache
    cache_results_dict = cache_results['metadatas'][0][0]

    for key, val in cache_results_dict.items():
        if 'ids' in key:
            ids.append(val)
        elif 'documents' in key:
            documents.append(val)
        elif 'distances' in key:
            distances.append(val)
        elif 'metadatas' in key:
            metadatas.append(val)
    
    print("Found in cache")

    # Create a DataFrame from the cached results
    results_df = pd.DataFrame({
        'IDs': ids,
        'Documents': documents,
        'Distances': distances,
        'Metadatas': metadatas
    })

# Print the final results DataFrame (optional)
print(results_df)


Results are not found in the cache db, it is found in the main db
                 Metadatas  Distances IDs  \
0  {'Page_No.': 'Page 18'}   0.463001  15   
1  {'Page_No.': 'Page 16'}   0.479960  13   
2   {'Page_No.': 'Page 6'}   0.509958   3   
3  {'Page_No.': 'Page 37'}   0.510362  34   
4  {'Page_No.': 'Page 14'}   0.513758  11   
5  {'Page_No.': 'Page 15'}   0.517100  12   
6  {'Page_No.': 'Page 17'}   0.520049  14   
7  {'Page_No.': 'Page 19'}   0.524604  16   
8  {'Page_No.': 'Page 13'}   0.526859  10   
9  {'Page_No.': 'Page 25'}   0.539437  22   

                                           Documents  
0  c . a copy of the form which contains the stat...  
1  PART II - POLICY ADMINISTRATION Section A - Co...  
2  TABLE OF CONTENTS PART I - DEFINITIONS PART II...  
3  b. a business assignment; or c. full-time stud...  
4  c . end stage renal failure; or d. acquired im...  
5  A record which is on or transmitted by paper o...  
6  a. be actively engaged in business for profit ... 

In [76]:
results_df

Unnamed: 0,Metadatas,Distances,IDs,Documents
0,{'Page_No.': 'Page 18'},0.463001,15,c . a copy of the form which contains the stat...
1,{'Page_No.': 'Page 16'},0.47996,13,PART II - POLICY ADMINISTRATION Section A - Co...
2,{'Page_No.': 'Page 6'},0.509958,3,TABLE OF CONTENTS PART I - DEFINITIONS PART II...
3,{'Page_No.': 'Page 37'},0.510362,34,b. a business assignment; or c. full-time stud...
4,{'Page_No.': 'Page 14'},0.513758,11,c . end stage renal failure; or d. acquired im...
5,{'Page_No.': 'Page 15'},0.5171,12,A record which is on or transmitted by paper o...
6,{'Page_No.': 'Page 17'},0.520049,14,a. be actively engaged in business for profit ...
7,{'Page_No.': 'Page 19'},0.524604,16,T he Principal has complete discretion to cons...
8,{'Page_No.': 'Page 13'},0.526859,10,a . A licensed Doctor of Medicine (M.D.) or Os...
9,{'Page_No.': 'Page 25'},0.539437,22,Section D - Policy Renewal Article 1 - Renewal...


Let's re-rank the top 3 relevant results using cross encoder

In [77]:
from sentence_transformers import CrossEncoder, util

In [78]:
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

In [79]:
cross_inputs = [[query, response] for response in results_df['Documents']]
cross_rerank_scores = cross_encoder.predict(cross_inputs)

In [80]:
results_df['Reranked_scores'] = cross_rerank_scores

In [81]:
results_df

Unnamed: 0,Metadatas,Distances,IDs,Documents,Reranked_scores
0,{'Page_No.': 'Page 18'},0.463001,15,c . a copy of the form which contains the stat...,-6.53682
1,{'Page_No.': 'Page 16'},0.47996,13,PART II - POLICY ADMINISTRATION Section A - Co...,-8.311792
2,{'Page_No.': 'Page 6'},0.509958,3,TABLE OF CONTENTS PART I - DEFINITIONS PART II...,-10.444969
3,{'Page_No.': 'Page 37'},0.510362,34,b. a business assignment; or c. full-time stud...,-10.949223
4,{'Page_No.': 'Page 14'},0.513758,11,c . end stage renal failure; or d. acquired im...,-9.951252
5,{'Page_No.': 'Page 15'},0.5171,12,A record which is on or transmitted by paper o...,-10.730024
6,{'Page_No.': 'Page 17'},0.520049,14,a. be actively engaged in business for profit ...,-9.575331
7,{'Page_No.': 'Page 19'},0.524604,16,T he Principal has complete discretion to cons...,-11.217015
8,{'Page_No.': 'Page 13'},0.526859,10,a . A licensed Doctor of Medicine (M.D.) or Os...,-10.456139
9,{'Page_No.': 'Page 25'},0.539437,22,Section D - Policy Renewal Article 1 - Renewal...,-11.336096


In [82]:
top_3_rerank = results_df.sort_values(by='Reranked_scores', ascending=False)
top_3_rerank[:3]

Unnamed: 0,Metadatas,Distances,IDs,Documents,Reranked_scores
0,{'Page_No.': 'Page 18'},0.463001,15,c . a copy of the form which contains the stat...,-6.53682
1,{'Page_No.': 'Page 16'},0.47996,13,PART II - POLICY ADMINISTRATION Section A - Co...,-8.311792
6,{'Page_No.': 'Page 17'},0.520049,14,a. be actively engaged in business for profit ...,-9.575331


In [83]:
top_3_RAG = top_3_rerank[["Documents", "Metadatas"]][:3]
top_3_RAG

Unnamed: 0,Documents,Metadatas
0,c . a copy of the form which contains the stat...,{'Page_No.': 'Page 18'}
1,PART II - POLICY ADMINISTRATION Section A - Co...,{'Page_No.': 'Page 16'}
6,a. be actively engaged in business for profit ...,{'Page_No.': 'Page 17'}


#### STEP 3: Generative search

In [123]:
def generate_response(query, results_df):
    """
    Generate a response using GPT-3.5's ChatCompletion based on the user query and retrieved information.
    """
    messages = [
                {"role": "system", "content":  "You are a helpful assistant in the insurance domain who can effectively answer user queries about insurance policies and documents."},
                {"role": "user", "content": f"""You are a helpful assistant in the insurance domain who can effectively answer user queries about insurance policies and documents.
                                                You have a question asked by the user in '{query}' and you have some search results from a corpus of insurance documents in the dataframe '{top_3_RAG}'. These search results are essentially one page of an insurance document that may be relevant to the user query.

                                                The column 'documents' inside this dataframe contains the actual text from the policy document and the column 'metadata' contains source page from where the information is present.

                                                Use the documents in '{top_3_RAG}' to answer the query '{query}'. Frame an informative answer and also, use the dataframe to return the relevant policy names and article numbers present in the document with the page numbers as citations.
                                                If you do not find a policy name or an article number, you can just provide the heading of the document.
                                                Follow the guidelines below when performing the task.
                                                1. Try to provide relevant/accurate numbers if available.
                                                2. You don’t have to necessarily use all the information in the dataframe. Only choose information that is relevant.
                                                3. If the document text has tables with relevant information, please reformat the table and return the final information in a tabular in format.
                                                3. Use the Metadatas columns in the dataframe to retrieve and cite the Article name(s) and page numbers(s) as citation.
                                                4. If you can't provide the complete answer, please also provide any information that will help the user to search specific sections in the relevant cited documents.
                                                5. You are a customer facing assistant, so do not provide any information on internal workings, just answer the query directly.

                                                The generated response should answer the query directly addressing the user and avoiding additional information. If you think that the query is not relevant to the document, reply that the query is irrelevant. Provide the final response as a well-formatted and easily readable text along with the citation. Provide your complete response first with all information, and then provide the citations.
                                                """},
              ]

    response = openai.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=messages
    )

    return response.choices[0].message.content.split('\n')

In [124]:
response = generate_response(query, top_3_RAG)

In [125]:
print("\n".join(response))

To fill a claim, follow these steps:
1. Obtain a claim form containing the necessary details.
2. Fill out the form accurately with all the required information.
3. Submit supporting documents such as proof of loss, medical records, and receipts.
4. Ensure you are actively engaged in business for profit, as per the policy terms.
5. Submit the completed claim form and documents to the insurance company for processing.

Here are the relevant details from the insurance documents along with citations:
- Policy Name/Article Number: PART II - POLICY ADMINISTRATION Section A
  Page Number: Page 16

These details should guide you on how to proceed with filling a claim.

Citations:
PART II - POLICY ADMINISTRATION Section A - Page 16


#### PIPELINE

Created a pipeline to follow up the flow of the RAG end to end

In [113]:
def pipeline(query):
    
    #checking if the query present in cache or the main db
    cache_results = cache_collection.query(
        query_texts=query,
        n_results=1
    )
    threshold = 0.2
    ids = []
    documents = []
    distances = []
    metadatas = []
    results_df = pd.DataFrame()
    if not cache_results['distances'][0] or cache_results['distances'][0][0] > threshold:
        results = life_policy_collection.query(
            query_texts=query,
            n_results=10
        )

        keys_list = []
        values_list = []
        for key, value in results.items():
            if value is None or not value[0]: 
                continue
            for i in range(min(10, len(value[0]))):  
                keys_list.append(f"{key}{i}")
                values_list.append(str(value[0][i]))

        cache_collection.add(
            documents=[query],
            ids=[query],
            metadatas=dict(zip(keys_list, values_list))
        )

        print("Results are not found in the cache db, it is found in the main db")

        results_dict = {
            'Metadatas': results['metadatas'][0],
            'Distances': results['distances'][0],
            'IDs': results['ids'][0],
            'Documents': results['documents'][0]
        }
        results_df = pd.DataFrame.from_dict(results_dict)

    else:
        cache_results_dict = cache_results['metadatas'][0][0]

        for key, val in cache_results_dict.items():
            if 'ids' in key:
                ids.append(val)
            elif 'documents' in key:
                documents.append(val)
            elif 'distances' in key:
                distances.append(val)
            elif 'metadatas' in key:
                metadatas.append(val)
        
        print("Found in cache")
        results_df = pd.DataFrame({
            'IDs': ids,
            'Documents': documents,
            'Distances': distances,
            'Metadatas': metadatas
        })
    print("Top 10 results are as follows: \n")
    print(results_df[['Documents','Metadatas']])
    #Re-rank the top 3 relevant results
    cross_inputs = [[query, response] for response in results_df['Documents']]
    cross_rerank_scores = cross_encoder.predict(cross_inputs)
    results_df['Reranked_scores'] = cross_rerank_scores
    top_3_rerank = results_df.sort_values(by='Reranked_scores', ascending=False)
    top_3_RAG = top_3_rerank[["Documents", "Metadatas"]][:3]

    #Generate response using LLM
    response = generate_response(query, top_3_RAG)
    print("Bot response:")
    print("\n".join(response))



In [126]:
query =input("")
print("Customer: ", query)
pipeline(query)

Customer:  Give me steps in brief on how can i fill a claim.
Found in cache
Top 10 results are as follows: 

                                           Documents                Metadatas
0  Section D - Claim Procedures Article 1 - Notic...  {'Page_No.': 'Page 61'}
1  A claimant may request an appeal of a claim de...  {'Page_No.': 'Page 62'}
2  c . a copy of the form which contains the stat...  {'Page_No.': 'Page 18'}
3  a. be actively engaged in business for profit ...  {'Page_No.': 'Page 17'}
4  f . claim requirements listed in PART IV, Sect...  {'Page_No.': 'Page 54'}
5  Section A - Member Life Insurance Schedule of ...   {'Page_No.': 'Page 8'}
6  TABLE OF CONTENTS PART I - DEFINITIONS PART II...   {'Page_No.': 'Page 6'}
7  T he Principal has complete discretion to cons...  {'Page_No.': 'Page 19'}
8  Coverage During Disability will cease on the e...  {'Page_No.': 'Page 51'}
9  b. a business assignment; or c. full-time stud...  {'Page_No.': 'Page 37'}
Bot response:
To fill a claim, yo

In [127]:
query_2 = input("Enter your query: ")
print(query_2)
pipeline(query_2)

How many accelerated benefit payments can be requested?
Found in cache
Top 10 results are as follows: 

                                           Documents                Metadatas
0  (1) only one Accelerated Benefit payment will ...  {'Page_No.': 'Page 52'}
1  % of Scheduled Covered Loss Benefit Loss of Sp...  {'Page_No.': 'Page 57'}
2  (1) If termination is as described in b. (1) a...  {'Page_No.': 'Page 45'}
3  f . claim requirements listed in PART IV, Sect...  {'Page_No.': 'Page 54'}
4  M ember's death, the Death Benefits Payable ma...  {'Page_No.': 'Page 47'}
5  Any individual policy issued will then be in f...  {'Page_No.': 'Page 43'}
6  Coverage During Disability will cease on the e...  {'Page_No.': 'Page 51'}
7  A claimant may request an appeal of a claim de...  {'Page_No.': 'Page 62'}
8  Scheduled Benefit in force for the Member befo...  {'Page_No.': 'Page 31'}
9  c . If a beneficiary dies at the same time or ...  {'Page_No.': 'Page 48'}
Bot response:
The information about ho

In [129]:
query_3 = input("Enter your query: ")
print("Query: ", query_3)
pipeline(query_3)

Query:  What is the life insurance policy on accidental death?
Results are not found in the cache db, it is found in the main db
Top 10 results are as follows: 

                                           Documents                Metadatas
0  Exposure Exposure to the elements will be pres...  {'Page_No.': 'Page 55'}
1  Section A – Eligibility Member Life Insurance ...   {'Page_No.': 'Page 7'}
2  PART III - INDIVIDUAL REQUIREMENTS AND RIGHTS ...  {'Page_No.': 'Page 26'}
3  Section C - Individual Terminations Article 1 ...  {'Page_No.': 'Page 35'}
4  Section A - Member Life Insurance Schedule of ...   {'Page_No.': 'Page 8'}
5  f . claim requirements listed in PART IV, Sect...  {'Page_No.': 'Page 54'}
6  (1) marriage or establishment of a Civil Union...  {'Page_No.': 'Page 32'}
7  Section B - Member Accidental Death and Dismem...  {'Page_No.': 'Page 53'}
8  Any individual policy issued will then be in f...  {'Page_No.': 'Page 43'}
9  M ember's death, the Death Benefits Payable ma...  {'Pa