## Part 1: Introduction

#### Project Background

Buiding a searc system project related to the insurance industry. An AI-powered project to extract information from insurance policy documents can revolutionize how policies are analyzed and understood. Using natural language processing (NLP), this system can efficiently scan, interpret, and extract key details such as coverage limits, exclusions, claim procedures, and premium details from the policy document. To address this, we will develop **Search System AI, a question-answering system that combines the power of Embedding models, RAG & Large language models to ensure accurate and reliable information delivery**.


#### Problem Statement

*Given Principal-Sample-Life-Insurance-Policy document containing information about the Group Member Life Insurance (Eligibility, Terminations, Reinstatement, BENEFITS, Member Accidental Death and Dismemberment Insurance, Dependent Life Insurance, Claim Procedures, cuisine type, area etc.), I will build a chatbot that parses the document and provides accurate answers based on the user queries*.

In [1]:
# Install all the required libraries
!pip install -U -q pdfplumber tiktoken openai chromadb sentence-transformers

In [2]:
# Import all the required Libraries

import pdfplumber
from pathlib import Path
import pandas as pd
from operator import itemgetter
import json
import tiktoken
import openai
import chromadb
import os, ast
import textwrap

#### Approach:

1. **Embedding Layer**: The embedding layer generates embeddings for our text corpus and allows the RAG model to understand the meaning of the query and to generate a relevant and informative response. This is essential for a variety of tasks, such as question answering system like what we are developing
2. **Search and Rank Layer**: It ensures that the retrieved text is accurate, relevant and contextually appropriate. The search component will use a technique called semantic similarity. It is a measure of how similar two pieces of text are in terms of their meaning. The search component uses semantic similarity to retrieve documents from a knowledge base that are relevant to the user's query. The re-rank component of the search typically uses a variety of techniques to re-rank the retrieved documents
3. **Generation Layer**: The generation layer will allow the model to generate new text in response to a user's query. The generative model takes the retrieved information, synthesises all the data and shapes it into a coherent and contextually appropriate response.

## Part 2: System Design

The question-answering system would help user to ask the questions to the system based on the policy document

After that the question-answering system will provide the answer, and engages in further conversation to help the user find the appropiate answer, specifically from the document we have and not from the internet


`Stage 1`

- Emebddding Layer - To embedded the user query & policy document into the ChromaDB

`Stage 2`

- Search the relevant text from the policy document
- Re-Rank the best possible match to the user Query

`Stage 3`

- `moderation_check()`: This checks if the user's or the assistant's message is inappropriate. If any of these is inappropriate, it ends the conversation.
- Generate new text based on the user query, prompt & relevant documents retreived from Stage 2

## Part 3: Implementation

## 3.1 <font color = black> Read, Process, and Chunk the PDF Files

We will be using [pdfplumber](https://https://pypi.org/project/pdfplumber/) to read and process the PDF files.

`pdfplumber` allows for better parsing of the PDF file as it can read various elements of the PDF apart from the plain text, such as, tables, images, etc. It also offers wide functionaties and visual debugging features to help with advanced preprocessing as well.

In [3]:
# Define the path of the PDF
pdf_path = '/Users/sahilavasthi/upGrad/genai/rag/project/'

In [4]:
# Function to check whether a word is present in a table or not for segregation of regular text and tables

def check_bboxes(word, table_bbox):
    # Check whether word is inside a table bbox.
    l = word['x0'], word['top'], word['x1'], word['bottom']
    r = table_bbox
    return l[0] > r[0] and l[1] > r[1] and l[2] < r[2] and l[3] < r[3]

In [5]:
# Function to extract text from a PDF file.
# 1. Declare a variable p to store the iteration of the loop that will help us store page numbers alongside the text
# 2. Declare an empty list 'full_text' to store all the text files
# 3. Use pdfplumber to open the pdf pages one by one
# 4. Find the tables and their locations in the page
# 5. Extract the text from the tables in the variable 'tables'
# 6. Extract the regular words by calling the function check_bboxes() and checking whether words are present in the table or not
# 7. Use the cluster_objects utility to cluster non-table and table words together so that they retain the same chronology as in the original PDF
# 8. Declare an empty list 'lines' to store the page text
# 9. If a text element in present in the cluster, append it to 'lines', else if a table element is present, append the table
# 10. Append the page number and all lines to full_text, and increment 'p'
# 11. When the function has iterated over all pages, return the 'full_text' list


def extract_text_from_pdf(pdf_path):
    p = 0
    full_text = []


    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            page_no = f"Page {p+1}"
            text = page.extract_text()

            tables = page.find_tables()
            table_bboxes = [i.bbox for i in tables]
            tables = [{'table': i.extract(), 'top': i.bbox[1]} for i in tables]
            non_table_words = [word for word in page.extract_words() if not any(
                [check_bboxes(word, table_bbox) for table_bbox in table_bboxes])]
            lines = []

            for cluster in pdfplumber.utils.cluster_objects(non_table_words + tables, itemgetter('top'), tolerance=5):

                if 'text' in cluster[0]:
                    try:
                        lines.append(' '.join([i['text'] for i in cluster]))
                    except KeyError:
                        pass

                elif 'table' in cluster[0]:
                    lines.append(json.dumps(cluster[0]['table']))


            full_text.append([page_no, " ".join(lines)])
            p +=1

    return full_text

In [6]:
# Define the directory containing the PDF files
pdf_directory = Path(pdf_path)

# Initialize an empty list to store the extracted texts and document names
data = []

# Loop through all files in the directory
for pdf_path in pdf_directory.glob("*.pdf"):

    # Process the PDF file
    print(f"...Processing {pdf_path.name}")

    # Call the function to extract the text from the PDF
    extracted_text = extract_text_from_pdf(pdf_path)

    # Convert the extracted list to a PDF, and add a column to store document names
    extracted_text_df = pd.DataFrame(extracted_text, columns=['Page No.', 'Page_Text'])
    extracted_text_df['Document Name'] = pdf_path.name

    # Append the extracted text and document name to the list
    data.append(extracted_text_df)

    # Print a message to indicate progress
    print(f"Finished processing {pdf_path.name}")

# Print a message to indicate all PDFs have been processed
print("All PDFs have been processed.")

CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, def

...Processing Principal-Sample-Life-Insurance-Policy.pdf


CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, def

Finished processing Principal-Sample-Life-Insurance-Policy.pdf
All PDFs have been processed.


In [7]:
# Concatenate all the DFs in the list 'data' together
life_insurance_pdf_data = pd.concat(data, ignore_index=True)

In [8]:
# Printing the DataFrame
life_insurance_pdf_data

Unnamed: 0,Page No.,Page_Text,Document Name
0,Page 1,DOROTHEA GLAUSE S655 RHODE ISLAND JOHN DOE 01/...,Principal-Sample-Life-Insurance-Policy.pdf
1,Page 2,This page left blank intentionally,Principal-Sample-Life-Insurance-Policy.pdf
2,Page 3,POLICY RIDER GROUP INSURANCE POLICY NO: S655 C...,Principal-Sample-Life-Insurance-Policy.pdf
3,Page 4,This page left blank intentionally,Principal-Sample-Life-Insurance-Policy.pdf
4,Page 5,PRINCIPAL LIFE INSURANCE COMPANY (called The P...,Principal-Sample-Life-Insurance-Policy.pdf
...,...,...,...
59,Page 60,I f a Dependent who was insured dies during th...,Principal-Sample-Life-Insurance-Policy.pdf
60,Page 61,Section D - Claim Procedures Article 1 - Notic...,Principal-Sample-Life-Insurance-Policy.pdf
61,Page 62,A claimant may request an appeal of a claim de...,Principal-Sample-Life-Insurance-Policy.pdf
62,Page 63,This page left blank intentionally,Principal-Sample-Life-Insurance-Policy.pdf


In [9]:
# Check one of the extracted page texts to ensure that the text has been correctly read
life_insurance_pdf_data.Page_Text[5]

'TABLE OF CONTENTS PART I - DEFINITIONS PART II - POLICY ADMINISTRATION Section A – Contract Entire Contract Article 1 Policy Changes Article 2 Policyholder Eligibility Requirements Article 3 Policy Incontestability Article 4 Individual Incontestability Article 5 Information to be Furnished Article 6 Certificates Article 7 Assignments Article 8 Dependent Rights Article 9 Policy Interpretation Article 10 Electronic Transactions Article 11 Section B – Premium Payment Responsibility; Due Dates; Grace Period Article 1 Premium Rates Article 2 Premium Rate Changes Article 3 Premium Amount Article 4 Contributions from Members Article 5 Section C - Policy Termination Failure to Pay Premium Article 1 Termination Rights of the Policyholder Article 2 Termination Rights of The Principal Article 3 Policyholder Responsibility to Members Article 4 Section D - Policy Renewal Renewal Article 1 PART III - INDIVIDUAL REQUIREMENTS AND RIGHTS This policy has been updated effective January 1, 2014 GC 6001 T

In [10]:
# Let's also check the length of all the texts as there might be some empty pages or pages with very few words that we can drop
life_insurance_pdf_data['Text_Length'] = life_insurance_pdf_data['Page_Text'].apply(lambda x: len(x.split(' ')))

In [11]:
# Retain only the rows with a text length of at least 10

life_insurance_pdf_data = life_insurance_pdf_data.loc[life_insurance_pdf_data['Text_Length'] >= 10]
life_insurance_pdf_data.head(10)

Unnamed: 0,Page No.,Page_Text,Document Name,Text_Length
0,Page 1,DOROTHEA GLAUSE S655 RHODE ISLAND JOHN DOE 01/...,Principal-Sample-Life-Insurance-Policy.pdf,30
2,Page 3,POLICY RIDER GROUP INSURANCE POLICY NO: S655 C...,Principal-Sample-Life-Insurance-Policy.pdf,230
4,Page 5,PRINCIPAL LIFE INSURANCE COMPANY (called The P...,Principal-Sample-Life-Insurance-Policy.pdf,110
5,Page 6,TABLE OF CONTENTS PART I - DEFINITIONS PART II...,Principal-Sample-Life-Insurance-Policy.pdf,153
6,Page 7,Section A – Eligibility Member Life Insurance ...,Principal-Sample-Life-Insurance-Policy.pdf,176
7,Page 8,Section A - Member Life Insurance Schedule of ...,Principal-Sample-Life-Insurance-Policy.pdf,171
8,Page 9,P ART I - DEFINITIONS When used in this Group ...,Principal-Sample-Life-Insurance-Policy.pdf,387
9,Page 10,T he legally recognized union of two eligible ...,Principal-Sample-Life-Insurance-Policy.pdf,251
10,Page 11,(2) has been placed with the Member or spouse ...,Principal-Sample-Life-Insurance-Policy.pdf,299
11,Page 12,An institution that is licensed as a Hospital ...,Principal-Sample-Life-Insurance-Policy.pdf,352


In [12]:
# Maximum length of the text
life_insurance_pdf_data['Text_Length'].max()

462

In [13]:
# Store the metadata for each page in a separate column
life_insurance_pdf_data['Metadata'] = life_insurance_pdf_data.apply(lambda x: {'Policy_Name': x['Document Name'][:-4], 'Page_No.': x['Page No.']}, axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  life_insurance_pdf_data['Metadata'] = life_insurance_pdf_data.apply(lambda x: {'Policy_Name': x['Document Name'][:-4], 'Page_No.': x['Page No.']}, axis=1)


In [14]:
life_insurance_pdf_data

Unnamed: 0,Page No.,Page_Text,Document Name,Text_Length,Metadata
0,Page 1,DOROTHEA GLAUSE S655 RHODE ISLAND JOHN DOE 01/...,Principal-Sample-Life-Insurance-Policy.pdf,30,{'Policy_Name': 'Principal-Sample-Life-Insuran...
2,Page 3,POLICY RIDER GROUP INSURANCE POLICY NO: S655 C...,Principal-Sample-Life-Insurance-Policy.pdf,230,{'Policy_Name': 'Principal-Sample-Life-Insuran...
4,Page 5,PRINCIPAL LIFE INSURANCE COMPANY (called The P...,Principal-Sample-Life-Insurance-Policy.pdf,110,{'Policy_Name': 'Principal-Sample-Life-Insuran...
5,Page 6,TABLE OF CONTENTS PART I - DEFINITIONS PART II...,Principal-Sample-Life-Insurance-Policy.pdf,153,{'Policy_Name': 'Principal-Sample-Life-Insuran...
6,Page 7,Section A – Eligibility Member Life Insurance ...,Principal-Sample-Life-Insurance-Policy.pdf,176,{'Policy_Name': 'Principal-Sample-Life-Insuran...
7,Page 8,Section A - Member Life Insurance Schedule of ...,Principal-Sample-Life-Insurance-Policy.pdf,171,{'Policy_Name': 'Principal-Sample-Life-Insuran...
8,Page 9,P ART I - DEFINITIONS When used in this Group ...,Principal-Sample-Life-Insurance-Policy.pdf,387,{'Policy_Name': 'Principal-Sample-Life-Insuran...
9,Page 10,T he legally recognized union of two eligible ...,Principal-Sample-Life-Insurance-Policy.pdf,251,{'Policy_Name': 'Principal-Sample-Life-Insuran...
10,Page 11,(2) has been placed with the Member or spouse ...,Principal-Sample-Life-Insurance-Policy.pdf,299,{'Policy_Name': 'Principal-Sample-Life-Insuran...
11,Page 12,An institution that is licensed as a Hospital ...,Principal-Sample-Life-Insurance-Policy.pdf,352,{'Policy_Name': 'Principal-Sample-Life-Insuran...


This concludes the chunking aspect also, as we can see that mostly the pages contain few hundred words, maximum going upto 1000. So, we don't need to chunk the documents further; we can perform the embeddings on individual pages. This strategy makes sense for 2 reasons:
1. The way insurance documents are generally structured, you will not have a lot of extraneous information in a page, and all the text pieces in that page will likely be interrelated.
2. We want to have larger chunk sizes to be able to pass appropriate context to the LLM during the generation layer.

## 4. <font color = black> Generate and Store Embeddings using OpenAI and ChromaDB

In this section, we will embed the pages in the dataframe through OpenAI's `text-embedding-ada-002` model, and store them in a ChromaDB collection.

In [15]:
filepath = '/Users/sahilavasthi/upGrad/genai/'

with open(filepath + "OpenAI_API_Key.txt", "r") as f:
  openai.api_key = ' '.join(f.readlines())

os.environ['OPENAI_API_KEY'] = openai.api_key.strip()

In [16]:
# Import the OpenAI Embedding Function into chroma
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction

In [17]:
# Define the path where chroma collections will be stored
chroma_data_path = '/Users/sahilavasthi/upGrad/genai/rag/ChromaDB_Data'

In [18]:
# Call PersistentClient()
client = chromadb.PersistentClient(chroma_data_path)

In [19]:
# Set up the embedding function using the OpenAI embedding model

model = "text-embedding-ada-002"
embedding_function = OpenAIEmbeddingFunction(api_key=openai.api_key, model_name=model)

In [20]:
# Initialise a collection in chroma and pass the embedding_function to it so that it used OpenAI embeddings to embed the documents
life_insurance_collection = client.get_or_create_collection(name='RAG_on_LifeInsurance', embedding_function=embedding_function)

In [21]:
# Convert the page text and metadata from your dataframe to lists to be able to pass it to chroma
documents_list = life_insurance_pdf_data["Page_Text"].tolist()
metadata_list = life_insurance_pdf_data['Metadata'].tolist()

In [22]:
# Add the documents and metadata to the collection alongwith generic integer IDs. You can also feed the metadata information as IDs by combining the policy name and page no.

life_insurance_collection.add(
    documents= documents_list,
    ids = [str(i) for i in range(0, len(documents_list))],
    metadatas = metadata_list
)

In [23]:
# Let's take a look at the first few entries in the collection

life_insurance_collection.get(
    ids = ['0','1','2'],
    include = ['embeddings', 'documents', 'metadatas']
)

{'ids': ['0', '1', '2'],
 'embeddings': array([[-2.24228799e-02,  1.87183432e-02, -2.72361692e-02, ...,
         -3.69149223e-02,  2.83710100e-03, -1.30930578e-03],
        [-1.32057490e-02,  8.82212631e-03, -4.67860838e-03, ...,
         -1.56548154e-02, -4.84764605e-05,  7.25115696e-03],
        [-1.24035338e-02,  1.34377144e-02, -2.85228249e-03, ...,
         -2.97525711e-02, -1.01760682e-02,  9.71201342e-03]]),
 'documents': ['DOROTHEA GLAUSE S655 RHODE ISLAND JOHN DOE 01/01/2014 711 HIGH STREET GEORGE RI 02903 GROUP POLICY FOR: RHODE ISLAND JOHN DOE ALL MEMBERS Group Member Life Insurance Print Date: 07/16/2014',
  'POLICY RIDER GROUP INSURANCE POLICY NO: S655 COVERAGE: Life EMPLOYER: RHODE ISLAND JOHN DOE Effective on the later of the Date of Issue of this Group Policy or March 1, 2005, the following will apply to your Policy: From time to time The Principal may offer or provide certain employer groups who apply for coverage with The Principal a Financial Services Hotline and Gri

In [24]:
# Create a cache collection to store the embeddings of the documents so that we can use them for similarity search later
# This will help us avoid re-embedding the same documents multiple times, which can be time-consuming and costly
cache_collection = client.get_or_create_collection(name='Life_Insurance_Cache', embedding_function=embedding_function)

## 5. <font color = black> Semantic Search with Cache

In this section, we will perform a semantic search of a query in the collections embeddings to get several top semantically similar results.

In [25]:
# Read the user query

query = input()

In [26]:
# Searh the Cache collection first
# Query the collection against the user query and return the top 20 results

cache_results = cache_collection.query(
    query_texts=query,
    n_results=1
)

In [27]:
cache_results

{'ids': [['How are the premiums determined for Member Life, Member Accidental Death and Dismemberment, and Dependent Life Insurance (including details on premium rates, rate changes, multiple policy discounts, and contribution responsibilities)?']],
 'embeddings': None,
 'documents': [['How are the premiums determined for Member Life, Member Accidental Death and Dismemberment, and Dependent Life Insurance (including details on premium rates, rate changes, multiple policy discounts, and contribution responsibilities)?']],
 'uris': None,
 'included': ['metadatas', 'documents', 'distances'],
 'data': None,
 'metadatas': [[{'distances3': '0.2581002116203308',
    'metadatas8': "{'Page_No.': 'Page 53', 'Policy_Name': 'Principal-Sample-Life-Insurance-Policy'}",
    'distances6': '0.2846359610557556',
    'distances4': '0.2599128484725952',
    'ids9': '46',
    'distances2': '0.25669312477111816',
    'metadatas4': "{'Policy_Name': 'Principal-Sample-Life-Insurance-Policy', 'Page_No.': 'Page 

In [28]:
results = life_insurance_collection.query(
query_texts=query,
n_results=10
)

In [29]:
results
# Print the results

{'ids': [['18', '19', '4', '23', '32', '5', '29', '17', '50', '46']],
 'embeddings': None,
 'documents': [["b . on any date the definition of Member or Dependent is changed; and c. on any date the Policyholder's business, as specified on the Policyholder application, is changed; and d. on any date that a schedule of insurance or class of insured Members is changed; and e. on any premium due date, if the Policyholder has been receiving a multiple policy discount rate and the Policyholder drops below the minimum number of coverages to receive such discount rate; and f. on any date the premium contribution required of Members is changed; and g. with respect to Member Life Insurance, on any Policy Anniversary, if the average age, average Scheduled Benefit amount, or the male/female distribution for then insured Members has changed since the last Policy Anniversary; and h. on any Policy Anniversary, if the volume of insurance for then insured Members has increased or decreased by more than 

In [30]:
# Implementing Cache in Semantic Search

# Set a threshold for cache search
threshold = 0.2

ids = []
documents = []
distances = []
metadatas = []
results_df = pd.DataFrame()


# If the distance is greater than the threshold, then return the results from the main collection.

if cache_results['distances'][0] == [] or cache_results['distances'][0][0] > threshold:
      # Query the collection against the user query and return the top 10 results
      results = life_insurance_collection.query(
      query_texts=query,
      n_results=10
      )

      # Store the query in cache_collection as document w.r.t to ChromaDB so that it can be embedded and searched against later
      # Store retrieved text, ids, distances and metadatas in cache_collection as metadatas, so that they can be fetched easily if a query indeed matches to a query in cache
      Keys = []
      Values = []

      for key, val in results.items():
        if val is None or len(val) == 0 or len(val[0]) == 0:
          continue
        for i in range(len(val[0])):
          Keys.append(str(key)+str(i))
          Values.append(str(val[0][i]))


      cache_collection.add(
          documents= [query],
          ids = [query],  # Or if you want to assign integers as IDs 0,1,2,.., then you can use "len(cache_results['documents'])" as will return the no. of queries currently in the cache and assign the next digit to the new query."
          metadatas = dict(zip(Keys, Values))
      )

      print("Not found in cache. Found in main collection.")

      result_dict = {'Metadatas': results['metadatas'][0], 'Documents': results['documents'][0], 'Distances': results['distances'][0], "IDs":results["ids"][0]}
      results_df = pd.DataFrame.from_dict(result_dict)
      results_df


# If the distance is, however, less than the threshold, you can return the results from cache

elif cache_results['distances'][0][0] <= threshold:
      cache_result_dict = cache_results['metadatas'][0][0]

      # Loop through each inner list and then through the dictionary
      for key, value in cache_result_dict.items():
          if 'ids' in key:
              ids.append(value)
          elif 'documents' in key:
              documents.append(value)
          elif 'distances' in key:
              distances.append(value)
          elif 'metadatas' in key:
              metadatas.append(value)

      print("Found in cache!")

      # Create a DataFrame
      results_df = pd.DataFrame({
        'IDs': ids,
        'Documents': documents,
        'Distances': distances,
        'Metadatas': metadatas
      })


Found in cache!


In [31]:
results_df

Unnamed: 0,IDs,Documents,Distances,Metadatas
0,46,Section B - Premiums Article 1 - Payment Respo...,0.2581002116203308,"{'Page_No.': 'Page 53', 'Policy_Name': 'Princi..."
1,18,Section C - Individual Terminations Article 1 ...,0.2846359610557556,{'Policy_Name': 'Principal-Sample-Life-Insuran...
2,50,b . on any date the definition of Member or De...,0.2599128484725952,{'Policy_Name': 'Principal-Sample-Life-Insuran...
3,17,Section A – Eligibility Member Life Insurance ...,0.2566931247711181,"{'Page_No.': 'Page 20', 'Policy_Name': 'Princi..."
4,29,(1) marriage or establishment of a Civil Union...,0.3051510751247406,"{'Page_No.': 'Page 8', 'Policy_Name': 'Princip..."
5,23,The number of Members insured for Dependent Li...,0.2981325984001159,{'Policy_Name': 'Principal-Sample-Life-Insuran...
6,4,PART III - INDIVIDUAL REQUIREMENTS AND RIGHTS ...,0.2926594316959381,{'Policy_Name': 'Principal-Sample-Life-Insuran...
7,19,Section B - Member Accidental Death and Dismem...,0.2175861299037933,{'Policy_Name': 'Principal-Sample-Life-Insuran...
8,32,Payment of benefits will be subject to the Ben...,0.2701891958713531,"{'Page_No.': 'Page 22', 'Policy_Name': 'Princi..."
9,5,Section A - Member Life Insurance Schedule of ...,0.2397646009922027,"{'Page_No.': 'Page 49', 'Policy_Name': 'Princi..."


## 6. <font color = black> Re-Ranking with a Cross Encoder

Re-ranking the results obtained from your semantic search can sometime significantly improve the relevance of the retrieved results. This is often done by passing the query paired with each of the retrieved responses into a cross-encoder to score the relevance of the response w.r.t. the query.

In [32]:
# Upgrade torch to fix AttributeError: module 'torch' has no attribute 'compiler'
%pip install --upgrade torch

# After running the above line, please restart the kernel before running the next lines.

from sentence_transformers import CrossEncoder, util

Note: you may need to restart the kernel to use updated packages.


  from .autonotebook import tqdm as notebook_tqdm


In [33]:
# Initialise the cross encoder model

cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

In [34]:
# Input (query, response) pairs for each of the top 20 responses received from the semantic search to the cross encoder
# Generate the cross_encoder scores for these pairs

cross_inputs = [[query, response] for response in results_df['Documents']]
cross_rerank_scores = cross_encoder.predict(cross_inputs)

In [35]:
# Store the rerank_scores in results_df

results_df['Reranked_scores'] = cross_rerank_scores

In [36]:
results_df

Unnamed: 0,IDs,Documents,Distances,Metadatas,Reranked_scores
0,46,Section B - Premiums Article 1 - Payment Respo...,0.2581002116203308,"{'Page_No.': 'Page 53', 'Policy_Name': 'Princi...",4.352727
1,18,Section C - Individual Terminations Article 1 ...,0.2846359610557556,{'Policy_Name': 'Principal-Sample-Life-Insuran...,3.707824
2,50,b . on any date the definition of Member or De...,0.2599128484725952,{'Policy_Name': 'Principal-Sample-Life-Insuran...,5.085
3,17,Section A – Eligibility Member Life Insurance ...,0.2566931247711181,"{'Page_No.': 'Page 20', 'Policy_Name': 'Princi...",3.289842
4,29,(1) marriage or establishment of a Civil Union...,0.3051510751247406,"{'Page_No.': 'Page 8', 'Policy_Name': 'Princip...",2.96484
5,23,The number of Members insured for Dependent Li...,0.2981325984001159,{'Policy_Name': 'Principal-Sample-Life-Insuran...,2.79957
6,4,PART III - INDIVIDUAL REQUIREMENTS AND RIGHTS ...,0.2926594316959381,{'Policy_Name': 'Principal-Sample-Life-Insuran...,3.753448
7,19,Section B - Member Accidental Death and Dismem...,0.2175861299037933,{'Policy_Name': 'Principal-Sample-Life-Insuran...,3.466119
8,32,Payment of benefits will be subject to the Ben...,0.2701891958713531,"{'Page_No.': 'Page 22', 'Policy_Name': 'Princi...",4.568875
9,5,Section A - Member Life Insurance Schedule of ...,0.2397646009922027,"{'Page_No.': 'Page 49', 'Policy_Name': 'Princi...",1.003377


In [37]:
# Return the top 3 results from semantic search

top_3_semantic = results_df.sort_values(by='Distances')
top_3_semantic[:3]

Unnamed: 0,IDs,Documents,Distances,Metadatas,Reranked_scores
7,19,Section B - Member Accidental Death and Dismem...,0.2175861299037933,{'Policy_Name': 'Principal-Sample-Life-Insuran...,3.466119
9,5,Section A - Member Life Insurance Schedule of ...,0.2397646009922027,"{'Page_No.': 'Page 49', 'Policy_Name': 'Princi...",1.003377
3,17,Section A – Eligibility Member Life Insurance ...,0.2566931247711181,"{'Page_No.': 'Page 20', 'Policy_Name': 'Princi...",3.289842


In [38]:
# Return the top 3 results after reranking

top_3_rerank = results_df.sort_values(by='Reranked_scores', ascending=False)
top_3_rerank[:3]

Unnamed: 0,IDs,Documents,Distances,Metadatas,Reranked_scores
2,50,b . on any date the definition of Member or De...,0.2599128484725952,{'Policy_Name': 'Principal-Sample-Life-Insuran...,5.085
8,32,Payment of benefits will be subject to the Ben...,0.2701891958713531,"{'Page_No.': 'Page 22', 'Policy_Name': 'Princi...",4.568875
0,46,Section B - Premiums Article 1 - Payment Respo...,0.2581002116203308,"{'Page_No.': 'Page 53', 'Policy_Name': 'Princi...",4.352727


In [39]:
top_3_RAG = top_3_rerank[["Documents", "Metadatas"]][:3]

In [40]:
top_3_RAG["Documents"].values[0]

"b . on any date the definition of Member or Dependent is changed; and c. on any date the Policyholder's business, as specified on the Policyholder application, is changed; and d. on any date that a schedule of insurance or class of insured Members is changed; and e. on any premium due date, if the Policyholder has been receiving a multiple policy discount rate and the Policyholder drops below the minimum number of coverages to receive such discount rate; and f. on any date the premium contribution required of Members is changed; and g. with respect to Member Life Insurance, on any Policy Anniversary, if the average age, average Scheduled Benefit amount, or the male/female distribution for then insured Members has changed since the last Policy Anniversary; and h. on any Policy Anniversary, if the volume of insurance for then insured Members has increased or decreased by more than 25% since the last Policy Anniversary. If the Policyholder has other group insurance with The Principal, an

## 7. Retrieval Augmented Generation

Now that we have the final top search results, we can pass it to an GPT 3.5 along with the user query and a well-engineered prompt, to generate a direct answer to the query along with citations, rather than returning whole pages/chunks.

In [41]:
# Define a function called moderation_check that takes user_input as a parameter.
def moderation_check(user_input):
    # Call the OpenAI API to perform moderation on the user's input.
    response = openai.moderations.create(input=user_input, model="omni-moderation-latest")

    # Extract the moderation result from the API response.
    moderation_output = response.results[0].flagged
    # Check if the input was flagged by the moderation system.
    if moderation_output == True:
        # If flagged, return "Flagged"
        return "Flagged"
    else:
        # If not flagged, return "Not Flagged"
        return "Not Flagged"

In [42]:
# Define the function to generate the response. Provide a comprehensive prompt that passes the user query and the top 3 results to the model

def generate_response(query, results_df):
    """
    Generate a response using GPT-4's ChatCompletion based on the user query and retrieved information.
    """
    messages = [
                {"role": "system", "content":  "You are a helpful assistant in the life insurance domain who can effectively answer user queries about insurance policies and documents."},
                {"role": "user", "content": f"""You are a helpful assistant in the life insurance domain who can effectively answer user queries about insurance policies and documents.
                                                You have a question asked by the user in '{query}' and you have some search results from a corpus of insurance documents in the dataframe '{top_3_RAG}'. These search results are essentially one page of an insurance document that may be relevant to the user query.

                                                The column 'documents' inside this dataframe contains the actual text from the policy document and the column 'metadata' contains the policy name and source page. The text inside the document may also contain tables in the format of a list of lists where each of the nested lists indicates a row.

                                                Use the documents in '{top_3_RAG}' to answer the query '{query}'. Frame an informative answer and also, use the dataframe to return the relevant policy names and page numbers as citations.

                                                Follow the guidelines below when performing the task.
                                                1. Try to provide relevant/accurate numbers if available.
                                                2. You don’t have to necessarily use all the information in the dataframe. Only choose information that is relevant.
                                                3. If the document text has tables with relevant information, please reformat the table and return the final information in a tabular in format.
                                                3. Use the Metadatas columns in the dataframe to retrieve and cite the policy name(s) and page numbers(s) as citation.
                                                4. If you can't provide the complete answer, please also provide any information that will help the user to search specific sections in the relevant cited documents.
                                                5. You are a customer facing assistant, so do not provide any information on internal workings, just answer the query directly.

                                                The generated response should answer the query directly addressing the user and avoiding additional information. If you think that the query is not relevant to the document, reply that the query is irrelevant. Provide the final response as a well-formatted and easily readable text along with the citation. Provide your complete response first with all information, and then provide the citations.
                                                """},
              ]

    response = openai.chat.completions.create(
        model="gpt-4.1-mini",
        messages=messages
    )

    return response.choices[0].message.content.split('\n')

In [43]:
# Generate the response
moderation = moderation_check(query)
response = None
if moderation == 'Flagged':
    display("Sorry, this message has been flagged. Please restart your conversation.")
else:
    response = generate_response(query, top_3_RAG)

In [44]:
# Print the response

print("\n".join(response))

The premiums for Member Life, Member Accidental Death and Dismemberment (AD&D), and Dependent Life Insurance are determined as follows:

1. **Premium Rates and Changes:**
   - Premiums for these insurance coverages are based on the rates specified in the policy.
   - The rates may be adjusted or changed by the insurer, and any such changes will be applied as per the terms of the policy.

2. **Multiple Policy Discounts:**
   - If a member has more than one policy with the insurer under this coverage, the insurer may provide a multiple policy discount, which affects the premium calculation.

3. **Contribution Responsibilities:**
   - The member and/or the employer share responsibility for paying the premiums as specified in the policy.
   - Payment terms and billing responsibility are clearly defined and must be adhered to for coverage to remain in effect.

4. **Payment Responsibility Details:**
   - Premium payments must be maintained timely; failure to pay premiums can result in termin