#### Problem Statement: 
Building a RAG-based Generative Search System for Insurance Policy Documents
The insurance industry generates vast amounts of policy documents containing complex and extensive details. Customers and agents often struggle to retrieve specific information efficiently, leading to delays in decision-making and increased dependency on manual customer support.

The goal of this project is to develop a Retrieval-Augmented Generation (RAG) system that enables users to ask natural language queries and receive accurate, context-aware responses derived from insurance policy documents. By leveraging LLMs (Large Language Models) with advanced retrieval mechanisms, the system will improve the efficiency of document search, reduce customer service overhead, and enhance user experience.

This project will utilize LangChain or LlamaIndex to integrate retrieval-based search with generative AI capabilities. It will process and index various insurance policies, allowing users to extract precise answers instead of scanning lengthy documents manually.

In [None]:
# Importing necessary libraries
!pip install llama-index

Collecting llama-index
  Downloading llama_index-0.11.11-py3-none-any.whl.metadata (11 kB)
Collecting llama-index-agent-openai<0.4.0,>=0.3.4 (from llama-index)
  Downloading llama_index_agent_openai-0.3.4-py3-none-any.whl.metadata (728 bytes)
Collecting llama-index-cli<0.4.0,>=0.3.1 (from llama-index)
  Downloading llama_index_cli-0.3.1-py3-none-any.whl.metadata (1.5 kB)
Collecting llama-index-core<0.12.0,>=0.11.10 (from llama-index)
  Downloading llama_index_core-0.11.11-py3-none-any.whl.metadata (2.4 kB)
Collecting llama-index-embeddings-openai<0.3.0,>=0.2.4 (from llama-index)
  Downloading llama_index_embeddings_openai-0.2.5-py3-none-any.whl.metadata (686 bytes)
Collecting llama-index-indices-managed-llama-cloud>=0.3.0 (from llama-index)
  Downloading llama_index_indices_managed_llama_cloud-0.3.1-py3-none-any.whl.metadata (3.8 kB)
Collecting llama-index-legacy<0.10.0,>=0.9.48 (from llama-index)
  Downloading llama_index_legacy-0.9.48.post3-py3-none-any.whl.metadata (8.5 kB)
Collecti

In [None]:
# Document loaders for SimpleDirectoryReader
!pip install docx2txt
!pip install pypdf

# Install OpenAI
!pip install openai


Collecting docx2txt
  Downloading docx2txt-0.8.tar.gz (2.8 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: docx2txt
  Building wheel for docx2txt (setup.py) ... [?25l[?25hdone
  Created wheel for docx2txt: filename=docx2txt-0.8-py3-none-any.whl size=3959 sha256=f80a4c3b581e15e207d5fa8f7f6f7762812efdd7ce990f462e4e3f1899a51882
  Stored in directory: /root/.cache/pip/wheels/22/58/cf/093d0a6c3ecfdfc5f6ddd5524043b88e59a9a199cb02352966
Successfully built docx2txt
Installing collected packages: docx2txt
Successfully installed docx2txt-0.8


In [None]:
#import openAI
from llama_index.llms.openai import OpenAI
#import ChatMessage
from llama_index.core.llms import ChatMessage
#import os
import os
import openai

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [None]:
#Set API key
filepath="/content/drive/MyDrive/"
with open(filepath + "OpenAI_API_Key.txt","r") as f:
  openai.api_key=''.join(f.readlines())

In [None]:
#import SimpleDirectoryReader
from llama_index.core import SimpleDirectoryReader

# Create object of SimpleDirectoryReader
reader=SimpleDirectoryReader(input_dir="/content/drive/MyDrive/Policy+Documents (3)/")

In [None]:
documents=reader.load_data()
#lenth of documents
print(f"Loaded {len(documents)} documents/pages successfully.")

Loaded 217 documents/pages successfully.


In [None]:
documents[0]

Document(id_='63aa5cf2-3460-40bc-90aa-f2ad407c6775', embedding=None, metadata={'page_label': '1', 'file_name': 'HDFC-Life-Easy-Health-101N110V03-Policy-Bond-Single-Pay.pdf', 'file_path': '/content/drive/MyDrive/Policy+Documents (3)/HDFC-Life-Easy-Health-101N110V03-Policy-Bond-Single-Pay.pdf', 'file_type': 'application/pdf', 'file_size': 1303156, 'creation_date': '2024-09-21', 'last_modified_date': '2024-09-21'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text=' \n             Part A \n<<Date>> \n<<Policyholder’s Name>>  \n<<Policyholder’s Address>> \n<<Policyholder’s Contact Number>>  \n \nDear <<Policyholder’s Name>>,  \n \nSub: Your Policy no. <<  >> \nWe are glad to inform you that your proposal has been accepted and the HDFC Life Easy Health (“Policy”

Step 4: Building the query engine

In [None]:
# import SimpleNodeParser
from llama_index.core.node_parser import SimpleNodeParser
# import VectorStoreIndex
from llama_index.core import VectorStoreIndex
# import display, HTML
from IPython.display import display, HTML


# Create parser and parse docuemnts into nodes
parser=SimpleNodeParser.from_defaults()
nodes=parser.get_nodes_from_documents(documents)

# build index
index=VectorStoreIndex(nodes)

#construct query engine
query_engine=index.as_query_engine()


In [None]:
#query
response=query_engine.query("What provisions may allow for a longer reinstatement period for an approved leave of absence taken in accordance with the Uniformed Services Employment and Reemployment Rights Act of 1994 (USERRA)?")

In [None]:
response.response

'The provisions that may allow for a longer reinstatement period for an approved leave of absence taken in accordance with the Uniformed Services Employment and Reemployment Rights Act of 1994 (USERRA) are those that specify a Revival Period of three years from the date of the first unpaid Premium.'

In [None]:
dir(response)

['__annotations__',
 '__class__',
 '__dataclass_fields__',
 '__dataclass_params__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__match_args__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'get_formatted_sources',
 'metadata',
 'response',
 'source_nodes']

In [None]:
response.metadata

{'b107ad26-7f4a-4449-90a0-8fb47ac0edc8': {'page_label': '11',
  'file_name': 'HDFC-Life-Sampoorna-Jeevan-101N158V04-Policy-Document (1).pdf',
  'file_path': '/content/drive/MyDrive/Policy+Documents (3)/HDFC-Life-Sampoorna-Jeevan-101N158V04-Policy-Document (1).pdf',
  'file_type': 'application/pdf',
  'file_size': 1990500,
  'creation_date': '2024-09-21',
  'last_modified_date': '2024-09-21'},
 'ceafe426-b428-4297-8bb0-45c72cd50edf': {'page_label': '11',
  'file_name': 'HDFC-Life-Smart-Pension-Plan-Policy-Document-Online.pdf',
  'file_path': '/content/drive/MyDrive/Policy+Documents (3)/HDFC-Life-Smart-Pension-Plan-Policy-Document-Online.pdf',
  'file_type': 'application/pdf',
  'file_size': 983547,
  'creation_date': '2024-09-21',
  'last_modified_date': '2024-09-21'}}

In [None]:
response.source_nodes

[NodeWithScore(node=TextNode(id_='b107ad26-7f4a-4449-90a0-8fb47ac0edc8', embedding=None, metadata={'page_label': '11', 'file_name': 'HDFC-Life-Sampoorna-Jeevan-101N158V04-Policy-Document (1).pdf', 'file_path': '/content/drive/MyDrive/Policy+Documents (3)/HDFC-Life-Sampoorna-Jeevan-101N158V04-Policy-Document (1).pdf', 'file_type': 'application/pdf', 'file_size': 1990500, 'creation_date': '2024-09-21', 'last_modified_date': '2024-09-21'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='3cae4910-ffc1-43f2-9585-0600d598012e', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'page_label': '11', 'file_name': 'HDFC-Life-Sampoorna-Jeevan-101N158V04-Policy-Document (1).pdf', 'file_path': '/content/drive/MyDrive/Polic

In [None]:
len(response.source_nodes)

2

In [None]:
print(response.source_nodes[0].node.metadata['file_name'])
print(response.source_nodes[0].node.metadata['page_label'])

HDFC-Life-Sampoorna-Jeevan-101N158V04-Policy-Document (1).pdf
11


In [None]:
print(response.source_nodes[0].node.metadata['file_name'] + " Page No " + response.source_nodes[0].node.metadata['page_label'])

HDFC-Life-Sampoorna-Jeevan-101N158V04-Policy-Document (1).pdf Page No 11


In [None]:
# Extract the score
print(response.source_nodes[0].score)

0.7720022811789837


In [None]:
# Response Node Text
response.source_nodes[0].node.text

'D.2.2.  Notwithstanding anything to the contrary contained elsewhere in this Policy, the Company reserves the right to revive  \nthe lapsed Policy either on its original terms and conditions or on such other or modified terms and conditions as the \nCompany may specify or to reject the Revival . If needed the Company may refer it to its medical examiner in decid ing \non Revival  of lapse d Policy. Subject to the provisions of Clauses D.2.1 above, the Revival  shall come into effect on the \ndate when the Company specifically communicates it in writing to the Policyholder.  \n \nD.2.3  If the Policy is not revived for full Benefits before the Policy  Maturity Date but within five years from the due date for \npayment of the first unpaid Premium and if the Policy has not acquired Guaranteed Surrender Value, then the Policy \nwill terminate.  \n \nD.3.  Non-Forfeiture options : PART D  \nPolicy Servicing Related Aspects'

Step 6: Creating response pipeline

In [None]:
# Query response function
def query_response(user_input):
  response=query_engine.query(user_input)
  file_name=response.source_nodes[0].node.metadata['file_name'] + " Page No " + response.source_nodes[0].node.metadata['page_label']
  final_response=response.response + "\nCheck further at " + file_name
  return final_response

In [None]:
def initialize_conv():
  print("Feel free to ask questions related to insurance policies. Enter exit once you are done!")
  while True:
    user_input=input()
    if user_input.lower() == "exit":
      print("Exiting the program. Bye!!!")
      break
    else:
      response=query_response(user_input)
      display(HTML(f'<p style="font-size:20px">{response}</p>'))


In [None]:
initialize_conv()

Feel free to ask questions related to insurance policies. Enter exit once you are done!
What provisions may allow for a longer reinstatement period for an approved leave of absence taken in accordance with the Uniformed Services Employment and Reemployment Rights Act of 1994 (USERRA)?


exit
Exiting the program. Bye!!!


Step 7: Building a test pipeline

In [None]:
questions=["What provisions may allow for a longer reinstatement period for an approved leave of absence taken in accordance with the Uniformed Services Employment and Reemployment Rights Act of 1994 (USERRA)?",
           "How is the peroid of time during which a reinstated Member's insurance was not in force treated for the purpose of determining the length of continuous coverage under the Group Policy?",
           "What are the requirements for placing in force any Scheduled benefit that would have been subject to Proof of Good Health has the member remained continuously insured?"]

In [None]:
import pandas as pd

def testing_pipeline(questions):
  test_feedback=[]
  for i in questions:
    print(i)
    print(query_response(i))
    print("\nPlease provide your feedback on the response provided by bot")
    user_input=input()
    page=query_response(i).split()[-1]
    test_feedback.append((i,query_response(i),page,user_input))

  feedback_df=pd.DataFrame(test_feedback,columns=["Question","Response","Page","Good/Bad"])
  return feedback_df

In [None]:
testing_pipeline(questions)

What provisions may allow for a longer reinstatement period for an approved leave of absence taken in accordance with the Uniformed Services Employment and Reemployment Rights Act of 1994 (USERRA)?
The provisions that may allow for a longer reinstatement period for an approved leave of absence taken in accordance with the Uniformed Services Employment and Reemployment Rights Act of 1994 (USERRA) are those that specify a Revival Period of three years from the date of the first unpaid Premium for the Policyholder to revive the Policy.
Check further at HDFC-Life-Sampoorna-Jeevan-101N158V04-Policy-Document (1).pdf Page No 11

Please provide your feedback on the response provided by bot
How is the peroid of time during which a reinstated Member's insurance was not in force treated for the purpose of determining the length of continuous coverage under the Group Policy?
How is the peroid of time during which a reinstated Member's insurance was not in force treated for the purpose of determini

Unnamed: 0,Question,Response,Page,Good/Bad
0,What provisions may allow for a longer reinsta...,The provisions that may allow for a longer rei...,11,How is the peroid of time during which a reins...
1,How is the peroid of time during which a reins...,The period of time during which a reinstated M...,15,What are the requirements for placing in force...
2,What are the requirements for placing in force...,The requirements for placing in force any Sche...,15,good


Part 3: Next steps

3.1 Building a custom promt template

In [None]:
response=query_engine.query("What provisions may allow for a longer reinstatement period for an approved leave of absence taken in accordance with the Uniformed Services Employment and Reemployment Rights Act of 1994 (USERRA)?")

In [None]:
response.response

'The provisions that may allow for a longer reinstatement period for an approved leave of absence taken in accordance with the Uniformed Services Employment and Reemployment Rights Act of 1994 (USERRA) could include the option to revive the lapsed policy within a specified Revival Period, which typically lasts for three years from the date of the first unpaid premium.'

In [None]:
# response source nodes

response.source_nodes

[NodeWithScore(node=TextNode(id_='b107ad26-7f4a-4449-90a0-8fb47ac0edc8', embedding=None, metadata={'page_label': '11', 'file_name': 'HDFC-Life-Sampoorna-Jeevan-101N158V04-Policy-Document (1).pdf', 'file_path': '/content/drive/MyDrive/Policy+Documents (3)/HDFC-Life-Sampoorna-Jeevan-101N158V04-Policy-Document (1).pdf', 'file_type': 'application/pdf', 'file_size': 1990500, 'creation_date': '2024-09-21', 'last_modified_date': '2024-09-21'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='3cae4910-ffc1-43f2-9585-0600d598012e', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'page_label': '11', 'file_name': 'HDFC-Life-Sampoorna-Jeevan-101N158V04-Policy-Document (1).pdf', 'file_path': '/content/drive/MyDrive/Polic

In [None]:
response.source_nodes[0]

NodeWithScore(node=TextNode(id_='b107ad26-7f4a-4449-90a0-8fb47ac0edc8', embedding=None, metadata={'page_label': '11', 'file_name': 'HDFC-Life-Sampoorna-Jeevan-101N158V04-Policy-Document (1).pdf', 'file_path': '/content/drive/MyDrive/Policy+Documents (3)/HDFC-Life-Sampoorna-Jeevan-101N158V04-Policy-Document (1).pdf', 'file_type': 'application/pdf', 'file_size': 1990500, 'creation_date': '2024-09-21', 'last_modified_date': '2024-09-21'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='3cae4910-ffc1-43f2-9585-0600d598012e', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'page_label': '11', 'file_name': 'HDFC-Life-Sampoorna-Jeevan-101N158V04-Policy-Document (1).pdf', 'file_path': '/content/drive/MyDrive/Policy

In [None]:
response.source_nodes[1].node.text

'HDFC  Life Smart Pension Plan 101L164V02  – Terms and Conditions  (Direct & \nOnline Sales)  \n(A Unit Linked Non -Participating Individual Pension Plan)   \n  Page 11 of 37  \n \nPART D  \nPOLICY SERVICING RELATED ASPECTS  \n \nThe Policyholder shall have a period of 15 days from the date of receipt of the Policy document to review the terms \nand conditions of this Policy and if the Policyholder disagrees with the said terms and conditions, the Policyholder shall \nhave the option t o return the Policy to the Company for cancellation, stating the reasons for His objections. Upon such \nFree-Look cancellation , the Company shall return the Premium paid subject to deduction of a proportionate risk \nPremium for the period of insurance cover and medical examination fees (if any) in addition to the stamp duty charges. \nAll Benefits and rights under this Policy shall immediately stand terminated on the cancellation of the Policy.  \n \nThe Policyholder shall have a period of 30 days if 

In [None]:
reference_0 = " Check further at " + response.source_nodes[0].node.metadata['file_name'] + " Page No " + response.source_nodes[0].node.metadata['page_label']
reference_1 = " Check further at " + response.source_nodes[1].node.metadata['file_name'] + " Page No " + response.source_nodes[1].node.metadata['page_label']
retrieved = response.source_nodes[0].node.text + reference_0 + response.source_nodes[1].node.text + reference_1
retrieved

'D.2.2.  Notwithstanding anything to the contrary contained elsewhere in this Policy, the Company reserves the right to revive  \nthe lapsed Policy either on its original terms and conditions or on such other or modified terms and conditions as the \nCompany may specify or to reject the Revival . If needed the Company may refer it to its medical examiner in decid ing \non Revival  of lapse d Policy. Subject to the provisions of Clauses D.2.1 above, the Revival  shall come into effect on the \ndate when the Company specifically communicates it in writing to the Policyholder.  \n \nD.2.3  If the Policy is not revived for full Benefits before the Policy  Maturity Date but within five years from the due date for \npayment of the first unpaid Premium and if the Policy has not acquired Guaranteed Surrender Value, then the Policy \nwill terminate.  \n \nD.3.  Non-Forfeiture options : PART D  \nPolicy Servicing Related Aspects Check further at HDFC-Life-Sampoorna-Jeevan-101N158V04-Policy-Docum

In [None]:
messages=[
          {
              "role":"system",
              "content":"You are AI assistent to user."
          },
          {
              "role":"user",
              "content": f"""What provisions may allow for a longer reinstatement period for an approved leave of absence taken
              in accordance with the Uniformed Services Employment and Reemployment Rights Act of 1994 (USERRA)? Check in '{retrieved}'
              """
          }
        ]
messages

[{'role': 'system', 'content': 'You are AI assistent to user.'},
 {'role': 'user',
  'content': "What provisions may allow for a longer reinstatement period for an approved leave of absence taken\n              in accordance with the Uniformed Services Employment and Reemployment Rights Act of 1994 (USERRA)? Check in 'D.2.2.  Notwithstanding anything to the contrary contained elsewhere in this Policy, the Company reserves the right to revive  \nthe lapsed Policy either on its original terms and conditions or on such other or modified terms and conditions as the \nCompany may specify or to reject the Revival . If needed the Company may refer it to its medical examiner in decid ing \non Revival  of lapse d Policy. Subject to the provisions of Clauses D.2.1 above, the Revival  shall come into effect on the \ndate when the Company specifically communicates it in writing to the Policyholder.  \n \nD.2.3  If the Policy is not revived for full Benefits before the Policy  Maturity Date but wit

In [None]:
response2=openai.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=messages
)
response2.choices[0].message.content

"In the context of an approved leave of absence taken in accordance with the Uniformed Services Employment and Reemployment Rights Act of 1994 (USERRA), the provisions that may allow for a longer reinstatement period include the following:\n\n1. Notwithstanding anything to the contrary contained elsewhere in the policy, the company reserves the right to revive a lapsed policy on its original terms and conditions or on modified terms as specified by the company. The decision on revival may involve consultation with the company's medical examiner.\n\n2. If the policy is not revived for full benefits before the policy maturity date but within five years from the due date for payment of the first unpaid premium and if the policy has not acquired guaranteed surrender value, then the policy will terminate.\n\n3. Non-forfeiture options may also provide flexibility in reinstating the policy, with specific conditions outlined in the policy servicing-related aspects.\n\nThese provisions offer a 

3.2 Recommendations on How to improve further



*   Based on testing pipeline's feedback, develop a strategy how to improve it further


*   This can be done thorough building a better/cleaner dataset or utilizing better data pre-processing techniques
   



Suggestion 1: Using customized nodes and LLMs

This can be used if responses are not accurate or is not being summarized very well

In [None]:
#import OpenAIEmbedding
from llama_index.embeddings.openai import OpenAIEmbedding
#import SentenceSplitter
from llama_index.core.node_parser import SentenceSplitter
#import OpenAI
from llama_index.llms.openai import OpenAI
#import Settings
from llama_index.core import Settings

#Initialize the openAI model
Settings.lm=OpenAI(model="gpt-3.5-turbo", temperature=0, max_tokens=256)

#Initialize the embedding model
Settings.embed_model=OpenAIEmbedding()

#Initialize the node_parser with custom node settings
Settings.node_parser=SentenceSplitter(chunk_size=512, chunk_overlap=20)

# Initialize the num_output and context window
Settings.num_output=512
Settings.context_window=3900

#Create a VectorStoreIndex from a list of documents using the service context
index=VectorStoreIndex.from_documents(documents)

# Initialize a query engine for the index with a specified similiarity with top-k values
query_engine=index.as_query_engine(similarity_top_k=3)


In [None]:
#Query the engine with specific question

response=query_engine.query("""What provisions may allow for a longer reinstatement period for an approved leave of absence taken
              in accordance with the Uniformed Services Employment and Reemployment Rights Act of 1994 (USERRA)?""")

In [None]:
response.response

'The provisions that may allow for a longer reinstatement period for an approved leave of absence taken in accordance with the Uniformed Services Employment and Reemployment Rights Act of 1994 (USERRA) are those that specify the conditions under which a lapsed policy can be revived.'