### Part 1

***Problem Statement*** - The goal of the project is to build a RAG system using frameworks such as LlamaIndex or LangChain.


This starter notebook contains the general steps to create an RAG using LlamaIndex framework. Feel free to modify the code as per your requirement.

**Step 1** : Import the necessary libraries

In [1]:
# Install OpenAI, LlamaIndex
!pip install -U -qq llama-index openai

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/389.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m389.8/389.8 kB[0m [31m18.9 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.6 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.6/1.6 MB[0m [31m115.1 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m41.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m45.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m195.8/195.8 kB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m298.0/298.0 kB[0m [31m19.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
## Install additional supporting libraries as required
import nest_asyncio
nest_asyncio.apply()

In [3]:
# Importing the libraries
from llama_index.llms.openai import OpenAI
from llama_index.core.llms import ChatMessage
import os
import openai
import pandas as pd

**Step 2**: Mount your Google Drive and Set the API key

In [4]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive', force_remount = True)

Mounted at /content/drive


In [5]:
# Read the API key from the text file and strip any leading or trailing whitespace
with open("/content/OPENAI_API_KEY.txt", "r") as f:
    api_key = f.read().strip()

# Set the API key for OpenAI
openai.api_key = api_key

Step 3 - Data Loading

Dataset:

- For HelpMate AI project, the insurance documents can be downloaded from the module 'RAG Demonstration' or download the files from the following [link](https://cdn.upgrad.com/uploads/production/8e278245-506c-4c8c-9246-892280692919/Policy+Documents.zip)

- For BYOP project, you may create your own dataset or open-source datasets from [Kaggle](www.kaggle.com)

Use the appropriate document loader for loading the documents.

**NOTE** - No matter how powerful the given data loader is, ensure that your file is properly formatted, and the loader is able to read the file clearly else the query engine might fail.

In [7]:
# Import the necessary Reader
from llama_index.core import SimpleDirectoryReader

# Let us take input from a directory
reader = SimpleDirectoryReader(input_dir="/content/drive/MyDrive/InsuranceAI/")

# Use the load_data() method to read the files from the directory
documents2 = reader.load_data()
# number of files
print(f"Loaded {len(documents2)} docs")

Loaded 64 docs


In [None]:
documents2[0]

Document(id_='7e51290c-91fa-4525-9e3a-0223707a4ea4', embedding=None, metadata={'page_label': '1', 'file_name': 'HDFC-Life-Easy-Health-101N110V03-Policy-Bond-Single-Pay.pdf', 'file_path': '/content/drive/MyDrive/Colab Notebooks/Insurance doc/HDFC-Life-Easy-Health-101N110V03-Policy-Bond-Single-Pay.pdf', 'file_type': 'application/pdf', 'file_size': 1303156, 'creation_date': '2024-05-27', 'last_modified_date': '2024-05-27'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text=' \n             Part A \n<<Date>> \n<<Policyholder’s Name>>  \n<<Policyholder’s Address>> \n<<Policyholder’s Contact Number>>  \n \nDear <<Policyholder’s Name>>,  \n \nSub: Your Policy no. <<  >> \nWe are glad to inform you that your proposal has been accepted and the HDFC Life Easy Health 

Step 4 - Building the query engine

The general process for creating the query_engine is:
- Load the documents
- Create nodes from the documents
- Create index from documents
- Initialise the Query Engine
- Query the index with the prompt
- Generate the response using the retrieved nodes

In [21]:
# Import the necessary libraries
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.core import VectorStoreIndex
from IPython.display import display, HTML

# create parser and parse document into nodes
parser = SimpleNodeParser.from_defaults()
nodes = parser.get_nodes_from_documents(documents2)

# # build index
index = VectorStoreIndex(nodes)

# Construct Query Engine
query_engine = index.as_query_engine()
# Perform query operation and generate the response
query_response = query_engine.query("What are all the conditions for termination of Member Life Insurance?")

In [22]:
query_response.response

"The conditions for termination of Member Life Insurance are: \n- The date the Group Policy is terminated\n- The date the last premium is paid for the Member's insurance\n- Any requested date before a specified date\n- The date the Member ceases to be a Member as defined in PART I\n- The date the Member ceases to be in a class for which Member Life Insurance is provided\n- The date the Member retires\n- The date the Member ceases Active Work."

In [10]:
query_response.metadata

{'54ab4098-de4b-473d-9ba3-15f264b10af2': {'page_label': '37',
  'file_name': 'Principal-Sample-Life-Insurance-Policy.pdf',
  'file_path': '/content/drive/MyDrive/InsuranceAI/Principal-Sample-Life-Insurance-Policy.pdf',
  'file_type': 'application/pdf',
  'file_size': 222772,
  'creation_date': '2024-12-04',
  'last_modified_date': '2024-12-04'},
 '634e3ac5-ca2e-4c46-82b8-113bafc87c54': {'page_label': '23',
  'file_name': 'Principal-Sample-Life-Insurance-Policy.pdf',
  'file_path': '/content/drive/MyDrive/InsuranceAI/Principal-Sample-Life-Insurance-Policy.pdf',
  'file_type': 'application/pdf',
  'file_size': 222772,
  'creation_date': '2024-12-04',
  'last_modified_date': '2024-12-04'}}

Step 5 - Creating a Response Pipeline

A Query Response pipeline encapsulates all the necssary steps to build a RAG pipeline. Modify the functions `query_response` and `initialize_conv()`  below. The `query_response` functions return the query response from the query engine along with the supporting documents and the `initialize_conv()` function creates an interactive chatbot.

In [None]:
# Streaming
#query_engine = index.as_query_engine(streaming=True)
#streaming_response = query_engine.query("What is the policy number of HDFC SURGICARE PLAN")
#streaming_response.print_response_stream()

The policy number of HDFC SURGICARE PLAN is 10123654.

In [31]:
## Query response function
def query_response(user_input):
    """
    Generate a response based on user input by querying the query engine and
    retrieving metadata from the source nodes.

    Args:
    user_input (str): The input query provided by the user.

    Returns:
    final_response (str): The final response generated by the query engine, including a
    reference to the source file names and page numbers.
    """
    response = query_engine.query(user_input)
    file_name = response.source_nodes[0].node.metadata['file_name'] + "page nos: " + response.source_nodes[0].node.metadata['page_label'] + ", " + response.source_nodes[1].node.metadata['page_label']
    final_response = response.response + '\n Check further at ' + file_name
    return final_response

In [32]:
def initialize_conv():
    """
    Initialize a conversation with the user, allowing them to ask questions
    about the policy documents. The user can type 'exit' to end the
    conversation.

    The function continuously prompts the user for input, processes the input
    using the query_response function, and displays the response. The loop
    terminates when the user types 'exit'.
    """
    print('Feel free to ask Questions regarding PRINCIPAL LIFE INSURANCE COMPANY insurance plans. Press exit once you are done')
    while True:
      user_input = input()
      # Type 'exit' to exit conversation
      if user_input.lower() == 'exit':
        print('Exiting the program... bye')
        break
      else:
        response = query_response(user_input)
        display(HTML(f'<p style="font-size:20px">{response}</p>'))


In [20]:
initialize_conv()

Feel free to ask Questions regarding PRINCIPAL LIFE INSURANCE COMPANY insurance plans. Press exit once you are done
How many insurance plans available in PRINCIPAL LIFE INSURANCE COMPANY


What is cessation in Group term life policy?


Policyholder Eligibility Requirements


what are Premium Rates?


exit
Exiting the program... bye



**Step 7** - Build a Testing Pipeline

Here we feed a series of questions to the Q/A bot and store the responses along with the feedback on whether it's accurate or not from the user

Create atleast 5 questions and store them in the `questions` list to be queried by the RAG system using the `testing_pipeline` function.

In [33]:
questions = ['What are  Premium Rates', "What are the Prior Policy  details?","what are Policy Termination conditions?"]

In [38]:
def testing_pipeline(questions):

    """
    Conduct a testing pipeline for a series of questions, collecting user
    feedback on the responses provided by the bot.

    Args:
    questions (list): A list of questions to be tested.

    Returns:
    pd.DataFrame: A DataFrame containing the questions, their corresponding
        responses, the page number from the response, and the user feedback
        indicating whether the response was good or bad.
    """
    test_feedback  = []
    for i in questions:
      print(i)
      print(query_response(i))
      print('\n Please provide your feedback on the response provided by the bot. ("GOOD"/"BAD")')
      user_input = input()
      page = query_response(i).split()[-1]
      test_feedback.append((i,query_response(i),page,user_input))

    feedback_df = pd.DataFrame(test_feedback, columns =['Question', 'Response', 'Page','Feedback'])
    return feedback_df


In [39]:
testing_pipeline(questions)

What are  Premium Rates
Premium rates for this policy are as follows:
- Member Life Insurance: $0.210 for each $1,000 of insurance in force.
- Member Accidental Death and Dismemberment Insurance: $0.025 for each $1,000 of Member Life Insurance in force.
- Dependent Life Insurance: $1.46 for each Member insured for Dependent Life Insurance.
 Check further at Principal-Sample-Life-Insurance-Policy.pdfpage nos: 20, 21

 Please provide your feedback on the response provided by the bot. ("GOOD"/"BAD")
GOOD
What are the Prior Policy  details?
The prior policy details are not explicitly mentioned in the provided context information.
 Check further at Principal-Sample-Life-Insurance-Policy.pdfpage nos: 19, 18

 Please provide your feedback on the response provided by the bot. ("GOOD"/"BAD")
GOOD
what are Policy Termination conditions?
The Policy Termination conditions include termination due to failure to pay the premium within the Grace Period, termination rights of the Policyholder to termin

Unnamed: 0,Question,Response,Page,Feedback
0,What are Premium Rates,The premium rates for each Member insured for ...,21,GOOD
1,What are the Prior Policy details?,The prior policy details are not explicitly me...,18,GOOD
2,what are Policy Termination conditions?,The Policy Termination conditions include term...,23,GOOD
