# Introduction

Welcome to the Mr.HelpMate AI Project! This notebook is designed to build a generative search system
capable of answering questions from an insurance policy document. The project includes three key layers:
embedding, search, and generation. Each layer is thoroughly documented and demonstrated with
examples for clarity.


# Problem Statement

The goal of this project is to process a group life insurance policy document and develop a retrieval-augmented
generation system capable of accurately and effectively answering user queries. The project leverages state-of-the-art
language models and vector databases to achieve this.


## 1) Installing Libraries

In [1]:
pip install PyPDF2

Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install sentence-transformers

Note: you may need to restart the kernel to use updated packages.


In [3]:
pip install chromadb

Note: you may need to restart the kernel to use updated packages.


In [4]:
pip install faiss-cpu

Note: you may need to restart the kernel to use updated packages.


In [5]:

pip install transformers

Note: you may need to restart the kernel to use updated packages.


In [6]:
pip install --upgrade openai

Note: you may need to restart the kernel to use updated packages.


In [7]:

pip install diskcache

Note: you may need to restart the kernel to use updated packages.


In [8]:

pip install rich

Note: you may need to restart the kernel to use updated packages.


## 2) Embedding Layer

The EmbeddingLayer class processes the policy document, chunks it into smaller units, and converts
the chunks into dense vector embeddings using a pre-trained model.

In [9]:
from PyPDF2 import PdfReader
from sentence_transformers import SentenceTransformer

In [10]:
class EmbeddingLayer:
    def __init__(self, model_name="all-MiniLM-L6-v2"):
        self.model = SentenceTransformer(model_name)

    def process_document(self, pdf_path):
        reader = PdfReader(pdf_path)
        text = " ".join([page.extract_text() for page in reader.pages])
        text = " ".join(text.split())  # Normalize whitespace
        return text

    def chunk_text(self, text, strategy="fixed", chunk_size=100):
        if strategy == "fixed":
            words = text.split()
            return [" ".join(words[i:i + chunk_size]) for i in range(0, len(words), chunk_size)]
        elif strategy == "sentence":
            return text.split(". ")
        elif strategy == "semantic":
            # Placeholder: Replace with advanced semantic chunking logic if needed
            return text.split(". ")
        else:
            raise ValueError("Invalid strategy!")

    def embed_chunks(self, chunks):
        return self.model.encode(chunks, convert_to_tensor=True)

## 3) Search layer

In [11]:

import chromadb
from chromadb.utils import embedding_functions

In [12]:
class SearchLayer:
    def __init__(self, model, db_path="./vector_db"):
        self.model = model
        self.client = chromadb.PersistentClient(path=db_path)
        self.collection = self.client.get_or_create_collection(name="policy_docs")

    def index_chunks(self, chunks, embeddings):
        for idx, (chunk, embedding) in enumerate(zip(chunks, embeddings)):
            self.collection.add(
                documents=[chunk],
                metadatas={"id": idx},
                embeddings=[embedding.tolist()],
                ids=[str(idx)]
            )

    def search(self, query, top_k=3):
        query_embedding = self.model.encode(query)
        results = self.collection.query(
            query_embeddings=[query_embedding],
            n_results=top_k
        )
        return results

## 3) Generation Layer

In [13]:
import openai
from itertools import chain

In [14]:
class GenerationLayer:
    def __init__(self, api_key):
        # Set your OpenAI API key
        openai.api_key = open("OpenAI_API_Key.txt", "r").read().strip()

    def generate_answer(self, query, retrieved_chunks):
        # Flatten the retrieved chunks if they are nested lists
        flattened_chunks = list(chain.from_iterable(retrieved_chunks))

        # Create the system and user messages for the API call
        system_message = "You are a helpful assistant providing concise answers based on the given policy details."
        user_message = f"The policy document contains the following relevant details:\n\n" + " ".join(flattened_chunks) + "\n\nQuestion: {query}\n\nProvide a concise and clear answer."

        # Call the OpenAI API using the correct 'messages' format
        response = openai.chat.completions.create(  # Corrected API call method
            model="gpt-3.5-turbo",  # Specify the model
            messages=[
                {"role": "system", "content": system_message},
                {"role": "user", "content": user_message}
            ],
            max_tokens=100,  # Adjust based on desired response length
            temperature=0.7,  # Adjust creativity
        )

        # Extract the answer from the response correctly
        return response.choices[0].message.content.strip()  # Corrected response handling



## 4) Response Layer

In [15]:
def main(pdf_path, queries):
    # Initialize layers
    embedding_layer = EmbeddingLayer()
    search_layer = SearchLayer(embedding_layer.model)

    # Replace YOUR_API_KEY with your actual OpenAI API key
    generation_layer = GenerationLayer(api_key=open("OpenAI_API_Key.txt", "r").read().strip())

    # Process document
    text = embedding_layer.process_document(pdf_path)
    chunks = embedding_layer.chunk_text(text, strategy="fixed", chunk_size=100)
    embeddings = embedding_layer.embed_chunks(chunks).cpu().numpy()

    # Index chunks
    search_layer.index_chunks(chunks, embeddings)

    # Test queries
    for query in queries:
      print(f"\nQuery: {query}")
      results = search_layer.search(query)

      # Adjusted retrieval of top chunks
      retrieved_chunks = results["documents"]  # Fixed here
      print("Top Retrieved Chunks:\n", retrieved_chunks)

      # Generate answer
      answer = generation_layer.generate_answer(query, retrieved_chunks)
      print("Generated Answer:\n", answer)


In [16]:
if __name__ == "__main__":
    pdf_path = "/Users/vkrkscb/AIMLPGCourseWorkspace/9.97_5_HelpMateAI/LifePolicy.pdf"
    queries = [
        "What are the benefits included under the Group Policy for Life Insurance?",
        "What is the coverage for Dependent Life Insurance?",
        "What is considered a 'Qualifying Event' for Accelerated Benefits under this policy?"
    ]
    main(pdf_path, queries)

Add of existing embedding ID: 0
Add of existing embedding ID: 1
Add of existing embedding ID: 2
Add of existing embedding ID: 3
Add of existing embedding ID: 4
Add of existing embedding ID: 5
Add of existing embedding ID: 6
Add of existing embedding ID: 7
Add of existing embedding ID: 8
Add of existing embedding ID: 9
Add of existing embedding ID: 10
Add of existing embedding ID: 11
Add of existing embedding ID: 12
Add of existing embedding ID: 13
Add of existing embedding ID: 14
Add of existing embedding ID: 15
Add of existing embedding ID: 16
Add of existing embedding ID: 17
Add of existing embedding ID: 18
Add of existing embedding ID: 19
Add of existing embedding ID: 20
Add of existing embedding ID: 21
Add of existing embedding ID: 22
Add of existing embedding ID: 23
Add of existing embedding ID: 24
Add of existing embedding ID: 25
Add of existing embedding ID: 26
Add of existing embedding ID: 27
Add of existing embedding ID: 28
Add of existing embedding ID: 29
Add of existing embe


Query: What are the benefits included under the Group Policy for Life Insurance?
Top Retrieved Chunks:
 [["qualifies and makes timely application, he or she may convert the group coverage by purchasing an individual policy of life insurance under these terms: (1) The Member will not be required to submit Proof of Good Health. (2) The policy will be for life insurance only. No disabilit y or other benefits will be included. (3) The policy will be on one of the forms, other than term insurance, then issued by The Principal to persons in the risk class to which the Member belongs on the individual policy's effective date. (4) Premium will be based on the Member's age", 'Policyholder The entity to whom this Group Policy is issued (see Title Page). Prior Policy The Group Term Life coverage of either: a. the Policyholder; or b. a business entity which has been obtained by the Policyholder through a merger or acquisition; for which this Group Policy is a replacement. Proof of Good Health Wri

In [17]:
if __name__ == "__main__":
    pdf_path = "/Users/vkrkscb/AIMLPGCourseWorkspace/9.97_5_HelpMateAI/LifePolicy.pdf"
    queries = [
        "Who is eligible to enroll in the group life insurance policy, and what are the conditions for enrollment?",
        "What benefits are provided in case of accidental death and dismemberment under this policy?",
        "What events are considered 'Qualifying Events' under the policy for making changes to coverage?"
    ]
    main(pdf_path, queries)

Add of existing embedding ID: 0
Insert of existing embedding ID: 0
Add of existing embedding ID: 1
Insert of existing embedding ID: 1
Add of existing embedding ID: 2
Insert of existing embedding ID: 2
Add of existing embedding ID: 3
Insert of existing embedding ID: 3
Add of existing embedding ID: 4
Insert of existing embedding ID: 4
Add of existing embedding ID: 5
Insert of existing embedding ID: 5
Add of existing embedding ID: 6
Insert of existing embedding ID: 6
Add of existing embedding ID: 7
Insert of existing embedding ID: 7
Add of existing embedding ID: 8
Insert of existing embedding ID: 8
Add of existing embedding ID: 9
Insert of existing embedding ID: 9
Add of existing embedding ID: 10
Insert of existing embedding ID: 10
Add of existing embedding ID: 11
Insert of existing embedding ID: 11
Add of existing embedding ID: 12
Insert of existing embedding ID: 12
Add of existing embedding ID: 13
Insert of existing embedding ID: 13
Add of existing embedding ID: 14
Insert of existing em


Query: Who is eligible to enroll in the group life insurance policy, and what are the conditions for enrollment?
Top Retrieved Chunks:
 [["qualifies and makes timely application, he or she may convert the group coverage by purchasing an individual policy of life insurance under these terms: (1) The Member will not be required to submit Proof of Good Health. (2) The policy will be for life insurance only. No disabilit y or other benefits will be included. (3) The policy will be on one of the forms, other than term insurance, then issued by The Principal to persons in the risk class to which the Member belongs on the individual policy's effective date. (4) Premium will be based on the Member's age", "for Dependent Life Insurance on the latest of: a. the date the person is eligible for Member Life Insurance; or b. the date the person first acquires a Dependent; or c. the date the person enters a class for which Dependent Life Insurance is provided under this Group Policy; or d. the date 

In [18]:
if __name__ == "__main__":
    pdf_path = "/Users/vkrkscb/AIMLPGCourseWorkspace/9.97_5_HelpMateAI/LifePolicy.pdf"
    queries = [
        "When does the life insurance coverage become effective for a new member?",
        "Are there any requirements or details about premium contributions under this policy?",
        "Under what circumstances does the life insurance coverage terminate for a member?"
    ]
    main(pdf_path, queries)

Add of existing embedding ID: 0
Insert of existing embedding ID: 0
Add of existing embedding ID: 1
Insert of existing embedding ID: 1
Add of existing embedding ID: 2
Insert of existing embedding ID: 2
Add of existing embedding ID: 3
Insert of existing embedding ID: 3
Add of existing embedding ID: 4
Insert of existing embedding ID: 4
Add of existing embedding ID: 5
Insert of existing embedding ID: 5
Add of existing embedding ID: 6
Insert of existing embedding ID: 6
Add of existing embedding ID: 7
Insert of existing embedding ID: 7
Add of existing embedding ID: 8
Insert of existing embedding ID: 8
Add of existing embedding ID: 9
Insert of existing embedding ID: 9
Add of existing embedding ID: 10
Insert of existing embedding ID: 10
Add of existing embedding ID: 11
Insert of existing embedding ID: 11
Add of existing embedding ID: 12
Insert of existing embedding ID: 12
Add of existing embedding ID: 13
Insert of existing embedding ID: 13
Add of existing embedding ID: 14
Insert of existing em


Query: When does the life insurance coverage become effective for a new member?
Top Retrieved Chunks:
 [["coverage or coverages provided by the Dependent's employer, the date such coverage terminates because the Dependent is no longer eligible unde r his/her employer's coverage will be considered the date the Member first acquires that Dependent (and any other Dependent who was also covered under such group coverage or coverages). This policy has been updated effective January 1, 2014 PART III - INDIVIDUAL REQUIREMENTS AND RIGHTS GC 6007 Section B - Effective Dates, Page 1 Section B - Effective Dates Article 1 - Member Life Insurance a. Actively at Work A Member's effective date for Member Life Insurance will be", 'renew at the applicable premium rates in effect on the Policy Anniversary. This policy has been updated effective January 1, 2014 PART III - INDIVIDUAL REQUIREMENTS AND RIGHTS GC 6006 Section A - Eligibility, Page 1 PART III - INDIVIDUAL REQUIREMENTS AND RIGHTS Section A - 