# **Semantic Spotter Project Submission by Satya Prakash - C58 Batch**

# Option 1: Build a RAG System - RAG Based Chatbot implemented using Langchain to search information in Life Insurance Document.
## Overview
This project builds a Retrieval-Augmented Generation (RAG) chatbot using LangChain to effectively answer user queries baseThis project builds a Retrieval-Augmented Generation (RAG) chatbot using LangChain to automate the process of extracting and answering questions based on life insurance policies. The system combines document processing, vector-based retrieval, caching, and generative AI to efficiently generate concise answers. LangChain simplifies the integration of these components by providing a modular framework to connect document loaders, vector databases, language models, and custom chains.

The chatbot extracts information from policy documents, stores it in ChromaDB (a vector store), and uses OpenAI’s GPT-3.5-turbo to generate answers. With LangChain’s retrieval and caching mechanisms, the system can optimize responses for both complex queries and repeated questions.

## Problem Statement
Insurance documents are long, complex, and challenging to navigate. Customers or employees seeking specific policy details (such as benefits or claims procedures) often struggle to find the relevant sections quickly. Searching through large PDFs manually is time-consuming and prone to errors, leading to frustration for both internal staff and customers.

The goal of this project is to automate the retrieval and question-answering process using RAG-based techniques. The system needs to:

- Extract relevant sections from insurance documents.
- Provide accurate and concise answers to user queries.
- Optimize response time by leveraging a caching mechanism for repeated queries.
- Use generative AI (GPT-3.5) to answer complex, context-dependent questions.
- Ensure that answers are well-cited, linking back to specific sections or pages from the source documents.

## Architecture
The following is the design of the system:

1. PDF Processing and Document Chunking:
  - Use LangChain’s PyPDFLoader to extract text from insurance policy PDFs.
  - Chunk large documents into manageable pieces to ensure relevant information is indexed correctly.
2. Embedding and Vector Store Setup:
  - Generate text embeddings using OpenAI’s text-embedding-ada-002 model.
  - Store these embeddings in a persistent Chroma vector database for efficient retrieval.
3. Retrieval and Caching:
  - Implement a retriever that searches for relevant sections based on user queries using similarity search.
  - Incorporate a cache mechanism to store frequent queries and minimize redundant computation.
4. Prompt Engineering and Response Generation:
  - Use LLMChain to connect the retrieved information with a structured prompt for GPT-3.5-turbo.
  - Ensure the generated responses are concise, informative, and cited, including references to specific policy names and pages.
5. User-Friendly Query Handling:
  - Provide clear and actionable answers to user questions (e.g., "What are the policy benefits for accidental death?").
  - Support both first-time queries and cached queries to enhance the user experience.

## 1. Install and Import the Required Libraries

In [1]:
# Install all the required libraries

!pip install langchain openai pdfplumber sentence-transformers tiktoken pypdf langchain-community

Collecting langchain
  Downloading langchain-0.3.4-py3-none-any.whl.metadata (7.1 kB)
Collecting openai
  Downloading openai-1.52.1-py3-none-any.whl.metadata (24 kB)
Collecting pdfplumber
  Downloading pdfplumber-0.11.4-py3-none-any.whl.metadata (41 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.0/42.0 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentence-transformers
  Downloading sentence_transformers-3.2.1-py3-none-any.whl.metadata (10 kB)
Collecting tiktoken
  Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting pypdf
  Downloading pypdf-5.0.1-py3-none-any.whl.metadata (7.4 kB)
Collecting langchain-community
  Downloading langchain_community-0.3.3-py3-none-any.whl.metadata (2.8 kB)
Collecting langchain-core<0.4.0,>=0.3.12 (from langchain)
  Downloading langchain_core-0.3.12-py3-none-any.whl.metadata (6.3 kB)
Collecting langchain-text-splitters<0.4.0,>=0.3.0 (from langchain)
  Dow

In [2]:
!pip install chromaDB==0.5.3

Collecting chromaDB==0.5.3
  Downloading chromadb-0.5.3-py3-none-any.whl.metadata (6.8 kB)
Collecting build>=1.0.3 (from chromaDB==0.5.3)
  Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting chroma-hnswlib==0.7.3 (from chromaDB==0.5.3)
  Downloading chroma_hnswlib-0.7.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (252 bytes)
Collecting fastapi>=0.95.2 (from chromaDB==0.5.3)
  Downloading fastapi-0.115.3-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromaDB==0.5.3)
  Downloading uvicorn-0.32.0-py3-none-any.whl.metadata (6.6 kB)
Collecting posthog>=2.4.0 (from chromaDB==0.5.3)
  Downloading posthog-3.7.0-py2.py3-none-any.whl.metadata (2.0 kB)
Collecting onnxruntime>=1.14.1 (from chromaDB==0.5.3)
  Downloading onnxruntime-1.19.2-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.5 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromaDB==0.5.3)
  Downlo

In [3]:
# Import all the required Libraries

from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA, LLMChain
from langchain.llms import OpenAI
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate
from sentence_transformers import CrossEncoder
import os

  from tqdm.autonotebook import tqdm, trange


In [4]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [5]:
os.chdir('/content/drive/MyDrive/Colab_Notebooks/Semantic_Spotter_Project')
!ls

OPENAI_API_Key.txt			    Semantic_Spotter_RAG_Langchain_Satya_Prakash.ipynb
Principal-Sample-Life-Insurance-Policy.pdf


In [6]:
# Set OpenAI API key
with open("/content/drive/MyDrive/Colab_Notebooks/Semantic_Spotter_Project/OPENAI_API_Key.txt", "r") as f:
    openai_api_key = ' '.join(f.readlines())

## 2. Read, Process, and Chunk the PDF File

In [7]:
# Load and process PDFs using LangChain's PyPDFLoader
pdf_directory = "/content/drive/MyDrive/Colab_Notebooks/Semantic_Spotter_Project/"
pdf_loader = PyPDFLoader(pdf_directory + "Principal-Sample-Life-Insurance-Policy.pdf")

In [8]:
# Extract documents from the PDF
documents = pdf_loader.load()

In [9]:
# Use RecursiveCharacterTextSplitter to chunk the text
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
texts = text_splitter.split_documents(documents)

In [10]:
texts

[Document(metadata={'source': '/content/drive/MyDrive/Colab_Notebooks/Semantic_Spotter_Project/Principal-Sample-Life-Insurance-Policy.pdf', 'page': 0}, page_content='GROUP POLICY FOR:  \nRHODE ISLAND JOHN DOE  \n \nALL MEMBERS  \nGroup Member Life Insurance  \n \nPrint Date: 07/16/2014  \n DOROTHEA GLAUSE  S655  \nRHODE ISLAND JOHN DOE  01/01/2014  \n711 HIGH STREET   \nGEORGE RI 02903'),
 Document(metadata={'source': '/content/drive/MyDrive/Colab_Notebooks/Semantic_Spotter_Project/Principal-Sample-Life-Insurance-Policy.pdf', 'page': 1}, page_content='This page left blank intentionally'),
 Document(metadata={'source': '/content/drive/MyDrive/Colab_Notebooks/Semantic_Spotter_Project/Principal-Sample-Life-Insurance-Policy.pdf', 'page': 2}, page_content='GC 806 VAL   \n POLICY RIDER  \n \nGROUP INSURANCE      \n \n \nPOLICY NO:  S655  \n \nCOVERAGE:  Life  \n EMPLOYER: RHODE ISLAND JOHN DOE   \n Effective on the later of the Date of Issue of this Group Policy or March 1, 2005, the followi

## 3. Generate and Store Embeddings using OpenAI and ChromaDB

In this section, we will embed the pages in the dataframe through OpenAI's `text-embedding-ada-002` model, and store them in a ChromaDB collection.

In [11]:
# Set up OpenAI embeddings
embedding_model = OpenAIEmbeddings(model="text-embedding-ada-002", openai_api_key=openai_api_key)

  embedding_model = OpenAIEmbeddings(model="text-embedding-ada-002", openai_api_key=openai_api_key)


In [12]:
# Initialize Chroma for vector storage with persistent storage path
chroma_persist_path = "/content/drive/MyDrive/Colab_Notebooks/Semantic_Spotter_Project/ChromaDB_Data"
vector_store = Chroma.from_documents(texts, embedding_model, persist_directory=chroma_persist_path)

In [13]:
# Save the vector store for future use
vector_store.persist()

  vector_store.persist()


In [14]:
# Implement caching using LangChain's VectorStore
cache_store = Chroma(persist_directory=chroma_persist_path, embedding_function=embedding_model)

  cache_store = Chroma(persist_directory=chroma_persist_path, embedding_function=embedding_model)


## 4. Semantic Search with Cache

In this section, we will perform a semantic search of a query in the collections embeddings to get several top semantically similar results.

In [15]:
# Define a retrieval-based question-answering chain with LangChain
retriever = vector_store.as_retriever(search_type="similarity", search_kwargs={"k": 10})

In [16]:
# Define query handling with caching
def get_results_with_cache(query: str):
    # Search in cache first
    cache_results = cache_store.similarity_search(query, k=1)

    # Check if cache_results is not empty and if the first element has 'score' before accessing it
    if cache_results and hasattr(cache_results[0], 'score') and cache_results[0].score < 0.2:  # Check if 'score' attribute exists
        print("Found in cache!")
        return cache_results
    else:
        print("Not found in cache. Searching main collection...")
        results = retriever.get_relevant_documents(query)
        cache_store.add_texts([query], metadatas=[{"query": query}])
        return results

In [17]:
# Define a prompt template for LangChain
prompt_template = """
You are a helpful assistant specializing in insurance policies. Answer the user's query accurately using the provided documents.
Query: {query}
Documents: {context}

Respond clearly and concisely, citing relevant policy names and page numbers where appropriate.
"""

prompt = PromptTemplate(input_variables=["query", "context"], template=prompt_template)

In [18]:
# Perform a sample query
query = "What are the policy benefits for accidental death?"

In [19]:
# Read the user query

#query = input()

In [20]:
# Get relevant documents (with caching support)
results = get_results_with_cache(query)

Not found in cache. Searching main collection...


  results = retriever.get_relevant_documents(query)


In [21]:
results

[Document(metadata={'page': 54, 'source': '/content/drive/MyDrive/Colab_Notebooks/Semantic_Spotter_Project/Principal-Sample-Life-Insurance-Policy.pdf'}, page_content='This policy has been updated effective  January 1, 2014  \n \n      PART IV - BENEFITS  \nGC 6015  Section B - Member Accidental Death and \nDismemberment Insurance, Page 3  \n  \nExposure  \n \nExposure to the elements will be presumed to be an injury if:  \n \na. such exposure is due to an accidental bodily injury; and  \n b. within 365 days after the injury, the Member incurs a loss that is the result of the exposure; and \n c. this Group Policy would have covered the injury resulting from the accident. \n  \nArticle 4 - Seat Belt/Airbag Benefit   \n \nIf the Member loses his or her life as a result of an acciden tal injury sustained while driving or \nriding in an Automobile, an additional benefit of $10,000 will be paid to the beneficiary named \nfor Member Life Insurance, provided all Benefit Qualifications as descr

In [22]:
# Format the retrieved documents into a context string
context = "\n\n".join([doc.page_content for doc in results])

## 4. Generation LLM output using the GPT-3.5-turbo Model

In this section, we will generate the output using langchain chains

In [23]:
# Initialize the OpenAI chat model
llm = ChatOpenAI(model="gpt-3.5-turbo", openai_api_key=openai_api_key)

  llm = ChatOpenAI(model="gpt-3.5-turbo", openai_api_key=openai_api_key)


In [24]:
# Define an LLMChain to structure the interaction between the LLM and the prompt
llm_chain = LLMChain(prompt=prompt, llm=llm)

  llm_chain = LLMChain(prompt=prompt, llm=llm)


In [25]:
# Prepare the inputs for the LLMChain
inputs = {"query": query, "context": context}

In [26]:
# Generate the final response using the LLMChain
response = llm_chain.run(inputs)

  response = llm_chain.run(inputs)


In [27]:
import textwrap

# wraping response to a width of 150 characters
wrapped_response = textwrap.fill(response, width=150)
print(wrapped_response)

The policy benefits for accidental death under Section B - Member Accidental Death and Dismemberment Insurance include the following:  - Scheduled
Benefit of $10,000 for loss of life (Section B, Page 1) - Additional $10,000 benefit for loss of life in Automobile accidents with seat belt and
airbag (Article 4) - Various percentages of the Scheduled Benefit for specific losses like paralysis, loss of limbs, speech, and hearing (Page 4, Page
5) - Repatriation Benefit of up to $2,000 for transportation of the body (Article 7) - Educational Benefit of $3,000 annually for a maximum of four
years for a Qualified Student (Article 8)  Please refer to Section B of the policy document, specifically Pages 1, 4, and 5 for detailed information
on benefits payable for accidental death.
