<a href="https://colab.research.google.com/github/sunilkumarrudragada/Semantic_Spotter/blob/master/Semantic_Spotter_Sunil.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## üß† Problem Statement

Insurance policy documents are often long, complex, and filled with technical jargon that makes it challenging for customers and agents to find specific information such as claim limits, exclusions, coverage details, and waiting periods. Traditional keyword-based searches are insufficient, as they fail to capture the contextual meaning of user queries.

To address this problem, **IntelliPolicy** aims to build a **Retrieval-Augmented Generation (RAG)** system that can accurately answer natural language questions from multiple insurance policy documents. The system retrieves the most relevant sections from the documents and generates concise, context-aware responses using a Large Language Model (LLM).

---

### üéØ Objective

The main objective of this project is to:
- Enable users to query large insurance documents conversationally.  
- Retrieve and summarize relevant policy details automatically.  
- Provide accurate, context-based answers in plain language.

---

## üöÄ Project Goals

- To build an intelligent question-answering system for insurance policy documents.  
- To enable users to extract specific information such as claim limits, exclusions, and waiting periods using natural language.  
- To demonstrate the practical application of Retrieval-Augmented Generation (RAG) using LangChain.  
- To showcase the integration of embeddings, vector stores, and LLMs in a real-world use case.  

---

## üìÇ Data Sources

The project utilizes **sample available insurance policy PDFs** stored in **Google Drive**.  
These documents include general insurance policies (e.g., health, life, or term insurance) containing coverage details, exclusions, and claim procedures.  

The documents are processed using LangChain‚Äôs `PyPDFLoader` to extract textual content, which is later split into smaller chunks for embedding and retrieval.

**Data Processing Workflow:**
1. Upload the policy PDF(s) to Google Drive.  
2. Load and parse using `PyPDFLoader`.  
3. Split the text into semantically meaningful chunks using `RecursiveCharacterTextSplitter`.  
4. Generate embeddings using `OpenAIEmbeddings` (model: `text-embedding-3-small`).  
5. Store the embeddings in a vector database (Chroma) for fast and accurate retrieval.

---

### ü§ñ Why LangChain?

**LangChain** is chosen as the core framework because it provides modular tools to seamlessly integrate document retrieval and generative AI. Specifically, LangChain supports:
- **Document Loaders** to extract text from PDFs.  
- **Text Splitters** to process long documents efficiently.  
- **Embeddings and Vector Stores** (like FAISS or Chroma) for semantic search.  
- **Retrieval and Generation Chains** for building the end-to-end RAG pipeline.  

By combining these components, LangChain allows us to design a scalable, modular, and efficient generative question-answering system tailored for the insurance domain.


## üèóÔ∏è Overall System Design

The **IntelliPolicy** system is designed as an end-to-end **Retrieval-Augmented Generation (RAG)** pipeline that retrieves relevant information from insurance policy documents and generates precise, context-based answers using a Large Language Model (LLM).

---

### üîπ System Workflow

The overall workflow consists of the following major stages:

1. **Document Loading**  
   - Load insurance policy PDFs using LangChain‚Äôs `PyPDFLoader`.  
   - Each document is parsed into structured text data.

2. **Text Splitting**  
   - The extracted text is divided into smaller, overlapping chunks using `RecursiveCharacterTextSplitter`.  
   - This ensures optimal context preservation and embedding quality.

3. **Embedding Generation**  
   - Each text chunk is transformed into a high-dimensional vector representation using `OpenAIEmbeddings`.  
   - These embeddings help capture semantic meaning for efficient retrieval.

4. **Vector Store Creation**  
   - All embeddings are stored in a vector database (like **ChromaDB**) for fast and accurate similarity search.

5. **Retriever Setup**  
   - The retriever fetches the top relevant chunks for a given query based on vector similarity.

6. **LLM Integration (RAG Pipeline)**  
   - The retrieved context is passed to the LLM (`ChatOpenAI`) to generate a concise and context-aware answer.  
   - LangChain‚Äôs `RetrievalQA` chain is used to connect the retriever with the LLM.

---

### üîπ System Architecture Diagram

```text
+---------------------------+
|   Insurance Policy PDFs   |
+-------------+-------------+
              |
              v
+---------------------------+
|   Document Loader         |
|  (LangChain PyPDFLoader)  |
+-------------+-------------+
              |
              v
+---------------------------+
|   Text Splitter           |
| (RecursiveCharacterSplitter) |
+-------------+-------------+
              |
              v
+---------------------------+
|   Embedding Generator     |
| (OpenAI)    |
+-------------+-------------+
              |
              v
+---------------------------+
|   Vector Store            |
|   (Chroma)        |
+-------------+-------------+
              |
              v
+---------------------------+
|   Retriever + LLM         |
| (LangChain RetrievalQA)   |
+-------------+-------------+
              |
              v
+---------------------------+
|   User Query + Response   |
|                           |
+---------------------------+


![System Design Flowchart](https://drive.google.com/uc?export=view&id=1ewWZ54BwfVSAdZfmc1yqmveUvYT1vhv7)


### üß© Design Choices

| **Component** | **Choice** | **Reason** |
|----------------|-------------|-------------|
| **Framework** | LangChain | Provides modular, scalable RAG pipelines with retrievers and chains. |
| **Vector Store** | ChromaDB | Lightweight, open-source, and easy to persist locally. |
| **LLM Model** | GPT-3.5 Turbo | Reliable, cost-effective, and accurate for contextual question answering. |
| **Embedding Model** | text-embedding-3-small | Efficient embedding generation for semantic similarity. |
| **Chunking Strategy** | RecursiveCharacterTextSplitter | Maintains sentence continuity with overlapping chunks. |
| **Memory** | Excluded | Each query is independent; memory is unnecessary for single-turn Q&A. |


### ‚öôÔ∏è Implementation Highlights

| **Aspect** | **Description** |
|-------------|-----------------|
| **Code Modularity** | Each component (loading, splitting, embedding, retrieval, and generation) is implemented independently for better maintainability and debugging. |
| **Flexibility** | The system supports multiple policy documents and can be easily extended to include metadata-based filters or additional vector stores. |
| **Transparency** | All intermediate outputs (e.g., embeddings, retrieval results) can be inspected for performance tuning and optimization. |
| **Reproducibility** | The entire pipeline runs in Google Colab, ensuring consistent results and easy re-execution for evaluators. |
| **Output Formatting** | The `pretty_print_rag_result()` function neatly formats the results, making them screenshot-ready for documentation and reports. |


### üí° Challenges Faced

| **Challenge** | **Description** |
|----------------|-----------------|
| **PDF Text Extraction** | Extracting clean, structured text from complex insurance PDFs containing tables, footnotes, and irregular formatting was challenging. |
| **Chunk Size Optimization** | Balancing the chunk size and overlap to retain context while staying within token limits required multiple iterations. |
| **Embedding Performance** | Generating embeddings for large documents using OpenAI‚Äôs API introduced latency and cost considerations. |
| **Retrieval Accuracy** | Ensuring that the retriever returns contextually relevant and precise chunks for reliable LLM responses was crucial. |
| **Model Latency & Rate Limits** | Managing API rate limits and occasional delays when processing multiple queries in Colab during testing. |


## üìò How to Run the Project

1. Open `Semantic_Spotter_Sunil.ipynb` in **Google Colab**.  
2. Mount Google Drive and update the path to your insurance PDF file.  
3. Run all code cells sequentially:
   - Install dependencies  
   - Load and split PDF  
   - Create embeddings and vector store  
   - Build the RAG pipeline  
   - Run sample queries  
4. Use the `pretty_print_rag_result()` function to view formatted answers.  
5. Review the sample outputs for screenshots in your project report.


In [20]:
# Installing required libraries
!pip install langchain langchain_community langchain-openai chromadb tiktoken pdfplumber pypdf

Collecting langchain-openai
  Downloading langchain_openai-1.0.1-py3-none-any.whl.metadata (1.8 kB)
INFO: pip is looking at multiple versions of langchain-openai to determine which version is compatible with other requirements. This could take a while.
  Downloading langchain_openai-1.0.0-py3-none-any.whl.metadata (1.8 kB)
  Downloading langchain_openai-0.3.35-py3-none-any.whl.metadata (2.4 kB)
Downloading langchain_openai-0.3.35-py3-none-any.whl (75 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m76.0/76.0 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: langchain-openai
Successfully installed langchain-openai-0.3.35


In [6]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


In [7]:
# Define the path of the PDF
pdf_path = '/content/drive/MyDrive/GenAi_Prac/HelpMate/Project_deps/Principal-Sample-Life-Insurance-Policy.pdf'

In [5]:
from google.colab import userdata
OPENAI_API_KEY=userdata.get('OPENAI_API_KEY')

**1.Imports**

In [30]:
# Required Imports
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
import textwrap

**2. Document Loading**

In [14]:
loader = PyPDFLoader(pdf_path)
documents = loader.load()
print(len(documents), "pages loaded")

64 pages loaded


**3.Split Text into Chunks**

In [16]:
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(documents)
print(len(chunks), "chunks created")

150 chunks created


**4.Create Embeddings and Vector Store**

In [23]:
embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",
    openai_api_key=OPENAI_API_KEY
)

db = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./vector_store"
)
db.persist()

print("‚úÖ Vector store created successfully!")

‚úÖ Vector store created successfully!


  db.persist()


**5. RAG (Retrieval + QA) Pipeline**

In [28]:
qa_model = ChatOpenAI(
    model="gpt-3.5-turbo",
    temperature=0.3,
    openai_api_key=OPENAI_API_KEY
)
retriever = db.as_retriever(search_kwargs={"k": 3})

rag_pipeline = RetrievalQA.from_chain_type(
    llm=qa_model,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True
)

**6.Questions**

In [36]:
def pretty_print_rag_result(result, query, width=100):
    """
    Nicely formats and prints the RAG response (Question, Answer, and Sources)
    so it fits in a single Colab page for screenshots.
    """
    print("=" * 120)
    print("üß† INTELLIPOLICY: INSURANCE POLICY Q&A SYSTEM")
    print("=" * 120)
    print(f"\nüîπ Question:\n{textwrap.fill(query, width=width)}")

    print("\nüí¨ Assistant Answer:\n")
    print(textwrap.fill(result['result'], width=width))

    print("\n" + "=" * 120 + "\n‚úÖ Response Generated Successfully!\n" + "=" * 120)


In [39]:
query = "What are the eligibility requirements for employees under this group insurance policy?"
result = rag_pipeline({"query": query})
pretty_print_rag_result(result, query)

üß† INTELLIPOLICY: INSURANCE POLICY Q&A SYSTEM

üîπ Question:
What are the eligibility requirements for employees under this group insurance policy?

üí¨ Assistant Answer:

Employees must enroll in the insurance policy to be eligible for coverage. At least 75% of all
eligible employees must enroll to maintain eligibility under this group insurance policy.

‚úÖ Response Generated Successfully!


In [40]:
query = "When does coverage terminate for members?"
result = rag_pipeline({"query": query})
pretty_print_rag_result(result, query)

üß† INTELLIPOLICY: INSURANCE POLICY Q&A SYSTEM

üîπ Question:
When does coverage terminate for members?

üí¨ Assistant Answer:

Coverage for members terminates on the earliest of the following conditions:  a. the date this Group
Policy is terminated;  b. the date the last premium is paid for the Member's insurance;  c. any date
desired, if requested by the Member before that date;  d. the date the Member ceases to be a Member
as defined in PART I;  e. the date the Member ceases to be in a class for which Member Life
Insurance is provided;  f. the date the Member retires;  g. the date the Member ceases Active Work.

‚úÖ Response Generated Successfully!


In [41]:
query = "What does the policy state about suicide or self-inflicted injury?"
result = rag_pipeline({"query": query})
pretty_print_rag_result(result, query)

üß† INTELLIPOLICY: INSURANCE POLICY Q&A SYSTEM

üîπ Question:
What does the policy state about suicide or self-inflicted injury?

üí¨ Assistant Answer:

The policy states that no benefits will be paid for any disability that results from willful self-
injury or self-destruction, while sane or insane.

‚úÖ Response Generated Successfully!


## üßæ Conclusion

The IntelliPolicy RAG system successfully retrieves and summarizes key information from insurance policy documents using LangChain.  
It enables users to query complex policy terms in natural language and receive accurate, context-based answers.  

### Key Learnings
- Gained hands-on experience with RAG architecture.
- Implemented embeddings, vector stores, and retrievers using LangChain.
- Built an interpretable and modular pipeline for document-based Q&A.

### Future Enhancements

- Integrate multiple policy documents with metadata-based retrieval (e.g., by insurer or policy type).

- Add a Streamlit or Gradio-based web interface for end users.

- Implement hybrid retrieval (semantic + keyword search) for better accuracy.

- Include summarization and confidence scoring in the response.

- Extend the system into a conversational assistant using LangChain‚Äôs memory for multi-turn Q&A.

- Explore using local open-source LLMs (like Llama 3 or Mistral) for cost optimization.
