### Optimize RAG

##### First we can use unstructured pdf instead of simple pypdf loader

In [1]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain_groq import ChatGroq
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.document_loaders import UnstructuredPDFLoader
import os
import sys
from dotenv import load_dotenv
load_dotenv()

True

In [3]:
groq_api_key=os.getenv("GROQ_API_KEY")
llm=ChatGroq(groq_api_key=groq_api_key,model_name="Llama3-8b-8192")

In [22]:
#Split the documents into chunks
splitter=RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    length_function=len
)

texts=splitter.split_documents(docs)


#Embeddings 
embeddings=HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

#Create vector store
vectorstore=FAISS.from_documents(texts,embeddings)

In [5]:
#Retriever
retriever=vectorstore.as_retriever(search_kwargs={'k':4})

In [6]:
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
prompt=PromptTemplate(
    template=""" 
    You are assistant for question answering tasks.
    Use the following piece of retreived context to answer
    the question.If you don't know the answer, say that you don't know.
    keep the answer concise.
    {context}
    Question:{question}
    """,
    input_variables=['context','question']
)

In [7]:
#Building chain
from langchain_core.runnables import RunnableParallel, RunnablePassthrough, RunnableLambda
from langchain_core.output_parsers import StrOutputParser

def format_docs(retrieved_docs):
    context_text="\n".join(doc.page_content for doc in retrieved_docs)
    return context_text

parllel_chain=RunnableParallel({
    'context':retriever | RunnableLambda(format_docs),
    'question': RunnablePassthrough()
})

parser = StrOutputParser()

rag_chain = parllel_chain | prompt | llm | parser

In [10]:
answer=rag_chain.invoke("""I am planning for vacation and might leave my house empty. 
Does my policy covers any potential damage that might occur? """)
print(answer)

Based on the provided context, your policy covers contents for loss or damage resulting from various causes, including when they are temporarily removed from the home. However, it does not explicitly mention coverage for damage when the house is empty. To confirm, I would need more information or clarification on your specific policy.


### pdf parsing -- Llamaparse

In [2]:
from marker.converters.pdf import PdfConverter

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
from marker.models import create_model_dict
from marker.output import text_from_rendered

In [4]:
FILEPATH = "sample_policy_doc_AU1234.pdf"

In [6]:
converter = PdfConverter(
    artifact_dict=create_model_dict(),
)
rendered = converter(FILEPATH)
text, _, images = text_from_rendered(rendered)

Recognizing layout: 100%|██████████| 10/10 [04:41<00:00, 28.11s/it]
Running OCR Error Detection: 100%|██████████| 15/15 [00:43<00:00,  2.89s/it]
Detecting bboxes: 0it [00:00, ?it/s]
Detecting bboxes: 0it [00:00, ?it/s]
Recognizing tables: 100%|██████████| 4/4 [01:58<00:00, 29.56s/it]


In [7]:
text

'# **This is a sample Policy document that provides full wording for all the covers we offer.**\n\nAll available options are on our website which will enable you to choose the level and type of cover. Once you have bought your Policy you will be provided with the documentation specific to what you have requested.\n\n| Section                 | Page |\n|-------------------------|------|\n|                         |      |\n| Buildings               | 3    |\n| Covers                  | 3    |\n| Causes                  | 6    |\n| Contents                | 9    |\n| Covers                  | 9    |\n| Causes                  | 15   |\n|                         |      |\n| Personal Possessions    | 17   |\n| Essential Information   | 19   |\n| General Conditions      | 19   |\n| Cancelling Your Cover   | 22   |\n| General Exclusions      | 24   |\n| Definitions             | 26   |\n| Claims Conditions       | 29   |\n| Making a Complaint      | 33   |\n| Sharing of Information  | 35   |

In [9]:
from marker.output import save_output
save_output(
    rendered,output_dir="output",fname_base="sample_policy"
  
)

In [10]:
from langchain.document_loaders import TextLoader
loader = TextLoader("output/sample_policy.md")
documents = loader.load()

print(f"✅ Loaded {len(documents)} markdown sections.")


✅ Loaded 1 markdown sections.


In [11]:
print(documents)

[Document(metadata={'source': 'output/sample_policy.md'}, page_content='# **This is a sample Policy document that provides full wording for all the covers we offer.**\n\nAll available options are on our website which will enable you to choose the level and type of cover. Once you have bought your Policy you will be provided with the documentation specific to what you have requested.\n\n| Section                 | Page |\n|-------------------------|------|\n|                         |      |\n| Buildings               | 3    |\n| Covers                  | 3    |\n| Causes                  | 6    |\n| Contents                | 9    |\n| Covers                  | 9    |\n| Causes                  | 15   |\n|                         |      |\n| Personal Possessions    | 17   |\n| Essential Information   | 19   |\n| General Conditions      | 19   |\n| Cancelling Your Cover   | 22   |\n| General Exclusions      | 24   |\n| Definitions             | 26   |\n| Claims Conditions       | 29   |\

In [12]:
from langchain.text_splitter import MarkdownTextSplitter
splitter = MarkdownTextSplitter(chunk_size=500, chunk_overlap=50)
chunked_docs = splitter.split_documents(documents)

print(f"✅ Chunked into {len(chunked_docs)} sections.")


✅ Chunked into 486 sections.


In [19]:
print(chunked_docs[6].page_content)

| Buildings cover Limit - please refer to your                                                           |                                                                                                                                    |
| schedule.                                                                                              | •<br>The exclusions listed in this column. These<br>exclusions relate to the corresponding cover<br>identified in the left column. |


In [21]:
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain_groq import ChatGroq
from langchain_huggingface import HuggingFaceEmbeddings
import os
import sys
from dotenv import load_dotenv
load_dotenv()

True

In [23]:
groq_api_key=os.getenv("GROQ_API_KEY")
llm=ChatGroq(groq_api_key=groq_api_key,model_name="Llama3-8b-8192")

In [None]:
#Embeddings 
embeddings=HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

#Create vector store
vectorstore=FAISS.from_documents(chunked_docs,embeddings)

retriever=vectorstore.as_retriever(search_kwargs={'k':4})

from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
prompt=PromptTemplate(
    template=""" 
    You are assistant for question answering tasks.
    Use the following piece of retreived context to answer
    the question.If you don't know the answer, say that you don't know.
    keep the answer concise.
    {context}
    Question:{question}
    """,
    input_variables=['context','question']
)

#Building chain
from langchain_core.runnables import RunnableParallel, RunnablePassthrough, RunnableLambda
from langchain_core.output_parsers import StrOutputParser

def format_docs(retrieved_docs):
    context_text="\n".join(doc.page_content for doc in retrieved_docs)
    return context_text

parllel_chain=RunnableParallel({
    'context':retriever | RunnableLambda(format_docs),
    'question': RunnablePassthrough()
})

parser = StrOutputParser()

rag_chain = parllel_chain | prompt | llm | parser

In [25]:
answer=rag_chain.invoke("""I am planning for vacation and might leave my house empty. 
Does my policy covers any potential damage that might occur? """)
print(answer)

Your policy covers contents while in the home for loss or damage resulting from various causes. It also covers buildings for loss or damage resulting from various causes. So, yes, your policy covers potential damage that might occur to your home while you're away on vacation.


In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Define section-based chunking
splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["Buildings Insurance","Contents Insurance", "Cover", "Causes", "General Exclusions","\n"]  # Prioritize section headers
)

chunked_docs_rec = splitter.split_documents(documents)


In [51]:
print(chunked_docs_rec[12].page_content)

Causes below, we will also pay:<br>•<br>Architects, surveyors, consulting engineers<br>and legal fees, but not fees for preparing a<br>claim; |                                     |


In [50]:
print(chunked_docs[11].page_content)

| •<br>the cost of clearing debris from the site or<br>demolishing or shoring up the buildings;<br>•<br>the cost to comply with government or local<br>authority requirements but not if the order<br>predates the loss or damage.                                                                                                                                             |                                     |


In [53]:
from langchain_experimental.text_splitter import SemanticChunker
#Embeddings
embeddings=HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
text_splitter = SemanticChunker(
    embeddings=embeddings,
    breakpoint_threshold_type='percentile',
    breakpoint_threshold_amount=90
)
docs = text_splitter.split_documents(documents)

In [66]:
print(docs[0])

page_content='# **This is a sample Policy document that provides full wording for all the covers we offer.**

All available options are on our website which will enable you to choose the level and type of cover. Once you have bought your Policy you will be provided with the documentation specific to what you have requested. | Section                 | Page |
|-------------------------|------|
|                         |      |
| Buildings               | 3    |
| Covers                  | 3    |
| Causes                  | 6    |
| Contents                | 9    |
| Covers                  | 9    |
| Causes                  | 15   |
|                         |      |
| Personal Possessions    | 17   |
| Essential Information   | 19   |
| General Conditions      | 19   |
| Cancelling Your Cover   | 22   |
| General Exclusions      | 24   |
| Definitions             | 26   |
| Claims Conditions       | 29   |
| Making a Complaint      | 33   |
| Sharing of Information  | 35   |
| Bicycle

In [54]:
len(docs)

64

In [61]:
print(docs[4].page_content)

| Accidental damage to cables, drain inspection<br>covers and underground drains, pipes or tanks<br>providing services to or from the home and for<br>which you are responsible.<br>We will also pay up to the limit for any one claim<br>for necessary and reasonable costs that you<br>incur in finding the source of the damage to the<br>home.


In [64]:
#Embeddings 
embeddings=HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

#Create vector store
vectorstore=FAISS.from_documents(docs,embeddings)

retriever=vectorstore.as_retriever(search_kwargs={'k':4})

from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
prompt=PromptTemplate(
    template="""Utilize the following contextual fragments to address the question at hand. Adhere to these guidelines:
1. If you're unsure of the answer, refrain from fabricating one.
2. Upon finding the answer, deliver it in a comprehensive manner, omitting references.
{context}
Question: {question}
Helpful Answer:""",
    input_variables=['context','question']
)

#Building chain
from langchain_core.runnables import RunnableParallel, RunnablePassthrough, RunnableLambda
from langchain_core.output_parsers import StrOutputParser

def format_docs(retrieved_docs):
    context_text="\n".join(doc.page_content for doc in retrieved_docs)
    return context_text

parllel_chain=RunnableParallel({
    'context':retriever | RunnableLambda(format_docs),
    'question': RunnablePassthrough()
})

parser = StrOutputParser()

rag_chain = parllel_chain | prompt | llm | parser

In [65]:
answer=rag_chain.invoke("""I am planning for vacation and might leave my house empty. 
Does my policy covers any potential damage that might occur? """)
print(answer)

Based on the provided contextual fragments, your policy does not provide coverage for normal day-to-day maintenance at your home, including maintenance that you should do yourself. Additionally, it does not cover replacement of items that wear out over a period of time or replacement of parts on a like-for-like basis where the replacement is necessary to resolve the immediate emergency.

However, if you are planning to leave your house empty for an extended period, you should notify your insurance provider. The policy does not explicitly mention coverage for damage caused by vacancy or unoccupied properties, but it's recommended that you check the cover provided on your schedule.

It's also important to note that any loss where you did not contact us to arrange repairs will not be covered under this insurance. Therefore, if you plan to leave your house empty for an extended period, it's essential to contact your insurance provider and make arrangements for any potential issues that may