<a href="https://colab.research.google.com/github/verified-HUMAN/RAGbot/blob/main/RAGbot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [401]:
# Installing neccesary libraries
! pip install openai langchain chromadb tiktoken lark

In [403]:
# Importing standard libraries.
# More will libraries will imported as we proceed further
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
llm_name = "gpt-3.5-turbo-0301"


# **Loading files...**

In [432]:
# Load text data from Knowledge Document using TextLoader
from langchain.document_loaders import TextLoader

loader = TextLoader("/content/KnowledgeDocument(pan_card_services).txt", encoding = 'UTF-8')
document = loader.load()
txt = ' '.join([d.page_content for d in document])

In [438]:
# Load data from SampleQuestions.xlsx
questions_df = pd.read_excel('SampleQuestions.xlsx')

In [439]:
# Preprocessing excell file
# Count the number of duplicate rows
num_duplicates = questions_df.duplicated().sum()
print(f"Number of duplicate rows: {num_duplicates}")

# Remove duplicate rows from the DataFrame
questions_df = questions_df.drop_duplicates()


Number of duplicate rows: 2


In [440]:
# Extracting the questions and ideal answers
ideal_answers = list(questions_df['Ideal Answer'])
questions = list(questions_df['Question'])

len(ideal_answers)

32

# **Splitting Data**

In [390]:
# Split the document at "# FAQs"
txt1, txt2 = txt.split("# FAQs")
txt2 = "# FAQs" + txt2
txt2 = txt2.replace("**", "** ")


In [None]:
print(txt2)

In [392]:
# Define the headers to split on
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

In [393]:
# Create the splitter
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)


In [None]:
# Use the splitter to split txt1
md_header_splits1 = markdown_splitter.split_text(txt1)
md_header_splits1[1:50]

In [None]:
# Use the splitter to split txt2
headers_to_split_two =[
    ("#", "Header 1"),
    ("**", "Ques&Ans"),
]
markdown_splitter_two = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_two)
md_header_splits2 = markdown_splitter_two.split_text(txt2)
md_header_splits2[1:50]


In [396]:
# Combine md_header_splits1 and md_header_splits2
md_header_splits = md_header_splits1 + md_header_splits2

In [None]:
md_header_splits[:100]

In [398]:
#Making a function to print document data in readable form.
def pretty_print_docs(docs):
    print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]))

In [None]:
document[0].metadata
print(document[0].page_content[0:500])

Context aware splitting
Chunking aims to keep text with common context together.

# **Embedding and storing in Vectorstores**

In [441]:
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings

embedding = OpenAIEmbeddings(openai_api_key=openai_api_key)

persist_directory = 'docs/chroma/'

!rm -rf ./docs/chroma  # remove old database files if any
vectordb = Chroma.from_documents(
    documents=md_header_splits,
    embedding=embedding,
    persist_directory=persist_directory
)
print(vectordb._collection.count())

50


In [413]:
question = "Can Pan Card be made in 2 mins. How can I make one for myself?"

In [414]:
docs = vectordb.similarity_search(question,k=3)
pretty_print_docs(docs)



Document 1:

****If you have Aadhaar card****  
You can get a Pan Card instantly **(in under 10 minutes)**, if you have an Aadhaar card. You can apply through ABC.  
********************************************************************If you don’t have an Aadhaar card********************************************************************  
Once the payment is made to ABC, we will contact you and initiate the process. Pan card will be issued in 3 weeks.
----------------------------------------------------------------------------------------------------
Document 2:

To reprint your PAN card, you need to follow a specific procedure that involves providing certain documents and information to authenticate your identity. The process can take around 2-3 weeks to complete. You can apply for a reprint through ABC. We will guide you through the process and help you obtain a new copy of your PAN card.
---------------------------------------------------------------------------------------------------

In [415]:
docs_mmr = vectordb.max_marginal_relevance_search(question,k=3)
pretty_print_docs(docs_mmr)



Document 1:

****If you have Aadhaar card****  
You can get a Pan Card instantly **(in under 10 minutes)**, if you have an Aadhaar card. You can apply through ABC.  
********************************************************************If you don’t have an Aadhaar card********************************************************************  
Once the payment is made to ABC, we will contact you and initiate the process. Pan card will be issued in 3 weeks.
----------------------------------------------------------------------------------------------------
Document 2:

No, in the absence of the Pan Card, ** NRIs can sign Form 60**  [Form 60 is a declaration to be filed by an individual or a person (not being a company or firm) who does not have a Permanent Account Number (PAN) and who in involved in any transaction] to open an NRI Account.
----------------------------------------------------------------------------------------------------
Document 3:

The charges for reprinting the PAN Card are

# **Retrieval** of relevent imformation
Retrieval is the centerpiece of our retrieval augmented generation (RAG) flow.

## **Addressing Specificity:** working with metadata using self-query retriever
We have an interesting challenge: we often want to infer the metadata from the query itself.

To address this, we can use SelfQueryRetriever, which uses an LLM to extract:

The query string to used for vector search
a metadata filter to pass in as well
Most vector databases support metadata filters, so this doesn't require any new databases or indexes.

In [416]:
from langchain.llms import OpenAI
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

In [417]:
metadata_field_info = [
    AttributeInfo(
        name="Header 1",
        description=" This represents a primary header or section title in the markdown text. It's the highest level of division for the content. The corresponding metadata attribute 'Header 1' would contain the title or main topic of the section. For example, in a document about animals, 'Header 1' might be 'Mammals'.",
        type="string",
    ),
    AttributeInfo(
        name="Header 2",
        description="This represents a secondary header or subsection within a primary section. It's the second level of division for the content. The corresponding metadata attribute 'Header 2' would contain the title or topic of the subsection. For example, within the 'Mammals' section, 'Header 2' might be 'Carnivores'.",
        type="string",
    ),
    AttributeInfo(
        name="Header 3",
        description="This represents a tertiary header or sub-subsection within a secondary section. It's the third level of division for the content. The corresponding metadata attribute 'Header 3' would contain the title or topic of the sub-subsection. For example, within the 'Carnivores' subsection, 'Header 3' might be 'Lions'. It can corresspond to different querries related to Header 2",
        type="string",
    ),
]

In [418]:
!pip install lark




In [419]:

document_content_description = "Pan_Card"
llm = OpenAI(openai_api_key=openai_api_key,temperature=0)
retriever = SelfQueryRetriever.from_llm(
    llm,
    vectordb,
    document_content_description,
    metadata_field_info,
    verbose=True
)

In [420]:
docs = retriever.get_relevant_documents(question)



query='make Pan Card' filter=None limit=None




In [421]:
pretty_print_docs(docs)

Document 1:

Here are the steps for *PAN CARD* processing.  
- Visit ABC app
- Navigate to Services > NRI Pan Card > Apply New PAN
- Select the required form of PAN card and proceed with the payment
- Our team will get in touch with you to ask for the following documents:
- Passport(Any Country) / OCI Card
- Passport Size Photograph
- Overseas address proof with zip code (Supporting documents - Indian NRO/NRE Account statement or Overseas bank statement or Utility bill)
----------------------------------------------------------------------------------------------------
Document 2:

**If you have Aadhaar card**  
No other document is required. You can get your pan card through your Aadhaar card in 10 minutes.  
**If you don’t have an Aadhaar card**  
- Passport(Any Country) / OCI Card
- Passport Size Photograph
- Overseas address proof with zip code (Supporting documents - Indian NRO/NRE Account statement or Overseas bank statement or Utility bill)
--------------------------------------

## Compression

Another approach for improving the quality of retrieved docs is compression.
Information most relevant to a query may be buried in a document with a lot of irrelevant text.
Passing that full document through your application can lead to more expensive LLM calls and poorer responses.

Contextual compression is meant to fix this.

In [422]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

In [423]:
def pretty_print_docs(docs):
    print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]))

In [424]:
# Wrap our vectorstore
llm = OpenAI(openai_api_key=openai_api_key,temperature=0)
compressor = LLMChainExtractor.from_llm(llm)

In [425]:
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever(search_type = "mmr")
)

In [426]:
compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)



Document 1:

"in under 10 minutes" and "Once the payment is made to ABC, we will contact you and initiate the process. Pan card will be issued in 3 weeks."
----------------------------------------------------------------------------------------------------
Document 2:

**NRIs can sign Form 60**


# **Getting answers of model**
### ---using RetrievalQA chain

In [427]:
from langchain.chains import RetrievalQA

In [428]:
from langchain.prompts import PromptTemplate

# Building a prompt
template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. Give a systematic answer in calm and polite tone. Keep the answer as concise as possible. Always say "Have a nice day" at the end of the answer in a seperate sentance.
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate.from_template(template)


In [429]:
# Run chain
llm = OpenAI(openai_api_key=openai_api_key,temperature=0)
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    return_source_documents=False,
    chain_type="stuff",
    verbose=False,
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
)

# **Evaluation:**

In [430]:
from langchain.evaluation.qa import QAEvalChain

llm = ChatOpenAI(openai_api_key=openai_api_key,temperature=0)

examples=[]
for i in range(0,len(model_answers)):
    new_examples = [
        {
            "query": questions[i],
            "answer": ideal_answers[i]
        }
    ]
    examples += new_examples

qa_chain.run(examples[0]["query"])

predictions = qa_chain.apply(examples)
eval_chain = QAEvalChain.from_llm(llm)

graded_outputs = eval_chain.evaluate(examples, predictions)

for i, eg in enumerate(examples):
    print(f"Example {i+1}:")
    print("Question: " + predictions[i]['query'])
    print("Real Answer: " + predictions[i]['answer'])
    print("Predicted Answer: " + predictions[i]['result'])
    print("Prediction Grade: " + graded_outputs[i]['results'])
    print()



Example 1:
Question: What are the documents required to apply for the new pan
Real Answer: If you have Aadhaar card
No other document is required. You can get your pan card through your Aadhaar card in 10 minutes.

If you don’t have an Aadhaar card
- Passport(Any Country) / OCI Card
- Passport Size Photograph
- Overseas address proof with zip code (Supporting documents - Indian NRO/NRE Account statement or Overseas bank statement or Utility bill). If you have any further questions or need assistance with the application process, please feel free to contact SBNRI.
Predicted Answer:  To apply for a new PAN card, you will need to submit a passport, passport size photograph, and overseas address proof with zip code. Supporting documents for the address proof include Indian NRO/NRE Account statement, Overseas bank statement, or Utility bill. Have a nice day.
Prediction Grade: CORRECT

Example 2:
Question: What is the cost/fees of a PAN card?
Real Answer: The cost of applying for a new PAN c

In [431]:
pretty_print_docs( result["source_documents"])

KeyError: ignored

## RetrievalQA-map_reduce and refine

In [None]:
qa_chain_mr = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    chain_type="map_reduce"
)
result = qa_chain_mr({"query": question})
result["result"]

In [None]:
qa_chain_mr = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    chain_type="refine"
)
result = qa_chain_mr({"query": question})
result["result"]

# **Memory**

In [None]:
from langchain.memory import ConversationBufferMemory
from langchain.memory import ConversationSummaryBufferMemory

memory = ConversationSummaryBufferMemory(
    llm=llm,
    memory_key="chat_history",
    return_messages=True,
    max_token_limit=500
)

# **Conversation_With_Bot**
### **---using ConversationalRetrievalChain**

In [None]:
from langchain.chains import ConversationalRetrievalChain
qa = ConversationalRetrievalChain.from_llm(
    llm,
    retriever=vectordb.as_retriever(),
    memory=memory
)

In [None]:
question = "Can Pan card be made in 2 min? Is it safe?"
result = qa({"question": question})

In [None]:
print(result['answer'])

In [None]:
question = "why are those prerequesites needed?"
result = qa({"question": question})

In [None]:
print(result['answer'])