# Unlocking Custom Document 🗎 Conversations 💬 with Langchain 🦜 and the Hugging Face API 🤗

## Architecture Diagram :

![Architecture Diagram](arct.jpg "Architecture Diagram")

## Note :
- The following `10 steps` are crucial for constructing a custom chat-oriented GPT based on the chosen document.
- The runtime of each cell will be determined by the system you are using.
- Steps 2, 5, and 9 entail utilizing HuggingFace and free models, which can potentially lead to longer runtime.
- You are encouraged to explore alternatives such as OpenAI or AzureOpenAI in place of HuggingFaceAPI, as they may offer enhanced performance.

## Step 1: Installation of Essential Python Modules

In [1]:
! pip install PyPDF2 langchain InstructorEmbedding sentence_transformers faiss-cpu
! pip freeze > requirements.txt



## Step 2: Setting Environment Variables in the System

In [2]:
from getpass import getpass

HUGGINGFACEHUB_API_TOKEN = getpass("Enter your access token generated from https://huggingface.co/settings/tokens : ")

import os

os.environ["HUGGINGFACEHUB_API_TOKEN"] = HUGGINGFACEHUB_API_TOKEN

## Step 3: Parsing PDF Documents

In [3]:
from PyPDF2 import PdfReader
pdf = PdfReader("sample.pdf")
text = ""

for page in pdf.pages:
    text += page.extract_text()

print(text)

 
ADH 15. 10a 
1  
 
 
 
 
 
 
This is a sample Policy document that provides full 
wording for all the covers we offer.  
 
All available options are on our website which will enable you to choose the level and type of cover. Once you 
have bought your Policy you will be provided with the documentation specific to what you have requested.  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  
ADH 15. 10a 
2 Section  Page  
 
Buildings   
3 
Covers  3 
Causes  6 
  
Contents  9 
Covers  9 
Causes  15 
  
Personal Possessions  17 
  
Essential Information  19 
General Conditions  19 
Cancelling Your Cover  22 
General Exclusions  24 
Definitions  26 
Claims Conditions  29 
Making a Complaint  33 
Sharing of Information  35 
  
Bicycle Cover  36 
  
Student Cover  37 
  
Home Assistance  38 
  
Family Legal Protection  
 46 
 
  
 
 
 
  
  
ADH 15. 10a 
3  
Buildings Insurance  
 
What your policy covers:  What your policy does not cover:  
We will pay you up to the maximu

## Step 4: Dividing Text into Chunks

In [4]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len
)

chunks = text_splitter.split_text(text=text)
print(chunks)

['ADH 15. 10a \n1  \n \n \n \n \n \n \nThis is a sample Policy document that provides full \nwording for all the covers we offer.  \n \nAll available options are on our website which will enable you to choose the level and type of cover. Once you \nhave bought your Policy you will be provided with the documentation specific to what you have requested.  \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n  \nADH 15. 10a \n2 Section  Page  \n \nBuildings   \n3 \nCovers  3 \nCauses  6 \n  \nContents  9 \nCovers  9 \nCauses  15 \n  \nPersonal Possessions  17 \n  \nEssential Information  19 \nGeneral Conditions  19 \nCancelling Your Cover  22 \nGeneral Exclusions  24 \nDefinitions  26 \nClaims Conditions  29 \nMaking a Complaint  33 \nSharing of Information  35 \n  \nBicycle Cover  36 \n  \nStudent Cover  37 \n  \nHome Assistance  38 \n  \nFamily Legal Protection  \n 46 \n \n  \n \n \n \n  \n  \nADH 15. 10a \n3  \nBuildings Insurance  \n \nWhat 

## Step 5: Converting Chunks into Embeddings & Storing them in a Vector Store
- Utilizing the `intfloat/e5-large-v2` model from HuggingFace for Embedding & Facebook's `FAISS` for Vector Store

In [5]:
from langchain.embeddings import HuggingFaceEmbeddings
embedding = HuggingFaceEmbeddings(
    model_name="intfloat/e5-large-v2",
)
from langchain.vectorstores import FAISS
vectorstore = FAISS.from_texts(texts=chunks, embedding=embedding)
print(vectorstore)

  from .autonotebook import tqdm as notebook_tqdm


<langchain.vectorstores.faiss.FAISS object at 0x0000029643F12150>


## Step 6: Saving the Vector Store for Future Reuse with `Pickle`

In [6]:
import pickle

with open("vectorstore.pkl", "wb") as pkl:
    pickle.dump(vectorstore, pkl)

## Step 7: Directly Loading the Vector Store from the `vectorstore.pkl` File, Skipping Steps 3, 4, 5, and 6 😃

In [7]:
with open("vectorstore.pkl", "rb") as pkl:
    vectorstore = pickle.load(pkl)

## Step 8: Performing Similarity Search Using the Vector Store

In [8]:
query = "What your policy covers?"
search_result = vectorstore.similarity_search(query=query)
print(search_result) # return 4 documents be default
search_result = vectorstore.similarity_search(query=query, k=2)
print(search_result) # return 2 documents by setting k=2

[Document(page_content='ADH 15. 10a \n1  \n \n \n \n \n \n \nThis is a sample Policy document that provides full \nwording for all the covers we offer.  \n \nAll available options are on our website which will enable you to choose the level and type of cover. Once you \nhave bought your Policy you will be provided with the documentation specific to what you have requested.  \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n  \nADH 15. 10a \n2 Section  Page  \n \nBuildings   \n3 \nCovers  3 \nCauses  6 \n  \nContents  9 \nCovers  9 \nCauses  15 \n  \nPersonal Possessions  17 \n  \nEssential Information  19 \nGeneral Conditions  19 \nCancelling Your Cover  22 \nGeneral Exclusions  24 \nDefinitions  26 \nClaims Conditions  29 \nMaking a Complaint  33 \nSharing of Information  35 \n  \nBicycle Cover  36 \n  \nStudent Cover  37 \n  \nHome Assistance  38 \n  \nFamily Legal Protection  \n 46 \n \n  \n \n \n \n  \n  \nADH 15. 10a \n3  \nBuildings

## Step 9: Creating a Large Language Model (LLM) with HuggingFace's `google/flan-t5-xxl`

In [9]:
from langchain.llms import HuggingFaceHub

llm = HuggingFaceHub(
    repo_id="google/flan-t5-xxl",
    model_kwargs={
        "temperature": 0.5, # How innovative this model can be?  0=>None 1=>Very high innovative
    }
)
print(llm)

[1mHuggingFaceHub[0m
Params: {'repo_id': 'google/flan-t5-xxl', 'task': None, 'model_kwargs': {'temperature': 0.5}}


## Optional: Creating a Custom Chat History to Set Context

In [10]:
chat_history = [ # (question, answer)
    (
        "How we can make our complaint?", 
        """
        Please write to:
            The Managing Director
            Arc Legal Assistance Limited
            PO Box 8921
            Colchester CO4 5YD
            Tel: 01206 615000*
            Email: customerservice@arclegal.co.uk
        """
    )
]
print(chat_history)

[('How we can make our complaint?', '\n        Please write to:\n            The Managing Director\n            Arc Legal Assistance Limited\n            PO Box 8921\n            Colchester CO4 5YD\n            Tel: 01206 615000*\n            Email: customerservice@arclegal.co.uk\n        ')]


# Now, with our LLM, PDF Vector Store, and Optional Custom Chat History in Place 🎉🎉🎉

## Step 10: Combining All the Information into a Unified Chain Named `Conversational Retrieval QA Chain`

In [11]:
from langchain.chains import ConversationalRetrievalChain

qa = ConversationalRetrievalChain.from_llm(
    llm=llm, 
    retriever=vectorstore.as_retriever(search_type = "similarity", search_kwargs = {"k":2}), 
)

print(qa)

memory=None callbacks=None callback_manager=None verbose=False tags=None combine_docs_chain=StuffDocumentsChain(memory=None, callbacks=None, callback_manager=None, verbose=False, tags=None, input_key='input_documents', output_key='output_text', llm_chain=LLMChain(memory=None, callbacks=None, callback_manager=None, verbose=False, tags=None, prompt=PromptTemplate(input_variables=['context', 'question'], output_parser=None, partial_variables={}, template="Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.\n\n{context}\n\nQuestion: {question}\nHelpful Answer:", template_format='f-string', validate_template=True), llm=HuggingFaceHub(cache=None, verbose=False, callbacks=None, callback_manager=None, tags=None, client=InferenceAPI(api_url='https://api-inference.huggingface.co/pipeline/text2text-generation/google/flan-t5-xxl', task='text2text-generation', options={'wait_for_model': Tr

## Our Conversation Chain is Ready! Let's Engage in Some Conversations 😎😎

In [12]:
result = qa({"question": query, "chat_history": chat_history})
print("Human Question :", result['question'])
print("AI Answer :", result['answer'])

Human Question : What your policy covers?
AI Answer : home insurance policy


## Let's Proceed with the Chat and Update the Chat History Accordingly

In [13]:
chat_history = [(query, result["answer"])]
query = "HOW TO MAKE A CLAIM?"

result = qa({"question": query, "chat_history": chat_history})
print("Human Question :", result['question'])
print("AI Answer :", result['answer'])

Human Question : HOW TO MAKE A CLAIM?
AI Answer : 2. Call 0330 024 8086 (Calls are recorded and monitored)


In [14]:
chat_history = [(query, result["answer"])]
query = "Under Cover 11 (Tax) What is insured?"

result = qa({"question": query, "chat_history": chat_history})
print("Human Question :", result['question'])
print("AI Answer :", result['answer'])

Human Question : Under Cover 11 (Tax) What is insured?
AI Answer : Standard Advisers’ Costs incurred by an Accountant if You are subject to an
