# LangChain: Q&A over Documents

An example might be a tool that would allow you to query a product catalog for items of interest.

Sources: [Here](https://learn.deeplearning.ai/langchain/lesson/5/question-and-answer),
[here](https://betterprogramming.pub/building-a-multi-document-reader-and-chatbot-with-langchain-and-chatgpt-d1864d47e339) and 
[here](https://python.langchain.com/docs/integrations/vectorstores/faiss)

In [0]:
#!pip install -q docarray
#!pip install python-docx
!pip install docx2txt

In [0]:
!pip install -q pydantic==1.10.9  #https://stackoverflow.com/questions/76934579/pydanticusererror-if-you-use-root-validator-with-pre-false-the-default-you

In [0]:
!pip install -q transformers
!pip install -q InstructorEmbedding

In [0]:
!pip install -q pypdf

In [0]:
!pip install -q unstructured[pdf]

In [0]:
#!pip install -qU numpy
#!pip install numpy==1.20.0

In [0]:
#!pip install -q chromadb
!pip install faiss-cpu

In [0]:
#dbutils.library.restartPython()

In [0]:
import os
import glob
from pathlib import Path
import pandas as pd
from IPython.display import display, Markdown

import openai
#from langchain.llms import OpenAI
from langchain.llms import AzureOpenAI

#from langchain.chat_models import ChatOpenAI
from langchain.chat_models import AzureChatOpenAI

from langchain.prompts import PromptTemplate
from langchain.prompts import ChatPromptTemplate

from langchain.text_splitter import CharacterTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain.chains import LLMChain
from langchain.chains import ConversationChain
from langchain.chains import ConversationalRetrievalChain
from langchain.chains import RetrievalQA
from langchain.chains.mapreduce import MapReduceChain
from langchain.chains.summarize import load_summarize_chain
from langchain.chains.question_answering import load_qa_chain

from langchain.docstore.document import Document
from langchain.schema import Document as LangchainDocument

from langchain.memory import ConversationBufferMemory

from langchain.document_loaders import CSVLoader
from langchain.document_loaders import PyPDFLoader
from langchain.document_loaders import Docx2txtLoader
from langchain.document_loaders import TextLoader
from langchain.document_loaders import DirectoryLoader

#from langchain.vectorstores import DocArrayInMemorySearch
#from langchain.vectorstores.base import VectorStore
#from langchain.vectorstores import Chroma
from langchain.vectorstores import FAISS

#from langchain.embeddings import OpenAIEmbeddings
from langchain.embeddings import HuggingFaceInstructEmbeddings

#from docx import Document
import tiktoken
#from funcy import lcat, lmap, linvoke

import warnings
warnings.filterwarnings('ignore')

#### Loading the ´gpt-35-turbo´ model

In [0]:
openai.api_type = "azure"
openai.api_base = "https://rg-rbi-aa-aitest-dsacademy.openai.azure.com/"
openai.api_key = os.environ["OPENAI_API_KEY"]

openai_model_name = "gpt-35-turbo"
openai_deploy_name = "model-gpt-35-turbo"
openai.api_version = "2023-07-01-preview"

In [0]:
llm = AzureChatOpenAI(openai_api_base=openai.api_base,
                      openai_api_version=openai.api_version,
                      deployment_name=openai_deploy_name,
                      openai_api_key=os.environ["OPENAI_API_KEY"],
                      openai_api_type=openai.api_type,
                      temperature=0.9,
                      #max_tokens=4000,
                      )


llm

#### Loading files in the examples folder

(In this first part, we are just minding PDF docs)

In [0]:
fullpath = "/Workspace/ds-academy-embedded-wave-4/ExampleDocs"
docs = os.listdir(fullpath)
docs = [d for d in docs if d.endswith(".pdf")]
for doc in docs:
    print(doc)

Now we will instantiate the PDF Loader, load one small document and create a list of Langchain documents object

Info about the page splitting [here](https://datascience.stackexchange.com/questions/123076/splitting-documents-with-langchain-when-a-sentence-straddles-the-a-page-break)  
You can also define your own document splitter using `pdf_loader.load_and_split()`

In [0]:
pdf_loader = PyPDFLoader(fullpath+"/"+docs[4])
documents = pdf_loader.load()
print(f"We have {len(documents)} pages in the pdf file")
print(type(documents))
print(type(documents[0]))

The simplest Q&A chain implementation we can use is the load_qa_chain.  
It loads a chain that allows you to pass in all of the documents you would like to query against using your LLM. 

![](https://miro.medium.com/v2/resize:fit:640/format:webp/1*rF3UlC7vWiVFGlXFNZ1XHw.png)

In [0]:
chain = load_qa_chain(llm=llm, verbose=False)
query = 'What is the document about?'
response = chain.run(input_documents=documents, question=query)
print(response) 

This method is all good when we only have a short amount of information to send in the [context size of our model](https://platform.openai.com/docs/models/overview).  
However, most LLMs will have a limit on the amount of information that can be sent in a single request. So we will not be able to send all the information in our documents within a single request.  
To overcome this, we need a smart way to send only the information we think will be relevant to our question/prompt.  


#### Interacting With a Single PDF Using Embeddings

We can use embeddings and vector stores to send only relevant information to our prompt.  
The steps we will need to follow are:

+ Split all the documents into small chunks of text
+ Pass each chunk of text into an embedding transformer to turn it into an embedding
+ Store the embeddings and related pieces of text in a vector store, instead of a list of Langchain document objects

![](https://miro.medium.com/v2/resize:fit:828/format:webp/1*FWwgOvUE660a04zoQplS7A.png)

First testing with a single and bigger PDF file:

In [0]:
pdf_loader = PyPDFLoader(fullpath+"/"+docs[0])
documents = pdf_loader.load()
print(f"We have {len(documents)} pages in the pdf file")

We will split the data into chunks of 1,000 characters, with an overlap of 200 characters between the chunks, which helps to give better results and contain the context of the information between chunks

In [0]:
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
documents = text_splitter.split_documents(documents)

We could create embeddings with many different transformers. We could have used using **OpenAIEmbeddings**, but then we would have to pay for each token sent to the API. In our case, we will create our vectorDB using **InstructEmbeddings** transformer from **[Hugging Face](https://huggingface.co/hkunlp/instructor-xl)** to provide embeddings from our text chunks.  
We set all the db information to be stored inside the `/Workspace/ds-academy-embedded-wave-4/VectorDB`, so it doesn't clutter up our source files.  

In [0]:
#openai_embeddings = OpenAIEmbeddings(deployment="model-text-embedding-ada-002", chunk_size = 1)

instruct_embeddings = HuggingFaceInstructEmbeddings(query_instruction="Represent the query for retrieval: ", 
                                                    model_name="hkunlp/instructor-xl")

##### Setting up a Vector Database

![Vector Databases](https://miro.medium.com/v2/resize:fit:828/format:webp/1*vIkxM-u3zrkHMZuIRURc0A.png)

There are [many Vector Databases](https://thenewstack.io/top-5-vector-database-solutions-for-your-ai-project/)  products, both paid and open source, that could be used. 
We have first tried [ChromaDB](https://www.trychroma.com/), but some incompatibilities with the current versions of Python motivated us to try [FAISS](https://faiss.ai/) (from Meta)

First attempt with ChromaDb (commented)

In [0]:
#vectordb = Chroma.from_documents(documents,
#                                 #embedding=openai_embeddings,
#                                 embedding=instruct_embeddings,
#                                 persist_directory='/Workspace/ds-academy-embedded-wave-4/VectorDB'
#)
#vectordb.persist()

Deleting previous databases from the folder we have create to store the files (only if creating new)

In [0]:
files = glob.glob('/Workspace/ds-academy-embedded-wave-4/VectorDB/*')
for f in files:
    os.remove(f)

Loading all PDF documents into the Vector Database

In [0]:
vectordb = FAISS.from_documents(documents, 
                                embedding=instruct_embeddings,
                               )
print(f"There are {vectordb.ntotal} documents in the index")
vectordb.save_local('/Workspace/ds-academy-embedded-wave-4/VectorDB/')

Once we have loaded our content as embeddings into the vector store, we are back to a similar situation as to when we only had one PDF to interact with. As in, we are now ready to pass information into the LLM prompt.  
However, instead of passing in all the documents as a source for our context to the chain, as we did initially, we will pass in our vector store as a source/retriever, and the chain will retrieve only the relevant text based on our question and send that information only inside the LLM prompt.

![](https://miro.medium.com/v2/resize:fit:828/format:webp/1*leoW-Pn0ohWalrUBbzdidA.png)

First we will only use the RetrievalQA chain, which will use our vector store as a source for the context information.

Again, the chain will wrap our prompt with some text, instructing it to only use the information provided for answering the questions.  
So the prompt we end up sending to the LLM something that looks like this:

    Use the following pieces of context to answer the question at the end.
    If you don't know the answer, just say that you don't know, don't try to
    make up an answer.

    {context} // i.e the chunks of text retrieved deemed to be most semantically
              // relevant to our question

    Question: {query} // i.e our actualy query
    Helpful Answer:

Loading the recently created Vector Database object

In [0]:
docsearch = FAISS.load_local("/Workspace/ds-academy-embedded-wave-4/VectorDB/", instruct_embeddings)

Now we could query our documents directly using the native similarity search from the vector DB:

In [0]:
query = "What are the documents about?"
result = docsearch.similarity_search(query)
print(result[0].page_content)

We can also search using a score function and a maximum number of documents in return

In [0]:
query = "What are the documents about?"
result = docsearch.similarity_search_with_score(query, k=2)
for r in result:
    print(r)
    print()

But it is much better to query the document with the Q&Q chain.  
Now we create a Retrieval chain using the Vector Database object:

In [0]:
qa_chain = RetrievalQA.from_chain_type(llm=llm,
                                       retriever=docsearch.as_retriever(),
                                       #retriever=docsearch.as_retriever(search_kwargs={'k': 7}),
                                       return_source_documents=True)

In [0]:
query = "What are the documents about?"
result = qa_chain(query)
print(result['result'])

#### Adding Chat History
Now, if we want to take things one step further, we can also make it so that our chatbot will remember any previous questions.

Implementation-wise, all that happens is that on each interaction with the chatbot, all of our previous conversation history, including the questions and answers, needs to be passed into the prompt. That is because the LLM does not have a way to store information about our previous requests, so we must pass in all the information on every call to the LLM.

Fortunately, LangChain also has a set of classes that let us do this out of the box. This is called the ConversationalRetrievalChain, which allows us to pass in an extra parameter called chat_history , which contains a list of our previous conversations with the LLM.

In [0]:
qa_chain = ConversationalRetrievalChain.from_llm(llm=llm,
                                                 retriever=docsearch.as_retriever(),
                                                 return_source_documents=True)

The chain run command accepts the chat_history as a parameter. We must manually build up this list based on our conversation with the LLM.  
The chain does not do this out of the box, so for each question and answer, we will build up a list called chat_history , which we will pass back into the chain run command each time.

In [0]:
chat_history = []
while True:
    # this prints to the terminal, and waits to accept an input from the user
    query = input('Prompt: ')
    # give us a way to exit the script
    if query == "exit" or query == "quit" or query == "q":
        print('Exiting')
        break
    # we pass in the query to the LLM, and print out the response. As well as
    # our query, the context of semantically relevant information from our
    # vector store will be passed in, as well as list of our chat history
    result = qa_chain({'question': query, 'chat_history': chat_history})
    print('Answer: ' + result['answer'])
    # we build up the chat_history list, based on our question and response
    # from the LLM, and the script then returns to the start of the loop
    # and is again ready to accept user input.
    chat_history.append((query, result['answer']))

In [0]:
chat_history

#### Interacting With Multiple Document types  
If you remember, the Documents created from our PDF Document Loader is just a list of parts of one Documents. So to increase our base of documents to interact with, we can just add more Documents to this list.

Now we can simply iterate over all of the files in that folder, and convert the information in them into Documents. From then onwards, the process is the same as before. We just pass our list of documents to the text splitter, which passes the chunked information to the embeddings transformer and vector store.

So, in our case, we want to be able to handle pdfs, Microsoft Word documents, and text files. We will iterate over the docs folder, handle files based on their extensions, use the appropriate loaders for them, and add them to the documentslist, which we then pass on to the text splitter.

First we are going to delete the old VectorDB

In [0]:
#files = glob.glob('/Workspace/ds-academy-embedded-wave-4/VectorDB/*')
#for f in files:
#    os.remove(f)

Let's now create Langchain Document objects for all different files in our storage folder

In [0]:
fullpath = "/Workspace/ds-academy-embedded-wave-4/ExampleDocs/"
documents = []
for filename in os.listdir(fullpath):
    print(f"Ingesting document {filename}")
    if filename.endswith('.pdf'):
        pdf_path = fullpath + filename
        loader = PyPDFLoader(pdf_path)
        documents.extend(loader.load())
    elif filename.endswith('.docx') or filename.endswith('.doc'):
        doc_path = fullpath + filename
        loader = Docx2txtLoader(doc_path)
        documents.extend(loader.load())
    elif filename.endswith('.txt'):
        text_path = fullpath + filename
        loader = TextLoader(text_path)
        documents.extend(loader.load())

Checking How many objects were created:

In [0]:
print(len(documents))
for d in documents[0:5]:
    print(d.metadata)

Now we are going to split the texts as we have done before: 

In [0]:
#text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=10)
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0, separators=[" ", ",", "\n"])

chunked_documents = text_splitter.split_documents(documents)

In [0]:
print(len(chunked_documents))
for d in chunked_documents[0:5]:
    print(d.metadata)

Now we are going to add documents to the previously created Vector Database index.

In [0]:
print(f"We have {len(vectordb.docstore._dict)} documents in the collection")
vectordb.add_documents(chunked_documents,
                       embedding=instruct_embeddings,
                       )
print(f"We have {len(vectordb.docstore._dict)} documents in the collection")
vectordb.save_local('/Workspace/ds-academy-embedded-wave-4/VectorDB/faiss_index')

The vector database does not distinguish which documents were indexed before, so we have to take care when ingesting to avoid duplicates

Now we can chat with our documents from multiple types via LLM 

In [0]:
pdf_qa = ConversationalRetrievalChain.from_llm(llm,
                                               retriever=vectordb.as_retriever(),
                                               return_source_documents=True,
                                               verbose=False
                                               )

chat_history = []
print(f"---------------------------------------------------------------------------------")
print('Welcome to the DocBot. You are now ready to start interacting with your documents')
print('---------------------------------------------------------------------------------')
while True:
    query = input(f"Prompt: ")
    if query == "exit" or query == "quit" or query == "q" or query == "f":
        print('Exiting')
        break
    if query == '':
        continue
    result = pdf_qa({"question": query, "chat_history": chat_history})
    print(f"Answer: " + result["answer"])
    chat_history.append((query, result["answer"]))

In [0]:
chat_history

### Bonus: operations among Vector Databases

##### You can merge many FAISS vector indexes

In [0]:
db1 = FAISS.from_texts(["Oranges are orange or yellow when ripe"], embedding=instruct_embeddings,)
db2 = FAISS.from_texts(["Grapes can be red, purple or green"], embedding=instruct_embeddings,)
db3 = FAISS.from_texts(["Watermelons are green outside, and red inside"], embedding=instruct_embeddings,)
db4 = FAISS.from_texts(["Lemons are green or yellow"], embedding=instruct_embeddings,)
db5 = FAISS.from_texts(["Oranges are orange or yellow when ripe"], embedding=instruct_embeddings,)

In [0]:
print(db1.docstore._dict)
print(db2.docstore._dict)
print(db3.docstore._dict)
print(db4.docstore._dict)
print(db5.docstore._dict)

In [0]:
db1.merge_from(db2)
db1.merge_from(db3)
db1.merge_from(db4)
db1.merge_from(db5)
db1.docstore._dict

In [0]:
results_with_scores = db1.similarity_search_with_score("red and green",)
for doc, score in results_with_scores:
    print(f"Content: {doc.page_content}, Score: {score}")

##### Another useful thing is to add documentd with metadata

In [0]:
list_of_documents = [
    LangchainDocument(page_content="Orange is orange", metadata=dict(topic="Fruit")),
    LangchainDocument(page_content="Lemon is green",  metadata=dict(topic="Fruit")),
    LangchainDocument(page_content="Watermelon is green",  metadata=dict(topic="Fruit")),
    LangchainDocument(page_content="Grapes are red or green",  metadata=dict(topic="Fruit")),
    LangchainDocument(page_content="The sun is orange",  metadata=dict(topic="Astronomy")),
    LangchainDocument(page_content="Mars is red",  metadata=dict(topic="Astronomy")),
    LangchainDocument(page_content="The Earth is blue",  metadata=dict(topic="Astronomy")),
    LangchainDocument(page_content="Our planet is Earth",  metadata=dict(topic="Astronomy")),
]
db = FAISS.from_documents(list_of_documents, embedding=instruct_embeddings)

First we make the query without filtering:

In [0]:
results_with_scores = db.similarity_search_with_score("orange")
for doc, score in results_with_scores:
    print(f"Content: {doc.page_content}, Metadata: {doc.metadata}, Score: {score}")

Now we make the same query call but we filter for only topic = "Fruit"

In [0]:
results_with_scores = db.similarity_search_with_score("orange")
for doc, score in results_with_scores:
    if doc.metadata['topic'] == "Fruit":
        print(f"Content: {doc.page_content}, Metadata: {doc.metadata}, Score: {score}")