## Chat with documents using Langchain

### Optional step: Check to ensure we are using the right virtual env.

In [None]:
import sys
sys.prefix

In [None]:
pip --version

### Install the required libraries

In [None]:
!pip install openai
!pip install python-dotenv
!pip install langchain
!pip install pypdf
!pip install chromadb
!pip install tiktoken
!pip install lark #Parsing library for Python. Lark can parse any context-free grammar.

In [105]:
pip freeze > requirements.txt

Note: you may need to restart the kernel to use updated packages.


### Import the necessary libraries

In [1]:
import openai
import os
from dotenv import load_dotenv, find_dotenv

### Setup OpenAI API

In [2]:
_ = load_dotenv(find_dotenv()) # read local .env file
openai.api_key  = os.environ['OPENAI_API_KEY']

### Instantiate the LLM

In [51]:
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(temperature=0) #gpt-3.5-turbo is the default model used

### Load the document

In [3]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("docs/progit.pdf")
pages = loader.load()

print(f"Number of pages in the book: {len(pages)}")

Number of pages in the book: 501


## Split the document in chunks

In [4]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

r_text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = r_text_splitter.split_documents(pages)

print(f"Number of chunks: {len(chunks)}")

Number of chunks: 1247


In [5]:
for i in range(30, 34):
    print(f"chunk_{i+1}: {chunks[i]}\n")

chunk_31: page_content='anywhere near the ubiquity it has today. Since then, nearly every open source community has\nadopted it. Git has made incredible progress on Windows, in the explosion of graphical user\ninterfaces to it for all platforms, in IDE support and in business use. The Pro Git of four years ago\nknows about none of that. One of the main aims of this new edition is to touch on all of those new\nfrontiers in the Git community.\nThe Open Source community using Git has also exploded. When I originally sat down to write the\nbook nearly five years ago (it took me a while to get the first version out), I had just started working\nat a very little known company developing a Git hosting website called GitHub. At the time of\npublishing there were maybe a few thousand people using the site and just four of us working on it.\nAs I write this introduction, GitHub is announcing our 10 millionth hosted project, with nearly 5' metadata={'source': 'docs/progit.pdf', 'page': 7}

chunk_

## Create Embeddings and store in a Vector Database

In [6]:
db_dir = "chatdb/chroma"
!rm -rf ./chatdb/chroma

In [7]:
from langchain.embeddings.openai import OpenAIEmbeddings
embedding = OpenAIEmbeddings()

In [8]:
from langchain.vectorstores import Chroma
vector_db = Chroma.from_documents(documents=chunks, embedding=embedding, persist_directory=db_dir)
vector_db.persist()

print(f"Number of collections: {vector_db._collection.count()}")

Number of collections: 1247


## Query and retrieve data

### Similarity Search

In [None]:
question = "who are the authors of this book?"
docs = vector_db.similarity_search(question, k=10)

for i in range(10):
    print(f"doc[{i+1}]: {docs[i]}\n")

### Maximum Marginal Relevance

In [None]:
question = "who are the authors of this book?"
docs = vector_db.max_marginal_relevance_search(question, k=10)

for i in range(10):
    print(f"doc[{i+1}]: {docs[i]}\n")

### Self Query Retrieval

In [24]:
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

In [48]:
document_content_description = "A book on Git"
metadata_field_info = [
    AttributeInfo(
        name="source",
        description="The book the chunk is from, it should be from `docs/progit.pdf`",
        type="string",
    ),
    AttributeInfo(
        name="page",
        description="The page from the book",
        type="integer",
    ),
]

In [38]:
sq_retriever = SelfQueryRetriever.from_llm(
    llm, 
    vector_db, 
    document_content_description,
    metadata_field_info,
    #search_type="mmr", #kwargs
    #enable_limit=True,
    verbose=True
)

In [34]:
print(sq_retriever.search_type)

similarity
True


### Retrieve the relevant documents

In [44]:
question = "What is the main focus of discussion between the pages 100 to 120?"
#question = "What is the 2 main focus of discussion between the pages 100 to 120?" #Limiting the number of documents returned doesn't work

docs = sq_retriever.get_relevant_documents(question)
for doc in docs:
    print(doc.metadata)

{'page': 103, 'source': 'docs/progit.pdf'}
{'page': 109, 'source': 'docs/progit.pdf'}
{'page': 108, 'source': 'docs/progit.pdf'}
{'page': 109, 'source': 'docs/progit.pdf'}


### Retrieves n number of relevant documents

We need to set the `enable_limit` parameter to True in order to fetch `k` number of documents.  
**Reference**: https://python.langchain.com/docs/modules/data_connection/retrievers/self_query/#filter-k

In [45]:
sq_retriever_1 = SelfQueryRetriever.from_llm(
    llm, 
    vector_db, 
    document_content_description,
    metadata_field_info,
    enable_limit=True,
    verbose=True
)

In [46]:
question = "What is the 2 main focus of discussion between the pages 100 to 120?"

docs = sq_retriever_1.get_relevant_documents(question)
for doc in docs:
    print(doc.metadata)

{'page': 105, 'source': 'docs/progit.pdf'}
{'page': 110, 'source': 'docs/progit.pdf'}


In [47]:
question = "who are the original or main authors of this book?"

docs = sq_retriever.get_relevant_documents(question)
for doc in docs:
    print(doc.metadata)

{'page': 10, 'source': 'docs/progit.pdf'}
{'page': 11, 'source': 'docs/progit.pdf'}
{'page': 10, 'source': 'docs/progit.pdf'}
{'page': 10, 'source': 'docs/progit.pdf'}


## Question and Answer
Pass the chunks retrieved from the vector store to a LLM Model to get a final answer for the user question.

### Using RetrievalQA chain

In [71]:
from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(
    llm=llm, 
    retriever=vector_db.as_retriever(), #default is similarity search
    #retriever=vector_db.as_retriever(search_type="mmr"),
    #retriever=sq_retriever,
    return_source_documents=True,
    verbose=True
)

In [72]:
print(qa_chain.retriever.search_type)

similarity


In [73]:
question = "who are the authors of this book?"
response = qa_chain({"query": question})

print(response["result"])



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
The authors of this book are Scott Chacon and Ben Straub.


In [70]:
print(response)

{'query': 'who are the authors of this book?', 'result': 'The authors of this book are Scott Chacon and Ben Straub.', 'source_documents': [Document(page_content='Contributors\nSince this is an Open Source book, we have gotten several errata and content changes donated over\nthe years. Here are all the people who have contributed to the English version of Pro Git as an open\nsource project. Thank you everyone for helping make this a better book for everyone.\nContributors as of 3c1bc3b8:\n4wk-                            Jon Freed                       Sean Jacobs\nAdam Laflamme                   Jonathan                        Sebastian Krause\nAdrien Ollier                   Jordan Hayashi                  Sergey Kuznetsov\nAkrom K                         Joris Valette                   Severino Lorilla Jr\nAlan D. Salewski                Josh Byster                     Shengbin Meng\nAlba Mendez                     Joshua Webb                     Sherry Hietala\nAleh Suprunovich      

## Chat
### Memory

In [91]:
from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory(
    memory_key="chat_history",
    return_messages=True,
    output_key="answer"
)

### Using ConversationalRetrievalChain

In [92]:
from langchain.chains import ConversationalRetrievalChain

chat_history = []
conv_chain = ConversationalRetrievalChain.from_llm(
    llm=llm,
    retriever=vector_db.as_retriever(search_type="mmr"),
    return_source_documents=True,
    #verbose=True,
    memory=memory
)

In [93]:
question = "who are the authors of this book?"
response = conv_chain({"question": question})

print(response)

{'question': 'who are the authors of this book?', 'chat_history': [HumanMessage(content='who are the authors of this book?'), AIMessage(content='The authors of this book are Ben Straub and Scott Chacon.')], 'answer': 'The authors of this book are Ben Straub and Scott Chacon.', 'source_documents': [Document(page_content='Contributors\nSince this is an Open Source book, we have gotten several errata and content changes donated over\nthe years. Here are all the people who have contributed to the English version of Pro Git as an open\nsource project. Thank you everyone for helping make this a better book for everyone.\nContributors as of 3c1bc3b8:\n4wk-                            Jon Freed                       Sean Jacobs\nAdam Laflamme                   Jonathan                        Sebastian Krause\nAdrien Ollier                   Jordan Hayashi                  Sergey Kuznetsov\nAkrom K                         Joris Valette                   Severino Lorilla Jr\nAlan D. Salewski     

In [94]:
print(response["answer"])

The authors of this book are Ben Straub and Scott Chacon.


In [95]:
question = "please give more details about them."
response = conv_chain({"question": question})

print(response)

{'question': 'please give more details about them.', 'chat_history': [HumanMessage(content='who are the authors of this book?'), AIMessage(content='The authors of this book are Ben Straub and Scott Chacon.'), HumanMessage(content='please give more details about them.'), AIMessage(content='Ben Straub is one of the authors of the book "Pro Git." He has written a preface for the book and is mentioned in the dedications section, where he expresses gratitude to his wife, Becky, for her support in embarking on this adventure.\n\nScott Chacon is also one of the authors of "Pro Git." He has written a preface for the book and is mentioned in the dedications section. Scott dedicates this edition of the book to his wife, Jessica, and his daughter, Josephine, expressing appreciation for their support throughout the years and in the future.')], 'answer': 'Ben Straub is one of the authors of the book "Pro Git." He has written a preface for the book and is mentioned in the dedications section, where 

In [96]:
print(response["answer"])

Ben Straub is one of the authors of the book "Pro Git." He has written a preface for the book and is mentioned in the dedications section, where he expresses gratitude to his wife, Becky, for her support in embarking on this adventure.

Scott Chacon is also one of the authors of "Pro Git." He has written a preface for the book and is mentioned in the dedications section. Scott dedicates this edition of the book to his wife, Jessica, and his daughter, Josephine, expressing appreciation for their support throughout the years and in the future.


In [97]:
memory.buffer

[HumanMessage(content='who are the authors of this book?'),
 AIMessage(content='The authors of this book are Ben Straub and Scott Chacon.'),
 HumanMessage(content='please give more details about them.'),
 AIMessage(content='Ben Straub is one of the authors of the book "Pro Git." He has written a preface for the book and is mentioned in the dedications section, where he expresses gratitude to his wife, Becky, for her support in embarking on this adventure.\n\nScott Chacon is also one of the authors of "Pro Git." He has written a preface for the book and is mentioned in the dedications section. Scott dedicates this edition of the book to his wife, Jessica, and his daughter, Josephine, expressing appreciation for their support throughout the years and in the future.')]

In [102]:
memory.load_memory_variables({})

{'chat_history': [HumanMessage(content='who are the authors of this book?'),
  AIMessage(content='The authors of this book are Ben Straub and Scott Chacon.'),
  HumanMessage(content='please give more details about them.'),
  AIMessage(content='Ben Straub is one of the authors of the book "Pro Git." He has written a preface for the book and is mentioned in the dedications section, where he expresses gratitude to his wife, Becky, for her support in embarking on this adventure.\n\nScott Chacon is also one of the authors of "Pro Git." He has written a preface for the book and is mentioned in the dedications section. Scott dedicates this edition of the book to his wife, Jessica, and his daughter, Josephine, expressing appreciation for their support throughout the years and in the future.')]}

In [103]:
question = "what does NASA do?"
result = conv_chain({"question": question})
print(result["answer"])

I don't know.
