# LangChain: Q&A over Documents

An example might be a tool that would allow you to query a product catalog for items of interest.

We want to use language models and combine it with a lot of our documents. But there's a key issue. 
Language models can only inspect a few thousand words at a time. So if we have really large documents, how can we get the language model to answer questions about everything that's in there? 

This is where embeddings and vector stores come into play. <br>
First, let's talk about embeddings.

### Embeddings

- embedding vector captures content/meaning. 
- Text with similar content will have similar vectors. 
Embeddings are useful as we think about which pieces of text we want to include when 
passing them to the language model to answer a question. 



### Vector database

The next component that we're going to cover is the vector database. 
A vector database is a way to store these vector representations that we created in the previous step. 

It is insufficient to store the whole vector representations of incoming documents. Instead, we split each incoming document into chunks, embed each chunk, index each chunk via locality sensitivity hasing (LHS) or Product Quantization (PQ) etc, and store the index representation and the index lookup table (which maps given index to the corresponding embedding vector). 

By doing so, we blur each document's original embedding to a rough embedding representation. Though it may not be 100% accurate, but it saves memory. Furthermore, since now the number of embeddings are limited, we can represent them via a index and use a index table to do the lookup when needed, which further compress memory usage. 

Reference: [What is a Vector Database?](https://www.pinecone.io/learn/vector-database/)


When we get a big incoming document, we're first going to break it 
up into smaller chunks. 
This helps create pieces of text that are 
smaller than the original document, which is useful because 
we may not be able to pass the whole document to the 
language model.
So we want to create these small chunks 
so we can only pass the most relevant 
ones to the language model. 
We then create an embedding for each of these chunks, 
and then we store those in a vector database. That's 
what happens when we create the index. 

#pip install --upgrade langchain
#pip install docarray

In [None]:
import os

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

In [None]:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import CSVLoader
from langchain.vectorstores import DocArrayInMemorySearch
from IPython.display import display, Markdown
import pandas as pd 
from langchain.indexes import VectorstoreIndexCreator

In [None]:
file = 'OutdoorClothingCatalog_1000.csv'
loader = CSVLoader(file_path=file)

df = pd.read_csv(file, index_col=0)
df.head()

In [None]:
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch
).from_loaders([loader])

query ="Please list all your shirts with sun protection \
in a table in markdown and summarize each one."
response = index.query(query)
display(Markdown(response))

### Understand Step-by-Step

In [None]:
loader = CSVLoader(file_path=file)
docs = loader.load()
docs[0] # each index/doc is a row in the csv

In [None]:
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

# query/sentence embedding
embed = embeddings.embed_query("Hi my name is Harrison")
print(len(embed))
print(embed[:5])

In [None]:
# create a vector database from docs and a embedding model
db = DocArrayInMemorySearch.from_documents(
    docs, 
    embeddings
)

In [None]:
# now we can query on db
query = "Please suggest a shirt with sunblocking"
docs = db.similarity_search(query)
# it returns 4 similar docs based on query
print(len(docs))
print(docs[0])

In [None]:
# Create a retriever from vector database, which is a generic interface 
# that can be underpinned by any method that takes in queries and outputs docs
# It's an interface for fetching docs
retriever = db.as_retriever()

llm = ChatOpenAI(temperature = 0.0)
qdocs = "".join([docs[i].page_content for i in range(len(docs))])
print(qdocs)

In [None]:
response = llm.call_as_llm(f"{qdocs} Question: Please list all your \
shirts with sun protection in a table in markdown and summarize each one.") 
display(Markdown(response))

In [None]:
qa_stuff = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=retriever, 
    verbose=True
)

query =  "Please list all your shirts with sun protection in a table \
in markdown and summarize each one."
response = qa_stuff.run(query)

In [None]:
display(Markdown(response))

### The above are the same as the following lines

In [None]:
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch,
    embedding=embeddings,
).from_loaders([loader])

response = index.query(query, llm=llm)