# LangChain: Q&A over Documents

An example might be a tool that would allow you to query a product catalog for items of interest.


Different types of chains
1. `stuff` - put all documents into context
2. `Map_reduce`- takes all the chunks, pass them with the question to an LLM, gets a response, and then use another LLM at the end to summarise the individual responses into a final answer. divide and conquer. 
    * allows for large number of docs
    * faster to run - due to parallelism
    * treats all docs as independent
3. `Refine` - iterate over all docs one-by-one, like boosting
    * good for combining info and building up an answer over time
    * tends to create longer answer
    * tends to be slow because the calls are done one-by-one
    
4. `Map_rerank` - make a call for each doc & ask for a score from the LLM.
    * needs to provide LLM with a judging criteria for "what is a good score" 
    * relative fast, because you can batch the calls

In [46]:
!pip3 install --upgrade sqlalchemy


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [7]:
!pip3 install docarray

Collecting docarray
  Downloading docarray-0.40.0-py3-none-any.whl (270 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m270.2/270.2 kB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
Collecting rich>=13.1.0 (from docarray)
  Downloading rich-13.7.1-py3-none-any.whl (240 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m240.7/240.7 kB[0m [31m15.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting types-requests>=2.28.11.6 (from docarray)
  Downloading types_requests-2.31.0.20240311-py3-none-any.whl (14 kB)
Collecting markdown-it-py>=2.2.0 (from rich>=13.1.0->docarray)
  Downloading markdown_it_py-3.0.0-py3-none-any.whl (87 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m87.5/87.5 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pygments<3.0.0,>=2.13.0 (from rich>=13.1.0->docarray)
  Using cached pygments-2.17.2-py3-none-any.whl (1.2 MB)
Collecting urllib3>=2 (from types-requests>=2.28.1

In [1]:
import os
import openai

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
openai.api_key = os.environ['OPENAI_API_KEY']

In [2]:
# account for deprecation of LLM model
import datetime
# Get the current date
current_date = datetime.datetime.now().date()

# Define the date after which the model should be set to "gpt-3.5-turbo"
target_date = datetime.date(2024, 6, 12)

# Set the model variable based on the current date
if current_date > target_date:
    llm_model = "gpt-3.5-turbo"
else:
    llm_model = "gpt-3.5-turbo-0301"

In [11]:
from langchain.chains import RetrievalQA
from langchain_openai import OpenAI
from langchain.document_loaders import CSVLoader
from langchain.vectorstores import DocArrayInMemorySearch
from IPython.display import display, Markdown
from langchain.llms import OpenAI

In [4]:
file = 'L4_OutdoorClothingCatalog_1000.csv'
loader = CSVLoader(file_path=file)

In [5]:
from langchain.indexes import VectorstoreIndexCreator

In [8]:
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch
).from_loaders([loader])



In [9]:
query ="Please list all your shirts with sun protection \
in a table in markdown and summarize each one."

In [12]:
llm_replacement_model = OpenAI(temperature=0, 
                               model='gpt-3.5-turbo-instruct')

response = index.query(query, 
                       llm = llm_replacement_model)

In [13]:
display(Markdown(response))



| Name | Description | Sun Protection Rating |
| --- | --- | --- |
| Men's Tropical Plaid Short-Sleeve Shirt | Made of 100% polyester, UPF 50+ rating, wrinkle-resistant, front and back cape venting, two front bellows pockets | SPF 50+, blocks 98% of harmful UV rays |
| Men's Plaid Tropic Shirt, Short-Sleeve | Made of 52% polyester and 48% nylon, UPF 50+ rating, SunSmart technology, wrinkle-free, front and back cape venting, two front bellows pockets | SPF 50+, blocks 98% of harmful UV rays |
| Men's TropicVibe Shirt, Short-Sleeve | Made of 71% nylon and 29% polyester, UPF 50+ rating, front and back cape venting, two front bellows pockets | SPF 50+, blocks 98% of harmful UV rays |
| Sun Shield Shirt | Made of 78% nylon and 22% Lycra Xtra Life fiber, UPF 50+ rating, moisture-wicking, abrasion-resistant, fits over swimsuit | SPF 50+, blocks 98% of harmful UV rays |

## Step By Step

In [14]:
from langchain.document_loaders import CSVLoader
loader = CSVLoader(file_path=file)

In [15]:
docs = loader.load()

In [17]:
docs[0] # each document represents one product
# document is small, so we don't need to chunk the data before creating embedding

Document(page_content=": 0\nname: Women's Campside Oxfords\ndescription: This ultracomfortable lace-to-toe Oxford boasts a super-soft canvas, thick cushioning, and quality construction for a broken-in feel from the first time you put them on. \n\nSize & Fit: Order regular shoe size. For half sizes not offered, order up to next whole size. \n\nSpecs: Approx. weight: 1 lb.1 oz. per pair. \n\nConstruction: Soft canvas material for a broken-in feel and look. Comfortable EVA innersole with Cleansport NXT® antimicrobial odor control. Vintage hunt, fish and camping motif on innersole. Moderate arch contour of innersole. EVA foam midsole for cushioning and support. Chain-tread-inspired molded rubber outsole with modified chain-tread pattern. Imported. \n\nQuestions? Please contact us for any inquiries.", metadata={'source': 'L4_OutdoorClothingCatalog_1000.csv', 'row': 0})

In [18]:
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

In [19]:
embed = embeddings.embed_query("Hi my name is Harrison")

In [20]:
print(len(embed))

1536


In [21]:
print(embed[:5])

[-0.021929145991628055, 0.00675322289990029, -0.01822616999085032, -0.03916366973085613, -0.014046201276369918]


In [23]:
# create a vector store based on OpenAI embeddings
db = DocArrayInMemorySearch.from_documents(
    docs, 
    embeddings
)

In [24]:
query = "Please suggest a shirt with sunblocking"

In [25]:
docs = db.similarity_search(query) # find the most relevant documents for the query 

In [27]:
len(docs)

4

In [28]:
docs[0]

Document(page_content=': 255\nname: Sun Shield Shirt by\ndescription: "Block the sun, not the fun – our high-performance sun shirt is guaranteed to protect from harmful UV rays. \n\nSize & Fit: Slightly Fitted: Softly shapes the body. Falls at hip.\n\nFabric & Care: 78% nylon, 22% Lycra Xtra Life fiber. UPF 50+ rated – the highest rated sun protection possible. Handwash, line dry.\n\nAdditional Features: Wicks moisture for quick-drying comfort. Fits comfortably over your favorite swimsuit. Abrasion resistant for season after season of wear. Imported.\n\nSun Protection That Won\'t Wear Off\nOur high-performance fabric provides SPF 50+ sun protection, blocking 98% of the sun\'s harmful rays. This fabric is recommended by The Skin Cancer Foundation as an effective UV protectant.', metadata={'source': 'L4_OutdoorClothingCatalog_1000.csv', 'row': 255})

In [29]:
# create a retriever for the vector db
retriever = db.as_retriever()

In [31]:
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(temperature = 0.0, model=llm_model)

In [33]:
# join all relevant documents into one variable
qdocs = "".join([docs[i].page_content for i in range(len(docs))]) 


In [37]:
# pass the variable as context 
response = llm.invoke(f"{qdocs} Question: Please list all your \
shirts with sun protection in a table in markdown and summarize each one.") 


In [38]:
display(Markdown(response))

TypeError: Markdown expects text, not AIMessage(content="| Shirt Name | Description |\n| --- | --- |\n| Sun Shield Shirt | High-performance sun shirt with UPF 50+ sun protection, moisture-wicking, and abrasion-resistant fabric. Recommended by The Skin Cancer Foundation. |\n| Men's Plaid Tropic Shirt | Ultracomfortable shirt with UPF 50+ sun protection, wrinkle-free fabric, and front/back cape venting. Made with 52% polyester and 48% nylon. |\n| Men's TropicVibe Shirt | Men's sun-protection shirt with built-in UPF 50+ and front/back cape venting. Made with 71% nylon and 29% polyester. |\n| Men's Tropical Plaid Short-Sleeve Shirt | Lightest hot-weather shirt with UPF 50+ sun protection, front/back cape venting, and two front bellows pockets. Made with 100% polyester and is wrinkle-resistant. |\n\nAll of these shirts provide UPF 50+ sun protection, blocking 98% of the sun's harmful rays. They are made with high-performance fabrics that are moisture-wicking, wrinkle-resistant, and abrasion-resistant. The Men's Plaid Tropic Shirt and Men's Tropical Plaid Short-Sleeve Shirt both have front/back cape venting for added breathability. The Sun Shield Shirt is recommended by The Skin Cancer Foundation as an effective UV protectant.", response_metadata={'finish_reason': 'stop', 'logprobs': None})

# Create a retrieval question & answer chain

In [40]:
qa_stuff = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", # stuff all documents into context
    retriever=retriever, 
    verbose=True
)

In [41]:
query =  "Please list all your shirts with sun protection in a table \
in markdown and summarize each one."

In [42]:
response = qa_stuff.invoke(query)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


In [44]:
display(response)

{'query': 'Please list all your shirts with sun protection in a table in markdown and summarize each one.',
 'result': "| Shirt Number | Name | Description |\n| --- | --- | --- |\n| 618 | Men's Tropical Plaid Short-Sleeve Shirt | Rated UPF 50+ for superior protection from the sun's UV rays. Made of 100% polyester and is wrinkle-resistant. With front and back cape venting that lets in cool breezes and two front bellows pockets. |\n| 374 | Men's Plaid Tropic Shirt, Short-Sleeve | Rated to UPF 50+ and offers sun protection. Made with 52% polyester and 48% nylon, this shirt is machine washable and dryable. Additional features include front and back cape venting, two front bellows pockets. |\n| 535 | Men's TropicVibe Shirt, Short-Sleeve | Built-in UPF 50+ has the lightweight feel you want and the coverage you need when the air is hot and the UV rays are strong. Made with 71% Nylon, 29% Polyester. Additional features include wrinkle resistance, front and back cape venting, and two front bell

In [45]:
response = index.query(query, llm=llm)

In [None]:
index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch,
    embedding=embeddings,
).from_loaders([loader])

Different types of chains
1. `stuff` - put all documents into context
2. `Map_reduce`- takes all the chunks, pass them with the question to an LLM, gets a response, and then use another LLM at the end to summarise the individual responses into a final answer. divide and conquer. 
    * allows for large number of docs
    * faster to run - due to parallelism
    * treats all docs as independent
3. `Refine` - iterate over all docs one-by-one, like boosting
    * good for combining info and building up an answer over time
    * tends to create longer answer
    * tends to be slow because the calls are done one-by-one
    
4. `Map_rerank` - make a call for each doc & ask for a score from the LLM.
    * needs to provide LLM with a judging criteria for "what is a good score" 
    * relative fast, because you can batch the calls