## Q&A over a document or user data.

run query agaist data from pdfs , company internal files.  
** Use of embedding models and vector stores

In [None]:
%pip install langchain -U

In [1]:
import os
from dotenv import load_dotenv,find_dotenv
from langchain.chat_models import ChatOpenAI
_ = load_dotenv(find_dotenv())

chat_model = ChatOpenAI(temperature=0.0)


In [2]:
from langchain.chains import RetrievalQA # A chain specific for Q/A style converstaion
from langchain.document_loaders import CSVLoader
from langchain.vectorstores import DocArrayInMemorySearch
from IPython.display import display,Markdown

# load csv file 

loader = CSVLoader(file_path = "qa_data.csv")

In [None]:
#%pip install docarray

LLMs can work with few thousand words at a time (context length / token limit)

How to answer questions about large and multiple documents
Answer - user embeddings and vector stores

Embeddings :
- embeddings create numerical rep for pieces of text
- this captures semantic meaning of the piece of text
- similar text will have similar vector

Vector Database/Store:
- a way to store vector representations
- populate with chunks of texts 
- pieces of texts smaller than original doc
- then create an embedding for each of the chunk to create index
- during runtime find text most relevant to incoming query

When a query comes -> first create an embedding for query
compare the embedding to all vectors , pick most similar



In [None]:
# create a vector index - helper
# change import in _sql_record_manager
from langchain.indexes import VectorstoreIndexCreator

index = VectorstoreIndexCreator( 
    # this will cause RateLimitError for token limit
    vectorstore_cls=DocArrayInMemorySearch
).from_loaders([loader])



In [7]:
# now use index to ask questions

query = "Please list all shirts with sun protection in a table in markdown and summarize each one."

response = index.query(query)

display(Markdown(response))



| Name | Description |
| --- | --- |
| Men's Tropical Plaid Short-Sleeve Shirt | UPF 50+ rated, 100% polyester, wrinkle-resistant, front and back cape venting, two front bellows pockets |
| Men's Plaid Tropic Shirt, Short-Sleeve | UPF 50+ rated, 52% polyester and 48% nylon, machine washable and dryable, front and back cape venting, two front bellows pockets |
| Men's TropicVibe Shirt, Short-Sleeve | UPF 50+ rated, 71% Nylon, 29% Polyester, 100% Polyester knit mesh, wrinkle resistant, front and back cape venting, two front bellows pockets |
| Sun Shield Shirt by | UPF 50+ rated, 78% nylon, 22% Lycra Xtra Life fiber, wicks moisture, fits comfortably over swimsuit, abrasion resistant |

All four shirts provide UPF 50+ sun protection, blocking 98% of the sun's harmful rays. The Men's Tropical Plaid Short-Sleeve Shirt is made of 100% polyester and is wrinkle-resistant. The Men's Plaid Trop

## Step by step breakdown

In [3]:
# load docs
docs = loader.load()
print(len(docs))

# get embedding model
from langchain.embeddings import OpenAIEmbeddings
embedding = OpenAIEmbeddings()

# embed = embedding.embed_query(query)
# print(len(embed))

1000


In [4]:
import tiktoken

def count_tokens(docs):
  """
  Counts the total number of tokens in a list of documents.

  Args:
    docs: A list of documents.

  Returns:
    The total number of tokens.
  """
  total_tokens = 0
  encoding = tiktoken.encoding_for_model("text-embedding-ada-002") # default-text-embedding-ada-002
  for doc in docs:
    total_tokens += len(encoding.encode(doc.page_content) )
  return total_tokens

total_tokens = count_tokens(docs)
print(total_tokens)

176510


In [5]:
# create vector db 
# token count is more than limit , break up 
docs_len = len(docs)
half = docs_len//2
docs_set_one = docs[:half]
print(count_tokens(docs_set_one))
docs_set_two = docs[half:]
print(count_tokens(docs_set_two))

# calls embedding API
# db_one = DocArrayInMemorySearch.from_documents(documents=docs_set_one , embedding=embedding)
# need logic to bypass ratelimit error
# db_two = DocArrayInMemorySearch.from_documents(documents=docs_set_two , embedding=embedding) 





87415
89095


In [6]:
# RateLimit error : 150000/min bottleneck
#db_all = DocArrayInMemorySearch.from_documents(documents=docs , embedding=embedding)

vector_db = DocArrayInMemorySearch.from_documents(documents=docs_set_one , embedding=embedding)
added_ids = vector_db.add_documents(docs_set_two) # add set two docs to initial vector db 




Retrying langchain.embeddings.openai.embed_with_retry.<locals>._embed_with_retry in 4.0 seconds as it raised RateLimitError: Rate limit reached for default-text-embedding-ada-002 in organization org-Ipkgjb8eV212suWndEgA3XmD on tokens per min. Limit: 150000 / min. Current: 74149 / min. Contact us through our help center at help.openai.com if you continue to have issues. Please add a payment method to your account to increase your rate limit. Visit https://platform.openai.com/account/billing to add a payment method..


In [7]:
temp_query = "Please suggest a shirt with sunblocking for men"

filtered_docs = vector_db.similarity_search(temp_query)

print(len(filtered_docs))

4


## Note :

Similarity search result of one vecto_db is different than   
similarty search on two sub_vector db and combining them

In [64]:

# response_one = db_one.similarity_search(temp_query)
# response_two = db_two.similarity_search(temp_query)
# response_docs = response_one + response_two
# print(len(response_one))
# print(len(response_two))
# print(len(response_docs))

4
4
8


In [None]:
"""
Men's Plaid Tropic Shirt - 374
Sun Shield Shirt - 255
Girls' Ocean Breeze Long-Sleeve Stripe Shirt - 293
Women's Tropical Plaid Shirt - 87
Men's TropicVibe Shirt, Short-Sleeve - 535
Men's Tropical Plaid Short-Sleeve Shirt - 618
Women's Tropical Tee, Sleeveless - 679
Girls' Beachside Breeze Shirt, Half-Sleeve - 617
"""

In [8]:


query_docs = "".join([doc.page_content for doc in filtered_docs])

In [10]:
llm_response = chat_model.call_as_llm(f"""{query_docs} \n Question:\n
Please list all your \
shirts with sun protection in a table in markdown and summarize each one.
""")

In [11]:
display(Markdown(llm_response))

| Shirt Name | Description | Summary |
|------------|-------------|---------|
| Men's Plaid Tropic Shirt, Short-Sleeve | Our Ultracomfortable sun protection is rated to UPF 50+, helping you stay cool and dry. Originally designed for fishing, this lightest hot-weather shirt offers UPF 50+ coverage and is great for extended travel. SunSmart technology blocks 98% of the sun's harmful UV rays, while the high-performance fabric is wrinkle-free and quickly evaporates perspiration. Made with 52% polyester and 48% nylon, this shirt is machine washable and dryable. Additional features include front and back cape venting, two front bellows pockets and an imported design. With UPF 50+ coverage, you can limit sun exposure and feel secure with the highest rated sun protection available. | Ultracomfortable and lightweight shirt with UPF 50+ sun protection. Features include wrinkle-free fabric, quick-drying, and venting for breathability. |
| Men's TropicVibe Shirt, Short-Sleeve | This Men’s sun-protection shirt with built-in UPF 50+ has the lightweight feel you want and the coverage you need when the air is hot and the UV rays are strong. Size & Fit: Traditional Fit: Relaxed through the chest, sleeve and waist. Fabric & Care: Shell: 71% Nylon, 29% Polyester. Lining: 100% Polyester knit mesh. UPF 50+ rated – the highest rated sun protection possible. Machine wash and dry. Additional Features: Wrinkle resistant. Front and back cape venting lets in cool breezes. Two front bellows pockets. Imported. Sun Protection That Won't Wear Off: Our high-performance fabric provides SPF 50+ sun protection, blocking 98% of the sun's harmful rays. | Lightweight and breathable shirt with UPF 50+ sun protection. Features include wrinkle resistance, venting for breathability, and two front pockets. |
| Men's Tropical Plaid Short-Sleeve Shirt | Our lightest hot-weather shirt is rated UPF 50+ for superior protection from the sun's UV rays. With a traditional fit that is relaxed through the chest, sleeve, and waist, this fabric is made of 100% polyester and is wrinkle-resistant. With front and back cape venting that lets in cool breezes and two front bellows pockets, this shirt is imported and provides the highest rated sun protection possible. Sun Protection That Won't Wear Off. Our high-performance fabric provides SPF 50+ sun protection, blocking 98% of the sun's harmful rays. | Lightweight and relaxed fit shirt with UPF 50+ sun protection. Features include wrinkle resistance, venting for breathability, and two front pockets. |
| Sun Shield Shirt by | "Block the sun, not the fun – our high-performance sun shirt is guaranteed to protect from harmful UV rays. Size & Fit: Slightly Fitted: Softly shapes the body. Falls at hip. Fabric & Care: 78% nylon, 22% Lycra Xtra Life fiber. UPF 50+ rated – the highest rated sun protection possible. Handwash, line dry. Additional Features: Wicks moisture for quick-drying comfort. Fits comfortably over your favorite swimsuit. Abrasion resistant for season after season of wear. Imported. Sun Protection That Won't Wear Off Our high-performance fabric provides SPF 50+ sun protection, blocking 98% of the sun's harmful rays. This fabric is recommended by The Skin Cancer Foundation as an effective UV protectant. | High-performance sun shirt with UPF 50+ sun protection. Features include moisture-wicking fabric, comfortable fit, and abrasion resistance. |


## Retrieval Methods :

`stuff` method : simplest , all data into prompt and call LLM once
 - pros : single call to LLM
 - con : context length can be a bottleneck

 `map_reduce` method : each chunks call to LLM (map), and again call to LLM to combine (reduce)
 - pro : context limit won't be a problem , since parallel calls 
 - con : RateLimitError will be bottleneck , all calls are individual , so context might be lost

 `refine` method : calls LLM iteratively , and combines response 
  - this is slower as calls take time and each time result is built upon.

In [22]:
# get a retriever to run Questions against 
retriever = vector_db.as_retriever(search_type = "similarity")

qa_stuff = RetrievalQA.from_chain_type(
    llm=chat_model, 
    chain_type="stuff", 
    retriever=retriever, 
    verbose=True
) # no call to llm

In [24]:
print(retriever)

tags=['DocArrayInMemorySearch'] metadata=None vectorstore=<langchain.vectorstores.docarray.in_memory.DocArrayInMemorySearch object at 0x12933cf90> search_type='similarity' search_kwargs={}


In [23]:
query =  "Please list all your shirts with sun protection in a table \
in markdown and summarize each one."

retreiver_reponse = qa_stuff.run(query) # runs takes 5 min and returns no result




[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


In [26]:
print(retreiver_reponse)

| Shirt ID | Shirt Name                            | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        