# Question Answering

## Overview

Recall the overall workflow for retrieval augmented generation (RAG):

![overview.jpeg](attachment:overview.jpeg)

We discussed `Document Loading` and `Splitting` as well as `Storage` and `Retrieval`.

Let's load our vectorDB.

In [46]:
# !pip install langchain openai chromadb tiktoken
# !pip install pypdf
# !pip install python-dotenv
# !pip install jupyter_bokeh
# !pip install -U langchain-community
# !pip install PyPDF2
# !pip install python-docx

In [47]:
# in order to embedding texts from PDFs
from PyPDF2 import PdfReader
from docx import Document
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings

In [48]:
import os
import openai
import sys
sys.path.append('../..')

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

The code below was added to assign the openai LLM version filmed until it is deprecated, currently in Sept 2023.
LLM responses can often vary, but the responses may be significantly different when using a different model version.

In [49]:
import datetime
current_date = datetime.datetime.now().date()
if current_date < datetime.date(2023, 9, 2):
    llm_name = "gpt-3.5-turbo-0301"
else:
    llm_name = "gpt-3.5-turbo"
print(llm_name)

gpt-3.5-turbo


In [None]:
# !pip install -U langchain-community

In [50]:
from google.colab import drive
# drive.mount('/content/drive')
persist_directory = 'Embeddings'
# persist_directory = 'sample_data/The_History_of_Starbucks.pdf'
# embedding = OpenAIEmbeddings()
embedding = OpenAIEmbeddings(
    openai_api_key= "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"
)
vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)

In [51]:
# pdf
def extract_text_from_pdf(pdf_path):
    reader = PdfReader(pdf_path)
    text = ''
    for page in reader.pages:
        text += page.extract_text()
    return text

In [52]:
# docs
def extract_text_from_docx(docx_path):
    doc = Document(docx_path)
    text = ''
    for para in doc.paragraphs:
        text += para.text + '\n'
    return text

In [69]:
documents = []

for filename in os.listdir(persist_directory):
    file_path = os.path.join(persist_directory, filename)
    print("file_path", file_path)

    # process .pdf, .doc, .docx, files
    if filename.endswith('.pdf'):
        print(f"Processing PDF: {filename}")
        text = extract_text_from_pdf(file_path)
        if text:
            documents.append(text)
    elif filename.endswith('.docx'):
        print(f"Processing DOCX: {filename}")
        text = extract_text_from_docx(file_path)
        if text:
            documents.append(text)
    elif filename.endswith('.doc'):
        print(f"Processing DOC: {filename}")
        pass
    else:
        print(f"Skipping file: {filename}")


if not documents:
    print("no documents found.")
else:
    vectordb.add_texts(documents)
    print("vectordb._collection.count() count:", vectordb._collection.count())

file_path Embeddings/chroma.sqlite3
Skipping file: chroma.sqlite3
file_path Embeddings/Pret.pdf
Processing PDF: Pret.pdf
file_path Embeddings/Starbucks.pdf
Processing PDF: Starbucks.pdf
file_path Embeddings/Pret.docx
Processing DOCX: Pret.docx
file_path Embeddings/.ipynb_checkpoints
Skipping file: .ipynb_checkpoints
file_path Embeddings/75014254-e6cf-4165-8a6c-5baedd6339bd
Skipping file: 75014254-e6cf-4165-8a6c-5baedd6339bd
vectordb._collection.count() count: 9


In [54]:
# documents = []
# for filename in os.listdir(persist_directory):
#     print("filename", filename)
#     file_path = os.path.join(persist_directory, filename)
#     print("file_path", file_path)
#     if filename.endswith('.pdf'):
#         print("pdf")
#         text = extract_text_from_pdf(file_path)
#         print(text)
#         documents.append(text)

# vectordb.add_texts(documents)

In [15]:
# from google.colab import drive
# drive.mount('/content/drive')

In [16]:
# texts = ["This is a test document.", "Another document to add."]
# vectordb.add_texts(texts)

['793c3cf9-7fce-407e-af56-13e6072a3e8d',
 'cbc8cb5e-a923-449b-adf1-bdd6b1954078']

In [70]:
vectordb._collection

Collection(id=8eb279ed-a621-4c6d-8ab8-d6110c0d5358, name=langchain)

In [72]:
vectordb._collection.metadata

In [55]:
print(vectordb._collection.count())

3


In [62]:
question = "Tell me something about Starbucks? Can you tell me which file you find the source from?"
docs = vectordb.similarity_search(question,k=3)
len(docs)

3

In [63]:
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(model_name=llm_name, temperature=0, openai_api_key= "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"
)

### RetrievalQA chain

In [64]:
from langchain.chains import RetrievalQA

In [65]:
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever()
)

In [66]:
result = qa_chain({"query": question})



In [67]:
result["result"]

'The information provided about Starbucks was sourced from the text titled "The Evolution of Starbucks: From a Single Store to a Global Coffee Powerhouse."'

### Prompt

In [78]:
from langchain.prompts import PromptTemplate

# Build prompt
template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. Use three sentences maximum. Keep the answer as concise as possible. Always say "thanks for asking!" at the end of the answer.
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate.from_template(template)


In [74]:
# Run chain
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    return_source_documents=True,
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
)

In [82]:
question = "Who are Starbucks and Pret's founders? Tell me where you get the info from? Did you get info from the articles I give you or some other place?"

In [99]:
question2 = "What are the sources' titles? The sources you get the answer from."

In [83]:
result = qa_chain({"query": question})

In [100]:
result2 = qa_chain({"query": question2})

In [101]:
result["result"]

'The founders of Starbucks are Jerry Baldwin, Zev Siegl, and Gordon Bowker. The information was obtained from the provided article on the evolution of Starbucks. On the other hand, the founders of Pret A Manger are Sinclair Beecham and Julian Metcalfe. This information was not included in the text provided.'

In [103]:
result2['result']

'The sources are titled "The Evolution of Starbucks: From a Single Store to a Global Coffee Powerhouse" and "Review: \'PRET\'". Thanks for asking!'

In [85]:
result["source_documents"][0]

Document(page_content='The\nEvolution\nof\nStarbucks:\nFrom\na\nSingle\nStore\nto\na\nGlobal\nCoffee\nPowerhouse\nStarbucks,\na\nname\nsynonymous\nwith\ncoffee\nculture\naround\nthe\nworld,\nhas\na\nhistory\nthat\ntraces\nback\nto\na\nsmall\nstore\nin\nSeattle,\nWashington.\nThe\njourney\nof\nStarbucks\nfrom\na\nhumble\ncoffee\nshop\nto\nan\ninternational\ngiant\nis\na\nstory\nof\ninnovation,\nbranding,\nand\nstrategic\nexpansion.\nThe\nBeginning:\n1971\nStarbucks\nwas\nfounded\nin\n1971\nby\nthree\npartners—Jerry\nBaldwin,\nZev\nSiegl,\nand\nGordon\nBowker.\nThe\ntrio\nwas\ninspired\nby\na\nlove\nfor\nhigh-quality\ncoffee\nand\na\ndesire\nto\nbring\npremium\nbeans\nto\nconsumers.\nThe\nfirst\nstore,\nlocated\nat\nPike\nPlace\nMarket\nin\nSeattle,\nspecialized\nin\nselling\nhigh-quality\ncoffee\nbeans\nand\nequipment.\nThe\nname\n"Starbucks"\nwas\ninspired\nby\nthe\ncharacter\nStarbuck\nfrom\nHerman\nMelville’s\nclassic\nnovel\nMoby-Dick\n,\nreflecting\nthe\nfounders\'\nmaritime\nherit

In [90]:
result.keys()

dict_keys(['query', 'result', 'source_documents'])

In [93]:
result["query"]

"Who are Starbucks and Pret's founders? Tell me where you get the info from? Did you get info from the articles I give you or some other place?"

In [86]:
result["source_documents"]

[Document(page_content='The\nEvolution\nof\nStarbucks:\nFrom\na\nSingle\nStore\nto\na\nGlobal\nCoffee\nPowerhouse\nStarbucks,\na\nname\nsynonymous\nwith\ncoffee\nculture\naround\nthe\nworld,\nhas\na\nhistory\nthat\ntraces\nback\nto\na\nsmall\nstore\nin\nSeattle,\nWashington.\nThe\njourney\nof\nStarbucks\nfrom\na\nhumble\ncoffee\nshop\nto\nan\ninternational\ngiant\nis\na\nstory\nof\ninnovation,\nbranding,\nand\nstrategic\nexpansion.\nThe\nBeginning:\n1971\nStarbucks\nwas\nfounded\nin\n1971\nby\nthree\npartners—Jerry\nBaldwin,\nZev\nSiegl,\nand\nGordon\nBowker.\nThe\ntrio\nwas\ninspired\nby\na\nlove\nfor\nhigh-quality\ncoffee\nand\na\ndesire\nto\nbring\npremium\nbeans\nto\nconsumers.\nThe\nfirst\nstore,\nlocated\nat\nPike\nPlace\nMarket\nin\nSeattle,\nspecialized\nin\nselling\nhigh-quality\ncoffee\nbeans\nand\nequipment.\nThe\nname\n"Starbucks"\nwas\ninspired\nby\nthe\ncharacter\nStarbuck\nfrom\nHerman\nMelville’s\nclassic\nnovel\nMoby-Dick\n,\nreflecting\nthe\nfounders\'\nmaritime\nheri

In [94]:
result['source_documents'].__len__()

4

### RetrievalQA chain types

In [95]:
qa_chain_mr = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    chain_type="map_reduce"
)

In [96]:
result = qa_chain_mr({"query": question})

In [None]:
result["result"]

'Starbucks was founded by Jerry Baldwin, Zev Siegl, and Gordon Bowker. Pret A Manger was founded by Julian Metcalfe and Sinclair Beecham.'

If you wish to experiment on the `LangSmith platform` (previously known as LangChain Plus):

 * Go to [LangSmith](https://www.langchain.com/langsmith) and sign up
 * Create an API key from your account's settings
 * Use this API key in the code below   
 * uncomment the code  
 Note, the endpoint in the video differs from the one below. Use the one below.

In [97]:
import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.langchain.plus"
os.environ["LANGCHAIN_API_KEY"]  = os.environ['OPENAI_API_KEY']

In [98]:
qa_chain_mr = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    chain_type="map_reduce"
)
result = qa_chain_mr({"query": question})
result["result"]

'The founders of Starbucks are Jerry Baldwin, Zev Siegl, and Gordon Bowker. The information was obtained from the provided article on the evolution of Starbucks. On the other hand, the founders of Pret A Manger are Sinclair Beecham and Julian Metcalfe. This information was not included in the text provided.'

In [105]:
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

retriever = vectordb.as_retriever()

# New Syntex???
qa_chain_mr = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    chain_type="map_reduce"
)

# query
question = "Can you tell me something about Pret?"
result = qa_chain_mr({"query": question})

# Output the result
print(result["result"])


Pret A Manger, often known as Pret, is a globally recognized coffee shop and sandwich chain that originated in London. It has become a favorite for fresh, healthy, and convenient meals. The brand focuses on quality, freshness, ethical sourcing, and sustainability, which has helped it build a loyal customer base worldwide. Pret's stores are designed to be inviting spaces for relaxation, work, or socializing. The brand's emphasis on simplicity, quality, and consistency sets it apart in the market and has garnered a loyal following globally.


In [106]:
qa_chain_mr = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    chain_type="refine"
)
result = qa_chain_mr({"query": question})
result["result"]

'Pret A Manger, commonly known as Pret, is a globally recognized coffee shop and sandwich chain that originated in London in 1984. Founded by college friends Sinclair Beecham and Julian Metcalfe, Pret aimed to provide fresh, natural food quickly to busy Londoners as an alternative to processed fast food. The name "Pret A Manger," meaning "ready to eat" in French, reflects their commitment to freshly prepared, high-quality food.\n\nPret\'s success was driven by its innovative approach to food preparation, with each store\'s kitchen making food daily using natural ingredients without artificial additives. The brand expanded across London and internationally, with a focus on quality, freshness, and ethical sourcing. Pret\'s welcoming atmosphere, friendly service, and charitable initiatives, such as donating unsold food to local charities, have contributed to its popularity.\n\nDuring the COVID-19 pandemic, Pret faced challenges but adapted by expanding delivery services and introducing ne

### RetrievalQA limitations

QA fails to preserve conversational history.

In [107]:
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever()
)

In [108]:
question = "What are Starbucks and Pret's reputations? What are their differences?"
result = qa_chain({"query": question})
result["result"]

'Starbucks and Pret have established strong reputations in the coffee industry. Starbucks is known for being a global coffee powerhouse, with over 30,000 stores in more than 80 countries. It has become a symbol of the global coffee culture and is recognized for its iconic green logo and commitment to social responsibility, including sustainability and community programs.\n\nOn the other hand, Pret is known for its chic and vibrant coffee shop atmosphere that appeals to both locals and tourists. It emphasizes modern and stylish design, locally sourced ingredients, and a focus on sustainability and community support. Pret offers exceptional coffee, specialty brews, and unique signature drinks, along with fresh artisanal pastries and light bites.\n\nThe main differences between Starbucks and Pret lie in their scale and focus. Starbucks is a global giant with a wide range of products and a strong emphasis on customer experience, while Pret focuses on creating a welcoming environment with a

In [109]:
question = "Which cafe is more profitable, Starbucks or Pret?"
result = qa_chain({"query": question})
result["result"]

"I don't have access to the financial data of Starbucks or Pret A Manger to determine which cafe is more profitable. You would need to refer to their respective financial reports or industry analyses for that information."

Note, The LLM response varies. Some responses **do** include a reference to probability which might be gleaned from referenced documents. The point is simply that the model does not have access to past questions or answers, this will be covered in the next section.