# Lab: Document Question Answering with AI Applications Search and LangChain

## Overview

This lab guides how to ask and answer questions about your data by combining a AI Applications Search engine with LLMs. In particular, we focus on querying 'unstructured' data such as PDFs and HTML files.

These patterns are useful if you have a AI Applications Search Engine pointed at a store of documents, such as a Google Cloud Storage bucket containing PDFs.

## Objectives

This lab focuses on utilize Gemini for Developers in the following ways:
 1. A question about uploaded documents, receive answers with citations.
 2. Engage in follow-up questions for deeper understanding.
 3. Have the flexibility to customize prompts for tailored responses.
 
## Task 1. Open Python Notebook and Install Packages

In your Google Cloud project, navigate to Vertex AI Workbench. In the top search bar of the Google Cloud console, enter Vertex AI Workbench, and click on the first result.

In [1]:
# Install the required packages by running the following command
! pip install -q --user google-cloud-aiplatform google-cloud-discoveryengine langchain-google-vertexai langchain-google-community

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
kfp 2.5.0 requires requests-toolbelt<1,>=0.8.0, but you have requests-toolbelt 1.0.0 which is incompatible.[0m[31m
[0m

In [2]:
! pip install --user google-cloud-aiplatform google-cloud-discoveryengine langchain-google-vertexai langchain-google-community



In [3]:
# Restart kernel after packages are installed so that your environment can access the new packages
import IPython
import time

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

## Task 2. Setup project information and populate the dataset into a data store

In this section, we will setup project environments and import the data in a pre-created data store.

In [1]:
# Setup project info and initializing the vertex ai
PROJECT_ID = "qwiklabs-gcp-02-984e5c922496"
LOCATION = "us-east4"

import vertexai

vertexai.init(project=PROJECT_ID, location=LOCATION)

2. Navigate to AI Applications by searching for it at the top of the Cloud Console.

3. Go to Data Stores and select the pre-created datastore qna-unstructured-datastore, choose the Cloud Storage as data source.

4. Enter the pdf location (FOLDER !) as cloud-samples-data/gen-app-builder/search/alphabet-investor-pdfs, select the Unstructured Documents and click on Import.

Note: This step takes around 5 to 10 minutes to complete.

## Task 3. Import Datastore, Model information and Libraries.

In [2]:
# Run the below code snippet to set the Datastore ID and location.
DATA_STORE_ID = "qna-datastore-id"  # @param {type:"string"}
DATA_STORE_LOCATION = "global"  # @param {type:"string"}

MODEL = "gemini-2.0-flash"  # @param {type:"string"}

if PROJECT_ID == "YOUR_PROJECT_ID" or DATA_STORE_ID == "YOUR_DATA_STORE_ID":
    raise ValueError(
        "Please set the PROJECT_ID, DATA_STORE_ID constants to reflect your environment."
    )

In [3]:
# Import the libraries by the below commands.
from langchain.chains import RetrievalQA
from langchain.chains import RetrievalQAWithSourcesChain
from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from langchain.prompts import PromptTemplate

from langchain_google_vertexai import ChatVertexAI
from langchain_google_community import VertexAISearchRetriever
from langchain_google_community import VertexAIMultiTurnSearchRetriever

## Task 4. LangChain retrieval Q&A chains

Here, we'll be having three LangChain retrieval Q&A chains:
 - RetrivalQA
 - RetrievalQAWithSourceChain
 - ConversationalRetrivealChain

In [6]:
# 1. We begin by initializing a Vertex AI LLM and a LangChain 'retriever' to fetch documents from our AI Applications Search engine.
llm = ChatVertexAI(model_name=MODEL)

retriever = VertexAISearchRetriever(
    project_id=PROJECT_ID,
    location_id=DATA_STORE_LOCATION,
    data_store_id=DATA_STORE_ID,
    get_extractive_answers=True,
    max_documents=10,
    max_extractive_segment_count=1,
    max_extractive_answer_count=5,
)



2. RetrievalQA simplest document Q&A chain offered by LangChain.
 - Here, we use the stuff type, which simply inserts all of the document chunks into the prompt.
 - This has the advantage of only making a single LLM call, which is faster and more cost efficient
 - However, if we have a large number of search results we run the risk of exceeding the token limit in our prompt, or truncating useful information.
 - Other chain types such as map_reduce and refine use an iterative process that makes multiple LLM calls, taking individual document chunks at a time and refining the answer iteratively.

In [7]:
search_query = "What was Alphabet's Revenue in Q2 2021?"  # @param {type:"string"}

retrieval_qa = RetrievalQA.from_chain_type(
    llm=llm, chain_type="stuff", retriever=retriever
)
retrieval_qa.invoke(search_query)

{'query': "What was Alphabet's Revenue in Q2 2021?",
 'result': "Alphabet's revenue in Q2 2021 was $61.9 billion.\n"}

3. Now, we'll be inspecting the document, If we add return_source_documents=True we can inspect the document chunks that were returned by the retriever.

This is helpful for debugging, as these chunks may not always be relevant to the answer, or their relevance might not be obvious.

In [9]:
retrieval_qa = RetrievalQA.from_chain_type(
    llm=llm, chain_type="stuff", retriever=retriever, return_source_documents=True
)

results = retrieval_qa.invoke(search_query)

print("*" * 79)
print(results["result"])
print("*" * 79)
for doc in results["source_documents"]:
    print("-" * 33)
    print(doc.page_content)

*******************************************************************************
Alphabet's revenue in Q2 2021 was $61.9 billion.

*******************************************************************************
---------------------------------
Our long-term investments in AI and Google Cloud are helping us drive significant improvements in everyone&#39;s digital experience.” “Our strong second quarter revenues of <b>$61.9 billion</b> reflect elevated consumer online activity and broad-based strength in advertiser spend.
---------------------------------
Alphabet Inc. CONSOLIDATED STATEMENTS OF INCOME (In millions, except share amounts which are reflected in thousands and per share amounts) Quarter Ended June 30, Year To Date June 30, 2020 2021 2020 2021 (unaudited) (unaudited) Revenues $ 38297 $ 61880 $ 79456 $ 117194 Costs and expenses: Cost of revenues 18553 26227 37535 50330 Research and development 6875 7675 13695 15160 Sales and marketing 3901 5276 8401 9792 General and administra

4. **RetrievalQAWithSourceChain** variant returns an answer to the question alongside the source documents that were used to generate the answer.

In [10]:
retrieval_qa_with_sources = RetrievalQAWithSourcesChain.from_chain_type(
    llm=llm, chain_type="stuff", retriever=retriever
)

retrieval_qa_with_sources.invoke(search_query, return_only_outputs=True)

{'answer': "Alphabet's revenue in Q2 2021 was $61.9 billion.\n",
 'sources': 'gs://cloud-samples-data/gen-app-builder/search/alphabet-investor-pdfs/2021Q2_alphabet_earnings_release.pdf1, gs://cloud-samples-data/gen-app-builder/search/alphabet-investor-pdfs/2021Q2_alphabet_earnings_release.pdf2'}

5. **ConversationalRetrievalChain** remembers and uses previous questions so you can have a chat-like discovery process.

To use this chain we must provide a memory class to store and pass the previous messages to the LLM as context. Here we use the **ConversationBufferMemory** class that comes with LangChain.

**VertexAIMultiTurnSearchRetriever** uses multi-turn search (also called conversational search or search with followups) to preserve context between requests.

Now will work with both retrievers, and the multi-turn retriever can be substituted in any of the previous examples.

In [17]:
multi_turn_retriever = VertexAIMultiTurnSearchRetriever(
    project_id=PROJECT_ID, location_id=DATA_STORE_LOCATION, data_store_id=DATA_STORE_ID
)

memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

conversational_retrieval = ConversationalRetrievalChain.from_llm(
    llm=llm, retriever=multi_turn_retriever, memory=memory
)

search_query = "What were alphabet revenues in 2022?"

result = conversational_retrieval.invoke(search_query)
print(result["answer"])

Alphabet revenues in 2022 were $282,836 million.


In [18]:
new_query = "What about costs and expenses?"
result = conversational_retrieval.invoke(new_query)
print(result["answer"])

Alphabet's costs and expenses in 2022 were $207,994 million.


In [19]:
new_query = "Is this more than in 2021?"

result = conversational_retrieval.invoke(new_query)
print(result["answer"])

Yes, Alphabet's total costs and expenses in 2022 were more than in 2021. In 2021, total costs and expenses were $178,923 million, while in 2022, they were $207,994 million.



In [20]:
new_query = "How does revenue relate to costs and expenses?"
result = conversational_retrieval.invoke(new_query)
print(result["answer"])

In 2022, Alphabet's revenue was $282.836 billion, and its total costs and expenses were $207.994 billion.



## Task 5. Advanced: Modifying the default LangChain prompt
In all of the previous steps, we used the default prompt that comes with langchain.

We can inspect our chain object to discover the wording of the prompt template being used.

We may find that this is not suitable for our purposes, and we may wish to customize the prompt, for example to present our results in a different format, or to specify additional constraints.

In [21]:
# 1. Execute the below code snippet to get the result in the form of a template.
qa = RetrievalQA.from_chain_type(
    llm=llm, chain_type="stuff", retriever=retriever, return_source_documents=True
)

print(qa.combine_documents_chain.llm_chain.prompt.messages[0].prompt.template)

Use the following pieces of context to answer the user's question. 
If you don't know the answer, just say that you don't know, don't try to make up an answer.
----------------
{context}


In [22]:
# 2. Let's modify the prompt to return an answer in a single word (useful for yes/no questions).
#    We will constrain the LLM to say 'I don't know' if it cannot answer.

# We will create a new prompt_template and pass this in using the template argument.

prompt_template = """Use the context to answer the question at the end.
You must always use the context and context only to answer the question. Never try to make up an answer. If the context is empty or you do not know the answer, just say "I don't know".
The answer should consist of only 1 word and not a sentence.

Context: {context}

Question: {question}
Helpful Answer:
"""
prompt = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)
qa_chain = RetrievalQA.from_llm(
    llm=llm, prompt=prompt, retriever=retriever, return_source_documents=True
)

In [23]:
print(qa_chain.combine_documents_chain.llm_chain.prompt.template)

Use the context to answer the question at the end.
You must always use the context and context only to answer the question. Never try to make up an answer. If the context is empty or you do not know the answer, just say "I don't know".
The answer should consist of only 1 word and not a sentence.

Context: {context}

Question: {question}
Helpful Answer:



In [24]:
search_query = "Were 2020 EMEA revenues higher than 2020 APAC revenues?"

results = qa_chain.invoke(search_query)

print("*" * 79)
print(results["result"])
print("*" * 79)
for doc in results["source_documents"]:
    print("-" * 79)
    print(doc.page_content)

*******************************************************************************
Yes

*******************************************************************************
-------------------------------------------------------------------------------
Year Ended December 31, % Change from 2020 2021 Prior Year EMEA revenues $ $ 43% EMEA constant currency revenues 38 % APAC revenues 32550 42% APAC constant currency revenues 40% Other Americas revenues 14404 53% Other Americas constant currency revenues 52% United States revenues 85014 39% Hedging gains (losses) 149 Total revenues $ $ 41% Revenues, excluding hedging effect $ 182351 $ Exchange rate effect (3330) Total constant currency revenues $ 254158 39% EMEA revenue growth from 2020 to 2021 was favorably affected by foreign currency exchange rates, primarily due to the US dollar weakening relative to the Euro and British pound.
-------------------------------------------------------------------------------
Google Cloud&#39;s infrastructure an