# In-context Q&A with Intelligent Document Processing
____

<div class="alert alert-block alert-info"> 
    <b>NOTE:</b> You will need to use a Jupyter Kernel with Python 3.9 or above to use this notebook. If you are in Amazon SageMaker Studio, you can use the "Data Science 3.0" image.
</div>

<div class="alert alert-block alert-warning"> 
    <b>NOTE:</b> You will need 3rd party model access to Anthropic Claude V1 model, and Amazon Titan Embedding G1 Text model to be able to run this notebook. Verify if you have access to the models by going to <a href="https://console.aws.amazon.com/bedrock" target="_blank">Amazon Bedrock console</a> > left menu "Model access". The "Access status" for Anthropic Claude and Amazon Titan Embedding G1 Text must be in "<span style="color:green;">Access granted</span>" status. If you do not have access, then click "Edit" button on the top right > select the model checkboxes > click "Save changes" button at the bottom. You should have access to the model within a few moments.
</div>

In this notebook we will walk through Q&A with a document first by extracting text from a document using Amazon Textract, generating chunks of text and store them into a Vector DB, and then performing Q&A with a Anthropic Claude model via Amazon Bedrock and get precise answers from the model. Later on, we will also implement a chat application with chat history to chat with documents.

In [None]:
!pip install -U boto3 langchain faiss-cpu transformers
!pip install amazon-textract-textractor pypdf Pillow

In [170]:
import json
import os
import sys
import sagemaker
import boto3

role = sagemaker.get_execution_role()
data_bucket = sagemaker.Session().default_bucket()
bedrock = boto3.client('bedrock-runtime')
br = boto3.client('bedrock')
s3 = boto3.client("s3")
print(f"SageMaker bucket is {data_bucket}, and SageMaker Execution Role is {role}")

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml
SageMaker bucket is sagemaker-us-east-1-710096454740, and SageMaker Execution Role is arn:aws:iam::710096454740:role/service-role/AmazonSageMaker-ExecutionRole-20220504T135260


## Upload sample data to S3 bucket


The sample document is in `/samples` directory. For this workshop, we will be using a sample document.

In [5]:
# Upload images to S3 bucket:

!aws s3 cp samples s3://{data_bucket}/idp/genai --recursive --only-show-errors

In [171]:
!aws s3 ls s3://{data_bucket}/idp/genai/

                           PRE .ipynb_checkpoints/
2023-09-29 17:12:03     770065 bank_statement.jpg
2023-09-29 17:12:03     232029 discharge-summary.png
2023-09-29 17:12:03     224721 hand_written_note.pdf
2023-09-29 17:12:03     108247 health_plan.pdf
2023-09-29 17:12:03     781244 health_plan_pg1.png


---
# Perform Common sense reasoning and QA on a document

In this section, we will perform common sense reasoning and Q&A on a document. This section does the following

- Generates text from documents and stores them into S3 in plaintext format
- Generate embeddings from the text
- Uses an in-memory vector database to store the embeddings. In this case we will use [FAISS](https://ai.meta.com/tools/faiss/#:~:text=FAISS%20(Facebook%20AI%20Similarity%20Search,more%20scalable%20similarity%20search%20functions.).
- Perform similarity search on the in-memory vector db to find relevant pieces of text that have relavancy to the asked question (by the user)
- Generate the context for the LLM using the search results
- Give the model the context and the original question asked
- Get the answer back from the LLM
- Profit

> _"Wait but that's a lot of steps just for getting an answer back? Why?"_

We would love to explain and dive deeper into why, but here's a paper that does a better job of explain the why? and the how? - https://arxiv.org/pdf/2005.11401.pdf . In short, LLMs know too much, _sometimes a bit too much that it may get confused and wander into the proverbial forest of it's own world knowledge and go start gathering firewood, when it was actually asked to go pick some fruit_. To solve this problem, and to get accurate/factual answers, we use this method of Retrieval-Augmented Generation (aka RAG), just to give the LLM a bit more _context_ to work with such that it gives us the desired output (like a fruit basket in our example, so that it knows it's only supposed to pick fruits) .

As a first step, we read a file (document) using Amazon Textract using LangChain Textract Document Loader.

In [9]:
from langchain.document_loaders import AmazonTextractPDFLoader
loader = AmazonTextractPDFLoader(f"s3://{data_bucket}/idp/genai/health_plan.pdf")
document = loader.load()
print(f"Textract extracted {len(document)} pages from the document")

Textract extracted 5 pages from the document


Let's look at the extracted text

In [10]:
for index,page in enumerate(document):
    print(f"=========Page {index+1}==========")
    print(page.page_content)
    print("\n")

Health Benefit Summary Plan Description Revised 01-01-2022 BENEFITS Healthcare Policy Plan Vision: To provide high quality, affordable healthcare for all citizens. Mission: To implement policy reforms and programs that expand access to healthcare, reduce costs, and improve health outcomes. Goals: 1. Achieve universal healthcare coverage. Provide health insurance for all citizens regardless of income or health status. 2. Reduce healthcare costs for individuals and government. Implement policies and programs to lower premiums, out-of-pocket costs, and overall healthcare spending. 3. Improve population health. Invest in public health programs and prevention to promote healthy lifestyles, reduce health risks, and improve health outcomes. 4. Support healthcare innovation. Invest in research and new technologies to improve treatments, cures, and the healthcare system. Policy Reforms: 1. Establish a public healthcare option. Provide a government-run health plan to compete with private insurer

Now that we have extracted the document, we split the document into smaller chunks, this is required because we may have a large multi-page document and our LLMs may have token limits. It will also ensure that we only get the relevant parts of the document to build the context instead of full page texts. Then these chunks will be loaded into the Vector DB for performing similarity search in the subsequent steps. 

However, before we store the document in the VectorDB, we will have to generate embeddings on the text. We use `HuggingFaceEmbeddings`  which is built into LangChain, for that purpose. For other models you may chose embedding models accordingly as suggested by the model provider. Let's start by splitting the document into smaller chunks

In [11]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=400,
                                               separators=["\n\n", "\n", ".", "!", "?", ",", " ", ""],
                                               chunk_overlap=0)
texts = text_splitter.split_documents(document)

for index, text in enumerate(texts):
    print(f"==== Chunk {index+1}, From Page {text.metadata['page']} ====")
    print(text.page_content)
    print("\n")

==== Chunk 1, From Page 1 ====
Health Benefit Summary Plan Description Revised 01-01-2022 BENEFITS Healthcare Policy Plan Vision: To provide high quality, affordable healthcare for all citizens. Mission: To implement policy reforms and programs that expand access to healthcare, reduce costs, and improve health outcomes. Goals: 1. Achieve universal healthcare coverage


==== Chunk 2, From Page 1 ====
. Provide health insurance for all citizens regardless of income or health status. 2. Reduce healthcare costs for individuals and government. Implement policies and programs to lower premiums, out-of-pocket costs, and overall healthcare spending. 3. Improve population health


==== Chunk 3, From Page 1 ====
. Invest in public health programs and prevention to promote healthy lifestyles, reduce health risks, and improve health outcomes. 4. Support healthcare innovation. Invest in research and new technologies to improve treatments, cures, and the healthcare system. Policy Reforms: 1. Establi



We have split the document into smaller chunks. We will now perform a couple of things-

- Generate embeddings of these chunks
- Store these embeddings into a vector database



## Vector database

This vector database is going to store the embeddings that we generate. This notebook showcases FAISS and will be transient and in memory. FAISS (Facebook AI Similarity Search) is a library that allows developers to quickly search for embeddings of multimedia documents that are similar to each other. It solves limitations of traditional query search engines that are optimized for hash-based searches, and provides more scalable similarity search functions. The VectorStore APIs that use FAISS within LangChain are available [here](https://python.langchain.com/en/harrison-docs-refactor-3-24/reference/modules/vectorstore.html). 

We will use Amazon Titan embedding model to generate the embeddings.

In [180]:
resp = br.list_foundation_models(
    byOutputModality='EMBEDDING'
)
for model in resp['modelSummaries']:
    print(model['modelId'])

amazon.titan-e1t-medium
amazon.titan-embed-g1-text-02
amazon.titan-embed-text-v1


In [8]:
from langchain.embeddings import BedrockEmbeddings

embeddings = BedrockEmbeddings(client=bedrock)

In [191]:
from langchain.embeddings import BedrockEmbeddings
from langchain.vectorstores import FAISS


# Ensure that you have enabled amazon.titan-embed-text-v1 model in Amazon Bedrock console
embeddings = BedrockEmbeddings(client=bedrock,model_id="amazon.titan-embed-text-v1")
vector_db = FAISS.from_documents(documents=texts, embedding=embeddings)

In [189]:
# Since we are loading the FAISS Vector DB in memory, it will load into the SageMaker Studio instance's memory
# you may want to free up memory from time to time. To do that, uncomment the line below and execute this cell

# CAUTION! This will delete the vector index

# vector_db.delete([vector_db.index_to_docstore_id[0]])

True

We have loaded our vector db with the document, now let's run a query.

In [16]:
query = "What is the annual deductible per person?"
docs = vector_db.similarity_search(query)

In [17]:
docs

[Document(page_content='Note: Embedded Deductible Means That If You Have Family Coverage, Any Combination Of Covered Family Members May Help Meet The Maximum Family Deductible; However, No One Person Will Pay More Than His Or Her Embedded Individual Deductible Amount. Annual Total Out-Of-Pocket Maximum: Note: Medical And Pharmacy Expenses Are Subject To The Same Out-Of-Pocket Maximum', metadata={'source': 's3://sagemaker-us-east-1-710096454740/idp/genai/health_plan.pdf', 'page': 4}),
 Document(page_content='. A Deductible applies to each Covered Person up to a family Deductible limit. When a new Plan Year begins, a new Deductible must be satisfied. Deductible amounts are shown on the Schedule of Benefits. Generally, the applicable Deductible must be met before any benefits will be paid under this Plan. However, certain covered benefits may be considered Preventive / Routine Care and paid first dollar', metadata={'source': 's3://sagemaker-us-east-1-710096454740/idp/genai/health_plan.pdf

The query returns all the chunks from the document that is similar to the query, by default it returns the Top 4 similar chunks. Let's see how to return just Top 3 with confidence scores.

In [18]:
docs = vector_db.similarity_search_with_score(query, k = 3)
docs

[(Document(page_content='Note: Embedded Deductible Means That If You Have Family Coverage, Any Combination Of Covered Family Members May Help Meet The Maximum Family Deductible; However, No One Person Will Pay More Than His Or Her Embedded Individual Deductible Amount. Annual Total Out-Of-Pocket Maximum: Note: Medical And Pharmacy Expenses Are Subject To The Same Out-Of-Pocket Maximum', metadata={'source': 's3://sagemaker-us-east-1-710096454740/idp/genai/health_plan.pdf', 'page': 4}),
  208.32166),
 (Document(page_content='. A Deductible applies to each Covered Person up to a family Deductible limit. When a new Plan Year begins, a new Deductible must be satisfied. Deductible amounts are shown on the Schedule of Benefits. Generally, the applicable Deductible must be met before any benefits will be paid under this Plan. However, certain covered benefits may be considered Preventive / Routine Care and paid first dollar', metadata={'source': 's3://sagemaker-us-east-1-710096454740/idp/genai

## Vector store-backed retriever
---

According to LangChain documentation-

>A vector store retriever is a retriever that uses a vector store to retrieve documents. It is a lightweight wrapper around the vector store class to make it conform to the retriever interface. It uses the search methods implemented by a vector store, like similarity search and MMR, to query the texts in the vector store.

Wrapping our vector db in a retriever wrapper is going to be useful when we use it in the Q&A chain for our chatbot in subsequent sections. But let's take a look how it works. The functionality is pretty similar to before (i.e. querying) with a slightly different interface.

We first define a retriever with search type mmr (Max Marginal Relevance), other option is similarity. Note that the search_type depends on which vector DB you are using, some vector DBs may or may not support mmr etc.

>MMR considers the similarity of keywords/keyphrases with the document, along with the similarity of already selected keywords and keyphrases. This results in a selection of keywords that maximize their within diversity with respect to the document.

We also define how many top results to return, in this case 2. Finally we query the retriever using get_relevant_documents by passing in the query.


In [181]:
query = "What is the total pharmacy out-of-pocket?"

retriever = vector_db.as_retriever(search_type='mmr', search_kwargs={"k": 3})
relevant_docs = retriever.get_relevant_documents(query)   
relevant_docs

[Document(page_content='. Pharmacy out of pocket maximum per person is $6000 and for family out of pocket maximum is $12000 Note: Embedded Out-Of-Pocket Maximum Means That If You Have Family Coverage, Any Combination Of Covered Family Members May Help Meet The Family Out-Of-Pocket Maximum; However, No One Person Will Pay More Than His Or Her Embedded Individual Out-Of-Pocket Maximum Amount. OUT-OF-POCKET EXPENSES AND MAXIMUMS Benefit Plan(s) 001, 002, 005 CO-PAYS A Co-pay is the amount that the Covered Person pays each time certain services are received. The Co- pay is typically a flat dollar amount and is paid at the time of service or when billed by the provider', metadata={'source': 's3://sagemaker-us-east-1-710096454740/idp/genai/health_plan.pdf', 'page': 4}),
 Document(page_content='. However, certain covered benefits may be considered Preventive / Routine Care and paid first dollar. The Deductible amounts that the Covered Person Incurs for Covered Expenses, including covered Phar

## Build context from retrieved documents
---

We now have the two relevant pieces of text that "contain" the anwer to our question, we are not quite there yet. So we will use a technique that we used earlier to build context and ask the quetion to the Llama-2 model. In this case, we will use the two text chunks we retrieved from the vector db to create the context by simply concatenating them.

In [172]:
full_context = str()
for doc in relevant_docs:
    full_context += doc.page_content+" "
    
print(full_context.strip(".").strip())

, and Your employer is pleased to sponsor this Plan to provide benefits that can help meet Your health care needs. Please read this document carefully and contact Your Human Resources or Personnel office if You have questions or if You have difficulty translating this document. ANYCOMPANY, INC. is named the Plan Administrator for this Plan. The Plan Administrator has retained the services of independent Third Party Administrators to process claims and handle other duties for this self-funded Plan. The Third Party Administrators for this Plan are UMR, Inc. (hereinafter "UMR") for medical claims, and Express Scripts for pharmacy claims . Amounts the Covered Person Incurs for Covered Expenses will be used to satisfy the Covered Person's (or family's, if applicable) annual out-of-pocket maximum(s). If the Covered Person's out-of-pocket expenses in a Plan Year exceed the annual out-of-pocket maximum, the Plan pays 100% of the Covered Expenses through the end of the Plan Year. The following 

The similarity seach query gave us a good output but we want some more key details out of it. Let's use an LLM to ask this question, but this time using the context that we created above

In [32]:
from langchain.document_loaders import AmazonTextractPDFLoader
from langchain.llms import Bedrock
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

loader = AmazonTextractPDFLoader("./samples/discharge-summary.png")
document = loader.load()

template = """

Answer the question as truthfully as possible strictly using only the provided text, and if the answer is not contained within the text, say "I don't know". Skip any preamble text and reasoning and give just the answer.

<text>{document}</text>
<question>{question}</question>
<answer>"""

prompt = PromptTemplate(template=template, input_variables=["document","question"])
bedrock_llm = Bedrock(client=bedrock, model_id="anthropic.claude-v1")


llm_chain = LLMChain(prompt=prompt, llm=bedrock_llm)
answer = llm_chain.run(document=full_context, question="What is the per-person pharmacy out-of-pocket?")
print(answer.strip())

$6,000


Now let's run it with a different question

In [33]:
answer = llm_chain.run(document=full_context, question="Who is the administrator for this plan?")
print(answer.strip())

I don't know.


The model doesn't know the answer because our context in `full_context` has no information about the administrator of the plan, and we asked the model to strictly answer from within the provided context. This means we will have to run a similarity search on the Vector database again using our new question, create the full context again, and then ask the question. Thankfully, LangChain makes it easy for us and we will see how.

### Performing Q&A with RAG with `load_qa_chain`
---

For this purpose, we will first define a question, and then generate embeddings from it. Once we have that we can perform similarity search on the vector database to find relevant pieces of information from the document. These relevant pieces of information will then be passed on to the model so that it can answer the question. We will use LangChain's `load_qa_chain` to perform Q&A with the model. The load qa chain does the work with prompt creation and all the context generation with help from the vector database.

NOTE: In order to use the `RetrievalQA` from LangChain, your prompt template must have the two variables `context` and `question`. Using any other variable names will cause an error.

In [202]:
from langchain.chains import RetrievalQA
from langchain.llms import Bedrock
from langchain.prompts import PromptTemplate

retriever = vector_db.as_retriever(search_type='mmr', search_kwargs={"k": 3})

template = """

Answer the question as truthfully as possible strictly using only the provided text, and if the answer is not contained within the text, say "I don't know". Skip any preamble text and reasoning and give just the answer.

<text>{context}</text>
<question>{question}</question>
<answer>"""

# define the prompt template
qa_prompt = PromptTemplate(template=template, input_variables=["context","question"])

chain_type_kwargs = { "prompt": qa_prompt, "verbose": False } # change verbose to True if you need to see what's happening

bedrock_llm = Bedrock(client=bedrock, model_id="anthropic.claude-v1")
qa = RetrievalQA.from_chain_type(
    llm=bedrock_llm, 
    chain_type="stuff", 
    retriever=retriever,
    chain_type_kwargs=chain_type_kwargs,
    verbose=False # change verbose to True if you need to see what's happening
)

question="Who is the administrator for this plan?"

result = qa.run(question)
print(result.strip())

 ANYCOMPANY, INC.


Perfect! our model now can precisely answer the question. But how did it work?

- First, the question text was taken and the embedding was generated using the Amazon Titan embedding model. This all happened inside the `retriever` as we defined earler with `retriever = vector_db.as_retriever(search_type='mmr', search_kwargs={"k": 3})` our `vector_db` is a FAISS object that was initialized with Amazon Titan embedding model.
- Next the `RetrievalQA` chain runs a similarity search with the generated embdeddings (from the question) to find out relevant pieces of text that are similar to the question we are try to get an answer for.
- Then the chain builds the full context using the returned chunks and generates the full prompt using the `qa_prompt` template we provided.
- Finally, the chain invokes the model to get the response

## Chat with your document
---

We will now create a simple chat application to chat with our document. This application will not only perform in-context Q&A, but will also be able to answer questions based on chat history. For the chatbot we need `context management, history, vector stores, and many other things`. We will start by with a ConversationalRetrievalChain

This uses conversation memory and RetrievalQAChain which Allow for passing in chat history which can be used for follow up questions.Source: https://python.langchain.com/en/latest/modules/chains/index_examples/chat_vector_db.html

_We will use Gradio to quickly spin up our chat interface. So we will install Gradio next. Then we will define our Conversation chain and plug that into the chat application._

In [None]:
!pip install -U gradio

In [174]:
# let's delete the index, we will create it again
vector_db.delete([vector_db.index_to_docstore_id[0]])

True

We will now read a document using `AmazonTextractPDFLoader` split the pages into smaller chunks, generate embeddings using Amazon Titan Embedding model, and load it into our Vector DB. We will also initialize our Claude model with `temperature=0.1` for less diversified responses, and some stop words so that the model knows when to stop generating tokens.

In [175]:
from langchain.embeddings import BedrockEmbeddings
from langchain.vectorstores import FAISS
from langchain.llms import Bedrock
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import AmazonTextractPDFLoader

loader = AmazonTextractPDFLoader(f"s3://{data_bucket}/idp/genai/health_plan.pdf")
document = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=700,
                                               separators=["\n\n", "\n", ".", "!", "?", ",", " ", ""],
                                               chunk_overlap=0)
texts = text_splitter.split_documents(document)

# Ensure that you have enabled amazon.titan-embed-text-v1 model in Amazon Bedrock console
embeddings = BedrockEmbeddings(client=bedrock,model_id="amazon.titan-embed-text-v1")
vector_db = FAISS.from_documents(documents=texts,embedding=embeddings)

bedrock_llm = Bedrock(client=bedrock, 
                      model_id="anthropic.claude-v1", 
                      model_kwargs={"temperature": 0.1,"stop_sequences": ["\n\nHuman:","</answer>"]})

To build our chat application, we will use a built-in LangChain chain called `ConversationalRetrievalChain`. This chain allows us to build conversational interface that is capable of retaining chat history, and perform RAG on our vector DB retriever simultaneously, without us having to code each of those steps individually. The purpose of the chat history, is when provided as a context to the model, the model will recall the conversation and may enrich the responses (with the help of additional context retrieved by the `retriever`) based on the current question.

In [176]:
from langchain.prompts import PromptTemplate
from langchain.chains import ConversationalRetrievalChain
# from langchain.memory import ConversationBufferMemory
# from langchain.memory import ConversationBufferWindowMemory

def create_prompt_template():
    _template = """
    
Given the following chat history and a follow up question, rephrase the follow up question to be a standalone question, in its original language. Skip the preamble and just get to the question.

<chat_history>    
{chat_history}
</chat_history>    
<follow_up_question>
{question}
</follow_up_question>
"""
    conversation_prompt = PromptTemplate.from_template(_template)
    return conversation_prompt

template = """

Answer the question as truthfully as possible strictly using only the provided text, and if the answer is not contained within the text, say "I don't know". Skip any preamble text and reasoning and give just the answer. If the user greets you, just greet them back.

<text>
{context}
{chat_history}
</text>

<question>
{question}
</question>

<answer>
"""

# define the prompt template
qa_prompt = PromptTemplate(template=template, input_variables=["context","question","chat_history"])

retriever = vector_db.as_retriever(search_type='mmr', search_kwargs={"k": 3})

# all_memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True, output_key='answer')
# windowed_memory = ConversationBufferWindowMemory(memory_key="chat_history", k=4, return_messages=True)

qa = ConversationalRetrievalChain.from_llm(llm=bedrock_llm, 
                                           retriever=retriever, 
                                           condense_question_prompt=create_prompt_template(),
                                           condense_question_llm = Bedrock(client=bedrock, model_id="anthropic.claude-v1"),
                                           combine_docs_chain_kwargs={"prompt": qa_prompt},
                                           # verbose=True    # uncomment this to see logs
                                          )

questions = [
    "Hi AI, I am Bob Doe. How are you?",
    "Who is the plan administrator for this plan?",
    "What is the annual deductible per person?",
    "Do you remember my name?"
]
chat_history = []

for question in questions:
    result = qa({"question": question, "chat_history":chat_history})
    chat_history.append((question, result["answer"]))
    print(f"-> **Question**: {question} \n")
    print(f"**Answer**: {result['answer'].strip()} \n")

-> **Question**: Hi AI, I am Bob Doe. How are you? 

**Answer**: Hello Bob Doe. I'm doing well, thanks for asking. 

-> **Question**: Who is the plan administrator for this plan? 

**Answer**: ANYCOMPANY, INC. 

-> **Question**: What is the annual deductible per person? 

**Answer**: $5000 

-> **Question**: Do you remember my name? 

**Answer**: I don't know. 



We just had an automated chat session with a bunch of pre-determined questions and we also noticed that from the fourth question, the model is able to answer the name since we have access to the chat history. Keep in mind, as the chat session goes longer, the chat memory can get bigger and bigger (i.e. commented line `all_memory` uses ALL chat history). In such cases, it is important to limit how far you want to remember the chat so that you don't run out of token limits, and encounter slower responses.

In [128]:
chat_history

[('Hi AI, I am Bob Doe. Please remember my name.', ' Hello Bob Doe.'),
 ('Who is the plan administrator for this plan?', ' ANYCOMPANY, INC.\n'),
 ('What is the annual deductible per person?', ' $5000\n\n'),
 ('Do you remember my name?', ' Yes, your name is Bob Doe.\n\n')]

## The Chat App with Gradio
---

Next we will build a simple chat app using Gradio and the same method we used above using `ConversationalRetrievalChain` and our vector database as a retriever. Note that our vector database is currently loaded with only one document. But you can imagine that you could have any number of documents loaded into the vector database.

Once you run the following code cell, here are some questions you can ask in the chat interface-

- Hi I am John Doe.
- Who is the plan Administrator?
- Who are the third party administrator?
- What is the per-person deductible?
- What is ERISA?
- Do you remember my name?       ---> Test if the bot remembers your name
- Based on your previous answers, who are the primary and the third party administrators of the plan? ---> Test chat history
- What is Co-pay?
- What is a deductible?
- What is the co-pay maximum for a family?  ---> Info not in the document
- What is the co-pay maximum for a person?  ---> Info not in the document
- What is the deductible maximum for a family?
- what is the maximum out of pocket for pharmacy for a person?
- what is the maximum out of pocket for pharmacy for a family?


In [179]:
import random
import gradio as gr
from langchain.prompts import PromptTemplate
from langchain.chains import ConversationalRetrievalChain

def create_prompt_template():
    _template = """
    
Given the following chat history and a follow up question, simply respond back with the question without any modifications. Skip any preamble text and reasoning and just generate the question.

<chat_history>
{chat_history}
</chat_history>
<follow_up_question>
{question}
</follow_up_question>
"""
    conversation_prompt = PromptTemplate.from_template(_template)
    return conversation_prompt

template = """

Answer the question as truthfully as possible strictly using only the provided text, and if the answer is not contained within the text, say "I don't know". Always respond in the language the question was asked. Skip any preamble text and reasoning and give just the answer. If the user greets you, just greet them back.

<text>
{context}
{chat_history}
</text>

<question>
{question}
</question>

<answer>
"""

# define the prompt template
qa_prompt = PromptTemplate(template=template, input_variables=["context","question","chat_history"])

retriever = vector_db.as_retriever(search_type='mmr', search_kwargs={"k": 3})

qa = ConversationalRetrievalChain.from_llm(llm=bedrock_llm, 
                                           retriever=retriever, 
                                           condense_question_prompt=create_prompt_template(),
                                           condense_question_llm = Bedrock(client=bedrock, model_id="anthropic.claude-v1"),
                                           combine_docs_chain_kwargs={"prompt": qa_prompt}
                                          )
chat_history = []

def qa_fn(message, history):
    result = qa({"question": message, "chat_history":chat_history})
    chat_history.append((message, result["answer"]))
    return result['answer'].strip()

gr.ChatInterface(qa_fn).launch()

Running on local URL:  http://127.0.0.1:7868
Sagemaker notebooks may require sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Running on public URL: https://22e68bf81f537e344d.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)




## Cleanup
---

Let's clean up the file we uploaded to S3 earlier

In [None]:
!aws s3api delete-object --bucket {data_bucket} --key bedrock-sample/health_plan.pdf