# Augment Intelligent Document Processing with generative AI
---

<div class="alert alert-block alert-info"> 
    <b>NOTE:</b> You will need to use a Jupyter Kernel with Python 3.9 or above to use this notebook. If you are in Amazon SageMaker Studio, you can use the "Data Science 3.0" image.
</div>

In this notebook, we demonstrate how you can integrate Amazon Textract with LangChain as a document loader to extract data from documents and use generative AI capabilities within the various IDP phases. We will perform the following with different LLMs.

- Classification
- Summarization
- Standardization
- Spell check corrections
- Q&A with tables

In [None]:
!pip install langchain huggingface_hub

In [None]:
!pip install amazon-textract-textractor pypdf Pillow

In [None]:
import os
from getpass import getpass

HUGGINGFACEHUB_API_TOKEN = getpass()
os.environ["HUGGINGFACEHUB_API_TOKEN"] = HUGGINGFACEHUB_API_TOKEN

## 1. Classification
---

Classify a document based on it's content, given a list of classes.

In [None]:
from langchain.document_loaders import AmazonTextractPDFLoader
from langchain import HuggingFaceHub
from langchain import PromptTemplate, LLMChain

loader = AmazonTextractPDFLoader("./samples/discharge-summary.png")
document = loader.load()

template = """Given a list of 'Classes', classify the 'Document' into one of these classes. 

Classes: DISCHARGE_SUMMARY, RECEIPT, PRESCRIPTION
Document: {doc_text}"""

prompt = PromptTemplate(template=template, input_variables=["doc_text"])
llm = HuggingFaceHub(
                repo_id="google/flan-t5-xxl",model_kwargs={"temperature": 0.5, "max_length": 50}
)
llm_chain = LLMChain(prompt=prompt, llm=llm)
class_name = llm_chain.run(document[0].page_content)

print(f"The provided document is a {class_name}")



## 2. Summarization
---

Summarize large pieces of text from a document into smaller, more coincise explanations.

In [None]:
from langchain.document_loaders import AmazonTextractPDFLoader
from langchain import HuggingFaceHub
from langchain import PromptTemplate, LLMChain

loader = AmazonTextractPDFLoader("./samples/discharge-summary.png")
document = loader.load()

template = """Given a full 'Document', summarize it for me. 

Document: {doc_text}"""

prompt = PromptTemplate(template=template, input_variables=["doc_text"])
llm = HuggingFaceHub(
                repo_id="google/flan-t5-xxl",model_kwargs={"temperature": 0.1, "max_length": 512}
)
llm_chain = LLMChain(prompt=prompt, llm=llm)
summary = llm_chain.run(document[0].page_content)

print(f"Here's the summary of the document\n")
print(f"==================================\n")
print(summary)

## Standardization
---

Note that flan-t5-xxl model has a 1024 token limit. Due to this reason we will divide the problem into two parts

- First we ask the model to get the desired value from the document text using prompt template `template1`
- Then we get the out put from the first LLM call and pass it on to a second template `template2` for standardization and formatting. `template2` uses few-shot prompting with example to guide the LLM to generate the desired output.

In [None]:
from langchain.document_loaders import AmazonTextractPDFLoader
from langchain import HuggingFaceHub
from langchain import PromptTemplate, LLMChain

loader = AmazonTextractPDFLoader("./samples/discharge-summary.png")
document = loader.load()

template1 = """Given a full 'Document', answer the question. 

Document: {doc_text}
Question: {question}"""

template2 = """Convert the dates given in MM/DD/YYYY format.
Examples-
Date: Nov-14-2023
Answer: 11/15/2023
Document: 05-Sep-2020
Answer: 9/5/2020
date: {dt}
Answer:
"""

rep = "google/flan-t5-xxl"

llm = HuggingFaceHub(
                repo_id=rep,model_kwargs={"temperature": 0.1, "max_length": 50}
)

prompt1 = PromptTemplate(template=template1, input_variables=["doc_text", "question"])
llm_chain = LLMChain(prompt=prompt1, llm=llm)

prompt2 = PromptTemplate(template=template2, input_variables=["dt"])
llm_chain2 = LLMChain(prompt=prompt2, llm=llm)

chain = ( 
    llm_chain 
    | {'dt': lambda x: x['text'] }  
    | llm_chain2
)

std_op = chain.invoke({ "doc_text": document[0].page_content, 
                        "question": "Can you give me the patient admitted date date?"})

print(std_op['text'])

## Spell check and corrections
---

Perform grammatical and spelling corrections on text extracted from a hand written document.

In [None]:
from langchain.document_loaders import AmazonTextractPDFLoader
from langchain import HuggingFaceHub
from langchain import PromptTemplate, LLMChain

loader = AmazonTextractPDFLoader("./samples/hand_written_note.pdf")
document = loader.load()


template = """Given a detailed 'Document', perform spelling and grammatical corrections. Ensure the output is coherent, 
polished, and free from errors.

Document: {doc_text}
Corrected text:
"""


prompt = PromptTemplate(template=template, input_variables=["doc_text"])
llm = HuggingFaceHub(
                repo_id="google/flan-t5-xxl",model_kwargs={"temperature": 0.8, "max_length": 1024}
)
llm_chain = LLMChain(prompt=prompt, llm=llm)

try:
    lines = document[0].page_content.split(".")
    lines = [x.strip(" ") for x in lines]
    lines = list(set(lines))
    for line in lines:
        if line and line != " ":
            print("Extracted text")
            print("==============")
            print(line)
            std_op = llm_chain.run({"doc_text": line})

            print("Corrected text")
            print("==============")
            print(std_op)
            print("\n")
except Exception as e:
    print(str(e))

## Q&A with Tables
---

If you're on SageMaker Studio environment then you will have to install gcc and gcc-c++ and C++ version 11 compiler. If you're on CentOS then running the following may help if you encounter issues in installing chromaDB.

```
!apt-get update
!apt-get install build-essential -y
```

In [None]:
!apt-get update
!apt-get install build-essential -y

In [None]:
!pip install -U amazon-textract-prettyprinter amazon-textract-textractor langchain spacy -q

In [None]:
!python -m spacy download en_core_web_sm

In [None]:
!pip install -U lark chromadb

View the list of models available

In [None]:
# To list all the available textgeneration models in JumpStart uncomment and run the code below
from sagemaker.jumpstart.notebook_utils import list_jumpstart_models, list_jumpstart_tasks
filter_value = "task == llm"

print("===== Available Models =====")
text_generation_models = list_jumpstart_models(filter=filter_value)
text_generation_models


### Model selection

LangChain's self-querying capabilities need a model that can accept atleast more than 2k token at a time for a reasonably sized table. If you have larger tables you may need larger models. For the purposes of this demonstration we will deploy a Falcon 40b BF16 model. To be able to execute this section you will need access to SageMaker JumpStart models and you must be in us-east-1 region.

Note: Using SageMaker JumpStart is just an option of using an LLM, feel free to use any LLM of your choice.

Please note that deploying this model with SageMaker Jumpstart requires an `ml.g5.12xlarge` instance. Please make sure that you have atleast 1 instance capacity available in the account/region where you are deploying this endpoint. You can check the Quota using the Amazon Service Quota console [here](https://console.aws.amazon.com/servicequotas/home/services/sagemaker/quotas) and search for "ml.g5.12xlarge". The "Applied Quota Value" must show a value greater than 0.



In [None]:
from sagemaker.jumpstart.model import JumpStartModel

try:
    model = JumpStartModel(model_id=model_id, instance_type="ml.g5.12xlarge")
    predictor = model.deploy()
except Exception as e:
    print(str(e))

In [None]:
endpoint_name = predictor.endpoint_name
region = "us-east-1"

## Falcon 40b BF16

In [None]:
(
    model_id,
    model_version,
) = (
    "huggingface-llm-falcon-40b-instruct-bf16",
    "*",
)

Please note that deploying this model with SageMaker Jumpstart requires an `ml.g5.12xlarge` instance. Please make sure that you have atleast 1 instance capacity available in the account/region where you are deploying this endpoint. You can check the Quota using the Amazon Service Quota console [here](https://console.aws.amazon.com/servicequotas/home/services/sagemaker/quotas) and search for "ml.g5.12xlarge". The "Applied Quota Value" must show a value greater than 0.



In [None]:
from sagemaker.jumpstart.model import JumpStartModel

try:
    model = JumpStartModel(model_id=model_id, instance_type="ml.g5.12xlarge")
    predictor = model.deploy()
except Exception as e:
    print(str(e))

In [None]:
endpoint_name = predictor.endpoint_name
region = "us-east-1"

In [None]:
from textractcaller.t_call import call_textract, Textract_Features
from trp import Document
from trp.trp2 import TDocument, TDocumentSchema
from textractprettyprinter.t_pretty_print import get_tables_string, Pretty_Print_Table_Format

textract_json = call_textract(input_document="./samples/bank_statement.jpg", features=[Textract_Features.TABLES])

doc = Document(textract_json)
all_tables = list()
for page in doc.pages:
    for table in page.tables:
        row_text = str()
        for r, row in enumerate(table.rows):            
            for c, cell in enumerate(row.cells):
                row_text = row_text + '"' + cell.text + '",'
            row_text = row_text.strip(',')+"\n"
        all_tables.append(row_text)

len(all_tables)

This document contains more than one table so we will use the first table to perform Q&A on it.

In [None]:
print(all_tables[0])

There are two tables in this page, let's do Q&A on the first table. Note that we are going to use LangChain's `SelfQueryRetriever` which is helpful with Q&A with tables. However since we are using Flan-T5 HuggingFace hosted API, the input token limit is only 1024 tokens. This is not suffieicient to accomodate all the rows of our table. You can deploy this model on SageMaker JumpStart with a large instance type and get more token limits, or perhaps use a different larger model such as Anthropic. For our demonstration purposes we will choose only the first 3 rows for the table via the `docs = docs[:3]` line of code.

We will now load the 3 row table into Chroma DB and try to perform Q&A with it.

In [None]:
from langchain.schema import Document
from langchain.vectorstores import Chroma
from langchain.embeddings.spacy_embeddings import SpacyEmbeddings
import csv
from io import StringIO

f = StringIO(all_tables[0])
reader = csv.reader(f)
headers = next(reader)
result = []

def to_integer(value):
    try:
        return int(float(value.replace(',', '')))
    except ValueError:
        return ""

for row in reader:
    if len(row) == len(headers):
        for i in [-1, -2, -3]:
            row[i] = to_integer(row[i])
        metadata = {headers[i].strip(): row[i] for i in range(len(headers))}
        page_content = ",".join(map(str, row))
        tuple_entry = (page_content, metadata)
        result.append(tuple_entry)
        
docs = list()
for item in result:
    docs.append(Document(page_content=item[0],metadata=item[1]))
docs = docs[:-3]

# create the open-source embedding function
embedding_function = SpacyEmbeddings()

vectorstore = Chroma.from_documents(docs, embedding_function)

In [None]:
# To delete the Chroma DB in-memory collection, un-comment and execute the line below
# vectorstore.delete_collection()

We have loaded our table in question into the vector database. Next we will create an LLM object using LangChain supported `SageMakerEndpoint` class. This object will call the SageMaker endpoint with the Falcon model from within the LangChain chain for inference. The `ContentHandler` class will receive the prompt, and then format it in a way that the SageMaker endpoint expects, it will also receive the output from the LLM and return the generated text from the model. 

In [None]:
import json
import re
from langchain import SagemakerEndpoint
from langchain.llms.sagemaker_endpoint import LLMContentHandler
from langchain import PromptTemplate, LLMChain
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

class ContentHandler(LLMContentHandler):
    content_type = "application/json"
    accepts = "application/json"

    def transform_input(self, prompt: str, model_kwargs: dict) -> bytes:        
        prompt = re.sub(r'<< Example 2. >>.*?(?=<< Example 3. >>)', '', prompt, flags=re.DOTALL).replace('<< Example 3. >>','<< Example 2. >>') # we will shorten the Langchain injected prompt a little bit
        input_str = json.dumps({"inputs": prompt,  "parameters": model_kwargs})         
        return input_str.encode('utf-8')
    
    def transform_output(self, output: bytes) -> str:
        response_json = json.loads(output.read().decode("utf-8"))        
        return response_json[0]["generated_text"]

content_handler = ContentHandler()

llm=SagemakerEndpoint(
        endpoint_name=endpoint_name, 
        region_name=region,
        model_kwargs={"do_sample": True,
                                    "top_p": 0.9,
                                    "temperature": 0.8,
                                    "max_new_tokens":  100,
                                    "stop": ["<|endoftext|>", "</s>"]},
        content_handler=content_handler
    )


In the final step, we define the schema of the table using LangChain `AttributeInfo` model which will help the LLM understand the structure of the table and subsequently create a retriever using the LLM (we created earlier), the vector store, and the schema.

In [None]:

metadata_field_info = [
    AttributeInfo(
        name="Date",
        description="Date of the bank transaction",
        type="string",
    ),
    AttributeInfo(
        name="Description",
        description="Description of the bank transaction",
        type="string",
    ),
    AttributeInfo(
        name="Deposits ($)",
        description="The dollar amount deposited into the bank account",
        type="integer",
    ),
    AttributeInfo(
        name="Withdrawals ($)",
        description="The dollar amount withdrawn from the bank account",
        type="integer",
    ),
    AttributeInfo(
        name="Amount ($)",
        description="The total dollar amount balance in the bank account",
        type="integer",
    )
]
document_content_description = "Bank Statement"

retriever = SelfQueryRetriever.from_llm(
    llm, vectorstore, document_content_description, metadata_field_info, 
    verbose=True
)

In [None]:
try:
    retriever.get_relevant_documents("List the transactions with more than $1000 in Deposits ($)")
except Exception as e:
    print(str(e))

## Cleanup
---

Delete the SageMaker Jumpstart endpoint

In [None]:
predictor.delete_model()
predictor.delete_endpoint()