# Document Extraction

In this lab we will look at a method of how to extract table information out of the documents.



- [Step 1: Setup notebook](#step1)
- [Step 2: Extract table from a sample doc using Amazon Textract](#step2)
- [Step 3: Look at the other ways to extract structured and semi-structured data using Textract](#step3)

---

# Step 1: Setup notebook <a id="step1"></a>

In this step, we will import some necessary libraries that will be used throughout this notebook. 

In [None]:
import boto3
import botocore
import sagemaker
import pandas as pd
from IPython.display import display
from textractcaller.t_call import call_textract, Textract_Features
from textractprettyprinter.t_pretty_print import convert_table_to_list
from trp import Document
import os

# variables
data_bucket = sagemaker.Session().default_bucket()
region = boto3.session.Session().region_name
account_id = boto3.client('sts').get_caller_identity().get('Account')

os.environ["BUCKET"] = data_bucket
os.environ["REGION"] = region
role = sagemaker.get_execution_role()

print(f"SageMaker role is: {role}\nDefault SageMaker Bucket: s3://{data_bucket}")

s3=boto3.client('s3')
textract = boto3.client('textract', region_name=region)
comprehend=boto3.client('comprehend', region_name=region)

%store -r document_classifier_arn
print(f"Amazon Comprehend Custom Classifier ARN: {document_classifier_arn}")


---
# Step 2: Extract table using Amazon Textract <a id="step2"></a>

In this step we will take a brief look at how to extract table information from one of the bank statements from our dataset. 

In [None]:
prefix = 'idp/comprehend/classified-docs/bank-statements'
start_after = 'idp/comprehend/classified-docs/bank-statements/'

paginator = s3.get_paginator('list_objects_v2')
operation_parameters = {'Bucket': data_bucket,
                        'Prefix': prefix,
                        'StartAfter':start_after}
list_items=[]
page_iterator = paginator.paginate(**operation_parameters)

for page in page_iterator:
    if "Contents" in page:
        for item in page['Contents']:
            print(item['Key'])
            list_items.append(f's3://{data_bucket}/{item["Key"]}')
    else:
        list_items.append('./samples/mixedbag/document_0.png')
list_items

Let's select a random bank statement from the list

In [None]:
import random
file = random.sample(list_items, k=1)[0] #select a random bank statement document from the list
file

Our bank statements have two tables. We will see how to extract the tables using the Textract pretty printer tool.

In [None]:
resp = call_textract(input_document=file, features=[Textract_Features.TABLES])
tdoc = Document(resp)
dfs = list()

for page in tdoc.pages:
    for table in page.tables:
        dfs.append(pd.DataFrame(convert_table_to_list(trp_table=table)))

df1 = dfs[0]
df2 = dfs[1]

In [None]:
df1

In [None]:
df2

---
# Step 3: Extract structured and semi-structured data using Amazon Textract <a id="step3"></a>

Let's look at some of the other ways Amazon Textract can extract structured as well as semi-structured data from documents. We will pull in a notebook from the Amazon Textract [code sample repository](https://github.com/aws-samples/amazon-textract-code-samples/tree/master/python). 

Run the code cell below to pull the notebook. Once the notebook named `02-idp-document-extraction-01.ipynb` shows up, open the notebook and perform the following sections in the notebook. These sections will demonstrate how to extract form data and table data using Amazon textract. We will pull a single notebook and look at a few specific functionalities.

- Section 8. Forms: Key/Values
- Section 10. Tables
- Section 12. Invoices and Receipts processing

Run the code below and execute the above listed sections in the `02-idp-document-extraction-01.ipynb` file.

In [None]:
!wget 'https://github.com/aws-samples/amazon-textract-code-samples/raw/master/python/Textract.ipynb' -O './02-idp-document-extraction-01.ipynb'
!wget 'https://github.com/aws-samples/amazon-textract-code-samples/raw/master/python/OneKeyValue.png' -O './OneKeyValue.png'


You can further explore all Amazon Textract capabilities by cloning the entire code repository using the `git clone` command below.

`git clone https://github.com/aws-samples/amazon-textract-code-samples`

---
# Cleanup

Cleanup is optional if you want to execute subsequent notebooks. 

Refer to the `05-idp-cleanup.ipynb` for cleanup and deletion of resources.

---
# Conclusion

In this notebook we did a table extraction from a bank statement and further looked on a few additional ways Amazon Textract can help extract specific structured and semi-structured data such as forms data from our documents. In the next notebook we will extract entity information from our documents using Amazon Comprehend.