# Watson Document Understanding Notebook

### Use Case 

This notebook demonstrates how to extract data from readable,non-readable pdf and images using Watson Document Understanding. This can include extracting text, tables, images, and other data from documents such as contracts, invoices, and finanice documnets etc. The extracted information can then be used for tasks such as data entry, document categorization, and information retrieval. Watson Document understanding techniques includes IBM optical character recognition (OCR), and machine learning. These techniques can be used to automate the process of extracting information from documents, making it faster and more accurate than manual data entry. Additionally, the use of document understanding can also help organizations to improve their compliance and regulatory requirements by automating the process of extracting and storing important information in the database.

The data that is used in this notebook is taken from the the [FinTabNet](https://developer.ibm.com/exchanges/data/all/fintabnet/). This data has large number of different types tabular documents. We can  easily extract this information and load it into their databases.



### What you'll learn in this notebook
Watson Document Understanding offers so-called blocks for various Parser & OCR tasks. This notebooks shows:

- **Parse API** This API parses the input PDF file and returns all tokens, with their corresponding bounding boxes. The input file (to be used for parsing) should be named file. This API returns the output into JSON format. In this parser API we can use different types of parameters which we can improve the results.
    
    
- **Convert API** This API converts the input PDF file into a HTML format that comprises of the text parsed from the document, along with additional metadata including font style information and document structures such as section titles, tables, headers, and footers. This API returns the output in HTML form. We can open this file in any browser to see the output.


- **Customizing the Watson Document Understanding REST API Output** There are different types of parameter is available by using them we can get more accurate results. 

    1. **Turn off Image processing**: By using image_processing parameter you can set to parse the data from images or not. this parameter is boolean type. If you don't want to pasre the data from images you can set this parameter as False. 
    
    1. **Use custom table identification (GTE)**: Global Table Extraction (GTE) leverages convolutional neural networks (CNN) for detecting tables and AI models for determining table structure. By using this feature we can extract the any type of tables from the documents. To enable this feature we have to set the table parameter as 'gte'. 
    
    1. **Use custom OCR model (IOCR)**: IOCR is a Deep-Learning based OCR engine that is optimized for speed and accuracy.IOCR also called IBM OCR.  IOCR contains two main deep-learning models:

          1. A segmentation model which detects and segments text in the image. 
          2. A recognition model which transcribes the text in the areas that were located by the detection model. IOCR was trained using synthetic data.
          By usning both model IOCR provides very good accurcy in many cases where any other OCR is not provide good accuracy.
              1. Better token level accuracy
              2. Higher character accuracy when background is not white or noisy
              3. Higher detetcion recall
              4. Higher bounding box detection accuracy
      
      To enable the IOCR model . we have to be set the ocr_model to iocr. 
    
    1. **Use custom table identification (GTE) and custom OCR model(IOCR)**: 
    By using a custom IOCR model & enabled GTE parameter you can extract data & tables from images which is into PDF files or Scanned PDF files. To use this API you have to set config parameters image_processing & table . there you have to define the ocr_model and table parameters.
    
    1. **Text Filters**: 
    By Default all text filters are added to parse_api and convert_api.By using these filters we can remove watermark , white text on white background and white lines from the document. this below filter config we need to be set in config parameter in REST API.
    
    {
    filters {
            ##Filters used to remove/cleanup unwanted text and graphical lines in the document
            ##Removes text watermark from document
                exclude_watermarks = false
            ##Removes white text on white background from document
                filter_white_text = true
            ##Removes white lines from the document
                filter_white_lines = true
            }
    }

  

        

## Table of Contents

1. [Setting up the environment for WDU](#beforeYouStart)
    1.  [Before you start](#beforeYouStart)
    1.	[IBM cloud login](#cloudlogin)
    1.  [Docker login for image registry](#dockerlogin)
    1.  [Run pre-built WDU container](#WDUContainer)
1.  [Using the WDU service](#wduservice)
    1.  [Use the parse document API](#parseapi)
    1.  [Use the convert document API](#convertapi)

In [28]:
# Importing required libraries

from IPython.display import IFrame
import json
# import drawSvg as draw

import pandas as pd
import watson_nlp

### 1. Setting up the environment for WDU
<a id="beforeYouStart"></a>
### 1.1 Before you start
For now, the WDU images are stored in a container registry in the cloud account 1473161 - redsonja. You will need to go through the process outlined [here](https://pages.github.ibm.com/ai-foundation/one-conversion/setup/) to get access to the IBM Cloud account.

In addition, you will need to have installed on your machine the following.
1. The ibmcloud CLI.
1. Docker or Podman
1. Run the Service


**Log in to the Container Registry**
Once you have the account, get an API Key by going here here and ensure that you are using the 1473161 - redsonja account. In your shell, set an environment variable to your API Key.

<div class="alert alert-block alert-info">
export YOUR_IBM_CLOUD_API_KEY= API Key 
</div>
    

<a id="cloudlogin"></a>
### 1.2 IBM Cloud login
To login the IBM cloud you can replace "XXXXXXXX" with your API key & login the account by using below command. After running below command you will login succesfully the IBM cloud account.

In [4]:
!ibmcloud login --apikey EjbDpjoWzUUo8xMu6UTpkRgFeFNFbHkW7LQhbdaOTIyu

API endpoint: [36;1mhttps://cloud.ibm.com[0m
Authenticating...
[32;1mOK[0m

Targeted account [36;1mredsonja (f264f50d54eb2ef70b77ff502ade71b7) <-> 1473161[0m


Select a region (or press enter to skip):
1. au-syd
2. in-che
3. jp-osa
4. jp-tok
5. kr-seo
6. eu-de
7. eu-es
8. eu-gb
9. ca-tor
10. us-south
11. us-south-test
12. us-east
13. br-sao
Enter a number[36;1m>[0m ^C



<a id="dockerlogin"></a>
### 1.3  Docker login for image registry:
To run the docker image you have to first login docker. Here you need to replace 'XXXXX'by your above API Key. By using the below command you are able to login the docker.


In [6]:
!docker login -u iamapikey -p EjbDpjoWzUUo8xMu6UTpkRgFeFNFbHkW7LQhbdaOTIyu us.icr.io

Login Succeeded


<a id="WDUContainer"></a>
### 1.4 Run pre-built WDU container:
 By using below command you are able to run the WDU container with the version. Here if you want to change the version then you can replace the version with 'v0.3.1' with 'latest'. 


In [7]:
!docker run --rm --name oneconversion -p 9443:9443 us.icr.io/discovery_ingestion/oneconversion:v0.3.1

Unable to find image 'us.icr.io/discovery_ingestion/oneconversion:v0.3.1' locally
v0.3.1: Pulling from discovery_ingestion/oneconversion

[1Bbfcdf6bc: Pulling fs layer 
[1B04715398: Pulling fs layer 
[1B7fab5d88: Pulling fs layer 
[1B50991bb4: Pulling fs layer 
[1B6ba7ad5b: Pulling fs layer 
[1B0a89e5fa: Pulling fs layer 
[1B6ecead2e: Pulling fs layer 
[1Bf5d0cec7: Pulling fs layer 
[3B6ecead2e: Waiting fs layer 
[1B4866cf39: Pulling fs layer 
[1Bc99119a6: Pulling fs layer 
[1B578eda55: Pulling fs layer 
[1B2de0f8f6: Pulling fs layer 
[1B009dd15f: Pulling fs layer 
[1Bb129cab4: Pulling fs layer 
[1B2cda1365: Pulling fs layer 
[1Bca4d807c: Pulling fs layer 
[2Bca4d807c: Waiting fs layer 
[1Bda98614d: Pulling fs layer 
[1B943f7c50: Pulling fs layer 
[1B5895cfe0: Pulling fs layer 
[1B167a9077: Pulling fs layer 
[4B943f7c50: Waiting fs layer 
[1B24f4aa32: Pulling fs layer 
[1B76cbd029: Pulling fs layer 
[1B4d3b5eb0: Pulling fs layer 
[2B4d3b5eb0: Waiting fs layer

[35B895cfe0: Download complete  B/153.2MB[53A[2K[55A[2K[51A[2K[55A[2K[51A[2K[55A[2K[51A[2K[51A[2K[55A[2K[51A[2K[55A[2K[51A[2K[55A[2K[51A[2K[55A[2K[51A[2K[51A[2K[51A[2K[52A[2K[51A[2K[55A[2K[51A[2K[55A[2K[51A[2K[55A[2K[55A[2K[51A[2K[55A[2K[55A[2K[51A[2K[55A[2K[52A[2K[51A[2K[51A[2K[55A[2K[51A[2K[55A[2K[51A[2K[55A[2K[51A[2K[51A[2K[55A[2K[51A[2K[55A[2K[51A[2KDownloading   2.26MB/4.264MB[55A[2K[52A[2K[55A[2K[51A[2K[51A[2K[51A[2K[51A[2K[52A[2K[51A[2K[55A[2K[51A[2K[51A[2K[55A[2K[55A[2K[52A[2K[51A[2K[55A[2K[51A[2K[55A[2K[51A[2K[55A[2K[51A[2K[55A[2K[55A[2K[51A[2K[55A[2K[51A[2K[55A[2K[51A[2K[55A[2K[52A[2K[55A[2K[51A[2K[55A[2K[51A[2K[51A[2K[52A[2K[51A[2K[55A[2K[51A[2K[55A[2K[51A[2K[52A[2K[51A[2K[55A[2K[51A[2K[55A[2K[51A[2K[55A[2K[55A[2K[52A[2K[55A[2K[55A[2K[55A[2K[55A[2K[55A[2K[52A[2K[55A[2K[5

[26B0bc395b: Downloading  420.6MB/1.595GB[26A[2K[30A[2K[31A[2K[26A[2K[30A[2K[26A[2K[30A[2K[31A[2K[30A[2K[26A[2K[30A[2K[26A[2K[31A[2K[30A[2K[31A[2K[30A[2K[26A[2K[31A[2K[30A[2K[30A[2K[31A[2K[31A[2K[30A[2K[31A[2K[30A[2K[31A[2K[30A[2K[26A[2K[30A[2K[31A[2K[26A[2K[30A[2K[30A[2K[26A[2K[30A[2K[31A[2K[26A[2K[30A[2K[31A[2K[30A[2K[26A[2K[31A[2K[26A[2K[31A[2K[26A[2K[31A[2K[26A[2K[30A[2K[26A[2K[31A[2K[30A[2K[26A[2K[31A[2K[30A[2K[31A[2K[26A[2K[31A[2K[30A[2K[31A[2K[30A[2K[31A[2K[26A[2K[31A[2K[30A[2K[31A[2K[31A[2K[30A[2K[26A[2K[31A[2K[30A[2K[31A[2K[30A[2K[31A[2K[26A[2K[30A[2K[31A[2K[30A[2K[31A[2K[26A[2K[31A[2K[26A[2K[31A[2K[26A[2K[31A[2K[26A[2K[31A[2K[30A[2K[30A[2K[26A[2K[30A[2K[31A[2K[26A[2K[30A[2K[31A[2K[26A[2K[30A[2K[31A[2K[26A[2K[30A[2K[31A[2K[26A[2K[30A[2K[31A[2K[26A[2K[30A[2K[26A[2K[26

[23B14a488e: Downloading  302.2MB/1.596GB[26A[2K[30A[2K[23A[2K[26A[2K[30A[2K[30A[2K[30A[2K[30A[2K[23A[2K[30A[2K[26A[2K[30A[2K[23A[2K[30A[2K[26A[2K[30A[2K[30A[2K[26A[2K[30A[2K[23A[2K[26A[2K[30A[2K[23A[2K[30A[2K[26A[2K[30A[2K[26A[2K[30A[2K[26A[2K[30A[2K[23A[2K[26A[2K[30A[2K[26A[2K[30A[2K[30A[2K[23A[2K[26A[2K[30A[2K[26A[2K[30A[2K[30A[2K[23A[2K[26A[2K[30A[2K[26A[2K[30A[2K[26A[2K[30A[2K[26A[2K[30A[2K[30A[2K[26A[2K[30A[2K[26A[2K[30A[2K[26A[2K[30A[2K[30A[2K[26A[2K[30A[2K[23A[2K[30A[2K[26A[2K[30A[2K[23A[2K[30A[2K[26A[2K[30A[2K[26A[2K[30A[2K[26A[2K[30A[2K[26A[2K[30A[2K[26A[2K[30A[2K[23A[2K[30A[2K[26A[2K[26A[2KDownloading  589.3MB/1.127GB[26A[2K[30A[2K[23A[2K[30A[2K[30A[2K[23A[2K[30A[2K[30A[2K[30A[2K[23A[2K[26A[2K[30A[2K[26A[2K[30A[2K[23A[2K[30A[2K[26A[2K[30A[2K[26A[2K[23A[2K[30A[2K[26A[2K[3

[23B14a488e: Downloading  608.2MB/1.596GB[30A[2K[23A[2K[26A[2K[23A[2K[30A[2K[23A[2K[30A[2K[23A[2K[30A[2K[23A[2K[30A[2K[23A[2K[26A[2K[26A[2K[30A[2K[30A[2K[30A[2K[23A[2K[26A[2K[30A[2K[23A[2K[26A[2K[30A[2K[26A[2K[30A[2K[23A[2K[30A[2K[30A[2K[23A[2K[30A[2K[26A[2K[30A[2K[30A[2K[26A[2K[30A[2K[30A[2K[30A[2K[23A[2K[30A[2K[23A[2K[30A[2K[23A[2K[30A[2K[23A[2K[30A[2K[26A[2K[23A[2K[30A[2K[23A[2K[26A[2K[26A[2K[23A[2K[30A[2K[23A[2K[30A[2K[26A[2K[30A[2K[30A[2K[30A[2K[30A[2K[30A[2K[30A[2K[23A[2K[30A[2K[30A[2K[26A[2K[23A[2K[30A[2K[23A[2K[23A[2K[23A[2K[30A[2K[23A[2K[23A[2K[30A[2K[30A[2K[23A[2K[23A[2K[23A[2K[30A[2K[26A[2K[23A[2K[26A[2K[23A[2K[26A[2K[30A[2K[23A[2K[30A[2K[26A[2K[23A[2K[26A[2K[23A[2K[23A[2K[26A[2K[30A[2K[26A[2K[30A[2K[26A[2K[23A[2K[26A[2K[30A[2K[23A[2K[30A[2K[26A[2K[30A[2K[23A[2K[26

[21Beefbe17: Downloading  568.2MB/3.5GBGB[26A[2K[21A[2K[21A[2K[23A[2K[26A[2K[21A[2K[23A[2K[21A[2K[23A[2K[21A[2K[26A[2K[21A[2K[23A[2K[21A[2K[23A[2K[21A[2K[23A[2K[26A[2K[21A[2K[26A[2K[21A[2K[26A[2K[21A[2K[23A[2K[21A[2K[21A[2K[23A[2K[26A[2K[21A[2K[21A[2K[26A[2K[23A[2K[26A[2K[23A[2K[23A[2K[26A[2K[21A[2K[21A[2K[23A[2K[23A[2K[26A[2K[21A[2K[26A[2K[21A[2K[26A[2K[21A[2K[26A[2K[23A[2K[26A[2K[23A[2K[26A[2K[23A[2K[26A[2K[23A[2K[21A[2K[26A[2K[23A[2K[26A[2K[21A[2K[23A[2K[26A[2K[23A[2K[21A[2K[21A[2K[23A[2K[26A[2K[21A[2K[23A[2K[21A[2K[26A[2K[23A[2K[21A[2K[26A[2K[21A[2K[21A[2K[23A[2K[26A[2K[21A[2K[26A[2K[21A[2K[23A[2K[26A[2K[21A[2K[23A[2K[26A[2K[21A[2K[26A[2K[21A[2K[26A[2K[21A[2K[21A[2K[21A[2K[26A[2K[21A[2K[23A[2K[26A[2K[21A[2K[26A[2K[23A[2K[21A[2KDownloading  215.2MB/3.5GB[21A[2K[26A[2K[21A[2K[26A

[26B0bc395b: Downloading  1.532GB/1.595GB[23A[2K[21A[2K[26A[2K[21A[2K[26A[2K[21A[2K[23A[2K[26A[2K[23A[2K[26A[2K[21A[2K[26A[2K[23A[2K[26A[2K[23A[2K[21A[2K[23A[2K[21A[2K[26A[2K[23A[2K[21A[2K[23A[2K[21A[2K[26A[2K[21A[2K[26A[2K[23A[2K[26A[2K[21A[2K[26A[2K[23A[2K[21A[2K[23A[2K[21A[2K[23A[2K[26A[2K[21A[2K[26A[2K[21A[2K[23A[2K[21A[2K[23A[2K[23A[2K[26A[2K[21A[2K[23A[2K[26A[2K[23A[2K[23A[2K[21A[2K[26A[2K[23A[2K[21A[2K[26A[2K[21A[2K[26A[2K[23A[2K[21A[2K[23A[2K[26A[2K[23A[2K[21A[2K[26A[2K[21A[2K[26A[2K[21A[2K[26A[2K[21A[2K[26A[2K[21A[2K[23A[2K[26A[2K[23A[2K[21A[2K[21A[2K[21A[2K[26A[2K[21A[2K[26A[2K[21A[2K[26A[2K[21A[2K[23A[2K[21A[2K[26A[2K[23A[2K[26A[2K[21A[2K[23A[2K[26A[2K[23A[2K[21A[2K[26A[2K[23A[2K[21A[2K[23A[2K[21A[2K[26A[2K[21A[2K[26A[2K[21A[2K[23A[2K[26A[2K[21A[2K[23A[2K[23A[2K[21

[23B14a488e: Downloading  1.436GB/1.596GB[26A[2K[23A[2K[21A[2K[26A[2K[23A[2K[21A[2K[26A[2K[21A[2K[23A[2K[26A[2K[21A[2K[26A[2K[23A[2K[21A[2K[26A[2K[21A[2K[23A[2K[21A[2K[21A[2K[23A[2K[21A[2K[23A[2K[21A[2K[26A[2K[21A[2K[23A[2K[21A[2K[26A[2K[21A[2K[23A[2K[26A[2K[21A[2K[26A[2K[23A[2K[21A[2K[21A[2K[21A[2K[21A[2K[23A[2K[21A[2K[21A[2K[23A[2K[26A[2K[26A[2K[23A[2K[21A[2K[26A[2K[23A[2K[21A[2K[26A[2K[21A[2K[26A[2K[23A[2K[26A[2KDownloading  1.547GB/1.595GB[23A[2K[21A[2K[23A[2K[26A[2K[23A[2K[21A[2K[26A[2K[21A[2K[23A[2K[26A[2K[21A[2K[26A[2K[21A[2K[23A[2K[21A[2K[26A[2K[21A[2K[23A[2K[21A[2K[26A[2K[21A[2K[23A[2K[21A[2K[26A[2K[21A[2K[26A[2K[23A[2K[26A[2K[21A[2K[26A[2K[21A[2K[26A[2K[21A[2K[26A[2K[23A[2K[26A[2K[21A[2K[26A[2K[23A[2K[23A[2K[26A[2K[21A[2K[23A[2K[26A[2K[21A[2K[21A[2K[26A[2K[21A[2K[26A[2K[2

[19B9b1e66f: Downloading  83.26MB/101.9MB[21A[2K[23A[2K[21A[2K[23A[2K[20A[2K[21A[2K[20A[2K[23A[2K[20A[2K[21A[2K[23A[2K[21A[2K[23A[2K[21A[2K[21A[2K[21A[2K[21A[2K[20A[2K[21A[2K[23A[2K[21A[2K[23A[2K[21A[2K[23A[2K[21A[2K[23A[2K[20A[2K[23A[2K[21A[2K[20A[2K[23A[2K[21A[2K[23A[2K[21A[2K[23A[2K[20A[2K[23A[2K[20A[2K[21A[2K[23A[2K[21A[2K[20A[2K[23A[2K[20A[2K[21A[2K[20A[2K[23A[2K[20A[2K[21A[2K[23A[2K[21A[2K[21A[2K[23A[2K[20A[2K[23A[2K[21A[2K[23A[2K[21A[2K[23A[2K[20A[2K[21A[2K[20A[2K[23A[2K[23A[2K[20A[2K[21A[2K[23A[2K[20A[2K[21A[2K[20A[2K[23A[2K[20A[2K[23A[2K[21A[2K[20A[2K[23A[2K[23A[2K[20A[2K[21A[2K[20A[2K[21A[2K[23A[2K[20A[2K[23A[2K[20A[2K[21A[2K[23A[2K[20A[2K[23A[2K[20A[2K[23A[2K[20A[2K[21A[2K[23A[2K[20A[2K[23A[2K[21A[2K[20A[2K[20A[2K[21A[2K[20A[2K[21A[2K[23A[2K[20A[2K[20A[2K[23A[2K[20

[9Bae0607a3: Downloading  22.17MB/270MBMB[23A[2K[20A[2K[19A[2K[20A[2K[21A[2K[19A[2K[20A[2K[21A[2K[21A[2K[19A[2K[22A[2K[20A[2K[21A[2K[21A[2K[19A[2K[22A[2K[22A[2K[21A[2K[22A[2K[22A[2K[20A[2K[22A[2KDownloading  89.75MB/101.9MB[19A[2K[21A[2K[20A[2K[19A[2K[21A[2K[19A[2K[20A[2K[21A[2K[20A[2K[19A[2K[21A[2K[20A[2K[21A[2K[19A[2K[21A[2K[20A[2K[21A[2K[21A[2K[19A[2K[20A[2K[21A[2K[20A[2K[19A[2K[21A[2K[19A[2K[21A[2K[19A[2K[20A[2K[21A[2K[20A[2K[21A[2K[20A[2K[21A[2K[20A[2K[20A[2K[19A[2K[21A[2K[21A[2K[21A[2K[20A[2K[21A[2K[19A[2K[19A[2K[21A[2K[20A[2K[21A[2K[20A[2K[21A[2K[20A[2K[21A[2K[20A[2K[21A[2K[20A[2K[20A[2K[21A[2K[20A[2K[21A[2K[21A[2K[20A[2K[21A[2K[21A[2K[20A[2K[21A[2K[20A[2K[20A[2K[21A[2K[20A[2K[21A[2K[20A[2K[21A[2K[20A[2K[21A[2K[21A[2K[20A[2K[20A[2K[21A[2K[20A[2K[20A[2K[21A[2K[21A[2K[18A[2K[2

[6B58148ac3: Downloading  7.028MB/317.2MB[21A[2K[8A[2K[9A[2K[8A[2K[21A[2K[9A[2K[8A[2K[21A[2K[9A[2K[21A[2K[9A[2K[21A[2K[9A[2K[21A[2K[9A[2K[21A[2K[9A[2K[8A[2K[9A[2K[8A[2K[21A[2K[9A[2K[21A[2K[8A[2K[8A[2K[9A[2K[8A[2K[21A[2K[9A[2K[8A[2K[21A[2K[9A[2K[9A[2K[21A[2K[8A[2K[9A[2K[21A[2K[9A[2K[8A[2K[21A[2K[8A[2K[9A[2K[8A[2K[21A[2K[8A[2K[9A[2K[8A[2K[21A[2K[9A[2K[8A[2K[9A[2K[8A[2K[9A[2K[8A[2K[21A[2K[9A[2K[21A[2K[9A[2K[21A[2KDownloading  38.39MB/270MB[21A[2K[9A[2K[8A[2K[9A[2K[21A[2K[9A[2K[21A[2K[9A[2K[8A[2K[9A[2K[8A[2K[9A[2K[8A[2K[9A[2K[8A[2K[9A[2K[21A[2K[8A[2K[9A[2K[8A[2K[9A[2K[8A[2K[9A[2K[8A[2K[21A[2K[8A[2K[9A[2K[8A[2K[21A[2K[9A[2K[9A[2K[8A[2K[9A[2K[9A[2K[21A[2K[9A[2K[9A[2K[21A[2K[9A[2K[21A[2K[9A[2K[8A[2K[21A[2K[8A[2K[21A[2K[9A[2K[21A[2K[8A[2K[21A[2K[8A[2K[21A[2K[8A[2K[21A[2K[

[1B63448f28: Downloading  152.5MB/2.313GB[21A[2K[6A[2K[21A[2K[21A[2K[6A[2K[21A[2K[6A[2K[21A[2K[21A[2K[5A[2K[21A[2K[6A[2K[21A[2K[5A[2K[6A[2K[6A[2K[21A[2K[6A[2K[21A[2K[5A[2K[21A[2K[6A[2K[5A[2K[6A[2K[5A[2K[21A[2K[6A[2K[21A[2K[5A[2K[6A[2K[21A[2K[6A[2K[21A[2K[6A[2K[21A[2K[6A[2K[5A[2K[21A[2K[5A[2K[6A[2K[6A[2K[6A[2K[21A[2K[6A[2K[21A[2K[5A[2K[21A[2K[6A[2K[21A[2K[5A[2K[21A[2K[21A[2K[5A[2K[6A[2K[21A[2K[5A[2K[6A[2K[5A[2K[6A[2K[21A[2K[5A[2K[21A[2K[5A[2K[6A[2K[21A[2K[6A[2K[21A[2K[6A[2K[21A[2K[5A[2K[6A[2K[5A[2K[21A[2K[6A[2K[21A[2K[6A[2K[5A[2K[21A[2K[5A[2K[6A[2K[5A[2K[6A[2K[21A[2K[6A[2K[21A[2K[6A[2K[21A[2K[6A[2K[5A[2K[6A[2K[5A[2K[21A[2K[5A[2K[6A[2K[5A[2K[6A[2K[21A[2K[6A[2K[21A[2K[5A[2K[21A[2K[6A[2K[21A[2K[6A[2K[21A[2K[6A[2K[6A[2K[21A[2K[5A[2K[6A[2K[21A[2K[5A[2K[21A[2K[5A[2K[

[21Beefbe17: Downloading  3.395GB/3.5GBGB[1A[2K[21A[2K[1A[2K[1A[2K[1A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[21A[2K[1A[2K[21A[2K[1A[2K[1A[2K[21A[2K[1A[2K[21A[2K[1A[2K[1A[2K[1A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[1A[2K[1A[2K[1A[2K[21A[2K[1A[2K[21A[2K[21A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[21A[2K[1A[2K[21A[2K[1A[2K[1A[2K[21A[2K[1A[2K[21A[2K[1A[2K[1A[2K[21A[2K[1A[2K[21A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[21A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[21A[2K[21A[2K[1A[2K[21A[2K[1A[2K[1A[2K[1A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[21A[2K[1A[2K[21A

[21Beefbe17: Extracting  2.604GB/3.5GB3GB[21A[2K[1A[2K[21A[2K[21A[2K[21A[2K[1A[2K[1A[2K[21A[2K[21A[2K[1A[2K[21A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[1A[2K[1A[2K[1A[2K[21A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[21A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[21A[2K[1A[2K[1A[2K[21A[2K[1A[2K[1A[2K[1A[2K[21A[2K[1A[2K[1A[2K[21A[2K[21A[2K[1A[2K[21A[2K[21A[2K[1A[2K[21A[2K[21A[2K[1A[2K[1A[2K[21A[2K[21A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[21A[2K[1A[2K[21A[2K[1A[2K[1A[2K[21A[2K[21A[2K[21A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[21A[2K[1A[2K[21A[2K[21A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[1A[2K[1A[2K[21A[2K[1A[2K[1A[2K[21A[2K[21A[2K[1A[2K[21A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[1A[2K[2

[1B63448f28: Downloading  2.181GB/2.313GB[21A[2K[21A[2K[1A[2K[21A[2K[21A[2K[21A[2K[1A[2K[1A[2K[1A[2K[21A[2K[1A[2K[1A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[21A[2K[1A[2K[21A[2K[21A[2K[21A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[21A[2K[21A[2K[21A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[1A[2K[1A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[21A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[21A[2K[21A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[21A[2K[1A[2K[21A[2K[21A[2K[21A[2K[21A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[1A[2K[1A[2K[21A[2K[1A[2K[21A[2K[1A[2K[21A[2K[21A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A

[1BDigest: sha256:69451fa79a910592dddaa832399643efc4bd21384b9f9d19e6c15d2c4c7e0f16[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[2K[1A[

2023-10-18 09:22:42,386 start_model.py        monitor_model() 37   INFO     Checking state ... [3s][300s]
2023-10-18 09:22:42,386 start_model.py        monitor_model() 64   INFO     Model is loaded and running!
2023-10-18 09:22:42,386 start_model.py        monitor_model() 23   INFO     Starting monitoring model: /models/GTE/
2023-10-18 09:22:42,386 start_model.py        monitor_model() 30   INFO     Model state JSON: /models/GTE/process_state.json
2023-10-18 09:22:42,387 start_model.py        monitor_model() 37   INFO     Checking state ... [0s][300s]
2023-10-18 09:22:45,390 start_model.py        monitor_model() 37   INFO     Checking state ... [3s][300s]
2023-10-18 09:22:48,394 start_model.py        monitor_model() 37   INFO     Checking state ... [6s][300s]
2023-10-18 09:22:51,397 start_model.py        monitor_model() 37   INFO     Checking state ... [9s][300s]
2023-10-18 09:22:54,401 start_model.py        monitor_model() 37   INFO     Checking state ... [12s][300s]
2023-10-18 09:22:

<a id="wduservice"></a>
### 2.  Using the WDU service
<a id="parseapi"></a>
### 2.1 Use the parse document API:

By using The Parse API (parse_document), you can extract all tokens, with their corresponding bounding boxes. It returns the output in JSON format. That you can normalise and extract the whole text from the output. 
To parse an example PDF document from the repository using the REST API, run the following command.

In [8]:
# importing here requests library to access the REST API 
import requests
# parse documnet url by using this we can access the REST API 
parse_documnet_url ='https://localhost:9443/api/v1/parse_document?'

<span style="color:blueviolet"><strong>Step 2.1.1</strong> This method getDatafromDocument to execute the REST API with post method with different parameters and here you need to pass input_filename and output_filename</span>

In [9]:
# method to get the HTML Data or JSON data from the DU service 
def getDatafromDocument(url,params,input_filename,output_filename):
    request=requests.request("POST", url,verify=False, params =params, files=input_filename)
    print(request.status_code)
    if request.status_code != 200:
        print("DU Service status:", request.text)
        print("Creating file ---",output_filename)
    with open(output_filename, mode='w') as f:
        f.write(request.text)
        f.close()

<span style="color:blueviolet"><strong>Step 2.1.2</strong> Here below you can provide the input_filename inside the Files which you want to parse the content and here you can pass the output_filename as a output which will store the extracted text.</span>

In [7]:
output_filename='page_55.json'
params ={'output':output_filename}
files = {'file': open('page_55.pdf', 'rb')}
getDatafromDocument(parse_documnet_url,params,files,output_filename)



200


In [8]:
IFrame(src='./page_55.json', width=900, height=400)

<span style="color:blue">This output json file stores the extarcted data in form of json with the bounding box parameters.</span>

<a id="convertapi"></a>
### 2.2 Use the convert document API:

Using The Convert API (convert_document), you can convert any pdf file into HTML format. You can extract all tokens or data from the file along with other metadata and all HTML tags like paragraph, tables, headers, titles, and pages information.  You can save this output into an HTML file and open it in any web browser.

using below Rest API call you can convert pdf file into HTML file.

<span style="color:blueviolet"><strong>Step 2.2.1</strong> Here is define the convert_documnet url which is use to extract the text from document</span>

In [12]:
#convert_documnet_url Rest API
convert_documnet_url ='https://localhost:9443/api/v1/convert_document?'

<span style="color:blueviolet"><strong>Step 6.2</strong> Here below you can provide the input_filename inside the Files which you want to parse the content and here you can pass the output_filename as a output which will store the extracted text.
</span>

# PDF DU PII Use Case 

In [13]:
output_filename='sample-data.html'
params ={'output':output_filename}
files = {'file': open('sample-data.pdf', 'rb')}
getDatafromDocument(convert_documnet_url,params,files,output_filename)



200


In [14]:
IFrame(src='./sample-data.html', width=900, height=900)

In [15]:
import re
def remove_extra_lines(data):
    data =re.sub(r'\n\s*\n', '\n', data, flags=re.MULTILINE)
    return data

def pre_processingtext(text_data):
    replaced = re.sub("</?html[^>]*>", "", text_data)
    replaced = re.sub("</?p[^>]*>", "", text_data)
    replaced = re.sub("</?div[^>]*>", "", text_data)
    replaced = re.sub("</?a[^>]*>", "", replaced)
    replaced = re.sub("</?h*[^>]*>", "", replaced)
    replaced = re.sub("</?em*[^>]*>", "", replaced)
    replaced = re.sub("</?img*[^>]*>", "", replaced)
    replaced = re.sub("&amp;", "", replaced)
    replaced = re.sub("{*}", "", replaced)
    
    return replaced

def pre_processing_html(html_data):
    final_data1 = pre_processingtext(html_data)
    final_data1 = remove_extra_lines(final_data1)
    return final_data1
def read_text_file(file_name):
    with open(file_name, "r", encoding="latin1") as f:
        text = f.read()
    return text

In [59]:
content = pre_processing_html(read_text_file("sample-data.html"))
content_array = content.split('\\n')
sub_content =""
for i in range(len(content_array)):
#     print("i--",i,"-----",content_array[i])
    if i in [1,2,3,30,31,32,60,61,62]:
#         print(content_array[i])
        sub_content = sub_content+"\n"+content_array[i]

In [72]:
content_str= " ".join(content_array[1:])
print(content_str)

514-14-8905 f 12/22/1944 Amaker Borden Ashley 213-46-8915 f 4/21/1958 Pinson Green Marjorie 524-02-7657 m 3/25/1962 Hall Munsch Jerome 489-36-8350 m 1964/09/06 Porter Aragon Robert 514-30-2668 f 1986/05/27 Nicholson Russell Jacki 505-88-5714 f 1963/09/23 Mcclain Venson Lillian 690-05-5315 m 1969/10/02 Kings Conley Thomas 646-44-9061 M 1978/01/12 Kurtz Jackson Charles 421-37-1396 f 1980/04/09 Linden Davis Susan 461-97-5660 f 1975/01/04 Kingdon Watson Gail 660-03-8360 f 1953/07/11 Onwunli Garrison Lisa 751-01-2327 f 1968/02/16 Simpson Renfro Julie 559-81-1301 m 1952/01/20 Mcafee Heard James 624-84-9181 m 1980/01/16 Frazier Reyes Danny 449-48-3135 m 1982/06/14 Feusier Hall Mark 477-36-0282 m 1961/03/10 Vasquez Mceachern Monte 458-02-6124 m 1955/09/20 Pennebaker Diaz Christopher 044-34-6954 m 1967/05/28 Simpson Lowe Tim 587-03-2682 f 1958/10/24 Dickerson Oyola Lynette 421-90-3440 f 1953/07/17 Kroeger Morrison Adriane 451-80-3526 m 1950/06/09 Parmer Santos Thomas 300-62-3266 m 1965/02/10 Sp

In [60]:
print(sub_content)


514-14-8905 f 12/22/1944 Amaker Borden Ashley
213-46-8915 f 4/21/1958 Pinson Green Marjorie
524-02-7657 m 3/25/1962 Hall Munsch Jerome
10932 Bigge Rd Menlo Park CA 94025 408 496-7223
4469 Sherman Street Goff KS 66428 785-939-6046
309 63rd St. #411 Oakland CA 94618 415 986-7020
jwhite@domain.com m 5270-4267-6450-5516 123 2010/06/25
aborden@domain.com m 5370-4638-8881-3020 713 2011/02/01
mgreen@domain.com v 4916-9766-5240-6147 258 2009/02/25


In [105]:
# df_csv =pd.read_csv("SamplePII/sample-data.csv")
# df_str = df_csv.to_string(index=False)
# df_str

In [29]:
# Load a syntax model to split the text into sentences and tokens
syntax_model = watson_nlp.load(watson_nlp.download('syntax_izumo_en_stock'))
# Load bilstm model in WatsonNLP
bilstm_model = watson_nlp.load(watson_nlp.download('entity-mentions_bilstm_en_pii'))
# Load rbr model in WatsonNLP
rbr_model = watson_nlp.load(watson_nlp.download('entity-mentions_rbr_multi_pii'))
# # sir Load rbr model in WatsonNLP
# sire = watson_nlp.load(watson_nlp.download('entity-mentions_sire_en_stock-wf'))

In [89]:
rbr_result_pii = rbr_model.run(content_str, language_code='en')

#Test Pretrained bilstm_model model in WatsonNLP
syntax_result = syntax_model.run(content_str)
bilstm_result = bilstm_model.run(syntax_result)


result = bilstm_result + rbr_result_pii
pii_type_list = []
for i in result.mentions:
#     print("PII: ", i.span.text.ljust(15, " "), "Type: ", i.type)
    pii_type_dict = {"PII": i.span.text, "Type": i.type,"Source":"PDF"}
    pii_type_list.append(pii_type_dict)



In [91]:
data= pd.DataFrame(pii_type_list)
data.head(20)


Unnamed: 0,PII,Type,Source
0,172-32-1176,NationalNumber.SocialSecurityNumber.US,CSV
1,4/21/1958 Smith White Johnson ...,Location,CSV
2,94025,Location,CSV
3,94025 408 496,PhoneNumber,CSV
4,jwhite@domain.com,EmailAddress,CSV
5,5270-4267-6450-5516,BankAccountNumber.CreditCardNumber.Master,CSV
6,514-14-8905,NationalNumber.SocialSecurityNumber.US,CSV
7,Amaker Borden Ashley,Person,CSV
8,4469 Sherman Street,Location,CSV
9,Goff KS,Location,CSV


In [92]:
data.to_csv("PII_Extracted_PDF.csv",index=False)

Please note that this content is made available by IBM Build Lab to foster Embedded AI technology adoption. The content may include systems & methods pending patent with USPTO and protected under US Patent Laws. For redistribution of this content, IBM will use release process. For any questions please log an issue in the [GitHub](https://github.ibm.com/hcbt/Watson-Document-Understanding). 

Developed by IBM Build Lab 

Copyright - 2023 IBM Corporation 