<a href="https://colab.research.google.com/github/rahiakela/general-utility-notebooks/blob/main/unstructured_data_extraction_part1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Unstructured - PDF data extraction

## Setup

In [None]:
%%capture
!apt-get install poppler-utils
!apt-get install tesseract-ocr-all
# unstructured 0.11.5
# unstructured-inference 0.7.19
!pip install unstructured[all-docs] unstructured-inference
!pip install langchain-community
!pip install --upgrade --quiet rapidocr-onnxruntime
!pip install --upgrade --quiet extract_msg

**Restart session in Colab!**

## Import libraries

**You must have restarted your session before to run the next cell.**

In [None]:
import pathlib
from pathlib import Path

In [None]:
# select the partition function
from unstructured.partition.pdf import partition_pdf # version unstructured 0.11.5

In [None]:
# Define parameters for Unstructured's library

## include_page_breaks
# include page breaks (default is False)
include_page_breaks = True

## strategy
# The strategy to use for partitioning the PDF. Valid strategies are "hi_res", "ocr_only", and "fast".
# When using the "hi_res" strategy, the function uses a layout detection model to identify document elements.
# hi_res" is used for analyzing PDFs and extracting table structure (default is "auto")
strategy = "hi_res"

## infer_table_structure
# Only applicable if `strategy=hi_res`.
# If True, any Table elements that are extracted will also have a metadata field named "text_as_html" where the table's text content is rendered into an html string.
# I.e., rows and cells are preserved.
# Whether True or False, the "text" field is always present in any Table element and is the text content of the table (no structure).

if strategy == "hi_res": infer_table_structure = True
else: infer_table_structure = False

## extract_element_types
# Get images of tables
if infer_table_structure == True: extract_element_types=['Table']
else: extract_element_types=None

## max_characters
# The maximum number of characters to include in a partition (document element)
# If None is passed, no maximum is applied.
# Only applies to the "ocr_only" strategy (default is 1500)
if strategy != "ocr_only": max_characters = None

## languages
# The languages to use for the Tesseract agent.
# To use a language, you'll first need to install the appropriate Tesseract language pack.
languages = ["eng"] # example if more than one "eng+por" (default is "eng")

## model_name
# @requires_dependencies("unstructured_inference")
# yolox: best model for table extraction. Other options are yolox_quantized, detectron2_onnx and chipper depending on file layout
# source: https://unstructured-io.github.io/unstructured/best_practices/models.html
hi_res_model_name = "yolox"

## PDF file

In [None]:
!wget https://github.com/piegu/language-models/raw/master/docs/Quarterly.Financial.Report.Template.pdf

In [None]:
path = "/content/"
# filename = path + "Quarterly.Financial.Report.Template.pdf"
filename = path + "Sample_30_Fax_1.pdf"

## Get partition in json file

In [None]:
# Returns a List[Element] present in the pages of the parsed pdf document
elements = partition_pdf(
        filename=filename,
        include_page_breaks=include_page_breaks,
        strategy=strategy,
        infer_table_structure=infer_table_structure,
        extract_element_types=extract_element_types,
        max_characters=max_characters,
        languages=languages,
        hi_res_model_name=hi_res_model_name,
        )

# get output as json
from unstructured.staging.base import elements_to_json
elements_to_json(elements, filename=f"{filename}.json") # Takes a while for file to show up on the Google Colab

## Get content in html file

In [None]:
def process_json_file(input_filename):
    # Read the JSON file
    with open(input_filename, 'r') as file:
        data = json.load(file)

    # Iterate over the JSON data and extract required table elements
    extracted_elements = []
    text_prev = ""
    for i,entry in enumerate(data):
        if entry["type"] == "Title":
            text = "<h1>" + entry["text"] + "</h1>"
        elif entry["type"] == "Table":
            text = entry["metadata"]["text_as_html"]
        else:
            text = "<p>" + entry["text"] + "</p>"

        if text != text_prev: extracted_elements.append(text)
        text_prev = text

    # Write the extracted elements to the output file
    html_start = """
    <!DOCTYPE html>
    <html>
    <head>
    <title>Document Information</title>
    <style>
        table {
            width: 100%;
            border-collapse: collapse;
        }
        th, td {
            border: 1px solid black;
            padding: 8px;
            text-align: left;
        }
        th {
            background-color: #f2f2f2;
        }
    </style>
    </head>
    <body>
    """

    html_end = """
    </body>
    </html>
    """

    output_file_html = path + Path(input_filename).name.replace(".json", "") + "_" + hi_res_model_name + ".html"
    with open(output_file_html, 'w') as output_file:
        output_file.write(html_start + "\n")
        for element in extracted_elements:
            output_file.write(element + "\n")
        output_file.write(html_end + "\n")

    return str(output_file_html)

In [None]:
import json
output_file_html = process_json_file(f"{filename}.json") # It can take a while for the .html file to show up in Colab

from google.colab import files
files.download(output_file_html)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
from langchain_community.document_loaders import UnstructuredHTMLLoader

file_path = "Sample_30_Fax_1.pdf_yolox.html"

loader = UnstructuredHTMLLoader(file_path)
data = loader.load()

print(data)

[Document(metadata={'source': 'Sample_30_Fax_1.pdf_yolox.html'}, page_content='Chennai Regional Physician\n\n127 Besant Nagar,\n\nChennai,\n\nTN, 44444\n\nChennai\n\n111-22-3333\n\nTo Whom it may concern,\n\nThis Letter is notify TN a new provider Mani Murugan MD, joining our group Chennai regional physicians.\n\nProvider individual NPI number is 9999999999\n\nThe effective date requested is 10/11/2020\n\nThe group name is Chennai regional physician TIN - 62-2222222\n\n1 Raja st, 127 Kamarajar st, PO BoX 1111 Adyar, CH 66666 Adyar, CH 55555 Adyar, CH-1111 P: 222-333-4444 P: 666-888-9999 P: 111-222-3333 F: 333-444-5555 F: 111-000-1010 F: 111-333-4444\n\nThank you,\n\nRaja Krishnan,\n\nCredential specialist,\n\nRajakcredential@secure.com\n\nOffice: (222) 111 2222, Fax: (111) 555 6666')]


In [None]:
print(data[0].page_content)

Chennai Regional Physician

127 Besant Nagar,

Chennai,

TN, 44444

Chennai

111-22-3333

To Whom it may concern,

This Letter is notify TN a new provider Mani Murugan MD, joining our group Chennai regional physicians.

Provider individual NPI number is 9999999999

The effective date requested is 10/11/2020

The group name is Chennai regional physician TIN - 62-2222222

1 Raja st, 127 Kamarajar st, PO BoX 1111 Adyar, CH 66666 Adyar, CH 55555 Adyar, CH-1111 P: 222-333-4444 P: 666-888-9999 P: 111-222-3333 F: 333-444-5555 F: 111-000-1010 F: 111-333-4444

Thank you,

Raja Krishnan,

Credential specialist,

Rajakcredential@secure.com

Office: (222) 111 2222, Fax: (111) 555 6666


## Get content in PDF file

In [None]:
from langchain_community.document_loaders import PyPDFLoader

file_path = "Sample_27.pdf"
loader = PyPDFLoader(file_path, extract_images=True)
pages = loader.load_and_split()

pages[0]

Document(metadata={'source': 'Sample_27.pdf', 'page': 0}, page_content='Provider Enrollment Department\n2000 Health Park Drive\nMadura i, TN 38028\nEmail: HCAPS. PayorRequest @secur e.com\nOctober 08, 2020\nAttn: Credentialing Department/Demographics Updates - TN\nRE: Adding location to provider demographics\nProvider Name: E Mary Mythili , MD\nNPI: 1760445457\nCAQH :11051777\nSpecialty:Orthopaedic  Surgery\nEffective 09/09/2020, E Mary Mythili , MD will be adding the following practice location(s ).\nPlease update your files to ASSOCIATE/ADD this provider to the demographic information listed below .\nPrimary Practice Address:\nLegal Name: Little Mount Specialty Services, LLC\nAddress:1160 E 3900 S Ste 5000\nMadurai, TN, 841241275\nPhone:801 -222-7479\nFax:801 -222-7429\nManager:Practice  Manager')

In [None]:
print(pages[0].page_content)

Provider Enrollment Department
2000 Health Park Drive
Madura i, TN 38028
Email: HCAPS. PayorRequest @secur e.com
October 08, 2020
Attn: Credentialing Department/Demographics Updates - TN
RE: Adding location to provider demographics
Provider Name: E Mary Mythili , MD
NPI: 1760445457
CAQH :11051777
Specialty:Orthopaedic  Surgery
Effective 09/09/2020, E Mary Mythili , MD will be adding the following practice location(s ).
Please update your files to ASSOCIATE/ADD this provider to the demographic information listed below .
Primary Practice Address:
Legal Name: Little Mount Specialty Services, LLC
Address:1160 E 3900 S Ste 5000
Madurai, TN, 841241275
Phone:801 -222-7479
Fax:801 -222-7429
Manager:Practice  Manager


In [None]:
print(pages[1].page_content)

Billing Address:
Legal Name: Little Mount Specialty Services, LLC
Address:PO  Box 100253
Madurai, TN 303840253, Phone :615-373-7600
New Practice Addresses :
Legal Name: Little Mount Specialty Services, LLC
Tax ID:061787666
Address:74 E KIMBALLS LN
STE 330
Draper, TN 840220000
Phone:801 -266-3564 Fax:801 -266-3613
 Manager:Practice  Manager
If you have any additional questions, please contact me at the information below .
Thank you,
Sandy Mike
Provider Enrollment Department
Phone: 615-377-7610


In [None]:
file_path = "Sample_30_Fax_1.pdf"
loader = PyPDFLoader(file_path, extract_images=True)
pages = loader.load_and_split()

In [None]:
print(pages[0].page_content)

Chennai Regional Physician
127 Besant Nagar,
Chennai, TN, 44444
Chennai
111-22-3333
To Whom it may concern,
This Letter is notify TN a new provider Mani Murugan MD, joining our group Chennai
regional physicians.
Provider individual NPl number is 9999999999
The effective dlate requested is 10/11/2020
The group name is Chennai regional physician TIN - 62-2222222
Practice address:
Mailing / credlentialing Address:
Billing Address:
1 Raja st,
127Kamaraiar st,
PO BoX1111
Adyar, CH 66666
Adyar, CH 55555
Adyar, CH - 1111
P: 222-333-4444
P: 666-888-9999
P: 111-222-3333
F:333-444-5555
F: 111-000-1010
F: 111-333-4444
Thank you,
Raja Krishnan,
Credential specialist,
Rajakcredential@secure.com
Office: (222) 111 2222, Fax: (111) 555 6666


In [None]:
from langchain_community.document_loaders import UnstructuredWordDocumentLoader

In [None]:
file_path = "Sample_20.docx"
loader = UnstructuredWordDocumentLoader(file_path, mode="elements")
pages = loader.load_and_split()

In [None]:
print(pages)

[Document(metadata={'source': 'Sample_20.docx', 'category_depth': 0, 'filename': 'Sample_20.docx', 'last_modified': '2024-07-12T07:23:03', 'languages': ['eng'], 'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'category': 'UncategorizedText'}, page_content='1184067779'), Document(metadata={'source': 'Sample_20.docx', 'category_depth': 0, 'emphasized_text_contents': ['PROVIDER NAME:', 'L', 'eon Sweatha', ', PT'], 'emphasized_text_tags': ['b', 'b', 'b', 'b'], 'filename': 'Sample_20.docx', 'last_modified': '2024-07-12T07:23:03', 'languages': ['eng'], 'filetype': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'category': 'Title'}, page_content='PROVIDER NAME: Leon Sweatha, PT'), Document(metadata={'source': 'Sample_20.docx', 'category_depth': 0, 'emphasized_text_contents': ['SPECIALTY:'], 'emphasized_text_tags': ['b'], 'filename': 'Sample_20.docx', 'last_modified': '2024-07-12T07:23:03', 'languages': ['eng'], 'filetype': 'ap

In [None]:
%pip install --upgrade --quiet  docx2txt

  Preparing metadata (setup.py) ... [?25l[?25hdone
  Building wheel for docx2txt (setup.py) ... [?25l[?25hdone


In [None]:
from langchain_community.document_loaders import Docx2txtLoader

loader = Docx2txtLoader("Sample_20.docx")

data = loader.load()

data

[Document(metadata={'source': 'Sample_20.docx'}, page_content='1184067779\n\nABC BASIC PROFILE LOADING INFORMATION\n\nABC BASIC PROFILE LOADING INFORMATION\n\n\n\n\n\nPROVIDER NAME: Leon Sweatha, PT    \n\nSPECIALTY: Physical Therapy\n\nDOB: 08/26/1985\n\nNPI: 8825614598\n\nABC ID: 231456\n\nEFF DATE: 03/10/2020\n\n\n\nVENDOR/GROUP NAME: Mithran Rehab Center\n\nTIN: 111111122\n\nEFF DATE: 03/10/2020\n\nNETWORK: ABC PPO/POS\n\nCLASS: Allied Health\n\nFEE SCHEDULE: P99 ABC Standard Procedure Fee Schedule\n\nDIRECTORY STATUS: Publish\n\nADD TO LOCAL PLUS: N/A – Outside Service Area\n\n\n\nBilling Address\n\nPrimary Address\n\nSecondary Office Address\n\nSecondary Office Address\n\nPlot No.25 Door No.1-60/1\n\n156 Gurukrupanagar 144 - LIG Street No. 8\n\n222 2nd Floor\n\nMahabalipuram, TN 66502\n\nMadurai, TN 66502\n\nTrichy, TN 66502\n\nPoondi, TN 66801\n\n\n\n\n\n\n\nVENDOR/GROUP NAME: Madurai Medical LLC\n\nTIN: 111111123\n\nEFF DATE: 03/10/2020\n\nNETWORK: ABC PPO/POS\n\nCLASS: Allied 

In [None]:
print(data[0].page_content)

1184067779

ABC BASIC PROFILE LOADING INFORMATION

ABC BASIC PROFILE LOADING INFORMATION





PROVIDER NAME: Leon Sweatha, PT    

SPECIALTY: Physical Therapy

DOB: 08/26/1985

NPI: 8825614598

ABC ID: 231456

EFF DATE: 03/10/2020



VENDOR/GROUP NAME: Mithran Rehab Center

TIN: 111111122

EFF DATE: 03/10/2020

NETWORK: ABC PPO/POS

CLASS: Allied Health

FEE SCHEDULE: P99 ABC Standard Procedure Fee Schedule

DIRECTORY STATUS: Publish

ADD TO LOCAL PLUS: N/A – Outside Service Area



Billing Address

Primary Address

Secondary Office Address

Secondary Office Address

Plot No.25 Door No.1-60/1

156 Gurukrupanagar 144 - LIG Street No. 8

222 2nd Floor

Mahabalipuram, TN 66502

Madurai, TN 66502

Trichy, TN 66502

Poondi, TN 66801







VENDOR/GROUP NAME: Madurai Medical LLC

TIN: 111111123

EFF DATE: 03/10/2020

NETWORK: ABC PPO/POS

CLASS: Allied Health

FEE SCHEDULE: P99 ABC Standard Procedure Fee Schedule

DIRECTORY STATUS: Publish

ADD TO LOCAL PLUS: N/A – Outside Service Area



Bi

## Get content in email

In [None]:
%pip install --upgrade --quiet extract_msg

In [None]:
from langchain_community.document_loaders import OutlookMessageLoader

loader = OutlookMessageLoader("example_data/fake-email.msg")

data = loader.load()

data[0]