# 1 - Normalize Content

This Lab material is an adaptation of the course 'Preprocessing Unstructured Data for LLM Applications', Coursera, March 2024

In this jupyter notebook we will learn how to normalize pdf content.

In [1]:
#!pip install unstructured_client    #from old SDK
!pip install unstructured-client     #new SDK

Collecting unstructured-client
  Downloading unstructured_client-0.27.0-py3-none-any.whl (59 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.9/59.9 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
Collecting jsonpath-python<2.0.0,>=1.0.6
  Downloading jsonpath_python-1.0.6-py3-none-any.whl (7.6 kB)
Collecting python-dateutil==2.8.2
  Downloading python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m247.7/247.7 kB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pypdf>=4.0
  Downloading pypdf-5.1.0-py3-none-any.whl (297 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m298.0/298.0 kB[0m [31m167.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting requests-toolbelt>=1.0.0
  Downloading requests_toolbelt-1.0.0-py2.py3-none-any.whl (54 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.5/54.5 kB[0m [31m201.2 MB/s[0m eta [36m0:00:00[0m
Collecting pydantic<2.10.0,>

In [2]:
import os, json

import unstructured_client
from unstructured_client.models import shared

## Examine PDF Files

In the **datasci-patient-charts/example_files** directory, double click on each pdf to see what example data the team has to work with.
Closely examine the file **CP_CHRT_C_G4M3BA_De-identified.pdf**  In this lab we will work with this pdf. 

PDFs are different than processing HTML or ppts.  Where in those documents you are looking at semi-structured information for clues on how to divide element types within the documents. In pdfs you are going to look for things like formatting.  For example:
* a piece of text that is 'bolded' or 'underlined' may more likely be a title.
* text that is longer and blockier, contains multiple sentences, doesn't have emphasis (e.g. bolding, underlining) is more likely to be narrative text.

Let's take the example medical file (CP_CHRT_C_G4M3BA_De-identified.pdf) and pass it to the unstructured API where the unstructured.io model is setup.

In [12]:
#Use your stored keys from the healthcare workbench
client = unstructured_client.UnstructuredClient(
    api_key_auth=os.getenv("UNSTRUCTURED_API_KEY"),
    server_url=os.getenv("UNSTRUCTURED_API_URL"),
)

filename = "example_files/CP_CHRT_C_G4M3BA_De-identified.pdf"
filename_out = "example_files_out/json_elements.json"

req = {
    "partition_parameters": {
        "files": {
            "content": open(filename, "rb"),
            "file_name": filename,
        },
        "strategy": shared.Strategy.HI_RES,
        "languages": ['eng'],
        "split_pdf_page": True,            # If True, splits the PDF file into smaller chunks of pages.
        "split_pdf_allow_failed": True,    # If True, the partitioning continues even if some pages fail.
        "split_pdf_concurrency_level": 15  # Set the number of concurrent request to the maximum value: 15.
    }
}

try:
    res = client.general.partition(request=req)
    element_dicts = [element for element in res.elements]

    # Print the processed data's first element only.
    print(element_dicts[0])

    # Write the processed data to a local file.
    json_elements = json.dumps(element_dicts, indent=2)

    with open(filename_out, "w") as file:
        file.write(json_elements)
except Exception as e:
    print(e)

INFO: HTTP Request: POST https://api.unstructured.io/general/v0/general "HTTP/1.1 200 OK"


{'type': 'Image', 'element_id': 'df79fa92715475e38b3320154fa85207', 'text': 'PAST MEDICAL HISTORY ', 'metadata': {'filetype': 'application/pdf', 'languages': ['eng'], 'page_number': 1, 'filename': 'CP_CHRT_C_G4M3BA_De-identified.pdf'}}



Examine the original PDF example_files/**CP_CHRT_C_G4M3BA_De-identified.pdf** and the resulting **json_elements.json** file.

Compare the document types (and associated text). Do they match the document types (e.g. Title, Narrative text, List items) within the original PDF?

Note: that you are able to visually identify that 'SURGICAL HISTORY' is a title. And it would get the same normalized type as a title serialized from a ppt or HTML file.

Also note the 'element_id' that is created and associated with each text type.

Finally, look at the 'metadata' tag. If you expand it, you will see the 'name' and the 'page number' of the document that this structured data was produced from.

Now that we were able to obtain document elements and metadata, we are ready to perform metadata extraction and chunking.