# 1 - Normalize Content

This Lab material is an adaptation of the course 'Preprocessing Unstructured Data for LLM Applications', Coursera, March 2024

In this jupyter notebook we will learn how to normalize pdf content.

In [4]:
#!pip install unstructured_client
!pip install unstructured-client


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.2.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [5]:
import os, json

import unstructured_client
from unstructured_client.models import shared

## Examine PDF Files

In the **datasci-patient-charts/example_files** directory, double click on each pdf to see what example data the team has to work with.
Closely examine the file **CP_CHRT_C_G4M3BA_De-identified.pdf**  In this lab we will work with this pdf. 

PDFs are different than processing HTML or ppts.  Where in those documents you are looking at semi-structured information for clues on how to divide element types within the documents. In pdfs you are going to look for things like formatting.  For example:
* a piece of text that is 'bolded' or 'underlined' may more likely be a title.
* text that is longer and blockier, contains multiple sentences, doesn't have emphasis (e.g. bolding, underlining) is more likely to be narrative text.

Let's take the example medical file (CP_CHRT_C_G4M3BA_De-identified.pdf) and pass it to the unstructured API where the unstructured.io model is setup.

In [6]:
#UNSTRUCTURED_API_URL='https://api.unstructured.io/general/v0/general'
#UNSTRUCTURED_API_KEY='tkp3I9iABLDbcJvfgGvnELB4Y2usgn'

client = unstructured_client.UnstructuredClient(
    api_key_auth=os.getenv("UNSTRUCTURED_API_KEY"),
    server_url=os.getenv("UNSTRUCTURED_API_URL"),
    #api_key_auth=UNSTRUCTURED_API_KEY,
    #server_url=UNSTRUCTURED_API_URL,
)

#filename = "PATH_TO_INPUT_FILE"
filename = "example_files/CP_CHRT_C_G4M3BA_De-identified.pdf"

req = {
    "partition_parameters": {
        "files": {
            "content": open(filename, "rb"),
            "file_name": filename,
        },
        "strategy": shared.Strategy.HI_RES,
        "languages": ['eng'],
        "split_pdf_page": True,            # If True, splits the PDF file into smaller chunks of pages.
        "split_pdf_allow_failed": True,    # If True, the partitioning continues even if some pages fail.
        "split_pdf_concurrency_level": 15  # Set the number of concurrent request to the maximum value: 15.
    }
}

try:
    res = client.general.partition(request=req)
    element_dicts = [element for element in res.elements]

    # Print the processed data's first element only.
    print(element_dicts[0])

    # Write the processed data to a local file.
    json_elements = json.dumps(element_dicts, indent=2)

    with open("PATH_TO_OUTPUT_FILE", "w") as file:
        file.write(json_elements)
except Exception as e:
    print(e)

INFO: HTTP Request: POST https://api.unstructured.io/general/v0/general "HTTP/1.1 200 OK"


{'type': 'Image', 'element_id': 'df79fa92715475e38b3320154fa85207', 'text': 'PAST MEDICAL HISTORY ', 'metadata': {'filetype': 'application/pdf', 'languages': ['eng'], 'page_number': 1, 'filename': 'CP_CHRT_C_G4M3BA_De-identified.pdf'}}



Examine the original PDF example_files/**CP_CHRT_C_G4M3BA_De-identified.pdf** and the resulting json/**elements.json** file.

Compare the document types (and associated text). Do they match the document types (e.g. Title, Narrative text, List items) within the original PDF?

Note: that you are able to visually identify that 'SURGICAL HISTORY' is a title. And it would get the same normalized type as a title serialized from a ppt or HTML file.

Also note the 'element_id' that is created and associated with each text type.

Finally, look at the 'metadata' tag. If you expand it, you will see the 'name' and the 'page number' of the document that this structured data was produced from.

Now that we were able to obtain document elements and metadata, we are ready to perform metadata extraction and chunking.