# 1 - Normalize Content

This Lab material is an adaptation of the course 'Preprocessing Unstructured Data for LLM Applications', Coursera, March 2024

In this jupyter notebook we will learn how to normalize pdf content.

In [None]:
# Install unstructured libraries

!pip install unstructured_client
!pip install unstructured

In [None]:
# Warning control
import warnings
warnings.filterwarnings('ignore')

In [None]:
from IPython.display import JSON

import json

from unstructured_client import UnstructuredClient
from unstructured_client.models import shared
from unstructured_client.models.errors import SDKError

from unstructured.partition.html import partition_html
#from unstructured.partition.pptx import partition_pptx
from unstructured.staging.base import dict_to_elements, elements_to_json

In [None]:
#Use the DLAI_API_KEY & DLAI_API_URL that you obtained from Unstructured.io

#DLAI_API_KEY = 'your DLAI_API_KEY'
#DLAI_API_URL = 'your DLAI_API_URL'

#Example
DLAI_API_KEY = 'tkp3I9iABLDbcJvfgGvnELB4Y2usgn'
DLAI_API_URL = 'https://naaissa-62qdjqlm.api.unstructuredapp.io/'

s = UnstructuredClient(
    api_key_auth=DLAI_API_KEY,
    server_url=DLAI_API_URL,
)

## Examine PDF Files

In the **datasci-patient-charts/example_files** directory, double click on each pdf to see what example data the team has to work with.
Closely examine the file CP_CHRT_C_G4M3BA_De-identified.pdf  In this lab we will work with this pdf. 

PDFs are different than processing HTML or ppts.  Where in those documents you are looking at semi-structured information for clues on how to divide element types within the documents. In pdfs you are going to look for things like formatting.  For example:
* a piece of text that is 'bolded' or 'underlined' may more likely be a title.
* text that is longer and blockier, contains multiple sentences, doesn't have emphasis (e.g. bolding, underlining) is more likely to be narrative text.

Let's take the above file name and pass it to the unstructured API where the unstructured.io model is setup.

In [None]:
#Process the PDF.  And remember that it may take a minute or so to process the PDF.

filename = "example_files/CP_CHRT_C_G4M3BA_De-identified.pdf"
with open(filename, "rb") as f:
    files=shared.Files(
        content=f.read(), 
        file_name=filename,
    )

req = shared.PartitionParameters(
    files=files,
    strategy='hi_res',
    pdf_infer_table_structure=True,
    languages=["eng"],
)
try:
    resp = s.general.partition(req)
    print(json.dumps(resp.elements[:3], indent=2))
except SDKError as e:
    print(e)

Take a look at the above processed PDF.  Check if the text labelled as 'Title' or 'NarrativeText' is accurate.  

Next, let's explore the JSON using the IPython JSON display function.

In [None]:
JSON(json.dumps(resp.elements, indent=2))

Comprare the above text types (and associated text).  Do they match the text types (e.g. Title, Narrative text, List items) within the original PDF?

Also note the 'element_id' that is created and associated with each text type.

Finally, look at the 'metadata' that is associated with each element.  The meta data reveals which file type(s), page(s) and file(s) the elements were extracted from.