# Better PDF processing with LLMs

In [24]:
%load_ext autoreload
%autoreload 2
%load_ext dotenv
%dotenv

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


## Initializing LLMWhisperer

In [4]:
import os

LLMWHISPERER_API_KEY = os.getenv("LLMWHISPERER_API_KEY")

In [5]:
from unstract.llmwhisperer.client import LLMWhispererClient

llm_whisperer_client = LLMWhispererClient(
    base_url="https://llmwhisperer-api.unstract.com/v1", api_key=LLMWHISPERER_API_KEY
)

2024-08-22 08:34:25,563 - unstract.llmwhisperer.client - DEBUG - logging_level set to DEBUG
2024-08-22 08:34:25,564 - unstract.llmwhisperer.client - DEBUG - base_url set to https://llmwhisperer-api.unstract.com/v1
2024-08-22 08:34:25,564 - unstract.llmwhisperer.client - DEBUG - api_key set to e7dbxxxxxxxxxxxxxxxxxxxxxxxxxxxx


## Process the document

In [20]:
response = llm_whisperer_client.whisper(file_path="billofsale.pdf")
whisper_hash = response["whisper_hash"]

print("Status:", response["status_code"])
print("Hash:", whisper_hash)

2024-08-22 08:59:36,476 - unstract.llmwhisperer.client - DEBUG - whisper called
2024-08-22 08:59:36,476 - unstract.llmwhisperer.client - DEBUG - api_url: https://llmwhisperer-api.unstract.com/v1/whisper
2024-08-22 08:59:36,477 - unstract.llmwhisperer.client - DEBUG - params: {'url': '', 'processing_mode': 'ocr', 'output_mode': 'line-printer', 'page_seperator': '<<<', 'force_text_processing': False, 'pages_to_extract': '', 'timeout': 200, 'store_metadata_for_highlighting': False, 'median_filter_size': 0, 'gaussian_blur_radius': 0, 'ocr_provider': 'advanced', 'line_splitter_tolerance': 0.4, 'horizontal_stretch_factor': 1.0}


Status: 200
Hash: bc3d5fbc|d5a80b735b076cfa184e1c3b0fb86897


In [21]:
import time

text = response.get("extracted_text")

while True:
    status = llm_whisperer_client.whisper_status(whisper_hash=whisper_hash)

    if status["status"] == "processed":
        text = llm_whisperer_client.whisper_retrieve(whisper_hash=whisper_hash)[
            "extracted_text"
        ]
        break
    elif status["status"] != "processing":
        break

    time.sleep(2)

2024-08-22 08:59:44,352 - unstract.llmwhisperer.client - DEBUG - whisper_status called
2024-08-22 08:59:44,352 - unstract.llmwhisperer.client - DEBUG - url: https://llmwhisperer-api.unstract.com/v1/whisper-status


In [22]:
print(text)




       Al, Spalding              Bros.        SPALDING           PLEASE REMIT TO SPALDING SALES CORP. 
        SION OF SPALDING SALES CORPORATION     MARK 


                                                                         #1 
                                                             STORE NO.           FOLIO C 


   FAMOUS FOR ATHLETIC EQUIPMENT 
                                                                           INVOICE NO. S 2812 


                                                                           CUSTOMER'S 
 Sold    To           DATE      6/1/39                Ship To              ORDER NO. 


            BKLYN EAGLES B B CLUB                                 DELD TO DIRK LUNDY 
            EMANLEY - 
 ADDRESS                                            ADDRESS 
            101 MONTGOMERY STREET 
 TOWN       NEWARK, N.J.      STATE                  TOWN                        STATE 
  TERMS: 
 2% CASH TO DAYS-NET 30 DAYS-                       VIA 


  

## Initialize OpenAI's Client

In [27]:
from openai import OpenAI

client = OpenAI()

## Processing the PDF Directly

In [28]:
instructions = (
    "You are a sales assistant. Answer questions about the supplied bill of sale."
)
question = "How many bats where ordered?"

assistant = client.beta.assistants.create(
    name="Sales assistant",
    instructions=instructions,
    model="gpt-4o",
    tools=[{"type": "file_search"}],
)

message_file = client.files.create(
    file=open("billofsale.pdf", "rb"), purpose="assistants"
)

thread = client.beta.threads.create(
    messages=[
        {
            "role": "user",
            "content": question,
            "attachments": [
                {"file_id": message_file.id, "tools": [{"type": "file_search"}]}
            ],
        }
    ]
)

run = client.beta.threads.runs.create_and_poll(
    thread_id=thread.id, assistant_id=assistant.id
)
messages = list(client.beta.threads.messages.list(thread_id=thread.id, run_id=run.id))
print(messages[0].content[0].text.value)

The bill of sale indicates that a total of 125 bats were ordered【4:0†billofsale.pdf】.


## Processing the Extracted Text

In [26]:
prompt = """
Look at the following bill of sale and answer the following question:

Question: How many bats where ordered?

Bill of sale:
"""

completion = client.chat.completions.create(
    model="gpt-4o", messages=[{"role": "user", "content": prompt + text}], stream=False
)

print(completion.choices[0].message.content)

The total number of bats ordered, as per the bill of sale, is calculated by summing up the quantities for each line item. Here are the quantities of bats ordered:

- 125 Bats: 9 ordered
- 120 Bats: 1 ordered
- 200 Bats: 6 ordered
- 130 Bats: 2 ordered

Adding these quantities together, we get:

9 + 1 + 6 + 2 = 18

Therefore, 18 bats were ordered.
