# Better PDF processing with LLMs

In [1]:
%load_ext autoreload
%autoreload 2
%load_ext dotenv
%dotenv

In [2]:
from openai import OpenAI

client = OpenAI()

## Setting Things Up

Let's create a function that turns a document into text using LLMWhisperer:

In [3]:
import time
import os
from unstract.llmwhisperer.client import LLMWhispererClient

LLMWHISPERER_API_KEY = os.getenv("LLMWHISPERER_API_KEY")

llm_whisperer_client = LLMWhispererClient(
    base_url="https://llmwhisperer-api.unstract.com/v1", api_key=LLMWHISPERER_API_KEY
)


def preprocess_document(file_path):
    response = llm_whisperer_client.whisper(file_path=file_path)
    whisper_hash = response["whisper_hash"]

    print("Status:", response["status_code"])
    print("Hash:", whisper_hash)

    text = response.get("extracted_text")

    while True:
        status = llm_whisperer_client.whisper_status(whisper_hash=whisper_hash)

        if status["status"] == "processed":
            text = llm_whisperer_client.whisper_retrieve(whisper_hash=whisper_hash)[
                "extracted_text"
            ]
            break
        elif status["status"] != "processing":
            break

        time.sleep(2)

    return text

2024-08-24 09:52:45,962 - unstract.llmwhisperer.client - DEBUG - logging_level set to DEBUG
2024-08-24 09:52:45,962 - unstract.llmwhisperer.client - DEBUG - base_url set to https://llmwhisperer-api.unstract.com/v1
2024-08-24 09:52:45,962 - unstract.llmwhisperer.client - DEBUG - api_key set to e7dbxxxxxxxxxxxxxxxxxxxxxxxxxxxx


Let's create a function that uses OpenAI's Assistants API to answer questions from a document:

In [4]:
def answer_from_document(instructions, question, file_path):
    assistant = client.beta.assistants.create(
        name="Assistant",
        instructions=instructions,
        model="gpt-4o",
        tools=[{"type": "file_search"}],
    )

    message_file = client.files.create(file=open(file_path, "rb"), purpose="assistants")

    thread = client.beta.threads.create(
        messages=[
            {
                "role": "user",
                "content": question,
                "attachments": [
                    {"file_id": message_file.id, "tools": [{"type": "file_search"}]}
                ],
            }
        ]
    )

    run = client.beta.threads.runs.create_and_poll(
        thread_id=thread.id, assistant_id=assistant.id
    )

    messages = list(
        client.beta.threads.messages.list(thread_id=thread.id, run_id=run.id)
    )
    return messages[0].content[0].text.value

Let's now create a function that uses the OpenAI's Completion API to answer a question:

In [5]:
def answer(question):
    completion = client.chat.completions.create(
        model="gpt-4o", messages=[{"role": "user", "content": question}], stream=False
    )

    return completion.choices[0].message.content

## GPT-4o + Bill of Sale PDF

In [12]:
response = answer_from_document(
    instructions="You are a sales assistant. Answer questions about the supplied bill of sale.",
    question="How many bats where ordered?",
    file_path="bill-of-sale.pdf",
)
print(response)

A total of 125 bats were ordered【4:0†bill-of-sale.pdf】.


## GPT-4o + Extracted Text From The Bill of Sale

In [13]:
text = preprocess_document("bill-of-sale.pdf")
print(text)

2024-08-24 10:06:46,808 - unstract.llmwhisperer.client - DEBUG - whisper called
2024-08-24 10:06:46,808 - unstract.llmwhisperer.client - DEBUG - api_url: https://llmwhisperer-api.unstract.com/v1/whisper
2024-08-24 10:06:46,808 - unstract.llmwhisperer.client - DEBUG - params: {'url': '', 'processing_mode': 'ocr', 'output_mode': 'line-printer', 'page_seperator': '<<<', 'force_text_processing': False, 'pages_to_extract': '', 'timeout': 200, 'store_metadata_for_highlighting': False, 'median_filter_size': 0, 'gaussian_blur_radius': 0, 'ocr_provider': 'advanced', 'line_splitter_tolerance': 0.4, 'horizontal_stretch_factor': 1.0}
2024-08-24 10:06:53,012 - unstract.llmwhisperer.client - DEBUG - whisper_status called
2024-08-24 10:06:53,012 - unstract.llmwhisperer.client - DEBUG - url: https://llmwhisperer-api.unstract.com/v1/whisper-status


Status: 200
Hash: c96b3ffe|d5a80b735b076cfa184e1c3b0fb86897



       Al, Spalding              Bros.        SPALDING           PLEASE REMIT TO SPALDING SALES CORP. 
        SION OF SPALDING SALES CORPORATION     MARK 


                                                                         #1 
                                                             STORE NO.           FOLIO C 


   FAMOUS FOR ATHLETIC EQUIPMENT 
                                                                           INVOICE NO. S 2812 


                                                                           CUSTOMER'S 
 Sold    To           DATE      6/1/39                Ship To              ORDER NO. 


            BKLYN EAGLES B B CLUB                                 DELD TO DIRK LUNDY 
            EMANLEY - 
 ADDRESS                                            ADDRESS 
            101 MONTGOMERY STREET 
 TOWN       NEWARK, N.J.      STATE                  TOWN                        STATE 
  TERMS: 
 

In [14]:
prompt = """
Look at the following bill of sale and answer the following question:

Question: How many bats where ordered?

Bill of sale:
"""

response = answer(prompt + text)
print(response)

The bill of sale indicates that the following quantities of bats were ordered:

- Item 125: 9 bats
- Item 120: 1 bat
- Item 200: 6 bats
- Item 130: 2 bats

Adding these quantities:

9 (Item 125) + 1 (Item 120) + 6 (Item 200) + 2 (Item 130) = 18 bats

Therefore, 18 bats were ordered.


## GPT-4o + Loan Application PDF

In [9]:
response = answer_from_document(
    instructions="You are a loan application assistant. Answer questions about the supplied loan application.",
    question="What's the full address of the applicant?",
    file_path="loan-application.pdf",
)
print(response)

I wasn't able to locate the applicant’s full address using the search function. Please provide the exact section or type of document where this information may be found, or alternatively, I can attempt a detailed manual review.


## GPT-4o + Extracted Text From The Loan Application

In [10]:
text = preprocess_document("loan-application.pdf")
print(text)

2024-08-24 09:53:34,570 - unstract.llmwhisperer.client - DEBUG - whisper called
2024-08-24 09:53:34,571 - unstract.llmwhisperer.client - DEBUG - api_url: https://llmwhisperer-api.unstract.com/v1/whisper
2024-08-24 09:53:34,571 - unstract.llmwhisperer.client - DEBUG - params: {'url': '', 'processing_mode': 'ocr', 'output_mode': 'line-printer', 'page_seperator': '<<<', 'force_text_processing': False, 'pages_to_extract': '', 'timeout': 200, 'store_metadata_for_highlighting': False, 'median_filter_size': 0, 'gaussian_blur_radius': 0, 'ocr_provider': 'advanced', 'line_splitter_tolerance': 0.4, 'horizontal_stretch_factor': 1.0}
2024-08-24 09:53:50,390 - unstract.llmwhisperer.client - DEBUG - whisper_status called
2024-08-24 09:53:50,391 - unstract.llmwhisperer.client - DEBUG - url: https://llmwhisperer-api.unstract.com/v1/whisper-status


Status: 200
Hash: 4fea52d2|a50523a76fbcf5cb5d802aadf86b4574



 To be completed by the Lender: 
 Lender Loan No./Universal Loan Identifier                                                         Agency Case No. 


Uniform Residential Loan Application 
Verify and complete the information on this application. If you are applying for this loan with others, each additional Borrower must provide 
information as directed by your Lender. 


Section 1: Borrower Information. This section asks about your personal information and your income from 
employment and other sources, such as retirement, that you want considered to qualify for this loan. 


 1a. Personal Information 
Name (First, Middle, Last, Suffix)                                             Social Security Number 175-678-910 
  IMA          CARDHOLDER                                                      (or Individual Taxpayer Identification Number) 
Alternate Names - List any names by which you are known or any names           Date 

In [11]:
prompt = """
Look at the following loan application and answer the following question:

Question: What's the full address of the applicant?

Loan application:
"""

response = answer(prompt + text)
print(response)

The full address of the applicant, as listed in the loan application, is:

1024 Sullivan Street, Los Angeles, CA 90210, USA
