### The Problem

How do we reliably extract all text-based requirements from a very long and unstructured PDF? How do we generalize this to multiple PDFs?

### The chosen approach

For each section:
 1. parse the pdf
 2. get a list of requirements and their sources in the PDF
 3. verify that those sources really exist in the input PDF

### Motivation & Evals

In order for this problem to be considered solved, we have to make sure:
1. we only include requirements that exist in the PDF
2. we extract all requirements from the PDF

To check for requirement 1, the evals are set up to parse each of the sources returned alongside each requirement and go back into the PDF and make sure it actually exists. More layers could be added to this (i.e. check that the source is in the correct section), but I found that this works pretty well without these layers, as least for this PDF.

The evals don't currently include a check for requirement 2, but a possible approach is to take each section, its outputted requirements, and ask a model to verify that we didn't miss any requirements. From intuition, this should provide decent baseline performance. If not, another approach is to remove each source we did encounter from the parsed PDF and inspect the remaining text for any missing requirements.

Many more layers of evals can be added both using LLMs and without LLMs based on how more PDFs act under this pipeline. Since the PDF parsing is not at the time the user uses the app and only happens once for each PDF, cost and latency is probably not a large concern when adding more LLM calls, if those calls result in a significant increase in output accuracy and product quality.

### The code

In [12]:
import os
import json
import anthropic

os.environ['ANTHROPIC_API_KEY'] = 'API_KEY_HERE'

PDF_FILEPATH = "./DicksRoutingGuide.pdf"
TABLE_OF_CONTENTS_PATH = "./table_of_contents.json"
SECTION_REQUIREMENTS_FILE_FULL = "./section_requirements_full.json"
SECTION_REQUIREMENTS_FILE_HALF = "./section_requirements_half.json"

The table of contents below is loaded from a predetermined file for simplicity of demo. In practice, this could be automated with either an LLM call (more generalizable, easier to implement, less reliable) or some in-depth parsing logic (less generalizable, harder to implement, more reliable). Again here, since this parsing happens once for each document, rather than once every time a user uses the app, having multiple LLM calls to set up this data (or future data, such as maps of where images are located) probably doesn't incur too much cost or latency.

In [13]:
# load table of contents

with open(TABLE_OF_CONTENTS_PATH, 'r') as file:
    table_of_contents = json.load(file)
    
table_of_contents

{'SECTION 1 INTRODUCTION': {'start': 11, 'end': 12},
 'SECTION 2 CONFIDENTIAL INFORMATION POLICY': {'start': 12, 'end': 12},
 'SECTION 3 VENDOR INDEMNIFICATION AND INSURANCE': {'start': 12, 'end': 12},
 'SECTION 4 PRODUCT DATA AND ATTRIBUTES': {'start': 12, 'end': 18},
 'SECTION 5 VALUE ADDED SERVICES (VAS) REQUIREMENTS': {'start': 20, 'end': 26},
 'SECTION 6 PURCHASE ORDERS': {'start': 27, 'end': 30},
 'SECTION 7 EDI': {'start': 30, 'end': 33},
 'SECTION 8 NON-EDI': {'start': 33, 'end': 34},
 'SECTION 9 GENERAL PACKING AND SHIPPING': {'start': 34, 'end': 38},
 'SECTION 10 CARTON LABELING': {'start': 39, 'end': 44},
 'SECTION 11 SHIPPING LABEL PLACEMENT': {'start': 44, 'end': 48},
 'SECTION 12 VENDOR CERTIFICATION': {'start': 48, 'end': 48},
 'SECTION 13 ROUTING (TMS TRAINING)': {'start': 49, 'end': 50},
 'SECTION 14 DOMESTIC TRANSPORTATION': {'start': 51, 'end': 58},
 'SECTION 15 DIRECT IMPORT COLLECT TRANSPORTATION': {'start': 58, 'end': 64},
 'SECTION 16 DAMAGE/DEFECTIVE RTV PROGRAM

Next, we want to query the LLM to get requirements and sources for each section. For demo purposes, I've created two samples that were pre-run: SECTION_REQUIREMENTS_FILE_HALF and SECTION_REQUIREMENTS_FILE_FULL. Otherwise, the below code can be used to query the LLM and come up with a new JSON of section requirements.

In [14]:
from parsing import extract_text_from_pdf, extract_json
from query import query_llm

if os.path.exists(SECTION_REQUIREMENTS_FILE_HALF):
    with open(SECTION_REQUIREMENTS_FILE_HALF, 'r') as f:
        section_requirements = json.load(f)
else:
    print("File does not exist. Rerunning LLM pipeline...")
    text = extract_text_from_pdf(PDF_FILEPATH, 11, 12)
    parsed_sections = {}

    for section, pages in table_of_contents.items():
        parsed_sections[section] = extract_text_from_pdf(PDF_FILEPATH, pages["start"], pages["end"])

    # Run to get an idea of how long each section's string is, could be important for LLM content length
    # for content in parsed_sections.values():
    #     print(len(content))
    
    section_requirements = {}
    for section, text in parsed_sections.items():
        print(f"Processing {section}...")
        result = query_llm(text, anthropic.Anthropic())
        section_requirements[section] = extract_json(result)
    
    with open(SECTION_REQUIREMENTS_FILE_HALF, 'w') as f:
        json.dump(section_requirements, f, indent=4)

for section, requirements in section_requirements.items():
    print(section)
    print(json.dumps(requirements, indent=4))

SECTION 1 INTRODUCTION
{
    "requirements": [
        {
            "requirement": "Vendors must abide by the routing guide as stated in the Dick's Sporting Goods Vendor Agreement.",
            "source": "As stated in the Dick's Sporting Goods Vendor Agreement (\"Vendor Agreement\"), it is our expectation that vendors will abide by this routing guide."
        },
        {
            "requirement": "Unless identified as recommended or requested, a section in the routing guide is considered to be required.",
            "source": "Please note that unless identified as recommended or requested a section is considered to be required."
        },
        {
            "requirement": "Compliance is necessary for all required items in the routing guide.",
            "source": "Compliance is necessary for all required items."
        },
        {
            "requirement": "Recommended or requested items in the routing guide can become required at any time as Dick's Sporting Goods finds i

Now that we have the requirements and sources, we want to verify that the sources were not hallucinated and actually exist in the PDF.

In [15]:
from evals import verify_sources_in_pdf, calculate_percentage_passed, print_negatives
result = verify_sources_in_pdf(section_requirements, table_of_contents, PDF_FILEPATH)

percent_passed = calculate_percentage_passed(result)
print(f"Percentage of requirements that had a valid source: {percent_passed:.2%}\n")

# If not all the sources passed, print the ones that failed
if percent_passed < 1.0:
    print_negatives(result)

Percentage of requirements that had a valid source: 100.00%

