# Project Outlining

This project will need to incorporate the following steps:
1. Parse the pdf into a searchable format (markdown?)
2. Search for and store numbers
3. For the numbers, look to see if proceeding words are modifiers
4. If it is a modifier, convert that to a value
5. Combine with initial number
6. Find the max value
7. Report max value (value, page number it was found on, if it was modified, what the original context was)

Which portions of this pipeline would make sense to define as a class?
- Text data. Class would load pdf, parse the pdf, and separate text by pages. 

# Parsing the PDF 

Notes about the PDF:
- Values often preceded by '$'
- High values often have a decimal point and a modifier
- Must be able to parse tables
    - Tables have modifiers in the headers as well! 
    - pymupdf4llm would be a good choice to ensure I can extract tables 
- First ~20 pages are the summary. I would suspect that the highest number would be found there
- It's reasonable to assume all modifiers are on the same page as the number, so I should go page-by-page


## Parse the pdf manually

In [61]:

import pymupdf
import pymupdf4llm
import random

In [62]:
# load the pdf file 
pdf_name = "FY25 Air Force Working Capital Fund.pdf"

In [63]:
# Open the document so we can see the page numbers
document = pymupdf.open(pdf_name)
page_count = len(document)

print(f"Total pages: {page_count}")

Total pages: 114


In [64]:
# Extract markdown page by page
pages = []
for page_num in range(page_count):
    
    # Convert page to markdown using pymupdf4llm
    markdown = pymupdf4llm.to_markdown(pdf_name, pages = [page_num])
    pages.append(markdown)

In [65]:
print("\n--- Preview of first page (markdown) ---")
first_page_text = pages[0]
print(first_page_text)


--- Preview of first page (markdown) ---
# ***UNITED STATES *** ***AIR FORCE *** ***WORKING CAPITAL FUND *** ***(Appropriation: 4930) ***
## ***Fiscal Year (FY) 2025 *** ***Budget Estimates *** ***January 2024***


-----




In [66]:
# get 3 random page numbers (excluding page 1) and show how they look
random_pages = random.sample(range(1, page_count), min(3, page_count-1))

for page_num in random_pages:
    print(f"\n--- Preview of page {page_num} text ---")
    page_text = pages[page_num]
    print(page_text)


--- Preview of page 79 text ---
###### Fiscal Year (FY) 2025 Budget Estimates February 2024

###### Fund 11 (Dollars in Millions) **United States Transportation Command**

###### Source of New Orders and Revenue Air Force Working Capital Fund Transportation Working Capital Fund (TWCF)

###### **FY2023 FY2024 FY2025** **1. New Orders 1. New Orders** **  a. Orders From DOD Components:  a. Orders From DOD Components: 8,510.7 7,282.2 8,066.8** **  Total Air Force  Total Air Force 4,441.8 3,838.1 4,271.1** **   Military Personnel    Military Personnel - AF 108.7 95.0 129.4** **   Aircraft Procurement    Aircraft Procurement .8 .1 .4** **   Missile Procurement    Missile Procurement .0 .0 .0** **   Other Procurement    Other Procurement 13.3 12.4 14.1** **   Operations & Maintenance    Operations & Maintenance - AF 4,024.1 3,371.8 3,754.8** **   Operations & Maintenance - ANG    Operations & Maintenance - ANG 3.5 3.5 3.8** **   Operations and Maintenance - AFRES AFRES 274.6 339.5 350.0** **

## Move code to class, check that class behaves as expected

In [67]:
from pdf_processor import PDFProcessor
processor = PDFProcessor(pdf_name)
# extract text
pages = processor.extract_text()

In [68]:
# print total pages
total_pages = processor.get_page_count()
print(f"Total pages: {total_pages}")

# print the first page
print("\n--- Preview of first page (markdown) ---")
first_page_text = processor.get_page_text(0)
print(first_page_text)


Total pages: 114

--- Preview of first page (markdown) ---
# ***UNITED STATES *** ***AIR FORCE *** ***WORKING CAPITAL FUND *** ***(Appropriation: 4930) ***
## ***Fiscal Year (FY) 2025 *** ***Budget Estimates *** ***January 2024***


-----




In [69]:
# Use the same random pages as earlier to preview the text
for page_num in random_pages:
    print(f"\n--- Preview of page {page_num} text ---")
    page_text = processor.get_page_text(page_num)
    print(page_text)
    


--- Preview of page 79 text ---
###### Fiscal Year (FY) 2025 Budget Estimates February 2024

###### Fund 11 (Dollars in Millions) **United States Transportation Command**

###### Source of New Orders and Revenue Air Force Working Capital Fund Transportation Working Capital Fund (TWCF)

###### **FY2023 FY2024 FY2025** **1. New Orders 1. New Orders** **  a. Orders From DOD Components:  a. Orders From DOD Components: 8,510.7 7,282.2 8,066.8** **  Total Air Force  Total Air Force 4,441.8 3,838.1 4,271.1** **   Military Personnel    Military Personnel - AF 108.7 95.0 129.4** **   Aircraft Procurement    Aircraft Procurement .8 .1 .4** **   Missile Procurement    Missile Procurement .0 .0 .0** **   Other Procurement    Other Procurement 13.3 12.4 14.1** **   Operations & Maintenance    Operations & Maintenance - AF 4,024.1 3,371.8 3,754.8** **   Operations & Maintenance - ANG    Operations & Maintenance - ANG 3.5 3.5 3.8** **   Operations and Maintenance - AFRES AFRES 274.6 339.5 350.0** **

# Extract numbers with no context

In [70]:
import re
from decimal import Decimal

In [71]:
# Regex pattern to match the following: commas, decimals, and scientific notation
number_pattern = r'(-?(?:(?:\d{1,3}(?:,\d{3})+|\d+)(?:\.\d+)?|\.\d+)(?:[eE][-+]?\d+)?)'

found_numbers = []

# Find all matches
for match in re.finditer(number_pattern, processor.get_all_text()):
    original = match.group(0)
    start_pos = match.start()
    
    # Convert to Decimal (remove commas first)
    clean_num = original.replace(',', '')
    
    decimal_value = Decimal(clean_num)
    found_numbers.append((decimal_value, original, start_pos))
    

In [72]:
# print how many numbers were found
print(f"Found {len(found_numbers)} numbers")
# print the first 5 numbers
for num, original, start_pos in found_numbers[:5]:
    print(f"Number: {num}, Original: {original}, Start Position: {start_pos}")
# print the max number
max_num = max(found_numbers, key=lambda x: x[0])
print(f"Max number: {max_num[0]}, Original: {max_num[1]}, Start Position: {max_num[2]}")

Found 7089 numbers
Number: 4930, Original: 4930, Start Position: 87
Number: 2025, Original: 2025, Start Position: 120
Number: 2024, Original: 2024, Start Position: 164
Number: 2025, Original: 2025, Start Position: 299
Number: 0.1, Original: .1, Start Position: 525
Max number: 6000000, Original: 6,000,000, Start Position: 161361


## Create .py file called 'number_extraction', move code there, and test

In [73]:
import number_extraction as ne

# Now analyze the numbers
found_numbers = ne.extract_numbers(processor.get_all_text())

# print how many numbers were found
print(f"Found {len(found_numbers)} numbers")
# print the first 5 numbers
for num, original, start_pos in found_numbers[:5]:
    print(f"Number: {num}, Original: {original}, Start Position: {start_pos}")
# print the max number
max_num = max(found_numbers, key=lambda x: x[0])
print(f"Max number: {max_num[0]}, Original: {max_num[1]}, Start Position: {max_num[2]}")

Found 7089 numbers
Number: 4930, Original: 4930, Start Position: 87
Number: 2025, Original: 2025, Start Position: 120
Number: 2024, Original: 2024, Start Position: 164
Number: 2025, Original: 2025, Start Position: 299
Number: 0.1, Original: .1, Start Position: 525
Max number: 6000000, Original: 6,000,000, Start Position: 161361


# Find Relevant Modifiers from the Context

Two types of context I need to detect: Modifier that's placed near the number in a block of text, and modifier that is found in a table heading / title.


Start with finding context near the numbers. 

In [74]:
# define a function to look at the position of each number extract the context near it
def get_context(text, position, window=50):
    start = max(0, position - window)
    end = min(len(text), position + window)
    return text[start:end]

In [75]:
# print the context of the max number
context = get_context(processor.get_all_text(), max_num[2])
print(f"Context around max number: {context}")

Context around max number: e smaller in scale (costing between $250,000 and $6,000,000) and are designed, scheduled, and constr


Now find if the number is in a table

In [76]:
# first need to find the tables in the text
def detect_table_boundaries(text):
    """
    Identify all tables in a text and their precise boundaries
    
    Args:
        text (str): The text to analyze
        
    Returns:
        list: List of dictionaries with table info including boundaries
    """
    tables = []
    lines = text.split('\n')
    
    # Look for table patterns
    table_start = None
    current_table = None
    
    for i, line in enumerate(lines):
        # Table rows typically have pipe characters or consistent spacing
        if '|' in line or re.search(r'\s{2,}', line):
            # Check if we're starting a new table
            if table_start is None:
                table_start = i

                current_table = {
                    'start_line': i,
                    'title': None,
                    'headers': [],
                    'end_line': None
                }
                
                # Look for title above the table
                for j in range(i-1, max(0, i-5), -1):
                    if lines[j].strip() and not '|' in lines[j]:
                        current_table['title'] = lines[j].strip()
                        break
            
            # If the second line has separator markers, the first line is likely headers
            if i == table_start + 1 and re.search(r'[-=|]+', line) and table_start > 0:
                current_table['headers'] = [h.strip() for h in re.split(r'\|', lines[table_start]) if h.strip()]
        
        # Table ends with an empty line or a line that's clearly not a table
        elif table_start is not None:
            # Empty line or clearly not a table row
            if not line.strip() or (not '|' in line and not re.search(r'\s{2,}', line)):
                current_table['end_line'] = i - 1
                tables.append(current_table)
                table_start = None
                current_table = None
    
    # If we ended in a table, add it
    if current_table is not None:
        current_table['end_line'] = len(lines) - 1
        tables.append(current_table)
    
    return tables

In [77]:
tables = detect_table_boundaries(processor.get_all_text())
print(f"Found {len(tables)} tables")


Found 131 tables


In [78]:
# find if the number is in a table
def get_table_for_position(tables, position, text):
    """
    Find which table (if any) contains the given position
    
    Args:
        tables (list): List of table dictionaries
        position (int): Position in text
        text (str): Full text
        
    Returns:
        dict or None: Table containing the position, or None
    """
    lines = text.split('\n')
    
    # Find which line contains our position
    current_pos = 0
    current_line = 0
    
    for i, line in enumerate(lines):
        if current_pos + len(line) >= position:
            current_line = i
            break
        current_pos += len(line) + 1  # +1 for the newline
    
    # Check which table contains this line
    for table in tables:
        if table['start_line'] <= current_line <= table['end_line']:
            return table
    
    return None

In [79]:
table = get_table_for_position(tables, max_num[2], processor.get_all_text())
print(f"Number found in table: {table['title'] if table else None}")

Number found in table: ###### Fiscal Year (FY) 2025 Budget Estimates February 2024


# Find multiplier in context (title/headers of table or near the number)

In [80]:
# define common multipliers, notations, and phrases to help identify multipliers 
multipliers = {
    'million': 1_000_000,
    'millions': 1_000_000,
    'billion': 1_000_000_000,
    'billions': 1_000_000_000,
    'trillion': 1_000_000_000_000,
    'thousands': 1_000,
    'thousand': 1_000
}
    
notation_patterns = [
    (r'\(\$M\)', 1_000_000),       # ($M) - millions
    (r'\(M\$\)', 1_000_000),       # (M$) - millions
    (r'\(\$B\)', 1_000_000_000),   # ($B) - billions
    (r'\(B\$\)', 1_000_000_000),   # (B$) - billions
    (r'\(\$K\)', 1_000),           # ($K) - thousands
    (r'\(K\$\)', 1_000),           # (K$) - thousands
    (r'\$ *M', 1_000_000),         # $M or $ M - millions
    (r'M\$', 1_000_000),           # M$ - millions
    (r'\$ *B', 1_000_000_000),     # $B or $ B - billions
    (r'B\$', 1_000_000_000),       # B$ - billions
    (r'\$ *K', 1_000),             # $K or $ K - thousands
    (r'K\$', 1_000)                # K$ - thousands
]

phrases = [
    ('dollars in millions', 1_000_000),
    ('(dollars in millions)', 1_000_000),
    ('$ in millions', 1_000_000),
    ('($ in millions)', 1_000_000),
    ('in millions', 1_000_000),
    ('(in millions)', 1_000_000),
    ('dollars in billions', 1_000_000_000),
    ('(dollars in billions)', 1_000_000_000),
    ('$ in billions', 1_000_000_000),
    ('($ in billions)', 1_000_000_000),
    ('in billions', 1_000_000_000),
    ('(in billions)', 1_000_000_000)
]

In [81]:
def detect_multiplier(context_text, multipliers, notations, phrases):
    """
    Detect if there's a multiplier like 'millions' or 'billions' in the context.
    """
    
    # Convert to lowercase for case-insensitive matching
    context_lower = context_text.lower()
    
    for pattern, value in notations:
        if re.search(pattern, context_text, re.IGNORECASE):
            return pattern, value

    for phrase, value in phrases:
        if phrase in context_lower:
            return phrase, value
    
    # Then look for standalone multiplier words
    for word, value in multipliers.items():
        if word in context_lower:
            # Make sure it's a whole word by checking for spaces or punctuation around it
            word_pattern = r'\b' + word + r'\b'
            if re.search(word_pattern, context_lower):
                return word, value
    
    # If no multiplier is found, return None and a value of 1 (no modification)
    return None, 1

In [82]:
context = "This is a test for 10 million dollars"

# Test the multiplier detection
multiplier, multiplier_value = detect_multiplier(context, multipliers, notation_patterns, phrases)
print(f"Multiplier: {multiplier}, Value: {multiplier_value}")


Multiplier: million, Value: 1000000


In [83]:
context = "This is a test for 10 dollars in millions"

# Test the multiplier detection
multiplier, multiplier_value = detect_multiplier(context, multipliers, notation_patterns, phrases)
print(f"Multiplier: {multiplier}, Value: {multiplier_value}")

Multiplier: dollars in millions, Value: 1000000


In [84]:
context = "This is a test for 10 M$"

# Test the multiplier detection
multiplier, multiplier_value = detect_multiplier(context, multipliers, notation_patterns, phrases)
print(f"Multiplier: {multiplier}, Value: {multiplier_value}")

Multiplier: M\$, Value: 1000000


# Move functions into a common .py file, find the highest value with context

In [85]:
results = ne.analyze_numbers(processor)

In [86]:
# Print the results
print("LARGEST RAW NUMBER:")
print(f"Value: {results['largest_raw']['value']}")
print(f"Original text: {results['largest_raw']['original']}")
print(f"Page: {results['largest_raw']['page']}")
print(f"Context: {results['largest_raw']['context'][:150]}...")

print("\nLARGEST MODIFIED NUMBER:")
print(f"Modified value: {results['largest_modified']['value']}")
print(f"Raw value: {results['largest_modified']['raw_value']}")
print(f"Original text: {results['largest_modified']['original']}")
print(f"Page: {results['largest_modified']['page']}")
print(f"Multiplier: {results['largest_modified']['multiplier']} ({results['largest_modified']['multiplier_value']})")
print(f"Context: {results['largest_modified']['context'][:150]}...")


LARGEST RAW NUMBER:
Value: 6000000
Original text: 6,000,000
Page: 93
Context: dapting to new and changing workloads. Projects are smaller in scale (costing between $250,000 and $6,000,000) and are designed, scheduled, and constr...

LARGEST MODIFIED NUMBER:
Modified value: 30704100000.0
Raw value: 30704.1
Original text: 30,704.1
Page: 13
Multiplier: dollars in millions (1000000)
Context:  Summary**
###### ( Dollars in Millions ) FY 2023 FY 2024 FY 2025 Total Revenue T 28,239.2 29,176.6 30,704.1 Cost of Goods Sold C 27,950.4 29,494.7 30...
