# Walkthrough: TRACERA - ESG Data Extractor

Welcome to this interactive walkthrough of the LLM-Powered Invoice Data Extractor project. This notebook will guide you through the entire pipeline, from setting up the environment to parsing PDFs, extracting data with a LLM, and saving the final structured output.

You can execute each code cell sequentially to see the process in action.

## 1. Setup and Configuration

First, we need to set up our environment. This involves:
1. Adding the project's root directory to the system path to allow imports from the `src` folder.
2. Loading the environment variables (like your API keys) from the `.env` file.
3. Importing all the necessary modules and classes from the project.

In [2]:
import sys
from pathlib import Path
import pandas as pd
from dotenv import load_dotenv

# Add the project root
project_root = Path.cwd().parent
if str(project_root) not in sys.path:
    sys.path.append(str(project_root))

# Load environment variables from .env file
load_dotenv(project_root / ".env")

print(f"Project root added to path: {project_root}")

Project root added to path: /Users/vatsalthakkar/Desktop/VATSAL_THAKKAR/MS-CS-UGA/VT-Code/VTs-Lab/Coding-Challenge/tracera-coding-assessment


In [3]:
# Import all the necessary classes and configuration from our source code
from src import config
from src.utils.pdf_parser import PDFParser
from src.utils.llm_service import LLMService
from src.utils.data_extractor import DataExtractor
from src.utils.file_handler import get_pdf_files

print("Successfully imported project modules.")

Successfully imported project modules.


### Configuration

Let's verify that our configuration is loaded correctly. The `src/config.py` file manages all our project paths and API keys.

In [12]:
print(f"Using LLM Model            : {config.LLM_MODEL_NAME}")
print(f"Documents Directory        : {config.DOCUMENTS_DIR}")
print(f"Output CSV Path            : {config.OUTPUT_CSV_PATH}")
print(f"Llama Cloud API Key is set : {bool(config.LLAMA_CLOUD_API_KEY)}")
print(f"Gemini API Key is set      : {bool(config.GEMINI_API_KEY)}")

Using LLM Model            : gemini-2.5-flash
Documents Directory        : /Users/vatsalthakkar/Desktop/VATSAL_THAKKAR/MS-CS-UGA/VT-Code/VTs-Lab/Coding-Challenge/tracera-coding-assessment/data
Output CSV Path            : /Users/vatsalthakkar/Desktop/VATSAL_THAKKAR/MS-CS-UGA/VT-Code/VTs-Lab/Coding-Challenge/tracera-coding-assessment/output/extracted_data.csv
Llama Cloud API Key is set : True
Gemini API Key is set      : True


## 2. Step-by-Step Extraction on a Sample PDF

Now, let's run the pipeline step-by-step on a single document to understand how each component works.

### Step 2.1: Select a Sample PDF

In [5]:
pdf_files = get_pdf_files(config.DOCUMENTS_DIR)
if not pdf_files:
    raise FileNotFoundError(
        f"No PDF files found in {config.DOCUMENTS_DIR}. Please add some PDFs to run this notebook."
    )

sample_pdf_path = pdf_files[0]
print(f"Selected sample file: {sample_pdf_path.name}")

Selected sample file: test11.pdf


### Step 2.2: Parse PDF to Markdown using LlamaParse

The `PDFParser` class is responsible for taking a raw PDF file and converting it into clean, structured Markdown text. It uses the LlamaParse API for its high-fidelity OCR and layout recognition capabilities.

**Caching:** This step is expensive, so the parser automatically caches the result in the `cache/` directory. If you run this cell again for the same file, it will load from the cache instantly instead of calling the API.

In [6]:
pdf_parser = PDFParser()
markdown_text = pdf_parser.parse_document(sample_pdf_path)

print(f"""--- Parsed Markdown (first 1000 characters) ---
    {markdown_text[:1000]}
...""")

Loading from cache: /Users/vatsalthakkar/Desktop/VATSAL_THAKKAR/MS-CS-UGA/VT-Code/VTs-Lab/Coding-Challenge/tracera-coding-assessment/cache/test11_cff30a58e73a613041f4b8855505635a.md
--- Parsed Markdown (first 1000 characters) ---
    
TAX INVOICE

# Invoice No: 9817263870123

Date: 15/03/2024

Page: 1 of 1

| Description                          | Date From  | Date To    | Net Amount | GST   | Gross Amount |
| ------------------------------------ | ---------- | ---------- | ---------- | ----- | ------------ |
| Recharge: Watercare Acc No 535633803 | 02/02/2024 | 04/03/2024 | 98.12      | 14.72 | 112.84       |

Total Amount ($NZD): $98.12

GST: $14.72

Gross Amount: $112.84

Email remittance to: accountsreceivablenz@goodman.com

Tax Invoice No: 9817263870123

Due Date: 02/04/2024

Total Amount Due ($NZD): $112.84





# Statement and tax invoice

Account number: 5356338-03

Watercare Services Limited

www.watercare.co.nz

Private Bag 94010

Auckland 2241

Customer Service: commercialcu

### Step 2.3: Extract Structured Data with the LLM

Next, the `LLMService` takes the clean Markdown text and sends it to the configured LLM (Gemini or OpenAI) with a specific prompt. The prompt instructs the model to extract the desired fields according to the Pydantic schema defined in `src/schemas.py`.

This is the raw output from the model, before any cleaning or consolidation.

In [7]:
llm_service = LLMService()
raw_extraction_result = llm_service.extract_structured_data(markdown_text)

print("--- Raw Extracted Records ---")
for record in raw_extraction_result.records:
    print(record.model_dump_json(indent=2))

Using Gemini LLM
--- Raw Extracted Records ---
{
  "account_number": "5356338-03",
  "meter_number": "M06A213583",
  "from_date": "2024-02-04",
  "to_date": "2024-03-04",
  "usage": "11.00",
  "cost": "112.84"
}


### Step 2.4: Consolidate and Clean Records

Often, an LLM might find the same information in multiple places in a document, leading to duplicate or partial records. The consolidation step sends all the raw extracted records back to the LLM with a new prompt, asking it to merge duplicates, fill in missing values, and produce a final, clean list of unique records.

In [8]:
consolidated_result = llm_service.consolidate_records(raw_extraction_result.records)

print("--- Consolidated & Cleaned Records ---")
for record in consolidated_result.records:
    print(record.model_dump_json(indent=2))

   Calling LLM to consolidate results...
--- Consolidated & Cleaned Records ---
{
  "account_number": "5356338-03",
  "meter_number": "M06A213583",
  "from_date": "2024-02-04",
  "to_date": "2024-03-04",
  "usage": "11.00",
  "cost": "112.84"
}


## 3. Running the End-to-End Pipeline

The `DataExtractor` class orchestrates the entire process for a single file. Let's use it to see the complete, formatted output for our sample PDF.

In [9]:
extractor = DataExtractor()
final_records_for_file = extractor.extract_from_file(sample_pdf_path)

print("--- Final Formatted Output for a Single File ---")
pd.DataFrame(final_records_for_file)

Using Gemini LLM
-> Starting extraction for: test11.pdf
Loading from cache: /Users/vatsalthakkar/Desktop/VATSAL_THAKKAR/MS-CS-UGA/VT-Code/VTs-Lab/Coding-Challenge/tracera-coding-assessment/cache/test11_cff30a58e73a613041f4b8855505635a.md
   => Document Parsing completed for test11.pdf
   Calling LLM to consolidate results...
{'Account Number': '5356338-03', 'Meter Number': 'M06A213583', 'From Date': '2024-02-04', 'To Date': '2024-03-04', 'Usage': '11.00', 'Cost': '112.84', 'Filename': 'test11'}
   => Found 1 records in test11.pdf
--- Final Formatted Output for a Single File ---


Unnamed: 0,Account Number,Meter Number,From Date,To Date,Usage,Cost,Filename
0,5356338-03,M06A213583,2024-02-04,2024-03-04,11.0,112.84,test11


## 4. Running the Full Pipeline on All Documents

Finally, let's simulate the `main.py` script by iterating through all the PDFs in the `data/` directory, extracting the data from each, and collecting the results into a single pandas DataFrame.

In [11]:
all_pdf_files = get_pdf_files(config.DOCUMENTS_DIR)
all_extracted_records = []
extractor = DataExtractor()

for file_path in all_pdf_files:
    try:
        records = extractor.extract_from_file(file_path)
        all_extracted_records.extend(records)
    except Exception as e:
        print(f"!! An unexpected error occurred while processing {file_path.name}: {e}")

final_df = pd.DataFrame(all_extracted_records)

print("--- Final Combined DataFrame from All PDFs ---")
final_df

Using Gemini LLM
-> Starting extraction for: test11.pdf
Loading from cache: /Users/vatsalthakkar/Desktop/VATSAL_THAKKAR/MS-CS-UGA/VT-Code/VTs-Lab/Coding-Challenge/tracera-coding-assessment/cache/test11_cff30a58e73a613041f4b8855505635a.md
   => Document Parsing completed for test11.pdf
   Calling LLM to consolidate results...
{'Account Number': '5356338-03', 'Meter Number': 'M06A213583', 'From Date': '2024-02-04', 'To Date': '2024-03-04', 'Usage': '11.00', 'Cost': '112.84', 'Filename': 'test11'}
   => Found 1 records in test11.pdf
-> Starting extraction for: test10.pdf
Loading from cache: /Users/vatsalthakkar/Desktop/VATSAL_THAKKAR/MS-CS-UGA/VT-Code/VTs-Lab/Coding-Challenge/tracera-coding-assessment/cache/test10_9e7e00720a64915dbe9cf949b0932484.md
   => Document Parsing completed for test10.pdf
   Calling LLM to consolidate results...
{'Account Number': '2843619', 'Meter Number': '95417343', 'From Date': '2024-02-01', 'To Date': '2024-02-10', 'Usage': '657.23', 'Cost': '647,280.93', 'Fi

Unnamed: 0,Account Number,Meter Number,From Date,To Date,Usage,Cost,Filename
0,5356338-03,M06A213583,2024-02-04,2024-03-04,11.0,112.84,test11
1,2843619,95417343,2024-02-01,2024-02-10,657.23,647280.93,test10
2,16134528,WF69FPO,2024-06-14,2024-06-14,47.4,71.16,test12
3,16134528,WF69FEJ,2024-06-13,2024-06-13,1.0,28.99,test12
4,16134528,WF69FEJ,2024-06-13,2024-06-13,58.22,87.42,test12
5,982 121 827 236,A872397620,2024-04-16,2024-05-14,233348.0,26792.54,test4
6,113423700,2179421,2023-10-11,2023-11-12,639.0,31.7,test5
7,8677 0264 22,-,2024-05-08,2024-06-07,8.0,141.08,test7
8,416957143135,1087677,2022-12-30,2023-01-30,1384.5,20819.52,test6
9,10221125,A9128362,2024-02-01,2024-02-29,20679.8,7037.81,test3


## 5. Conclusion

This notebook has demonstrated the complete, step-by-step workflow of the data extraction pipeline. We have seen how the system:

- Parses PDFs into clean markdown, using a cache to improve performance.
- Uses an LLM to perform initial data extraction based on a defined schema.
- Leverages a second LLM call to intelligently consolidate and de-duplicate the results.
- Orchestrates the entire process to handle multiple files and produce a clean, structured CSV file.

You can now modify the code cells above to experiment with different files or inspect the outputs at each stage.