Utility Invoice Processing System

A lightweight AI-powered system that extracts structured data from utility invoices using LLMs and outputs results as CSV files.

Project Overview

This system processes utility invoices (electricity, gas, water) in multiple languages and formats, extracting key information using Large Language Models.

Features

📄 Processes PDF and text-based utility invoices
🌍 Handles multilingual invoices (English, Spanish, French)
🤖 Uses LLM for intelligent data extraction
📊 Outputs clean CSV with standardized fields
✅ Includes validation and confidence scoring
🔧 Modular and extensible design

Project Structure

invoice-processor/
├── README.md                 # This file
├── requirements.txt          # Python dependencies
├── config.py                # Configuration settings
├── main.py                  # Main execution script
├── src/
│   ├── __init__.py
│   ├── extractor.py        # Text extraction from documents
│   ├── llm_processor.py    # LLM-based data parsing
│   ├── csv_writer.py       # CSV output generation
│   └── validator.py        # Data validation utilities
├── sample_invoices/         # Sample input files
│   ├── invoice_1_electricity_english.txt
│   ├── invoice_2_gas_spanish.txt
│   ├── invoice_3_water_english.txt
│   ├── invoice_4_electricity_french.txt
│   └── invoice_5_gas_english.txt
├── output/                  # Generated CSV files
│   └── extracted_data.csv
└── TESTING.md              # Testing approach documentation

Setup Instructions

Prerequisites

Python 3.8 or higher
OpenAI API key (or other LLM API)

Installation

Clone or extract this repository
Install dependencies:
```
pip install -r requirements.txt
```

Set up your API key:

Create a .env file in the project root
Add your API key:
```
OPENAI_API_KEY=your_api_key_here
```

Or set it as an environment variable:

# Windows PowerShell
$env:OPENAI_API_KEY="your_api_key_here"

# Windows CMD
set OPENAI_API_KEY=your_api_key_here

# Linux/Mac
export OPENAI_API_KEY=your_api_key_here

Running the Project

Process all sample invoices:

python main.py

Process specific invoice:

python main.py --input sample_invoices/invoice_1_electricity_english.txt

Process with custom output:

python main.py --output custom_output.csv

Output Format

The system generates a CSV file with the following fields:

Field	Description	Example
`vendor_name`	Utility company name	"PowerCo Energy"
`invoice_date`	Invoice issue date	"2026-04-15"
`service_address`	Service location	"123 Main St, Anytown, CA"
`utility_type`	Type of utility	"electricity"
`usage_amount`	Numeric usage value	"450.5"
`usage_unit`	Unit of measurement	"kWh"
`billing_period_start`	Billing period start date	"2026-03-15"
`billing_period_end`	Billing period end date	"2026-04-14"
`confidence_score`	Extraction confidence (0-1)	"0.95"
`language_detected`	Detected language	"en"

How It Works

1. Document Ingestion

Accepts PDF and text files
Extracts raw text content
Preserves document structure

2. Language Detection

Automatically detects invoice language
Adapts LLM prompts accordingly
Supports English, Spanish, and French

3. LLM-Based Extraction

Uses structured prompts to extract fields
Handles variations in layout and terminology
Normalizes data to standard formats

4. Validation & Output

Validates extracted data
Scores confidence for each extraction
Generates clean CSV output

Assumptions & Design Decisions

Assumptions

Text is pre-extracted: OCR is not implemented; assumes text is available in PDFs
Standard utility types: Focus on electricity, gas, and water
Date formats: Handles common international date formats
API availability: Requires active LLM API connection

Design Decisions

LLM Choice: Uses OpenAI GPT models for reliability and multilingual support
Modular design: Separate modules for extraction, processing, and output
Error handling: Graceful degradation with confidence scoring
Extensibility: Easy to add new languages or utility types

Handling Edge Cases

Missing fields: Leaves blank, logs warning, includes in confidence score
Ambiguous dates: Uses context clues and common patterns
Multiple languages: Detects language, uses appropriate prompts
Varying units: Normalizes to standard units (e.g., kWh, therms, gallons)
Different layouts: LLM handles structural variation

Testing

See TESTING.md for detailed testing approach and validation strategy.

Improvements for Production

Given more time, I would add:

Enhanced OCR: Integrate proper OCR for scanned documents
Batch processing: Parallel processing for large volumes
Database integration: Store results in structured database
Web interface: Upload and process invoices via web UI
Advanced validation: Cross-field validation rules
Cost optimization: Caching and prompt optimization
More languages: Expand language support
PDF generation: Export results as formatted reports
Unit tests: Comprehensive automated testing
Monitoring: Logging and error tracking

Technologies Used

Python 3.8+: Core language
OpenAI API: LLM for data extraction
python-dotenv: Environment configuration
langdetect: Language detection
PyPDF2/pdfplumber: PDF text extraction

License

This is a demonstration project for evaluation purposes.

Author

Created as part of AI Engineer Take-Home Project (4 Hours Max)

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
sample_invoices		sample_invoices
src		src
.env.example		.env.example
.gitignore		.gitignore
API_NOTES.md		API_NOTES.md
PROJECT_SUMMARY.md		PROJECT_SUMMARY.md
README.md		README.md
RECORDING_CHECKLIST.md		RECORDING_CHECKLIST.md
TESTING.md		TESTING.md
config.py		config.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Utility Invoice Processing System

Project Overview

Features

Project Structure

Setup Instructions

Prerequisites

Installation

Running the Project

Output Format

How It Works

1. Document Ingestion

2. Language Detection

3. LLM-Based Extraction

4. Validation & Output

Assumptions & Design Decisions

Assumptions

Design Decisions

Handling Edge Cases

Testing

Improvements for Production

Technologies Used

License

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Utility Invoice Processing System

Project Overview

Features

Project Structure

Setup Instructions

Prerequisites

Installation

Running the Project

Output Format

How It Works

1. Document Ingestion

2. Language Detection

3. LLM-Based Extraction

4. Validation & Output

Assumptions & Design Decisions

Assumptions

Design Decisions

Handling Edge Cases

Testing

Improvements for Production

Technologies Used

License

Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages