A lightweight AI-powered system that extracts structured data from utility invoices using LLMs and outputs results as CSV files.
This system processes utility invoices (electricity, gas, water) in multiple languages and formats, extracting key information using Large Language Models.
- 📄 Processes PDF and text-based utility invoices
- 🌍 Handles multilingual invoices (English, Spanish, French)
- 🤖 Uses LLM for intelligent data extraction
- 📊 Outputs clean CSV with standardized fields
- ✅ Includes validation and confidence scoring
- 🔧 Modular and extensible design
invoice-processor/
├── README.md # This file
├── requirements.txt # Python dependencies
├── config.py # Configuration settings
├── main.py # Main execution script
├── src/
│ ├── __init__.py
│ ├── extractor.py # Text extraction from documents
│ ├── llm_processor.py # LLM-based data parsing
│ ├── csv_writer.py # CSV output generation
│ └── validator.py # Data validation utilities
├── sample_invoices/ # Sample input files
│ ├── invoice_1_electricity_english.txt
│ ├── invoice_2_gas_spanish.txt
│ ├── invoice_3_water_english.txt
│ ├── invoice_4_electricity_french.txt
│ └── invoice_5_gas_english.txt
├── output/ # Generated CSV files
│ └── extracted_data.csv
└── TESTING.md # Testing approach documentation
- Python 3.8 or higher
- OpenAI API key (or other LLM API)
-
Clone or extract this repository
-
Install dependencies:
pip install -r requirements.txt
-
Set up your API key:
- Create a
.envfile in the project root - Add your API key:
OPENAI_API_KEY=your_api_key_here
Or set it as an environment variable:
# Windows PowerShell $env:OPENAI_API_KEY="your_api_key_here" # Windows CMD set OPENAI_API_KEY=your_api_key_here # Linux/Mac export OPENAI_API_KEY=your_api_key_here
- Create a
Process all sample invoices:
python main.pyProcess specific invoice:
python main.py --input sample_invoices/invoice_1_electricity_english.txtProcess with custom output:
python main.py --output custom_output.csvThe system generates a CSV file with the following fields:
| Field | Description | Example |
|---|---|---|
vendor_name |
Utility company name | "PowerCo Energy" |
invoice_date |
Invoice issue date | "2026-04-15" |
service_address |
Service location | "123 Main St, Anytown, CA" |
utility_type |
Type of utility | "electricity" |
usage_amount |
Numeric usage value | "450.5" |
usage_unit |
Unit of measurement | "kWh" |
billing_period_start |
Billing period start date | "2026-03-15" |
billing_period_end |
Billing period end date | "2026-04-14" |
confidence_score |
Extraction confidence (0-1) | "0.95" |
language_detected |
Detected language | "en" |
- Accepts PDF and text files
- Extracts raw text content
- Preserves document structure
- Automatically detects invoice language
- Adapts LLM prompts accordingly
- Supports English, Spanish, and French
- Uses structured prompts to extract fields
- Handles variations in layout and terminology
- Normalizes data to standard formats
- Validates extracted data
- Scores confidence for each extraction
- Generates clean CSV output
- Text is pre-extracted: OCR is not implemented; assumes text is available in PDFs
- Standard utility types: Focus on electricity, gas, and water
- Date formats: Handles common international date formats
- API availability: Requires active LLM API connection
- LLM Choice: Uses OpenAI GPT models for reliability and multilingual support
- Modular design: Separate modules for extraction, processing, and output
- Error handling: Graceful degradation with confidence scoring
- Extensibility: Easy to add new languages or utility types
- Missing fields: Leaves blank, logs warning, includes in confidence score
- Ambiguous dates: Uses context clues and common patterns
- Multiple languages: Detects language, uses appropriate prompts
- Varying units: Normalizes to standard units (e.g., kWh, therms, gallons)
- Different layouts: LLM handles structural variation
See TESTING.md for detailed testing approach and validation strategy.
Given more time, I would add:
- Enhanced OCR: Integrate proper OCR for scanned documents
- Batch processing: Parallel processing for large volumes
- Database integration: Store results in structured database
- Web interface: Upload and process invoices via web UI
- Advanced validation: Cross-field validation rules
- Cost optimization: Caching and prompt optimization
- More languages: Expand language support
- PDF generation: Export results as formatted reports
- Unit tests: Comprehensive automated testing
- Monitoring: Logging and error tracking
- Python 3.8+: Core language
- OpenAI API: LLM for data extraction
- python-dotenv: Environment configuration
- langdetect: Language detection
- PyPDF2/pdfplumber: PDF text extraction
This is a demonstration project for evaluation purposes.
Created as part of AI Engineer Take-Home Project (4 Hours Max)