Skip to content

softwarecore1995/invoice-extraction-system

Repository files navigation

Utility Invoice Processing System

A lightweight AI-powered system that extracts structured data from utility invoices using LLMs and outputs results as CSV files.

Project Overview

This system processes utility invoices (electricity, gas, water) in multiple languages and formats, extracting key information using Large Language Models.

Features

  • 📄 Processes PDF and text-based utility invoices
  • 🌍 Handles multilingual invoices (English, Spanish, French)
  • 🤖 Uses LLM for intelligent data extraction
  • 📊 Outputs clean CSV with standardized fields
  • ✅ Includes validation and confidence scoring
  • 🔧 Modular and extensible design

Project Structure

invoice-processor/
├── README.md                 # This file
├── requirements.txt          # Python dependencies
├── config.py                # Configuration settings
├── main.py                  # Main execution script
├── src/
│   ├── __init__.py
│   ├── extractor.py        # Text extraction from documents
│   ├── llm_processor.py    # LLM-based data parsing
│   ├── csv_writer.py       # CSV output generation
│   └── validator.py        # Data validation utilities
├── sample_invoices/         # Sample input files
│   ├── invoice_1_electricity_english.txt
│   ├── invoice_2_gas_spanish.txt
│   ├── invoice_3_water_english.txt
│   ├── invoice_4_electricity_french.txt
│   └── invoice_5_gas_english.txt
├── output/                  # Generated CSV files
│   └── extracted_data.csv
└── TESTING.md              # Testing approach documentation

Setup Instructions

Prerequisites

  • Python 3.8 or higher
  • OpenAI API key (or other LLM API)

Installation

  1. Clone or extract this repository

  2. Install dependencies:

    pip install -r requirements.txt
  3. Set up your API key:

    • Create a .env file in the project root
    • Add your API key:
      OPENAI_API_KEY=your_api_key_here
      

    Or set it as an environment variable:

    # Windows PowerShell
    $env:OPENAI_API_KEY="your_api_key_here"
    
    # Windows CMD
    set OPENAI_API_KEY=your_api_key_here
    
    # Linux/Mac
    export OPENAI_API_KEY=your_api_key_here

Running the Project

Process all sample invoices:

python main.py

Process specific invoice:

python main.py --input sample_invoices/invoice_1_electricity_english.txt

Process with custom output:

python main.py --output custom_output.csv

Output Format

The system generates a CSV file with the following fields:

Field Description Example
vendor_name Utility company name "PowerCo Energy"
invoice_date Invoice issue date "2026-04-15"
service_address Service location "123 Main St, Anytown, CA"
utility_type Type of utility "electricity"
usage_amount Numeric usage value "450.5"
usage_unit Unit of measurement "kWh"
billing_period_start Billing period start date "2026-03-15"
billing_period_end Billing period end date "2026-04-14"
confidence_score Extraction confidence (0-1) "0.95"
language_detected Detected language "en"

How It Works

1. Document Ingestion

  • Accepts PDF and text files
  • Extracts raw text content
  • Preserves document structure

2. Language Detection

  • Automatically detects invoice language
  • Adapts LLM prompts accordingly
  • Supports English, Spanish, and French

3. LLM-Based Extraction

  • Uses structured prompts to extract fields
  • Handles variations in layout and terminology
  • Normalizes data to standard formats

4. Validation & Output

  • Validates extracted data
  • Scores confidence for each extraction
  • Generates clean CSV output

Assumptions & Design Decisions

Assumptions

  1. Text is pre-extracted: OCR is not implemented; assumes text is available in PDFs
  2. Standard utility types: Focus on electricity, gas, and water
  3. Date formats: Handles common international date formats
  4. API availability: Requires active LLM API connection

Design Decisions

  1. LLM Choice: Uses OpenAI GPT models for reliability and multilingual support
  2. Modular design: Separate modules for extraction, processing, and output
  3. Error handling: Graceful degradation with confidence scoring
  4. Extensibility: Easy to add new languages or utility types

Handling Edge Cases

  • Missing fields: Leaves blank, logs warning, includes in confidence score
  • Ambiguous dates: Uses context clues and common patterns
  • Multiple languages: Detects language, uses appropriate prompts
  • Varying units: Normalizes to standard units (e.g., kWh, therms, gallons)
  • Different layouts: LLM handles structural variation

Testing

See TESTING.md for detailed testing approach and validation strategy.

Improvements for Production

Given more time, I would add:

  1. Enhanced OCR: Integrate proper OCR for scanned documents
  2. Batch processing: Parallel processing for large volumes
  3. Database integration: Store results in structured database
  4. Web interface: Upload and process invoices via web UI
  5. Advanced validation: Cross-field validation rules
  6. Cost optimization: Caching and prompt optimization
  7. More languages: Expand language support
  8. PDF generation: Export results as formatted reports
  9. Unit tests: Comprehensive automated testing
  10. Monitoring: Logging and error tracking

Technologies Used

  • Python 3.8+: Core language
  • OpenAI API: LLM for data extraction
  • python-dotenv: Environment configuration
  • langdetect: Language detection
  • PyPDF2/pdfplumber: PDF text extraction

License

This is a demonstration project for evaluation purposes.

Author

Created as part of AI Engineer Take-Home Project (4 Hours Max)

About

Invoice data extraction system using LLM

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages