AI-powered invoice data extraction widget that converts PDF invoices into structured CSV/JSON data.
- Overview
- Features
- Project Structure
- Installation
- Usage
- Configuration
- Testing
- Improvements Summary
- API Reference
This widget extracts structured data from invoice PDFs using AI (OpenAI GPT-4 or Anthropic Claude). It handles text-based and image-based PDFs, validates inputs, and outputs clean CSV/JSON files with comprehensive error handling and logging.
Key Capabilities:
- Extract vendor, customer, invoice details, line items, and financial data
- Support for OpenAI GPT-4 and Anthropic Claude
- Works without API keys - automatic fallback to basic extraction
- OCR support for image-based PDFs
- Multiple output formats (CSV, JSON, or both)
- Comprehensive error handling and validation
- Robust file discovery for Abyss platform deployment
- β AI-Powered Parsing - Uses GPT-4 or Claude for intelligent data extraction
- β Works Without API Keys - Automatic fallback to mock parser for basic extraction
- β Multi-Format Output - CSV, JSON, or both formats
- β OCR Support - Extract text from image-based PDFs
- β Robust Validation - Validates files, API keys, and extracted data
- β Error Handling - Comprehensive error messages and troubleshooting guides
- β Logging System - Step-by-step progress tracking
- β Smart File Discovery - Handles various file path scenarios on Abyss platform
- Vendor information (name, address, email, phone)
- Customer information (name, address)
- Invoice details (number, date, due date)
- Financial data (subtotal, tax, total, currency)
- Line items (description, quantity, unit price, total)
- Payment terms
invoice-parser-widget/
β
βββ run.py # Main entry point
βββ requirements.txt # Python dependencies
βββ requirements.system # System dependencies (apt packages)
βββ .gitignore # Git ignore rules
βββ README.md # This documentation
β
βββ utils/ # Core utility modules
β βββ __init__.py # Module exports
β βββ config.py # Configuration management
β βββ logger.py # Logging system
β βββ pdf_processor.py # PDF extraction & OCR
β βββ ai_parser.py # AI parsing & validation
β βββ formatter.py # Output formatting (CSV/JSON)
β
βββ tests/ # Test suite (46 tests, 74% coverage)
β βββ __init__.py
β βββ test_config.py # Configuration tests (16 tests)
β βββ test_formatter.py # Formatter tests (10 tests)
β βββ test_ai_parser.py # AI parser tests (16 tests)
β βββ test_integration.py # Integration tests (4 tests)
β
βββ output/ # Generated output files
βββ .gitkeep # Keep directory in git
Main entry point that orchestrates the entire workflow:
- Reads environment variables
- Validates configuration
- Extracts text from PDF
- Parses invoice with AI
- Saves output files
- Handles errors gracefully
Configuration management and validation:
- Provider validation (OpenAI, Anthropic)
- Output format validation (CSV, JSON, both)
- API key retrieval
- Model configuration
Logging system with:
- Console and file output
- Timestamped messages
- Step-by-step progress tracking
- Multiple log levels
PDF processing with:
- File validation (size, type, existence)
- Text extraction from PDFs
- OCR support for image-based PDFs
- Metadata extraction
- Password protection detection
AI-powered parsing with:
- API key validation
- Structured prompt generation
- JSON response cleaning
- Data validation and type conversion
- Mock parser for testing
Output formatting with:
- CSV generation (main data + line items)
- JSON generation
- Summary text creation
- Edge case handling (empty/null values)
git clone <repository-url>
cd invoice-parser-widget
pip install -r requirements.txt
# Ubuntu/Debian
sudo apt-get install poppler-utils tesseract-ocr
# macOS
brew install poppler tesseract
The widget works without API keys (uses mock parser), but for best results:
# Optional: AI provider (widget works without this)
export OPENAI_API_KEY=your_openai_key_here
# OR
export ANTHROPIC_API_KEY=your_anthropic_key_here
# Optional: Configure behavior
export invoice_file=invoice.pdf
export output_format=csv
export ai_provider=openai
export use_ocr=false
python run.py
export invoice_file=my_invoice.pdf
export output_format=json
export ai_provider=anthropic
python run.py
export test_mode=true
python run.py
export use_ocr=true
python run.py
After successful execution, the output/
directory will contain:
invoice_data.csv
- Main invoice dataline_items.csv
- Line items (if present)invoice_data.json
- JSON format (if selected)summary.txt
- Human-readable summary
Variable | Default | Options | Description |
---|---|---|---|
invoice_file |
invoice.pdf |
Any PDF path | Path to invoice PDF |
output_format |
csv |
csv , json , both |
Output format |
ai_provider |
openai |
openai , anthropic |
AI provider (optional) |
use_ocr |
false |
true , false |
Enable OCR processing |
OPENAI_API_KEY |
- | API key | OpenAI API key (optional - uses mock parser if not set) |
ANTHROPIC_API_KEY |
- | API key | Anthropic API key (optional - uses mock parser if not set) |
- Input: PDF files (up to 5MB)
- Output: CSV, JSON, TXT
- OpenAI: GPT-4, GPT-4-turbo, GPT-3.5-turbo
- Anthropic: Claude 3.5 Sonnet, Claude 3 Opus, Claude 3 Sonnet
python -m pytest tests/ -v
python -m pytest tests/ -v --cov=utils --cov-report=term-missing
python -m pytest tests/test_config.py -v
python -m pytest tests/test_formatter.py -v
python -m pytest tests/test_ai_parser.py -v
python -m pytest tests/test_integration.py -v
- Total Tests: 46
- Passed: 46 β
- Failed: 0
- Code Coverage: 74%
Module | Coverage | Lines |
---|---|---|
utils/init.py | 100% | 4 |
utils/formatter.py | 100% | 52 |
utils/config.py | 95% | 39 |
utils/logger.py | 76% | 37 |
utils/ai_parser.py | 65% | 98 |
utils/pdf_processor.py | 54% | 78 |
utils/config.py
- Configuration managementutils/logger.py
- Logging systemutils/__init__.py
- Module structure
run.py
- Better error handling, logging, validationutils/pdf_processor.py
- OCR support, validation, metadatautils/ai_parser.py
- Better validation, mock parserutils/formatter.py
- Edge case handling, enhanced summaries
- 46 comprehensive tests (unit + integration)
- 74% code coverage
- Test fixtures and utilities
- Mock parser for API-free testing
- Type hints throughout
- Comprehensive docstrings
- Better error messages
- Input validation at all levels
- Security improvements
β Configuration Management - Centralized validation and configuration β Logging System - Step-by-step progress tracking β OCR Support - Handle image-based PDFs β Error Handling - Comprehensive validation and user-friendly errors β Test Coverage - 46 tests with 74% coverage β Mock Parser - Test without API calls β Smart File Discovery - Handles Abyss platform file path variations β API Key Fallback - Automatic graceful degradation without API keys β Enhanced Output - Better formatting and summaries β Type Safety - Type hints throughout codebase β Documentation - Complete API reference and guides β Security - File validation, size limits, sanitization
Main entry point that orchestrates the workflow.
Returns: int
- Exit code (0 = success, 1 = error)
Extract text from PDF file.
Parameters:
pdf_path
(str): Path to PDF fileuse_ocr
(bool): Enable OCR for image-based PDFs
Returns: str
- Extracted text
Raises:
FileNotFoundError
: If PDF doesn't existValueError
: If PDF is invalid or unreadable
Parse invoice text using AI.
Parameters:
invoice_text
(str): Extracted invoice textprovider
(str): AI provider ('openai' or 'anthropic')model
(str): Specific model to use (optional)
Returns: dict
- Parsed invoice data
Raises:
ValueError
: If API key is missing or parsing failsImportError
: If required library is not installed
Format parsed data to CSV files.
Parameters:
parsed_data
(dict): Parsed invoice dataoutput_path
(Path): Output file path
Creates:
- Main CSV with invoice data
- Separate CSV for line items (if present)
Format parsed data to JSON file.
Parameters:
parsed_data
(dict): Parsed invoice dataoutput_path
(Path): Output file path
{
"vendor_name": str,
"vendor_address": str,
"vendor_email": str,
"vendor_phone": str,
"invoice_number": str,
"invoice_date": str, # YYYY-MM-DD
"due_date": str, # YYYY-MM-DD
"customer_name": str,
"customer_address": str,
"subtotal": float,
"tax_amount": float,
"tax_rate": float,
"total_amount": float,
"currency": str, # USD, EUR, etc.
"payment_terms": str,
"line_items": [
{
"description": str,
"quantity": float,
"unit_price": float,
"total": float
}
]
}
Error: FileNotFoundError: PDF file not found
Solution: Check the file path and ensure the PDF exists
export invoice_file=/full/path/to/invoice.pdf
Note: The widget automatically uses a basic mock parser if no API key is provided
Solution: Set an API key for better AI-powered extraction
export OPENAI_API_KEY=your_key_here
# OR
export ANTHROPIC_API_KEY=your_key_here
Error: ValueError: No readable text found in PDF
Solution: Enable OCR for image-based PDFs
export use_ocr=true
Error: ValueError: PDF is password-protected
Solution: Remove password protection or use an unencrypted version
Error: ValueError: PDF file too large: X.XX MB (max 5MB)
Solution: Compress the PDF or split into smaller files
- API keys are never logged or exposed
- File paths are validated before processing
- File size limits prevent memory issues
- Input sanitization prevents injection attacks
- OCR libraries may have vulnerabilities (keep updated)
- PyPDF2 >= 3.0.1
- openai >= 1.12.0
- anthropic >= 0.18.1
- pdf2image >= 1.16.3 (for OCR)
- pytesseract >= 0.3.10 (for OCR)
- pytest >= 7.4.3 (for testing)
- poppler-utils (for OCR)
- tesseract-ocr (for OCR)
- Python 3.10+
- 512 MB RAM or less
- 4 vCPUs or less (Abyss platform requirement)
This project is designed for the Abyss AI Widget platform.
This is a widget for the Abyss platform. For issues or improvements:
- Run tests before submitting changes
- Maintain backward compatibility
- Add tests for new features
- Update documentation
For issues or questions:
- Check the troubleshooting section
- Review test cases for examples
- Verify environment variables are set correctly
- Ensure all dependencies are installed
- Multi-language invoice support
- Batch processing (multiple invoices)
- Advanced OCR with layout analysis
- Duplicate invoice detection
- Excel output format
- Database integration
- Automatic amount validation
- Invoice categorization
Built with β€οΈ for the Abyss AI Widget Platform